<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject: Exploring COVID-19 Data from GitHub</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/covid-data-from-github/">https://discovery.cs.illinois.edu/microproject/covid-data-from-github/</a></div>
</h1>

<hr style="color: #DD3403;">

## Data Source: COVID-19 Case Data from Johns Hopkins University, via GitHub

Since before COVID-19 was detected in the United States, the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University has provided daily updates of COVID-19 case data as clean, structured CSV files on GitHub as a free public service to the world.

You can view their COVID-19 GitHub repository here: [https://github.com/CSSEGISandData/COVID-19](https://github.com/CSSEGISandData/COVID-19).  You can find their daily reports by navigating into their repository:

- Click **csse_covid_19_data** to navigate into the `csse_covid_19_data` folder,
- Navigate into `csse_covid_19_daily_reports`,
- Find the CSV data for **Jan. 3, 2022** *(it'll be near the top, be careful to get the correct year)*
- Click the **Raw** button to above the file contents to navigate to the raw CSV version of the file (without the GitHub interface)
- Use the URL of the **raw data as your dataset** for this MicroProject.

Use panda's `read_csv` function to read the dataset you found and create a DataFrame called `df`:

In [None]:
df = ...
df

### 🔬 Checkpoint Tests 🔬

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.
tada = "\N{PARTY POPPER}"

assert("df" in vars())
assert("Country_Region" in df)
assert("People_Hospitalized" not in df), "Make sure you have the global daily reports, not just the US daily reports."
assert("India" in df["Country_Region"].unique())
assert("2022-01-04" in df["Last_Update"].unique()[0]), "Make sure you have the Jan. 3, 2022 CSV file."
print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Country-Level Analysis of COVID-19

The CSV file from JHU provides the **total reported cases over all time until the end of the day on Jan. 3, 2022**.  However, the data is often breaks countries into individual regions.  For example, let's check out the United States.  Create a DataFrame with all records from the dataset with data about the United States in the variable `df_us`:

In [None]:
df_us = ...
df_us

### Analysis of COVID-19 in the United States

Create a new DataFrame, `df_us_sorted`, that sorts the DataFrame based on the number of confirmed cases of COVID-19 in the United States, where the **first row contains the location with the highest number of confirmed cases**:

In [None]:
df_us_sorted = ...
df_us_sorted

### Create a DataFrame for Country Level Analysis

Create a new DataFrame, `df_countries`, that aggregates the data within each country together to get a DataFrame that contains one row for each country:

In [None]:
df_countries = ...
df_countries

### Performing Country-Level Analysis

Create a DataFrame called `df_most_cases` that contains the country which has had the most confirmed cases of COVID-19:

In [None]:
df_most_cases = ...
df_most_cases

### 🔬 Checkpoint Tests 🔬

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.
tada = "\N{PARTY POPPER}"

assert("df_us" in vars())
assert("Province_State" in df_us)

assert("df_us_sorted" in vars())
assert(df_us_sorted.iloc[:,-7].values[2] <= df_us_sorted.iloc[:,-7].values[1])
assert(df_us_sorted.iloc[:,-7].values[10] <= df_us_sorted.iloc[:,-7].values[9])
assert(df_us_sorted["Confirmed"].values[0]) == max(df_us["Confirmed"])

assert("df_countries" in vars())
assert(df_countries["Confirmed"].sum() == df["Confirmed"].cumsum().values[len(df) - 1])

assert("df_most_cases" in vars())
assert(len(df_most_cases) == 1)
assert(len(df_most_cases.iloc[0]["Country_Region"]) == 2)

print(f"{tada} All Tests Passed! {tada}")


<hr style="color: #DD3403;">

## Checking for the Pareto principle

The Pareto principle states that *"for many outcomes, roughly 80% of consequences come from 20% of causes (the "vital few")"* ([See more on Wikipedia: Pareto principle](https://en.wikipedia.org/wiki/Pareto_principle)).  This is also known as the "80-20 rule" and appears often in Data Science.

In terms of COVID-19 cases, the application of the Pareto principle would be that **80% of confirmed cases comes from just 20% of countries**.  Is this true?

To test this, we need to find total number of cases across all countries.  Compute the total number of confirmed cases in the variable `confirmed_total` (this should be a number, not a DataFrame):

In [None]:
confirmed_total = ...
confirmed_total

Using a bit of math, 80% of the total number of confirmed cases would be:

In [None]:
confirmed_80pct = confirmed_total * 0.8
confirmed_80pct

### Finding the Cumulative Sum of Cases

DataFrames provides the function "cumulative sum" function, or `df.cumsum(...)`, that allows us to calculate the sum of every row up until and including the current row.

Read the DISCOVERY guide to learn the syntax on "What is the Cumulative Sum of a pandas DataFrame?" to find out more on using the cumulative sum function:
- [Guide: "What is the Cumulative Sum of a pandas DataFrame?"](https://discovery.cs.illinois.edu/guides/DataFrame-Fundamentals/Cumulative-Sum-in-pandas/)

Before finding the cumulative sum, we need to have a sorted DataFrame of all countries in descending order.  Use `df_countries` to create a DataFrame sorted by confirmed cases in the variable `df_countries_sorted`:

In [None]:
df_countries_sorted = ...
df_countries_sorted

Using `df_countries_sorted`, create a new column called `Cumulative Confirmed` that contains the cumulative sum of the Confirmed cases:

In [None]:
df_countries_sorted["Cumulative Confirmed"] = ...
df_countries_sorted

Finally, create a DataFrame called `df_80pct` with all the countries up to the country that, cumulatively, account for 80% of the global cases (remember, that's the cases you stored in `confirmed_80pct`):

In [None]:
df_80pct = ...
df_80pct

### Does the Pareto Principle show up?

Currently:
- `df_countries` contains EVERY country in the world with COVID-19 data, and
- `df_80pct` contains countries that make up 80% of the cases.

If the Pareto principle applies to the confirmed cases of COVID-19, then we expect that `df_80pct` holds only approximately 20% of all the countries.  Let's see:


In [None]:
pct_cases = 100 * sum(df_80pct["Confirmed"]) / sum(df_countries["Confirmed"])
pct_cases = round(pct_cases, 2)

pct_countries = 100 * len(df_80pct) / len(df_countries)
pct_countries = round(pct_countries, 2)

print(f"Result: {pct_cases}% of the COVID-19 cases comes from {pct_countries}% of the countries in the dataset.")

### 🔬 Checkpoint Tests 🔬

In [None]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.
tada = "\N{PARTY POPPER}"

assert("confirmed_total" in vars())
assert("confirmed_80pct" in vars())
assert(confirmed_total > 2.9e8)
                         
assert("df_countries_sorted" in vars())
assert("Cumulative Confirmed" in df_countries_sorted)
assert("Admin2" not in df_countries_sorted)
assert(max(df_countries_sorted["Cumulative Confirmed"]) == sum(df_countries_sorted["Confirmed"]))
assert(min(df_countries_sorted["Cumulative Confirmed"]) == df_countries_sorted.iloc[0]["Confirmed"])

assert("df_80pct" in vars())
assert("US" in df_80pct["Country_Region"].unique())
assert("India" in df_80pct["Country_Region"].unique())
assert("Tonga" not in df_80pct["Country_Region"].unique())

assert(sum(df_80pct["Confirmed"]) / sum(df_countries["Confirmed"]) > 0.7)
assert(sum(df_80pct["Confirmed"]) / sum(df_countries["Confirmed"]) < 0.9)

assert(len(df_80pct["Confirmed"]) / len(df_countries["Confirmed"]) > 0.1)
assert(len(df_80pct["Confirmed"]) / len(df_countries["Confirmed"]) < 0.2)

print(f"{tada} All Tests Passed! {tada}")

<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and follow the instructions to commit and grade this MicroProject on GitHub!
