# Homework 2 - Data Wrangling: Understanding U.S. Covid Statistics
 


# Introduction

The [GenderSci Lab](https://www.genderscilab.org) is "dedicated to generating feminist concepts, methods and theories for scientific research on sex and gender."

One of their research projects explores the impact of COVID-19 on women and men.
In this lab, we are using a set of data that is based on the information in their [US Gender/Sex COVID-19 Data Tracker](https://genderscilab.org/gender-and-sex-in-covid19/#DataTable). (You may need to search for "US Gender/Sex COVID-19 Data Tables".)

The table shows various pieces of information about US state COVID-19 cases and deaths counted by sex, including the total case count, male case count, and female case counts, as well as the death counts and percentages. Here's a snippet:

![US Gender/Sex COVID-19 Data Tables](tableclip.png)

We have added one more column of data to this, the state population. You'll find out more about this data below, after the boring stuff.

# Question

The question you're answering in this lab: 

> __Do states with large populations have a higher COVID-19 rate than states with low populations?__

# Lab Instructions and Learning Objectives

Just like in the previous homework, you will be creating and submitting a data story answering a data science question. You will be required to submit your work in the same format as last time, complete with sections for *Introduction*, *Data*, *Methods*, *Computation*, and *Conclusion*.

In this lab, you will:
* Create a data story in a notebook exploring the question.
* Work with COVID-19 case counts and population information.
* Select multiple columns (variables) from a `DataFrame`.
* Filter rows (observations) through explicit (boolean/condition-based) indexing.

# Due date 

You will submit your completed Homework 2 on MarkUs by *Tuesday, Jan 25 2021 at 11:59 PM EST*. We will send an announcement in a couple days when autotesting has been set up on MarkUs.

# GGR: How to submit

1. Download your homework to your local computer and save it as `EEB125_Homework_2.ipynb`.
2. Log in here: https://markus-ds.teach.cs.toronto.edu.
3. Submit your homework to `HW2: Homework 2`.

# Marking Rubric


Section     | 0 | 1 | 2 | 3
------------|---|---|---|---
Introduction|The question is not stated correctly or left blank | The question is stated correctly | NA | NA 
Data (for each python variable)       |auto test fails | auto test passes | NA | NA 
Methods (for each part) | No answer | The data extracted is specified or a reasonable rationale is given, but not both | Both the data extracted is specified and a reasonable rationale is given | NA
Computation |auto test fails | auto test passes | NA | NA 
Conclusion (for each part) | No answer | The question is answered but no explanation is given | The question is answered but the explanation is not supported or weakly supported by the data | The question is answered and the explanation is supported by the data 

Maximum grade: 24


# Introduction section

This should introduce the question being explored in a sentence. __(1 mark)__

# Data section

The `Data` part of your notebook should read the raw data, extract a `DataFrame` containing the important columns, and present the overall data. Create at least these two variables.

+ `covid_raw_data` : the `DataFrame` created by reading the `covid_raw_w2.csv` file. __(1 mark)__
+ `covid_data`: the `DataFrame` containing only the relevant columns from the raw data: the `State`, `Total_cases`, and `pop` columns. __(1 mark)__

(We will check the value of `covid_data`, the smaller `DataFrame` with only three columns, in the autotester. You'll probably want to use a few other variables along the way for the intermediate steps, like creating a list of important columns, but we're not autotesting those.)

A note about the data: as it turns out, when we retrieved the data, some values were missing, and we omitted states with missing data in our dataset. For example, it did not contain a `Total_cases` column for Florida, so we omitted Florida from the dataset. There are other states missing as well — there are a total of 41 states in our cleaned dataset.

# Methods section

Start with a Markdown cell describing what you're going to do, which is:

1. Extract and analyze data about large states (explain which data you're going to extract and analyze).  What data are extracting? Why are you extracting this data?  Explain in a few sentences. __(2 marks)__
2. Extract and analyze data about small states (explain which data you're going to extract and analyze). What data are extracting? Why are you extracting this data? Explain in a few sentences. __(2 marks)__
3. Compare the results. What quantities will you compare in your analysis?  Why did you choose to compare these quantities? Explain in a few sentences. __(2 marks)__

# Computation section

There are two sections to this, one for large-state data and one for small-state data.

## Large-state data

We'll define "large" as any state with a population greater than 15 million. Python lets you write that number like this so that it's more readable: `15_000_000`. You will need to get a boolean `Series` by comparing to this number, then do some math using built-in function `sum`. (You should not need `len`.)

Create these variables along the way. We will check them in the autotester. We will not check your intermediate steps.

+ `large_case_sum`: the total number of cases in the large states. __(1 mark)__
+ `large_pop_sum`: the total population in the large states. __(1 mark)__
+ `large_case_avg`: the total number of cases in the large states divided by the total population. __(1 mark)__

## Small-state data

We'll define "small" as any state with fewer than 1 million people. You'll need to get a boolean `Series` by comparing to `1_000_000` then doing the same math.

Note that the code for small-state data will look a lot like the code for the large-state data.

+ `small_case_sum`: the total number of cases in the small states. __(1 mark)__
+ `small_pop_sum`: the total population in the small states. __(1 mark)__
+ `small_case_avg`: the total number of cases in the small states divided by the total population. __(1 mark)__

# Conclusion

Include cells with your answers to each of these questions:

1. Do states with large populations have a higher COVID-19 rate than states with low populations?  Briefly explain. __(3 marks)__
2. Briefly explain how the missing data might influence your answer to 1. __(3 marks)__
3. Use your answers to 1. and 2. to answer the following: If you had limited resources (e.g., doctors, nurses, medicine) to distribute to help those who are sick, how would you distribute these states based on their population size? What would be an 'efficient' strategy, one that helps the most people quickly? What would be an 'equitable' strategy? Briefly explain in a few sentences. __(3 marks)__

# Printing the required variables

Include a cell at the very end of your notebook containing this code:

In [60]:
print('covid_data.head():')
print(covid_data.head())
print()
print('large_pop_sum:')
print(large_pop_sum)
print('large_case_sum:')
print(large_case_sum)
print('large_case_avg:')
print(large_case_avg)
print()
print('small_pop_sum:')
print(small_pop_sum)
print('small_case_sum:')
print(small_case_sum)
print('small_case_avg:')
print(small_case_avg)

covid_data.head():
         State  Total_cases       pop
0       Alaska     132645.0    731545
1      Arizona    1166060.0   7278717
2   California    4647587.0  39512223
3     Colorado     740461.0   5758736
4  Connecticut     402583.0   3565287

large_pop_sum:
87961665
large_case_sum:
10696471.0
large_case_avg:
0.1216037804650469

small_pop_sum:
4498465
small_case_sum:
638434.0
small_case_avg:
0.14192263360946455


The output should look like this:

```covid_data.head():
         State  Total_cases       pop
0       Alaska     132645.0    731545
1      Arizona    1166060.0   7278717
2   California    4647587.0  39512223
3     Colorado     740461.0   5758736
4  Connecticut     402583.0   3565287

large_pop_sum:
87961665
large_case_sum:
10696471.0
large_case_avg:
0.1216037804650469

small_pop_sum:
4498465
small_case_sum:
638434.0
small_case_avg:
0.14192263360946455
```