# Project 1: Exploring COVID-19 Data



## Due Saturday, February 13 at 11:59pm


<img src="data/covid.png" width=70%>



Welcome to Project 1! Projects in DSC 10 are similar in format to homeworks, but are different in a few key ways. First, a project is *comprehensive*, meaning that it draws upon everything we've learned this quarter so far. Second, the problems are more open-ended; they will usually ask for some result, but won't tell you what method should be used to get it. There might be several equally-valid approaches, and several steps might be necessary. This is closer to how data science is done in "real life".

It is important that you **start early** on the project! It will take the place of a homework in the week that it is due, but you should also expect it to take longer than a homework. It is especially encouraged to **find a partner** to work through the project with. 

### Instructions

This assignment is due Saturday, February 13 at 11:59pm. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

**Please do not use for-loops for any questions on this project**, unless the instructions specifically mention otherwise. Loops in Python are slow, and we are working with large data sets in this project. Looping over arrays and tables should usually be avoided in favor of commands that are meant specifically for tables. This entire project can be done without any loops.

**Important**: The `otter` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours or your team's chatroom on Campuswire. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 


Remember that you may work in pairs for this assignment! If you work in a pair, you must work with someone from your team, and you should submit one notebook to Gradescope for the both of you.

You should start early so that you have time to get help if you're stuck. See the course calendar on Canvas for the schedule and Zoom links.

In [None]:
# please don't change this cell, but do make sure to run it
import babypandas as bpd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import numpy as np
import datetime

import otter
import numbers
import IPython
grader = otter.Notebook()

## Background

During the end of the year 2019, the novel coronavirus started spreading around the world, causing many people to contract COVID-19. It didn't take long for the virus to spread from Wuhan, China to pretty much everywhere else in the world. At first, no one realized the severity of the virus and its potential to change lives as drastically as it has. In February 2020, the number of COVID-19 cases in the United States started to grow exponentially. Various measures, like face covering regulations and stay at home mandates, have helped mitigate some effects of the virus, but the virus is still spreading rapidly. In the US, there have been over 24 million cases of COVID-19, and over 400,000 people have died as a result. 

In this project, we will be analyzing COVID-19 data in the United States. Specifically, we will look at national data for fall 2020, broken down by state and by day, and also local data about outbreaks in San Diego.

## Outline of the Project 

The project is divided into two parts, each of which is divided into several sections. The outline below includes links that will take you directly to each section. The questions that are part of each section are also listed in this outline. Questions that are **bolded** are questions whose results will be used later in the project. Unbolded questions are ones that won't be referenced later and can be safely skipped if you get stuck. Also, Part 2 of the project is not dependent upon Part 1, so you can start Part 2 even if Part 1 has not been finished. 

Part 1. [National COVID-19 Data](#part1)  
-  Section 1. [Getting to Know the Data](#part1_section1)  
     - **Q1**
-  Section 2. [Working with datetimes](#part1_section2) 
     - **Q2**
-  Section 3. [Exploratory Data Analysis](#part1_section3)  
     - Q3, Q4, Q5, Q6, Q7, Q8, Q9
-  Section 4. [Exponential Growth?](#part1_section4)  
     - **Q10**, **Q11**
-  Section 5. [Weekend Testing](#part1_section5)  
     - **Q12**, **Q13**, **Q14**, Q15
-  Section 6. [Rates Per 100,000 People](#part1_section6) 
     - **Q16**, **Q17**, Q18, **Q19**
-  Section 7. [Mask Mandates](#part1_section7) 
     - **Q20**, Q21

Part 2. [San Diego County Outbreaks](#part2)  
-  Section 1. [Getting to Know the Data](#part2_section1)  
     - **Q1**, **Q2**
-  Section 2. [Exploring Outbreaks](#part2_section2)  
     - Q3, Q4, Q5, Q6
-  Section 3. [Shared Addresses and Shared Place Names](#part2_section3)  
     - Q7, **Q8**
-  Section 4. [Exploring Outbreak Locations](#part2_section4)  
     - Q9, Q10, Q11
-  Section 5. [Time Between Outbreaks](#part2_section5)  
     - **Q12**, Q13, Q14


<a id='part1'></a>
## Part 1: National COVID-19 Data

In this part, you will be analyzing national COVID-19 data broken down by state and by day.

The data we use here comes from the [COVID-19 Tracking Project](https://covidtracking.com/), License: CC BY 4.0.

<a id='part1_section1'></a>
### Section 1: Getting to Know the Data

Our first step now is to read in the data and prepare it for further analysis. 

We have already cleaned up the data a bit for you by removing unnecessary columns, handling missing values, and restricting the dates to be from only one quarter, from October 1, 2020 to December 31, 2020. 

The dataset we need is stored in `data/covid_tracking_data.csv`. Run the following code to start.

In [None]:
covid_raw = bpd.read_csv('data/covid_tracking_data.csv')
covid_raw

Let's take a quick look at the table and understand what each row and column represents.

For each of the 50 United States, plus the District of Columbia (DC), there is a separate row for each date in  October (31 days), November (30 days), and December (31 days). So the total number of rows is:

In [None]:
 51*(31+30+31)

Each row of our table represents both a state and a date. We will call this a "state-date" throughout this project.

There are ten columns of data, reading from left to right:

1. `date`: The date written as a string in the format month/day/year.
2. `state`: The two-letter [postal code abbreviation](https://pe.usps.com/text/pub28/28apb.htm) for the state.
3. `death`: The total number of COVID-19 related deaths recorded for that state, either on that date or previously recorded.
4. `deathIncrease`: The increase in the number of COVID-19 related deaths from the previous day, for the same state. A negative number indicates a decrease.
5. `hospitalized`: The total number of COVID-19 related hospitalizations recorded for that state, either on that date or previously recorded.
6. `hospitalizedIncrease`: The increase in the number of COVID-19 related hospitalizations from the previous day, for the same state. A negative number indicates a decrease.
7. `negative`: The total number of negative COVID-19 tests recorded for that state, either on that date or previously recorded.
8. `negativeIncrease`: The increase in the number of negative COVID-19 tests from the previous day, for the same state. A negative number indicates a decrease.
9. `positive`: The total number of positive COVID-19 tests recorded for that state, either on that date or previously recorded.
10. `positiveIncrease`: The increase in the number of positive COVID-19 tests from the previous day, for the same state. A negative number indicates a decrease.





**Question 1.1.** Add the following two additional columns to `covid_raw`.

11. `totalTestResults`: The total number of positive and negative COVID-19 tests recorded for that state, either on that date or previously recorded. 
12. `totalTestResultsIncrease`: The increase in the total number of positive and negative COVID-19 tests from the previous day, for the same state. A negative number indicates a decrease.

In [None]:
covid_raw = ...
covid_raw

In [None]:
grader.check("q1_1")

<a id='part1_section2'></a>
### Section 2: Working with datetimes

We want to perform some analysis using the `date` variable, but it's not so easy to answer certain questions given the current format of the date. For example, which month had the most positive tests? The month information is embedded with the `date` variable, but we want to be able to separate the year, month, day, and year.

The `date` column currently contains strings in the format month/day/year. For example, "12/31/20" represents December 31, 2020.

To better prepare for our later analysis, let's extract the year, month, and day from this string. We *could* do this with the string methods we've seen before, but Python actually provides an easier way. The `datetime` module, included with Python, has a function that can read a string in month/day/year format and convert it to a `datetime` object. This function, `strptime`,  takes in two arguments: a string that we want to convert, and a [*format string*](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) that tells Python what each part of the input string represents. For our application, since the string is input in the format month/day/year, we will use the format string "%m/%d/%y". 

Below is an example:

In [None]:
example = covid_raw.get('date').loc[0]
example

In [None]:
import datetime
example_dt = datetime.datetime.strptime(example, '%m/%d/%y')
example_dt

Python has parsed the datestring into its constituent parts. To get the year from our datetime object, we can write:

In [None]:
example_dt.year

Similarly, to get the month and day, we can write

In [None]:
example_dt.month

In [None]:
example_dt.day

**Question 1.2.** Starting with `covid_raw`, create a new table called `covid` which has all of the columns of the old table, plus 2 new columns: 

13. `month`: The month for that date as an integer. e.g., 12
14. `day`: The day for that date as an integer. e.g., 31.

We won't store the year since we know that the whole dataset is from the year 2020.

*Note*: This question, like many in this project, requires several steps. Feel free to create new cells as needed.

In [None]:
covid = ...
covid

In [None]:
grader.check("q1_2")

##### Check your work!

Before moving on, it is absolutely crucial that you have the right information in your `covid` table, since we'll be making frequent use of it throughout the project. The test above will make sure (as best as it is able) that you've done everything correctly so far. If it fails, make sure your table has:

- 14 columns
- 4962 rows
- the correct column names

If you've verified that the table has the right shape and column names, make sure your converted dates are correct.

<a id='part1_section3'></a>
### Section 3: Exploratory Data Analysis

Now let's do some rudimentary exploration of this large dataset in order to find some interesting trends worthy of further investigation. Exploratory data analysis often involves a lot of queries, grouping (sometimes even double grouping!), and visualization. Chapters 2 and 3 of the [textbook](https://eldridgejm.github.io/dive_into_data_science/front.html) might come in handy.

You should use the `covid` table as a starting point for the problems below.

**Question 1.3.** What was the nationwide increase in total number of positive tests between the last day of September and the last day of December? Save the result as `pos_cases_gained_fall`.

In [None]:
pos_cases_gained_fall = ...
pos_cases_gained_fall

In [None]:
grader.check("q1_3")

**Question 1.4.** In total, how many COVID-19 tests were administered in the United States in the year 2020? Note that no tests were administered before January 1, 2020. Save the result as `tests_2020`.

In [None]:
tests_2020 = ...
tests_2020

In [None]:
grader.check("q1_4")

**Question 1.5.** What percentage of COVID-19 tests administered in the United States in the year 2020 came back positive? Save the result as `percent_positive_2020`.

In [None]:
percent_positive_2020 = ...
percent_positive_2020

In [None]:
grader.check("q1_5")

**Question 1.6.** Of all the state-dates recorded in the table, which had the greatest single-day increase in number of deaths from the day before in the same state? Store the state (as a two-letter postal code abbreviation) and date (in month/day/year format) in `highest_death_state` and `highest_death_date`, respectively.

In [None]:
highest_death_state = ...
highest_death_state

In [None]:
grader.check("q1_6a")

In [None]:
highest_death_date = ...
highest_death_date

In [None]:
grader.check("q1_6b")

**Question 1.7.** For what percent of state-dates recorded in the table was the number of deaths for that day less than the number of deaths for the previous day in the same state? This is a measure of how often the COVID-19 situation could be considered improving. Save the result as `percent_improving`.

In [None]:
percent_improving = ...
percent_improving

In [None]:
grader.check("q1_7")

**Question 1.8.** Which state had the most new positive tests per day, on average, during this time period? Save the two-letter postal code abbreviation as `most_new_pos_state`.

In [None]:
most_new_pos_state = ...
most_new_pos_state

In [None]:
grader.check("q1_8")

**Question 1.9.** Make a bar chart that shows the average new positive tests per day for the **top 10 states**. Make the plot so that the average number of new positive tests per day appears on the y-axis and the top 10 state abbreviations appear on the x-axis. Arrange the bars in height from tallest to shortest.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_9
manual: true
-->

In [None]:
# make your plot here
...

<!-- END QUESTION -->



<a id='part1_section4'></a>
### Section 4: Exponential Growth?

One thing that has caused great concern throughout the pandemic has been the potential for exponential spread of the virus. That's why we often hear the World Health Organization advocating for people to stay at home early, before the growth of the virus swells out of control. Let's see if the virus was spreading exponentially in the US during fall 2020.

**Question 1.10.** Plot a **line graph** showing the growth of the total number of positive cases in the US throughout the fall. In your plot, the y-axis should represent the total number of positive cases and the x-axis should represent the days from October 1 to December 31. (Don't worry too much about the labels on your x-axis; they may be hard to read, and that's fine.) Give your plot a title of "US" by using the keyword argument `title="US"` within the `plot` command.

*Hint*: You should see a smooth curve. If your curve looks jagged, look carefully at the data being used for your x-axis. 

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_10
manual: true
-->

In [None]:
# make your plot here
...

<!-- END QUESTION -->



**Question 1.11.** Create a function called `state_trend` that creates a similar line plot to show the spread of the virus over time, except at the state level instead of the national level. The function should take as input the two-letter postal code abbreviation for a state and create a line plot similar to the one in the question above, but for the given state only. The function does not need to return anything. Make the title of the plot match the two-letter postal code given as input.

Test out your function on a few different states, and use the plots you generate to answer the multiple-choice question below. 

In [None]:
#define your function here
...

#test out your function here - try a few different states
state_trend('GA')

Which of the following statements is true? Assign 1, 2, 3, or 4 to `answer_q11`.

1. The number of cases in Wyoming (WY) was growing faster than linearly throughout all of the fall. 
2. North Dakota (ND) and South Carolina (SC) both experienced a surge in growth in the beginning of the fall, but then cases stopped growing as quickly by the end of the fall. 
3. The number of cases in Hawaii (HI) was growing approximately linearly throughout all of the fall. 
4. Virginia (VA) and Minnesota (MN) showed a similar trend in how the virus spread throughout all of the fall.

In [None]:
answer_q11 = ...
answer_q11

In [None]:
grader.check("q1_11")

<a id='part1_section5'></a>
### Section 5: Weekend Testing

Now, we have discovered some basic information about US COVID-19 cases and tests. We would like to dive deeper and discover if there are some interesting insights we can draw from the data. For instance, how does the number of tests on the weekends compare to other days? Let's find out!

**Question 1.12.**  Let's first define a function called `is_weekend` which takes in a parameter `date`, a string formatted as month/day/year, and outputs the boolean **True** if the input date is on a weekend (Saturday or Sunday) and **False** otherwise. 

*Hint*: The function `datetime.weekday()` may be helpful.

Remember to test your function on some dates to make sure it is working properly.

In [None]:
#define your function here
...

#test out your function here
#try a few different dates, and look at a calendar to make sure your function is working correctly
is_weekend('2/13/21')

In [None]:
grader.check("q1_12")

**Question 1.13.** Using the function you just defined, create a new column called `weekend` which contains boolean values corresponding to whether each date is on the weekend. Assign the new table to the variable `covid_q13`.

In [None]:
covid_q13 = ...
covid_q13

In [None]:
grader.check("q1_13")

**Question 1.14.** Now, with the new table you just created, calculate two values:
1. `weekday_pos_avg`: the average number of new positive tests per weekday throughout the US
2. `weekend_pos_avg`: the average number of new positive tests per weekend day throughout the US

In [None]:
weekday_pos_avg = ...
weekday_pos_avg

In [None]:
grader.check("q1_14a")

In [None]:
weekend_pos_avg = ...
weekend_pos_avg

In [None]:
grader.check("q1_14b")

**Question 1.15.** What can you conclude, based on the data, about the difference in weekend and weekday tests? Assign 1, 2, 3, or 4 to `answer_q15`.

1. Labs and testing facilities are more likely to be closed on the weekends, which causes the decrease in positive tests on the weekends.
2. People are generally at work or school and interacting with more people on weekdays which causes the increase in positive tests on the weekdays.
3. There is no reason for the difference between weekend and weekday tests, since there is often a lag between when tests are administered and when results come back.
4. There is not enough information to conclude any of the above.

In [None]:
answer_q15 = ...
answer_q15

In [None]:
grader.check("q1_15")

<a id='part1_section6'></a>
### Section 6: Rates Per 100,000 People

Without knowing the population of each state, purely comparing the number of positive tests gives a very biased impression of which states are faring better in their battle against the coronavirus. For example, populous states like California or Texas are going to have more positive tests than states like Wyoming and Vermont, simply because they have far more people. In order to fairly compare states with different populations, we need to look at proportions, or rates. 

In this section, you will use another data set of estimated state populations to add some perspective to the COVID-19 numbers you have seen so far. The population data comes from the [U.S. Census Bureau's Annual Estimates of the Resident Population for the United States, Regions, States, and the District of Columbia: April 1, 2010 to July 1, 2020](https://www.census.gov/programs-surveys/popest/technical-documentation/research/evaluation-estimates.html). We will use their annual estimates for July 1, 2020.

Let's begin by reading in the population data located at `data/census_data.csv`.

In [None]:
census_data = bpd.read_csv('data/census_data.csv')
census_data

The first thing you might notice is that in this data set, states are given by their full name, instead of their two-letter postal code abbreviation. Let's address this mismatch in our two different data sources. To do that, we'll need a way of converting between state name and postal code. For that, we'll introduce yet another data set, this one from the [US Postal Service](https://pe.usps.com/text/pub28/28apb.htm). 

The data is in `data/postal_codes.csv`.

In [None]:
postal_codes = bpd.read_csv('data/postal_codes.csv')
postal_codes

Notice that this table has more rows, because in addition to the 50 states and the District of Columbia, this data set also includes US territories, like American Samoa (AS) and Guam (GU).

**Question 1.16.** Write a function called `to_postal_code` that takes as input the name of a US state or territory, and returns the two-letter postal code abbreviation. Write another function called `to_name` that takes as input the two-letter postal code of a US state or territory, and returns its name. 

It's okay if your functions don't work on invalid input, such as a postal code of 'ZZ' or a state name of 'Zimbabwe', but they should work correctly for all the states and territories listed in the `postal_code` table. Test out each of your functions on a a few inputs to make sure they are working properly.

In [None]:
#define your functions here
...

#test out your functions here - try a few different examples
to_postal_code('Maryland'), to_name('GU')

In [None]:
grader.check("q1_16")

**Question 1.17.** Create a new table called `begin_cases` that has 51 rows (one for each state plus DC) and contains four columns:

1. `state`: The two-letter postal code abbreviation for the state.
2. `population`: The population of the state.
3. `beginPositive`: The total number of positive COVID-19 tests recorded for that state, as of October 1, 2020.
4. `beginPositiveRate`: As of October 1, 2020, the total number of positive tests per 100,000 people for that state.


In [None]:
begin_cases = ...
begin_cases

In [None]:
grader.check("q1_17")

**Question 1.18.** Using `begin_cases`, identify the state that had the highest number of positive tests per 100,000 people, as of October 1. Store the two-letter postal code abbreviation of that state in variable `begin_highest`.

In [None]:
begin_highest = ...
begin_highest

In [None]:
grader.check("q1_18")

Let's see if this state was able to improve upon its numbers throughout the fall:

In [None]:
state_trend(begin_highest)

**Question 1.19.** Repeat the above process, except use the end of the given time period instead of the beginning. Find the state that had the highest total number of positive tests per 100,000 people, as of **December 31**. Store the two-letter postal code abbreviation of that state in variable `end_highest`. 

In [None]:
end_highest = ...
end_highest

In [None]:
grader.check("q1_19")

Let's see what happened in this state throughout the fall.

In [None]:
state_trend(end_highest)

<a id='part1_section7'></a>
### Section 7: Mask Mandates

It has long been advocated by the World Health Organization that wearing a mask can prevent the spread of COVID-19. We would like to see how this plays out in our data, using a dataset of which states have state-wide mask mandates, as of December 2, 2020. This data comes from an article by [U.S. News and World Report](https://www-usnews-com.cdn.ampproject.org/v/s/www.usnews.com/news/best-states/articles/these-are-the-states-with-mask-mandates?amp_js_v=a6&amp_gsa=1&context=amp&usqp=mq331AQHKAFQArABIA%3D%3D#aoh=16110233502761&referrer=https%3A%2F%2Fwww.google.com&amp_tf=From%20%251%24s&ampshare=https%3A%2F%2Fwww.usnews.com%2Fnews%2Fbest-states%2Farticles%2Fthese-are-the-states-with-mask-mandates). 

First, a note on some limitations of this data. As the article points out, the details of the mask mandate may differ from state to state, and certain states that don't have state-wide mask mandates may have mask mandates in individual cities or for certain circumstances. For example, Arizona does not have a state-wide mask mandate, but does have a mandate that masks be worn at schools. Our data gives a snapshot of the mask mandate situation on December 2, 2020, but does not capture detailed information like when the mandates went into effect and what exactly they entail. 

The data is located at `data/mask_mandate.csv`. Let's read it in.

In [None]:
mask_mandate = bpd.read_csv('data/mask_mandate.csv')
mask_mandate

**Question 1.20.** Let's define a state's year-end positive rate as the total number of positive tests per 100,000 people as of December 31, 2020. Among states with a mask mandate, what is the average state year-end positive rate? Save the result as `avg_masked_rate`. Similarly, among all states without a mask mandate, what is the average state year-end positive rate? Save the result as `avg_unmasked_rate`.

In [None]:
avg_masked_rate = ...
avg_masked_rate

In [None]:
grader.check("q1_20a")

In [None]:
avg_unmasked_rate = ...
avg_unmasked_rate

In [None]:
grader.check("q1_20b")

**Question 1.21.** Does the data show that mask mandates cause lower positive test rates? Assign 1, 2, 3, or 4 to `answer_q21`.

1. Yes, because the average state year-end positive rate is lower for states with a mask mandate.
2. Yes, for some other reason.
3. No, because the average state year-end positive rate is higher for states with a mask mandate.
4. No, for some other reason.

In [None]:
answer_q21 = ...
answer_q21

In [None]:
grader.check("q1_21")

Let's look at the state level to see if we can see the effect of mask mandates on the spread of the virus. Iowa, for example, instituted a mask mandate on November 17, right in middle of the time period we are looking at. 

In [None]:
state_trend('IA') #Nov 17

North Dakota implemented their mask mandate right around the same time, on November 13, and the trend there is similar.

In [None]:
state_trend('ND') #Nov 13

<a id='part2'></a>
## Part 2: San Diego County Outbreaks

In this part, you will be exploring data about local outbreaks in San Diego County. This data comes from [KPBS](https://www.kpbs.org/news/2020/dec/21/covid-19-outbreak-locations-san-diego-county/), a local news source. Interestingly, throughout the pandemic, city and county officials have refused to share outbreak data with the public, claiming that if outbreak information were made public, business owners would not come forward and report outbreaks. The media has been fighting against this, claiming that the public has a right to this information, and that it would help people make more informed risk assessments and better decisions. This is part of an ongoing debate in which the media is pursuing lawsuits against the government for the release of more COVID-19 data. For now, KPBS has acquired access to this dataset of local outbreaks from an undisclosed source, which is the only such data that has been shared with the public.

The [KPBS article](https://www.kpbs.org/news/2020/dec/21/covid-19-outbreak-locations-san-diego-county/) releasing this data includes some helpful visualization tools that you may wish to explore, however, you should know that for this project, the KPBS data set has been modified in several ways, so your results may not exactly match results found in the article. 

<a id='part2_section1'></a>
### Section 1: Getting to Know the Data 

Our first step now is to read in the data and prepare it for further analysis.

We have cleaned up the data for you by handling missing values, correcting data entry errors, and renaming columns.

The dataset we need is stored in `data/sd_covid.csv`. Run the following code to start.

In [None]:
sd_covid_raw = bpd.read_csv('data/sd_covid.csv', dtype=str)
sd_covid_raw

Each row of our table represents a reported outbreak, which is defined as at least three people with COVID-19, who aren't close contacts, being in the same place in the same 14-day period. It's important to note that this definition says nothing about where these people contracted the virus, or whether they visited the same place on the same day. Places that have had outbreaks may not necessarily have unsafe conditions or practices. Bigger retail stores that get more traffic are more likely to have outbreaks than small stores, just because more people pass through their doors every day. This doesn't mean their practices are any less safe, or that you're more likely to catch the virus at a bigger store than a smaller store.

For each outbreak, the table contains the following 5 columns:

1. `category`: the category of the location where the outbreak occurred, e.g. "Hotel"
2. `place`: the place name of the location where the outbreak occurred, e.g. "Walmart"
3. `address`: the address of the location where the outbreak occurred in the format "street, city, state zip code", e.g. "955 Grand Ave, San Diego, CA 92109"
4. `date`: the date an investigation into the outbreak began, written as string in the format "month/day/year", e.g. "11/20/2020"
5. `outbreak`: the outbreak number, out of the total number of outbreaks at this location, e.g. "2 of 4"

Right now, the elements in the `outbreak` column are strings. We want to be able to separate the outbreak number of each outbreak from the total number of outbreaks at this place, and deal with these numbers as `int` types. For example, if the `outbreak` column says "2 of 4", we want to be able to separate the 2, which we'll call the *outbreak index* from the 4, which we'll call the *outbreak total*.

**Question 2.1.** Make a table called `sd_covid_outbreaks` that has all five columns of the `sd_covid_raw` table, plus two new columns:

6. `outbreak_index`: the index of this outbreak 
7. `total_outbreaks`: the total number of outbreaks at this location

*Hint:* Check out Python's built-in [string methods](https://docs.python.org/3/library/stdtypes.html#string-methods).

In [None]:
sd_covid_outbreaks = ...
sd_covid_outbreaks

In [None]:
grader.check("q2_1")

Similarly, we want to separate the parts of the address column. Currently, addresses are listed in the format "street, city, state zip code", such as "955 Grand Ave, San Diego, CA 92109."

**Question 2.2.**  Make a table called `sd_outbreaks` that has all seven columns of the `sd_covid_outbreaks` table, plus three new columns:

8. `street_address`: the street part of the address, e.g. "955 Grand Ave" 
9. `city`: the city, e.g. "San Diego"
10. `zip_code`: the zip code, e.g. "92109"

All three new columns should contain strings. We won't store the state information since we know the entire dataset comes from California.

*Hint:* Check out Python's built-in [string methods](https://docs.python.org/3/library/stdtypes.html#string-methods).

In [None]:
sd_outbreaks = ...
sd_outbreaks

In [None]:
grader.check("q2_2")

<a id='part2_section2'></a>
### Section 2: Exploring Outbreaks

**Question 2.3.** How many zip codes had 20 or more outbreaks? Save the result as `num_codes`.

In [None]:
num_codes = ...
num_codes

In [None]:
grader.check("q2_3")

**Question 2.4.** What proportion of the outbreaks that occurred in La Jolla took place in a preschool or childcare facility? Save the result as `school prop`.

In [None]:
school_prop = ...
school_prop

In [None]:
grader.check("q2_4")

**Question 2.5.** The city associated with the most outbreaks in this dataset is San Diego. What city is associated with the second-highest number of outbreaks? Save the result as `city_two`.

In [None]:
city_two = ...
city_two

In [None]:
grader.check("q2_5")

**Question 2.6.** Make a table called `repeat_proportions` that shows, for each category, the proportion of outbreaks in that category that were repeat outbreaks (not the first outbreak at that location). Sort your table so that the first row represents the category with the highest proportion of repeat outbreaks. Your table should have one column, `repeat_proportion`, and be indexed by `category`.

In [None]:
repeat_proportions = ...
repeat_proportions

In [None]:
grader.check("q2_6")

<a id='part2_section3'></a>
### Section 3: Shared Addresses and Shared Place Names

You will find that in this dataset, there are some addresses corresponding to multiple places, and some places corresponding to multiple addresses. 

**Question 2.7.** Use the `sd_outbreaks` table to find out how many addresses are shared by more than one place. Save the result as `num_shared_addresses`.

In [None]:
num_shared_addresses = ...
num_shared_addresses

In [None]:
grader.check("q2_7")

One reason that different places can have a shared address is because of shopping malls - different businesses in the same mall that have an outbreak will appear as different places, but at the same address. Below is an example from Westfield Mission Valley.

In [None]:
sd_outbreaks[sd_outbreaks.get('address')=="1640 Camino Del Rio N, San Diego, CA 92108"]

In addition to multiple places with the same address, we can have multiple addresses associated with the same place name. One example is chain stores that go by the same name in their different locations, like Trader Joe's.

In [None]:
sd_outbreaks[sd_outbreaks.get('place')=="Trader Joe's"]

Whether an address is shared among places (like at Westfield Mission Valley) or a place name is shared among addresses (like Trader Joe's), we want to be able to separate the different outbreak *locations*, which depends upon both the place name and the address. The `sd_outbreaks` table has a row for each outbreak, but let's make a different table with a row for each outbreak location.

**Question 2.8.** Create a new table called `outbreak_locations` with one row for each unique outbreak location, and the following columns reading from left to right:
1. `category`
2. `place`
3. `address`
4. `total_outbreaks`
5. `street_address`
6. `city`
7. `zip_code`


In [None]:
outbreak_locations = ...
outbreak_locations

In [None]:
grader.check("q2_8")

<a id='part2_section4'></a>
### Section 4: Exploring Outbreak Locations

**Question 2.9.** Create a helpful visualization to compare the number of different outbreak locations for each of the different categories. 

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_9
manual: true
-->

In [None]:
#make your plot here 
...

<!-- END QUESTION -->



**Question 2.10.** Rank the zip codes according to the number of different outbreak locations associated with each zip code. What proportion of outbreak locations were in one of the top five zip codes? Save the result as `prop_top_zip`.

In [None]:
prop_top_zip = ...
prop_top_zip 

In [None]:
grader.check("q2_10")

**Question 2.11** Use this [list of zip codes and neighborhood names](http://www.sdcourt.ca.gov/portal/page?_pageid=55,1524259&_dad=portal&_schema=PORTAL), obtained from the San Diego Superior Court, to find the name of the neighborhood with the most outbreak locations. Choose the best reason why you think this zip code is ranked highest for 
number of outbreak locations. Assign 1, 2, 3, or 4 to `answer_q211`.

1. It covers a large amount of land area relative to other zip codes in the county.
2. There is a high density of people here relative to other zip codes in the county.
3. The businesses here are less safe relative to other zip codes in the county.
4. It contains a higher proportion of nursing homes relative to other zip codes in the county.

In [None]:
answer_q211 = ...
answer_q211

In [None]:
grader.check("q2_11")

<a id='part2_section5'></a>
### Section 5: Time Between Outbreaks

In this section, we'll explore the time interval between outbreaks at the same location. Note that the date listed for each outbreak is the date that the investigation into the outbreak began, which is useful to simplify things, because an outbreak spans several days, but this gives us a way to consistently equate each outbreak with a single date.

The dates in our dataset are strings in "month/day/year" format, but as we have seen elsewhere in this project, it will be easiest to work with dates after converting to `datetime` objects. Note that to find a time interval in days, you can simply subtract two datetime objects and get the number of days using `.days`.

In [None]:
example1 = "2/13/2021"
example1_dt = datetime.datetime.strptime(example1, '%m/%d/%Y') 
example1_dt

In [None]:
example2 = "12/25/2020"
example2_dt = datetime.datetime.strptime(example2, '%m/%d/%Y') 
example2_dt

In [None]:
#calculate the time difference in days
(example1_dt - example2_dt).days

Note that unlike the data in Part 1 of this project, the outbreak data records all four digits of the year, so we need to use a different format string. Notice the capitalized "Y" in the format string "%m/%d/%Y"; this tells Python to expect a four-digit year.

To subtract two Series of `datetime` objects, you'll need to use the [Series property `.values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.values.html). An example is below.

In [None]:
example_table = bpd.DataFrame().assign(start_days = [example1_dt, example2_dt], 
                       end_days = [datetime.datetime.strptime("2/17/2021", '%m/%d/%Y'), datetime.datetime.strptime("1/5/2021", '%m/%d/%Y') ] )
example_table

In [None]:
example_table = example_table.assign(interval=example_table.get('end_days').values - example_table.get('start_days').values)
example_table

**Question 2.12.** From `sd_outbreaks`, create a new table called `sd_outbreaks_dt` by adding a new column called `datetime` that contains the date associated with each outbreak as a `datetime` object.

In [None]:
sd_outbreaks_dt = ...
sd_outbreaks_dt

In [None]:
grader.check("q2_12")

**Question 2.13.** What was the shortest time interval between any pair of consecutive outbreaks at a location with four or more outbreaks? Save the number of days in this shortest interval as `shortest_interval`. For this question, you may use a *for* loop if you'd like, though it is not necessary to use.

*Hint*: It's okay to take advantage of the fact that there are not many places with four or more outbreaks.

In [None]:
shortest_interval= ...
shortest_interval

In [None]:
grader.check("q2_13")

**Question 2.14.** Find the place name and category associated with the location that had the longest time interval between their first outbreak and last outbreak. Save the result as `longest_interval_place` and `longest_interval_category`.

In [None]:
longest_interval_place = ...
longest_interval_place

In [None]:
grader.check("q2_14a")

In [None]:
longest_interval_category = ...
longest_interval_category

In [None]:
grader.check("q2_14b")

# Finish Line: Almost there, but make sure to follow the steps below to submit!

Big congratulations! You've completed Project 1! To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed. If you fail a test here that used to pass, you probably changed that variable sometime later. Check through your code and make sure to use new variable names rather than overwriting variables that are used in the tests.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.

Remember, the tests here and on Gradescope just check the format of your answers. We will run correctness tests after the assignment's due date has passed.

In [None]:
grader.check_all()