# **COVID Tracker - Project 3**
## Inference for the Population Proportion

# **Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that add extra features to the basic setup. The name of the modules is after the `import` or `from` statement, and the purpose is in a non-code comment after the hashtag (#).



In [1]:
import pandas as pd                 #Data analysis
import numpy as np                  #Calculations
from IPython.display import Image   #Display images
from scipy.stats import norm        #Confidence Interval

In [2]:
# Assigns the URL of the image to display to the name 'image_url'.
image_url = 'https://cdn.who.int/media/images/default-source/mca/mca-covid-19/coronavirus-2.tmb-1920v.jpg?sfvrsn=4dba955c_19'

# Display the image
Image(url=image_url, width=500)

# **Context**

When reporting about COVID-19, the Associated Press (AP) used data collected by the Johns Hopkins University Center for Systems Science and Engineering as a source for outbreak caseloads and death counts for the United States and globally.

The Johns Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests - and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

The data you will be analyzing is from the Johns Hopkins dashboard (link below) that is updated throughout the day. Like all organizations dealing with data, Johns Hopkins is constantly refining and cleaning up their data feed, so there may be brief moments where data does not appear correctly. You can find the Johns Hopkins daily data reports at https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data.

The AP updates their dataset hourly at 45 minutes past the hour.
To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, visit https://www.ap.org/content/formats/data/.

Attribution: Johns Hopkins University COVID-19 tracking project
* Dashboard: https://www.arcgis.com/apps/dashboards/bda7594740fd40299423467b48e9ecf6





# **About the Dataset**

This dataset contains 133 rows corresponding to a random sample of drafted players. A total of 9 variables are provided as listed below:

| Variable Name(s)      | Description                             |
|:----------------------|:----------------------------------------|
| county_name           | Name of the county                      |
| state                 | State in which the count is located     |
| nchs_urbanization     | Urban-Rural category                    |
| total_population      | County population size                  |
| confirmed             | Number of confirmed cases in the county |
| confirmed_per_100000  | Population adjusted confirmed COVID-19 cases per 100000 people |
| deaths                | Number of deaths in county due to COVID-19 |
| deaths_per_100000     | Population adjusted COVID-19 deaths per 100000 people |

* For more information about nchs_urbanization, visit https://www.cdc.gov/nchs/data/series/sr_02/sr02_166.pdf
* Note on "population adjusted": Is 1500 a lot of cases? It is certainly more significant in a population of 1500 people than in a population of 10 million people. One way to compare populations of different sizes is to calculate the rate "as if" there were only 100,000 people. Do that adjustment for all populations you want to compare. In class we learned to use relative frequency when comparing groups of different sizes. The population adjusted method is another way to do that.



Let's take a look at the data. To do this, first we import it directly from the url below.



# **A Snippet of the Data**

Let's take a look at the data. To do this, first we import it directly from the url.

In [3]:
# Assigns the URL where the data file is stored to 'file_path'.
url='https://raw.githubusercontent.com/thamilton562/STAT108_Projects_Students/main/DataSets/COVID%20Cases.csv'

# Reads in the CSV data file and assigns it to the DataFrame 'df'.
df=pd.read_csv(url)

Next, we can display the data by *typing the name* of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [4]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)

# When you type the object name, the object gets printed.
df

Unnamed: 0,county_name,state,nchs_urbanization,total_population,confirmed,confirmed_per_100000,deaths,deaths_per_100000
0,Lowndes,Alabama,Medium metro,10236,3251,31760.45,80,781.56
1,Ontario,New York,Large fringe metro,109472,25821,23586.85,212,193.66
2,Waukesha,Wisconsin,Large fringe metro,398879,137985,34593.20,1216,304.85
3,Escambia,Florida,Medium metro,311522,96194,30878.72,1452,466.10
4,Greenbrier,West Virginia,Non-core,35347,12633,35739.95,182,514.90
...,...,...,...,...,...,...,...,...
137,Bourbon,Kentucky,Medium metro,20144,7688,38165.21,73,362.39
138,Schley,Georgia,Micropolitan,5211,1387,26616.77,11,211.09
139,Neshoba,Mississippi,Non-core,29376,12475,42466.64,247,840.82
140,Wise,Virginia,Micropolitan,39025,13596,34839.21,225,576.55


# **INSTRUCTIONS**

* Use Python to analyze the data set and complete each of the following.
* Replace ellipsis (...) with the relavent names or code.  
* For problems that require a written response, replace the ellipsis (...) by double clicking the text box to start typing.
* Reference the tutorial from activity for assistance.
* If you still need help:
 * Watch the video.
 * Attend office hours.

# **The variable to analyze**
You will analyze a category of NCHS Urbanization. Based on the first initial of your LAST name, analyze the category of the variable listed in the table. Use this category for the entire project.

| Last Name | Variable |
|-----------|-------------------------------|
| A-L       | NCHS Urbanization = Large Fringe Metro |
| M-Z       | NCHS Urbanization = Non-Core |

In [1]:
#Print all the category names.
#Use this list to ensure correct spelling of your category.
print("... category names")                #Replace ... with the variable name written out
print("--------------------------------")
freq_table = pd.Series(df['...']).value_counts()        #Replace ... with the variable name
print(freq_table)


... category names
--------------------------------


NameError: name 'pd' is not defined

# **QUESTION 1**
## Confidence Interval

Construct and interpret the 98% confidence interval for the population proportion of counties classified as (your assigned category). Write up the solution using the PMACC procedure.

**Last Names A-L**, construct and interpret the 98% confidence interval for the population proportion of counties classified as large fringe metro.

**Last Names M-Z**, construct and interpret the 98% confidence interval for the population proportion of counties classified as non-core.

**1.1) Parameter: Define the parameter, using correct notation.**

...

**1.2) Method: Name the method you will use.**

...

**1.3) Assumptions:**

Complete the code below to find out how many counties fall under the category assigned to you.

In [1]:
# Count total observations
n = len(df)

# Count total successes
# Replace the 1st ... with the variable name
# Replace the 2nd ... with the name of the NCHS Urbanization category to be analyzed
obs_count = df['...'].value_counts().get('...')

print(f"{obs_count} out of {n} counties are classified as ... .") #Replace ... with your category name.



NameError: name 'df' is not defined

**Show that both assumptions are met.**

...

**1.4) Calculate: Complete the code below to calculate the sample proportion of majors that fall under the engineering category, and the confidence interval.**

In [None]:
# Define the confidence level
# Replace the ... with the stated confidence level, as a decimal (ex: 0.83, not 83%)
CL = ...

#Use this code for the student version
p_hat = obs_count / n
cv = norm.ppf((1+CL)/2)
se = np.sqrt(p_hat * (1-p_hat) / n)

#Calculate the bounds of the interval
ci_lower = (p_hat - cv * se)
ci_upper = (p_hat + cv * se)

print(f"p-hat = {obs_count}/{n} = {p_hat.round(5)}")
print(f"The {CL*100}% CI is ({ci_lower.round(5)}, {ci_upper.round(5)})")

**1.5) Communicate Results: Interpret the confidence interval calculated in 1.4 above. Round to three (3) decimal places.**

...

**1.6) Show work to calculate the margin of error. Then interpret the margin of error.**

...

# **Question 2**

## **Hypothesis Test**

**Last Names A-L:** There are 6 NCHS Urbanization categories. Based on this data set, is there convincing evidence that the population proportion of counties classified as large fringe metro is different from 1/6? Use α=0.02. Write up the solution using the PMACC procedure. NOTE: Use 1/6, not the decimal approximation.

**Last Names M-Z:** There are 6 NCHS Urbanization categories. Based on this data set, is there convincing evidence that the population proportion of counties classified as non-core is different from 1/6? Use α=0.02. Write up the solution using the PMACC procedure. NOTE: Use 1/6, not the decimal approximation.

**2.1) Parameter: Define the parameter, using correct notation.**

...

**2.2) Method: Name the method you will use, and write the hypotheses.**

...

**2.3) Assumptions: Show that both assumptions are met. Round to 1 decimal place.**

...

**2.4) Calculate: Complete the code below to calculate the values required.**

In [None]:
#Define p0, the value in H0.
#Replace ... with p0.
p_0 = ...

#Use this code for the student version
#Calculate the values needed; p-hat, and standard error (se).
p_hat = obs_count / n
se = np.sqrt(p_0 * (1-p_0) / n)

#Calculate the z-score of our p-hat, under the assumption H0 is true.
z_score = (p_hat - p_0) / se

#Calculate the p-value for 1- and 2-sided tests
p_value1 = (1 - norm.cdf(abs(z_score)))
p_value2 = 2 * p_value1

print(f"p-hat =  {obs_count}/{n} = {p_hat.round(7)}")
print(f"z-score = {z_score.round(7)}")
print(f"1 sided p-value = {p_value1:.11f}")
print(f"2 sided p-value = {p_value2:.11f}")

**2.5) Communicate Results: What conclusion is made about the null hypothesis? And what does that mean about the alternate hypothesis?**

...

# **Question 3**

## **Do you make the same conclusion if you use the confidence interval?**

**In question 2 you concluded that we either do have or do not have convincing evidence for the alternate hypothesis. Using your confidence interval from question 1, do you reach the same conclusion?**

...

# **QUESTION 4**

Generate a paragraph of at least 100 words to address one of the following questions. That is, answer only 4a or 4b, but not both.

**4a)** Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major?

...

--OR--

**4b)** Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career?

...


<br><br>
### Once you are done and ready to submit, follow the instructions below to save as a PDF and submit to GradeScope.

### Save as PDF
Note 1: You do not have to select Print Preview. You can print directly from the notebook.
Note 2: Image and graph sizes have been set so you should be able to see them correctly without making any changes to the browser width or the layout (portrait vs landscape).
1. Run all code one last time and make sure your graphs can be seen.
2. File -> Print (or ctrl-p/cmnd-p)
3. Change the "Desination" to PDF.
4. Save the PDF, taking note of where it is saved.

### Submit to GradeScope
**Watch the "GradeScope Submission" video for help.**
1. Login to the Canvas course
2. Click on GradeScope in the course navigation.
3. If you see multiple courses in GradeScope, click on the STAT 108 course
4. Click on the name of the assignment that matches your data set
5. Click on "Submit Work", select PDF
6. Select the PDF you just created
7. You need to tell GradeScope which page each problem answer/output is on. You should see a list of problems on the left, and a display of pages (thumbnails) on the right. Assign pages to questions by clicking on the question number on the left, then clicking on all pages that question is on.
8. After ALL questions have been assigned to their respective page(s), click "Submit"

#### **Still need help? Your STAT 108 team is here to help. Take your laptop to office hours.**
