# **COVID Tracker - Project 1**
### Analyzing qualitative and quantitative variables.

# **Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that add extra features to the basic setup. The name of the modules is after the `import` statement, and the purpose is in a non-code comment after thew hashtag (#).



In [1]:
import pandas as pd                 #Data analysis
import numpy as np                  #Calculations
import plotly.express as px         #Graphing
import matplotlib.pyplot as plt     #Graphing
from IPython.display import Image   #Display images
import warnings                     #Ignore version warnings
warnings.simplefilter('ignore', FutureWarning)


In [2]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://cdn.who.int/media/images/default-source/mca/mca-covid-19/coronavirus-2.tmb-1920v.jpg?sfvrsn=4dba955c_19'

# Display the image
Image(url=image_url)

# **Context**

When reporting about COVID-19, the Associated Press (AP) used data collected by the Johns Hopkins University Center for Systems Science and Engineering as a source for outbreak caseloads and death counts for the United States and globally.

The Johns Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests - and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

The data you will be analyzing is from the Johns Hopkins dashboard (link below) that is updated throughout the day. Like all organizations dealing with data, Johns Hopkins is constantly refining and cleaning up their data feed, so there may be brief moments where data does not appear correctly. You can find the Johns Hopkins daily data reports at https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data.

The AP updates their dataset hourly at 45 minutes past the hour.
To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, visit https://www.ap.org/content/formats/data/.

Attribution: Johns Hopkins University COVID-19 tracking project
* Dashboard: https://www.arcgis.com/apps/dashboards/bda7594740fd40299423467b48e9ecf6





# **About the Dataset**

This dataset contains 133 rows corresponding to a random sample of drafted players. A total of 9 variables are provided as listed below:

| Variable Name(s)      | Description                             |
|:----------------------|:----------------------------------------|
| county_name           | Name of the county                      |
| state                 | State in which the count is located     |
| nchs_urbanization     | Urban-Rural category                    |
| total_population      | County population size                  |
| confirmed             | Number of confirmed cases in the county |
| confirmed_per_100000  | Population adjusted confirmed COVID-19 cases per 100000 people |
| deaths                | Number of deaths in county due to COVID-19 |
| deaths_per_100000     | Population adjusted COVID-19 deaths per 100000 people |

* For more information about nchs_urbanization, visit https://www.cdc.gov/nchs/data/series/sr_02/sr02_166.pdf
* Note on "population adjusted": Is 1500 a lot of cases? It is certainly more significant in a population of 1500 people than in a population of 10 million people. One way to compare populations of different sizes is to calculate the rate "as if" there were only 100,000 people. Do that adjustment for all populations you want to compare. In class we learned to use relative frequency when comparing groups of different sizes. The population adjusted method is another way to do that.



Let's take a look at the data. To do this, first we import it directly from the url below.



# **A Snippet of the Data**

In [3]:
url='https://raw.githubusercontent.com/thamilton562/STAT108_Projects_Students/main/DataSets/COVID%20Cases.csv'
df=pd.read_csv(url)

Next, we can display the data by *typing the name* of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [4]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)
df

Unnamed: 0,county_name,state,nchs_urbanization,total_population,confirmed,confirmed_per_100000,deaths,deaths_per_100000
0,Lowndes,Alabama,Medium metro,10236,3251,31760.45,80,781.56
1,Ontario,New York,Large fringe metro,109472,25821,23586.85,212,193.66
2,Waukesha,Wisconsin,Large fringe metro,398879,137985,34593.20,1216,304.85
3,Escambia,Florida,Medium metro,311522,96194,30878.72,1452,466.10
4,Greenbrier,West Virginia,Non-core,35347,12633,35739.95,182,514.90
...,...,...,...,...,...,...,...,...
137,Bourbon,Kentucky,Medium metro,20144,7688,38165.21,73,362.39
138,Schley,Georgia,Micropolitan,5211,1387,26616.77,11,211.09
139,Neshoba,Mississippi,Non-core,29376,12475,42466.64,247,840.82
140,Wise,Virginia,Micropolitan,39025,13596,34839.21,225,576.55


# **INSTRUCTIONS**

* Use Python to analyze the data set and complete each of the following.
* Replace ellipsis (...) with the relavent names or code.  
* For problems that require a written response, double click the text box to start typing.
* Reference the 3 tutorials from activity for assistance.
* Attend office hours if you still need help.

## **QUESTION 1**
Determine whether the four variables below are qualitative or quantitative. If they are quantitative, specify whether they are continuous or discrete.

| Variable          | Classification |
|:------------------|:-------------------------------|
| confirmed         |  ...      |
| nchs_urbanization |  ...      |
| deaths_per_100000 |  ...      |
| total_population  |  ...      |

## **QUESTION 2**

Construct a frequency table, relative frequency table, and relative frequency bar chart to describe the distribution of urban-rural category. State any fact that jumps out to you.

**2a)** Construct a table that contains the frequency and relative frequency distribution for the urban-rural category. Round relative frequency to 3 decimal places.

In [5]:
# Define the name of the variable to be analyzed
variable = df['...']

# Sort the urban-rural categories from least to most populated.
cat_order = ['Non-core', 'Micropolitan', 'Small metro', 'Medium metro',
                'Large fringe metro', 'Large central metro']

# Create the frequency table and sort the categories in numerical order.
# .sort_index works here because the category names are numerical.
freq_table = pd.value_counts(variable).reindex(cat_order)

# Rename "count" to "Frequency", and "cp" to "Chest Pain Type"
freq_table = freq_table.rename('Frequency')
freq_table = freq_table.rename_axis('Urban-Rural Category')

# Create the relative frequency table, and rename the counts column to
#   Relative Frequency.
relative_freq_table=freq_table/...             #HINT: look back at Project 0 or Tutorial 1.
relative_freq_table = relative_freq_table.rename('...').round(...)

# Combine both tables
combined_table=pd.concat([..., ...], axis=1)

# Print the combined table.
...


KeyError: '...'

**2b)** Construct a relative frequency bar chart to describe the distribution of the urban-rural category.

In [None]:
# Create a DataFrame
dfrf = relative_freq_table.reset_index()

# Create the bar graph#
fig = px.bar(x=dfrf['Urban-Rural Category'],y=dfrf['Relative Frequency'],
             title='...')

fig.update_layout(xaxis_title='...')
fig.update_layout(yaxis_title='...')
fig.show()


**2c)** Describe the distribution of urban-rural category.

...

## **Question 3**

For question 3 you will analyze a quantitative variable. Find your variable based on your last name and use that variable when answering all parts of question 3.  

Once you find your variable description, scroll up to "About the Dataset" to find the variable name. Then look at the "Snippet of Data" to get the exact variable name, especially since variable names are case sensitive.

| **Last Name** | **Variable Description**    |
|:--------------|:----------------------------|
| A-L           | confirmed                   |
| M-Z           | deaths                      |





**3a)** Construct a histogram for your variable. Use number of bins = 12.

In [None]:
# Create the histogram, with the x-axis being the variable specified in the
#   table based on your last name.

fig = px.histogram(x=df['...'], nbins = ...,
             title='...',
             labels = {'x':'...'})

# Update the vertical axis title.
fig.update_layout(yaxis_title='...')

# Print the histogram.
fig.show()


**3b)** Construct a boxplot for your variable.  

In [None]:
# Create the boxplot, with a title, and specify horizontal axis label.
px.box(x=df['...'],
       title='...',
       labels={'x':'...'})

**3c)** Calculate the following summary statistics for your variable: 5 number summary, mean, and standard deviation. Round to three decimal places.

In [None]:
# Calculate the numerical summaries
# Indicate your variable.
descriptive_stats = df[['...']].describe().round(...)

# Print the results.
...


**3d)** Use information from (3a), (3b) and 3(c) to describe your variable in terms of shape, center, spread, and outliers.
* Use the correct center and the correct spread based on the shape of the distribution.
* Specify which center and which spread you are using. For ex: Say "The mean is ..." or "The median is ...", rather than "The center is ..."
* When addressing outliers, if any, list the values of all outliers.
* Include units, if any, for all numbers.

...

**3e)** Interpret the standard deviation in context.

...

**3f)** Interpret the IQR in context.

...

## **QUESTION 4**

How does the *median* (not mean) confirmed number of cases per 100000, and the *median* number of deaths per 100000 com

How do *large fringe metro* and *small metro* counties compare with respect to the number of confirmed cases per 100,000 and the number of deaths per 100,000.

**4a)** Calculate the mean number of confirmed cases per 100000 and the mean number of deaths per 100000 for *large fringe metro* and *small metro* counties. Round to two decimal places. Hint: See the table you created in (1a) for the exact spelling of the categories.

In [None]:
# LB_means = df[df['Qual var'] == 'LB'][['Quant1', 'Quant2', 'Quant3']]
large_fringe_means = df[df['...'] == 'Large fringe metro'][['...', '...']].mean().round(...)
small_metro_means = df[df['nchs_urbanization'] == '...'][['...', '...']].mean().round(...)

# Combine the two table and specify labels.
combined_means = pd.DataFrame({'Confirmed per 100,000': [large_fringe_means['confirmed_per_100000'], small_metro_means['confirmed_per_100000']],
                               'Deaths per 100,000': [large_fringe_means['deaths_per_100000'], small_metro_means['deaths_per_100000']]},
                               index=['Large Fringe Metro', 'Small Metro'])

#Print the table
...

**4b)** Compare the results for these two type of counties.

...

**4c)** In the table below are some snippets of code.

| **Last Name Initial** | **Code**                               |
|:----------------------|:---------------------------------------|
| A-L                   | df[df['nchs_urbanization'] == 'Small metro'] |
| M-Z                   | df['nchs_urbanization'] == 'Large fringe metro' |

Based on your last name, interpret the snippet of code.



## **QUESTION 5**

Generate a paragraph of at least 100 words to address one of the following questions. That is, answer only 5a or 5b, but not both.

**5a)** Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major?

...

--OR--

**5b)** Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career?

...


<br><br>
### Once you are done and ready to submit, follow the instructions below to save as a PDF and submit to GradeScope.

### Save as PDF
1. Run all code one last time
2. File-> Print Preview opens in a new browser window
3. Verify you can see all graphs. If not, go back to step 1.
4. File -> Print (or ctrl-p/cmnd-p)
5. Change destination to PDF (don't save, yet)
6. Scroll through preview to make sure you can see your graphs entirely. If not, click Cancel. Make the browser window narrower. Go back to step 4.
7. Repeat steps 4-6 until you can see your graphs completely. But do not make them too narrow.
8. Save the PDF, taking note of where it is saved.

### Submit to GradeScope
1. Login to the Canvas course
2. Click on GradeScope in the course navigation.
3. If you see multiple courses in GradeScope, click on the STAT 108 course
4. Click on the "Tutorial 2 Practice Upload" assignment
5. Click on "Submit Work", select PDF
6. Select the PDF you just created
7. You need to tell GradeScope which page each problem answer/output is on. You should see a list of problems on the right, and a display of pages (thumbnails) on the right.
Assign pages to questions by clicking on the question number on the left, then clicking on all pages that question is on.
8. After ALL questions have been assigned to their respective page(s), click "Submit"
