# Analysis of Groups Disproporinately impacted by COVID-19

## Background
#### Many of us face the daily reality that SARS CoV-2, a recently identified human coronavirus, has caused a world wide pandemic. This respiratory illness is spread through droplet-based transmission and leads to severe respiratory problems, kidney failure, and even death. This disease affects almost everyone's daily lives but may be said to impact certain groups disproportionately. It is commonly known that people with certain medical conditions and specific age categories are at an increased risk for severe disease following exposure to COVID-19. However, the impact of race, geographic location, economic status, etc. is not often acknowledged. According to the CDC, " Long-standing systemic health and social inequities have put many people from racial and ethnic minority groups at increased risk of getting sick and dying from COVID-19"(CDC.gov). Some of the factors that may contribute to this increased risk could include racial discrimination, healthcare access, educational gaps, income gaps, and wealth gaps. This is an instance that brings forward the idea of health care disparities, which are differences among health and health care between various population groups. For example, "In the U.S. African Americans are contracting SARS-CoV-2 at higher rated and are more likely to die from COVID-19"(Del Rio 2020).

## Why is this issue important?
#### Health disparities adversely impact groups of people who have, throughout history, experienced more significant obstacles to health based on race, religion, socioeconomic status, gender, age, mental health, or disability. These disparities do not solely impact the group facing differences. Still, they can also limit overall gains in quality of care from the broader population due to the burden of unnecessary cost(health care cost and lost productivity costs). Specifically, in the case of COVID-19, addressing these disparities will allow for better contact tracing and containment of the disease, which is beneficial overall to all society members. To address these disparities, it is crucial to recognize precisely where they stem from and who is being impacted at the most astounding rate. So we should turn to the data to examine where the disease hot spots are and why they are more severely affected. Can we find any population trends that are common in these locations? With this knowledge, public health officials become better equipped to address the needs of the public through education, targeted health programs, and more advanced surveillance efforts.

## This Project Aims to:
#### Examine what factors influence the disproportionate outcomes of COVID-19 and better understand which populations are impacted?

## How Will this be Achived: 
#### Unit of Analysis 
- COVID-19 Hotspots across the United States
- Population present in thoes locations
#### List of potential variables to use in analysis
- Death rate 
- Race/ Ethnicity
- Gender
- Underlying health conditions
- annual houshold income
- heath visits 
#### Visualization Techniques to Use: 
- Summary statistics to give a breif description of each variable
- Scatter plots and bar charts to help vizualize the data more effectively

In [21]:
# Ideas for a general project flow
#Part 1: 
# First determine what regions of the United states are more impacted by coronavirus at the state and county level
    # 1. Download NYtimes data to determine the top 5 and bottom 5 based on most and least number of COVID 19 overall cases and deaths
    # 2. divide each state into its top and bottom 5 counties 
# Part 2: 
# Determine demographic data for the top and bottom 5 states and counties
    #Factors to look at: 
        # Income
        # Gender
        # Race
# On the basis of these corelations we can determine if any one demographic is statistically more likey to get COVID19 
# This will allow governments and health care workers to better target and manage the spread of this disease

In [22]:
#Import libaries needed to clean data 
import pandas as pd
import numpy as np

In [23]:
#Import the States csv as a pandas dataframe and define the columns
# This data set was retrived from git hub https://github.com/nytimes/covid-19-data
df_Covid = pd.read_csv('us-states.csv', header=None)
df_Covid.columns = ["date","state","fips","cases","deaths"]

# Variable definations: 
- date: starting Jan 21st 2020 to October 21st 2020 when it was downloaded
- state: included in the united states and it surrounding territories
- fips: Federal Infomation processing standard codes that identify unique geogrphic regions
- cases: The total number of cases of Covid-19, including both confirmed and probable
- deaths: The total number of deaths from Covid-19, including both confirmed and probable.

In [24]:
# find the length of the data frame to see the scale of data that we are working with
len(df_Covid)

12830

In [25]:
# preview the data fram and rename it df1 for ease in future manipulation
df1 = df_Covid[1:]
df1.head()

Unnamed: 0,date,state,fips,cases,deaths
1,2020-01-21,Washington,53,1,0
2,2020-01-22,Washington,53,1,0
3,2020-01-23,Washington,53,1,0
4,2020-01-24,Illinois,17,1,0
5,2020-01-24,Washington,53,1,0


In [26]:
# convert each value in the cases and deaths column to integers
df1['cases'] = df1['cases'].astype(int)
df1['deaths']= df1['deaths'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


# Top 5 states by Total Coronavirus Cases:
1. California
2. New York 
3. Texas
4. Florida
5. Illinois


In [27]:
# create a data frame looking at just the total cases grouped by each state 
# use the Visualize tool for deepnote to determine the top 5 states based on total cases of coronavirusases
cases_df= df1.groupby('state')['cases'].agg(['sum'])
cases_df

Unnamed: 0_level_0,sum
state,Unnamed: 1_level_1
Alabama,14280771
Alaska,709540
Arizona,22969988
Arkansas,7340532
California,80034435
Colorado,8263081
Connecticut,8954141
Delaware,2561664
District of Columbia,2110257
Florida,67493015


# Top 5 states by Total Coronavirus Deaths:
1. New York
2. New Jersey
3. California
4. Massachusett
5. Texas

In [28]:
# create a data frame looking at just the total deaths grouped by each state 
# use the Visualize tool for deepnote to determine the top 5 states 
deaths_df= df1.groupby('state')['deaths'].agg(['sum'])
deaths_df

Unnamed: 0_level_0,sum
state,Unnamed: 1_level_1
Alabama,267107
Alaska,4531
Arizona,562757
Arkansas,105411
California,1654882
Colorado,314013
Connecticut,759950
Delaware,90627
District of Columbia,97517
Florida,1313535


In [29]:
#Import the county txt as a pandas dataframe and define the columns
# This data set was retrived from git hub https://github.com/nytimes/covid-19-data
df_counties = pd.read_csv('us-counties.txt', header=None)
df_counties.columns = ["date","county","state","fips","cases","deaths"]

  interactivity=interactivity, compiler=compiler, result=result)


In [30]:
# find the length of the data frame to see the scale of data that we are working with
len(df_counties)

647925

In [31]:
# preview the data frame and rename it df2 for ease in future manipulation
df2 = df_counties[1:]
df2.head()

Unnamed: 0,date,county,state,fips,cases,deaths
1,2020-01-21,Snohomish,Washington,53061,1,0
2,2020-01-22,Snohomish,Washington,53061,1,0
3,2020-01-23,Snohomish,Washington,53061,1,0
4,2020-01-24,Cook,Illinois,17031,1,0
5,2020-01-24,Snohomish,Washington,53061,1,0


In [32]:
df2['cases'] = df2['cases'].astype(int)
df2['deaths']= df2['deaths'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


# Top 5 counties by cases: 
1. Cook  
2. Suffolk
3. Dallas 
4. Broward
5. Clark

In [33]:
# create a data frame looking at just the total cases grouped by each county
# use the Visualize tool for deepnote to determine the top 5 counties based on total cases of coronavirusases
cases_df2= df2.groupby('county')['cases'].agg(['sum'])
cases_df2

Unnamed: 0_level_0,sum
county,Unnamed: 1_level_1
Abbeville,48835
Acadia,304890
Accomack,175750
Ada,1169125
Adair,124923
...,...
Yukon-Koyukuk Census Area,6671
Yuma,1391805
Zapata,26411
Zavala,30514


# Top 5 Counties by Deaths: 
1. Cook 
2. Suffolk
3. Wayne
4. Bergen
5. Westchester

In [34]:
# create a data frame looking at just the total deaths grouped by each county
# use the Visualize tool for deepnote to determine the top 5 counties based on total deaths of coronavirusases
deaths_df2= df2.groupby('county')['deaths'].agg(['sum'])
deaths_df2

Unnamed: 0_level_0,sum
county,Unnamed: 1_level_1
Abbeville,960
Acadia,10528
Accomack,2613
Ada,12717
Adair,4422
...,...
Yukon-Koyukuk Census Area,110
Yuma,32951
Zapata,372
Zavala,1104


## *This is the point I have gotten to so far, next I plan to download census data to look at the demographic information*
- ### The variables I want to focus on at both the state and county level are: 
    1. Average income 
    2. Race distribution 
    3. Gender
## *Finally I plan to summarize this data using summary statistics and bar charts/ scatter plots to vizualize the impact of the various demographic factors on COVID case and death rate*    