# Coronavirus Cases in U.S. Prisons
## COVID-19 infection in prisons across the United States, 2020 - 2021

This dataset can be found on [kaggle](https://www.kaggle.com/shaneysze/covid-cases-in-prisons?select=covid_prison_rates.csv). A big thanks to the contributors of this dataset (listed on site)!

### How Will I Use This Data?

I am most interested in exploring the total of prisoner covid cases. In this project, I will use some data cleaning, data wrangling, and data visualization to get some interesting findings.

In [None]:
!pip install chart_studio

In [None]:
import pandas as pd 
import numpy as np 
import datetime as dt
from datetime import datetime  
import seaborn as sns 
import matplotlib.pyplot as plt 
%matplotlib inline
import chart_studio.plotly as py
import plotly.graph_objects as go 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

In [None]:
# For Notebooks
init_notebook_mode(connected=True)

In [None]:
covid_cases = pd.read_csv('../input/covid-cases-in-prisons/covid_prison_cases.csv')

In [None]:
covid_cases.head(5)

Seems to be a lot of missing values.

In [None]:
covid_cases.info()

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(covid_cases.isnull(),cmap='viridis') #lots of nulls

I will drop the columns that are not of interest for this project. Luckily, the majortiy of missing values come from these columns.

In [None]:
#drop unwanted cols
covid_cases = covid_cases.drop(['staff_tests','staff_tests_with_multiples','staff_partial_dose', 'staff_full_dose', 'prisoner_tests','prisoner_tests_with_multiples','prisoners_partial_dose', 'prisoners_full_dose','notes'],axis=1)

Since we are dealing with time, it is reasonable for missing values to be 0 for some variables since there may have been no change in variables for that particular day or month. Therefore, I will replace na's with 0. Before doing so, I will create columns for the month, day, and year for dataset then drop the nulls from 'as_of_date' variable since I do not want zeroes for dates.

In [None]:
#converting as_of_date into date objects
covid_cases['as_of_date'] = pd.to_datetime(covid_cases['as_of_date'])

In [None]:
#creating month, day, and year cols
covid_cases['month'] = covid_cases['as_of_date'].apply(lambda time: time.month)
covid_cases['day of week'] = covid_cases['as_of_date'].apply(lambda time: time.dayofweek)
covid_cases['year'] = covid_cases['as_of_date'].apply(lambda time: time.year)

In [None]:
#labels for days of week
dmap = {0: 'Mon',1:'Tues',2:'Wed',3:'Thurs',4:'Fri',5:'Sat',6:'Sun'}
covid_cases['day of week'] = covid_cases['day of week'].map(dmap)

In [None]:
Mmap = {1: 'Jan',2:'Feb',3:'Mar',4:'April',5:'May',6:'Jun',7:'July',8:'Aug',9:'Sept',10:'Oct',11:'Nov',12:'Dec'}
covid_cases['month'] = covid_cases['month'].map(Mmap)

In [None]:
#dropping rows from as_of_date that contain nulls 
covid_cases = covid_cases[covid_cases['as_of_date'].notna()]

In [None]:
#replacing other nulls with 0
covid_cases.fillna(0,inplace=True)

In [None]:
#sort by chronological order
covid_cases = covid_cases.sort_values(by='as_of_date',ascending=True)

In [None]:
covid_cases.head(3)

## Total Prisoner Cases

I am interested in exploring the total of prisoner cases 

In [None]:
plt.figure(figsize=(12,4))
covid_cases['total_prisoner_cases'].plot(kind='hist')

In [None]:
plt.figure(figsize=(16,4))
sns.boxplot(x='total_prisoner_cases',data=covid_cases)

In [None]:
plt.figure(figsize=(15,6))
sns.boxplot(x='name',y='total_prisoner_cases',data=covid_cases)
plt.xticks(rotation=90)
plt.xlabel('State')
plt.ylabel('Total Prisoner COVID-19 Cases')

In [None]:
plt.figure(figsize=(15,6))
sns.boxplot(x='name',y='total_prisoner_deaths',data=covid_cases)
plt.xticks(rotation=90)
plt.xlabel('State')
plt.ylabel('Total Prisoner Deaths Due to COVID-19')

Looking at the boxplots and histogram above, it is clear that the data has many outliers, but these outliers remain significant since the numbers of cases were volatile during the pandemic. 

## A Comparison of Total Cases and Deaths Between Staff and Prisoners

In [None]:
plt.figure(figsize=(12,4))
sns.lineplot(x='month',y='total_prisoner_cases',data=covid_cases,ci=False,sort=True,color='r',label='Prisoners')
sns.lineplot(x='month',y='total_staff_cases',data=covid_cases,ci=False,sort=True,color='blue',label='Staff')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('Month \n (March 2020 - Feb 2021)')
plt.ylabel('Total COVID-19 Cases')
plt.title('Staff and Prisonser COVID-19 Cases')

Cases between staff and prisoners have differ drastically.

In [None]:
plt.figure(figsize=(12,4))
sns.lineplot(x='month',y='total_prisoner_deaths',data=covid_cases,ci=False,sort=True,color='r',label='Prisoners')
sns.lineplot(x='month',y='total_staff_deaths',data=covid_cases,ci=False,sort=True,color='blue',label='Staff')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('Month \n (March 2020 - Feb 2021)')
plt.ylabel('Total Deaths Due to COVID-19')
plt.title('Staff and Prisonser COVID-19 Deaths')

Staff deaths were significantly lower than prisoner deaths, both bearing similar trends.

## Look into Federal Prison Covid Stats

In [None]:
federal_prisons = covid_cases[covid_cases['name']=='Federal']

In [None]:
plt.figure(figsize=(12,4))
sns.lineplot(x='month',y='total_prisoner_cases',data=federal_prisons,ci=False,sort=True,color='r',label='Prisoners')
sns.lineplot(x='month',y='total_staff_cases',data=federal_prisons,ci=False,sort=True,color='blue',label='Staff')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('Month \n (March 2020 - Feb 2021)')
plt.ylabel('Total COVID-19 Cases')
plt.title('Staff and Prisonser COVID-19 Deaths (Federal Prisons)')

In [None]:
plt.figure(figsize=(12,4))
sns.lineplot(x='month',y='total_prisoner_deaths',data=federal_prisons,ci=False,sort=True,color='r',label='Prisoners')
sns.lineplot(x='month',y='total_staff_deaths',data=federal_prisons,ci=False,sort=True,color='blue',label='Staff')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('Month \n (March 2020 - Feb 2021)')
plt.ylabel('Total Deaths Due to COVID-19')
plt.title('Staff and Prisonser COVID-19 Deaths (Federal Prisons)')

Noticed similar trends between federal prisons and state prisons.

## Choropleth Map

For our choropleth map, it is important we get the sum of total cases per state to display it on our map. 

In [None]:
covid_cases['name'].unique()

In [None]:
#filtering out 'Federal' since it is not a state for the map
covid_map = covid_cases[covid_cases['name'] != 'Federal']

In [None]:
#getting sums of variable per state
df = covid_map.groupby(covid_map['name']).sum()

'df' now has the states ('name') as an index. I need the index reset in order to map out the variable. For this, I will create another dataframe that contains the states ('name') and the states abbreviations ('abbreviations'). After joining these two dataframes, it will reset the 'name' index that will allow me to create the map with the abbreivations as well.

In [None]:
abbrev = covid_map[['name','abbreviation']]

In [None]:
#joining both df's
covid = abbrev.join(df, how="left",on='name')

## Total Prisoner Cases Map

In [None]:
data = dict(type='choropleth',
            colorscale = 'Portland',
            reversescale = False,
            locations = covid['abbreviation'],
            z = covid['total_prisoner_cases'],
            locationmode = 'USA-states',
            text = covid['name'],
            marker = dict(line = dict(color = 'rgb(255,255,255)',width = 1)),
            colorbar = {'title':"Total Cases"}
            ) 

In [None]:
layout = dict(title = 'Total Prisoner Cases in US Prisons',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )

In [None]:
choromap = go.Figure(data = [data],layout = layout)
iplot(choromap)

## Findings

Califorina is the state with the state with the most total cases(~1.5M). Texas had the second most cases (~1.3M).

## Total Prisoner Deaths Map

In [None]:
data = dict(type='choropleth',
            colorscale = 'Portland',
            reversescale = False,
            locations = covid['abbreviation'],
            z = covid['total_prisoner_deaths'],
            locationmode = 'USA-states',
            text = covid['name'],
            marker = dict(line = dict(color = 'rgb(255,255,255)',width = 1)),
            colorbar = {'title':"Total Deaths"}
            ) 

In [None]:
layout = dict(title = 'Total Prisoner Deaths in US Prisons',
              geo = dict(scope='usa',
                         showlakes = True,
                         lakecolor = 'rgb(85,173,240)')
             )

In [None]:
choromap = go.Figure(data = [data],layout = layout)
iplot(choromap)

## Findings
Texas leads the nation in total prisoner deaths due to COVID-19 (~12k). Florida has the second most deaths in the nation (7,701).