# COVID 19 in California Exploratory Data Analysis

## Abstract for Kaggle: 

1. Scientific Rigor: My project utilizes very credible data from the New York Times, CDC, California Department of Health, Google, and Coders AGainst COVID. 
2. Scientific Model and Strategy: My procedures are very data driven. I back all my conclusions with data, and use statistical techniques like regressions to find patterns and relationships in the data. 
3. Novel Insights: I have summarized my key findings in the end of the notebook. However, some important novel insights are the third and forth key point: 
    3.While Imperial and Kings counties have been able to keep the death: cases ratio low, the low ICU availability and the high percentage of the population with the virus may be a strong indication that these counties might experience a higher death/cases ratio in the coming weeks.
    
    4.There is a strong relationship between a counties COVID-19 cases as a percentage of population and the metrics: percentage of people with housing with 10 or more units and percentage of houses with more rooms than people. This may be due in part to California’s social distancing policies. 
4. Market Translation and Applicability: This notebook resolves a need for State-wide Public Health officials as it analyzes very recent COVID-19 data, identifies hard hit and potentially at risk counties, and provides some insights into why counties differ in their COVID-19 cases as a percent of county population. 
5. Speed to market: The important features of the regression and ICU results should be integrated in risk-assesment tools when determing resource allocation. 
6. Longevity of solution in market: This notebook can be duplicated in other states easily, because all the data is publically available in the same datasets used here. 
7. Ability of user to collaborate and contribute to other solutions within the Kaggle community: Lots of these data sources were obtaied from Kaggle, so I am contributing by analyzing them and providing insights. 
 


As the Coronavirus continues to spread it has become ever more apparent that we are dealing with the biggest public health challenge of our generation. Fortunately, California’s state and county officials have been working tirelessly for months to help protect and ensure the safety of all citizens. Thus, the purpose of this project is to utilize my data wrangling, visualization, and analytical skills to obtain some insight on the spread of the coronavirus in California and learn about the challenges that our government experts are facing while making these tough decisions. In this report, I will not only share my results but also go in-depth into the decisions taken to come to my conclusions. 
This analysis is divided into three parts: 
1.	Overview of the Spread of COVID-19 in California 
2.	Analysis of some Factors that could explain the differences in spread among counties
3.	Analysis of Testing Access, and ICU availability across California. 


## Overview of the Spread of COVID-19 in California

In [None]:
import pandas as pd 
import numpy as np 
import seaborn as sb 
import matplotlib.pyplot as plt 
import geopandas as geo
%matplotlib inline

The Coronavirus Data was obtained from the New York Times GitHub, and the California shapefile was obtained by the California Government. The first step is to preprocess and clean the data. 

In [None]:
# read in data
cali= geo.read_file(r"../input/covid19/CA_Counties_TIGER2016.shp")
cali.head()

In [None]:
# read in data
cases=pd.read_csv(r"https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv")
cali_cases= cases[cases["state"]=="California"]
cali_cases.head()

In [None]:
cali_cases=cali_cases.reset_index().drop("index", axis=1)

In [None]:
cali_cases=cali_cases.drop("fips",axis=1)
cali_cases.info()

In [None]:
# When I combined the data, I chose inner join since I wanted to keep the counties that were present in both datasets. 
covid=cali_cases.merge(cali, left_on="county",right_on="NAME", how="inner")
covid.head()

In [None]:
# Keep interested columns 
covid=covid[["date","county","cases","deaths","geometry"]]
covid.shape

In [None]:
covid.head()

I read in the data, checked it for nulls, merged the dataset, and kept the columns I wanted. The next step is to realize that the NYT coronavirus dataset contains the cumulative number of cases and deaths. However, for my analysis, I want the daily number of new cases/deaths, and the growth in daily new cases/deaths. This would be better than cumulative for a time series, because it would give me a clearer sense of how the coronavirus has progressed daily over the past few months.

In [None]:
# Calculates the new cases/deaths and growths for one county
def fun(name,covid=covid):
    county=covid[covid["county"]== name]
    county= county.reset_index()
    new_cases= [county["cases"][0]]+[county["cases"][i]-county["cases"][i-1]for i in range(1,len(county["cases"]))]
    new_deaths= [county["cases"][0]]+[county["deaths"][i]-county["deaths"][i-1]for i in range(1,len(county["cases"]))]
    county["new_cases"]=new_cases
    county["new_deaths"]= new_deaths
    g=[0]
    for i in range(1,len(county["new_cases"])): 
        x= ((county["new_cases"][i]-county["new_cases"][i-1])/county["new_cases"][i-1])*100
        g.append(x)
    county["cases_growth"]= g
    h=[0]
    for i in range(1,len(county["new_deaths"])): 
        x= ((county["new_deaths"][i]-county["new_deaths"][i-1])/county["new_deaths"][i-1])*100
        h.append(x)
    county["death_growth"]= h
    county["date"]= pd.to_datetime(county["date"])
    return county


In [None]:
# Implements the function on all counties and puts them into one dataframe
counties= covid["county"].unique()
gdf= fun(counties[0])
for i in counties[1:]: 
    x= fun(i)
    gdf= pd.concat([gdf,x], ignore_index=True)
gdf.shape

In [None]:
gdf.head(10)

Before I delve into individual counties, I first want to learn about how California has progressed as a whole. 

In [None]:
# total represents all of california data
total= gdf.groupby("date").sum()
total=total.reset_index().drop("index",axis=1)
total.head()

In [None]:
fig,axes= plt.subplots(1,2,figsize=(20,5))
fig.suptitle("COVID New Cases")
shrunk=total[total["date"]>="2020-03-01"]
axes[0].plot("date","new_cases", data=shrunk)
axes[0].set_title("Number of New Cases")
axes[1].plot("date","new_deaths", data=shrunk, color="red")
axes[1].set_title("Number of New Deaths")
for ax in axes: 
       ax.tick_params("x",labelrotation=90)
       ax.set_xlabel("Date")
       ax.set_ylabel("Number of New Cases")
        

These graphs represent the number of new cases /deaths and the growth rates from March 1st until June 9th.  The number of new cases has fluctuated greatly from day to day but has shown a trend of overall increase.  However, while the number of new deaths fluctuates a lot more, we do not see this same pattern of overall increase from April-15 to June. Additionally, looking closer at the rate of overall increase for new cases from the COVID new cases graphs, it appears that from April 15 to mid-May (approx. May-15th), the rate of overall increase is smaller than from mid-May to June-9th. This could be due to reopening policies and the protests. 

In [None]:
# Need to recalcuate growth percentages after groupby
g=[0]
for i in range(1,len(total["new_cases"])): 
        x= ((total["new_cases"][i]-total["new_cases"][i-1])/total["new_cases"][i-1])*100
        g.append(x)
total["cases_growth"]= g
h=[0]
for i in range(1,len(total["new_deaths"])): 
        x= ((total["new_deaths"][i]-total["new_deaths"][i-1])/total["new_deaths"][i-1])*100
        h.append(x)
total["death_growth"]= h

In [None]:
plt.figure(figsize=(8,8))
shrunk=total[total["date"]>="2020-03-01"]
plt.plot("date","cases_growth", data=shrunk)
plt.plot("date","death_growth", data=shrunk)
plt.xticks(rotation=90)
plt.legend();
plt.title("COVID Daily Growth Rate");

The COVID Percentage Daily Growth Rate is interesting as it shows some dates had an unusually high peak.  While the early peaks in March could be attributed to small numbers, the large peak in cases in Mid-April, and the large peak in deaths in Mid-May is interesting. There could be many possible reasons.  One reason could be delays in hospital/ other sources reporting, which could erroneously attribute more cases to a certain day.  Another reason could be abnormal social events. For instance, it may not be a coincidence that the spike in growth in Mid-April (approx.- 15th-20th) occurred during the increase in stay at home order protests (https://www.foxnews.com/us/california-protest-erupts-over-states-coronavirus-stay-at-home-rules). However, I do not know the exact reason and it could still be something else. 

Now that we understand how the Coronavirus has progressed in California as a whole, lets analyze the growth in specific counties and uncover which counties have been hit the hardest. To do this, I will be utilizing two metrics- average number of people who contract the coronavirus daily as a percentage of total population and the ratio of total death/cases. I chose to observe the percent of total population instead of the number of people, because I did not want to bias my results against small counties. Also, I decided to use the ratio of death/cases to understand where the virus is more deadly since the total percentage of population is very small and the death/cases ratio is a better representation off the killing efficiency of the virus.  

In [None]:
# Calculate the average number of people who contract the coronavirus daily
def average_new_cases_percent(table, start,end): 
    shrunk=table[(table["date"]>= start) & (table["date"]<= end) ]
    shrunk=shrunk.reset_index()
    return (np.mean(shrunk["new_cases"])/shrunk["e_totpop"][0])*100
def average_new_death_percent(table, start,end): 
    shrunk=table[(table["date"]>= start) &( table["date"]<= end)]
    shrunk= shrunk.reset_index()
    return (np.mean(shrunk["new_deaths"])/shrunk["e_totpop"][0])*100

def ave(d,start,end): 
    av_cases=[]
    av_deaths=[]
    for i in d["county"].unique(): 
        df=d[d["county"]==i]
        av_cases.append(average_new_cases_percent(df,start,end))
        av_deaths.append(average_new_death_percent(df,start,end))
    averages= pd.DataFrame()
    averages["county"]= d["county"].unique()
    averages["av_cases_daily"]= av_cases
    averages["av_deaths_daily"]= av_deaths
    print("Top 25 cases:")
    print(averages.sort_values(by="av_cases_daily", ascending=False).head(25).reset_index()[["county","av_cases_daily"]])

    return averages
    
    
    

In [None]:
# Combining the cases dataframe with a dataframe that contains the estimated population.(read in a later cell)#
new= gdf.merge(jo ,on="county")
new.head()

In [None]:
av=ave(new,"2020-03-01","2020-6-09")

In [None]:
Imperial= av.loc[34][1]
sb.distplot(av["av_cases_daily"]);
plt.scatter(Imperial, 0, color='red', s=100);
plt.title("Average Percent of Population contracting COVID Daily  ");

The distribution seems to be right skewed. Most of the percentages are between .001% and .002%. The outliers are Imperial (shown in red), with a percentage of approximately 0.0213%, and Kings County with a percentage of approximately 0.0145%. These are more than 10x higher than the median California county! Following these counties are Los Angeles County with approx. 0.0065%, Tulare with approx.  0.0058%, and Modoc with approx. 0.0055%. It is important to note that these are not the percentage of people with COVID-19 cases (to come later), but instead represent the percentage of people we can expect to obtain COVID-19 on a typical day.   

In [None]:
# only kept Counties with 100+ cases
covid["date"]= pd.to_datetime(covid["date"])
final= covid[covid["date"]=="2020-6-09"]
final["d/c"]=final["deaths"]/ final["cases"]
sfinal= final[final["cases"]>99]
print("Top 25 death ratio:")
print(sfinal.sort_values(by="d/c", ascending=False).head(25).reset_index().drop("index", axis=1)[["county","cases","deaths","d/c"]])

In [None]:
Yolo=sfinal.sort_values(by="d/c", ascending=False).head(1)["d/c"][1867]
sb.distplot(sfinal["d/c"]);
plt.scatter(Yolo, 0, color='red', s=100)
plt.title("Death/Cases ratio");

In [None]:
# Shapiro Wilk Test to test for normality ; no YOLO
from scipy import stats
no=sfinal.drop(1867)
print(no.skew())
p_value=stats.shapiro(no["d/c"])[1]
if p_value< .05: 
    print("Reject Null hypothesis of normality")
else: 
    print("Fail to reject Null Hypothesis")

Like the percentage distribution, there is an outlier in the total death/cases ratio. However, this time it is Yolo county, with a death: cases ratio of approximately 0.11 which is approximately double the next largest county (San Diego), which has a ratio of 0.05. Following San Diego is Los Angeles County with 0.041, Tulare 0.039, and San Mateo 0.038. Yolo Counties position on top could be due to a combination of its low total number of cases (only 228), and the "riskiness" of the population who have contracted the virus. Also, it is interesting that Los Angeles and Tulare have maintained high positions on both lists, while Kings and Imperial counties are not even in the top 25 when considering death: cases ratio. Furthermore, because some counties have very few cases and deaths, I decided to limit the distribution to counties with 100 or more cases. 

Also, it is apparent from the histogram and the Shapiro–Wilk test, that this distribution is normal when Yolo County is removed.  

Now, let’s consider the number of people with COVID-19 as a percent of population as of June 9th:

In [None]:
# read in dataset with population numbers
social= pd.read_csv(r"../input/covid19/cdcs-social-vulnerability-index-svi-2016-overall-svi-county-level.csv")
social= social[social["state"]=="CALIFORNIA"]
pop= social[["county",'e_totpop']]
jo=pop.copy()

In [None]:
# join the dataset with COVID-19 cases 
final1=covid[covid["date"]=="2020-6-09"]
grouped=pop.merge(final1, on="county", how="inner")


In [None]:
# Calculate percent 
grouped["casesp"]= (grouped["cases"]/grouped["e_totpop"])*100
grouped["deathsp"]= (grouped["deaths"]/grouped["e_totpop"])*100
grouped.head()

In [None]:
print(grouped.sort_values("casesp", ascending=False)[["county","e_totpop","cases","casesp"]].reset_index().drop("index", axis=1).head(25))

In [None]:
plt.close()
changed= grouped.copy()
piv_cases=geo.GeoDataFrame(changed)
piv_cases["casesp"]= piv_cases["casesp"].replace(max(piv_cases["casesp"]),.65)
piv_cases["casesp"]= piv_cases["casesp"].replace(max(piv_cases["casesp"]),.65)
ax= piv_cases.plot(column="casesp",cmap="OrRd",figsize=(10,10),legend=True,edgecolor="black",linewidth=0.4)
ax.set_title("Percentage of People with COVID-19 up to June 9th ")
ax.set_axis_off()

This map clearly shows that the counties with a higher percentage of COVID-19 cases are mostly in Southern California, with the number of lightly impacted counties increasing as we progress north. Furthermore, the county with the highest percentage is Imperial with approx. 1.75% infected, which is more than 10x higher than the median county. Following Imperial is Kings with approx.  1.08% infected, Los Angeles with approx. 0.65% infected, Tulare with approx. 0.52% infected and Santa Barbara with approx. 0.41 % infected. Taking all three of these statistics into account, I believe that the county hit the hard by COVID-19 has been Los Angeles county. Even though Imperial county has a very high percentage of cases, LA county's top three presence when it comes to cases and deaths makes it the hardest hit county.  

## Analysis of Risk Factors

### Source : CDC Social Vulnerability Score 2016 

The source I will be analyzing is the Center for Disease Controls Social Vulnerability score data, which I obtained from Kaggle. This dataset, which is on the county level, comprises of features that can be grouped into four categories- Socioeconomic Status, Household Composition & Disability, Minority Status & Language, and Housing & Transportation. Furthermore, in the dataset, one feature is represented in three different formats- as a value, as a percent, and as a percentile. In my analysis, I will only look at the values as a percentage.

In [None]:
# keep all features represented as a percent 
keep=["county","geometry","ep_pov","ep_unemp","ep_nohsdp",
     "ep_age65","ep_age17","ep_disabl","ep_sngpnt",
      "ep_minrty","ep_limeng",
     "ep_munit","ep_mobile","ep_crowd","ep_noveh","ep_groupq"]

In [None]:
social=social[keep]


In [None]:
# combinbe cases and SVI data
social_cases=grouped.merge(social,on="county")
social_cases.head()

In [None]:
# Create correlation heatmap
plt.figure(figsize=(10,10))
plt.title("Heatmap for Social Vulnerability Features")
keep1= keep[2:]+["casesp","deathsp","county"]
pplot= social_cases[keep1]
pcorr= pplot.corr()
ax=sb.heatmap(pcorr,vmin=-1, vmax=1, center=0, cmap=sb.diverging_palette(20, 220, n=200),square=True)
ax.set_xticklabels(ax.get_xticklabels(),rotation=45,horizontalalignment='right');


To determine which features are most relevant, I will perform a multivariate regression with the social vulnerability variables as the features and the number of cases as a percent of county population as the label. However, first, I did some preprocessing. The first step of the preprocessing was to choose the best features that minimize co-linearity. While there are statistical tools that can do this like the Variation Inflation Factor, I decided to create a heatmap of the correlation coefficients and choose features that intuitively related to the spread of COVID-19 and are not that similar to increase transparency. Here are the features I picked and why:

ep_pov: I chose the Percentage of persons below poverty, as poverty may influence social distancing adherence.  

ep_crowd: I chose the percentage of occupied housing units with more people than rooms. This seems vital to the spread of COVID-19 as this means that there would be more frequent human interactions.  This feature seems to be correlated with the Age65 and minorities which is why I did not include these features in my model, as I believe crowd could explain why these communities could experience any disparities.

ep_munit:  I chose the percentage of housing in structures with 10 or more units estimate for the same reason as ep_crowd.  Interestingly, these two features are not linearly correlated, which makes it an appropriate choice.  


In [None]:
interested=["ep_pov","ep_munit","ep_crowd","casesp"]
pplot1=pplot[interested]



My next preprocessing step was to remove outliers. This is a tough task since this is a small dataset. While removing lots of large data points will increase our r^2 performance, we also might lose important information. Therefore, to balance this I decided to just remove Kings and Imperial County. 

In [None]:
# find where counties are and remove
print(np.where(pplot["county"]=="Imperial"))
print(np.where(pplot["county"]=="Kings"))

In [None]:
none=pplot.drop([54,55])

In [None]:
v=["ep_pov","ep_munit","ep_crowd"]
fig,ax=plt.subplots(1,3,figsize=(25,5))
fig.suptitle('Casesp vs Social Factors without Imperial and Kings County', fontsize=20)
for rows,j in zip(ax,v): 
    rows.scatter(x=none[j],y=none["casesp"])
    rows.set_xlabel(j)
    rows.set_ylabel("casesp")

Here are scatterplots which compare our features.  From the first plot, poverty is not correlated. Thus, I decided not to use it for the regression.  By observing the plots for crowd, I decided that adding a square polynomial feature would be beneficial due to the fact that the cases percentage seems to grow faster for 8-12 percent than for 2-6 percent. I decided to leave ep_munit alone. Initially, it may appear that ep_munit could benefit from the addition of a fractional polynomial feature since it appears to level off after 20%. However, the larger datapoints are sparse, and therefore there is reason to believe that this phenomenon could just be caused by a lack of data.  The final step was to min-max scale the variables. 

In [None]:
# need to scale the variables 
def min_max(x): 
    return (x- min(x))/(max(x)-min(x))
    

In [None]:
none["crowd2"]= none["ep_crowd"]**2

In [None]:
for i in ["ep_munit","ep_crowd","crowd2"]: 
    none[i]= min_max(none[i])

In [None]:
# build stats models
import statsmodels.api as sm
none["intercept"]=1
mod1= sm.OLS(none["casesp"],none[["intercept","ep_munit","ep_crowd","crowd2"]])
res1=mod1.fit()

In [None]:
res1.summary()

In [None]:
mod2= sm.OLS(none["casesp"],none[["intercept","ep_munit","ep_crowd"]])
res2=mod2.fit()
res2.summary()

****I decided to preform two least squares regressions. The first regression is the model with all our features.  The P-values show ep_munit is an extremely significant variable as it has a very small p-value.  Crowd 2 also has a small p-value of 0.069, though it is not significant in this model as this is above the custom 0.05 threshold. Furthermore, it appears that ep_crowd is not significant in this model. The model explained 58.4% of the variation(r^2). The second model is a simpler model, as it contains just the variables without the squared feature. In this model, both variables are statistically significant with low p-values. Also, the R-Squared and adjusted R-squared values are only about .02 lower than the more complex model. This increases favorability for the simpler model as it is more transparent and almost as effective. 

From this regression analysis, it is clear that the ep_munit and ep_crowd characteristics are very important characteristics that could explain the difference in spread of COVID-19 in California. Just these two variables explain more than a majority (55.6%) of the variation in total cases as a percentage as a population.  Intuitively, this provides evidence to the theory that counties with more apartment/dorm-like structures, should have a higher COVID case percentage since people are in more frequent contact with each other.  


Lets visualize these two features: 

In [None]:
print(social.sort_values("ep_munit", ascending=False)[["county","ep_munit"]].reset_index().drop("index", axis=1).loc[:9])

In [None]:
print(social.sort_values("ep_crowd", ascending=False)[["county","ep_crowd"]].reset_index().drop("index", axis=1).loc[:9])

In [None]:
new=social.merge(cali, left_on="county",right_on="NAME", how="inner")[["county","geometry_y","ep_crowd","ep_munit"]]
new.rename(columns = {'geometry_y':'geometry'}, inplace = True)

In [None]:
plt.close()

new=geo.GeoDataFrame(new)
ax= new.plot(column="ep_munit",cmap="Greens",figsize=(10,10),legend=True,edgecolor="black",linewidth=0.4)
ax.set_title("Percent of People in Housing with 10 or more Units ")
ax.set_axis_off()

In [None]:
plt.close()

new=geo.GeoDataFrame(new)
ax= new.plot(column="ep_crowd",cmap="Greens",figsize=(10,10),legend=True,edgecolor="black",linewidth=0.4)
ax.set_title("Percent of Houses with More People than Rooms ")
ax.set_axis_off()

The first map represents MUNIT while the second represent CROWD.  Regarding the first map, it shows that a majority of this type of apartment housing is concentrated in urban, high populated counties. In fact, San Francisco County leads this category with 36.5% of its housing in this manner. This is followed by Los Angeles County (26.5%), Alpine(24.6%), Alameda(21.2%), and Santa Clara(21.1%).  This is to be expected of these regions as they contain a high volume of people in a confined area.  However, the percentages depicted by the second map of overcrowded houses seems to be distributed more evenly across California.  Central California seems to possess lots of crowding relative to the other counties, which differs from the second map. The top crowded counties are Monterey(12.8%), Los Angeles (11.8%), Imperial(10.4%), Santa Barbara(10.2%), Tulare(9.9%).  It is especially noteworthy that Los Angeles county, which I deemed the hardest hit by COVID-19 is near the top on both lists. 

In [None]:
# read in Google Mobility Data
glob=pd.read_csv(r"../input/covid19/Global_Mobility_Report.csv")
usa= glob[glob["country_region"]== "United States"]
cal= usa[usa["sub_region_1"]== "California"]
cal.head()

In [None]:
cali_total= cal[cal["sub_region_2"].isna()==True]

In [None]:
cali_total["date"]=pd.to_datetime(cali_total["date"])

In [None]:
col=cali_total.columns[7:]

In [None]:
col[3:6]

In [None]:
cal["date"]=pd.to_datetime(cal["date"])

In [None]:
plt.close()
fig, axes= plt.subplots(2,3, figsize=(20,20))
fig.suptitle("California Mobility Data", size=15);
plt.subplots_adjust(wspace = 0.2,hspace = 0.2)
for i in range(2):
    x=col[:3]
    y=col[3:6]
    for j in range(3): 
        if i==0: 
            axes[i,j].plot(cali_total["date"],cali_total[x[j]])
            axes[i,j].tick_params(axis='x',rotation=90);
            axes[i,j].set_title(x[j]);
            axes[i,j].set_ylabel("Percent from Baseline");
            axes[i,j].set_xlabel("Date");
        if i==1: 
            axes[i,j].plot(cali_total["date"],cali_total[y[j]])
            axes[i,j].tick_params(axis='x',rotation=90);
            axes[i,j].set_title(y[j]);
            axes[i,j].set_ylabel("Percent from Baseline");
            axes[i,j].set_xlabel("Date");

Furthermore, the strong relationship between the percentage of COVID-19 cases in a county and housing characteristics may be due in part to California’s social distancing policies.  The time series graphs represent Google Mobility for grocery/pharmacy, parks, transit, retail/recreation, residential and workplace.  While Mobility in places like parks and grocery have returned to base levels recently, overall, the mobility in public places has been drastically reduced during the period. This trend and the drastic increase in residential mobility are strong indicators that the social distancing efforts have been efficient. However, the trade-off with this policy is that it leads people in overcrowded and compact living spaces at risk, thus providing a potential hypothesis to explain the results of the regression. 

## Analysis of Testing Access, and ICU availability across California

### Testing Access

The data for testing access was obtained on Kaggle by the organization Coders against COVID-19. This dataset is a crowdsourced list of testing centers across the United States. 

In [None]:
# read in data
health=pd.read_csv(r"../input/covid19/crowd-sourced-covid-19-testing-locations.csv")
health.head()

In [None]:
# California represented two ways CA and California
health["location_address_region"].unique()

In [None]:
# Find just california and CA
cali_health=health[(health["location_address_region"]=="CA" )|(health["location_address_region"]=="California")]
cali_health.head()

In [None]:
cali_health.shape

In [None]:
cali_health["is_location_screening_patients"].value_counts()

There are a total of 182 centers in California in this dataset. However, only 170 centers screen for COVID-19.

In [None]:
# The following cells make sure the coordinate system is same for plotting
screening= cali_health[cali_health["is_location_screening_patients"]=="t"]

In [None]:
screening=geo.GeoDataFrame(screening)


In [None]:
cali.crs = {'init' :'epsg:4326'}

In [None]:
screening.crs=cali.crs

In [None]:
screening.drop("geometry", axis=1, inplace=True)

In [None]:
gdf = geo.GeoDataFrame(
    screening, geometry=geo.points_from_xy(screening["lng"], screening["lat"]))

In [None]:
gdf.crs=cali.crs

In [None]:
# Note: Because LA has such a large population, using the real population of LA creates a poor map. Therefore, I changed it to another value. The axis is still correct. 
plt.close()
piv_cases=geo.GeoDataFrame(grouped)
piv_cases.crs= cali.crs
piv_cases["e_totpop"]=piv_cases["e_totpop"].replace(max(piv_cases["e_totpop"]),3253356)

base= piv_cases.plot(column="e_totpop",cmap='BuGn',figsize=(10,10),legend=True,edgecolor="black",linewidth=0.4)
gdf = gdf.to_crs({'init': 'epsg:3857'})
ax=gdf.plot(ax=base,figsize=(10,10),zorder=2,color="red")
ax.set_title("Population vs COVID Testing Centers ")
ax.set_axis_off()

The question of whether there are certain groups of people that are restricted access to testing can only truly be answered at a county level. However, to get a general sense if there were any high population areas that did not have any adequate testing centers, I decided to plot the COVID testing centers on top of a population map of California. It appears that the testing centers are concentrated in high population counties with lower population counties having relatively fewer testing centers. This is logical as the more populous areas should have more resources allocated towards them. 

### ICU Availability

The data for ICU bed counts and COVID patients win ICU where obtained by the California Department of Public Health. 

In [None]:
# read in Data
bed_counts=pd.read_csv(r"../input/covid19/bed_counts.csv")
bed_counts.head()

In [None]:
# change Counties to Lowercase
bed_counts["county"]= [i.lower() for i in bed_counts["COUNTY_NAME"]]

In [None]:
# Capitalize first letter of each word in County Name 
def capital(ls): 
    if len(ls)==1: 
        return ls[0].capitalize()
    else: 
        return ls[0].capitalize()+" " +capital(ls[1:])

In [None]:
bed_counts["county"]= [capital(i.split())for i in bed_counts["county"]]

In [None]:
icu_beds= bed_counts[["county","INTENSIVE CARE"]]

In [None]:
# Get ICU COVID data
icu_counts=pd.read_csv(r"../input/covid19/icu_cases.csv")
icu_counts.head()

To calculate the number of COVID-19 ICU patients, I decided to add the number of ICU COVID positive patients with ICU COVID Suspected Patients.

In [None]:
icu_counts["total_icu"]= icu_counts["ICU COVID-19 Positive Patients"]+ icu_counts["ICU COVID-19 Suspected Patients"]

In [None]:
icu_counts["Most Recent Date"]=pd.to_datetime(icu_counts["Most Recent Date"])

In [None]:
icu= icu_counts[icu_counts["Most Recent Date"]=="2020-06-09"]

In [None]:
# combine all data sets 
k=icu.merge(icu_beds, left_on="County Name", right_on="county", how="inner").drop("county", axis=1)
cali_icu= k.merge(cali, left_on="County Name", right_on="NAME", how="inner")
cali_icu.head()


In [None]:
cali_icu.shape

In [None]:
# Calculate Percentage 
cali_icu["percent"]=(cali_icu["total_icu"]/cali_icu["INTENSIVE CARE"])*100

In [None]:
print(cali_icu.sort_values(by="percent", ascending=False)[["County Name","total_icu","INTENSIVE CARE","percent"]].reset_index().drop("index", axis=1).head(10))

In [None]:
# Imperial counnty reduced on map because it is an outlier. However, axis is still correct. 
plt.close()
co= cali_icu.copy()
co["percent"]=co["percent"].replace(max(co["percent"]),37)
piv_cases=geo.GeoDataFrame(co)
ax=piv_cases.plot(column="percent",cmap='Purples',figsize=(10,10),legend=True,edgecolor="black",linewidth=0.4)
ax.set_title("Percent of ICU Beds Used by Covid-19")
ax.set_axis_off()

The data for ICU bed counts and COVID patients with ICU where obtained by the California Department of Public Health. To calculate the number of COVID-19 ICU patients, I decided to add the number of ICU COVID positive patients with ICU COVID Suspected Patients. Looking at the map, it is clear that Southern California ICUs are filling fast due to COVID 19. This is probably due to the higher number of COVID cases as a percent of population. Additionally, the county with the highest Percentage is Imperial (approx. 64%), which is nearly 1.8x higher the next largest county, Kings County (approx. 37%).  Following these counties is San Joaquin (approx. 31%), Orange (approx. 27%) and San Diego (approx. 26%). Los Angeles County is ranked 8th on the list with approx. 24%. Based on this data and previous findings, Imperial County and Kings County are the counties that could be most at risk. While they have been able to keep the death: cases ratio low, the low ICU availability and the high percentage of the population with the virus may be a strong indication that these counties might experience a higher death/cases ratio in the coming weeks. Therefore, more resources should be provided to these two counties to ensure that this does not happen. 

## Key Findings 

1.	From April 15 to mid-May (approx. May-15th), the rate of overall increase is smaller than from mid-May to June-9th.
2.	Even though Imperial county has a very high percentage of cases, LA county's top three presence when it comes to cases and deaths makes it the hardest hit county. 
3.	While Imperial and Kings counties have been able to keep the death: cases ratio low, the low ICU availability and the high percentage of the population with the virus may be a strong indication that these counties might experience a higher death/cases ratio in the coming weeks.
4.	There is a strong relationship between a counties COVID-19 cases as a percentage of population and the metrics: percentage of people with housing with 10 or more units and percentage of houses with more rooms than people. This may be due in part to California’s social distancing policies.  
