# Covid-19 Maps - USA

The intent of this is to implement learnings from the Geospatial Analysis course here on Kaggle. What better way than to apply my learnings to visualise the Covid-19 cases that have been recorded in the US. There are of course gaps in my general coding ability and this analysis (e.g. different styles of maps could be used, could try manipulating the dataset in other ways).

10 different maps will be produced using three different styles

**Styles**
1. Heat map
1. Heat map with time
1. Choropleth map

**Heat Maps**
1. Recorded Cases
      * Through the use of a Geocoder and,
      * Through the use of an existing US county shapefile  
2. Recorded Deaths 
3. Recorded Cases with Time

**Choropleth Maps**
1. Recorded Cases
    * Without capping the continuous colour range
    * Through capping the continuous colour range
    * Using a discrete colour range
5. Recorded Deaths
6. Death Rate
7. Cases to Population Size

# **Step 1 - Import Libraries**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

import geopandas as gpd
from shapely.geometry import LineString
from geopandas.tools import geocode
import folium
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, FastMarkerCluster
from folium import plugins
import math
import webbrowser
from IPython.display import HTML
import matplotlib.pyplot as plt
from pandasql import sqldf
import plotly.express as px

#turn off settingwithcopywarning off
pd.options.mode.chained_assignment = None

geo_code = True

# **Step 2 - Inspect Covid-19 Dataset**
> courtesy of [NYT's github CSV](https://www.kaggle.com/fireballbyedimyrnmom/us-counties-covid-19-dataset)

Dataset gets updated frequently. For performance reasons I will restrict dataset to look at only 2 months worth of data.

In [None]:
#Have a look at initial dataset
US_covid_data_base_file = pd.read_csv("../input/us-counties-covid-19-dataset/us-counties.csv")
#Restrict dates to be before 30th April for performance reasons
US_covid_data_date_restricted = US_covid_data_base_file.loc[US_covid_data_base_file['date']<'2020-04-30']
print("Shape of initial dataset: " + str(US_covid_data_date_restricted.shape))
df1 = US_covid_data_date_restricted
df1

The [source](https://www.kaggle.com/fireballbyedimyrnmom/us-counties-covid-19-dataset) of this data explains that this dataset tracks the cumulative cases however if someone didn't know this it would be useful to first figure out if the dataset tracks incremental cases or cumulative cases. 

Using King, Washington as an example you can plot the 'cases' column over time to determine if the dataset is tracking incremental or cumulative cases. If it is an increasing function it would probably be safe to assume this dataset is recording the cumulative case count over time and not new cases per day.

In [None]:
#Use one location as a test example
US_covid_data_cumulative_or_new_test = df1.loc[(df1['county']=='King')& (df1['state']=='Washington')]
plt.figure()

#set x and y variables
x = US_covid_data_cumulative_or_new_test['date']
y = US_covid_data_cumulative_or_new_test['cases']

#setting x ticks to be the start and end date so that the x-axis isn't messy
x_ticks = [x.min(),x.max()]

#plot variables to inspect if 'cases' column is cumulative cases or new cases per day
plt.plot(x, y)
plt.ylabel('Cases')
plt.xlabel('Date')
plt.xticks(x_ticks,rotation=70)
plt.title('King, Washington Cases')
plt.show

The above graph confirms the data is displaying cumulative cases. Since this is the case we will only be concerned about the max date of the dataset when creating static maps.

As always we have to clean up the dataset before we can use it in the way we want to. 

Inspecting the 'county' column will be our starting point. As seen below there are two things to note:
1. Values in the 'county' column recorded as 'unknown'. As it is impossible to pinpoint an exact location when mapping, these cases will be removed from the dataset later on. An alternative method to deal with this situation could be to inspect the 'state' column of these unknowns and then add the 'unknown' counties to another county from the same state.
1. Some counties appear more than once e.g. Washington which appears ~ 30 times. Cases like these are dealt with later through creating a new column that concatenates the county column with the state column to produce a unique identifier. This is important when Geocoding based on a string of text.

In [None]:
#Conditional on the rows where we have the max date, inspect the count of all the distinct county names
inspect_county_values = df1.loc[df1['date'] == df1['date'].max()]
inspect_county_values['county'].value_counts()

Moving on now, we'll inspect the 'fips' column for null values and check what counties are associated with these. It should be expected that most null fips should be associated with the county = 'Unknown' which is fine because we will drop these rows later on using the dropna() function.

As we will see below, Kansas City and New York City are the only counties that return null fips values. When mapping through the use of the Geocoder we will want to fill in these null fips values so that when we use the dropna() function we only lose the rowns where county = 'unknown'

Remembering these two counties will also be important as we try to manually account for them when using an external US counties shapefile to map the Covid-19 cases. 


In [None]:
#Inspect the initial dataset for counties that dont have fips
#This is important for later when mapping with a separate shapefile that contains a list of GEOID's (fips)
#When inspecting eliminate counties = 'unknown' since you can't reconcile that to a county location
US_covid_nulls = df1.loc[(df1['county'] != 'Unknown')& (df1['date'] == df1['date'].max())]

US_covid_nulls = US_covid_nulls[US_covid_nulls.isnull().any(axis=1)].county.value_counts()

print("Counties which can be manually accounted for when mapping using shapefile: \n" + str(US_covid_nulls))


At this point we have:
1. Figured out that the dataset represents cumulative cases/deaths - the implication of this is if we want to map the most up to date information then we only need to take the rows where the max date is present
1. Found some counties are recorded as 'unknown' - all unknown counties also have no value in the 'fips' column. Since we know we can't accurately map these cases we will use utilise the fact that their associated 'fips' value is null and use the dropna() function to get rid of these rows
1. Found some counties occur more than once - they will have different state names so we will create a new row concatenating the county and state in order to create a unique identifier that will be used in the Geocoder
1. Found that Kansas City and New York City are counties in the dataset that do not have values in the 'fips' column
    * In the case of creating a map with the Geocoder we do not want to lose these rows when using the dropna() function so we will fill in the 'fips' value of both with an arbitrary number in order to keep them
    * In the case of building maps with an external US counties shapefile it will either
        1. locate their fips in the external shapefile or,
        1. use google to get their latitude and longitude if the county can't be found in the shapefile         

In the code below we account for points 1 and 3

In [None]:
#Since cumulative cases, use just the max date so that you have the total cases to date
US_covid_data = df1.loc[df1.date == df1.date.max()]
print("Rows as of max date: \n\n" + str(US_covid_data['date'].value_counts()))

In [None]:
# Because it's possible the name of a county may exist in more than one state, concat the county name with the state name so that it is unique
US_covid_data['concat'] = df1['county']+str(', ')+df1['state']
#Inspect to see if there are any duplicates - there shouldnt be any 
US_covid_data['concat'].value_counts(ascending=False)
#Ascending = False so if first row equals 1 then every value in concat column is unique

# **Step 3 - Building a Geocoder**
We will account for points 2 and 4 shortly but before we do that we will attempt to Geocode every row of data based on the concatenation of the county and state name. We could account for points 2 and 4 first but it's possible the Geocoder wont geocode every row thereby meaning will still be left with rows with empty values (latitude and longitude columns which get added).

This means we'd have to use dropna() again since we cant create a map where information to be mapped are null.

So we will geocode first then account for NY and Kansas City then drop rows with null values.

In [None]:
#Geocode the concat column
if geo_code:
    def my_geocoder(row):
            try:
                point = geocode(row, provider='nominatim').geometry.iloc[0]
                return pd.Series({'Latitude': point.y, 'Longitude': point.x, 'geometry': point})
            except:
                return None

    US_covid_data[['Latitude', 'Longitude', 'geometry']] = US_covid_data.apply(lambda x: my_geocoder(x['concat']), axis=1)
    
    US_covid_data.to_csv('US_covid_data_maxdate_geocoded.csv', index=False)
else:
    US_covid_data = pd.read_csv('US_covid_data_maxdate_geocoded.csv')

In [None]:
#Fill in null fips value with arbitray number 
blank_fips_counties = ['New York City', 'Kansas City']

for i in blank_fips_counties:
    US_covid_data.loc[US_covid_data['county'] == i,'fips'] = 1


#Drop any rows where county = 'Unknown' through use of dropna() since fips value for every 'unknown' county is null. Note that any other locations that couldnt be geocoded will also be dropped 
US_covid = US_covid_data.dropna()
print("Percentage of rows that could be geocoded:\n"+str((US_covid.shape[0]/US_covid_data.shape[0])*100))

The Geocoder has succesfully geocoded most of the locations. However we won't know if these coded locations are correct until we create the map. It's possible some locations which we know should be in the US have been coded to locations outside of the US.

For the final step before mapping we must understand how the mapping works.If we were to map the data as is then it would map it as if each row referred to 1 case. In order to map based on the 'cases' column we will duplicate the rows based on the values in this column. For example if King, Washington has 235 cases recorded to date we will be duplicating that row to result in 235 rows.

In [None]:
#In order to map, each row needs to be replicated based on the value in the 'cases' column
#For example if row x has a cases count of 635 then row x needs to appear 635 times
US_covid = US_covid.loc[US_covid.index.repeat(US_covid['cases'])]
print("Shape of dataset to be mapped: " +str(US_covid.shape))

# **Heat Map using Geocoder - Cases**
The dataset is now ready to be mapped

As seen below, building the Geocoder wasn't as useful as hoped. It's evident that not all locations were geocoded to the US when you zoom out. Because of this we will move on to mapping the Covid-19 cases with an external shapefile of US counties.

In [None]:
#Create base map
map1 = folium.Map(location=[40, -95], zoom_start=4)

#Marker Cluster
map1.add_child(FastMarkerCluster(US_covid[['Latitude', 'Longitude']].values.tolist()))
#Heat Map
HeatMap(data=US_covid[['Latitude', 'Longitude']], radius=16.5, blur =16.5).add_to(map1)

map1.save('plot_data.html')   
HTML('<iframe src=plot_data.html width=800 height=600></iframe>')
        

# **Step 4 - Geocoding with an external file**
As seen with the heat map above, some data has been mapped to locations outside of the US. A new dataset will now be introduced to account for this. This dataset contains point and shape coordinates for all US counties and was sourced from [the home of the U.S. Government’s open data](https://catalog.data.gov/dataset/tiger-line-shapefile-2017-nation-u-s-current-county-and-equivalent-national-shapefile)

As seen in the code below only the .shp file will be used however it order for it to be read the files that accompany this shp file from the link above must be found in the file directory. 

Note that from the shapefile (GeoDataFrame) we will only take the 1) county name, 2) geoid (so we can join with fips of Covid-19 dataset) and the 3) latitude and 4) longitude columns

In [None]:
#Read the file
us_counties_shapefile_base = gpd.read_file("../input/us-counties-geocoded/tl_2017_us_county.shp")
df2 = us_counties_shapefile_base
df2.head()

In [None]:
us_counties_dataframe = pd.DataFrame(df2[['NAME','GEOID', 'INTPTLAT', 'INTPTLON']])
us_counties_dataframe.to_csv('geocodes.csv', index = False)
us_counties_dataframe['GEOID'] = us_counties_dataframe['GEOID'].astype('float64')


As seen earlier Kansas City and New York City were present in the Covid-19 dataset however they did not have a fips value associated with them. We will therefore check the US counties GeoDataFrame to find their fips. In the shapefile, the equivalent column is the GEOID column

Only New York City will be found in this GeoDataFrame. We will use google later on to get the latitude and longitude coordinates of Kansas City

In [None]:
#check df2 Kansas City, New York City and Joplin to see if you can manually account for these two when merging df1 and df2
for i in ["Kansas", "New York"]:
    check = us_counties_dataframe[us_counties_dataframe['NAME'].str.contains(i)]
    print("Check for "+str(i)+"\n" +str(check)+"\n")

#Can manually account for New York City as there is only one row that returns from df2
#Kansas City, Missouri - will have to get the coordinates from google

At this stage the original Covid-19 dataset needs to be prepared so that it can be joined to the GeoDataFrame. As with earlier, use the max() function on the 'date' column in order to take only what is necessary

In [None]:
#Original Dataframe of Covid cases in USA
US_covid_data = df1
US_covid_data = US_covid_data.loc[US_covid_data.date == US_covid_data.date.max()]

From our search in the GeoDataFrame above we found that the GEOID for New York City is 36061. Insert this value into 'fips' column of the Covid-19 dataframe.

In [None]:
print("NY City before update: \n" +str(US_covid_data.loc[US_covid_data['county'] == 'New York City'])+"\n\n\n")
US_covid_data.at[US_covid_data.loc[US_covid_data['county'] == 'New York City'].index[0],'fips'] = 36061.0
print("NY City after update: \n" +str(US_covid_data.loc[US_covid_data['county'] == 'New York City'])+"\n")


At this point we can join the county GeoDataFrame with the Covid-19 dataframe. The type of join we use is a left join with the Covid-19 dataframe being the left table. Have used a left join so we don't lose the Kansas City and Joplin rows which we will shortly account for. Any other rows that can't be joined to the GeoDataFrame will be dropped later as the counties associated with these rows are 'unknown'

In [None]:
#Merge df1 and df2 together 
concat_result = US_covid_data.merge(us_counties_dataframe[['GEOID','INTPTLAT', 'INTPTLON']],left_on = 'fips', right_on = 'GEOID', how = 'left')
print("Rows in left df: "+str(US_covid_data.shape[0]))
print("Rows in joint df: "+str(concat_result.shape[0]))
concat_result


As stated earlier, Kansas City can be accounted for by inserting the latitude and longitude values. We couldn't account for it earlier in the same way New York City was because there was no GEOID from the GeoDataFrame for Kansas City. Using Google, the latitude and longitude for Kansas City is 39.0997 and -94.5786

Should also note that we will fill in the 'fips' and 'GEOID' column with arbitrary numbers so that this row isn't dropped when we use the dropna() function later on.


In [None]:
#Manually update Kansas City, Missouri with 39.0997°, -94.5786°
print("Kansas City before update: \n" +str(concat_result.loc[concat_result['county'] == 'Kansas City'])+"\n\n\n")
concat_result.at[concat_result.loc[concat_result['county'] == 'Kansas City'].index[0],'INTPTLAT'] = '+39.0997000'
concat_result.at[concat_result.loc[concat_result['county'] == 'Kansas City'].index[0],'INTPTLON'] = '-94.5786000'
#Also add in arbitrary numbers to fips and GEOID column so that this doesn't get dropped when you use dropna() later
concat_result.at[concat_result.loc[concat_result['county'] == 'Kansas City'].index[0],'fips'] = 1
concat_result.at[concat_result.loc[concat_result['county'] == 'Kansas City'].index[0],'GEOID'] = 1
print("Kansas City after update: \n" +str(concat_result.loc[concat_result['county'] == 'Kansas City'])+"\n")


With the below set of code we will clean up the joint dataframe so that we can successfully map the covid cases. This includes removing the '+' from the latitude column, removing rows where there are null values (this is where county = 'unknown') and duplicating each row based on the value in the 'cases' column

In [None]:
#Remove the '+' from the latitude column so that it can be mapped 
concat_result['INTPTLAT'] = concat_result['INTPTLAT'].astype('str')
concat_result['INTPTLAT']
concat_result['INTPTLAT'] = concat_result['INTPTLAT'].str[1:]
concat_result['INTPTLAT']


#Remove rows with NaN's - this will be where county = Unknown
concat_result.dropna(how = 'any', inplace = True)

#Mapping the Covid Cases - required to duplicate rows based on value in 'cases' column
concat_result_cases = concat_result.loc[concat_result.index.repeat(concat_result['cases'])]
print("Shape of dataset to be mapped: " +str(concat_result_cases.shape))

#Convert latitude and longitude columns so that it's compatible with mapping
concat_result_cases['INTPTLAT'] = concat_result_cases['INTPTLAT'].astype('float64')
concat_result_cases['INTPTLON'] = concat_result_cases['INTPTLON'].astype('float64')


# **Heat Map - Cases**


In [None]:
#Create base map
map2 = folium.Map(location=[40, -95], zoom_start=4)

#Marker Cluster
map2.add_child(FastMarkerCluster(concat_result_cases[['INTPTLAT', 'INTPTLON']].values.tolist()))
#Heat Map
HeatMap(data=concat_result_cases[['INTPTLAT', 'INTPTLON']], radius=16.5, blur = 16.5).add_to(map2)

map2.save('plot_data2.html')   
HTML('<iframe src=plot_data2.html width=800 height=600></iframe>')
        

Using the joint dataframe we can quickly create a heat map reflecting the recorded deaths. We can start just after the point where rows containing nulls were dropped in order to map the cases. This time however, the rows are duplicated based on the 'deaths' column as opposed to the 'cases' column.

In [None]:
#Mapping the Covid Deaths
concat_result_deaths = concat_result.loc[concat_result.index.repeat(concat_result['deaths'])]
concat_result_deaths = concat_result_deaths.loc[concat_result_deaths['deaths']!=0]
print("Shape of dataset to be mapped: " +str(concat_result_deaths.shape))

concat_result_deaths['INTPTLAT'] = concat_result_deaths['INTPTLAT'].astype('float64')
concat_result_deaths['INTPTLON'] = concat_result_deaths['INTPTLON'].astype('float64')



# **Heat Map - Deaths**

In [None]:
#Create base map
map3 = folium.Map(location=[40, -95], zoom_start=4)

#Marker Cluster
map3.add_child(FastMarkerCluster(concat_result_deaths[['INTPTLAT', 'INTPTLON']].values.tolist()))
#Heat Map
HeatMap(data=concat_result_deaths[['INTPTLAT', 'INTPTLON']], radius=16.5, blur = 16.5).add_to(map3)

map3.save('plot_data3.html')   
HTML('<iframe src=plot_data3.html width=800 height=600></iframe>')
        

With part 1 now complete we will move on to [part 2](https://www.kaggle.com/blakkmagic/covid-maps-usa-part-2) of this analysis which will introduce a heat map with a time element.