[Link to Covid Maps - USA Part 1](https://www.kaggle.com/blakkmagic/covid-maps-usa-part-1)

[Link to Covid Maps - USA Part 2](https://www.kaggle.com/blakkmagic/covid-maps-usa-part-2)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

import geopandas as gpd
from shapely.geometry import LineString
from geopandas.tools import geocode
import folium
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, FastMarkerCluster
from folium import plugins
import math
import webbrowser
from IPython.display import HTML
import matplotlib.pyplot as plt
from pandasql import sqldf
import plotly.express as px

#turn off settingwithcopywarning off
pd.options.mode.chained_assignment = None

# **Step 5 - Choropleth Maps**

Creating some Choropleth maps may be useful as well to compare counties. [Choropleth Maps](https://datavizcatalogue.com/methods/choropleth.html) display divided geographical areas or regions that are coloured, shaded or patterned in relation to a data variable. The data variable uses colour progression to represent itself in each region of the map. Typically, this can be a blending from one colour to another, a single hue progression, transparent to opaque, light to dark or an entire colour spectrum.

For these maps the latitude and longitude of the counties from the Covid-19 dataset isn't important because the only column required to map the cases is the 'fips' column.As a result, we can account for NY City by manually inputting its fips value. We could search the web for Kansas City's fips but I will ignore it.

Also unlike the heat maps, we do not need to restrict on the date when mapping since we won't be running into performance issues.

In [None]:
density =  pd.read_csv("../input/us-counties-covid-19-dataset/us-counties.csv", dtype={"fips": str})
density = density.loc[density.date == density.date.max()]
print("NY City before update: \n" +str(density.loc[density['county'] == 'New York City'])+"\n")
density.loc[density['county'] == 'New York City','fips'] = 36061
print("NY City after update: \n" +str(density.loc[density['county'] == 'New York City'])+"\n")
#drop the rows with county = 'unknown' - these rows have no fips values
density.dropna(how = 'any', inplace = True)

With choropleth maps colour can only be assigned to a county if you have data on that county. Otherwise they will show as blank on the choropleth map. This is problematic because of the way the original Covid-19 dataset is set up. If a county has no recorded cases then it will not show up in the dataset. So we have to account for this in order to create a more wholistic choropleth map.

To do this I compared the unique counties in the Covid-19 dataset with the list of counties in the GeoDataFrame that we have been using up to this point. I used Excel to do this. Because both files using different naming conventions (e.g. one of the files uses "borough" in the county's name whereas the other didn't) it was hard to completely account for every missing county. Because of this there will be gaps in the maps produced below.


I then added any counties that were 1) in the GeoDataFrame and 2) not in the Covid-19 dataset to a new dataset where I manually created the columns 'county', 'fips', 'cases' and 'deaths'. The columns 'cases' and 'deaths' were set to 0 for all counties in this new dataframe. As I will be trying to map the death rate (deaths/cases) I also created a new column here and not in excel called 'deathrate'. This column of course will be filled with NaN since 0 divided by 0 does not work. I therefore use fillna to replace these with 0.

This new dataset is then read and we take what we need depending on what we want to map.

**The issue with this method is that you would have to periodically check that your list of counties with 0 cases to make sure it is up to date. If one of these counties records a case in the future and you keep your list as is then when you concatenate the US Covid-19 dataset with this dataset of counties with 0 cases you will have two rows (one with >0 cases recorded courtesy of the US Covid-19 dataset and one with 0 cases recorded courtesy of the manually created dataset) **

**FYI there will still be gaps which will be represented by the colour white in the maps produced below**

In [None]:
unrecorded =  pd.read_csv('../input/us-counties-without-recorded-casesdeaths/unrecorded_counties.csv')

unrecorded_cases = unrecorded[['fips','county','cases']]
unrecorded_deaths = unrecorded[['fips','county','deaths']]

unrecorded_deathrate = unrecorded[['fips','county','cases','deaths']]
unrecorded_deathrate['death_rate'] = (unrecorded_deathrate['deaths']/unrecorded_deathrate['cases']).fillna(0)
unrecorded_deathrate = unrecorded_deathrate[['fips','county','death_rate']]

As already said, the list of counties in the step above were not present in the initial US Covid-19 dataset. So we will concatenate this list of counties with the US Covid-19 dataset so it contains as many counties as possible whether they have cases/deaths recorded or not

In [None]:
# Take what you need from initial US Covid-19 dataset 
density1 = density[['fips','county','cases']]
# Store what you need from initial US Covid-19 dataset and dataset of counties with no recorded cases as a frame
frames1 = [density1,unrecorded_cases]
# Concat the two of them together
concat_density1 = pd.concat(frames1)

#Same method as above but for deaths
density2 = density[['fips','county','deaths']]
frames2 = [density2,unrecorded_deaths]
concat_density2 = pd.concat(frames2)


#Same method as above but for death rate
density3 = density[['fips','county','cases','deaths']]
density3['death_rate'] = (density3['deaths']/density3['cases'])*100
density3 = density3[['fips','county','death_rate']]
frames3 = [density3,unrecorded_deathrate]
concat_density3 = pd.concat(frames3)

Now we load a GeoJSON file containing the geometry information for US counties, where feature.id is a FIPS code. This FIPS code is unique to each entry in this file so it is what we must use to link to our three concat_density dataframes above. This is what was meant by the latitude and longitude columns no longer becoming important when mapping using this GeoJSON file.

In [None]:
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)
# Below will show that we should use feature.id as the way to link this dataset with our conca_density datasets
counties["features"][0]

# **Choropleth Map - Cases**
As we will see below with the first Choropleth Map, the main issue with this dataset is the distribution of the cases. Refer to describe() below which highlights how skewed the data is. The implication of this is the effect it has on colouration of the choropleth map when using a continuous colour scale. Without capping the colour range the whole map will pretty much be the same colour since there is signficant right skew.

Because of this we will map the case count in three ways
1. Without capping the colour range - the produced map will pretty much just be one colour
2. With capping the colour range - this will start to produce more colour - by capping the colour range to 100 the map will produce a nice range of colour for counties with a case counts between 0 and 100 - the issue with capping this however means we have no way of differentiating counties with cases >= 100. They will all show up in the same colour
1. Using a discrete colour range. We will assign each county a value based on the quartile it lies in. As a result the final map will only have 4 colours and will be at least useful for displaying which quartile each county lies in

In [None]:
concat_density1['cases'].describe()

In [None]:
fig1 = px.choropleth_mapbox(concat_density1, geojson=counties, locations='fips', color='cases',
                           color_continuous_scale="Viridis",
                            mapbox_style="carto-positron",
                           hover_name='county',
                           zoom=2.5, center = {"lat": 37.0902, "lon": -95.7129},
                           labels={'cases':'cases'}
                          )
fig1.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig1.show()

Clearly the map above is no good at visualising difference in case numbers across US counties. Map 2 will now be produced where we put a cap on the colour range in order to produce a bit more colour. As stated earlier the issue with this map will be the fact that there is no visual difference in counties where case numbers exceed 1000.

In [None]:
fig2 = px.choropleth_mapbox(concat_density1, geojson=counties, locations='fips', color='cases',
                           color_continuous_scale="Viridis",
                           mapbox_style="carto-positron",
                           range_color=(0,1000),
                           hover_name='county',
                           zoom=2.5, center = {"lat": 37.0902, "lon": -95.7129},
                           labels={'cases':'cases'}
                          )
fig2.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig2.show()

Map 3 will now use a slightly different method to map. We know that with quartiles an even number of cases lie within each quartile. So based on that if we use a discrete colouring system using just 4 colours then we should get a nice even display of colouration in the map. Whilst there is no way to visually explain difference in counties that lie within the same quartile, using this method at least helps identify where in the distribution that a county sits.

In [None]:
#Additional step required to assign each county a value based on the quartile that it is in
quartiles = (concat_density1['cases'].min(),
                 np.quantile(concat_density1['cases'], 0.25),
                 np.quantile(concat_density1['cases'], 0.5),
                 np.quantile(concat_density1['cases'], 0.75),
                 concat_density1['cases'].max())

def quantile_value(val):
    if quartiles[0] <= val < quartiles[1]:
        return '1'
    if quartiles[1] <= val < quartiles[2]:
        return '2'
    if quartiles[2] <= val < quartiles[3]:
        return '3'
    else:
        return '4'
    
concat_density1['quartile'] = concat_density1.apply(lambda x: quantile_value(x['cases']), axis=1)

#For whatever reason the choropleth_mapbox will assign colour values based on the order that the quartiles appear starting from row 1
#Without sorting, the map will assign colours based on the order 3,4,2,1
#By sorting from highest to lowest quartile values the map will now assign colours based on the order 4,3,2,1
concat_density1 = concat_density1.sort_values(by=['quartile'],ascending = False )
concat_density1.head()

In [None]:
#Only need 4 colours so print one of the plotly colour schemes to get the exact colour codes
print("Viridis colour codes")
print(px.colors.sequential.Viridis)

In [None]:
colours = ['#440154', '#31688e','#35b779','#fde725']

fig4 = px.choropleth_mapbox(concat_density1, geojson=counties, locations='fips', color='quartile',
                           mapbox_style="carto-positron",
                           hover_name='county',
                           color_discrete_sequence= colours,
                           zoom=2.5, center = {"lat": 37.0902, "lon": -95.7129},
                           labels={'cases':'cases'}
                          )
fig4.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig4.show()

# **Choropleth Map - Deaths**
Again the main issue with the death data is the distribution. Refer to describe() below which highlights how skewed the data is. Will use method three from the Choropleth - Cases maps to map the deaths.

In [None]:
concat_density2['deaths'].describe()

Notice from above descriptive stat that method three actually wont yield good results visually. At least 50% of the counties apparently have not recorded a death (the way I have compiled data means this could be wrong) which means two quartiles are going to be the same colour. So we will use method two instead. At least 75% of the counties have only recorded at max, 2 deaths. Will run a couple of more tests to identify a suitable colour range cap in order to map with a bit more visual appeal.

In [None]:
list = (np.arange(0.8,1.0,0.02))

for i in list:
    print("Percentile = " +str(i.round(2)) + ", Deaths = " +str(np.quantile(concat_density2['deaths'], i).round(0)))

Note that there is a large jump between the 94th and 96th percentile. So we will use 141 as the cap. Any county with where deaths are >= 141 will show up with same colour. This isn't ideal since the difference between 147 and the max number of recorded deaths for a county is so large but at least we can get some visual differentiation for datapoints within the 90th percentile.

In [None]:
fig5 = px.choropleth_mapbox(concat_density2, geojson=counties, locations='fips', color='deaths',
                           color_continuous_scale="Bluyl",
                           range_color=(0,127),
                            mapbox_style="carto-positron",
                           hover_name='county',
                           zoom=2.5, center = {"lat": 37.0902, "lon": -95.7129},
                           labels={'deaths':'deaths'}
                          )
fig5.update_layout(margin={"r":0,"t":0,"l":0,"b":0},title_text ='US Covid-19 Deathrate')
fig5.show()

# **Choropleth Map - Death Rate**
Main issue with this map again is the distribution of the death rates. Refer to describe() below which highlights how skewed the data is. 
Because at least 50% of the data has a deathrate of 0% (they way i have compiled dataset means my numbers may not be accurate) we will use a colour range cap so that we can get some variation. Note that 75% of the data lies between 0 and 4.37% so we will cap the range to 5% which hopefully will result in some colour variation. Any county where the case count is >= 5% will show up as the same colour

In [None]:
concat_density3['death_rate'].describe()

In [None]:
fig6 = px.choropleth_mapbox(concat_density3, geojson=counties, locations='fips', color='death_rate',
                           color_continuous_scale="Cividis_r",
                           range_color=(0,5),
                            mapbox_style="carto-positron",
                           hover_name='county',
                           zoom=2.5, center = {"lat": 37.0902, "lon": -95.7129},
                           labels={'deathrate':'deathrate'}
                          )
fig6.update_layout(margin={"r":0,"t":0,"l":0,"b":0},title_text ='US Covid-19 Deathrate')
fig6.show()

# **Choropleth Map - Cases to Population Size**

Could attempt to normalise case rates by dividing each counties case numbers by the county population size. Trying to map each county in the initial Covid-19 dataset to another dataset that contains population size has been tedious and there are some gaps which I have not and will not be bothered accounting for.

In [None]:
cases_to_population =  pd.read_csv("../input/us-county-covid-casespopulation/casespopulation.csv", dtype={"fips": str})
cases_to_population.at[cases_to_population.loc[cases_to_population['county'] == 'New York City'].index[0],'fips'] = '36061'
cases_to_population.dropna(how = 'any', inplace = True)
cases_to_population.shape
cases_to_population = cases_to_population[['fips','county','case rate']]


cases_to_population['case rate'].describe()

In [None]:
fig7 = px.choropleth_mapbox(cases_to_population, geojson=counties, locations='fips', color='case rate',
                           color_continuous_scale="Tealgrn",
                           range_color=(0,0.11),
                            mapbox_style="carto-positron",
                           hover_name='county',
                           zoom=2.5, center = {"lat": 37.0902, "lon": -95.7129},
                           labels={'case rate':'case rate'}
                          )
fig7.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig7.show()

That's all for now folks. As stated at the start of this analysis there are probably a number of ways to improve this from an analytical and coding perspective. 