# Choropleth Maps and Time Series

Hello and welcome to my COVID analysis notebook, where today we will look at an introduction to choropleth map visualisation, as well as using other graphs to display our time series data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import geopandas as gpd
import plotly.express as px
import plotly.graph_objs as go
from collections import Counter

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

A **choropleth map** is a map that uses different shading or colours to represent different values in various locations. In this notebook, we will use it to examine the confirmed, lethal and recovered instances of COVID-19 around the world.

**Time series** data is data that is listed in time order. Here, it is a sequence of dates that show the different coronavirus cases.

Firstly, we use the file which tells us how many global confirmed cases there are. We then replace the 'US' sample with 'United States of America' so that it can match the future values that we will use.

In [None]:
path = '../input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv'
df = pd.read_csv(path)
df['Country/Region'] = df['Country/Region'].replace('US', 'United States of America')

Then, we use geopandas' "naturalearth_lowres" file to acquire the data we need to chart the different countries, storing it in a variable "world". World contains two features that will be important to us:
* name: the full names of the different countries
* iso_a3: shorthand notation for the countries; necessary for us to input into our choropleth function

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world[['name', 'iso_a3']]

We set the index of world to be the "name", and then reindex the dataframe with our "Country/Region" variable from df.

In [None]:
world.index = world['name']
world = world.reindex(df['Country/Region'])
world

Subsequently, we fill the null values from the "iso_a3" feature in world and transfer it into df. This is very significant, as we now have the iso values for each country, which can then be used for our choropleth function.

In [None]:
world['iso_a3'] = world['iso_a3'].fillna('NaN')
df['iso_a3'] = world['iso_a3'].reset_index(drop=True)
df[['iso_a3']]

Here, we store df's date columns into "date_cols" by seeing if the column's name ends with 20 or 21 (the year). Afterwards, we create a one-dimensional array called "countries" which multiplies each iso_a3 value by the length of date_cols.

In [None]:
date_cols = list(df.columns[[col[-2:]=='20' or col[-2:]=='21' for col in df]])
countries = np.array([[i]*len(date_cols) for i in df['iso_a3']]).flatten()
pd.Series(date_cols)

In [None]:
pd.Series(countries)

Furthermore, we perform a similar procedure, except that we multiply the date_cols by the length of df's "Country/Region" feature. Following that, we want to store the number of confirmed cases for each day, so we do this in the next line with a variable "values". Now we assemble it all together by combining our "countries", "dates" and "values" variables into one dataframe: "data".

In [None]:
dates = date_cols*len(df['Country/Region'])
values = np.array([list(i) for i in df[date_cols].iloc]).flatten()
data = pd.DataFrame({'country':countries, 'date':dates, 'confirmed':values})
data

Finally, we can display data using plotly express' choropleth function, giving us the ability to not only visualise the countries' cases but to also see how they all change over time, using an animation frame.

In [None]:
fig = px.choropleth(data, locations='country', color='confirmed', animation_frame='date')
fig.show()

To make it easier for ourselves in the future, we can sum all of this up in one procedure to allow us quick and simple access.

In [None]:
def world_map(suffix):
    path = '../input/novel-corona-virus-2019-dataset/time_series_covid_19_'+suffix+'.csv'
    df = pd.read_csv(path)
    df['Country/Region'] = df['Country/Region'].replace('US', 'United States of America')

    world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
    world.index = world['name']
    world = world.reindex(df['Country/Region'])
    world['iso_a3'] = world['iso_a3'].fillna('NaN')
    df['iso_a3'] = world['iso_a3'].reset_index(drop=True)
    
    date_cols = list(df.columns[[col[-2:]=='20' or col[-2:]=='21' for col in df]])
    countries = np.array([[i]*len(date_cols) for i in df['iso_a3']]).flatten()
    dates = date_cols*len(df['Country/Region'])
    values = np.array([list(i) for i in df[date_cols].iloc]).flatten()
    data = pd.DataFrame({'country':countries, 'date':dates, suffix:values})

    fig = px.choropleth(data, locations='country', color=suffix, animation_frame='date')
    fig.show()

In [None]:
def linear(suffix):
    path = '../input/novel-corona-virus-2019-dataset/time_series_covid_19_'+suffix+'.csv'
    df = pd.read_csv(path)
    df['Country/Region'] = df['Country/Region'].replace('US', 'United States of America')
    
    date_cols = df.columns[[col[-2:]=='20' or col[-2:]=='21' for col in df]]
    country_list = ['United States of America', 'Brazil', 'India', 'Italy', 'Germany']
    cases = [df[df['Country/Region']==country][date_cols] for country in country_list]
    cases = [list(i.T[i.T.columns[0]]) for i in cases]

    fig = go.Figure()
    for i in cases:
        fig.add_trace(go.Scatter(x=date_cols, y=i, name=country_list[cases.index(i)]))
    fig.show()

def US_map(suffix):
    df = pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_'+suffix+'_US.csv')
    states = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_us_ag_exports.csv')[['code', 'state']]
    count = Counter(df['Province_State'])
    indices = [list(np.unique(df['Province_State'])).index(i) for i in np.setdiff1d(df['Province_State'], states['state'])]
    values = pd.Series(count.values())[[i not in indices for i in range(len(count.values()))]]
    states['values'] = values.reset_index(drop=True)

    date_cols = list(df.columns[[col[-2:]=='20' or col[-2:]=='21' for col in df]])
    state_count = np.array([[i]*len(date_cols) for i in states['code']]).flatten()
    dates = date_cols*len(states['code'])
    data = pd.DataFrame({'date':dates, 'state':state_count})

    values = []
    for i in df.groupby('Province_State'):
        if i[0] in list(states['state']):
            values.append([sum(list(i[1][j])) for j in i[1][date_cols]])
        else:
            continue
    data['cases'] = np.array(values).flatten()
    fig = px.choropleth(data, locations='state', color='cases', animation_frame='date', locationmode='USA-states', scope='usa')
    fig.show()

def state_bar(suffix, state):
    df = pd.read_csv('../input/novel-corona-virus-2019-dataset/time_series_covid_19_'+suffix+'_US.csv')
    date_cols = list(df.columns[[col[-2:]=='20' or col[-2:]=='21' for col in df]])
    df['total'] = [sum(i) for i in df[date_cols].iloc]
    state = df[df['Province_State']==state]
    state = state.sort_values(by='total', ascending=False)[:100]
    fig = px.bar(state, 'Admin2', 'total', color='total')
    fig.show()

def state_linear(suffix):
    path = '../input/novel-corona-virus-2019-dataset/time_series_covid_19_'+suffix+'_US.csv'
    df = pd.read_csv(path)

    cases = []
    date_cols = df.columns[[col[-2:]=='20' or col[-2:]=='21' for col in df]]
    state_cases=[df[df['Province_State'] == i][date_cols] for i in np.unique(df['Province_State'])]
    num = 0

    fig = go.Figure()
    for i in state_cases:
        fig.add_trace(go.Scatter(x=date_cols, y=[sum(i[j]) for j in date_cols], name=np.unique(df['Province_State'])[num]))
        num += 1
    fig.show()

def region_linear(case, country_name):
    df = pd.read_csv('../input/novel-corona-virus-2019-dataset/covid_19_data.csv')
    df = df.fillna('NaN')
    countries = [i[1].reset_index(drop=True) for i in df.groupby('Country/Region')]
    cases = [i.iloc[list(i.index)[-1]][case] for i in countries]
    country = countries[[i[0] for i in df.groupby('Country/Region')].index(country_name)]
    cases = [list(i[1][case]) for i in country.groupby('Province/State')]

    date_cols = np.unique(country['ObservationDate'])
    fig = go.Figure()
    for region in cases:
        fig.add_trace(go.Scatter(x=date_cols, y=region, name=np.unique(country['Province/State'])[cases.index(region)]))
    fig.update_layout(title=country_name)
    fig.show()

# World maps

These world maps are the same type of visualisation as the example we just did, except that we will also explore the deaths and recoveries in addition to the confirmed cases.

## Confirmed

In [None]:
world_map('confirmed')

* In March of 2020, the coronavirus made an effective spread to the European countries and Iran, most prominently Italy, Spain and Germany.
* However, not long after, we see the virus spreading uncontrollably to the USA, which remains a hotbed of cases for the whole year.
* Near the summer, COVID made notable contaminations to Brazil, India and Russia.

## Deaths

In [None]:
world_map('deaths')

* The trend here in deaths shows a very similar pattern to the last display, however we also see a significant rise of deaths in Mexico.

## Recovered

In [None]:
world_map('recovered')

* At the beginning Europe, Iran, Turkey and the US make the biggest strides in recovery.
* This is followed by Brazil, India and Russia overtaking the lead in pioneering the healing.

# Global time series

Now we will move on from choropleth maps and towards time series graphs around the world.

## Confirmed

We firstly analyse the confirmed cases of 5 countries which COVID-19 has had a major impact on: USA, Brazil, India, Italy and Germany.

In [None]:
linear('confirmed')

* In the beginning of April, America's cases started to significantly rise above the others. At the start of November is when we see it having an almost exponential ascent.
* At the dawn of February 2021, the US begins to noticeably flatten the curve of cases.
* As for India and Brazil, their cases seem to take a more remarkable rise in June. Though, as of the end of February 2021, they seem to have a roughly similar amount of cases.
* Undoubtedly, the countries which are faring the best out of our sample are Italy and Germany. Although they had a rough start in the beginning and a dangerous turn of events in October, they seem to be flattening the curve well.

## Deaths

In [None]:
linear('deaths')

* The death pattern for America, Brazil and India is similar to that of the confirmed cases.
* Italy is a different story. They started having a hard time with coronavirus deaths all the way back in March 2020. This gave them a tough two months of rising in COVID instances.
* During May, Italy got their act together and enjoyed five months of very little deaths.
* However, this was followed by a sharp rise after October, climbing the death rates at a record rate.

## Recovered

In [None]:
linear('recovered')

* It seems that we have missing data for the USA's recovery, as the rates drop to 0 in our graph after a very promising 6 million records in mid-December.
* Brazil started steadily making recoveries in June 2020, keeping a (mostly) consistent trend the whole year.
* India made great strides in the middle of July, which led to an increadible increase in rehabilitations.
* Italy and Germany began making their significant recoveries in November.

# Cases per country region

Afterwards, we now take a look at each region within a sample of countries (Australia, India and Italy) and compare them to how they fare with COVID cases.

## Confirmed

In [None]:
for country in ['Australia', 'India', 'Italy']:
    region_linear('Confirmed', country)

* The pattern for these countries is that one region (maybe the most populous?) has a massive lead over all the other ones, while the rest are clustered in one specific area.
* In Australia, the most major place with confirmed cases is Victoria by a significant margin, New South Wales following in second.
* For India, Maharashta is hugely leading in cases, followed by Kerala.
* In Italy, Lombardia is the first in confirmed instances, with Veneto in second.

## Deaths

In [None]:
for country in ['Australia', 'India', 'Italy']:
    region_linear('Deaths', country)

* As mentioned previously: Victoria, Maharashtra and Lombardia have an enormous lead in coronavirus cases above the others, because if we compare them to the rest then the others seem very insignificant.

## Recovered

In [None]:
for country in ['Australia', 'India', 'Italy']:
    region_linear('Recovered', country)

* We can see a correlation between the number of cases and recoveries: the more virus instances there are, the more they need to recover.
* India and Italy's regions flattened the curve of recoveries in June 2020.

# USA map

Next, we move onto just analysing the USA, while going back to choropleth maps.

## Confirmed

In [None]:
US_map('confirmed')

## Deaths

In [None]:
US_map('deaths')

* Both graphs tell us that the coronavirus began in Washington, made a vicious attack on New York, only to later set up base in California, Texas and Florida.

# State cases

Furthermore, we can compare how the cases for each state have gone.

## Confirmed

In [None]:
state_linear('confirmed')

* New York was infected with COVID-19 in March 2020, though it got it under control in May, only to have it rise again in November.
* California, Texas and Flora saw their numbers climb in June, which led to them being the states with the most cases.

## Deaths

In [None]:
state_linear('deaths')

* The statistics here models a lot of what we saw in the previous graph.

# State regions

Finally, we will analyse which regions in California, Texas and Florida have the most confirmed instances and deaths.

## Confirmed

### California

In [None]:
state_bar('confirmed', 'California')

### Texas

In [None]:
state_bar('confirmed', 'Texas')

### Florida

In [None]:
state_bar('confirmed', 'Florida')

## Deaths

### California

In [None]:
state_bar('deaths', 'California')

### Texas

In [None]:
state_bar('deaths', 'Texas')

### Florida

In [None]:
state_bar('deaths', 'Florida')

* In California, the places with the most cases are LA, Riverside, San Bernardino, San Diego and Orange.
* In Texas, the places with the most cases are Harris, Dallas, Bexar and Tarrant.
* In Florida, the places with the most cases are Miami-Dade, Palm beach, Broward and Hillsborough.

### Thank you for reading my notebook. If you enjoyed it, please upvote and provide feedback.

### If you want to learn more about plotly you can check out my Introduction to Plotly notebook: https://www.kaggle.com/dabawse/introduction-to-visualisations-with-plotly