# Covid-19 Data Analysis

Analyzing of confirmed cases of Covid-19 around the world.

Inspired by [Xingyu Bian](https://www.kaggle.com/therealcyberlord) and [Wei Hao Khoong](https://www.kaggle.com/khoongweihao) and other data scientists on Kaggle. This is one of my first notebooks and projects on data science and I am learning about the tools used in analyzing data from all the awesome contributors on Kaggle. 

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import plotly.graph_objects as go
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
import plotly.express as px

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


Import the data from [Novel Corona Virus 2019 Dataset](https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset) (updated to 12/6/2020)

In [None]:
#import the data
data_df = pd.read_csv("../input/novel-corona-virus-2019-dataset/covid_19_data.csv")
confirmed_df = pd.read_csv("../input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv")
death_df = pd.read_csv("../input/novel-corona-virus-2019-dataset/time_series_covid_19_deaths.csv")
recovered_df = pd.read_csv("../input/novel-corona-virus-2019-dataset/time_series_covid_19_recovered.csv")
confirmed_US = pd.read_csv("../input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed_US.csv")

#clean the data
data_df.rename(columns={'ObservationDate':'Date', 'Country/Region':'Country'}, inplace=True)
confirmed_df.rename(columns={'Country/Region':'Country'}, inplace=True)
death_df.rename(columns={'Country/Region':'Country'}, inplace=True)
recovered_df.rename(columns={'Country/Region':'Country'}, inplace=True)

### Most recent cases:

The five most recent cases of Covid-19 were discovered in Ukraine, Netherlandas, and China.

In [None]:
data_df.tail()

### Visualization:

Clean and prepare the data to graph.

In [None]:
#adds up all the confirmed cases, deaths and recovered cases in each day
confirmed = data_df.groupby('Date').sum()['Confirmed'].reset_index()
death = data_df.groupby('Date').sum()['Deaths'].reset_index()
recovered = data_df.groupby('Date').sum()['Recovered'].reset_index()

In [None]:
fig = go.Figure()
#barplot of confirmed cases
fig.add_trace(go.Bar(x=confirmed['Date'],
                    y=confirmed['Confirmed'],
                    name='Confirmed',
                    marker_color='blue'))
#barplot of death
fig.add_trace(go.Bar(x=death['Date'],
                    y=death['Deaths'],
                    name='Death',
                    marker_color='rgba(182, 0, 0, 1)'))
#barplot of recovered
fig.add_trace(go.Bar(x=recovered['Date'],
                    y=recovered['Recovered'],
                    name='Recovered',
                    marker_color='Green'))
fig.update_layout(
    title='Worldwide Covid-19 Cases - Confirmed, Deaths, Recovered(Bar Chart)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Number of Cases',
        titlefont_size=16,
        tickfont_size=15,
    ),
    legend=dict(
        x=0,
        y=1.0,
        bgcolor='rgba(255, 255, 0, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    ),
    barmode='group',
    bargap=0.15, # gap between bars of adjacent location coordinates.
    bargroupgap=0.1 # gap between bars of the same location coordinate.
)


In [None]:
fig = go.Figure()
#barplot of confirmed cases
fig.add_trace(go.Scatter(x=confirmed['Date'],
                    y=confirmed['Confirmed'],
                    name='Confirmed',
                    mode='lines',
                    marker_color='blue'))
#barplot of death
fig.add_trace(go.Scatter(x=death['Date'],
                    y=death['Deaths'],
                    name='Death',
                    mode='lines',
                    marker_color='rgba(182, 0, 0, 1)'))
#barplot of recovered
fig.add_trace(go.Scatter(x=recovered['Date'],
                    y=recovered['Recovered'],
                    name='Recovered',
                    mode='lines',
                    marker_color='Green'))
fig.update_layout(
    title='Worldwide Covid-19 Cases - Confirmed, Deaths, Recovered(Line Chart)',
    xaxis_tickfont_size=14,
    yaxis=dict(
        title='Number of Cases',
        titlefont_size=16,
        tickfont_size=15,
    ),
    legend=dict(
        x=0,
        y=1.0,
        bgcolor='rgba(255, 255, 0, 0)',
        bordercolor='rgba(255, 255, 255, 0)'
    ),
    barmode='group',
    bargap=0.15, # gap between bars of adjacent location coordinates.
    bargroupgap=0.1 # gap between bars of the same location coordinate.
)

From the graph, we can tell that the number of confirmed cases and the number of recovered are going exponentially while the number of death is growing linearly at extremely low rate, which indicates that Covid-19 is extremely contagious but has a mortality rate.

In [None]:
confirmed_df = confirmed_df[['Province/State','Lat','Long','Country']]
temp = data_df.copy()
temp['Country'].replace({'Mainland China': 'China'}, inplace=True)
map_df = pd.merge(temp, confirmed_df,on=['Country','Province/State'])#add the latitude and longitude to each region

In [None]:
#append the Province/State with the country to get the name

map_df['Province/State'] = map_df['Province/State'].fillna('')
map_df['Location'] = map_df['Country']+' '+map_df['Province/State']

### How Covid-19 spread acorss the world over time

(US is not included in this graph)

In [None]:
fig = px.density_mapbox(map_df,
                       lat='Lat',
                       lon='Long',
                       hover_name='Location',
                       hover_data=['Confirmed','Deaths','Recovered'],
                       animation_frame='Date',
                       color_continuous_scale='jet',
                       radius=7,
                       zoom=1,
                       height=700)
fig.update_layout(title='Worldwide Corona Virus Cases Time Lapse - Confirmed, Deaths, Recovered',
                  font=dict(family="Courier New, monospace",
                            size=18,
                            color="#7f7f7f")
                 )
fig.update_layout(mapbox_style="open-street-map", mapbox_center_lon=0)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

Covid-19 was first discovered across China. Then it was spread to Europe, Australia and Middle East. And then it was spread to Africa, South and North America. 

### Covid-19 in the US

In [None]:
#clean the data by dropping the useless columns
confirmed_US.drop(columns=['iso2','iso3','code3','FIPS'],inplace=True)
confirmed_US.rename(columns={'Admin2':'County', 'Province_State':'State','Combined_Key':'Location'}, inplace=True)

In [None]:
#prepare data for bar chart of each state
bar_data = pd.DataFrame()
bar_data['State'] = confirmed_US['State']
bar_data['Confirmed'] = confirmed_US['12/6/20']
bar_data=bar_data.groupby('State').sum()['Confirmed'].reset_index()#group the data and take sum of the confirmed cases
bar_data = bar_data.sort_values(by=['Confirmed'], ascending=True)

In [None]:
fig = go.Figure(go.Bar(
        x=bar_data['Confirmed'],
        y=bar_data['State'],
        name='Confirmed cases',
        orientation = 'h'))
fig.update_layout(
        title='Confirmed Cases in each State',
        yaxis_tickfont_size=13,
        height=1300)
fig.show()

Most of the US are severly impacted the Corona Virus, with California has the most confirmed cases of Covid-19. Only a couple US controlled territories outside of North America have zero cases. 

Reconstruct the dataframe to be used in a scatter_geo graph.

In [None]:
confirmed_US

Transpose the dates and put them into one columns with the corresponding confirmed cases in anothe column.

In [None]:
#prepare the data to graph
dates = confirmed_US.columns[7:]
temp=pd.DataFrame()
temp['Lat'] = confirmed_US['Lat']
temp['Long'] = confirmed_US['Long_']
temp['Location'] = confirmed_US['Location']

US_data = pd.DataFrame()
for date in dates:
    case_day = temp.copy()
    case_day['Date'] = date
    case_day['Confirmed'] = confirmed_US[date]
    US_data = US_data.append(case_day)

In [None]:
#graph
fig = px.scatter_geo(US_data, 
                     lat='Lat', 
                     lon='Long', 
                     scope='usa',
                     color="Confirmed", 
                     size='Confirmed',
                     projection="albers usa", 
                     animation_frame="Date", 
                     color_continuous_scale='jet',
                     hover_name='Location',
                     title='Covid-19 Confimed Cases across the US')
fig.show()

Covid-19 was first discovered in New York and other region in the North East part of the US. However the pandemic in NY was mitigated thanks to appropriate restrictions and policies. On the other hand, in California, without proper restrictions, the number of confirmed cases continued to increase.

All in all, we discovered that although Corona Virus is highly contagious, we are able to slow down if proper policies and procedure are deployed, like the lockdown and quarantine policies in New York and China. We also discovered that Corona Virus has a low mortality rate, however, it will cause irreversible to the lungs of the patients which may result in long-term fatigue, shortness of breath and so on.([More information on CDC](https://www.cdc.gov/coronavirus/2019-ncov/long-term-effects.html)) 