# Novel Coronavirus
- Day level information on 2019-nCoV affected cases.
- 2019 Novel Coronavirus (2019-nCoV) is a virus (more specifically, a coronavirus) identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. Early on, many of the patients in the outbreak in Wuhan, China reportedly had some link to a large seafood and animal market, suggesting animal-to-person spread. However, a growing number of patients reportedly have not had exposure to animal markets, indicating person-to-person spread is occurring. At this time, it’s unclear how easily or sustainably this virus is spreading between people - CDC
- This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus.

The data is available from 22 Jan 2020.

> Dataset link: https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset

### Data Dictionary¶
1. Sno: Serial number
2. Province/State: Province or State of observation
3. Country: Country of observation
4. Last Update: Date of observation
5. Confirmed: Number of confirmed cases
6. Deaths: Number of deaths
7. Recovered: Number of recovered cases

# 1. Import libraries

Import all the neccessary libraries into the notebook which are required to explore the dataset.

In [None]:
import numpy as np
import pandas as pd

# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Import seaborn library
import seaborn as sns
sns.set()

# Import plotly.plotly, 
# plotly.offline -> download_plotlyjs, init_notebook_mode, plot, iplot, and
# plotly.graph_objs
import chart_studio.plotly as py
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import pycountry

import folium 
from folium import plugins

# Enable notebook mode
init_notebook_mode(connected = True)

# Graphics in retina format 
%config InlineBackend.figure_format = 'retina' 

# Increase the default plot size and set the color scheme
plt.rcParams['figure.figsize'] = 8, 5
#plt.rcParams['image.cmap'] = 'viridis'

# To see the plots in the notebook
%matplotlib inline

# 2. Reading/Importing the Data

Read the dataset from the csv file and print few rocords.

In [None]:
raw_data = pd.read_csv("../input/novel-corona-virus-2019-dataset/2019_nCoV_data.csv")
raw_data.head()

# 3. Understanding/Inspecting the data

See few records of data to get an overview of the dataset using `head()` method.

In [None]:
# See few data records
raw_data.head()

Check for the dimensions of the dataset using `shape` function.

In [None]:
# Shape of the data
raw_data.shape

Data contains 434 rows and 7 columns

Check the datatype of each column in the dataset using `info()` method.

In [None]:
# Information about each columns
raw_data.info()

We have 3 float, 1 int and 3 object columns.

Check the descriptive statistics of the dataset using `describe()` method.

In [None]:
# Generates descriptive statistics
raw_data.describe()

# 4. Data cleaning and preparation

#### Checking for Missing Values and Fix/Drop them

First find the missing values in each column.

In [None]:
# Checking missing values (column-wise)
raw_data.isnull().sum()

Find the percentage of the missing values so that we can take the next action appropriately.

In [None]:
# Checking the percentage of missing values
round(100*(raw_data.isnull().sum()/len(raw_data.index)), 2)

We have 19.59% missing data in Province/State.

Now we can either drop it or fix it. Since it is of String type and we can't replace the null values with any mathematical formula, so it's better to drop the rows with missing Province/State.

In [None]:
# Dropping the rows with missing Province/State.
raw_data.dropna(inplace=True)

Again check if there is any missing values present in the dataset.

In [None]:
# Checking missing values (column-wise)
raw_data.isnull().sum()

Now we don't have any missing values. So we can proceed with the next step.

`Last Update` is of type string object. We need to convert it into datetime format to parse the date.

One way of doing so is using `to_datetime()` method.

In [None]:
raw_data["LastUpdated"] = pd.to_datetime(raw_data['Last Update'])

Extract information from the LastUpdated column of type datetime.

In [None]:
# Extract different components from the date

raw_data['date'] = pd.DatetimeIndex(raw_data['LastUpdated']).date

raw_data['year'] = pd.DatetimeIndex(raw_data['LastUpdated']).year

raw_data['month'] = pd.DatetimeIndex(raw_data['LastUpdated']).month

raw_data['day'] = pd.DatetimeIndex(raw_data['LastUpdated']).day

raw_data['time'] = pd.DatetimeIndex(raw_data['LastUpdated']).time

raw_data['dayofweek'] = pd.DatetimeIndex(raw_data['LastUpdated']).dayofweek

raw_data['day_name'] = pd.DatetimeIndex(raw_data['LastUpdated']).day_name()

raw_data['month_name'] = pd.DatetimeIndex(raw_data['LastUpdated']).month_name()
raw_data.head()

# Questions

### Question 1 : What is the "Severity Level" of the virus?

Basically its an ratio between Confirmed cases and Deaths.

In [None]:
severity = (raw_data['Deaths'].sum() / raw_data['Confirmed'].sum())*100
severity

The death percentage among the Confirmed cases is only 2.3%. The severity level of Corona virus is very low.

### Question 2 : Which top 5 country has most number of Confirmed cases?

For this, lets Find the total Confirmed cases countrywise using `groupby()` method applied on column `Country`

In [None]:
top_country = raw_data.groupby('Country').sum()
top_country['Country'] = top_country.index
top_country.sort_values(by='Confirmed', ascending=False).head(10)

`Mainland China` has the highest number of Confirmed cases.

Let's visualize this data for more clarity.

In [None]:
countries = [country for country, df in raw_data.groupby('Country')]

plt.bar(countries, top_country['Confirmed'])
plt.xticks(countries, rotation='vertical', size=8)
plt.xlabel('Country name')
plt.ylabel('Number of Confirmed cases')
plt.show()

As shown in the graph, Mainland China has the highest number of Confirmed cases.

Lets visualise the data in World Map format to understand the impact of the virus and its spread.

To do so, two of the techniques we use
1. Using `folium` library
2. Using `plotly.graph_objs` library

#### Using folium library

In [None]:
# Make a data frame with dots to show on the map
world_data = pd.DataFrame({
   'name':list(top_country['Country']),
    'lat':[-25.27,56.13,35.86,51.17,22.32,22.19,35.96,23.7,37.09],
   'lon':[133.78,-106.35,104.19,10.45,114.17,113.54,90.19,120.96,-95.71],
   'Confirmed':list(top_country['Confirmed']),
})

# create map and display it
world_map = folium.Map(location=[10, -20], zoom_start=2.3,tiles='OpenStreetMap')

for lat, lon, value, name in zip(world_data['lat'], world_data['lon'], world_data['Confirmed'], world_data['name']):
    folium.CircleMarker([lat, lon],
                        radius=value * 0.001,
                        popup = ('<strong>Country</strong>: ' + str(name).capitalize() + '<br>'
                                '<strong>Confirmed Cases </strong>: ' + str(value) + '<br>'),
                        color='purple',
                        
                        fill_color='indigo',
                        fill_opacity=0.7 ).add_to(world_map)

world_map

The bubble size indicated the number of Confirmed cases in that particular Country.

#### Using plotly.graph_objs library

In [None]:
cntry = top_country['Country'].tolist()

confirmed = top_country['Confirmed'].tolist()

cntryCode = ['AUS','CAN', 'CHN','DEU','HKG','MAC','CHN', 'TWN', 'USA']

# Create a data using dict() method
data = dict(type = 'choropleth', # what type of plot you are doing
           locations = cntryCode, # list of abbreviated codes 
           #locationmode = 'USA-states', # locationmode for above abbreviated codes
           colorscale = 'Portland', # the colors you wanna plot
           text = cntry, # texts for the corresponding elements in locations parameter
           z = confirmed, # The color you want to represent for the corresponding elements in locations parameter
           colorbar = {'title' : 'Colorbar Title Goes Here'}) # Description about the color bar

layout = dict(title = 'Confirmed cases of Coronavirus',
              geo = dict(showframe = True,
                         showlakes = True, # Shows the actual lakes in the map
                     lakecolor = 'rgb(85, 173, 240)',
                         
                     projection = {'type' : 'equirectangular'}
                    ))

choromap = go.Figure(data = [data], layout = layout)

iplot(choromap)

### Question 3 : Which top 5 states has most number of Confirmed cases in Mainland China?

In [None]:
mainland_china = raw_data.loc[raw_data['Country'] == 'Mainland China']

top_states = mainland_china.groupby('Province/State').sum()
top_states['Province/State'] = top_states.index
top_states.sort_values(by='Confirmed', ascending=False).head(10)

`Hubei` has the highest number of Confirmed cases.

Let's visualize this data for more clarity

In [None]:
states = top_states['Province/State']

plt.bar(states, top_states['Confirmed'])
plt.xticks(states, rotation='vertical')
plt.xlabel('State name')
plt.ylabel('Number of Confirmed cases')
plt.show()

As shown in the graph, Hubei has the highest number of Confirmed cases.

Now, lets plot the Confirmed vs Recovered data to understand the relationship between them.

In [None]:
f, ax = plt.subplots(figsize=(20, 8))

sns.set_color_codes("pastel")
sns.barplot(x="Confirmed", y="Province/State", data=top_states,
            label="Confirmed", color="b")

sns.set_color_codes("muted")
sns.barplot(x="Recovered", y="Province/State", data=top_states,
            label="Recovered", color="g")

# Add a legend and informative axis label
ax.legend(ncol=2, loc="upper right", frameon=True)
ax.set(xlim=(0, 2000), ylabel="",
       xlabel="Stats")
sns.despine(left=True, bottom=True)

### Question 4 : Spread of coronavirus across Countries and thier corresponding States?

To show this, we can use the heatmap.

In [None]:
# Create a pivot table on 'raw_data' dataset
fp = raw_data.pivot_table(index = 'Province/State', columns = 'Country', values = 'Confirmed')

In [None]:
# Plot the heatmap for the above pivot table
sns.heatmap(fp, cmap = 'plasma')

Mainly, The virus is spread across the Mainland China and China.

### Question 5 : Trendline of the Spread of the virus? Whether it is increasing or decreasing or wavy?

Group the dataset on the basis of `date`

In [None]:
daily_confirmed = raw_data.groupby('date').sum()
daily_confirmed

In [None]:
dates = [date for date, df in raw_data.groupby('date')]
dates = pd.DatetimeIndex(dates).day

plt.plot(dates, daily_confirmed['Confirmed'])

plt.xticks(dates, rotation='vertical')
plt.show()

As shown in the graph, there is a sudden inclination after 26 Jan 2020 and it is still increasing.

### Question 6: Which country is performing better in saving the lives infected by coronavirus?

First lets calculate the Recovered percentage and Death percentage country wise.

In [None]:
top_country['recovered_percent'] = (top_country['Recovered'] / top_country['Confirmed'])*100
top_country['death_percent'] = (top_country['Deaths'] / top_country['Confirmed'])*100

top_country.sort_values(by='recovered_percent', ascending=False).head(10)

Australia has the highest recovered percentage compared to other countries.

Mainland China has the highest death percentage.

Lets see the Recovered vs Deaths trends during the whole month.

In [None]:
# We can define the figure size while creating subplots: multiple subplots
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (15, 5), dpi = 100)

#ax.plot(dates, daily_confirmed['Confirmed'])
ax[0].plot(dates, daily_confirmed['Recovered'], label = 'Recovered')
ax[0].plot(dates, daily_confirmed['Deaths'], label = 'Deaths')

ax[1].plot(dates,daily_confirmed['Confirmed'], label = 'Confirmed')
ax[1].plot(dates,daily_confirmed['Recovered'], label = 'Recovered')
ax[1].plot(dates,daily_confirmed['Deaths'], label = 'Deaths')

plt.xticks(dates, rotation='vertical')
ax[0].legend()
ax[1].legend()
plt.tight_layout()
plt.show()

Recovered vs Deaths country wise trend.

In [None]:
countries = top_country['Country']

# We can define the figure size while creating subplots: multiple subplots
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (15, 5), dpi = 100)

#ax.plot(dates, daily_confirmed['Confirmed'])
ax[0].plot(countries, top_country['Recovered'], label = 'Recovered')
ax[0].plot(countries, top_country['Deaths'], label = 'Deaths')
ax[0].set_xticklabels( countries, rotation=45);

ax[1].plot(countries,top_country['Confirmed'], label = 'Confirmed')
ax[1].plot(countries,top_country['Recovered'], label = 'Recovered')
ax[1].plot(countries,top_country['Deaths'], label = 'Deaths')
ax[1].set_xticklabels( countries, rotation=45);

ax[0].legend()
ax[1].legend()

plt.tight_layout()
plt.show()

### Question 7 : Explore the outliers.

Lets create the boxplot for Confirmed cases of coronavirus.

In [None]:
plt.figure(figsize=(5,6))
#plt.subplot(1, 2, 1)
fig = top_country.boxplot(column='Confirmed')
fig.set_title('')
fig.set_ylabel('Confirmed')

So, we have few outliers. To explore more about outlier, lets create bar graph to visualise.

In [None]:
countries = top_country['Country']

plt.bar(countries, top_country['Confirmed'])
plt.xticks(countries, rotation=45, size=8)
plt.xlabel('Country name')
plt.ylabel('Number of Confirmed cases')
plt.show()

We found the culprit for the outliers. And the winner is Mainland China.

Lets deep dive into the dataset corresponding to Mainland China.

In [None]:
plt.figure(figsize=(5,6))
#plt.subplot(1, 2, 1)
fig = top_states.boxplot(column='Confirmed')
fig.set_title('')
fig.set_ylabel('Confirmed')

Again, we have few outliers here also. 

You know what to do now. Yes, you're right!! Lets plot bar graph.

In [None]:
states = top_states['Province/State']

plt.bar(states, top_states['Confirmed'])
plt.xticks(states, rotation='vertical')
plt.xlabel('State name')
plt.ylabel('Number of Confirmed cases')
plt.show()

The outlier here is Hubei.

We can explore more about the Hubei province to answer below questions:

1. Why Hubei, Mainland China has the highest cases of coronavirus confirmed cases?
2. What are the possible reasons that the corona virus is widely spread here?
3. The virus is spreading through animal to human and human to human.
    - Is the population density high in Hubei which is one of the reason for the spread?
    - The animal population is high in Hubei which results in the spread?
and many more ... 