<a href="https://colab.research.google.com/github/rahiakela/kaggle-competition-projects/blob/covid19-global-forecasting/covid19_global_forecasting_week_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COVID19 Global Forecasting (Week 1)
**Forecast daily COVID-19 spread in regions around world**

Kaggle is launching two companion COVID-19 forecasting challenges to help answer a subset of the NASEM/WHO questions. While the challenge involves forecasting confirmed cases and fatalities between March 25 and April 22 by region, the primary goal isn't to produce accurate forecasts. It’s to identify factors that appear to impact the transmission rate of COVID-19.

You are encouraged to pull in, curate and share data sources that might be helpful. If you find variables that look like they impact the transmission rate, please share your finding in a notebook.

As the data becomes available, we will update the leaderboard with live results based on [data made available](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series) from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE).

We have received support and guidance from health and policy organizations in launching these challenges. We're hopeful the Kaggle community can make valuable contributions to developing a better understanding of factors that impact the transmission of COVID-19.

## Data Description

In this challenge, you will be predicting the cumulative number of confirmed COVID19 cases in various locations across the world, as well as the number of resulting fatalities, for future dates.

We understand this is a serious situation, and in no way want to trivialize the human impact this crisis is causing by predicting fatalities. Our goal is to provide better methods for estimates that can assist medical and governmental institutions to prepare and adjust as pandemics unfold.

### Files
* train.csv - the training data up to Mar 18, 2020.
* test.csv - the dates to predict; there is a week of overlap with the * training data for the initial Public leaderboard. Once submissions are paused, the Public leaderboard will update based on last 28 days of predicted data.
* submission.csv - a sample submission in the correct format; again, predictions should be cumulative

Reference:

https://www.kaggle.com/pradeepmuniasamy/covid19-inside-story-of-each-countries

https://www.kaggle.com/saga21/covid-global-forecast-sir-model

## Setup

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd

from bokeh.plotting import output_notebook, figure, show
from bokeh.models import ColumnDataSource, Div, Select, Button, ColorBar, CustomJS
from bokeh.layouts import row, column, layout
from bokeh.transform import cumsum, linear_cmap
from bokeh.palettes import Blues8, Spectral3
from bokeh.plotting import figure, output_file, show

output_notebook()

TensorFlow 2.x selected.


In [2]:
# Visualisation libraries
import matplotlib.pyplot as plt
%matplotlib inline

import folium 
from folium import plugins
plt.style.use("fivethirtyeight")# for pretty graphs

from plotly.offline import iplot
from plotly import tools
import plotly.graph_objects as go
import plotly.express as px
import plotly.offline as py
import plotly.figure_factory as ff
py.init_notebook_mode(connected=True)

In [3]:
import plotly; plotly.__version__

'4.4.1'

## Load dataset from Kaggle

In [6]:
!pip uninstall -y kaggle
!pip install --upgrade pip
!pip install kaggle==1.5.6
!kaggle -v

Found existing installation: kaggle 1.5.6
Uninstalling kaggle-1.5.6:
  Successfully uninstalled kaggle-1.5.6
Requirement already up-to-date: pip in /usr/local/lib/python3.6/dist-packages (20.0.2)
Processing /root/.cache/pip/wheels/01/3e/ff/77407ebac3ef71a79b9166a8382aecf88415a0bcbe3c095a01/kaggle-1.5.6-py3-none-any.whl
Installing collected packages: kaggle
Successfully installed kaggle-1.5.6
Kaggle API 1.5.6


First of all, needs to copy kaggle.json file to .kaggle directory

In [0]:
# copy kaggle.json file to .kaggle directory
! cp kaggle.json ~/.kaggle/
! chmod 600 /root/.kaggle/kaggle.json

In [7]:
# show all availabe datasets
!kaggle datasets list

ref                                                         title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
----------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
unanimad/dataisbeautiful                                    Reddit - Data is Beautiful                           11MB  2020-03-21 22:28:28            443         49  1.0              
allen-institute-for-ai/CORD-19-research-challenge           COVID-19 Open Research Dataset Challenge (CORD-19)  646MB  2020-03-20 23:31:34          22457       3252  0.88235295       
rubenssjr/brasilian-houses-to-rent                          brazilian_houses_to_rent                            117KB  2020-03-15 01:12:22            488         27  1.0              
sudalairajkumar/novel-corona-virus-2019-dataset             Novel Corona Virus 2

In [8]:
# Try to downlaod data for the covid19-global-forecasting-week-1 challenge.
!kaggle competitions download -c covid19-global-forecasting-week-1

Downloading covid19-global-forecasting-week-1.zip to /content
  0% 0.00/195k [00:00<?, ?B/s]
100% 195k/195k [00:00<00:00, 89.8MB/s]


### Unzip dataset

In [0]:
import os, shutil
import zipfile

# path to the directory where the original dataset was uncompressed
original_dataset_dir = 'kaggle_covid19_global_forecasting_week_1'

# remove directories if it already exists
shutil.rmtree(original_dataset_dir, ignore_errors=True)

# create directories
os.mkdir(original_dataset_dir)

In [0]:
# unzip dataset
with zipfile.ZipFile("covid19-global-forecasting-week-1.zip","r") as zip_ref:
    zip_ref.extractall(original_dataset_dir)

### Load dataset

In [0]:
train_df = pd.read_csv(original_dataset_dir +"/train.csv")
test_df = pd.read_csv(original_dataset_dir + "/test.csv")
submission_csv = pd.read_csv(original_dataset_dir + "/submission.csv")

In [12]:
train_df.head()

Unnamed: 0,Id,Province/State,Country/Region,Lat,Long,Date,ConfirmedCases,Fatalities
0,1,,Afghanistan,33.0,65.0,2020-01-22,0.0,0.0
1,2,,Afghanistan,33.0,65.0,2020-01-23,0.0,0.0
2,3,,Afghanistan,33.0,65.0,2020-01-24,0.0,0.0
3,4,,Afghanistan,33.0,65.0,2020-01-25,0.0,0.0
4,5,,Afghanistan,33.0,65.0,2020-01-26,0.0,0.0


In [13]:
test_df.head()

Unnamed: 0,ForecastId,Province/State,Country/Region,Lat,Long,Date
0,1,,Afghanistan,33.0,65.0,2020-03-12
1,2,,Afghanistan,33.0,65.0,2020-03-13
2,3,,Afghanistan,33.0,65.0,2020-03-14
3,4,,Afghanistan,33.0,65.0,2020-03-15
4,5,,Afghanistan,33.0,65.0,2020-03-16


Download conutry code dataset.

In [14]:
!git clone https://github.com/ybayle/ISRC

Cloning into 'ISRC'...
remote: Enumerating objects: 17, done.[K
remote: Total 17 (delta 0), reused 0 (delta 0), pack-reused 17[K
Unpacking objects: 100% (17/17), done.


In [15]:
country_code_df = pd.read_csv('ISRC/wikipedia-iso-country-codes.csv')
country_code_df.head()

Unnamed: 0,English short name lower case,Alpha-2 code,Alpha-3 code,Numeric code,ISO 3166-2
0,Afghanistan,AF,AFG,4,ISO 3166-2:AF
1,Åland Islands,AX,ALA,248,ISO 3166-2:AX
2,Albania,AL,ALB,8,ISO 3166-2:AL
3,Algeria,DZ,DZA,12,ISO 3166-2:DZ
4,American Samoa,AS,ASM,16,ISO 3166-2:AS


## Exploratory Data Analysis

### Disease spread over the countries

In [16]:
temp = train_df.groupby(['Date', 'Country/Region'])['ConfirmedCases'].sum().reset_index()

temp['Date'] = pd.to_datetime(temp['Date'])
temp['Date'] = temp['Date'].dt.strftime('%m/%d/%Y')
temp['size'] = temp['ConfirmedCases'].pow(0.3) * 3.5

fig = px.scatter_geo(temp, locations='Country/Region', locationmode='country names',
                     color='ConfirmedCases', size='size', hover_name='Country/Region',
                     range_color=[1, 100], projection='natural earth',
                     animation_frame='Date', title='COVID-19: Cases Over Time', color_continuous_scale='greens')
fig.show()

In [21]:
country_df = pd.DataFrame()
temp = train_df.loc[train_df['Date'] == train_df['Date'][len(train_df) - 1]].groupby(['Country/Region'])['ConfirmedCases'].sum().reset_index()

country_df['Name'] = temp['Country/Region']
country_df['Values'] = temp['ConfirmedCases']

fig = px.choropleth(country_df, locations='Name', locationmode='country names', color='Values')
fig.update_layout(title='Corona spread on 19-03-2020')
fig.show()

#### Observations

From this graph, we can see clearly that disease is well spread in **China**.

We can also able to observe that, **Iran, Italy, USA** are following the trend of China and are having high numbers.

### Cases Confirmed Vs Fatalities across Countries

**Note:**

I have made a dashboard for each countries on their confirmed cases vs fatality rate. Please feel free to hover over the dashboard and please select the values to explore more about each countires