# COVID-19 Pandemic Forecasting Using XGBoost

Coronavirus disease (COVID-19) is an infectious disease caused by a new virus.
The disease causes respiratory illness (like the flu) with symptoms such as a cough, fever, and in more severe cases, difficulty breathing. You can protect yourself by washing your hands frequently, avoiding touching your face, and avoiding close contact (1 meter or 3 feet) with people who are unwell.

HOW IT SPREADS
Coronavirus disease spreads primarily through contact with an infected person when they cough or sneeze. It also spreads when a person touches a surface or object that has the virus on it, then touches their eyes, nose, or mouth.

Symptoms
People may be sick with the virus for 1 to 14 days before developing symptoms. The most common symptoms of coronavirus disease (COVID-19) are fever, tiredness, and dry cough. Most people (about 80%) recover from the disease without needing special treatment.
More rarely, the disease can be serious and even fatal. Older people, and people with other medical conditions (such as asthma, diabetes, or heart disease), may be more vulnerable to becoming severely ill.
People may experience:
cough
fever
tiredness
difficulty breathing (severe cases)

Prevention

DO THE FIVE
Help stop coronavirus
* HANDS Wash them often
* ELBOW Cough into it
* FACE Don't touch it
* SPACE Keep safe distance
* HOME Stay if you can

This notebook uses Visualization method to understand the current scenario and trend around the world based on the week-2 dataset and uses XGBoost Regressor method to Forecast the future trend of further Confirmed cases and Possible Fatalities

Stay Safe!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# Import Dataset

In [None]:
train_set = pd.read_csv('/kaggle/input/covid19-global-forecasting-week-2/train.csv')
test_set = pd.read_csv('/kaggle/input/covid19-global-forecasting-week-2/test.csv')
submission_set = pd.read_csv('/kaggle/input/covid19-global-forecasting-week-2/submission.csv')

In [None]:
train_set.head()

In [None]:
test_set.head()

In [None]:
train_set.info()

In [None]:
test_set.info()

In [None]:
train_set.sample(5)

In [None]:
test_set.sample(5)

# Top 15 countries with Confirmed cases

In [None]:
df = train_set.fillna('NA').groupby(['Country_Region','Province_State','Date'])['ConfirmedCases'].sum() \
                          .groupby(['Country_Region','Province_State']).max().sort_values() \
                          .groupby(['Country_Region']).sum().sort_values(ascending = False)

In [None]:
top_15_countries = pd.DataFrame(df).head(15)
top_15_countries

# Top 15 countries with Fatalities

In [None]:
df1 = train_set.fillna('NA').groupby(['Country_Region','Province_State','Date'])['Fatalities'].sum() \
                          .groupby(['Country_Region','Province_State']).max().sort_values() \
                          .groupby(['Country_Region']).sum().sort_values(ascending = False)

In [None]:
top_15_countries_fatal = pd.DataFrame(df1).head(15)
top_15_countries_fatal

From above two tables we can see that US is having the highest confirmed cases but Italy has the highest fatality

In [None]:
import plotly.express as px
fig = px.bar(top_15_countries, x=top_15_countries.index, y='ConfirmedCases', labels={'x':'Countries'},  color='ConfirmedCases', barmode='group',
             height=400)
fig1 = px.bar(top_15_countries_fatal, x=top_15_countries_fatal.index, y='Fatalities', labels={'x':'Countries'},  color='Fatalities', barmode='group',
             height=400)
fig.show()
fig1.show()

# Percentage of Confirmed cases vs Deaths in a Country

In [None]:
train_set_copy = train_set.drop(['Province_State'], axis=1)
train_set_copy.head()

In [None]:
df2 = train_set_copy.groupby(['Country_Region','Date'])['Fatalities'].sum() \
                    .groupby(['Country_Region']).max().sort_values(ascending = False)

df2.head()

In [None]:
df3 = train_set_copy.groupby(['Country_Region','Date'])['ConfirmedCases'].sum() \
                    .groupby(['Country_Region']).max().sort_values(ascending = False)

df3.head()

In [None]:
percentage_value = ((df2/df3)*100).sort_values(ascending = False)
percentage_value = pd.DataFrame(percentage_value)
percentage_value.columns = ['Percentage']
#Drop all the percentage value with no ratio
percentage_value = percentage_value.replace(0.0, np.nan)
percentage_value = percentage_value.dropna(how='all', axis=0)
percentage_value.tail()

In [None]:
fig = px.bar(percentage_value.dropna(), x=percentage_value.index, y='Percentage', labels={'x':'Countries'},  color='Percentage', 
             title='Death VS Confirmed_Cases Ratio',
             barmode='group',
             height=700)
fig.show()

# Data Cleaning and Taking care of missing data

In [None]:
train_set.isnull().sum()

In [None]:
test_set.isnull().sum()

In [None]:
train_set_copy.isnull().sum()

In [None]:
test_set_copy = test_set.drop(['Province_State'], axis=1)
test_set_copy.isnull().sum()

In [None]:
train_set_copy["Date"] = train_set_copy["Date"].apply(lambda x: x.replace("-",""))
train_set_copy["Date"] = train_set_copy["Date"].astype(int)
train_set_copy.head()

In [None]:
test_set_copy["Date"] = test_set_copy["Date"].apply(lambda x: x.replace("-",""))
test_set_copy["Date"] = test_set_copy["Date"].astype(int)
test_set_copy.head()

# Training Data

In [None]:
x_train = train_set_copy[['Date']]
y1_train = train_set_copy['ConfirmedCases']
y2_train = train_set_copy['Fatalities']
x_test = test_set_copy[['Date']]

# Fitting XGBoost to the Training set of Confirmed Cases

In [None]:
from xgboost import XGBRegressor
classifier = XGBRegressor(max_depth=8, n_estimators=1000, random_state=0)
classifier.fit(x_train, y1_train)

In [None]:
x_pred = classifier.predict(x_test)
prediction1 = pd.DataFrame(x_pred)
prediction1.columns = ["ConfirmedCases_prediction"]

In [None]:
prediction1

# Fitting XGBoost to the Training set of Fatalities

In [None]:
from xgboost import XGBRegressor
classifier = XGBRegressor(max_depth=8, n_estimators=1000, random_state=0)
classifier.fit(x_train, y2_train)

In [None]:
x_pred = classifier.predict(x_test)
prediction2 = pd.DataFrame(x_pred)
prediction2.columns = ["Fatalities_prediction"]

In [None]:
prediction2.head()

# Submission

In [None]:
submission_set.head()

In [None]:
submission_forecast = submission_set['ForecastId']
submission_forecast = pd.DataFrame(submission_forecast)
submission_forecast.head()

In [None]:
submission = pd.concat([submission_forecast, prediction1, prediction2], axis=1)
submission.head()

In [None]:
submission.columns = ['ForecastId', 'ConfirmedCases', 'Fatalities']
submission.head()

In [None]:
submission.describe()

# XGBoost Submission file.

In [None]:
submission.to_csv('submission.csv', index = False)