## Random Forest

Random Forest is an ensemble of Decision Trees. With a few exceptions, a `RandomForestClassifier` has all the hyperparameters of a `DecisionTreeClassifier` (to control how trees are grown), plus all the hyperparameters of a `BaggingClassifier` to control the ensemble itself.

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model. The following `BaggingClassifier` is roughly equivalent to the previous `RandomForestClassifier`. Run the cell below to visualize a single estimator from a random forest model, using the Iris dataset to classify the data into the appropriate species.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

# Model (can also use single decision tree)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10)

# Train
model.fit(iris.data, iris.target)
# Extract single tree
estimator = model.estimators_[5]

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = iris.feature_names,
                class_names = iris.target_names,
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

Notice how each split seperates the data into buckets of similar observations. This is a single tree and a relatively simple classification dataset, but the same method is used in a more complex dataset with greater depth to the trees.

## Coronavirus
Coronavirus disease (COVID-19) is an infectious disease caused by a new virus.
The disease causes respiratory illness (like the flu) with symptoms such as a cough, fever, and in more severe cases, difficulty breathing. You can protect yourself by washing your hands frequently, avoiding touching your face, and avoiding close contact (1 meter or 3 feet) with people who are unwell. An outbreak of COVID-19 started in December 2019 and at the time of the creation of this project was continuing to spread throughout the world. Many governments recommended only essential outings to public places and closed most business that do not serve food or sell essential items. An excellent [spatial dashboard](https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6) built by Johns Hopkins shows the daily confirmed cases by country. 

This case study was designed to drive home the important role that data science plays in real-world situations like this pandemic. This case study uses the Random Forest Classifier and a dataset from the South Korean cases of COVID-19 provided on [Kaggle](https://www.kaggle.com/kimjihoo/coronavirusdataset) to encourage research on this important topic. The goal of the case study is to build a Random Forest Classifier to predict the 'state' of the patient.

First, please load the needed packages and modules into Python. Next, load the data into a pandas dataframe for ease of use.

In [None]:
import os
import pandas as pd
from datetime import datetime,timedelta
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas_profiling
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file

%matplotlib inline

import joblib
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from sklearn.metrics import classification_report,confusion_matrix,roc_curve,roc_auc_score
from sklearn.metrics import accuracy_score,log_loss

from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

import plotly.graph_objects as go






In [None]:
url ='SouthKoreacoronavirusdataset/PatientInfo.csv'
df = pd.read_csv(url, parse_dates=['symptom_onset_date','confirmed_date','released_date','deceased_date'], infer_datetime_format=True)
df.head()

In [None]:
df.info()

In [None]:
df.nunique()

In [None]:
df1 = df.copy()

In [None]:
now= datetime.now()
year = now.year
df['age'] = year - df['birth_year']

In [None]:
#dropping birth_year as age is calculated
df.drop('birth_year', inplace=True, axis=1)
df.head()

In [None]:
df.describe().T

df['symptom_onset_date'] = pd.to_datetime(df['symptom_onset_date'])
df['confirmed_date'] = pd.to_datetime(df['confirmed_date'])
df['released_date'] = pd.to_datetime(df['released_date'])
df['deceased_date'] = pd.to_datetime(df['deceased_date'])

In [None]:
cols = ['sex', 'country','province', 'city','infection_case', 'state']
df[cols] = df[cols].astype('category')

In [None]:
df.dtypes

In [None]:
df.isnull().sum()

state column in df has 88 missing values which depicts that those are missing apart from 'isolated','released' and'deceased. so we can replace NaN values as 'missing

In [None]:
df.state.unique()

In [None]:
df['sex'].unique()

In [None]:
# filling Nan values in state as missing
df['state'] = df['state'].cat.add_categories('missing').fillna('missing')
df['sex'] = df['sex'].cat.add_categories('neutral').fillna('neutral')

In [None]:
df.state.unique(), df['sex'].unique()

In [None]:
df['disease'].fillna(0, inplace=True)
df.disease[df['disease']==True]=1

In [None]:
df = df.fillna(df.mean())

In [None]:
df.isnull().sum()

In [None]:
df.state.value_counts()

In [None]:
df.columns

In [None]:
df.disease.value_counts()

In [None]:
cols =['patient_id', 'global_num', 'sex', 'age', 'country', 'province', 'city',
       'disease', 'infection_case', 'infection_order', 'infected_by', 'state']

In [None]:
df[cols]

In [None]:
dfd = pd.get_dummies(df[cols].drop('state', axis=1))
dfd.columns

In [None]:

dfd.drop('state', axis=1, inplace=True)

In [None]:
dfd

In [None]:
X = dfd
y = df['state']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, random_state=42)

In [None]:
y_train = pd.factorize(y_train)[0]
y_train

In [None]:
clf = RandomForestClassifier(n_jobs = 2, random_state=0)
clf.fit(X_train, y_train)

In [None]:
clf.predict(X_test)

In [None]:
X_test

In [None]:
clf.predict_proba(X_test)[0:10]

In [None]:
preds = df.state[clf.predict(X_test)]
preds[0:25]

In [None]:
y_test.head()

In [None]:
cnf_matrix = confusion_matrix(y_test, preds)
cnf_matrix

In [None]:
pd.crosstab(y_test, preds, rownames=['Actual'], colnames=['Predicted'])

In [None]:
y_test, preds

In [None]:
ac = accuracy_score(y_test, preds)
ac

In [None]:
df

In [None]:
df['symptom_date'] = df['symptom_onset_date'].dt.day
df['symptom_month'] = df['symptom_onset_date'].dt.month
df['symptom_year'] = df['symptom_onset_date'].dt.year

In [None]:
df['symptom_date'], 