## [무작정 Kaggle 따라하기] COVID-19 RandomForest Regressor

---

## Intro

![covid-19](https://www.topevents.co.za/wp-content/uploads/2020/04/covid-19-1024x431.jpg)

최근 데이터 (2020-04-27 ~ 2020-05-11)을 이용하여 일일 COVID-19 발병 횟수와 사망 횟수를 예측하는 Competition.

해당 notebook에서는 RandomForest Regression을 이용하여 발병과 사망횟수를 예측한다.

RandomForest의 내용이 궁금하신 분은 [해당 칼럼](https://towardsdatascience.com/understanding-random-forest-58381e0602d2)을 확인해주세요!

*이 notebook은 필사 노트북입니다. 해당 커널을 참고하였습니다!*


- [Nischay Dhankhar's kernel](https://www.kaggle.com/nischaydnk/covid19-week5-visuals-randomforestregressor)

- [Sarut Yentakham's kernel](https://www.kaggle.com/benzintel01/randomforestregressor-covid-19)

---

## Setting

In [None]:
!pip install plotly

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np 
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import plotly.express as px
from datetime import datetime
%matplotlib inline

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, KFold

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

---

## Read Data

In [None]:
train = pd.read_csv('../input/covid19-global-forecasting-week-5/train.csv')
test = pd.read_csv('../input/covid19-global-forecasting-week-5/test.csv')
submission = pd.read_csv('../input/covid19-global-forecasting-week-5/submission.csv')

## EDA

In [None]:
train.columns

- Id : 환자 순번

- County : 카운티(군)

- Province_State : 시도 혹은 주

- Country_Region : 국가명

- Population : 해당 국가 인구 수 

- Weight (단위 해석이 어려움 / 기준 단위 확인 필요)

- Date : 확진 일자 

- Target : ConfirmedCases 확진 / Fatalities 사망으로 이진 분류

- TargetValue : y값. 예측변수

### Checking Missing Values 결측치 확인하기

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

`County`와 `Province_State` 변수에 결측치가 많이 나타남

> 결측치 처리 필요함

### Data Check

In [None]:
submission

In [None]:
submission.shape

In [None]:
submission['TargetValue'].sum()

In [None]:
# 'TargetValue'에 따라 오름차순으로 정렬
train.sort_values(by=['TargetValue'])

---

## Data Visualization

### ConfirmedCases & Fatalities 발병인원과 사망인원

In [None]:
fig = px.pie(train, values = 'TargetValue', names='Target')
fig.update_traces(textposition = 'inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')
fig.show()

### Current share of Worldwide COVID-19 Confirmed Cases

In [None]:
fig = px.pie(train, values='TargetValue', names='Country_Region')
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')
fig.show()

### Top 15 Countries

In [None]:
getToplist = 15
grouped_multiple = train.groupby(['Country_Region'], as_index=False)['TargetValue'].sum()
countryTop = grouped_multiple.nlargest(getToplist, 'TargetValue')['Country_Region']
newlist = train[train['Country_Region'].isin(countryTop.values)]
line = newlist.groupby(['Date', 'Country_Region'], as_index=False)['TargetValue'].sum()
line = line[line['TargetValue'] >= 0]

In [None]:
line.pivot(index='Date', columns='Country_Region', values='TargetValue').plot(figsize=(10,5))
plt.grid(zorder=0)
plt.title('Top' + str(getToplist) + 'ConfirmedCases & Fatalities', fontsize=18, pad=10)
plt.ylabel('People')
plt.xlabel('Date')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()

> 미국이 3월 중순을 기점으로 상황이 급격히 악화된 것을 확인할 수 있음. <br>
[해당 기사 참조 / Trump declares national emergency -- and denies responsibility for coronavirus testing failures](https://edition.cnn.com/2020/03/13/politics/donald-trump-emergency/index.html)

## Data Preprocessing

결측치가 많고 중요도가 낮은 몇몇 변수를 제거하는 작업을 진행함

In [None]:
train = train.drop(['County', 'Province_State','Country_Region','Target'], axis=1)
test = test.drop(['County', 'Province_State','Country_Region','Target'], axis=1)

train.head()

`County`, `Province_State`, `Country_Region` = 지역 관련 변수 & `Target`

명목형 변수이기 때문에 수치형으로 변환하지 않으면 모델 적용 불가

In [None]:
from sklearn.preprocessing import OrdinalEncoder

def create_feature(df):
    df['day'] = df['Date'].dt.day
    df['month'] = df['Date'].dt.month
    df['dayofweek'] = df['Date'].dt.dayofweek
    df['dayofyear'] = df['Date'].dt.dayofyear
    df['quarter'] = df['Date'].dt.quarter
    df['weekofyear'] = df['Date'].dt.weekofyear
    return df

# Date 변수 분리

In [None]:
def train_dev_split(df, days):
    date = df['Date'].max() - dt.timedelta(days=days)
    return df[df['Date'] <= date], df[df['Date'] > date]

In [None]:
test_date_min = test['Date'].min()
test_date_max = test['Date'].max()

In [None]:
def avoid_date_leakage(df, date=test_date_min):
    return df[df['Date'] < date]

In [None]:
def to_integer(dt_time):
    return 10000*dt_time.year + 100*dt_time.month + dt_time.day

In [None]:
train['Date'] = pd.to_datetime(train['Date'])
test['Date'] = pd.to_datetime(test['Date'])

In [None]:
train['Date'] = train['Date'].dt.strftime('%Y%m%d')
test['Date'] = test['Date'].dt.strftime('%Y%m%d')

In [None]:
train.head()

'2020-05-07' 형식에서 '20200507' 형식으로 변경됨.

## Using Regressor to find Target values

RandomForest Regressor를 이용하여 Target Value output 도출

In [None]:
from sklearn.model_selection import train_test_split

predictors = train.drop(['TargetValue', 'Id'], axis=1)
target = train['TargetValue']
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.22, random_state=0)

In [None]:
model = RandomForestRegressor(n_jobs=-1)
estimators=100
scores=[]
model.set_params(n_estimators=estimators)
model.fit(X_train, y_train)
scores.append(model.score(X_test, y_test))

In [None]:
X_test

In [None]:
test.drop(['ForecastId'], axis=1, inplace=True)
test.index.name = 'Id'
test

In [None]:
y_pred2 = model.predict(X_test)
y_pred2

In [None]:
predictions = model.predict(test)

pred_list = [int(x) for x in predictions]

output = pd.DataFrame({'Id': test.index, 'TargetValue': pred_list})
print(output)

In [None]:
output

## Finding Quanlite values from the output

분위수 값으로 output 찾아 분류하기

In [None]:
a = output.groupby(['Id'])['TargetValue'].quantile(q=0.05).reset_index() # 5% 지점
b = output.groupby(['Id'])['TargetValue'].quantile(q=0.5).reset_index() # 절반 지점
c = output.groupby(['Id'])['TargetValue'].quantile(q=0.05).reset_index() # 95% 지점

In [None]:
a.columns = ['Id', 'q0.05']
b.columns = ['Id', 'q0.5']
c.columns = ['Id', 'q0.95']

a = pd.concat([a,b['q0.5'],c['q0.95']], 1)

a['q0.05'] = a['q0.05'].clip(0, 10000)
a['q0.5'] = a['q0.5'].clip(0, 10000)
a['q0.95'] = a['q0.95'].clip(0, 10000)

a

In [None]:
a['Id'] = a['Id'] + 1
a

## Submission

In [None]:
sub = pd.melt(a, id_vars=['Id'], value_vars = ['q0.05', 'q0.5', 'q0.95'])
sub['variable'] = sub['variable'].str.replace('q', '', regex=False)
sub['ForecastId_Quantile'] = sub['Id'].astype(str)+'-'+sub['variable']
sub['TargetValue'] = sub['value']
sub = sub[['ForecastId_Quantile', 'TargetValue']]
sub.reset_index(drop=True, inplace=True)
sub.to_csv('submission.csv', index=False)
sub.head()