### XGBoost TPS Jan 2022(English+日本語)
---
### Overview
XGBoost is used to train it.
For 'date', we use *Series.dt()* to convert it.

XGBoostを使って学習させます。
'date'に関しては*Series.dt()*を使って変換します。

> https://qiita.com/Takemura-T/items/79b16313e45576bb6492 (Japanese)

### data
* 'row_id'
* 'date' → Using Series.dt() (Series.dt()を使う)
* 'country'
* 'store'
* 'product'
* 'num_sold'

### Importing modules(モジュールをインポート)

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

### Load the data(データを読み込み)

In [None]:
df=pd.read_csv("../input/tabular-playground-series-jan-2022/train.csv")
df_test=pd.read_csv("../input/tabular-playground-series-jan-2022/test.csv")
df.head()

### Check for missing values(欠損値を確認)

In [None]:
df.isna().sum()

In [None]:
df.isna().sum()

### Convert date(日付を変換)

In [None]:
df['date'] = pd.to_datetime(df['date'])
df["year"]=df["date"].dt.year
df["quarter"]=df["date"].dt.quarter
df["month"]=df["date"].dt.month
df['week'] = df['date'].dt.isocalendar().week.astype(int)
df['dayofyear'] = df['date'].dt.dayofyear
df['daysinmonth'] = df['date'].dt.days_in_month
df["dayofweek"]=df["date"].dt.dayofweek
df["day"]=df["date"].dt.day
df['weekend'] = ((df['date'].dt.dayofweek) // 5 == 1).astype(int)
df=df.drop("date",axis=1)

In [None]:
df_test['date'] = pd.to_datetime(df_test['date'])
df_test["year"]=df_test["date"].dt.year
df_test["quarter"]=df_test["date"].dt.quarter
df_test["month"]=df_test["date"].dt.month
df_test['week'] = df_test['date'].dt.isocalendar().week.astype(int)
df_test['dayofyear'] = df_test['date'].dt.dayofyear
df_test['daysinmonth'] = df_test['date'].dt.days_in_month
df_test["dayofweek"]=df_test["date"].dt.dayofweek
df_test["day"]=df_test["date"].dt.day
df_test['weekend'] = ((df_test['date'].dt.dayofweek) // 5 == 1).astype(int)
df_test=df_test.drop("date",axis=1)

In [None]:
df.head()

In [None]:
num_columns=df.select_dtypes(include="int").columns.values.tolist()
num_columns.remove("num_sold")
cat_cols = list(set(df.columns) - set(num_columns))
cat_cols.remove("num_sold")
print(num_columns,cat_cols)

### Check for correlation(相関を確認)

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), cmap="RdBu")
plt.title("Correlations Between Variables", size=15)
plt.show()

### Split into X and y(X,yに分割)

In [None]:
X=df.drop("num_sold",axis=1)
y=df["num_sold"]

### Encoding of category variables(カテゴリー変数のエンコーディング)

In [None]:
X["for_train"]=True
df_test["for_train"]=False

#データを結合
df_all_data=pd.concat([X,df_test])

encoded_all_data=pd.get_dummies(df_all_data,columns=cat_cols)

X=encoded_all_data.loc[encoded_all_data["for_train"]]
X_pred=encoded_all_data.loc[~encoded_all_data["for_train"]]

### data normalization(データの正規化)

In [None]:
scaler = StandardScaler()
X[num_columns]=scaler.fit_transform(X[num_columns])
scaler = StandardScaler()
X_pred[num_columns]=scaler.fit_transform(X_pred[num_columns])

### Split into train and test data(train,testデータに分割)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Verify XGBoost(XGBoostを検証)

In [None]:
model=XGBRegressor()
model.fit(X_train,y_train)
prediction=model.predict(X_test)
r_squared = r2_score(y_test,prediction)
print(f"r2_score:{r_squared}")

### Create a submission file!(提出ファイルを作成!)

In [None]:
model=XGBRegressor()
model.fit(X,y)
submit_prediction=model.predict(X_pred)

# IDを取り出す
sub=pd.DataFrame()
sub["row_id"]=df_test["row_id"]
sub["num_sold"]=submit_prediction
sub.to_csv("submission.csv",index=False)