### The following topics are covered in this Notebook:

- Downloading a real-world dataset from a Kaggle competition
- Performing feature engineering and prepare the dataset for training
- Training and interpreting a gradient boosting model using XGBoost
- Training with KFold cross validation and ensembling results
- Configuring the gradient boosting model and tuning hyperparamters

## Problem Statement

This notebook takes a practical and coding-focused approach. We'll learn gradient boosting by applying it to a real-world dataset from the [Rossmann Store Sales](https://www.kaggle.com/c/rossmann-store-sales) competition on Kaggle:

> Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. 
>
>
> With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied. You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.
>
> View and download the data here: https://www.kaggle.com/c/rossmann-store-sales/data

In [None]:
#import kaggle 
import warnings
warnings.filterwarnings('ignore')

In [None]:
#!kaggle datasets list -s rossman-store-sales

In [None]:
#!kaggle datasets download realvinay/rossmann-store-sales

In [None]:
#!unzip rossmann-store-sales.zip

In [None]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

In [None]:
store_df =  pd.read_csv('../input/rossmann-store-sales/store.csv');
ross_df = pd.read_csv('../input/rossmann-store-sales/train.csv', low_memory = False);
test_df = pd.read_csv('../input/rossmann-store-sales/test.csv');
submission_df = pd.read_csv('../input/rossmann-store-sales/sample_submission.csv');

In [None]:
store_df

In [None]:
ross_df

In [None]:
submission_df

Let's merge the information from `store_df` into `train_df` and `test_df`.

In [None]:
merged_df = ross_df.merge(store_df, how = 'left', on = 'Store')
merged_test_df = test_df.merge(store_df, how = 'left', on = 'Store')

# Checking the merged dataset :
merged_df.head()

# Visual Analysis

## Sales :

In [None]:
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline 
plt.style.use('seaborn-whitegrid')

In [None]:
plt.figure(figsize=(12,6), dpi = 80)
plt.title('Sales Distribution', fontsize=15)
sns.distplot(merged_df['Sales'].sample(17000), hist = False, color = 'seagreen')
plt.legend(['sales']);

### Notes :
- Huge outliers
- Data is skewed. No normal distribution of data.

## Store sales :

In [None]:
print(merged_df['StoreType'].unique())

In [None]:
import plotly.express as px
import plotly.graph_objects as go

fig = px.histogram(merged_df, x ='StoreType', y = 'Sales',
                   color = 'StoreType', height = 580, width = 900)

fig.update_layout(title = 'Sales per store',
                 xaxis_title = 'Store type',
                 yaxis_title = 'Sales',
                 font = dict(family = 'Droid Serif', size=14))
fig.show()

In [None]:
lables, values = merged_df['StoreType'], merged_df['Sales']

fig = go.Figure(data=[go.Pie(labels = lables, values = values, hole=.3)])
fig.update_layout(title = 'Total Sales per store',
                 font = dict(family = 'Droid Serif', size=12))
fig.show()

### Insights :
- Store __'a'__ earns 3.16 Billions of profit. 
- Store __'b'__ : 1.7 Billions.
- Store __'c'__ : 783.224 Millions.
- Store __'d'__ : 159.2341 Millions.

## Feature Engineering :

In [None]:
merged_df.info()

__Note :__
- From column zero Store to Assortment & Promo 2 we have non null values.
- Rest of the columns carries great number of missing values.
- We will deal with them further.

### Date

First, let's convert `Date` to a `datecolumn` and extract different parts of the date.

In [None]:
def split_date(df):
    df['Date'] = pd.to_datetime(df['Date'])
    df['Year'] = df.Date.dt.year
    df['Month'] =df.Date.dt.month
    df['Day'] = df.Date.dt.day
    df['WeekOfYear'] = df.Date.dt.isocalendar().week

In [None]:
#coverted datda :
split_date(merged_df)
split_date(merged_test_df)

merged_df.head()

### Store Open/Closed

Next, notice that the sales are zero whenever the store is closed.

In [None]:
merged_df[merged_df.Open == 0].Sales.value_counts()

In [None]:
# You can verify that :
merged_df[merged_df.Open == 0].Sales

__Important :__
- Instead of trying to model this relationship, it would be better to hard-code it in our predictions, and remove the rows where the store is closed. 
- We won't remove any rows from the test set, since we need to make predictions for every row.

In [None]:
merged_df = merged_df[merged_df.Open == 1].copy()

### Competition

Next, we can use the columns `CompetitionOpenSince[Month/Year]` columns from `store_df` to compute the number of months for which a competitor has been open near the store.

In [None]:
def comp_months(df):
    df['CompetitionOpen'] = 12 * (df.Year - df.CompetitionOpenSinceYear) + (df.Month - df.CompetitionOpenSinceMonth)
    df['CompetitionOpen'] = df['CompetitionOpen'].map(lambda x: 0 if x < 0 else x).fillna(0)

In [None]:
comp_months(merged_df)
comp_months(merged_test_df)

In [None]:
merged_df

- Let's view the results of the new columns we've created.

In [None]:
merged_df[['Date','CompetitionDistance','CompetitionOpenSinceYear','CompetitionOpenSinceMonth',
           'CompetitionOpen']].sample(20).sort_values('Date')

### Additional Promotion

We can also add some additional columns to indicate how long a store has been running `Promo2` and whether a new round of `Promo2` starts in the current month.

In [None]:
def check_promo_month(row):
    month2str = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun',              
                 7:'Jul', 8:'Aug', 9:'Sept', 10:'Oct', 11:'Nov', 12:'Dec'}
    try:
        months = (row['PromoInterval'] or '').split(',')
        if row['Promo2Open'] and month2str[row['Month']] in months:
            return 1
        else:
            return 0
    except Exception:
        return 0

    
def promo_cols(df):
    # Months since Promo2 was open
    df['Promo2Open'] = 12 * (df.Year - df.Promo2SinceYear) +  (df.WeekOfYear - df.Promo2SinceWeek)*7/30.5
    df['Promo2Open'] = df['Promo2Open'].map(lambda x: 0 if x < 0 else x).fillna(0) * df['Promo2']
    
    # Whether a new round of promotions was started in the current month
    df['IsPromo2Month'] = df.apply(check_promo_month, axis=1) * df['Promo2']

In [None]:
promo_cols(merged_df)
promo_cols(merged_test_df)

In [None]:
merged_df[['Date','Promo2','Promo2SinceYear','Promo2SinceWeek','PromoInterval','Promo2Open',
           'IsPromo2Month']].sample(20).sort_values('Date')

- The features related to competition and promotion are now much more useful.


### Input and Target Columns

Let's select the columns that we'll use for training.

In [None]:
print(list(merged_df.columns))

In [None]:
input_cols = ['Store','DayOfWeek','Promo','StateHoliday','SchoolHoliday', 
              'StoreType', 'Assortment', 'CompetitionDistance','CompetitionOpen', 
              'Day','Month','Year','WeekOfYear','Promo2', 
              'Promo2Open','IsPromo2Month']
target_col = 'Sales'

In [None]:
# inputs & target :
inputs = merged_df[input_cols].copy()
targets = merged_df[target_col].copy()

# test inputs : 
test_inputs = merged_test_df[input_cols].copy()

- Let's also identify numeric and categorical columns. Note that we can treat binary categorical columns (0/1) as numeric columns.

In [None]:
numeric_cols = ['Store','Promo','SchoolHoliday', 
              'CompetitionDistance','CompetitionOpen','Promo2','Promo2Open','IsPromo2Month',
              'Day','Month','Year','WeekOfYear',  ]
categorical_cols = ['DayOfWeek','StateHoliday','StoreType','Assortment']

In [None]:
inputs[numeric_cols].isnull().sum().sort_values(ascending=False)

- Seems like competition distance is the only missing value, and we can simply fill it with the highest value (to indicate that competition is very far away).

In [None]:
max_distance = inputs.CompetitionDistance.max()

In [None]:
inputs['CompetitionDistance'].fillna(max_distance, inplace = True)
test_inputs['CompetitionDistance'].fillna(max_distance, inplace = True)

# last Check
inputs.isnull().sum()

# Data Preprocessing :


### Scale Numeric Values & Encode Categorical columns

Let's scale numeric values to the 0 to 1 range.<br>

Then encode the categorical columns.

In [None]:
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

# Scalling numeric values to the 0 to 1 range :
scaler = MinMaxScaler().fit(inputs[numeric_cols])

inputs[numeric_cols] = scaler.transform(inputs[numeric_cols])
test_inputs[numeric_cols] = scaler.transform(test_inputs[numeric_cols])

# Encoding categorical columns :
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore').fit(inputs[categorical_cols])
encoded_cols = list(encoder.get_feature_names(categorical_cols))

inputs[encoded_cols] = encoder.transform(inputs[categorical_cols])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categorical_cols])

- Finally, let's extract out all the numeric data for training.

In [None]:
X = inputs[numeric_cols + encoded_cols]
X_test = test_inputs[numeric_cols + encoded_cols]

# Gradient Boosting : Xgboost

In [None]:
from xgboost import XGBRegressor
model = XGBRegressor(n_jobs = -1, n_estimators = 20, random_state = 42, max_depth = 4)

In [None]:
#?model

In [None]:
%%time
model.fit(X, targets)

### Prediction

We can now make predictions and evaluate the model using `model.predict`.

In [None]:
%%time
preds = model.predict(X)
print(preds)

### Evaluation

Let's evaluate the predictions using __RMSE__ error.

In [None]:
from sklearn.metrics import mean_squared_error

def rmse(a, b):
    return mean_squared_error(a, b, squared = False)

In [None]:
print('RMSE:', rmse(targets, preds))

- We are off by 2377 which is not really good also not that bad.

### Visualization

We can visualize individual trees using `plot_tree` (note: this requires the `graphviz` library to be installed).

In [None]:
from matplotlib.pyplot import rcParams
import matplotlib.pyplot as plt
from xgboost import plot_tree
%matplotlib inline

rcParams['figure.figsize'] = 25,30

In [None]:
#Uncomment if you have graphviz installed and located in your environment path :
plot_tree(model, rankdir = 'LR');

In [None]:
plot_tree(model, rankdir='LR', num_trees=1);

In [None]:
plot_tree(model, rankdir='LR', num_trees=19);

In [None]:
trees = model.get_booster().get_dump()
len(trees)

In [None]:
print(trees[0]);

### Feature importance

Just like decision trees and random forests, XGBoost also provides a feature importance score for each column in the input.

In [None]:
importance_df=pd.DataFrame({'features':X.columns,
                            'importance':model.feature_importances_}).sort_values('importance',ascending = False)

In [None]:
importance_df.head(10).style.highlight_max(axis=0)

In [None]:
fig = px.histogram(importance_df.head(10), 
                   x = 'importance', y = 'features', 
                   color = 'features', width = 900, height = 570)
fig.update_layout(title = 'Important features',
                 xaxis_title = 'importance',
                 yaxis_title = 'features',
                 font = dict(family = 'Droid Serif', size = 15))
fig.show()

## K Fold Cross Validation

Notice that we didn't create a validation set before training our XGBoost model. We'll use a different validation strategy this time, called K-fold cross validation ([source](https://vitalflux.com/k-fold-cross-validation-python-example/)):

![](https://vitalflux.com/wp-content/uploads/2020/08/Screenshot-2020-08-15-at-11.13.53-AM.png)

In [None]:
from sklearn.model_selection import KFold


def train_and_eval(X_train, train_targets, X_val, val_targets, **params):
    model = XGBRegressor(n_jobs = -1, random_state = 42, **params)
    model.fit(X_train, train_targets)
    train_rmse = rmse(model.predict(X_train), train_targets)
    val_rmse = rmse(model.predict(X_val), val_targets)
    return model, train_rmse, val_rmse

In [None]:
kfold = KFold(n_splits = 5, shuffle = True)

In [None]:
models = []

for train_idxs, val_idxs in kfold.split(X) :
    X_train, train_targets = X.iloc[train_idxs], targets.iloc[train_idxs]
    X_val, val_targets = X.iloc[val_idxs], targets.iloc[val_idxs]
    model, train_rmse, val_rmse = train_and_eval(X_train, train_targets,
                                                 X_val, val_targets,
                                                 max_depth = 4, n_estimators = 20)
    models.append(model)
    print('Train RMSE: {}, Validation RMSE: {}'.format(train_rmse, val_rmse))

- Let's also define a function to average predictions from the 5 different models.

In [None]:
def pred_avg(models, inputs) :
    return np.mean([model.predict(inputs) for model in models], axis = 0)

preds = pred_avg(models, X)
print(preds)

# Hyperparameter tuning :

In [None]:
#def test_params_kfold(n_splits, **params):
#    train_rmses, val_rmses, models = [], [], []
#    kfold = KFold(n_splits)
#    for train_idxs, val_idxs in kfold.split(X):
#        X_train, train_targets = X.iloc[train_idxs], targets.iloc[train_idxs]
#        X_val, val_targets = X.iloc[val_idxs], targets.iloc[val_idxs]
#        model, train_rmse, val_rmse = train_and_evaluate(X_train, train_targets, X_val, val_targets, **params)
#        models.append(model)
#        train_rmses.append(train_rmse)
#        val_rmses.append(val_rmse)
#    print('Train RMSE: {}, Validation RMSE: {}'.format(np.mean(train_rmses), np.mean(val_rmses)))
#    return models

- Since it may take a long time to perform 5-fold cross validation for each set of parameters we wish to try, we'll just pick a random 10% sample of the dataset as the validation set.

In [None]:
from sklearn.model_selection import train_test_split as tts
X_train, X_val, train_targets, val_targets = tts(X, targets, test_size=0.1)

In [None]:
def test_params(**params):
    model = XGBRegressor(n_jobs=-1, random_state=42, **params)
    model.fit(X_train, train_targets)
    train_rmse = rmse(model.predict(X_train), train_targets)
    val_rmse = rmse(model.predict(X_val), val_targets)
    print('Train RMSE: {}, Validation RMSE: {}'.format(train_rmse, val_rmse))

#### `n_estimators`

The number of trees to be created. More trees = greater capacity of the model.

In [None]:
test_params(n_estimators=10)
print('-'*70)
test_params(n_estimators=30)
print('-'*70)
test_params(n_estimators=100)
print('-'*70)
test_params(n_estimators=240)

#### `max_depth`

As you increase the max depth of each tree, the capacity of the tree increases and it can capture more information about the training set.

In [None]:
test_params(max_depth=2)
print('-'*70)
test_params(max_depth=5)
print('-'*70)
test_params(max_depth=10)

#### `learning_rate`

The scaling factor to be applied to the prediction of each tree. A very high learning rate (close to 1) will lead to overfitting, and a low learning rate (close to 0) will lead to underfitting.

In [None]:
test_params(n_estimators=50, learning_rate=0.01)
print('-'*70)
test_params(n_estimators=50, learning_rate=0.1)
print('-'*70)
test_params(n_estimators=50, learning_rate=0.3)
print('-'*70)
test_params(n_estimators=50, learning_rate=0.9)
print('-'*70)
test_params(n_estimators=50, learning_rate=0.99)

#### `booster`

Instead of using Decision Trees, XGBoost can also train a linear model for each iteration. This can be configured using `booster`.

In [None]:
test_params(booster='gblinear')

- Clearly, a linear model is not well suited for this dataset.

## Putting it Together and Making Predictions

Let's train a final model on the entire training set with custom hyperparameters. 

In [None]:
model = XGBRegressor(n_jobs=-1, random_state=42, n_estimators=1000, 
                     learning_rate=0.2, max_depth=10, subsample=0.9, 
                     colsample_bytree=0.7)

In [None]:
%%time
model.fit(X, targets)

In [None]:
test_preds = model.predict(X_test)

Adding the predictions into `submission_df`.

In [None]:
submission_df.head()

In [None]:
submission_df['Sales'] = test_preds

### Important :
Recall, however, if if the store is not open, then the sales must be 0. Thus, wherever the value of `Open` in the test set is 0, we can set the sales to 0. Also, there some missing values for `Open` in the test set. We'll replace them with 1 (open).

In [None]:
test_df.Open.isnull().sum()

In [None]:
submission_df['Sales'] = submission_df['Sales']*test_df.Open.fillna(1.)

submission_df.head(20)

In [None]:
#submission_df.to_csv('submission.csv', index = None)