## Problem Statement

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. 


With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied. You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. 

<b>Note</b>: Some stores in the dataset were temporarily closed for refurbishment.

View and download the data here: https://www.kaggle.com/c/rossmann-store-sales/data

<b> Files</b><br>
train.csv - historical data including Sales<br>
test.csv - historical data excluding Sales<br>
sample_submission.csv - a sample submission file in the correct format<br>
store.csv - supplemental information about the stores<br>

<b>Data fields</b>
Most of the fields are self-explanatory. The following are descriptions for those that aren't.<br>

Id - an Id that represents a (Store, Date) duple within the test set<br>
Store - a unique Id for each store<br>
Sales - the turnover for any given day (this is what you are predicting)<br>
Customers - the number of customers on a given day<br>
Open - an indicator for whether the store was open: 0 = closed, 1 = open<br>
StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None<br>
SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools<br>
StoreType - differentiates between 4 different store models: a, b, c, d<br>
Assortment - describes an assortment level: a = basic, b = extra, c = extended<br>
CompetitionDistance - distance in meters to the nearest competitor store<br>
CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened<br>
Promo - indicates whether a store is running a promo on that day<br>
Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating<br>
Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2<br>
PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

# Import Libraries

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from xgboost import plot_tree
from matplotlib.pylab import rcParams
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import train_test_split
import joblib

#### Configurations

In [None]:
pd.set_option('display.max_columns',120)
pd.set_option('display.max_rows',120)

# Import Datasets

In [None]:
ross_df = pd.read_csv('../input/rossmann-store-sales/train.csv', low_memory=False)
store_df = pd.read_csv('../input/rossmann-store-sales/store.csv')
test_df = pd.read_csv('../input/rossmann-store-sales/test.csv')
submission_df = pd.read_csv('../input/rossmann-store-sales/sample_submission.csv')

In [None]:
ross_df

In [None]:
test_df

<b>Note :</b>
Customers column is present in train set(ross_df) but not in test set.<br>
Sales column is the target column.

In [None]:
submission_df

In [None]:
store_df

Since store_df contains additional information about the stores, let's merge store_df to the ross_df and test_df.<br>
`Left Outer Join` on the column name `Store`.

In [None]:
merged_df = ross_df.merge(store_df, how='left', on='Store')
merged_test_df = test_df.merge(store_df, how='left', on='Store')

In [None]:
merged_df

In [None]:
merged_test_df

# Preprocessing And Feature Engineering

In [None]:
merged_df.info()

In [None]:
def extract_date(data):
    data['Date'] = pd.to_datetime(data['Date'])
    data['Year'] = data.Date.dt.year
    data['Month'] = data.Date.dt.month
    data['Day'] = data.Date.dt.day
    data['WeekOfYear'] = data.Date.dt.isocalendar().week
    #data.drop('Date', axis=1, inplace=True)

In [None]:
extract_date(merged_df)
extract_date(merged_test_df)

In [None]:
merged_df

In [None]:
merged_test_df

<b>Note</b> : Date column in training set is the past (2013 - 2015) and that in the test set is the future(2015).<br>
Let's first extract different parts of the Date.

In [None]:
merged_df.Year.value_counts()

In [None]:
merged_test_df.Year.value_counts()

#### `Open` Column
The open column in the dataset describes whether the store is opened or not.<br>
And if the store is not opened on a day there will not be any Sales.<br>
It can be seen that the stores are closed for 172817 days (rows).<br>

In [None]:
#merged_df[merged_df.Open == 0].Sales.value_counts()
merged_df[merged_df.Open == 0].Sales

Therefore it is obvious that if a store is closed then the Sales on that day is zero.<br>
Therefore, let's drop all columns in the merged_df for which Open=0.<br>
And while predicting, Sales of merged_test_df = 0 when Open = 0.<br>

In [None]:
merged_df = merged_df[merged_df.Open == 1].copy()

### CompetitionOpenSinceMonth & CompetitionOpenSinceYear

- CompetitionOpenSinceYear : On which year the competitor store was opened.
- CompetitionOpenSinceMonth : On which month the competitor store was opened.

It will be helpful for our model if the duration of competition (number of months) is known instead of the exact date in which the competition store has opened.<br>
So, we'll use these two rows along with Year, Month columns to derive the duration column and use it to train our model.<br>

The longer the compititor store has opened, the more the more impact on the Sales of the store.

<b>Note : </b><br>
There are Nan Values also in `CompetitionOpenSinceYear` & `CompetitionOpenSinceMonth` which means there is no competition store nearby for which the value is to be replaced as zero.<br>

Also for a store there might not be competition in a particular date but the competition arises in future and in such cases the difference will be negative value.<br>

These are not relevant values since it represent the future in which the competition arises and also no competition.<br>
Therefore, negative and NaN values are replaced with zero.

In [None]:
merged_df[['CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Year', 'Month']].info()

In [None]:
def calc_duration_competition(data):
    data['CompetitionOpen'] = 12 * (data.Year - data.CompetitionOpenSinceYear) + (data.Month - data.CompetitionOpenSinceMonth)
    data['CompetitionOpen'] = data['CompetitionOpen'].apply(lambda x:0 if x<0 else x).fillna(0)

In [None]:
calc_duration_competition(merged_df)
calc_duration_competition(merged_test_df)

In [None]:
merged_df[['CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Year', 'Month']].info()

In [None]:
merged_df[['Date', 'CompetitionDistance', 'CompetitionOpenSinceYear', 'CompetitionOpenSinceMonth', 'CompetitionOpen']].sample(20)

#### Promotion of Store
Promotions given by a store increases the sales.<br>

Promo - indicates whether a store is running a promo on that day<br>
Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating<br>
Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2<br>
PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

We can also add some additional columns to indicate how long (Number of months) a store has been running `Promo2` and whether a new round of `Promo2` starts in the current month.<br>

<b>Note : </b>Here also negative values and NaN values are replaced as zero.

In [None]:
def check_promo_month(row):#check if promo is given in the particular month
    month2str = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun',              
                 7:'Jul', 8:'Aug', 9:'Sept', 10:'Oct', 11:'Nov', 12:'Dec'}
    try:
        months = (row['PromoInterval'] or '').split(',')
        if(row['Promo2Open'] and month2str[row['Month']] in months):
            return 1
        else:
            return 0
    except:
        return 0
    
def promo_cols(data): #calculate duration of promotion(in months)
    data['Promo2Open'] = 12 * (data.Year - data.Promo2SinceYear) + (data.WeekOfYear - data.Promo2SinceWeek)*7/30.5
    data['Promo2Open'] = data['Promo2Open'].apply(lambda x: 0 if x < 0 else x).fillna(0)*data['Promo2']#only when there is promo
    #whether a new round of promotion started in curent month
    data['IsPromo2Month'] = data.apply(check_promo_month, axis=1) * data['Promo2']

In [None]:
%%time
promo_cols(merged_df)
promo_cols(merged_test_df)

In [None]:
merged_df[['Date', 'Promo2', 'Promo2SinceYear', 'Promo2SinceWeek', 'PromoInterval', 'Promo2Open', 'IsPromo2Month']].sample(20)

# Input & Target Columns

In [None]:
merged_df.columns

In [None]:
input_cols = ['Store', 'DayOfWeek', 'Promo', 'StateHoliday', 'SchoolHoliday', 
              'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpen', 
              'Day', 'Month', 'Year', 'WeekOfYear',  'Promo2', 
              'Promo2Open', 'IsPromo2Month']
target_col = 'Sales'

- Date is dropped because all relevant information is extracted
- Sales is target column
- Customers is not present in test data so drop.( or create a model to predict number of customers and use the values in test data). 
- Open : Sales = 0 when Open =0 (obvious so not needed to feed into our model)
- 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear' is used to derive 'CompetitionOpen'
- 'Promo2SinceYear', 'PromoInterval' is used to derive 'Promo2Open', 'IsPromo2Month'

In [None]:
inputs = merged_df[input_cols].copy()
targets = merged_df[target_col].copy()
test_inputs = merged_test_df[input_cols].copy()

#### Numerical & Categorical Columns

In [None]:
numeric_cols = ['Store', 'Promo', 'SchoolHoliday', 
              'CompetitionDistance', 'CompetitionOpen', 'Promo2', 'Promo2Open', 'IsPromo2Month',
              'Day', 'Month', 'Year', 'WeekOfYear',  ]
categorical_cols = ['DayOfWeek', 'StateHoliday', 'StoreType', 'Assortment']

# Impute Missing Numerical Data

In [None]:
inputs[numeric_cols].isna().sum()

In [None]:
test_inputs[numeric_cols].isna().sum()

Only Competition Distance numerical column has null values.<br>
Competition Distance having null values means that there is no competitor store near by so these are to be imputed with `larger value constant` not zero (since larger the distance means lesser the competition).<br>

We'll impute with the maximum value.

In [None]:
max_distance = inputs.CompetitionDistance.max()
inputs['CompetitionDistance'].fillna(max_distance, inplace=True)
test_inputs['CompetitionDistance'].fillna(max_distance, inplace=True)

# Scaling Numerical Values

In [None]:
scaler = MinMaxScaler().fit(inputs[numeric_cols])
inputs[numeric_cols] = scaler.transform(inputs[numeric_cols])
test_inputs[numeric_cols] = scaler.transform(test_inputs[numeric_cols])

### Encode Categorical Columns

<img src="https://i.imgur.com/n8GuiOO.png" width="640">

Let's one-hot encode categorical columns.

In [None]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore').fit(inputs[categorical_cols])
encoded_cols = list(encoder.get_feature_names(categorical_cols))
inputs[encoded_cols] = encoder.transform(inputs[categorical_cols])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categorical_cols])

In [None]:
X = inputs[numeric_cols + encoded_cols]
X_test = test_inputs[numeric_cols + encoded_cols]

# Gradient Boosting

In [None]:
model = XGBRegressor(random_state=42, n_jobs=-1, n_estimators=20, max_depth=4)

In [None]:
%%time
model.fit(X, targets)

In [None]:
preds = model.predict(X)

In [None]:
def rmse(a, b):
    return mean_squared_error(a, b, squared=False)

In [None]:
rmse(preds, targets)

# Visualization

In [None]:
trees = model.get_booster().get_dump()
len(trees)

In [None]:
print(trees[0])

# Feature importance

Just like decision trees and random forests, XGBoost also provides a feature importance score for each column in the input.

In [None]:
importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
importance_df.head(10)

In [None]:
import seaborn as sns
plt.figure(figsize=(10,6))
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10), x='importance', y='feature');

# K Fold Cross Validation

Notice that we didn't create a validation set before training our XGBoost model. We'll use a different validation strategy this time, called K-fold cross validation ([source](https://vitalflux.com/k-fold-cross-validation-python-example/)):

![](https://vitalflux.com/wp-content/uploads/2020/08/Screenshot-2020-08-15-at-11.13.53-AM.png)

In [None]:
def train_and_evaluate(X_train, train_targets, X_val, val_targets, **params):
    model = XGBRegressor(random_state=42, n_jobs=-1, **params)
    model.fit(X_train, train_targets)
    train_rmse = rmse(model.predict(X_train), train_targets)
    val_rmse = rmse(model.predict(X_val), val_targets)
    return model, train_rmse, val_rmse

In [None]:
kfold = KFold(n_splits=5)

In [None]:
models = []

for train_idxs, val_idxs in kfold.split(X):
    X_train, train_targets = X.iloc[train_idxs], targets.iloc[train_idxs]
    X_val, val_targets = X.iloc[val_idxs], targets.iloc[val_idxs]
    model, train_rmse, val_rmse = train_and_evaluate(X_train, 
                                                     train_targets, 
                                                     X_val, 
                                                     val_targets, 
                                                     max_depth=4, 
                                                     n_estimators=20)
    models.append(model)
    print('Train RMSE: {}, Validation RMSE: {}'.format(train_rmse, val_rmse))

In [None]:
def predict_avg(models, inputs):
    return np.mean([model.predict(inputs) for model in models], axis=0)

In [None]:
preds = predict_avg(models, X_train)
preds

# Hyperparameter Tuning and Regularization

Just like other machine learning models, there are several hyperparameters we can to adjust the capacity of model and reduce overfitting.

<img src="https://i.imgur.com/EJCrSZw.png" width="480">

Check out the following resources to learn more about hyperparameter supported by XGBoost:

- https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor
- https://xgboost.readthedocs.io/en/latest/parameter.html

In [None]:
def test_params_kfold(n_splits, **params):
    train_rmses, val_rmses, models = [], [], []
    kfold = KFold(n_splits)
    for train_idxs, val_idxs in kfold.split(X):
        X_train, train_targets = X.iloc[train_idxs], targets.iloc[train_idxs]
        X_val, val_targets = X.iloc[val_idxs], targets.iloc[val_idxs]
        model, train_rmse, val_rmse = train_and_evaluate(X_train, train_targets, X_val, val_targets, **params)
        models.append(model)
        train_rmses.append(train_rmse)
        val_rmses.append(val_rmse)
    print('Train RMSE: {}, Validation RMSE: {}'.format(np.mean(train_rmses), np.mean(val_rmses)))
    return models

In [None]:
X_train, X_val, train_targets, val_targets = train_test_split(X, targets, test_size=0.1)

In [None]:
def test_params(**params):
    model = XGBRegressor(n_jobs=-1, random_state=42, **params)
    model.fit(X_train, train_targets)
    train_rmse = rmse(model.predict(X_train), train_targets)
    val_rmse = rmse(model.predict(X_val), val_targets)
    print('Train RMSE: {}, Validation RMSE: {}'.format(train_rmse, val_rmse))

In [None]:
test_params(n_estimators=10)

### max_depth

In [None]:
test_params(max_depth=2)

In [None]:
test_params(max_depth=5)

#### learning_rate

The scaling factor to be applied to the prediction of each tree. A very high learning rate (close to 1) will lead to overfitting, and a low learning rate (close to 0) will lead to underfitting.

In [None]:
test_params(n_estimators=50, learning_rate=0.01)

In [None]:
test_params(n_estimators=50, learning_rate=0.1)

#### booster

Instead of using Decision Trees, XGBoost can also train a linear model for each iteration. This can be configured using `booster`.

In [None]:
test_params(booster='gblinear')

# Train with best parameters

In [None]:
model = XGBRegressor(n_jobs=-1, random_state=42, n_estimators=1000, 
                     learning_rate=0.2, max_depth=10, subsample=0.9, 
                     colsample_bytree=0.7)

In [None]:
%%time
model.fit(X, targets)

# Predict

In [None]:
test_preds = model.predict(X_test)

In [None]:
submission_df['Sales']  = test_preds

Recall, however, if if the store is not open, then the sales must be 0. Thus, wherever the value of `Open` in the test set is 0, we can set the sales to 0. Also, there some missing values for `Open` in the test set. We'll replace them with 1 (open).

In [None]:
test_df.Open.isna().sum()

#### Preparing submission.csv

In [None]:
submission_df['Sales'] = submission_df['Sales'] * test_df.Open.fillna(1.)

In [None]:
submission_df

In [None]:
submission_df.to_csv('submission.csv', index=None)

# Single Input Prediction

In [None]:
sample_input={
    'Store':2,
    'DayOfWeek':4,
    'Promo' :1,
    'Date':'2015-09-30',
    'Open':1,
    'StateHoliday':'a',
    'SchoolHoliday':0
}
input_df = pd.DataFrame([sample_input])
input_df

In [None]:
input_merged_df = input_df.merge(store_df, on='Store')
input_merged_df

# Saving & Loading Models

In [None]:
drug_store = {
    'model': model,
    'scaler': scaler,
    'encoder': encoder,
    'input_cols': input_cols,
    'target_col': target_col,
    'numeric_cols': numeric_cols,
    'categorical_cols': categorical_cols,
    'encoded_cols': encoded_cols
}

In [None]:
joblib.dump(drug_store, 'drug_store.joblib')

In [None]:
drug_store = joblib.load('drug_store.joblib')

#### Feature Engineering

In [None]:
extract_date(input_merged_df)
calc_duration_competition(input_merged_df)
promo_cols(input_merged_df)
input_merged_df

#### Preprocessing

In [None]:
input_merged_df[numeric_cols] = scaler.transform(input_merged_df[numeric_cols])
input_merged_df[encoded_cols] = encoder.transform(input_merged_df[categorical_cols])

In [None]:
X_input = input_merged_df[numeric_cols+encoded_cols]
model.predict(X_input)[0]