#### In this notebook I have tried to take the data of each store and make predictions with XGB. This way it is less taxing on the processor and also score does not decrease too much.

Note : I have not tried to integrate average weather data for past Europe past records. If we use it, there is high chance of improving results. Also trying to predict with AutoRegression is also an option

## Importing data
So lets get started with importing required libraries and importing input data from compitition.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, KFold

In [None]:
train = pd.read_csv(r'../input/rossmann-store-sales/train.csv', parse_dates=['Date'], low_memory=False)
train.head()

In [None]:
test = pd.read_csv(r'../input/rossmann-store-sales/test.csv', parse_dates=['Date'], low_memory=False, index_col='Id')
test.head()

In [None]:
store = pd.read_csv(r'../input/rossmann-store-sales/store.csv', index_col='Store')
store.head()

## Data Integration and Cleaning

Let's considers all the features one by one
- First, 'CompetitionDistance' column in store data, it had 3 null values, so imputing them with mean would do.
- Next, Promo2 columns : We will need to compare when the promo2 started with each date later to calculate if it is active currently. So just clearing some null values for now.
- Same is the case for the Compititors column. We will need to check when compititor shop opened.  I have created 'CompetitionOpen' column to know if we have data on the compititor or not.
- Also let's Label encode the PromoInterval and Assortment columns as they are nominal

In [None]:
mean_dist = store['CompetitionDistance'].mean()
store.loc[store['CompetitionDistance'].isnull(), 'CompetitionDistance'] = mean_dist

store.loc[store['Promo2'] == 0, ['Promo2SinceWeek','Promo2SinceYear']] = 0
store['Promo2SinceWeek'] = store['Promo2SinceWeek'].astype('int')
store['Promo2SinceYear'] = store['Promo2SinceYear'].astype('int')

store['CompetitionOpen'] = 1
store.loc[store['CompetitionOpenSinceMonth'].isnull(),['CompetitionOpen', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear']] = 0
store['CompetitionOpenSinceMonth'] = store['CompetitionOpenSinceMonth'].astype('int')
store['CompetitionOpenSinceYear'] = store['CompetitionOpenSinceYear'].astype('int')

label_map = {'PromoInterval' : {'Jan,Apr,Jul,Oct' : 1,
                                'Feb,May,Aug,Nov' : 2,
                                'Mar,Jun,Sept,Dec' : 3,
                                np.nan : 0},
             'Assortment' : {'a':0,
                             'b':1,
                             'c':2}}
store.replace(label_map, inplace=True)

- when I took a look at data and it lookd like the type of Holiday a/b/c does not affect the drop in Sales, its just that if it is StateHoliday, there is a chance of Sales to drop. So encoding all the Holiday's a/b/c as 1.
- Also, we need to split Date column into Year,Month and Date for it to be a little bit useful for model
- Not going to drop Date column for now cause it will be useful later on for calculations on Promo2 and Compititors

In [None]:
train.loc[train['StateHoliday'] != '0', 'StateHoliday'] = '1'
train['StateHoliday'] = train['StateHoliday'].astype('int')

test.loc[test['StateHoliday'] != '0', 'StateHoliday'] = '1'
test['StateHoliday'] = test['StateHoliday'].astype('int')

train['year'] = train['Date'].dt.year
train['month'] = train['Date'].dt.month
train['day'] = train['Date'].dt.day

test['year'] = test['Date'].dt.year
test['month'] = test['Date'].dt.month
test['day'] = test['Date'].dt.day

test.loc[test['Open'].isnull(), 'Open'] = 1
test['Open'] = test['Open'].astype('int')

train.drop('Customers', axis=1, inplace=True)

#### Difining our basic model
- First I checked with grid search on first 30 stores with very large parameter set and took the best models
- As a result I found out these things:
    - learning rate was varying between 0.1 to 0.4, with most of them with 0.2 learning rate. So keeping it as hyper parameter with values [0.1,0.2,0.35]
    - n_estimators were always best at 100
    - max_depth was best at 4, only once it was showing best results for max_depth=3.

In [None]:
kfold = KFold(n_splits=5, random_state=2021, shuffle=True)
parameters = {'learning_rate' : [0.1,0.2,0.35]}
clf = XGBRegressor(random_state=2021, use_label_encoder=False, n_estimators=100, max_depth=4)

Here I have simply created an empty dataframe to append the predicted results

In [None]:
submit_frame = pd.DataFrame(columns=['Id','Sales'])

- These are the functions which will take in a date and return the value for promo2 and compititor if it is currently active.
- as for promo2, we have got the value for when the promo gets renewed, so I have added the column to determine the age in months of promo, it will return 0 if promo2 is inactive, or a number between 1 to 3 if promo2 is active
- for example if for the store, promo2 gets renewed each Jan,Apr,Jul and Oct, it will return 1 if date is in Jan, 2 if date is in Feb, 3 if date is in Mar and 1 again if date is in April, and so on.....

In [None]:
def calc_promo(dt):
    if (dt.isocalendar()[0] * 100 + dt.isocalendar()[1]) >= (st_t['Promo2SinceYear'] * 100 + st_t['Promo2SinceWeek']):
        mt = (dt.month + st_t['PromoInterval'] - 1) % 3
        if mt == 0:
            mt = 3
        return (1,mt)
    else:
        return (0,0)

def calc_comp(dt):
    if (dt.year * 12 + dt.month) >= (st_t['CompetitionOpenSinceYear'] * 12 + st_t['CompetitionOpenSinceYear']):
        return 1
    else:
        return 0

## Making Predictions on each Store
- Now the part which takes too much time for execution...
- We are going to iterate through data for each store and make predictions...

In [None]:
for st_no in range(1,1116):
    train_t = train.loc[train['Store'] == st_no].copy()
    train_t.drop('Store', axis=1, inplace=True)

    test_t = test.loc[test['Store'] == st_no].copy()
    test_t.drop('Store',axis=1,inplace=True)

    st_t = store.loc[store.index==st_no].iloc[0,:]
    
    if test_t.shape[0] > 0:
        train_t['Promo2'] = 0
        train_t['NewPromoAge'] = 0
        train_t['Competition'] = 0

        test_t['Promo2'] = 0
        test_t['NewPromoAge'] = 0
        test_t['Competition'] = 0

        if st_t['Promo2'] == 1:
            train_t['Promo2'], train_t['NewPromoAge'] = zip(*train_t['Date'].map(calc_promo))
            test_t['Promo2'], test_t['NewPromoAge'] = zip(*test_t['Date'].map(calc_promo))

        if st_t['CompetitionOpen']:
            train_t['Competition'] = train_t['Date'].map(calc_comp)
            test_t['Competition'] = test_t['Date'].map(calc_comp)

        train_t.drop('Date', axis=1, inplace=True)
        test_t.drop('Date', axis=1, inplace=True)
        
        X_train_t = train_t.drop('Sales',axis=1)
        y_train_t = train_t['Sales']
        
        cv = GridSearchCV(clf,
                          param_grid=parameters,
                          cv=kfold,
                          scoring='neg_mean_squared_error')
        cv.fit(X_train_t, y_train_t)
        y_pred_t = cv.predict(test_t)
        y_pred_t[y_pred_t < 0] = 0
        
        out_frame = pd.DataFrame([test_t.index, y_pred_t]).T
        out_frame.columns = ['Id','Sales']
        out_frame['Id'] = out_frame['Id'].astype('int')
        submit_frame = submit_frame.append(out_frame)
        print(f'Predicted on : {st_no}, train_rows={train_t.shape[0]}, test_rows={len(y_pred_t)}')
    else:
        print(f'Skipped: {st_no}, train_rows={train_t.shape[0]}')

sorting index of submittion dataframe and outputting it to csv 

In [None]:
submit_frame.sort_values('Id', inplace=True)
print(submit_frame.shape)
print(submit_frame.head())
submit_frame.to_csv(r'submission.csv', index=False)