## Understanding Data

You are given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items at 10 different stores.

Data fields

* date - Date of the sale data. There are no holiday effects or store closures.
* store - Store ID
* item - Item ID
* sales - Number of items sold at a particular store on a particular date.


In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score, GridSearchCV
import datetime as dt

plt.rcParams["figure.figsize"] = (15,5)


In [None]:
data = pd.read_csv('/kaggle/input/demand-forecasting-kernels-only/train.csv',parse_dates =['date'],index_col=['date'])
data.head()

In [None]:
data.info()

## Base model

We have chosen Decision Tree Regressor as our base model. SMAPE (symmetric mean absolute percentage error) is our performance metric for this project. SMAPE is less sensitive to outliers and invariant to linear scaling. It has been observed that sales has right skewed distribution, hence we are considering it's log transform as it gives more approximate standard distribution for it.

In [None]:
data['sales'].hist(bins = 20)
mean = data['sales'].mean()
median = np.median(data['sales'])
minimum = data.sales.min()
maximum = data.sales.max()

mean,median,minimum,maximum

In [None]:
logged = np.log1p(data['sales'])
logged.hist(bins = 20)
logged.mean(),logged.median()

In [None]:
def smape(actual,predict,islog=True):
    if islog == True:
        actual = np.exp(actual) - 1
        predict = np.exp(predict) -1
        
    return 100*np.mean(2*np.abs(actual-predict)/(np.abs(actual)+np.abs(predict)))
    
smape_score = make_scorer(smape,greater_is_better=False)

def evaluate_model(df,features):
    all_X = df[features]
    all_y = np.log1p(df['sales'])
    
    tree = DecisionTreeRegressor(random_state = 1)
    scores = cross_val_score(tree,all_X,all_y,scoring=smape_score,cv=5)
    avg_score = -(scores.mean())  # avoid negative sign which caused due to make_scorer
    
    return avg_score

## Step 1: EDA with available Features

Our EDA shows that sales vary with store and item. Hence these features can play important role in predictions.In this step we will train our base model with available features and evaluate the performance.

In [None]:
df = data.groupby(['date','item'])['sales'].sum().unstack()
df.plot(figsize=(15,10))
plt.ylabel('Total sale')
plt.title('itemwise sale')

In [None]:
df = data.groupby(['date','store'])['sales'].sum().unstack()
df.plot(figsize=(15,10))
plt.ylabel('sales')
plt.title('Total sale store wise')

In [None]:
features = list(data.columns)
features.remove('sales')
print('SMAPE with available features: {:.4f}'.format(evaluate_model(data,features)))

Cross validation SMAPE score for this step is 47.66

## Step 2: EDA and Basic Feature Engineering

In this step we will extract common features like month, year, day, week etc. from date column and analyse sales variation over the period w.r.t. these features. 

In [None]:
data['month'] = data.index.month
data['year'] = data.index.year
data['dow'] = data.index.dayofweek
data['day'] = data.index.day
data['quarter'] = data.index.quarter
data['week'] = data.index.week

data.head()

In [None]:
sns.boxplot(x='month',y='sales',hue = 'year', data=data)

In [None]:
sns.boxplot(x='week',y='sales',hue ='year', data=data)

In [None]:
sns.boxplot(x='dow',y='sales',hue ='year', data=data)

In [None]:
sns.boxplot(x='day',y='sales',hue ='year', data=data)

In [None]:
sns.boxplot(x='quarter',y='sales',hue ='year', data=data)

Observations:
1. Overall sale seems to be increasing with the year
2. Overall sale is high 
 - during may,june,july in a year
 - during weekend in a week 
 - Mostly in a 3rd quarter
3. sales vary slightly w.r.t to day of month
 

In [None]:
features = list(data.columns)
features.remove('sales')
print('SMAPE with Basic feature engineering: {:.4f}'.format(evaluate_model(data,features)))

With basic feature engineering we have lowered cross validation SMAPE score by 2.

## Step 3: EDA and Advance Feature Engineering

Sometime even with the available features tree based models fails to extract interaction between them and result in large number of splits and more complex model. In this context we would develope some features which extract features interaction and ease the job for decision tree to make more accurate prediction. 

We would develop feature for following interactions

1. **store and item: **
   sales could differ for particular item in different locations. It means that, though a item could be very popular but sales could be less if store located far away from residential area. Reverse is true for store. i.e. store could be at main location but if item is not so popular the sale could be low

2. **week and dow: **
   combination of these two features may give sense of order to decision tree regressor
   
3. **year and month: ** same as above

In [None]:
store_item_df = pd.pivot_table(data, index='item', values='sales', columns='store',margins=True, aggfunc=np.mean)
store_item_df.head()

In [None]:
sns.heatmap(store_item_df)

In [None]:
fig,(ax1,ax2) = plt.subplots(1,2, figsize = (15,5))
store_item_df['All'].sort_values().plot.bar(ax=ax1, title ='itemwise average sale' )
store_item_df.loc['All',:].sort_values().plot.bar(ax=ax2, title = 'storewise average sale')

In [None]:
i = store_item_df['All'].sort_values().index
c = store_item_df.loc['All',:].sort_values().index
store_item_df = store_item_df[c]
store_item_df = store_item_df.reindex(i)

store_item_df.drop('All',axis=1,inplace=True)
store_item_df.drop('All',axis=0,inplace = True)

store_item_df.head()


In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(store_item_df, square=True)

From above heatmap we can see that store item are very well organized w.r.t their sales. Each block in heatmap shows the average sales for particular item in particular store. Row number defines that particular item and column number defines that particular store.Lowest value for avg sale is situated in topleft corner and highest value is at bottom right corner. And average sales seems to be increasing along the column and then along the row. if we encode store item combination accordingly it would grately helps our regressor for making prediction.

In [None]:
# Prepare dataframe to encode store item interaction
encode_df = pd.DataFrame(np.arange(1,501,1).reshape((50,10)))
encode_df.columns = store_item_df.columns
encode_df.index = store_item_df.index


In [None]:
def encode_feature(row):
    r = row['item']
    c = row['store']
    return encode_df.loc[r,c]

data['store_item'] = data.apply(encode_feature,axis=1)

data.head()

In [None]:
features = list(data.columns)
features.remove('sales')

print('SMAPE with feature engineering step 3: {:.4f}'.format(evaluate_model(data,features)))

For other two interactions we will contenate year with month as its fraction and week with days as its fraction

In [None]:
data['m_yr'] = data['year'] + data['month']/100
data['week_frac'] = data['week'] + data['dow']/100
data.head()

In [None]:
drop_col = ['sales'] # 'year','day','month','dow','week','store','item',

features = list(data.columns)
for i in drop_col:
    features.remove(i)
    
evaluate_model(data,features)

We have tremendously improved our base model performance by this advance feature engineering. The SMAPE is lowered by almost 60% of it's value at previous step.

## Hyperparameter Tuning:
We will apply standard grid search method to our base model to optimize its important hyperparameters. It helped us to improve the performance further.

In [None]:
h = {'criterion' : ['mse','friedman_mse'],
                   'min_samples_leaf': [1,3,5],
                   'min_samples_split': [2,4,6]
                  }


dtr = DecisionTreeRegressor(random_state=1)
grid = GridSearchCV(dtr, param_grid=h, scoring=smape_score, cv=5, verbose=10, n_jobs =-1)

all_X = data[features]
all_y = np.log1p(data['sales'])
grid.fit(all_X,all_y)
        
pred = grid.predict(data[features])
pred = [max(0,p) for p in pred]

error = smape(all_y,pred,islog=True)

print('SMAPE on Last Step: {:.4f}'.format(error))




## Implementation of model for Final Test Set

In [None]:
def transform_feature(df):
    
    # Apply basic feature engineering
    df['month'] = df.index.month
    df['year'] = df.index.year  
    df['dow'] = df.index.dayofweek
    df['day'] = df.index.day
    df['quarter'] = df.index.quarter
    df['week'] = df.index.week
    
    # Apply advance feature engineering
    df['store_item'] = df.apply(encode_feature,axis=1)
    df['m_yr'] = df['year']+df['month']/100
    df['week_frac'] = df['week']+df['dow']/100
    
    return df
    
    

In [None]:
holdout = pd.read_csv('/kaggle/input/demand-forecasting-kernels-only/test.csv',parse_dates =['date'],index_col=['date'])
ids= holdout['id']
holdout = transform_feature(holdout)

pred_h = grid.predict(holdout[features])
pred_h = np.exp(pred_h)-1
pred_h = [max(0,p) for p in pred_h]

submission_df = {'id':ids, 'sales':pred_h}
submission =pd.DataFrame(submission_df)

submission.to_csv('submission.csv',index=False)



**Test score : 14.80 **

## Summary: 
We have implemented store demand forecast which forecast sales of item at different store. We have trained model on five years of data and tested over next three months. Our test SMAPE score is 14.80. The aim of the project is to step by step improvement to our base model through basic to advance feature engineering and hyperparameter tuning. We have started with very simple decision tree regressor model with available feature which provided us SMAPE score 47.66. With feature engineering step we have reduced it to 18.63 and further hyperparameter tuning reduced it to 10.48.