# Predict The Future - An Ensemble based approach

### Introduction
Welcome to an extensive Exploratory Data Analysis for the ongoing competition "Predict The Future". This notebook will grow over the coming days as I utilize this lockdown to improve my data-science skillset. In this notebook, we will be pre-processing the 4 different datasets and finally training an ensemble based boosting model of Decision Tree to predict the future demand of each item and shop combination from a Russian store. Also, before proceeding ahead with the notebook, I would like to express my gratitude and give due credit to "Baek Kyun Shin" whose existing notebook on LGBM and Feature engineering has been leveraged to a large extent for coming up with the code presented in this notebook. To know more about his code-base and analysis, please scroll down to the end of this noteboook. 

The training data consists of four separate datasets which include
1. Sales data (sales_train.csv)
2. Item data  (items.csv)
3. Shop data  (shops.csv)
4. Item Category data (item_categories.csv)

The sales dataset contains the shop number as well as the item number and the date on which the said item was sold. It also includes the item price as well as the count of the items sold. This dataset will serve as our primary training set and we will use the other datasets as source of additional features to increase our prediction accuracy.

### Load libraries

In [None]:
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import plotly.express as px
import warnings
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb

### Parameter Tuning 

In [None]:
warnings.filterwarnings('ignore')
data_path = '/kaggle/input/competitive-data-science-predict-future-sales/'
pd.set_option('float_format', '{:f}'.format)

### Load Datasets

In [None]:
sales_train=pd.read_csv(data_path + 'sales_train.csv')
shops=pd.read_csv(data_path + 'shops.csv')
items=pd.read_csv(data_path + 'items.csv')
item_categories=pd.read_csv(data_path + 'item_categories.csv')
test=pd.read_csv(data_path + 'test.csv')
submission=pd.read_csv(data_path + 'sample_submission.csv')

With the datasets all loaded into the dataframes, we are now all set to look under the hood to understand the basic structure of our data. The variable of interest is the column titled "item_cnt_day" which essentially will be the target variable for our predictive model. Also, we notice that the date format is not the regular machine-ready one and hence we need to convert it beforehand. We will be aggregating demand at a monthly level and hence it makes sense to assign a value of one to the day column so that the end result translates into the starting day of the month for each record in the sales table. We also go ahead and describe each of the columns in the sales dataset which throws up some interesting preliminary observations.

In [None]:
sales_train['date']=pd.to_datetime(sales_train['date'])
sales_train['year']=sales_train['date'].dt.year
sales_train['month']=sales_train['date'].dt.month
sales_train['day']=sales_train['date'].dt.day
sales_train['day']=1

print(sales_train.describe())

The describe output shows us that the minimum value for the number of items sold as well as the item price is below zero which is theoretically impractical. So this is an important observation which needs to be dealt with during data-cleaning. We also see that the item price and the number of items sold shoot up to astronomically high values which are especially above the 3rd quartile range of values. As such we realize that their are outliers or data-points which are much greater than the standard boundary of 1.5 * IQR and as such these need to be removed before proceeding ahead. After taking care of these sanity issues we go ahead with the plotting of the number of distinct items being sold in each shop.  

In [None]:
sales_train = sales_train[sales_train['item_price'] > 0]
sales_train = sales_train[sales_train['item_price'] < 50000]
sales_train = sales_train[sales_train['item_cnt_day'] > 0]
sales_train = sales_train[sales_train['item_cnt_day'] < 1000]
data = sales_train.groupby(['shop_id']).agg({'item_id':'nunique'}).reset_index()

mpl.rc('font', size=6)
figure, ax=plt.subplots()
figure.set_size_inches(11,5)
data=data.reset_index()

sns.barplot(x='shop_id', y='item_id', data=data)
ax.set(title='Distribution of items sold across different shops',xlabel='Shop Number',ylabel='Total Items Sold');
plt.show()

The plot throws up certain interesting observations which need a further deep-dive to understand more about them. Some of them include:
1. Shop number 11,20 and 36 seem to have an abnormally low number of items being sold
2. Shop number 0,1,10 and 39 seem to be symmetric with shops 57,58,11 and 40 indicating that they might me related
3. Shop number 25,31 and 54 seem to have exceptionally high number of items being sold

To understand more about the second observation, we go ahead and print the shop names for these shop id's and we realize that these shops are equivalent with just the city name concatenated at the end.

In [None]:
print(shops['shop_name'][0], '||', shops['shop_name'][57])
print(shops['shop_name'][1], '||', shops['shop_name'][58])
print(shops['shop_name'][10], '||', shops['shop_name'][11])
print(shops['shop_name'][39], '||', shops['shop_name'][40])

Since these shops are the one and the same, we go ahead and replace the data for these shops with equivalent symmetric shop id's in both the training dataset as well as the test dataset. Also, we go ahead and create a separate column for the shop name in the shops dataset and categorically encode it so that it can act as a feature in our predictive model.

In [None]:
test.loc[test['shop_id'] == 0, 'shop_id'] = 57
test.loc[test['shop_id'] == 1, 'shop_id'] = 58
test.loc[test['shop_id'] == 10, 'shop_id'] = 11
test.loc[test['shop_id'] == 39, 'shop_id'] = 40

shops['city'] = shops['shop_name'].apply(lambda x: x.split()[0])
shops.loc[shops['city']=='!Якутск', 'city'] = 'Якутск'
label_encoder = LabelEncoder()
shops['city'] = label_encoder.fit_transform(shops['city'])

### Data Preperation
Next we go ahead with the aggregation of the sales dataset to get our item count at the year-month level. We also observe that the item price is unique at the year,month,shop_id and item_id level and as such we need to ensure that during aggregation the level of data remains fixed at this state. After aggregation, we go ahead and left join the sales dataset with the three other respective datasets to get all the features neccessary for our predictive model.The end result will be judged on the basis of RMSE and hence we need to ensure that we do a proper feature engineering to ensure the utmost accuracy. I will keep updating this section with more and more new features as and when i come across more such aspects.

In [None]:
#Data manipulation on the training dataset
data3 = sales_train.groupby(['year','month','date_block_num','shop_id','item_id']).agg({'item_price':'mean','item_cnt_day':'sum'}).reset_index()
data=pd.merge(data3,items,how='left', on='item_id')
data=pd.merge(data,item_categories,how='left',on='item_category_id')
data=pd.merge(data,shops,how='left',on='shop_id')
data['month']=data['date_block_num'].apply(lambda month: (month+1)%12)
data=data[['month','date_block_num','shop_id','item_id','item_category_id','city','item_price','item_cnt_day']]

#Data manipulation on the testing dataset
test['date_block_num'] = 34
test['month']=11
item_price=data[['item_id','item_price']].groupby('item_id')['item_price'].mean().reset_index()
test=pd.merge(test,item_price,how='left',on='item_id')
test=pd.merge(test,items,how='left',on='item_id')
test=pd.merge(test,item_categories,how='left',on='item_category_id')
test=pd.merge(test,shops,how='left',on='shop_id')

### Model Building
For the predictive model we will use the Light Gradient Boost Model which was originally developed by the researchers at Microsoft. We will have to break down our dataset into training and validation sets so that we can accordingly judge model accuracy. We will also set certain paramaters of the model which will enable us to retrain the model till an optimum RSME criteria is reached for the training and validation dataset. We have set the bagging fraction to 75% for optimum results and the number of leaf nodes will be pruned to 255 to ensure optimal accuracy.

In [None]:
target=['item_cnt_day']
features=['month','shop_id','item_id','item_category_id','city','item_price']


data1=data[data['month']<10]
data2=data[data['month']==10]

x_train = data1[features].fillna(value=0)
y_train=data1[target].fillna(value=0)

x_valid=data2[features].fillna(value=0)
y_valid=data2[target].fillna(value=0)


params = {'metric': 'rmse',
          'num_leaves': 255,
          'learning_rate': 0.005,
          'feature_fraction': 0.75,
          'bagging_fraction': 0.75,
          'bagging_freq': 5,
          'force_col_wise' : True,
          'random_state': 10}

cat_features=features

dtrain=lgb.Dataset(x_train,y_train)
dvalid=lgb.Dataset(x_valid,y_valid)

lgb_model=lgb.train(params=params,
                    train_set=dtrain,
                    num_boost_round=1500,
                    valid_sets=(dtrain, dvalid),
                    early_stopping_rounds=150,
                    categorical_feature=cat_features,
                    verbose_eval=100)   

test1=test[features]
test1=test1.fillna(0)
test['preds']=lgb_model.predict(test1)


preds=test[['ID','preds']]
preds.columns=['ID','item_cnt_month']
preds.to_csv('my_submission_lgbm_final.csv',index=False)

I hope you guys had fun reading my notebook. This was my first attempt at a Kaggle notebook and hopefully there will be more to come in the future. I will periodically update this notebook with newer findings and optimized code as I get better and better in my journey towards becoming a data-scientist. 

### A huge shout-out to Baek Kyun Shin whose notebook has been a major source of inspiration for this code.
He is a consultant at KPC . A lot of the code used in this notebook has been sourced from his kernel titled "(TOP 3.5%) LightGBM with Feature Engineering".

Link : https://www.kaggle.com/werooring/top-3-5-lightgbm-with-feature-engineering

Profile Link : https://www.kaggle.com/werooring
