<h1 style="text-align:center;font-size:200%;;">Prophet/LightGBM - EDA&Feature Engineering&Tuning</h1>
<h3  style="text-align:center;">Keywords : <span class="label label-success">Time Series Analysis</span> <span class="label label-success">Feature Engineering</span> <span class="label label-success">Boosting Technique</span> <span class="label label-success">Hyperparameter Tuning</span></h3>

# Table of Contents <a id='top'></a>

>1. [Overview](#1.-Overview)  
>    * [Project Detail](#Project-Detail)
>    * [Goal of this notebook](#Goal-of-this-notebook)
>1. [Import libraries](#2.-Import-libraries)
>1. [Load the dataset](#3.-Load-the-dataset)
>1. [Pre-processing](#4.-Pre-processing)
>1. [EDA](#5.-EDA)  
>    * [Shops Analysis](#Shops-Analysis)
>    * [Item Analysis](#Item-Analysis)
>    * [Item Category Analysis](#Item-Category-Analysis)
>    * [Sales Analysis](#Sales-Analysis)
>    * [Basic Time Series EDA](#Basic-Time-Series-EDA)
>    * [Outlier](#Outlier)
>1. [Modelling](#6.-Modelling)
>    * [Let's try Prophet](#Let's-try-Prophet)
>    * [Let's try Lightgbm](#Let's-try-Lightgbm)
>1. [Conclusion](#7.-Conclusion)
>1. [References](#8.-References)

# 1. Overview
## Project Detail
>In this project([Predict Future Sales](https://www.kaggle.com/c/competitive-data-science-predict-future-sales)), the motivation is to predict total sales for every product and store in the next month by using the dataset provided by Russian software firms [1C Company](https://1c.ru/eng/title.htm).

## Goal of this notebook
>* Practice EDA technique to deal with time-series data
>    * check stationiality with ADF test
>    * ACF analysis
>    * Series Decomposition into trend/seasonality
>* Practice visualising technique(especially using bokeh via holoviews)
>* Practice feature enginieering technique  
>    * Lag features
>    * Differential feature
>* Practice modeling technique
>    * Prophet
>    * LightGBM
>* Hyperparameter tuning technique
>    * Step-wise algorithm
>      * Optune - LightGBM Tuner

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

# 2. Import libraries

In [None]:
!pip install pandarallel

In [None]:
import numpy as np
import pandas as pd
import os
import gc
import holoviews as hv
from holoviews import opts
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from fbprophet import Prophet
from fbprophet.plot import add_changepoints_to_plot
hv.extension('bokeh')
from pandarallel import pandarallel
pandarallel.initialize()
from sklearn.preprocessing import StandardScaler, LabelEncoder
from itertools import product
import lightgbm as lgb
import optuna.integration.lightgbm as lgb_optuna

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

# 3. Load the dataset

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
items = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/items.csv')
print(f'items.csv : {items.shape}')
items.head(3)

In [None]:
item_categories = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv')
print(f'item_categories.csv : {item_categories.shape}')
item_categories.head(3)

In [None]:
shops = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/shops.csv')
print(f'shops.csv : {shops.shape}')
shops.head(3)

In [None]:
sales_train = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv')
print(f'sales_train.csv : {sales_train.shape}')
sales_train.head(3)

In [None]:
test = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/test.csv')
print(f'test.csv : {test.shape}')
test.head(3)

In [None]:
submission = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/sample_submission.csv')
print(f'sample_submission.csv : {submission.shape}')
submission.head(3)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

# 4. Pre-processing

>Adding column city names where shops are located

In [None]:
shops['city_name'] = shops['shop_name'].str.split(' ').map(lambda x: x[0])
shops['city_name'].unique()

In [None]:
shops.loc[shops['city_name']=='!Якутск', 'city_name'] = 'Якутск'
shops['city_code'] = LabelEncoder().fit_transform(shops['city_name']).astype(np.int8)
shops.head(3)

>Adding column item main/sub-category

In [None]:
item_categories['item_maincategory_name'] = item_categories['item_category_name'].str.split(' - ').map(lambda x: x[0])
item_categories['item_maincategory_name'].unique()

In [None]:
item_categories['item_subcategory_name'] = item_categories['item_category_name'].str.split('-').map(lambda x: '-'.join(x[1:]).strip() if len(x) > 1 else x[0].strip())
item_categories['item_subcategory_name'].unique()

>Normalizing category columns which have similar values

In [None]:
item_categories.loc[item_categories['item_maincategory_name']=='Игры Android', 'item_maincategory_name'] = 'Игры'
item_categories.loc[item_categories['item_maincategory_name']=='Игры MAC', 'item_maincategory_name'] = 'Игры'
item_categories.loc[item_categories['item_maincategory_name']=='Игры PC', 'item_maincategory_name'] = 'Игры'
item_categories.loc[item_categories['item_maincategory_name']=='Карты оплаты (Кино, Музыка, Игры)', 'item_maincategory_name'] = 'Карты оплаты'
item_categories.loc[item_categories['item_maincategory_name']=='Чистые носители (шпиль)', 'item_maincategory_name'] = 'Чистые носители'
item_categories.loc[item_categories['item_maincategory_name']=='Чистые носители (штучные)', 'item_maincategory_name'] = 'Чистые носители'
item_categories['item_maincategory_id'] = LabelEncoder().fit_transform(item_categories['item_maincategory_name']).astype(np.int8)
item_categories['item_subcategory_id'] = LabelEncoder().fit_transform(item_categories['item_subcategory_name']).astype(np.int8)
item_categories.head(3)

>Merging sales dataframe and item/shop dataframe into one dataframe by item_id and shop_id as a key

In [None]:
item_info = pd.merge(items, item_categories, on='item_category_id', how='inner')
train_tmp = pd.merge(sales_train,item_info, on='item_id', how='inner')
train = pd.merge(train_tmp, shops, on='shop_id', how='inner')
train.head(3)

In [None]:
test_tmp = pd.merge(test,item_info, on='item_id', how='inner')
test = pd.merge(test_tmp, shops, on='shop_id', how='inner')
test.head(3)

>Processing date column into convenient format

In [None]:
train.date = pd.to_datetime(train.date,format='%d.%m.%Y')
train =  train.sort_values('date').reset_index(drop=True)
train.head(3)

>Calculating the amount of sales per a day

In [None]:
train['total_sales'] = train['item_price'] * train['item_cnt_day']
train.head(3)

>Converting class type to reduce memory load

In [None]:
train['date_block_num'] =train['date_block_num'].astype(np.int8)
train['shop_id'] = train['shop_id'].astype(np.int8)
train['item_id'] = train['item_id'].astype(np.int16)
train['item_category_id'] = train['item_category_id'].astype(np.int16)

In [None]:
test['date_block_num'] = 34
test['date_block_num'] = test['date_block_num'].astype(np.int8)
test['shop_id'] = test['shop_id'].astype(np.int8)
test['item_id'] = test['item_id'].astype(np.int16)
test['item_category_id'] = test['item_category_id'].astype(np.int16)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

# 5. EDA

### Some points to focus on
><div class="alert alert-success" role="alert">
><ul>
><li>Most populer items,categories and shops</li>
><li>The distribution of item price and item sales per a day</li>
><li>The seasonal trend of sales through a year</li>
><li>Some outliers and missing values</li>
></ul>
></div>

## Shops Analysis

### Insights
><div class="alert alert-success" role="alert">Looking at the number of shops,  the shops in Moscow stand out.</div>

In [None]:
shop_rank_df = train.shop_name.value_counts().sort_values(ascending=False)
hv.Bars(shop_rank_df[0:20]).opts(title="Shop Count top20", color="red", xlabel="Shop Name", ylabel="Count")\
                            .opts(opts.Bars(width=700, height=500,tools=['hover'],xrotation=45,show_grid=True))

### Insights
><div class="alert alert-success" role="alert">
>Translate top-5 shop names as below :
><ol>
><li>Moscow Shopping Center "Semenovsky"</li>
><li>Moscow TRC "Atrium"</li>
><li>Khimki Shopping Center "Mega"</li>
><li>Moscow TC "MEGA Teply Stan" II</li>
><li>Yakutsk Ordzhonikidze, 56</li>
></ol>
>Looking at the style of each shops, there are many shopping malls and many of them are located in cities near Moscow.
></div>

In [None]:
pd.DataFrame(shop_rank_df[0:5])

### Insights
><div class="alert alert-success" role="alert">
>Let's see cities where shops are located.
><ul>
><li>As mentioned above, Moscow is the top city</li>
><li>Next to Moscow, there were many large cities and state capitals such as St. Petersburg and Yakutsk</li>
></ul>
></div>

In [None]:
hv.Bars(train['city_name'].value_counts()).opts(title="City Count", color="red", xlabel="City Name", ylabel="Count")\
                                            .opts(opts.Bars(width=700, height=500,tools=['hover'],xrotation=45,show_grid=True))

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

## Item Analysis

### Insights

><div class="alert alert-success" role="alert">Looking at the distribution of item names, one particular item stands out(I prefer later).</div>

In [None]:
item_rank_df = train.item_name.value_counts().sort_values(ascending=False)
hv.Bars(item_rank_df[0:20]).opts(title="Item Count top20", color="blue", xlabel="Item Name", ylabel="Count")\
                            .opts(opts.Bars(width=700, height=500,tools=['hover'],xrotation=45,show_grid=True))

### Insights
><div class="alert alert-success" role="alert">
>Translate top-5 item names as below :
><ol>
><li>Corporate package shirt 1C Interest white (34 * 42) 45 microns</li>
><li>Playstation Store wallet replenishment: Payment card 1000 rub.</li>
><li>Acceptance of funds for 1C-Online</li>
><li> Diablo III [PC, Jewel, Russian version]</li>
><li>Kaspersky Internet Security Multi-Device Russian Edition. 2-Device 1 year Renewal Box</li>
></ol>
>The number one is most prominent, but it's probably not a specific(real) item, but rather accessories that accompanies the purchase.And many others are related to games.<br/>
>It is interesting that the internet security software(Kaspersky) is in the list.
></div>

In [None]:
pd.DataFrame(item_rank_df[0:5])

### Insights
><div class="alert alert-success" role="alert">
>Most expensive items:
><ol>
><li>Remote Control Software</li>
><li>Shipping cost</li>
><li>Lord of the Ring(DVD)</li>
></ol>
></div>

In [None]:
price_rank = train[['item_name','item_category_name','item_price']].groupby(['item_name','item_category_name']).max()
price_rank.sort_values('item_price',ascending=False).head(3)

### Insight
><div class="alert alert-success" role="alert">
>Cheapest items:
><ol>
><li>Corporate package</li>
><li>Battery</li>
><li>Monday Night Combat(PC Game)</li>
></ol>
></div>

In [None]:
price_rank.sort_values('item_price',ascending=True).head(3)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

## Item Category Analysis

### Insights

><div class="alert alert-success" role="alert">Looking at the distribution of item categories, it seems that a lot of TV-game related software and movie DVDs are selling well.</div>

In [None]:
item_cat_rank_df = train.item_category_name.value_counts().sort_values(ascending=False)
hv.Bars(item_cat_rank_df[0:20]).opts(title="Item Category Count top20", color="green" ,xlabel="Item categories", ylabel="Count")\
                                .opts(opts.Bars(width=700, height=500,tools=['hover'],xrotation=45,show_grid=True))

### Insights
><div class="alert alert-success" role="alert">
>Translate top-5 item categories as below :
><ol>
><li>Cinema - DVD</li>
><li>PC Games - Standard Editions</li>
><li>Music - Local Production CD</li>
><li>Games - PS3</li>
><li>Cinema - Blu-ray</li>
></ol>
>In addition to the game software of different hardware, movie products(DVD,Blu-ray) of different type of media are also prominent.
></div>

In [None]:
pd.DataFrame(item_cat_rank_df[0:5])

### Insights
><div class="alert alert-success" role="alert">
>Let's see item main/sub category distribution  
><ul>
><li>Entertainment related categories such as games, movies and misic are top-3</li>
><li>Next to them, gift such as toy, ornament and stamp is in the top-ranking</li>
></ul>
></div>

In [None]:
hv.Bars(train['item_maincategory_name'].value_counts()).opts(title="Item Main-Category Count", color="green" ,xlabel="Main Categories" ,ylabel="Count") \
                                                        .opts(opts.Bars(width=700, height=500,tools=['hover'],xrotation=45,show_grid=True))

In [None]:
hv.Bars(train['item_subcategory_name'].value_counts()).opts(title="Item Sub-Category Count", color="green" ,xlabel="Sub Categories" ,ylabel="Count")\
                                                        .opts(opts.Bars(width=700, height=500,tools=['hover'],xrotation=45,show_grid=True))

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

## Sales Analysis

>Let's aggregate the data based on total sales.

### Top Sales Item
><div class="alert alert-success" role="alert">
><ul>
><li>It turns out that the game-related items are making the most profit</li>
><li>PS4 is considered to be popular in hardware, and Grand Theft Auto is considered to be popular in software</li>
></ul>
></div>

In [None]:
train.groupby('item_name').sum()[['total_sales','item_cnt_day']].sort_values('total_sales',ascending=False).head()

### Worst Sales Item

In [None]:
train[train[['total_sales','item_cnt_day']].values > 0].groupby('item_name').sum()[['total_sales','item_cnt_day']].sort_values('total_sales',ascending=True).head(3)

### Top Sales Shop

><div class="alert alert-success" role="alert">The shops that make the most contribution to sales are the ones near Moscow, which are mostly large shops like a shopping mall.</div>

In [None]:
train.groupby('shop_name').sum()[['total_sales','item_cnt_day']].sort_values('total_sales',ascending=False).head()

### Worst Sales Shop

><div class="alert alert-success" role="alert">Shops with poor sales are located in the eastern part of the continent, and the type of shops is likely to be medium to small.</div>

In [None]:
train.groupby('shop_name').sum()[['total_sales','item_cnt_day']].sort_values('total_sales',ascending=True).head()

### Top Sales Item Category

><div class="alert alert-success" role="alert">Top sales items are related to game software or hardware.So, it is natural that the top sales item category is related to games.</div>

In [None]:
train.groupby('item_category_name').sum()[['total_sales','item_cnt_day']].sort_values('total_sales',ascending=False).head()

### Monthly Aggregation

>Getting the statistics description by grouping by months

In [None]:
train[["date_block_num","shop_id","item_id","date","item_price","item_cnt_day","total_sales"]].groupby(["date_block_num","shop_id","item_id"])\
            .agg({"date":["min",'max'],"item_price":"mean","item_cnt_day":"sum","total_sales":"sum"}).head(10)

### Time Series Graph of Whole Company Sales
><div class="alert alert-success" role="alert">
>Get the whole sales of this company in each month, and see how the sales changes through a year.
><ul>
><li>There are obviously two peaks by each 12 points<br/>
>    <p>the data is aggregated by each month, so 12 points mean a year</p></li>
><li>It is said that the peaks are related to Christmas events or New-Year sales</li>
><li>Monthly Item Count shows decreasing trend, but not Monthly Sales</li>
></ul>
></div>

In [None]:
monthly_ts = train.groupby(["date_block_num"])["total_sales","item_cnt_day"].sum()
month_ts_sales = hv.Curve(monthly_ts["total_sales"]).opts(title="Monthly Sales Time Series", xlabel="Month", ylabel="Total Sales")
month_ts_cnt = hv.Curve(monthly_ts["item_cnt_day"]).opts(title="Monthly Item Count Time Series", xlabel="Month", ylabel="Item Count")
(month_ts_sales + month_ts_cnt).opts(opts.Curve(width=400, height=300,tools=['hover'],show_grid=True))

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

## Basic Time Series EDA

### Stationality
><div class="alert alert-success" role="alert">
>Stationality is important for time-series modeling because many time-series modeling method require the data to be stationality.<br/>
>Stationality condition is:  
><ul>
><li><b>The mean of the data is constant</b></li>
><li><b>The variance of the data is constant</b></li>
><li><b>The covariance of the data is constant</b></li>
></ul>
>The simple concept of stationality is explained in <a href='https://www.kaggle.com/jagangupta/time-series-basics-exploring-traditional-ts'>this notebook</a>.
></div>

><div class="alert alert-success" role="alert">
>For testing the stationality, I used Augmented Dickey-Fuller test.As a result of the ADF test, p-value of item_cnt_day is within 5%, but the value of total_sales is nearly 10%.<br/>
>So we can say that the time-series of item_cnt_day have stationality, but not for the time-series of total_sales.
></div>

In [None]:
print('ADF testing ...')
print(f"p-value[total_sales] : {adfuller(monthly_ts['total_sales'].values, autolag='AIC', regression = 'ct')[1]}")
print(f"p-value[item_cnt_day] : {adfuller(monthly_ts['item_cnt_day'].values, autolag='AIC', regression = 'ct')[1]}")

### Trend in Time Series
><div class="alert alert-success" role="alert">
><ul>
><li>Both of Sales and Item Count have seasonality</li>
><li>Item Count have decreasing trend consistently, but Sales have increasing trend until middle of the data, and decreasing from that point</li>
><li>The data shows non-linear time series, so I select 'multiplicative' model when using decomposing method</li>
></ul>
></div>

In [None]:
sales_dec = sm.tsa.seasonal_decompose(monthly_ts["total_sales"].values,period=12,model="multiplicative").plot()

In [None]:
item_cnt_dec = sm.tsa.seasonal_decompose(monthly_ts["item_cnt_day"].values,period=12,model="multiplicative").plot()

### Periodicity
><div class="alert alert-success" role="alert">
>To confirm the yearly periodicity in the data, the autocorrelation is useful. As a result, 12-months period is observed in the correlogram.
></div>

In [None]:
sales_acf = sm.graphics.tsa.plot_acf(monthly_ts["total_sales"].values, lags=24)

In [None]:
item_cnt_acf = sm.graphics.tsa.plot_acf(monthly_ts["item_cnt_day"].values, lags=24)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

## Outlier
><div class="alert alert-success" role="alert">
><ul>
><li>It can be seen that there is an obvious outlier in item price and item sales per day</li>
><li>Item sales per day has a wider distribution, and it seems that the part of long-tail in the distribution shows that large quantities of items are sold at the time of sale or inventory disposal</li>
></ul>
></div>

In [None]:
price_bx = hv.BoxWhisker(train[['item_price']].sort_values('item_price',ascending=False)[0:500].values,label='Item Price BoxPlot',vdims='Price')
cnt_bx = hv.BoxWhisker(train[['item_cnt_day']].sort_values('item_cnt_day',ascending=False)[0:500].values,label='Item Count Day BoxPlot',vdims='Count')
(price_bx + cnt_bx).opts(opts.BoxWhisker(width=300, height=300,show_grid=True,tools=['hover']))

>Replacing outliers with value of the row having similar condition.

In [None]:
def fill_anomaly(x, trg):
    shop_id = int(x.shop_id)
    item_id = int(x.item_id)
    db_num = int(x.date_block_num)
    ret = train[(train.shop_id==shop_id)&(train.item_id==item_id)&(train.date_block_num==db_num)&(train.item_cnt_day<1000)&(train.item_cnt_day>=0)&(train.item_price<100000)&(train.item_price>=0)][trg].mean()
    if np.isnan(ret):
        ret = train[(train.shop_id==shop_id)&(train.item_id==item_id)&(train.item_cnt_day<1000)&(train.item_cnt_day>=0)&(train.item_price<100000)&(train.item_price>=0)][trg].mean()
    if np.isnan(ret):
        ret = train[(train.item_id==item_id)&(train.item_cnt_day<1000)&(train.item_cnt_day>=0)&(train.item_price<100000)&(train.item_price>=0)][trg].mean()
    if np.isnan(ret):
        ret = train[(train.shop_id==shop_id)&(train.item_cnt_day<1000)&(train.item_cnt_day>=0)&(train.item_price<100000)&(train.item_price>=0)][trg].mean()
    return ret

In [None]:
tmp = train[['date_block_num','shop_id','item_id','item_price','item_cnt_day']]

train.loc[(train['item_cnt_day'] < 0),'item_cnt_day'] = tmp[tmp['item_cnt_day'] < 0].parallel_apply(fill_anomaly, trg='item_cnt_day', axis=1)
train.loc[(train['item_cnt_day'] > 1000),'item_cnt_day'] = tmp[tmp['item_cnt_day'] > 1000].parallel_apply(fill_anomaly, trg='item_cnt_day', axis=1)
train.loc[(train['item_price'] < 0),'item_price'] = tmp[tmp['item_price'] < 0].parallel_apply(fill_anomaly, trg='item_price', axis=1)
train.loc[(train['item_price'] > 100000),'item_price'] = tmp[tmp['item_price'] > 100000].parallel_apply(fill_anomaly, trg='item_price', axis=1)
#recalculate total_sales
train['total_sales'] = train['item_price'] * train['item_cnt_day']

In [None]:
del items,item_categories,sales_train,shop_rank_df,item_rank_df,price_rank,item_cat_rank_df,monthly_ts,month_ts_sales,month_ts_cnt,\
    sales_dec,item_cnt_dec,sales_acf,item_cnt_acf,price_bx,cnt_bx,tmp
gc.collect()

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

# 6. Modelling

## Let's try Prophet

### Data Preparation

>To try Prophet modeling, we need to re-format the data into datetime-index/value shape.

In [None]:
prophet_df = pd.DataFrame()
prophet_df["ds"] = pd.date_range(start = '2013-01-01',end='2015-10-01', freq = 'MS')
prophet_df['y'] = train.groupby(["date_block_num"])["total_sales"].sum()
prophet_df.head(3)

### Model Fitting & Visualization

In [None]:
m = Prophet(changepoint_prior_scale=0.08)
m.fit(prophet_df)
future = m.make_future_dataframe(periods = 20, freq = 'MS')
prophe_result = m.predict(future)
prophe_result.tail(3)

>With no parameter tuning, decreasing trend and yearly peak are correctly predicted!

In [None]:
fig1 = m.plot(prophe_result)
ax = fig1.gca()
ax.set_title("Sales Prediction", size=25)
ax.set_xlabel("Date", size=15)
ax.set_ylabel("Sales", size=15)
a = add_changepoints_to_plot(ax, m, prophe_result)

>Trend and seasonality look similar to the ones which we got by traditional metohod.

In [None]:
fig2 = m.plot_components(prophe_result)

### Modeling with Prophet 

>Checking the test data, there is 214200 rows in the data.  
>So we need build 214200 time-series models when using Prophet, and this is impossible.  
> <font color='red'>I decided to quit using Prophet in this problem .....</font>

In [None]:
len(test[test.duplicated(['shop_id','item_id'])==False])

In [None]:
del prophet_df,m,future,prophe_result,fig1,ax,a,fig2
gc.collect()

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

## Let's try Lightgbm

### Feature Engineering
><div class="alert alert-success" role="alert">
><ul>
><li>Item Count by month in each shops</li>
><li>Christmas flag<br/>
>    <p>In Decenber, it is thought that consumers tends to buy much more gifts than in other month</p></li>
><li>Some lag features</li>
><li>Difference of total sales between months</li>
></ul>
></div>

>Making new dataframe for feature engineering

In [None]:
train.drop(['date','item_name', 'item_category_name', 'item_maincategory_name', 'item_subcategory_name', 'shop_name', 'city_name'],axis=1, inplace=True)
test.drop(['ID','item_name', 'item_category_name', 'item_maincategory_name', 'item_subcategory_name', 'shop_name', 'city_name'],axis=1, inplace=True)
data = pd.concat([train, test], ignore_index=True, sort=
                 False, keys=['date_block_num','shop_id','item_id']).fillna(0)
data.head(3)

In [None]:
feature_df = []
cols = ['date_block_num','shop_id','item_id']
for i in range(35):
    sales = data[data.date_block_num==i]
    feature_df.append(np.array(list(product([i], sales.shop_id.unique(), sales.item_id.unique())), dtype='int16'))
    
feature_df = pd.DataFrame(np.vstack(feature_df), columns=cols)
feature_df['date_block_num'] = feature_df['date_block_num'].astype(np.int8)
feature_df['shop_id'] = feature_df['shop_id'].astype(np.int8)
feature_df['item_id'] = feature_df['item_id'].astype(np.int16)
feature_df.sort_values(cols,inplace=True)

feature_df = pd.merge(feature_df, shops[['shop_id','city_code']], on=['shop_id'], how='left')
feature_df = pd.merge(feature_df, item_info[['item_id','item_category_id','item_maincategory_id','item_subcategory_id']], on=['item_id'], how='left')
feature_df['city_code'] = feature_df['city_code'].astype(np.int8)
feature_df['item_category_id'] = feature_df['item_category_id'].astype(np.int8)
feature_df['item_maincategory_id'] = feature_df['item_maincategory_id'].astype(np.int8)
feature_df['item_subcategory_id'] = feature_df['item_subcategory_id'].astype(np.int8)

>Item Count by month in each shop and item

In [None]:
tmp = data.groupby(['date_block_num','shop_id','item_id']).agg({'item_cnt_day': ['sum']})
tmp.columns = ['item_cnt_month']
tmp.reset_index(inplace=True)
feature_df = pd.merge(feature_df, tmp, on=['date_block_num','shop_id','item_id'], how='left')
feature_df['item_cnt_month'] = (feature_df['item_cnt_month'].fillna(0).clip(0,20).astype(np.float16))

>Christmas flag

In [None]:
feature_df['christmas'] = 0
f = lambda x : (x.date_block_num+1) % 12 == 0
feature_df.loc[feature_df[['date_block_num']].parallel_apply(f,axis=1)==True,'christmas'] = 1

>Function for making N-lag features([sample notebook](https://www.kaggle.com/dlarionov/feature-engineering-xgboost#Part-1,-perfect-features))

In [None]:
def lag_feature(df, lags, col):
    tmp = df[['date_block_num','shop_id','item_id',col]]
    for i in lags:
        shifted = tmp.copy()
        shifted.columns = ['date_block_num','shop_id','item_id', col+'_lag_'+str(i)]
        shifted['date_block_num'] += i
        df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left')
    return df

>Item count by month for 1,2,3,6,12 lag  

In [None]:
feature_df = lag_feature(feature_df, [1,2,3,6,12], 'item_cnt_month')

>Item count mean by month for 1 lag  

In [None]:
tmp = feature_df.groupby(['date_block_num']).agg({'item_cnt_month': ['mean']})
tmp.columns = ['item_cnt_month_avg']
tmp.reset_index(inplace=True)

feature_df = pd.merge(feature_df, tmp, on=['date_block_num'], how='left')
feature_df['item_cnt_month_avg'] = feature_df['item_cnt_month_avg'].astype(np.float16)
feature_df = lag_feature(feature_df, [1], 'item_cnt_month_avg')

>Item count mean by month/item for 1,2,3,6,12 lag

In [None]:
tmp = feature_df.groupby(['date_block_num', 'item_id']).agg({'item_cnt_month': ['mean']})
tmp.columns = ['item_cnt_month_item_avg']
tmp.reset_index(inplace=True)

feature_df = pd.merge(feature_df, tmp, on=['date_block_num','item_id'], how='left')
feature_df['item_cnt_month_item_avg'] = feature_df['item_cnt_month_item_avg'].astype(np.float16)
feature_df = lag_feature(feature_df, [1,2,3,6,12], 'item_cnt_month_item_avg')

>Item count mean by month/shop for 1,2,3,6,12 lag

In [None]:
tmp = feature_df.groupby(['date_block_num', 'shop_id']).agg({'item_cnt_month': ['mean']})
tmp.columns = ['item_cnt_month_shop_avg']
tmp.reset_index(inplace=True)

feature_df = pd.merge(feature_df, tmp, on=['date_block_num','shop_id'], how='left')
feature_df['item_cnt_month_shop_avg'] = feature_df['item_cnt_month_shop_avg'].astype(np.float16)
feature_df = lag_feature(feature_df, [1,2,3,6,12], 'item_cnt_month_shop_avg')

>Item count mean by month/item_category for 1 lag

In [None]:
tmp = feature_df.groupby(['date_block_num', 'item_category_id']).agg({'item_cnt_month': ['mean']})
tmp.columns = ['item_cnt_month_cat_avg']
tmp.reset_index(inplace=True)

feature_df = pd.merge(feature_df, tmp, on=['date_block_num','item_category_id'], how='left')
feature_df['item_cnt_month_cat_avg'] = feature_df['item_cnt_month_cat_avg'].astype(np.float16)
feature_df = lag_feature(feature_df, [1], 'item_cnt_month_cat_avg')

>Item count mean by month/main item category for 1 lag

In [None]:
tmp = feature_df.groupby(['date_block_num', 'item_maincategory_id']).agg({'item_cnt_month': ['mean']})
tmp.columns = ['item_cnt_month_maincat_avg']
tmp.reset_index(inplace=True)

feature_df = pd.merge(feature_df, tmp, on=['date_block_num','item_maincategory_id'], how='left')
feature_df['item_cnt_month_maincat_avg'] = feature_df['item_cnt_month_maincat_avg'].astype(np.float16)
feature_df = lag_feature(feature_df, [1], 'item_cnt_month_maincat_avg')

>Item count mean by month/main item category/shop for 1 lag

In [None]:
tmp = feature_df.groupby(['date_block_num', 'item_maincategory_id', 'shop_id']).agg({'item_cnt_month': ['mean']})
tmp.columns = ['item_cnt_month_maincat_shop_avg']
tmp.reset_index(inplace=True)

feature_df = pd.merge(feature_df, tmp, on=['date_block_num','item_maincategory_id','shop_id'], how='left')
feature_df['item_cnt_month_maincat_shop_avg'] = feature_df['item_cnt_month_maincat_shop_avg'].astype(np.float16)
feature_df = lag_feature(feature_df, [1], 'item_cnt_month_maincat_shop_avg')

>Item count mean by month/sub item category for 1 lag

In [None]:
tmp = feature_df.groupby(['date_block_num', 'item_subcategory_id']).agg({'item_cnt_month': ['mean']})
tmp.columns = ['item_cnt_month_subcat_avg']
tmp.reset_index(inplace=True)

feature_df = pd.merge(feature_df, tmp, on=['date_block_num','item_subcategory_id'], how='left')
feature_df['item_cnt_month_subcat_avg'] = feature_df['item_cnt_month_subcat_avg'].astype(np.float16)
feature_df = lag_feature(feature_df, [1], 'item_cnt_month_subcat_avg')

>Item count mean by month/sub item category/shop for 1 lag

In [None]:
tmp = feature_df.groupby(['date_block_num', 'item_subcategory_id', 'shop_id']).agg({'item_cnt_month': ['mean']})
tmp.columns = ['item_cnt_month_subcat_shop_avg']
tmp.reset_index(inplace=True)

feature_df = pd.merge(feature_df, tmp, on=['date_block_num','item_subcategory_id','shop_id'], how='left')
feature_df['item_cnt_month_subcat_shop_avg'] = feature_df['item_cnt_month_subcat_shop_avg'].astype(np.float16)
feature_df = lag_feature(feature_df, [1], 'item_cnt_month_subcat_shop_avg')

>Item count mean by month/city for 1,2,3,6,12 lag

In [None]:
tmp = feature_df.groupby(['date_block_num', 'city_code']).agg({'item_cnt_month': ['mean']})
tmp.columns = ['item_cnt_month_city_avg']
tmp.reset_index(inplace=True)

feature_df = pd.merge(feature_df, tmp, on=['date_block_num','city_code'], how='left')
feature_df['item_cnt_month_city_avg'] = feature_df['item_cnt_month_city_avg'].astype(np.float16)
feature_df = lag_feature(feature_df, [1,2,3,6,12], 'item_cnt_month_city_avg')

>Item count mean by month/city/item for 1,2,3,6,12 lag

In [None]:
tmp = feature_df.groupby(['date_block_num', 'city_code', 'item_id']).agg({'item_cnt_month': ['mean']})
tmp.columns = ['item_cnt_month_city_item_avg']
tmp.reset_index(inplace=True)

feature_df = pd.merge(feature_df, tmp, on=['date_block_num','city_code','item_id'], how='left')
feature_df['item_cnt_month_city_item_avg'] = feature_df['item_cnt_month_city_item_avg'].astype(np.float16)
feature_df = lag_feature(feature_df, [1,2,3,6,12], 'item_cnt_month_city_item_avg')

In [None]:
del tmp
gc.collect()

>Item price trend

In [None]:
trend_df = feature_df[['date_block_num','shop_id','item_id']]

#per item average
tmp1 = data.groupby(['item_id']).agg({'item_price': ['mean']})
tmp1.columns = ['item_price_item_avg']
tmp1.reset_index(inplace=True)
trend_df = pd.merge(trend_df, tmp1, on=['item_id'], how='left')
trend_df['item_price_item_avg'] = trend_df['item_price_item_avg'].astype(np.float16)

#per item&month average
tmp2 = data.groupby(['date_block_num','item_id']).agg({'item_price': ['mean']})
tmp2.columns = ['item_price_month_item_avg']
tmp2.reset_index(inplace=True)
trend_df = pd.merge(trend_df, tmp2, on=['date_block_num','item_id'], how='left')
trend_df['item_price_month_item_avg'] = trend_df['item_price_month_item_avg'].astype(np.float16)

lags = [1,2,3,4,5,6,12]
trend_df = lag_feature(trend_df, lags, 'item_price_month_item_avg')

for i in lags:
    trend_df['delta_price_lag_'+str(i)] = (trend_df['item_price_month_item_avg_lag_'+str(i)] - trend_df['item_price_item_avg']) / trend_df['item_price_item_avg']

def select_trend(x):
    for i in lags:
        if x['delta_price_lag_'+str(i)]:
            return x['delta_price_lag_'+str(i)]
    return 0
    
trend_df['delta_price_lag'] = trend_df.parallel_apply(select_trend, axis=1)
trend_df['delta_price_lag'] = trend_df['delta_price_lag'].astype(np.float16).fillna(0)

feature_df['delta_price_lag'] = trend_df['delta_price_lag']

In [None]:
del tmp1,tmp2,trend_df
gc.collect()

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

>Month number / number of days in a month

In [None]:
feature_df['month'] = feature_df['date_block_num'] % 12
days = pd.Series([31,28,31,30,31,30,31,31,30,31,30,31])
feature_df['days_num'] =  feature_df['month'].map(days).astype(np.int8)

>Total sales trend

In [None]:
total_sales_df = feature_df[['date_block_num','shop_id','item_id']]

#sum of shop& month total_sales
tmp1 = data.groupby(['date_block_num','shop_id']).agg({'total_sales': ['sum']})
tmp1.columns = ['date_shop_total_sales']
tmp1.reset_index(inplace=True)
total_sales_df = pd.merge(total_sales_df, tmp1, on=['date_block_num','shop_id'], how='left')
total_sales_df['date_shop_total_sales'] = total_sales_df['date_shop_total_sales'].astype(np.float32)

#mean of sum of shop& month total_sales in all date
tmp2 = total_sales_df.groupby(['shop_id']).agg({'date_shop_total_sales': ['mean']})
tmp2.columns = ['shop_avg_total_sales']
tmp2.reset_index(inplace=True)
total_sales_df = pd.merge(total_sales_df, tmp2, on=['shop_id'], how='left')
total_sales_df['shop_avg_total_sales'] = total_sales_df['shop_avg_total_sales'].astype(np.float32)

total_sales_df['delta_total_sales'] = (total_sales_df['date_shop_total_sales'] - total_sales_df['shop_avg_total_sales']) / total_sales_df['shop_avg_total_sales']
total_sales_df['delta_total_sales'] = total_sales_df['delta_total_sales'].astype(np.float16)

feature_df['delta_total_sales'] = total_sales_df['delta_total_sales']

>Number of days after first sale

In [None]:
feature_df['item_shop_first_sale'] = feature_df['date_block_num'] - feature_df.groupby(['item_id','shop_id'])['date_block_num'].transform('min')
feature_df['item_first_sale'] = feature_df['date_block_num'] - feature_df.groupby('item_id')['date_block_num'].transform('min') 

> Because of using 12 as max lag feature, we need to drop first 12 months.

In [None]:
feature_df = feature_df[feature_df.date_block_num > 11]

>Fill null value with 0 for lag features

In [None]:
for col in feature_df.columns:
    if ('_lag_' in col) & (feature_df[col].isnull().any()):
        feature_df[col].fillna(0, inplace=True)

>With these features created above, we build the model

In [None]:
feature_df.columns

In [None]:
feature_df.head(3)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

### Data Preparation

In [None]:
X_train = feature_df[feature_df.date_block_num < 33].drop(['item_cnt_month'], axis=1)
Y_train = feature_df[feature_df.date_block_num < 33]['item_cnt_month']
X_valid = feature_df[feature_df.date_block_num == 33].drop(['item_cnt_month'], axis=1)
Y_valid = feature_df[feature_df.date_block_num == 33]['item_cnt_month']
X_test = feature_df[feature_df.date_block_num == 34].drop(['item_cnt_month'], axis=1)

lgb_train = lgb.Dataset(X_train, Y_train)
lgb_valid = lgb.Dataset(X_valid, Y_valid, reference=lgb_train)

### Hyperparameter Tuning with Optuna

In [None]:
params = {
    'task' : 'train',
    'boosting' : 'gbdt',
    'objective': 'regression',
    'metric': 'l2'
}
best_params, history = {}, []
model = lgb_optuna.train(params, 
                  lgb_train, 
                  valid_sets=lgb_valid,
                  verbose_eval=False,
                  num_boost_round=20,
                  early_stopping_rounds=5,
                  best_params=best_params,
                  tuning_history=history)

In [None]:
best_params

In [None]:
params.update(best_params)

### Modeling with LightGBM using best parameter

In [None]:
gbm = lgb.train(params,
            lgb_train,
            num_boost_round=100,
            valid_sets=lgb_valid,
            early_stopping_rounds=100)

### Feature Importance

In [None]:
feature_imp = pd.DataFrame()
feature_imp['feature'] = gbm.feature_name()
feature_imp['importance'] = gbm.feature_importance()
hv.Bars(feature_imp.sort_values(by='importance', ascending=True)).opts(title="Feature Importance", color="purple", xlabel="Features", ylabel="Importance", invert_axes=True)\
                            .opts(opts.Bars(width=700, height=700, tools=['hover'], show_grid=True))

### Prediction & Submission

In [None]:
Y_test = gbm.predict(X_test).clip(0, 20)
submission['item_cnt_month'] = Y_test
submission.to_csv('submission.csv', index=False)

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

# 7. Conclusion
><div class="alert alert-success" role="alert">
>According to feature importance analysis above, it is said that:
><ul>
><li>shop/city related features are likely to be more important than other features</li>
><li>lag features which has large lag numbers(3,4,5,...,12) are likely to be less important than these with smaller number</li>
></ul>
></div>

# 8. References

>* **Good EDA example**  
>https://www.kaggle.com/kabure/simple-eda-model-hyperopt-w-easy-code
>* **Good notebook about time series analysis**  
>https://www.kaggle.com/jagangupta/time-series-basics-exploring-traditional-ts  
>* **Prophet Document**  
>https://facebook.github.io/prophet/docs/quick_start.html
>* **Prophet Paper**  
>https://peerj.com/preprints/3190.pdf
>* **Very good example of boosting method**  
>https://www.kaggle.com/dlarionov/feature-engineering-xgboost#Part-1,-perfect-features  
>* **Optuna for LightGBM document(step-wise algorithm)**  
>https://tech.preferred.jp/en/blog/lightgbm-tuner-new-optuna-integration-for-hyperparameter-optimization/  

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>