# Monthly Sales Prediction

We are asked to predict next month's total sales for every product and store of the largest Russian software firms - 1C Company. They provided us with a time-series dataset consisting of daily sales data.

* <a href="https://www.kaggle.com/c/competitive-data-science-predict-future-sales">Competition link</a>
* Very informative <a href="https://www.kaggle.com/gordotron85/future-sales-xgboost-top-3">kernel</a> that I used as a starting point

Data Fields:
- ID - an Id that represents a (Shop, Item) tuple within the test set
- shop_id - unique identifier of a shop
- item_id - unique identifier of a product
- item_category_id - unique identifier of item category
- item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
- item_price - current price of an item
- date - date in format dd/mm/yyyy
- date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
- item_name - name of item
- shop_name - name of shop
- item_category_name - name of item category

## Dependencies

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import ParameterGrid
from sklearn.preprocessing import LabelEncoder

import lightgbm as lgb

import shap

## Dataset

In [None]:
sales = pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv',
                    parse_dates=["date"])
items = pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv')
item_cat = pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv')
shops = pd.read_csv('../input/competitive-data-science-predict-future-sales/shops.csv')
test = pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv')

## EDA

### Outliers
Several values of price and count seem abnormal - either extremly high or negative.
They might be actual explaination, e.g. a refund. We will ignore these values for now.

In [None]:
fig, ax = plt.subplots(figsize=(8, 3))
ax.set_xscale('symlog')
ax.set_title('Prices distribution', fontsize=14)
sns.boxplot(x=sales.item_price, palette='rainbow', ax=ax)

In [None]:
fig, ax = plt.subplots(figsize=(8, 3))
ax.set_xscale('symlog')
ax.set_title('Distribution of number of product sold', fontsize=14)
sns.boxplot(x=sales.item_cnt_day, palette='rainbow', ax=ax)

In [None]:
sales = sales.query('item_cnt_day > 0 & item_cnt_day < 1000').copy()
sales = sales.query('item_price > 0 & item_price <= 50000').copy()

### Data Leakage

Shops missing from the test set are filtered out in the training data.

In [None]:
shop_ids = test['shop_id'].unique()
sales = sales[sales['shop_id'].isin(shop_ids)].copy()

### Shops

Shops dataset has duplicates. The latest entry is considered to be relevant. 

In [None]:
print(f"{shops['shop_name'][0]} VS {shops['shop_name'][57]}")
print(f"{shops['shop_name'][1]} VS {shops['shop_name'][58]}")
print(f"{shops['shop_name'][10]} VS {shops['shop_name'][11]}")
print(f"{shops['shop_name'][39]} VS {shops['shop_name'][40]}")

Ids are uptaded both in training and test datasets.

In [None]:
sales.loc[sales['shop_id'] == 0, 'shop_id'] = 57
sales.loc[sales['shop_id'] == 1, 'shop_id'] = 58
sales.loc[sales['shop_id'] == 10, 'shop_id'] = 11
sales.loc[sales['shop_id'] == 39, 'shop_id'] = 40

test.loc[test['shop_id'] == 0, 'shop_id'] = 57
test.loc[test['shop_id'] == 1, 'shop_id'] = 58
test.loc[test['shop_id'] == 10, 'shop_id'] = 11
test.loc[test['shop_id'] == 39, 'shop_id'] = 40

A heavy-handed google translation shows that shop_name contains two pieces of information: city name and category of building.

In [None]:
shops["city"] = shops.shop_name.apply(lambda x: x.split()[0])
shops["category"] = shops.shop_name.apply(lambda x: x.split()[1])

In [None]:
shops.loc[shops['city'] =='!Якутск', 'city'] = 'Якутск'

Only important categories are considered.

In [None]:
shops_cat = shops.category.value_counts()
shops_cat.head()

In [None]:
def thresh_filter(x,
                  items,
                  default="other"):
    return x if (x in items) else default


thresh_cat = shops_cat[shops_cat >= 5].index
shops.category = shops.category.apply(thresh_filter,
                                      args=([thresh_cat]))

New variables are encoded to be used as features.

In [None]:
shops["shop_category"] = LabelEncoder().fit_transform(shops.category)
shops["shop_city"] = LabelEncoder().fit_transform(shops.city)
shops = shops[["shop_id", "shop_category", "shop_city"]]

### Items

First sale of each item:

In [None]:
items['first_sale_date'] = sales.groupby('item_id')\
                                .agg({'date_block_num': 'min'})['date_block_num']
items['first_sale_date'] = items['first_sale_date'].fillna(34)

Category_name brings two levels of information on item type.

In [None]:
item_cat.item_category_name.unique()[:5]

In [None]:
item_cat['type'] = item_cat.item_category_name.apply(lambda x: x.split()[0])

item_types = item_cat.type.value_counts()
thresh_type = item_types[item_types >= 5].index
item_cat['type'] = item_cat.type.apply(thresh_filter,
                                       args=([thresh_type]))

def get_subtype(x):
    split = x.split()
    if len(split) > 1:
        return split[1].strip()
    else:
        return split[0].strip()

    
item_cat['subtype'] = item_cat.item_category_name.apply(get_subtype)

In [None]:
item_cat['type_code'] = LabelEncoder().fit_transform(item_cat.type)
item_cat['subtype_code'] = LabelEncoder().fit_transform(item_cat.subtype)
item_cat = item_cat[["item_category_id", "subtype_code", "type_code"]]

### Monthly sales

We are interested in predicting the monthly equivalent of item_cnt_day for each tuple (shop, item). Sales data is aggregated.

In [None]:
groupby_cols = ['date_block_num', 'shop_id', 'item_id']

sales['transaction'] = sales['item_cnt_day'] * sales['item_price']

monthly_sales = sales.groupby(by=groupby_cols,
                              as_index=False).agg({'item_cnt_day': ['sum',
                                                                    'count'],
                                                   'transaction': 'sum',
                                                   'item_price': 'mean',
                                                   })
monthly_sales.columns = ['date_block_num', 'shop_id', 'item_id',
                         'item_cnt', 'transaction_nb', 'transaction', 'mean_price']

Missing records are artificially created to take into account months where no items were sold.

In [None]:
def fill_missing_month(monthly_sales):
    """Creates missing tuple (date_block_num, shop_di, item_id)
    Args:
        - monthly_sales: pd.DataFrame. Monthly sales
    Return:
        pd.DataFrame
    """
    months_nb = monthly_sales.date_block_num.max()
    
    full_df = []
    for i in range(months_nb + 1):
        #  Retrieves list of shops and items for this month
        shops = monthly_sales.query('date_block_num == @i').shop_id.unique()
        items = monthly_sales.query('date_block_num == @i').item_id.unique()
        for shop in shops:
            for item in items:
                #  Creates entry
                full_df.append([i, shop, item])

    full_df = pd.DataFrame(full_df,
                           columns=['date_block_num', 'shop_id', 'item_id'])
    #  Gets information for existing tuple
    full_df = full_df.merge(monthly_sales,
                            how='left',
                            on=['date_block_num', 'shop_id', 'item_id'])
    full_df.fillna(0, inplace=True)

    return full_df

In [None]:
full_sales = fill_missing_month(monthly_sales)
full_sales.shape

### Top categories visualisation
We are zooming out at category level.

In [None]:
cat_sales = monthly_sales.merge(items[['item_id', 'item_category_id']],
                                how='left',
                                left_on='item_id',
                                right_on='item_id',
                                )
cat_sales = cat_sales.groupby(by=['date_block_num',
                                  'item_category_id'],
                              as_index=False).agg({'item_cnt': 'sum'})

Top categories are defined regarding the total number of items sold, namely more than 100k.

In [None]:
cat_group = cat_sales.groupby('item_category_id')\
                     .agg({'item_cnt': 'sum'})\
                     .sort_values(by='item_cnt',
                                  ascending=False)
top_cat_idx = cat_group.query('item_cnt > 100000').index
top_cat = cat_sales[cat_sales.item_category_id.isin(top_cat_idx)].copy()

Names are translated:

In [None]:
mapper = {19: 'Games - PS3',
          20: 'Games - PS4',
          23: 'Games - XBOX 360',
          28: 'PC Games - Extensions',
          30: 'PC Games - Standard edition',
          37: 'Film - Blu-Ray',
          40: 'Film - DVD',
          55: 'Music - CD local production',
          71: 'Gifts - Bags, Albums, Mouse Pads',}
top_cat.item_category_id = top_cat.item_category_id.map(mapper)

Interactive chart created with altair:

In [None]:
import altair as alt

#  We want to be able to select a specific category on
#  a bar chart
cat_filter = alt.selection_multi(fields=["item_category_id"])
cat_chart = alt.Chart().mark_bar().encode(
    x=alt.X("count()", title='Age (month)'),
    y=alt.Y("item_category_id:N"),
    color=alt.condition(
        cat_filter,
        alt.Color("item_category_id:N",
                  scale=alt.Scale(scheme='category20')),
        alt.value("lightgray")),
).properties(width=300,
             height=300,
             selection=cat_filter)

def filtered_bar(x, y, labels, filter):
    """Creates a layered chart of bar plots.
    The first layer (light gray) contains the plot of the full
    data, and the second contains the plot of the filtered data.
    Args:
     - x: abscissa, split into bins.
     - y: ordinate, summed up.
     - label: String labels.
     - filter: an alt.Selection object to be used to filter the data.
    """
    base = alt.Chart().mark_bar().encode(
        x=alt.X(x,
                bin=alt.Bin(maxbins=34),
                title=labels[0]),
        y=alt.Y(y,
                aggregate='sum',
                title=labels[1]),
          ).properties(
              width=350,
          )
    return alt.layer(
      base.transform_filter(filter),
      base.encode(color=alt.value('lightgray'),
                  opacity=alt.value(.7)),
  ).resolve_scale(y='independent')

In [None]:
alt.hconcat(
    filtered_bar('date_block_num',
                 'item_cnt',
                 ['month', 'Items sold'],
                 cat_filter),
    cat_chart,
    data=top_cat)

## Feature engineering

Test dataset is concatened to conveniently create feature.

In [None]:
test['date_block_num'] = 34

full_sales = pd.concat([full_sales, test.drop('ID', axis=1)],
                       ignore_index=True,
                       keys=groupby_cols)

full_sales = full_sales.fillna(0)

full_sales.shape

Categorical features previously defined:

In [None]:
full_sales = full_sales.merge(shops,
                              on='shop_id',
                              how='left')
full_sales = full_sales.merge(items,
                              on=['item_id'],
                              how='left')
full_sales = full_sales.merge(item_cat,
                              on='item_category_id',
                              how='left')

### Dates

In [None]:
def extract_year(date_num_block, thresh=2013):
    return date_num_block // 12 + thresh


def extract_month(date_num_block):
    return date_num_block % 12


full_sales['year'] = full_sales.date_block_num.apply(extract_year)
full_sales['month'] = full_sales.date_block_num.apply(extract_month)

New item marker:

In [None]:
full_sales['new_item'] = full_sales['first_sale_date'] == full_sales['date_block_num']

Time spent since first sale:

In [None]:
full_sales['since_first_sale'] = full_sales['date_block_num'] - full_sales['first_sale_date']

### Monthly means

In [None]:
def get_month_mean(idx_col,
                   suffixes,
                   col='item_cnt'):
    """Gets mean value for each month
    
    Args:
     - idx_col: columns to group by
     - col: column to groub
    """
    df = full_sales[idx_col + [col]].groupby(idx_col).mean()
    df = full_sales.merge(df,
                          how='left',
                          on=idx_col,
                          suffixes=suffixes)
    return df

Mean monthly values:
* for a specific category in a specific shop
* for a specific item
* for a specific item in a specific city


In [None]:
full_sales = get_month_mean(['date_block_num', 'item_category_id', 'shop_id'],
                            suffixes=('', '_mean_shop_cat'))

In [None]:
full_sales = get_month_mean(['date_block_num', 'item_id'],
                            suffixes=('', '_mean_item'))

In [None]:
full_sales = get_month_mean(['date_block_num', 'item_id', 'shop_city'],
                            suffixes=('', '_mean_city'))

### Lag features

Values taken by features for previous months:
* number of item sold
* transaction amount
* mean price
* mean values created previously


In [None]:
lag_list = [1, 2, 3]
def get_lag_feature(col,
                    idx_col,
                    lag_list=lag_list):
    """Retrives previous values of col for each value of lag
    
    Args:
        - col: column of interest
        - idx_col: columns to group by
        - lag_list: intervals of interest"""
    for lag in lag_list:
        ft_name = f'{col}_lag{lag}'
        full_sales[ft_name] = full_sales.sort_values('date_block_num')\
                                        .groupby(idx_col)[col]\
                                        .shift(lag)
        full_sales[ft_name].fillna(0, inplace=True)

In [None]:
get_lag_feature('item_cnt', ['shop_id', 'item_id'])
get_lag_feature('transaction_nb', ['shop_id', 'item_id'])
get_lag_feature('mean_price', ['shop_id', 'item_id'])
get_lag_feature('item_cnt_mean_city', ['shop_id', 'item_id'])
get_lag_feature('item_cnt_mean_item', ['shop_id', 'item_id'])
get_lag_feature('item_cnt_mean_shop_cat', ['shop_id', 'item_id'])

It is specified that target values are clipped into [0,20] range.

In [None]:
cnt_cols = []
for col in full_sales.columns:
    if '_cnt' in col:
        cnt_cols.append(col)
        
for col in cnt_cols:
    full_sales[col] = full_sales[col].clip(0, 20)

### Trends

Some information on the evolution of the number of product sold in the past months. First over a rolling window of 3 months, then comparing the lagged values.

In [None]:
full_sales['item_cnt_trend'] = full_sales[['item_cnt_lag1',
                                           'item_cnt_lag2',
                                           'item_cnt_lag3']].mean(axis=1)
full_sales['item_cnt_trend'].fillna(0, inplace=True)

In [None]:
full_sales['trend1'] = full_sales['item_cnt_lag1'] / full_sales['item_cnt_lag2']
full_sales['trend1'] = full_sales['trend1'].replace([np.inf, -np.inf], np.nan)
full_sales['trend1'] = full_sales['trend1'].fillna(0)

full_sales['trend2'] = full_sales['item_cnt_lag2'] / full_sales['item_cnt_lag3']
full_sales['trend2'] = full_sales['trend2'].replace([np.inf, -np.inf], np.nan)
full_sales['trend2'] = full_sales['trend2'].fillna(0)

### Cleaning

The first 3 months were used to create features, thus are removed from the training set.

In [None]:
full_sales = full_sales.query('date_block_num >= 3').copy()

Unused columns:

In [None]:
droped_col = ['transaction_nb', 'transaction', 'mean_price', 'item_name',
               'first_sale_date', 'item_cnt_mean_shop_cat', 'item_cnt_mean_item',
              'item_cnt_mean_city',]

In [None]:
full_sales.drop(columns=droped_col, inplace=True)

Downcast:

In [None]:
def downcast(df):
    """
    Reduces allocated memory
    
    Args:
        - df: pd.DataFrame
    Return:
        compressed pd.DataFrame
    """
    for col in df.columns:
        dtype_name = df[col].dtype.name
        if dtype_name == 'object' or dtype_name.startswith('date'):
            pass
        elif dtype_name == 'bool':
            df[col] = df[col].astype('int8')
        elif dtype_name.startswith('int') or (df[col].round() == df[col]).all():
            df[col] = pd.to_numeric(df[col], downcast='integer')
        else:
            df[col] = pd.to_numeric(df[col], downcast='float')
    return df

In [None]:
full_sales = downcast(full_sales)

In [None]:
del items, item_cat, sales, monthly_sales, cat_sales, cat_group

### Correlations

In [None]:
corr_matrix = full_sales.corr()
corr_matrix = corr_matrix.applymap(abs)

mask = np.zeros_like(corr_matrix)
mask[np.triu_indices_from(mask)] = True

fig = plt.figure(figsize=(16,8))
sns.heatmap(corr_matrix,
            mask=mask,
            vmin=0,
            cmap='Blues',
            annot=True,
            fmt='.2f',
            cbar=False)

fig.suptitle('Correlation matrix', fontsize=16)
plt.show()

## Modelisation

All information regarding the current month has been deleted except, item_cnt that will be our target. The last month available is used for validation purposes.

In [None]:
X_train = full_sales.query('date_block_num < 33')
X_train = X_train.drop(columns=['item_cnt'])

X_valid = full_sales.query('date_block_num == 33')
X_valid = X_valid.drop(columns=['item_cnt'])

X_test = full_sales.query('date_block_num == 34')
X_test = X_test.drop(columns=['item_cnt'])


y_train = full_sales.query('date_block_num < 33').item_cnt
y_valid = full_sales.query('date_block_num == 33').item_cnt

### LightGBM

In [None]:
cat_cols = ['shop_id', 'item_id', 'year', 'month', 'item_category_id',
            'shop_category', 'shop_city', 'subtype_code', 'type_code', 'new_item']
cat_cols.sort() #  avoid LigthGBM warning

In [None]:
dtrain = lgb.Dataset(X_train, y_train)
dvalid = lgb.Dataset(X_valid, y_valid)

A custom GridSearch is performed - we aim to optimise a few hyperparameters:

In [None]:
params = dict(metric=['rmse'],
              num_leaves=[400], #  Default 31 [10, 31, 255, 400]
              learning_rate=[0.005], #  Default 0.1 [0.005, 0.001, 0.1]
              max_depth=[-1], #  Default -1 [10, -1]
              feature_fraction=[0.75],
              bagging_fraction=[0.75],
              bagging_freq=[5],
              random_state=[10],
              verbose=[-1])

In [None]:
grid_results = {}
for i, param in enumerate(ParameterGrid(params)):
    #  At each step a dictionnary is created, containg
    #  the HPs, the trained model and the validation RMSE
    grid_results[i] = {}
    grid_results[i]['params'] = param
    curr_model = lgb.train(params=param,
                           train_set=dtrain,
                           num_boost_round=1500,
                           valid_sets=(dtrain, dvalid),
                           early_stopping_rounds=150,
                           categorical_feature=cat_cols,
                           verbose_eval=False)
    grid_results[i]['model'] = curr_model
    pred = curr_model.predict(X_valid)
    val_rmse = np.sqrt(mean_squared_error(pred, y_valid))
    grid_results[i]['val_rmse'] = val_rmse

In [None]:
best_iter = sorted(grid_results,
                   key=lambda k: grid_results[k]['val_rmse'])[0]
best_model = grid_results[best_iter]['model']

## Feature importance

TreeSHAP is apply on a ~10% sample of testing data.

In [None]:
sample_size = int(X_test.shape[0] * 0.1)
test_sample = X_test.sample(sample_size)
best_model.params["objective"] = "regression"
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(test_sample)
shap.summary_plot(shap_values, test_sample, max_display=15)

## Submission

In [None]:
y_test = best_model.predict(X_test).clip(0, 20)

In [None]:
submission = pd.DataFrame({
    "ID": test.index,
    "item_cnt_month": y_test
})
submission.to_csv('submission.csv', index=False)