Table of contents:

1. [READ, TRANSLATE AND MERGE TABLES](#section-1)
    - [Shops](#subsection-1-1)
    - [Categories](#subsection-1-2)
    - [Data merge](#subsection-1-3)
2. [DATA UNDERSTANDING AND CLEANING](#section-2)
    - [Data overview](#subsection-2-1)
    - [Managing date columns](#subsection-2-2)
    - [Data Exploration and Visualisation](#subsection-2-3)
        - [Sales quantity](#subsection-2-3-1)
        - [Item price](#subsection-2-3-2)
        - [Cronological sales](#subsection-2-3-3)
        - [Monthly sales - 2013 & 2014](#subsection-2-3-4)
        - [Weekday sales](#subsection-2-3-5)
        - [Shop sales](#subsection-2-3-6)
        - [Category sales](#subsection-2-3-7)
3. [FEATURE ENGINEERING](#section-3)
    - [Average prices calculation](#subsection-3-1)
    - [First / last sale, medians and modes](#subsection-3-2)
    - [Aggregating train data](#subsection-3-3)
    - [Stacking train data](#subsection-3-4)
    - [Price features](#subsection-3-5)
    - [Mean quantity features](#subsection-3-6)
    - [Lag features](#subsection-3-7)
    - [Items features](#subsection-3-8)
    - [Test data enginnering](#subsection-3-9)
    - [Calendar related features](#subsection-3-10)
    - [Final steps](#subsection-3-11)
4. [FEATURE SELECTION](#section-4)
    - [Feature correlation](#subsection-4-1)
    - [Best feature selection with SelectKBest](#subsection-4-2)
    - [Best feature selection with RFECV](#subsection-4-3)
5. [MODELING](#section-5)
    - [Spliting the train data](#subsection-5-1)
    - [Selecting features to train](#subsection-5-2)
    - [Training and evaluating](#subsection-5-3)
    - [Predictions](#subsection-5-4)
6. [CREATING SUBMISSION FILE](#section-6)

<a id="section-1"></a>
# 1. READ, TRANSLATE AND MERGE TABLES #

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt

%matplotlib inline

pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)

items = pd.read_csv("../input/competitive-data-science-predict-future-sales/items.csv")
item_categories = pd.read_csv("../input/competitive-data-science-predict-future-sales/item_categories.csv")
shops = pd.read_csv("../input/competitive-data-science-predict-future-sales/shops.csv")
sample_submission = pd.read_csv("../input/competitive-data-science-predict-future-sales/sample_submission.csv")

train_o = pd.read_csv("../input/competitive-data-science-predict-future-sales/sales_train.csv")
test_o = pd.read_csv("../input/competitive-data-science-predict-future-sales/test.csv")

Translate shop and category names to English.

In [None]:
!pip install googletrans

In [None]:
from googletrans import Translator

translator = Translator()

def translate(df, feature, src, dest):
    return df[feature].apply(translator.translate, src=src, dest=dest).apply(getattr, args=('text',))

In [None]:
item_categories["item_category_name_en"] = translate(item_categories, "item_category_name", "ru", "en")
shops["shop_name_en"] = translate(shops, "shop_name", "ru", "en")

item_categories.drop("item_category_name", axis=1, inplace=True)
shops.drop("shop_name", axis=1, inplace=True)

<a id="subsection-1-1"></a>
## Shops ##

In [None]:
shops["shop_name_en"].head(10)

We can see that most of the shop names begin with the city name. Let's isolate the first string from names and insert it in the separate column named "city"

In [None]:
shops["city"] = shops["shop_name_en"].str.replace("[!,?,²]", "").str.lower().str.strip().str.split(" ").str.get(0)
shops["city"].value_counts()

Let's isolate the names which are not cities (spd == St Petersburg) replace them with "other":
- digital
- offsite
- emergency

In [None]:
not_shops = ["digital", "offsite", "emergency"]
shops["city"] = shops["city"].apply(lambda x: "other" if x in not_shops else x)
shops.sort_values("city")

We also have some potential duplicates, which we will explore later:
- Zhukovsky st. Chkalov 39m² (id 10 and 11)
- Moscow TC "Budenovskiy" (id 23 and 24)
- Yakutsk Ordzhonikidze (id 0 and 57)
- Yakutsk TC "Central" (id 1 and 58)

<a id="subsection-1-2"></a>
## Categories ##

In [None]:
item_categories["item_category_name_en"].head(20)

Category names begin with the "master category". Let's isolate the text before "-" and insert it in the separate column named "master_category".

In [None]:
item_categories["master_category"] = item_categories["item_category_name_en"].str.replace("[!,?,²]", "").str.lower().str.strip().str.split("-").str.get(0).str.strip()
item_categories["master_category"].value_counts()

More grouping

In [None]:
item_categories["master_category"] = item_categories["master_category"].apply(lambda x: "payment cards" if "payment cards" in x else x)
item_categories["master_category"] = item_categories["master_category"].apply(lambda x: "games" if "games" in x else x)
item_categories["master_category"] = item_categories["master_category"].apply(lambda x: "blank media" if "blank media" in x else x)
item_categories["master_category"].value_counts()

<a id="subsection-1-3"></a>
## Data merge ##

Merge item categories, items and shops to train/test data.

In [None]:
items = pd.merge(items, item_categories, how="left", on='item_category_id')
train_o = pd.merge(train_o, items, how="left", on='item_id')
train_o = pd.merge(train_o, shops, how="left", on='shop_id')

test_o = pd.merge(test_o, items, how="left", on='item_id')
test_o = pd.merge(test_o, shops, how="left", on='shop_id')

Downgrade numeric data types to save memory.

Thank you Konstantin Yakovlev (kyakovlev) for this trick!
https://www.kaggle.com/kyakovlev/1st-place-solution-part-1-hands-on-data

In [None]:
def downgrade_dtypes(df):
    float_cols = list(df.dtypes[df.dtypes == "float64"].index)
    int_cols = list(df.dtypes[(df.dtypes == "int64") | (df.dtypes == "int32")].index)

    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols] = df[int_cols].astype(np.int16)
    
    return df

In [None]:
train_s = downgrade_dtypes(train_o)
test_s = downgrade_dtypes(test_o)

<a id="section-2"></a>
# 2. DATA UNDERSTANDING AND CLEANING#

<a id="subsection-2-1"></a>
## Data overview ##

In [None]:
train_s.head(10)

In [None]:
def generate_features_overview(df):
    df_info = pd.DataFrame()
    df_info["type"] = df.dtypes
    df_info["missing_count"] = df.isna().sum()
    df_info["missing_perc"] = (df_info["missing_count"] / len(df) * 100).astype(int)
    df_info = pd.concat([df_info, df.describe(include='all').T], axis=1)
    
    return df_info

In [None]:
info_df = generate_features_overview(train_s)
info_df_string = info_df.dropna(subset=["unique"], axis=0).dropna(axis=1)
info_df_numeric = info_df.dropna(subset=["mean"], axis=0).dropna(axis=1)

In [None]:
info_df_string

In [None]:
info_df_numeric

Conclusions:
- No missing values
- Unique dates is equal to all days in the timeframe from 01.01.2013 to 31.10.2015, which means that we have sales every day
- We have 21807 different items
- We have 84 different categories
- We have 60 different shops
- We have some outliers in item_price and item_cnt_day columns

<a id="subsection-2-2"></a>
## Managing date columns ##

The "date" column should first be converted to type date and the following columns should be added, for easier understanding of data:
- day
- month
- year
- weekday

In [None]:
train_c1 = train_s.copy()
train_c1["date"] = pd.to_datetime(train_s["date"], format="%d.%m.%Y")
train_c1["month"] = train_c1["date"].dt.month
train_c1["year"] = train_c1["date"].dt.year
train_c1["weekday"] = train_c1["date"].dt.weekday
train_c1["day"] = train_c1["date"].dt.day

<a id="subsection-2-3"></a>
## Data Exploration and Visualisation ##

In [None]:
def lineplot(df, X, Y, title):
    fig, ax = plt.subplots(1, 1, figsize=(20, 7), sharex=True)
    sns.lineplot(x=X, y=Y, data=df[Y], ax=ax[0]).set_title(title)
    plt.show()
    
    
def lineplot_multiple(df, X, Y, title):
    sns.lineplot(X, 'value', hue='variable', 
             data=pd.melt(df, X))
    
def barplot(df, feature_1, label_1, x_ticks, title="", width=0.4):
    fig = plt.figure(figsize =(20, 7))
    ax = fig.add_subplot()

    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
        
    plt.bar(range(len(df)), df[feature_1], align='center', label=label_1, color="blue")
    plt.xticks(range(len(df)), df[x_ticks], size='small')
    plt.title(title)
    plt.grid(False)
    
    plt.show()
    
def barplot_double(df, features, labels, x_ticks, width=0.4):
    fig = plt.figure(figsize =(20, 7))
    ax = fig.add_subplot()

    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['bottom'].set_visible(False)

    plt.bar(range(len(df)), df[features[0]], align='center', label=labels[0])
    plt.bar(range(len(df)), df[features[1]], align='center', label=labels[1])
        
    plt.xticks(range(len(df)), df[x_ticks], size='small')
    plt.grid(False)
    ax.legend(loc='upper right')
    plt.show()

def barplot_sns(df, X, Y, hue, title):
    fig, ax = plt.subplots(1, 1, figsize=(22, 7), sharex=True)
    tidy = df.melt(id_vars=X).rename(columns=str.title)
    sns.barplot(x=X, y=Y, hue=hue, data=tidy, ax=ax[0], palette="rocket").set_title(title)
    plt.show()
    
def barplot_double_axis(df, feature_1, feature_2, label_1, label_2, x_ticks, width=0.4):
    fig = plt.figure(figsize =(20, 7))
    ax1 = fig.add_subplot()
    ax2 = ax1.twinx()
    
    ax1.spines['top'].set_visible(False)
    ax1.spines['right'].set_visible(False)
    ax1.spines['left'].set_visible(False)
    ax1.spines['bottom'].set_visible(False)
    ax1.grid(False)

        
    ax2.spines['top'].set_visible(False)
    ax2.spines['right'].set_visible(False)
    ax2.spines['left'].set_visible(False)
    ax2.spines['bottom'].set_visible(False)
    ax2.grid(False)

    ax1.bar(np.arange(len(df)) + (width / 2), df[feature_1], width=width, color="red", label=label_1)
    ax2.bar(np.arange(len(df)) - (width / 2), df[feature_2], width=width, color="blue", label=label_2)
    
    ax1.legend(loc='upper left', frameon=False)
    ax2.legend(loc='upper right', frameon=False)
    
    plt.xticks(range(len(df)), df[x_ticks], size='small')
    plt.grid(False)

    plt.show()
    

def box_plot(df, features):
    fig = plt.figure(figsize =(20, 7))

    # Creating axes instance 
    ax = fig.add_subplot()

    # Creating plot
    data = []
    for col in features:
        data.append(df[col])
        
    bp = ax.boxplot(data)

    # show plot
    plt.show()
    
def box_plot_sns(df, X, Y):
    fig = plt.figure(figsize =(20, 7))
    ax = fig.add_subplot()
    sns.boxplot(x = X, y = Y, ax=ax, data = df)
    
def histogram(df, features, bins=10):
    df[features].plot.hist(bins=bins)

<a id="subsection-2-3-1"></a>
### Sales quantity ###

In [None]:
train_c1["item_cnt_day"].value_counts(bins=10).sort_index()

In [None]:
norm = train_c1["item_cnt_day"].value_counts(normalize=True)*100
norm.cumsum().head(10)

In [None]:
box_plot_sns(train_c1, "shop_id", "item_cnt_day")

We can see that the vast majority (99.59%) of quantities is in top 10 most represented values (-1 to 9). We also see some outliers - especially at shop 12.

Lets take a look at the items with the sales higher or equal to 1000 per day.

In [None]:
q_outliers = train_c1[train_c1["item_cnt_day"] >= 1000]
q_outliers

Remove the top outlier.

In [None]:
train_c1 = train_c1[train_c1["item_cnt_day"] <= 1000]

<a id="subsection-2-3-2"></a>
### Item price ###

In [None]:
train_c1["item_price"].value_counts(bins=20).sort_index()

In [None]:
norm = train_c1["item_price"].value_counts(normalize=True)*100
norm.cumsum().head(20)

In [None]:
box_plot_sns(train_c1, "shop_id", "item_price")

We can see that top 20 prices are in the range from 99 to 2599. Setting the price with the 99 ending seems very popular. We can also see 1 outlier with the price far above the others. There is also 1 price below zero. Let's eliminate them.

In [None]:
train_c2 = train_c1[(train_c1["item_price"] > 0) & (train_c1["item_price"] < 100000)]

<a id="subsection-2-3-3"></a>
### Cronological sales ###

In [None]:
def aggregate(df, group_by, aggfunc, features=[]):
    #agg_types = np.mean, np.max, np.min, np.count_nonzero, np.sum
    
    grouped_df = df.groupby(group_by, as_index=False)
    
    if len(features) > 0:
        grouped_df = grouped_df[features]

    df = grouped_df.agg(aggfunc)
    return df

In [None]:
agg_1 = aggregate(train_c2, ["date_block_num"], np.sum)

barplot(agg_1, "item_cnt_day", "Quantity", "date_block_num", "Cronological sales")

We are seeing a decline in sales over the range we are training on, there is also a seasonal effect.

<a id="subsection-2-3-4"></a>
### Monthly sales - 2013 & 2014 ###

In [None]:
agg_2 = aggregate(train_c2[train_c2["year"] <= 2014], ["month"], np.sum)

barplot(agg_2, "item_cnt_day", "Quantity", "month", "SUM of sales by month (2013 + 2014)")

We are only comparing the years 2013 and 2014 since we don't have full data for 2015. We can see a massive seasonality effect in winter months, especially december.

<a id="subsection-2-3-5"></a>
### Weekday sales ###

In [None]:
agg_3 = aggregate(train_c2, ["weekday"], np.sum)
agg_3_m = agg_3.sort_values("item_cnt_day", ascending=False)
agg_3_m["cumsum"] = agg_3_m["item_cnt_day"].cumsum()
agg_3_m["cumperc"] = agg_3_m["cumsum"] / agg_3_m["item_cnt_day"].sum()
agg_3_m[["weekday","cumperc"]]

In [None]:
barplot(agg_3, "item_cnt_day", "Quantity", "weekday", "Sales quantity by day in week")

The most sales occur on weekend (52%), which we'll consider later.

<a id="subsection-2-3-6"></a>

### Shops sales ###

First let's explore some potential duplicates:
- Zhukovsky st. Chkalov 39m² (id 10 and 11)
- Moscow TC "Budenovskiy" (id 23 and 24)
- Yakutsk Ordzhonikidze (id 0 and 57)
- Yakutsk TC "Central" (id 1 and 58)

We will plot shop sales pairs on the barplot.

In [None]:
agg_4 = aggregate(train_c2, ["shop_id", "date_block_num"], np.sum).sort_values("date_block_num")

In [None]:
def comparison_barplot(ids, feature):
    compare_ids = ids
    shop_1 = agg_4[agg_4[feature] == compare_ids[0]]
    shop_2 = agg_4[agg_4[feature] == compare_ids[1]]

    shop_comp = pd.merge(shop_1, shop_2, how="outer", on='date_block_num')
    shop_comp.fillna(0, inplace=True)

    shop_comp["stacked"] = shop_comp["item_cnt_day_x"] + shop_comp["item_cnt_day_y"]

    fig, ax1 = plt.subplots(figsize=(20, 7))
    sns.barplot(x='date_block_num', y='stacked', data=shop_comp, ax=ax1, color="red", label="{0}: {1}".format(feature, compare_ids[1]))
    sns.barplot(x='date_block_num', y='item_cnt_day_x', data=shop_comp, ax=ax1, color="blue", label="{0}: {1}".format(feature, compare_ids[0]))
    plt.legend()
    sns.despine(fig)

In [None]:
comparison_barplot([10, 11], "shop_id")

In [None]:
comparison_barplot([23, 24], "shop_id")

In [None]:
comparison_barplot([0, 57], "shop_id")

In [None]:
comparison_barplot([1, 58], "shop_id")

Looks like the following shops are duplicated:
- Zhukovsky st. Chkalov 39m² (id 10 and 11)
- Yakutsk Ordzhonikidze (id 0 and 57)
- Yakutsk TC "Central" (id 1 and 58)

Let's make the folowing shop id replacements:
- id 11 => id 10
- id 0 => id 57
- id 1 => 58

In [None]:
train_c2.loc[train_c2["shop_id"] == 11, 'shop_id'] = 10
train_c2.loc[train_c2["shop_id"] == 0, 'shop_id'] = 57
train_c2.loc[train_c2["shop_id"] == 1, 'shop_id'] = 5

In [None]:
agg_5 = aggregate(train_c2, ["shop_id"], np.sum).sort_values("item_cnt_day", ascending=False)
agg_5["cumsum"] = agg_5["item_cnt_day"].cumsum()
agg_5["cumperc"] = agg_5["cumsum"] / agg_5["item_cnt_day"].sum()

barplot(agg_5, "item_cnt_day", "Quantity", "shop_id", "Sales quantity by shop")
barplot(agg_5, "cumperc", "Quantity", "shop_id", "Sales quantity by shop - cumulative percentage")

In [None]:
agg_6 = aggregate(train_c2, ["city"], np.sum).sort_values("item_cnt_day", ascending=False)
agg_6["cumsum"] = agg_6["item_cnt_day"].cumsum()
agg_6["cumperc"] = agg_6["cumsum"] / agg_6["item_cnt_day"].sum()
barplot(agg_6, "item_cnt_day", "Quantity", "city", "Sales quantity by city")

<a id="subsection-2-3-7"></a>
### Categories sales ###

In [None]:
agg_7 = aggregate(train_c2, ["item_category_id"], np.sum).sort_values("item_cnt_day", ascending=False)
agg_7["cumsum"] = agg_7["item_cnt_day"].cumsum()
agg_7["cumperc"] = agg_7["cumsum"] / agg_7["item_cnt_day"].sum()

barplot(agg_7, "item_cnt_day", "Quantity", "item_category_id", "Sales quantity by item category")
barplot(agg_7, "cumperc", "Quantity", "item_category_id", "Sales quantity by item category - cumulative percentage")

In [None]:
agg_8 = aggregate(train_c2, ["master_category"], np.sum).sort_values("item_cnt_day", ascending=False)

barplot(agg_8, "item_cnt_day", "Quantity", "master_category", "Sales quantity by master category")

We can see that top 5 master categories form the vast majority of sales.

<a id="section-3"></a>
# 3. FEATURE ENGINEERING #

We are predicting the monthly sales data for november 2015, however our train data consists of daily sales. Therefore we will have to aggregate the data by item/shop/month, but before that we should create some useful features with non-aggregated data.

<a id="subsection-3-1"></a>
### Average prices calculation ###

New features:
- mean price for item/shop/date_block combo
- mean price for item/shop combo
- mean price for item
- mean price item_category

In [None]:
avg_price_item_shop_month = train_c2.groupby(['item_id', 'shop_id', 'date_block_num']).agg({"item_price": "mean"})
avg_price_item_shop = train_c2.groupby(['item_id', 'shop_id']).agg({"item_price": "mean"})
avg_price_item = train_c2.groupby('item_id').agg({"item_price": "mean"})
avg_price_category = train_c2.groupby('item_category_id').agg({"item_price": "mean"})

<a id="subsection-3-2"></a>
### First / last sale, medians and modes ###

New features:
- first and last sale of item/shop combo
- first and last sale of item
- first and last sale of shop

- weekday, day and month median of item/shop combo sales count
- weekday, day and month median of item sales count
- weekday, day and month median of shop sales count

- weekday, day and month mode of item sales count
- weekday, day and month mode of shop sales count

In [None]:
import scipy
item_shop_sales_detail = aggregate(train_c2, ["item_id", "shop_id"], {"date_block_num": ["min", "max"], "weekday": "median", "day": "median", "month": "median"})
#item_shop_sales_detail = train_c2.groupby(["item_id", "shop_id"])[["day", "weekday", "month"]].agg(lambda x: scipy.stats.mode(x)[0])
item_shop_sales_detail.set_index(["item_id", "shop_id"], inplace=True)

In [None]:
item_sales_detail = aggregate(train_c2, "item_id", {"date_block_num": ["min", "max"], "weekday": "median", "day": "median", "month": "median"})
item_sales_modes = train_c2.groupby("item_id")[["day", "weekday", "month"]].agg(lambda x: scipy.stats.mode(x)[0])
item_sales_detail.set_index("item_id", inplace=True)
item_sales_detail[("weekday", "mode")] = item_sales_modes["weekday"]
item_sales_detail[("day", "mode")] = item_sales_modes["day"]
item_sales_detail[("month", "mode")] = item_sales_modes["month"]
item_sales_detail.describe()

In [None]:
shop_sales_detail = aggregate(train_c2, "shop_id", {"date_block_num": ["min", "max"], "weekday": "median", "day": "median", "month": "median"})
shop_sales_modes = train_c2.groupby("shop_id")[["day", "weekday", "month"]].agg(lambda x: scipy.stats.mode(x)[0])
shop_sales_detail.set_index("shop_id", inplace=True)
shop_sales_detail[("weekday", "mode")] = shop_sales_modes["weekday"]
shop_sales_detail[("day", "mode")] = shop_sales_modes["day"]
shop_sales_detail[("month", "mode")] = shop_sales_modes["month"]
shop_sales_detail.describe()

<a id="subsection-3-3"></a>
### Aggregating the data ###

In [None]:
train_c3 = train_c2.copy()
train_c3.info()

We should first drop some columns that we won't need for modeling.

In [None]:
cols_to_drop_train = ["date", "item_name", "item_category_name_en", "shop_name_en", "weekday", "day"]
train_c3.drop(cols_to_drop_train, axis=1, inplace=True)

Aggregation, rename column from item_cnt_day to item_cnt_month.

In [None]:
train_c3 = aggregate(train_c3, ["date_block_num", "shop_id", "item_id"], {"item_cnt_day": np.sum})
train_c3.rename(columns={"item_cnt_day": "item_cnt_month"}, inplace=True)

<a id="subsection-3-4"></a>
### Stacking the train data ###

First step:
- Remove the data not in the test set from train
- Fill the data with zero sales for all item/shop/date_block combo

In [None]:
shop_ids = test_s['shop_id'].unique()
item_ids = test_s['item_id'].unique()

empty_df = []
for i in range(34):
    for shop in shop_ids:
        for item in item_ids:
            empty_df.append([i, shop, item])
    
empty_df = pd.DataFrame(empty_df, columns=['date_block_num','shop_id','item_id'])

In [None]:
train_c4 = pd.merge(empty_df, train_c3, on=['date_block_num','shop_id','item_id'], how='left')
train_c4.fillna(value=0, inplace=True)

Append all available data. Clean shops data.

In [None]:
train_c4 = pd.merge(train_c4, items[["item_id", "item_category_id", "master_category"]], on="item_id", how="left")
train_c4 = pd.merge(train_c4, shops[["shop_id", "city"]], on="shop_id", how="left")
train_c4["year"] = train_c4["date_block_num"] // 12 + 2013
train_c4["month"] = train_c4["date_block_num"] % 12 + 1
train_c4.loc[train_c4["shop_id"] == 11, 'shop_id'] = 10
train_c4.loc[train_c4["shop_id"] == 0, 'shop_id'] = 57
train_c4.loc[train_c4["shop_id"] == 1, 'shop_id'] = 5

Add first and last sales, weekday and month medians.

In [None]:
train_c4.set_index(["item_id", "shop_id"], inplace=True)

train_c4["item_shop_date_block_min"] = item_shop_sales_detail[("date_block_num", "min")]
train_c4["item_shop_date_block_max"] = item_shop_sales_detail[("date_block_num", "max")]
train_c4["item_shop_weekday_median"] = item_shop_sales_detail[("weekday", "median")] + 1
train_c4["item_shop_month_median"] = item_shop_sales_detail[("month", "median")]

train_c4 = train_c4.reset_index().set_index("item_id")
train_c4["item_date_block_min"] = item_sales_detail[("date_block_num", "min")]
train_c4["item_date_block_max"] = item_sales_detail[("date_block_num", "max")]
train_c4["item_weekday_median"] = item_sales_detail[("weekday", "median")] + 1
train_c4["item_month_median"] = item_sales_detail[("month", "median")]
train_c4["item_weekday_mode"] = item_sales_detail[("weekday", "mode")] + 1
train_c4["item_month_mode"] = item_sales_detail[("month", "mode")]

train_c4 = train_c4.reset_index().set_index("shop_id")
train_c4["shop_date_block_min"] = shop_sales_detail[("date_block_num", "min")]
train_c4["shop_date_block_max"] = shop_sales_detail[("date_block_num", "max")]
train_c4["shop_weekday_median"] = shop_sales_detail[("weekday", "median")] + 1
train_c4["shop_month_median"] = shop_sales_detail[("month", "median")]
train_c4["shop_weekday_mode"] = shop_sales_detail[("weekday", "mode")] + 1
train_c4["shop_month_mode"] = shop_sales_detail[("month", "mode")]

train_c4.reset_index(inplace=True)
train_c4.fillna(value=0, inplace=True)

<a id="subsection-3-5"></a>
### Price features ###

In [None]:
train_c5 = train_c4.set_index(['item_id', 'shop_id', 'date_block_num'])
train_c5["item_price"] = avg_price_item_shop_month["item_price"]
train_c5["item_price"].isna().sum()

In [None]:
train_c5 = train_c5.reset_index().set_index(['item_id', 'shop_id'])
train_c5.loc[train_c5['item_price'].isna(), 'item_price'] = avg_price_item_shop["item_price"]
train_c5["item_price"].isna().sum()

In [None]:
train_c5 = train_c5.reset_index().set_index('item_id')
train_c5.loc[train_c5['item_price'].isna(), 'item_price'] = avg_price_item["item_price"]
train_c5["item_price"].isna().sum()

Let's also add average item price in a separate column for potential features such as discounts.

In [None]:
train_c5["avg_item_price"] = avg_price_item["item_price"]

In [None]:
train_c5 = train_c5.reset_index().set_index('item_category_id')
train_c5.loc[train_c5['item_price'].isna(), 'item_price'] = avg_price_category["item_price"]
train_c5["item_price"].isna().sum()

Now lets calculate a new feature - shop item price percentage of average price in all shops.

In [None]:
train_c5["avg_item_price_perc"] = (train_c5['item_price'] - train_c5["avg_item_price"]) / train_c5["avg_item_price"]
train_c5.fillna(value=0, inplace=True)

<a id="subsection-3-6"></a>
### Mean quantity features ###

Mean quantity features in relation to date_block_num.

It is very important to filter train data so that it is similar to test data:
- Clip the sales between 0 and 20

In [None]:
train_c5["item_cnt_month"] = train_c5["item_cnt_month"].clip(0., 20.)

In [None]:
avg_q_month = train_c5.groupby(['date_block_num']).agg({"item_cnt_month": "mean"})
avg_q_month_item = train_c5.groupby(['date_block_num', 'item_id']).agg({"item_cnt_month": "mean"})
avg_q_month_shop = train_c5.groupby(['date_block_num', 'shop_id']).agg({"item_cnt_month": "mean"})
avg_q_month_category = train_c5.groupby(['date_block_num', 'item_category_id']).agg({"item_cnt_month": "mean"})
avg_q_month_city = train_c5.groupby(['date_block_num', 'city']).agg({"item_cnt_month": "mean"})
avg_q_month_master_category = train_c5.groupby(['date_block_num', 'master_category']).agg({"item_cnt_month": "mean"})

In [None]:
train_c5 = train_c5.reset_index().set_index('date_block_num')
train_c5["avg_month_sales"] = avg_q_month["item_cnt_month"]

train_c5 = train_c5.reset_index().set_index(['date_block_num', 'item_id'])
train_c5["avg_month_item_sales"] = avg_q_month_item["item_cnt_month"]

train_c5 = train_c5.reset_index().set_index(['date_block_num', 'shop_id'])
train_c5["avg_month_shop_sales"] = avg_q_month_shop["item_cnt_month"]

train_c5 = train_c5.reset_index().set_index(['date_block_num', 'item_category_id'])
train_c5["avg_month_category_sales"] = avg_q_month_category["item_cnt_month"]

train_c5 = train_c5.reset_index().set_index(['date_block_num', 'city'])
train_c5["avg_month_city_sales"] = avg_q_month_city["item_cnt_month"]

train_c5 = train_c5.reset_index().set_index(['date_block_num', 'master_category'])
train_c5["avg_month_master_category_sales"] = avg_q_month_master_category["item_cnt_month"]

train_c5 = train_c5.reset_index()

<a id="subsection-3-7"></a>
### Lag features ###

We still don't have comparisons of sales against previous months. We should add some, since previous sales are one of the most important features in sales analytics.

We will add features using a great function from Denis Larionov => https://www.kaggle.com/dlarionov/feature-engineering-xgboost:

In [None]:
train_c6 = train_c5.copy()

In [None]:
def lag_feature(df, lags, col):
    tmp = df[['date_block_num','shop_id','item_id',col]]
    for i in lags:
        shifted = tmp.copy()
        shifted.columns = ['date_block_num','shop_id','item_id', col+'_lag_'+str(i)]
        shifted['date_block_num'] += i
        df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left')
    return df

In [None]:
train_c6 = lag_feature(train_c6, [1, 2, 3], 'item_cnt_month')
train_c6 = lag_feature(train_c6, [1, 2, 3], 'avg_month_item_sales')
train_c6 = lag_feature(train_c6, [1, 2, 3], 'avg_month_shop_sales')
train_c6 = lag_feature(train_c6, [1], 'avg_month_sales')
train_c6 = lag_feature(train_c6, [1], 'avg_month_category_sales')
train_c6 = lag_feature(train_c6, [1], 'avg_month_city_sales')
train_c6 = lag_feature(train_c6, [1], 'avg_month_master_category_sales')

Let's also create price lag features, so that we can add them to the test set.

In [None]:
train_c6 = lag_feature(train_c6, [1], 'item_price')
train_c6 = lag_feature(train_c6, [1], 'avg_item_price_perc')

<a id="subsection-3-8"></a>
### Items features ###

We will create the item type feature:
- Old item - no sales in last 6 months
- New item - first sales in last 6 months
- Regular items - the rest

In [None]:
def itemTypes(block_min, block_max):
    if block_min >= 27:
        return 1
    elif block_max < 27:
        return 2
    else:
        return 3

In [None]:
train_c6["item_type"] = np.vectorize(itemTypes)(train_c6['item_date_block_min'], train_c6['item_date_block_max'])
train_c6["item_type"].value_counts()

<a id="subsection-3-9"></a>
### Test data enginnering ###

Let's fill the missing data in our test set.

In [None]:
test_c1 = test_s.copy()

In [None]:
test_c1["month"] = 11
test_c1["year"] = 2015
test_c1["date_block_num"] = 34

test_c1.set_index(["item_id", "shop_id"], inplace=True)
test_c1["item_shop_date_block_min"] = item_shop_sales_detail[("date_block_num", "min")]
test_c1["item_shop_date_block_max"] = item_shop_sales_detail[("date_block_num", "max")]
test_c1["item_shop_weekday_median"] = item_shop_sales_detail[("weekday", "median")] + 1
test_c1["item_shop_month_median"] = item_shop_sales_detail[("month", "median")]

test_c1 = test_c1.reset_index().set_index("item_id")
test_c1["item_date_block_min"] = item_sales_detail[("date_block_num", "min")]
test_c1["item_date_block_max"] = item_sales_detail[("date_block_num", "max")]
test_c1["item_weekday_median"] = item_sales_detail[("weekday", "median")] + 1
test_c1["item_month_median"] = item_sales_detail[("month", "median")]
test_c1["item_weekday_mode"] = item_sales_detail[("weekday", "mode")] + 1
test_c1["item_month_mode"] = item_sales_detail[("month", "mode")]

test_c1 = test_c1.reset_index().set_index("shop_id")
test_c1["shop_date_block_min"] = shop_sales_detail[("date_block_num", "min")]
test_c1["shop_date_block_max"] = shop_sales_detail[("date_block_num", "max")]
test_c1["shop_weekday_median"] = shop_sales_detail[("weekday", "median")] + 1
test_c1["shop_month_median"] = shop_sales_detail[("month", "median")]
test_c1["shop_weekday_mode"] = shop_sales_detail[("weekday", "mode")] + 1
test_c1["shop_month_mode"] = shop_sales_detail[("month", "mode")]

test_c1.reset_index(inplace=True)
test_c1.fillna(value=0, inplace=True)

cols_to_drop_test = ["item_name", "item_category_name_en", "shop_name_en"]
test_c1.drop(cols_to_drop_test, axis=1, inplace=True)

test_c2 = test_c1.reset_index().set_index("index")
test_c2.fillna(value=0, inplace=True)

In [None]:
sales = train_c6.reset_index().set_index(['item_id', 'shop_id'])
sales_lag_1 = sales[sales["date_block_num"] == 33]
sales_lag_2 = sales[sales["date_block_num"] == 32]
sales_lag_3 = sales[sales["date_block_num"] == 31]
sales_lag_12 = sales[sales["date_block_num"] == 22]
test_c2 = test_c2.reset_index().set_index(['item_id', 'shop_id'])

test_c2["item_cnt_month_lag_1"] = sales_lag_1["item_cnt_month"]
test_c2["item_price_lag_1"] = sales_lag_1["item_price"]
test_c2["avg_item_price_perc_lag_1"] = sales_lag_1["avg_item_price_perc"]
test_c2["avg_month_sales_lag_1"] = sales_lag_1["avg_month_sales"]
test_c2["avg_month_item_sales_lag_1"] = sales_lag_1["avg_month_item_sales"]
test_c2["avg_month_shop_sales_lag_1"] = sales_lag_1["avg_month_shop_sales"]
test_c2["avg_month_category_sales_lag_1"] = sales_lag_1["avg_month_category_sales"]
test_c2["avg_month_city_sales_lag_1"] = sales_lag_1["avg_month_city_sales"]
test_c2["avg_month_master_category_sales_lag_1"] = sales_lag_1["avg_month_master_category_sales"]

test_c2["item_cnt_month_lag_2"] = sales_lag_2["item_cnt_month"]
test_c2["avg_month_item_sales_lag_2"] = sales_lag_2["avg_month_item_sales"]
test_c2["avg_month_shop_sales_lag_2"] = sales_lag_2["avg_month_shop_sales"]

test_c2["item_cnt_month_lag_3"] = sales_lag_3["item_cnt_month"]
test_c2["avg_month_item_sales_lag_3"] = sales_lag_3["avg_month_item_sales"]
test_c2["avg_month_shop_sales_lag_3"] = sales_lag_3["avg_month_shop_sales"]

test_c2["item_type"] = np.vectorize(itemTypes)(test_c2['item_date_block_min'], test_c2['item_date_block_max'])

test_c2 = test_c2.reset_index().set_index("index")

<a id="subsection-3-10"></a>
### Calendar related features ###

Let's add the number of weekend days (friday included) for every month in our data. Also we calculate the number of days in month.

In [None]:
import calendar

def calculateWeekendDays(month, year):
    weekend_days = 0
    for week in calendar.monthcalendar(year, month):
        for day in week[4:]:
            if day != 0:
                weekend_days +=1
                
    return weekend_days

def calculateMonthDays(month, year):
    month_days = 0
    for week in calendar.monthcalendar(year, month):
        for day in week:
            if day != 0:
                month_days +=1
                
    return month_days

In [None]:
calendar_dict = {"date_block_num": [], "weekend_days": [], "month_days": []}

for year in range (2013, 2016):
    for month in range(1, 13):
        calendar_dict["date_block_num"].append((year - 2013)*12 + month - 1)
        calendar_dict["weekend_days"].append(calculateWeekendDays(month, year))
        calendar_dict["month_days"].append(calculateMonthDays(month, year))

weekend_days_df = pd.DataFrame(calendar_dict)
weekend_days_df

In [None]:
train_c6 = pd.merge(train_c6, weekend_days_df, how="left", on='date_block_num')
test_c2 = pd.merge(test_c2, weekend_days_df, how="left", on='date_block_num')

<a id="subsection-3-11"></a>
### Final steps ###

We also need to eliminate first three months from training data since it has a lot of missing data in lagged features.

In [None]:
train_c6 = train_c6[train_c6["date_block_num"] > 2]

In [None]:
train_c6.drop(["item_price", "avg_item_price", "avg_item_price_perc"], axis=1, inplace=True)
train_c6.drop(["avg_month_sales", "avg_month_item_sales", "avg_month_shop_sales", "avg_month_category_sales", "avg_month_city_sales", "avg_month_master_category_sales"], axis = 1, inplace=True)

train_f = train_c6.copy()
test_f = test_c2.copy()

Convert object to numeric columns.

In [None]:
def create_dummies(df,features):
    for col in features:
        dummies = pd.get_dummies(df[col],prefix=col)
        df = pd.concat([df,dummies],axis=1)
        df = df.drop(col, axis=1)
    return df

def categorize_column(df, features):
    for col in features:
        df[col] = df[col].astype('category')
    return df

In [None]:
# train_f = create_dummies(train_f, ["master_category"])
# train_f = create_dummies(train_f, ["city"])

# test_f = create_dummies(test_f, ["master_category"])
# test_f = create_dummies(test_f, ["city"])

train_f = categorize_column(train_f, ["master_category"])
train_f = categorize_column(train_f, ["city"])

test_f = categorize_column(test_f, ["master_category"])
test_f = categorize_column(test_f, ["city"])

train_f["master_category"] = train_f["master_category"].cat.codes
train_f["city"] = train_f["city"].cat.codes

test_f["master_category"] = test_f["master_category"].cat.codes
test_f["city"] = test_f["city"].cat.codes

Downgrade numeric types for faster calculations.

In [None]:
train_f = downgrade_dtypes(train_f)
train_f.info()
test_f = downgrade_dtypes(test_f)
test_f.info()

<a id="section-4"></a>
# 4. FEATURE SELECTION #

<a id="subsection-4-1"></a>
### Feature correlation ###

Method types:
- pearson => numerical input - numerical output
- spearman => numerical input - numerical output
- kendall => categorical input - numerical output, numerical input - categorical output

In [None]:
def show_corr_heatmap(df, method):
    corr = df.corr(method)
    
    # Generate a mask for the upper triangle
    mask = np.triu(np.ones_like(corr, dtype=bool))

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(230, 20, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
                square=True, linewidths=.5, cbar_kws={"shrink": .5})
    
    return corr

In [None]:
correlations = show_corr_heatmap(train_f, "pearson")

In [None]:
top_correlation_features = list(correlations["item_cnt_month"].sort_values(ascending=False)[:30].index)

top = show_corr_heatmap(train_f[top_correlation_features], "pearson")

<a id="subsection-4-2"></a>
### Best feature selection with SelectKBest ###

Searching for best features using SelectKBest.

Regression methods: f_regression, mutual_info_regression

Classification methods: chi2, f_classif, mutual_info_classif

In [None]:
from sklearn.feature_selection import SelectKBest, chi2, f_regression, mutual_info_regression
from sklearn.feature_selection import f_classif, mutual_info_classif

def get_best_features(df, features, target, function, num_of_features=-1):
    
    # Select all features if number is not passed
    if num_of_features == -1:
        num_of_features = len(features)
    
    # Create the model and fit it with data
    kBest=SelectKBest(score_func=function,k=num_of_features)
    kBest.fit(df[features],df[target])
    
    # Get columns to keep and create new dataframe with those only
    cols = kBest.get_support(indices=True)
    features_df_new = df[features].iloc[:,cols]
        
    # Create a dataframe of feature names and scores
    names = df[features].columns.values[kBest.get_support()]
    scores = kBest.scores_
    names_scores = list(zip(names, scores))
    feature_scores_df = pd.DataFrame(data = names_scores, columns=['feature', 'score'])
    
    #Sort the dataframe for better visualization
    feature_scores_df_sorted = feature_scores_df.sort_values(['score', 'feature'], ascending = [False, True])

    return feature_scores_df_sorted

In [None]:
from sklearn.feature_selection import chi2, f_regression, mutual_info_regression, f_classif, mutual_info_classif

target_feature = "item_cnt_month"
best_train_features = list(train_f.columns)
best_train_features.remove(target_feature)

methods = [f_regression]
for method in methods:
    best_features_kBest = get_best_features(train_f, best_train_features, target_feature, method)
    print(best_features_kBest)

<a id="subsection-4-3"></a>

### Best feature selection with RFECV ###

Searching for best features using RFECV.

Warning: it is a very time consuming process - in my case it took 6 minutes.

In [None]:
from sklearn.feature_selection import RFECV

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import SGDRegressor
from xgboost.sklearn import XGBRegressor

import numpy as np

def select_features_RFECV(df, target):
    df.dropna(axis=0, inplace=True)
    
    df = df.select_dtypes([np.number])
    
    all_X = df.drop([target], axis=1)
    all_y = df[target]
    
    clf = LinearRegression()
    selector = RFECV(clf, cv=5, min_features_to_select=25, scoring='neg_root_mean_squared_error')
    selector.fit(all_X, all_y)

    optimized_columns = all_X.columns[selector.support_]

    return optimized_columns

In [None]:
best_features_RFECV = select_features_RFECV(train_f, "item_cnt_month")
print(best_features_RFECV)

<a id="section-5"></a>
# 5. MODELING #

First we create a context manager to manage calculation times.

In [None]:
import contextlib
import time

@contextlib.contextmanager
def timer():
    start = time.time()
    
    yield

    end = time.time()
    runtime = '{:.2f}s \n'.format(end - start)
    print(runtime)

Hyperparameters optimization with the function below using GridSearchCV or RandomizedSearchCV.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import SGDRegressor
from xgboost.sklearn import XGBRegressor

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

def select_regression_model(df, features, target, typ):
    
    all_X = df[features]
    all_y = df[target]
    
    models = [
        {
            "name": "LinearRegression",
            "estimator": LinearRegression(),
            "hyperparameters":
                {
                    "normalize": [True, False]
                }
         }
    ]
    
    for model in models:
        with timer():
            if typ == "grid":
                search = GridSearchCV(model["estimator"], param_grid=model["hyperparameters"], cv=5, scoring='neg_root_mean_squared_error')
            elif typ == "random":
                search = RandomizedSearchCV(model["estimator"], param_distributions=model["hyperparameters"], n_iter = 1, cv=5, scoring='neg_root_mean_squared_error')

            search.fit(all_X, all_y)
            model["best_params"] = search.best_params_
            model["best_score"] = search.best_score_
            model["best_model"] = search.best_estimator_

    return models

<a id="subsection-5-1"></a>
### Split the train data ###

Split our train data into train_train and train_test. We will use the last month to evaluate our model.

In [None]:
train_f_tr = train_f[(train_f["date_block_num"] < 33)]
train_f_t = train_f[train_f["date_block_num"] == 33]

<a id="subsection-5-2"></a>
### Selecting features to train ###

Best correlation features

In [None]:
correlations["item_cnt_month"].sort_values(ascending=False)

Best features using SelectKBest.

In [None]:
list(best_features_kBest["feature"][:30].to_numpy())

Best features using RFECV.

In [None]:
list(best_features_RFECV.to_numpy())

Target and train features selection

In [None]:
target_feature = "item_cnt_month"
all_train_features = list(train_f.columns)
all_train_features.remove(target_feature)
print("Available features: {}".format(all_train_features))

In [None]:
selected_features = ['date_block_num',
 'year',
 'month',
 'item_shop_date_block_min',
 'item_shop_date_block_max',
 'item_shop_weekday_median',
 'item_weekday_median',
 'item_month_mode',
 'shop_weekday_median',
 'shop_month_median',
 'item_cnt_month_lag_1',
 'item_cnt_month_lag_2',
 'item_cnt_month_lag_3',
 'avg_month_item_sales_lag_1',
 'avg_month_item_sales_lag_2',
 'avg_month_item_sales_lag_3',
 'avg_month_shop_sales_lag_1',
 'avg_month_shop_sales_lag_2',
 'avg_month_shop_sales_lag_3',
 'avg_month_sales_lag_1',
 'avg_month_category_sales_lag_1',
 'avg_month_city_sales_lag_1',
 'avg_item_price_perc_lag_1',
 'item_type',
 'month_days']

Model evaluation function

In [None]:
from sklearn.metrics import mean_squared_error

def evaluate(result):
    best_rf_model = result[0]["best_model"]

    predictions_tr = best_rf_model.predict(train_f_tr[selected_features])
    predictions_t = best_rf_model.predict(train_f_t[selected_features])

    rmse_tr = (mean_squared_error(train_f_tr["item_cnt_month"].to_numpy(), predictions_tr.clip(0., 20.))) ** (1/2)
    rmse_t = (mean_squared_error(train_f_t["item_cnt_month"].to_numpy(), predictions_t.clip(0., 20.))) ** (1/2)
    
    data.append({"best_model": best_rf_model, "best_score": result[0]["best_score"], "features": selected_features,
                 "rmse_train": rmse_tr, "rmse_test": rmse_t})

<a id="subsection-5-3"></a>
### Training and evaluating ###

In [None]:
data = []

result = select_regression_model(train_f_tr, selected_features, target_feature, "random")
evaluate(result)

print(data)

In [None]:
best_rf_model = result[0]["best_model"]
best_rf_model.fit(train_f[selected_features], train_f[target_feature])

<a id="subsection-5-4"></a>
### Predictions ###

We first select the model which performed optimal and make predictions on actual test set.

In [None]:
predictions = best_rf_model.predict(test_f[selected_features])

Predictions fast check.

In [None]:
test_f["predictions"] = predictions
test_f["predictions"].describe()

<a id="section-6"></a>
# 6. CREATING SUBMISSION FILE #

Creating a submission file

In [None]:
def save_submission_file(data, filename="submission_13.csv"):
    test_ids = data.index
    predictions = data["predictions"].clip(0., 20.)
    
    submission_df = {"ID": test_ids,
                 "item_cnt_month": predictions}
    
    submission = pd.DataFrame(submission_df)
    submission.to_csv(filename,index=False)

In [None]:
save_submission_file(test_f)