Table of contents:

1. [SUPPORT TABLES ANALYSIS](#section-1)
    - [Shops](#subsection-1-1)
    - [Item Categories](#subsection-1-2)
    - [Items](#subsection-1-3)
    - [Data merge](#subsection-1-4)
2. [MERGED DATA ANALYSIS](#section-2)
    - [Data overview](#subsection-2-1)
    - [Managing date columns](#subsection-2-2)
    - [Sales quantity](#subsection-2-3)
    - [Prices and net sales](#subsection-2-4)
3. [AGGREGATED DATA ANALYSIS](#section-3)
    - [Non-agregated features](#subsection-3-1)
    - [Aggregating train data](#subsection-3-2)
    - [Current price imputing](#subsection-3-3)
    - [First/last sales, best month/weekday/day](#subsection-3-4)
    - [Mean quantity features](#subsection-3-5)
    - [Lag features](#subsection-3-6)
    - [Calendar related features](#subsection-3-7)
4. [FINAL STEPS](#subsection-4)

<a id="section-1"></a>
# 1. SUPPORT TABLES ANALYSIS #

I will first separately analyze the following support tables.
- Shops
- Item categories
- Items

I am using the translated version from [remisharoon](https://www.kaggle.com/remisharoon) in his [Predict future sales translated dataset](https://www.kaggle.com/datasets/remisharoon/predict-future-sales-translated-dataset).


After the analysis I will merge all of them into one dataframe.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
import gc

%matplotlib inline

items = pd.read_csv("../input/predict-future-sales-translated-dataset/items_en.csv")
item_categories = pd.read_csv("../input/predict-future-sales-translated-dataset/item_categories_en.csv")
shops = pd.read_csv("../input/predict-future-sales-translated-dataset/shops_en.csv")

sample_submission = pd.read_csv("../input/competitive-data-science-predict-future-sales/sample_submission.csv")
train_o = pd.read_csv("../input/competitive-data-science-predict-future-sales/sales_train.csv")
test_o = pd.read_csv("../input/competitive-data-science-predict-future-sales/test.csv")

<a id="subsection-1-1"></a>
## 1.1 Shops ##

In [None]:
shops["shop_name"].head(10)

Most of the shop names begin with the name of the city. I will isolate the first string from names and create a feature named "city".

In [None]:
shops["city"] = shops["shop_name"].str.replace("[!,?,²]", "").str.lower().str.strip().str.split(" ").str.get(0).str.strip()

fig = plt.figure(figsize =(20, 10))
sns.countplot(y="city", data=shops).set_title("Number of shops by city")

Some of the extracted features do not represent the city therefore I will replace these with "other" (st. == St. Petersburg):
- itinerant (id 9)
- shop (id 12)
- digital (id 55)

I will take a closer look at those shops later.

In [None]:
not_cities = ["itinerant", "shop", "digital"]
shops["city"] = shops["city"].apply(lambda x: "other" if x in not_cities else x)
# shops.sort_values("city")

There are some potential duplicates, which I will explore later:
- Zhukovsky st. Chkalov 39m² (id 10 and 11)
- Moscow TC "Budenovskiy" (id 23 and 24)
- Yakutsk Ordzhonikidze (id 0 and 57)
- Yakutsk TC "Central" (id 1 and 58)
- RostovNaDonu TRC "Megacenter Horizon" (id 39 and 40)

Next I will try to find more patterns using wordcloud chart.

In [None]:
from wordcloud import WordCloud, STOPWORDS

def generateWordCloud(series, return_words=False):
    wordcloud = WordCloud(width = 1600, height = 800,
                    background_color ='white',
                    stopwords = STOPWORDS,
                    min_font_size = 10).generate(" ".join(series)) #_from_frequencies(series.value_counts())  

    plt.figure(figsize = (20, 7), facecolor = None)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad = 0)
    plt.show()
    if return_words:
        return pd.DataFrame(wordcloud.words_, index=["values"]).T.reset_index()

In [None]:
generateWordCloud(shops["shop_name"])

Some words seems to be standing out: SEC, TK, TC, TRC, Mega. However I will not dive deeper for now.

<a id="subsection-1-2"></a>
## 1.2 Item Categories ##

In [None]:
item_categories["item_category_name"]

Category names seems to begin with its group. I will isolate the text before and after " - " to create new features "master_category" and "subcategory".

In [None]:
item_categories["master_category"] = item_categories["item_category_name"].str.replace("[!?².]", "").str.lower().str.strip().str.split(" - ").str.get(0).str.strip().fillna("missing")
item_categories["subcategory"] = item_categories["item_category_name"].str.replace("[!?².]", "").str.lower().str.strip().str.split(" - ").str.get(1).str.strip().fillna("missing")

In [None]:
#Plot master categories
fig = plt.figure(figsize =(20, 10))
sns.countplot(y="master_category", data=item_categories).set_title("Number of categories by master category")

After a quick check, I should make a little bit of cleaning.

In [None]:
item_categories["master_category"] = (item_categories["master_category"]
                                        .str.replace("игры", "games")
                                        .str.replace("movies", "movie")
                                        .str.replace("programs", "program")
                                        .str.replace("payment cards", "payment card")
                                     )

In [None]:
#Plot subcategories
fig = plt.figure(figsize =(20, 10))
sns.countplot(y="subcategory", data=item_categories).set_title("Number of categories by subcategories")

In [None]:
# Additional patterns search
generateWordCloud(item_categories["item_category_name"])

There is one more word that stands out and I didn't explore - "digital". I will create new boolean feature "is_digital" to take that into account. I will also remove "digital" string from other features to prevent duplicated information.

In [None]:
item_categories["is_digital"] = item_categories["item_category_name"].apply(lambda x: 1 if ("digit" in x) | ("Digital" in x) else 0)
item_categories["master_category"] = item_categories["master_category"].str.replace("\(digits\)", "")
item_categories["subcategory"] = item_categories["subcategory"].str.replace("\(digital\)", "")
item_categories["subcategory"] = item_categories["subcategory"].str.replace("digital", "missing")

In [None]:
#Final touches

item_categories.loc[26, "master_category"] = "games"
item_categories.loc[26, "subcategory"] = "android"

item_categories.loc[27, "master_category"] = "games"
item_categories.loc[27, "subcategory"] = "mac"

item_categories.loc[28:31, "master_category"] = "games"
item_categories.loc[28:31, "subcategory"] = "pc"

item_categories.loc[32, "master_category"] = "payment card"
item_categories.loc[32, "subcategory"] = "movie, music, games"

item_categories.loc[43:45, "subcategory"] = "audiobooks"

item_categories.loc[81, "master_category"] = "net carriers"
item_categories.loc[81, "subcategory"] = "spire"

item_categories.loc[82, "master_category"] = "net carriers"
item_categories.loc[82, "subcategory"] = "piece"

In [None]:
# Drop category name, since it is redundant
item_categories.drop(["item_category_name"], axis=1, inplace=True)

<a id="subsection-1-3"></a>
## 1.3 Items ##

To better understand the data, I will merge item categories with item dataframe.

In [None]:
items = pd.merge(items, item_categories, how="left", on='item_category_id')
items

There seems to be some additional categorization inside item names. I will first inspect strings inside ( ) and [ ].

In [None]:
# Extract text from ( ) and [ ]
items["item_name_feat_1"] = items["item_name"].str.lower().str.strip().str.extract(".*\((.*)(?:\)|\$)").fillna("missing")
items["item_name_feat_2"] = items["item_name"].str.lower().str.strip().str.extract(".*\[(.*)(?:\]|\$)").fillna("missing")

In [None]:
items["item_name_feat_1"].value_counts(normalize=True)[:5]

In [None]:
items["item_name_feat_2"].value_counts(normalize=True)[:5]

The percentage of missing values for both extracted features is relatively high (58% and 80%) but there could be some old inactive products on the list. Therefore I will further analyze non missing values.

### Item name feature 1

In [None]:
# Show 10 most frequenty used strings inside ( )
items["item_name_feat_1"].value_counts().reset_index()[:10]

In [None]:
# Bad translation cleaning
items["item_name_feat_1"] = items["item_name_feat_1"].replace("фирм.", "firms")
items["item_name_feat_1"] = items["item_name_feat_1"].replace("регион", "region")

Looks like an item categorization, I will create new feature "item_name_category" for top 10 strings used inside ( ), the rest will be categorized as missing.

In [None]:
extract_feat_1 = items["item_name_feat_1"].value_counts().reset_index()[:10]
items["item_name_category"] = items["item_name_feat_1"].apply(lambda x: x.lower() if x in extract_feat_1["index"].values else "missing")

### Item name feature 2

In [None]:
# Show 10 most frequenty used strings inside [ ]
items["item_name_feat_2"].value_counts().reset_index()[:10]

In [None]:
# Bad translation cleaning
items["item_name_feat_2"] = items["item_name_feat_2"].str.replace("цифровая версия", "russian version")

The most used strings in [ ] are mostly indicating the language version, therefore I will create new, language related feature. There is also one string (jewel) which is a part of previous categorization.

In [None]:
# Add new jewel categories in item_name_category
items["item_name_category"] = items.apply(lambda x: "jewel" if "jewel" in x["item_name_feat_2"] else x["item_name_category"], axis=1)

# Extract english and russian from item_name_feat_2
items["item_language"] = items["item_name_feat_2"].str.lower().str.extract(".*(russian|english)").fillna("missing")

There is also a repeating "digital version" string. I will check it against the is_digital feature I created earlier.

In [None]:
# Extract is_digital_2 feature to compare it to is_digital
items["is_digital_2"] = items["item_name_feat_2"].str.lower().str.extract(".*(digital)").fillna("missing")

# Pivot table helper
items["ones"] = 1

In [None]:
pivot_item_language = items.pivot_table(index="is_digital", columns="is_digital_2", values="ones", aggfunc=np.sum).fillna(0)

cmap = sns.diverging_palette(230, 20, as_cmap=True)
fig, ax = plt.subplots(1,1,figsize=(10,10))

sns.heatmap(pivot_item_language.T, cmap=cmap, vmax=500, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True, fmt=".0f").set_title("'Digital' categorization comparison")

In [None]:
# Change is_digital to value 1 for 334 extra features
items["is_digital"] = items.apply(lambda x: 1 if (x["is_digital"] == 0) & (x["is_digital_2"] == "digital") else x["is_digital"], axis=1)

In [None]:
items.drop(["item_name_feat_1", "item_name_feat_2", "ones", "is_digital_2"], axis=1, inplace=True)

### Duplicated values

Many item names begin with ! or * which is a sign of potential duplicates, at least from the shop names point of view.

In [None]:
items["item_name"] = items["item_name"].str.replace("[^A-Za-z0-9А-Яа-я]+", " ").str.lower()

In [None]:
# Find duplicated item names inside items dataframe
name = items["item_name"]
duplicated = items[name.isin(name[name.duplicated()])]
print("Duplicated items {}".format(len(duplicated)))

There are 147 rows with repeating item name. For start I will check if some of them appear inside the test set and add new feature "is_in_test" to indicate it.

In [None]:
# Find duplicated item names in test dataframe
duplicated_in_test = pd.merge(duplicated, test_o[["item_id", "shop_id"]], how="inner", on="item_id").groupby(["item_name", "item_id"]).agg({"shop_id": "count"}).reset_index()

In [None]:
# Add column that indicate duplicated items, that are not in test dataframe
duplicated = pd.merge(duplicated, duplicated_in_test[["item_id"]], indicator=True, how="left", on=["item_id"])
duplicated["_merge"] = duplicated["_merge"].replace({"left_only": False, "both": True})
duplicated.rename({"_merge": "is_in_test"}, axis=1, inplace=True)
duplicated.sort_values(by=["item_name"], inplace=True)
duplicated

Now I will generate duplicated items replacement dictionary. I will always select the second of the two item ids, with the exception if one of items inside a pair is in the test set. The dictionary will be used later in the analysis.

In [None]:
replace_item_dict = {}

temp_name = ""
temp_id = 0
use_test_data = None
count = 0

for index, row in duplicated.iterrows():
    if row["item_name"] != temp_name:
        count = 1
        temp_name, temp_id = row["item_name"], row["item_id"]
        use_test_data = bool(row["is_in_test"])
    else:
        count += 1
        if (use_test_data == True) | (count > 2):
            replace_item_dict[row["item_id"]] = temp_id
        else:
            replace_item_dict[temp_id] = row["item_id"]
        
        temp_id = row["item_id"]

<a id="subsection-1-4"></a>
## 1.4 Data merge ##

Merge items and shops to train/test data. Create new features that indicate train/test only data.

In [None]:
# Data merge
train_o = pd.merge(train_o, items, how="left", on='item_id')
train_o = pd.merge(train_o, shops, how="left", on='shop_id')

test_o = pd.merge(test_o, items, how="left", on='item_id')
test_o = pd.merge(test_o, shops, how="left", on='shop_id')

In [None]:
# Create train/test only item-shop combo features
train_o = pd.merge(train_o, test_o[["item_id", "shop_id"]], indicator='train_only_item_shop', how="left", on=["item_id", "shop_id"])
test_o = pd.merge(test_o, train_o[["item_id", "shop_id"]], indicator='test_only_item_shop', how="left", on=["item_id", "shop_id"])
test_o.drop_duplicates(inplace=True, ignore_index=True)

train_o['train_only_item_shop'] = np.where(train_o.train_only_item_shop == 'left_only', True, False)
test_o['test_only_item_shop'] = np.where(test_o.test_only_item_shop == 'left_only', True, False)

In [None]:
# Create train/test only item features
item_ids_train = train_o['item_id'].unique()
item_ids_test = test_o['item_id'].unique()

train_o['train_only_item'] = np.logical_not(np.isin(train_o.item_id, item_ids_test))
test_o['test_only_item'] = np.logical_not(np.isin(test_o.item_id, item_ids_train))

I will also downgrade numeric data types to save memory.

Thank you [Konstantin Yakovlev](https://www.kaggle.com/https://www.kaggle.com/kyakovlev/1st-place-solution-part-1-hands-on-data) for this trick!


In [None]:
def downgrade_dtypes(df):
    float_cols = list(df.dtypes[df.dtypes == "float64"].index)
    int_cols = list(df.dtypes[(df.dtypes == "int64") | (df.dtypes == "int32")].index)

    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols] = df[int_cols].astype(np.int16)
    
    return df

In [None]:
train_s = downgrade_dtypes(train_o)
test_s = downgrade_dtypes(test_o.copy())

<a id="section-2"></a>
# 2. MERGED DATA ANALYSIS #

<a id="subsection-2-1"></a>
## 2.1 Data overview ##

I am using my costum function to generate features overview.

In [None]:
from scipy.stats import kendalltau, pearsonr, spearmanr

def kendall_pval(x,y):
    return kendalltau(x,y)[1]

def pearsonr_pval(x,y):
    return pearsonr(x,y)[1]

def spearmanr_pval(x,y):
    return spearmanr(x,y)[1]

def generate_features_overview(df, target_feature=None, corr="pearson"):
    
    if "is_train" not in df.columns:
        df["is_train"] = True
        
    df_info = pd.DataFrame()
    df_info["type"] = df.dtypes
    df_info["missing_count"] = df.isna().sum()
    df_info["missing_perc"] = (df_info["missing_count"] / len(df) * 100).astype(int)
    df_info["unique"] = df.nunique()
    df_info["top"] = df.mode().head(1).T
    df_info["freq"] = df[df==df_info["top"]].count()
    df_info["freq_perc"] = (df_info["freq"] / len(df) * 100).astype(int)
    
    temp_df = df.apply(lambda x : pd.factorize(x)[0] if x.dtypes == "object" else x)

    temp_df.fillna(-1, inplace=True)

    df_info["var"] = temp_df[temp_df["is_train"] == True].var()
    df_info["skew"] = temp_df[temp_df["is_train"] == True].skew()
    df_info["kurt"] = temp_df[temp_df["is_train"] == True].kurt()
    
    if target_feature != None:
        df_info["corr"] = df[df["is_train"] == True].corr(corr)[target_feature]
        df_info["corr_p_value"] = df[df["is_train"] == True].corr(method=pearsonr_pval)[target_feature]

        for feature in df.loc[:, df.dtypes == "object"].columns:
            dummies_df = pd.get_dummies(df[feature].fillna(-1),prefix=feature)
            dummies_df[target_feature] = df[target_feature]


            ma = dummies_df.corr(corr)[target_feature][:-1].max()
            mi = dummies_df.corr(corr)[target_feature][:-1].min()

            if abs(ma) > abs(mi):
                df_info.loc[feature, "corr"] = ma
            else:
                df_info.loc[feature, "corr"] = mi

        
    df_info = pd.concat([df_info, df.describe().T], axis=1)[:-1]
    
    return df_info

In [None]:
info_df = generate_features_overview(train_s.copy())
info_df

Conclusions:
- No missing values
- Unique dates is equal to all days in the timeframe from 01.01.2013 to 31.10.2015, which means that we have sales every day
- We have 21807 different items by id and 21735 different items by name - duplicates
- We have 84 different categories
- We have 60 different shops
- We have a lot of outliers in item_price and item_cnt_day columns
- 98% of rows represent not digital items
- 75% of products are sold for less them 1000 RUB
- 58% of the item-shop combos is only in train set

<a id="subsection-2-2"></a>
## 2.2 Managing date columns ##

The "date" column should first be converted to type date and the following columns should be added, for more complex EDA:
- day
- month
- year
- weekday

In [None]:
train_s["date"] = pd.to_datetime(train_s["date"], format="%d.%m.%Y")
train_s["month"] = train_s["date"].dt.month
train_s["year"] = train_s["date"].dt.year
train_s["weekday"] = train_s["date"].dt.weekday
train_s["day"] = train_s["date"].dt.day

<a id="subsection-2-3"></a>
## 2.3 Sales quantity ##

<a id="subsection-2-3-1"></a>
### Distribution ###

In [None]:
train_s["item_cnt_day"].value_counts(bins=10).sort_index()

In [None]:
# Cumulative sum of quantities by day
train_s["item_cnt_day"].value_counts(normalize=True).cumsum().head(10)

### Quantity Outliers

In [None]:
def markOutliers(df, group_feature, value_feature, boundaries_factor=1.5, method="IQR"):

    if method == "z-score":
        df_mean = df.groupby(group_feature).mean()[value_feature]
        df_std = df.groupby(group_feature).std()[value_feature]

        feat_min = df_mean - boundaries_factor * df_std
        feat_max = df_mean + boundaries_factor * df_std
        
    elif method == "IQR":
        Q1 = df.groupby(group_feature).quantile(0.25)[value_feature]
        Q3 = df.groupby(group_feature).quantile(0.75)[value_feature]
        IQR = Q3 - Q1

        feat_min = Q1 - boundaries_factor * IQR
        feat_max = Q3 + boundaries_factor * IQR
    
    feat_min = feat_min.rename("min")
    feat_max = feat_max.rename("max")

    df = df.set_index(group_feature).join(feat_min).join(feat_max).reset_index()
    
    df[f"outlier_{value_feature}"] = ~((df[value_feature] >= df["min"]) & (df[value_feature] <= df["max"])).fillna(False)
    
    df.drop("min", axis=1, inplace=True)
    df.drop("max", axis=1, inplace=True)
    
    return df

In [None]:
train_s = markOutliers(train_s, "item_id", "item_cnt_day", 3, method="z-score")
train_s["outlier_item_cnt_day"].value_counts(normalize=True)

In [None]:
fig = plt.figure(figsize =(20, 7))
sns.boxplot(x = "shop_id", y = "item_cnt_day", data = train_s).set_title("Daily sales outliers by shop")

We can see that the vast majority (99.59%) of quantities is in top 10 most represented values (-1 to 9). We also see some outliers - especially at shop 12.

In [None]:
#train_s1 = train_s[train_s["item_cnt_day"] < 1000]
train_s1 = train_s.copy()

<a id="subsection-2-3-2"></a>
### Chronological sales ###

In [None]:
month_agg = train_s1.groupby(["date_block_num", "month"]).agg({"item_cnt_day":"sum"}).reset_index()

fig = plt.figure(figsize =(20, 7))
sns.barplot(x="date_block_num", y="item_cnt_day", hue="month", dodge=False, data=month_agg).set_title("Monthly sales over time")

We are seeing a declining trend, there is also a high seasonal effect.

<a id="subsection-2-3-3"></a>
### Monthly sales ###

In [None]:
fig = plt.figure(figsize =(20, 7))
sns.barplot(x="month", y="item_cnt_day", color="c", data=month_agg.groupby(["month"]).agg({"item_cnt_day":"mean"}).reset_index()).set_title("Mean monthly sales")

Again there is a strong seasonality effect, especially in december.

<a id="subsection-2-3-4"></a>
### Weekday sales ###

In [None]:
day_agg = train_s1.groupby(["weekday", "day"]).agg({"item_cnt_day":"sum"}).reset_index()
weekday_agg = day_agg.groupby(["weekday"]).agg({"item_cnt_day":"mean"}).reset_index()

weekday_agg["perc"] = weekday_agg["item_cnt_day"] / sum(weekday_agg["item_cnt_day"])
weekday_agg

In [None]:
fig = plt.figure(figsize =(20, 7))
sns.barplot(x="weekday", y="item_cnt_day", color="c", data=weekday_agg).set_title("Sum of weekday sales")

Most sales occur on weekend (52%), which we'll consider later.

<a id="subsection-2-3-5"></a>

### Sales by shop ###

First let's explore some potential duplicates:
- Zhukovsky st. Chkalov 39m² (id 10 and 11)
- Moscow TC "Budenovskiy" (id 23 and 24)
- Yakutsk Ordzhonikidze (id 0 and 57)
- Yakutsk TC "Central" (id 1 and 58)

I will plot shop sales pairs in the barplot.

In [None]:
shop_month_agg = train_s1.groupby(["shop_id", "date_block_num"]).agg({"item_cnt_day":"sum"}).reset_index()

In [None]:
def comparison_barplot(df, ids, feature):
    compare_ids = ids
    shop_1 = df[df[feature] == compare_ids[0]]
    shop_2 = df[df[feature] == compare_ids[1]]

    shop_comp = pd.merge(shop_1, shop_2, how="outer", on='date_block_num')
    shop_comp.fillna(0, inplace=True)

    shop_comp["stacked"] = shop_comp["item_cnt_day_x"] + shop_comp["item_cnt_day_y"]

    fig, ax1 = plt.subplots(figsize=(20, 7))
    sns.barplot(x='date_block_num', y='stacked', data=shop_comp, ax=ax1, color="b", label="{0}: {1}".format(feature, compare_ids[1]))
    sns.barplot(x='date_block_num', y='item_cnt_day_x', data=shop_comp, ax=ax1, color="c", label="{0}: {1}".format(feature, compare_ids[0])).set_title("Monthly sales by shop")
    plt.legend()
    sns.despine(fig)

In [None]:
comparison_barplot(shop_month_agg, [10, 11], "shop_id")

In [None]:
comparison_barplot(shop_month_agg, [23, 24], "shop_id")

In [None]:
comparison_barplot(shop_month_agg, [0, 57], "shop_id")

In [None]:
comparison_barplot(shop_month_agg, [1, 58], "shop_id")

In [None]:
comparison_barplot(shop_month_agg, [39, 40], "shop_id")

Looks like the following shops are duplicated:
- Zhukovsky st. Chkalov 39m² (id 10 and 11)
- Yakutsk Ordzhonikidze (id 0 and 57)
- Yakutsk TC "Central" (id 1 and 58)
- RostovNaDonu TRC "Megacenter Horizon" (id 39 and 40)

Let's make the folowing shop id replacements:
- id 11 => id 10
- id 0 => id 57
- id 1 => 58
- id 40 => 39

In [None]:
train_s1.loc[train_s1["shop_id"] == 11, 'shop_id'] = 10
train_s1.loc[train_s1["shop_id"] == 0, 'shop_id'] = 57
train_s1.loc[train_s1["shop_id"] == 1, 'shop_id'] = 58
train_s1.loc[train_s1["shop_id"] == 40, 'shop_id'] = 39

<a id="subsection-2-3-5"></a>

### Digital sales by shop ###

In [None]:
fig = plt.figure(figsize =(20, 5))
sns.barplot(x="shop_id", y="item_cnt_day", hue="is_digital", data=train_s1.groupby(["shop_id", "is_digital"]).agg({"item_cnt_day": "sum"}).reset_index())

Shop 55 is digital and it should be selling digital products only. I will set is_digital feature to 1 for all products sold in shop 55. 

In [None]:
digital_transform = train_s1[(train_s1["shop_id"] == 55) & (train_s1["is_digital"] == 0)]["item_id"].value_counts().index
train_s1.loc[train_s1["item_id"].isin(digital_transform), "is_digital"] = 1
items.loc[items["item_id"].isin(digital_transform), "is_digital"] = 1

Now I will write the function to make aggregated comparisons.

In [None]:
def agg_bar_charts(df, group, aggFeature, aggType, rotate=0):
    agg = df.groupby([group]).agg({aggFeature: aggType}).reset_index()
    agg.sort_values(by=[aggFeature], ascending=False, inplace=True)
    agg["cumsum_perc"] = agg[aggFeature].cumsum() / agg[aggFeature].sum()
    
    fig = plt.figure(figsize =(20, 5))
    if rotate > 0:
        plt.xticks(rotation=rotate)
    sns.barplot(x=group, y=aggFeature, data=agg, color="c", order=agg.sort_values(aggFeature, ascending=False)[group]).set_title("Sum of sales")
    
    sns.despine(fig)
    
    fig = plt.figure(figsize =(20, 5))
    
    if rotate > 0:
        plt.xticks(rotation=rotate)
        
    ax = sns.barplot(x=group, y="cumsum_perc", data=agg, color="c", order=agg.sort_values("cumsum_perc")[group])
    ax.set_title("Cumulative sum of sales [percentage]")
    
    ax.axhline(0.25, color="red", ls="--")
    ax.axhline(0.5, color="red")
    ax.axhline(0.75, color="red", ls="--")
    
    sns.despine(fig)

In [None]:
agg_bar_charts(train_s1, "shop_id", "item_cnt_day", "sum")

13 top shops generate more than 50% of sales. Last shop sales seems to be zero.

<a id="subsection-2-3-6"></a>
### Chronological sales by shops ###

In [None]:
pivot_shop_month = train_s1.pivot_table(index="shop_id", columns="date_block_num", values="item_cnt_day", aggfunc=np.sum)

cmap = sns.diverging_palette(230, 20, as_cmap=True)
fig, ax = plt.subplots(1,1,figsize=(20,20))

sns.heatmap(pivot_shop_month, cmap=cmap, vmax=1000, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=False, fmt=".1f").set_title("Sales by shops through time")

We can see some different 'types' of shops:
- shops which were always opened in the last 34 months
- shops which started the business during the last 34 months
- shops which stopped the business during the last 34 months
- shops which started and stopped the business during last 34 months (shop id 17, 33 and 40)
- shops which are opened only in october (shop id 9 and 20)

Shops which stopped the business should be excluded from them model since there are some potential stock sales with great discounts. The periodically opened shops should also be excluded since they will sell 0 items in november.

In [None]:
shops_to_include = list(pivot_shop_month[pivot_shop_month[33] > 0].index)
shops_to_include.remove(9)
shops_to_include.remove(20)

In [None]:
train_s2 = train_s1[train_s1["shop_id"].isin(shops_to_include)]

<a id="subsection-2-3-7"></a>
### Sales by city ###

In [None]:
agg_bar_charts(train_s2, "city", "item_cnt_day", "sum", 45)

Regarding the city sales; Moscow is the clear winner. "Other" category, which includes digital shops is also relatively big.

<a id="subsection-2-3-8"></a>
### Sales by master categories ###

In [None]:
agg_bar_charts(train_s2, "master_category", "item_cnt_day", "sum", 45)

Top 4 master categories represent almost 80% of total sales, while last one is pretty much zero.

<a id="subsection-2-3-9"></a>
### Sales by subcategories ###

In [None]:
agg_bar_charts(train_s2, "subcategory", "item_cnt_day", "sum", 90)

Again top 10 subcategories represent almost 80% of total sales.

<a id="subsection-2-3-10"></a>
### Individual item sales analysis ###

First I will replace duplicate items id according to the dictionary we generated above.

In [None]:
train_s2["item_id"].replace(replace_item_dict, inplace=True)

DEPRECATED
Next I will find out the first and last month of sale for each item. I will also add some month metrics (min, max and median), to find the most popular sale month.

In [None]:
# import scipy

# items_boundaries = train_s2.groupby(["item_id"]).agg({"date_block_num": ["min", "max"], "month": ["min", lambda x: scipy.stats.mode(x)[0], "max"]}).reset_index()
# items_boundaries

DEPRECATED
I will remove items without a single sale in the last 6 months. Despite the first rule I will also keep item with the mode in november.

In [None]:
# items_to_keep = set(items_boundaries[(items_boundaries[("date_block_num", "max")] >= 27) | (items_boundaries[("month", "<lambda_0>")] == 11)]["item_id"].value_counts().index)
# train_s3 = train_s2[train_s2["item_id"].isin(items_to_keep)]
train_s3 = train_s2.copy()

Next I will check best selled items.

In [None]:
best_sellers = train_s3.groupby(["item_id", "item_name", "master_category", "subcategory"]).agg({"item_cnt_day":"sum", "item_price":"mean"}).reset_index().sort_values(by=["item_cnt_day"], ascending=False)[:20]
best_sellers

As expected the best sellers are mostly games. There is also one potential outlier (Sony playstation 4) with relatively high price, it should be checked.

In [None]:
train_s3[train_s3["item_id"] == 6675].sort_values(by=["item_cnt_day"], ascending=False)[:10]

Looks like everybody really liked Sony PlayStation 4 on 29.11.2013. It must have been a great deal. Either way there is no error in the data.

<a id="subsection-2-4"></a>
## 2.4 Prices and net sales ##

<a id="subsection-2-4-1"></a>
### Price distribution ###

In [None]:
train_s3["item_price"].value_counts(bins=10).sort_index()

In [None]:
# Cumulative sum of prices by day
train_s3["item_price"].value_counts(normalize=True).cumsum().head(10)

In [None]:
fig, ax = plt.subplots(2, 1, figsize =(20, 10))

sns.barplot(x = "item_category_id", y = "item_cnt_day", data = train_s3, ax=ax[0]).set_title("Daily sales by item category")
sns.boxplot(x = "item_category_id", y = "item_price", data = train_s3, ax=ax[1]).set_title("Daily prices by item category")

We can see that top 10 prices are in the range from 149 to 999. Setting the price with the 99 ending seems very popular. Furthermore, price outliers are much more common in certain categories.

<a id="subsection-2-4-2"></a>
### Price distribution - shops ###

In [None]:
from sklearn.preprocessing import MinMaxScaler

def minMaxScale(df):
    scaler = MinMaxScaler()
    
    scaled_df = df.copy()
    
    for item in df.columns:
        scaled_df[[item]] = scaler.fit_transform(scaled_df[[item]])
        
    return scaled_df

In [None]:
pivot_shop_price = minMaxScale(train_s3.pivot_table(index="item_category_id", columns="shop_id", values="item_price", aggfunc=np.median).fillna(0).T)


cmap = sns.diverging_palette(230, 20, as_cmap=True)
fig, ax = plt.subplots(1,1,figsize=(20,20))
sns.heatmap(pivot_shop_price, cmap=cmap, vmax=1, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=False, fmt=".1f"
           ).set_title("Median shop prices by category")

In [None]:
# train_s3[train_s3["item_id"] > 0].pivot_table(
#     index=["item_id"], columns=["shop_id"], values=["item_price"], aggfunc={"item_price": [np.median, np.std]}
# ).kurt(axis=1)[:10]

<a id="subsection-2-4-2"></a>
### Chronological price movement ###

In [None]:
pivot_category_price = minMaxScale(train_s3.pivot_table(index="item_category_id", columns="date_block_num", values="item_price", aggfunc=np.median).fillna(0).T)

cmap = sns.diverging_palette(230, 20, as_cmap=True)
fig, ax = plt.subplots(1,1,figsize=(20,20))
sns.heatmap(pivot_category_price.T, cmap=cmap, vmax=1, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=False, fmt=".1f"
           ).set_title("Prices by category through time")

plt.axvline(9, c="black")
plt.axvline(11, c="black")
plt.axvline(21, c="black")
plt.axvline(23, c="black")

Price movement does not follow a clearly defined pattern. There is also pattern in price increase/decrease between october & november (between black lines).

In [None]:
# train_s4 = train_s3[train_s3["item_price"] < 40000]
train_s4 = train_s3.copy()

<a id="subsection-2-4-2"></a>
### Chronological net sales ###

In [None]:
train_s4["net_sales"] = train_s4["item_price"] * train_s4["item_cnt_day"]

In [None]:
month_agg = train_s4.groupby(["date_block_num", "month"]).agg({"net_sales":"sum"}).reset_index()

fig = plt.figure(figsize =(20, 7))
sns.barplot(x="date_block_num", y="net_sales", hue="month", dodge=False, data=month_agg).set_title("Monthly net sales over time")

Net sales chart indicates growing trend, which is not alligned with quantities declining trend. Looks like the company is doing something to maintain profitability.

<a id="subsection-2-4-3"></a>
### Monthly net sales ###

In [None]:
fig = plt.figure(figsize =(20, 7))
sns.barplot(x="month", y="net_sales", color="c", data=month_agg.groupby(["month"]).agg({"net_sales":"mean"}).reset_index()).set_title("Mean monthly net sales")

Monthly net sales indicate even bigger seasonal effect, meaning there must be something going on at the end of the year.
There are two possibilities:
1. Price increases
2. Increased sales of premium products

However I will not explore it further for now.

<a id="subsection-2-4-4"></a>
### Price outliers ###

Many subcategories are still right skewed, but it is not totaly unexpected. It all comes down to the specific product inside the subcategory. Therefore I will mark outliers (based on item price level) using intequartile range with boundary of 3.

In [None]:
fig = plt.figure(figsize =(20, 7))
plt.xticks(rotation=90)
sns.boxplot(x="subcategory", y="item_price", data = train_s4)

In [None]:
train_s4 = markOutliers(train_s4, "item_id", "item_price", 3, method="z-score")
train_s4["outlier_item_price"].value_counts(normalize=True)

Almost 1,5% of item sales were marked as price outliers.

In [None]:
month_agg = train_s4.groupby(["shop_id", "outlier_item_price"]).agg({"net_sales":"sum"}).reset_index()

fig = plt.figure(figsize =(20, 7))
sns.barplot(x="shop_id", y="net_sales", hue="outlier_item_price", dodge=False, data=month_agg).set_title("Shops net sales by ouliers")

Most of the shops have trivial net sales represented by outliers. However shop_id 10 seems to be an exception.

<a id="subsection-2-2"></a>
## Train / test only items ##

In [None]:
month_agg = train_s4.groupby(["date_block_num", "train_only_item_shop"]).agg({"item_cnt_day":"sum"}).reset_index()

fig = plt.figure(figsize =(20, 7))
sns.barplot(x="date_block_num", y="item_cnt_day", hue="train_only_item_shop", data=month_agg).set_title("Chronological sales by train_only item-shop combos")

In [None]:
test_s["test_only_item_shop"].value_counts(normalize=True)

In [None]:
test_s["test_only_item"].value_counts(normalize=True)

Despite the cleaning there is still relatively high amount of train only items. I will have to consider this when aggregating monthly values. Another important insight comes from test only items. Almost 50% of item_shop combos and more than 7% of items are in the test set only, therefore new features based on this combo should be used with caution.

<a id="section-3"></a>
# 3. AGGREGATED DATA ANALYSIS #

The competition is expecting monthly sales predictions for november 2015, however train data consists of daily sales. Therefore the data should be aggregated by item/shop/month, but before that I will create some useful features with non-aggregated data.

<a id="subsection-3-1"></a>
## 3.1 Non-agregated features

### Average price

Since there are many outliers I will generate price features based on median values:
- median price for item/shop combo
- median price for item

In [None]:
# avg_price_item_shop_month = train_s4.groupby(['item_id', 'shop_id', 'date_block_num']).agg({"item_price": "median"})
avg_price_item_shop = train_s4.groupby(['item_id', 'shop_id']).agg({"item_price": "median"})
avg_price_item = train_s4.groupby('item_id').agg({"item_price": "median"})

### First / last sale, best month, weekend majority

New features:
- item/shop first/last sale, best month, weekend majority
- item first/last sale, best month, weekend majority
- shop first/last sale, best month, weekend majority

In [None]:
# First and last sale
item_shop_sales_detail = train_s4.groupby(["item_id", "shop_id"]).agg({"date_block_num": ["min", "max"]})
item_sales_detail = train_s4.groupby(["item_id"]).agg({"date_block_num": ["min", "max"]})
shop_sales_detail = train_s4.groupby(["shop_id"]).agg({"date_block_num": ["min", "max"]})

item_shop_sales_detail.columns = item_shop_sales_detail.columns.droplevel()
item_sales_detail.columns = item_sales_detail.columns.droplevel()
shop_sales_detail.columns = shop_sales_detail.columns.droplevel()

# Best month
item_shop_sales_detail["best_month"] = train_s4.pivot_table(index=["item_id", "shop_id"], columns=["date_block_num"], values="item_cnt_day", aggfunc=np.sum).fillna(0).idxmax(axis=1) % 12 + 1
item_sales_detail["best_month"] = train_s4.pivot_table(index=["item_id"], columns=["date_block_num"], values="item_cnt_day", aggfunc=np.sum).fillna(0).idxmax(axis=1) % 12 + 1
shop_sales_detail["best_month"] = train_s4.pivot_table(index=["shop_id"], columns=["date_block_num"], values="item_cnt_day", aggfunc=np.sum).fillna(0).idxmax(axis=1) % 12 + 1

# Weekend majority marks the combo which is better sold on weekends
item_shop_sales_detail["best_weekday"] = train_s4.pivot_table(index=["item_id", "shop_id"], columns=["weekday"], values="item_cnt_day", aggfunc=np.sum).fillna(0).idxmax(axis=1) + 1
item_sales_detail["best_weekday"] = train_s4.pivot_table(index=["item_id"], columns=["weekday"], values="item_cnt_day", aggfunc=np.sum).fillna(0).idxmax(axis=1) + 1
shop_sales_detail["best_weekday"] = train_s4.pivot_table(index=["shop_id"], columns=["weekday"], values="item_cnt_day", aggfunc=np.sum).fillna(0).idxmax(axis=1) + 1

item_shop_sales_detail["weekend_majority"] = item_shop_sales_detail["best_weekday"] > 4
item_sales_detail["weekend_majority"] = item_sales_detail["best_weekday"] > 4
shop_sales_detail["weekend_majority"] = shop_sales_detail["best_weekday"] > 4

item_shop_sales_detail.drop("best_weekday", axis=1, inplace=True)
item_sales_detail.drop("best_weekday", axis=1, inplace=True)
shop_sales_detail.drop("best_weekday", axis=1, inplace=True)

<a id="subsection-3-2"></a>
## 3.2 Aggregating the data

I will aggregate the data using groupby on date_block_num, shop & item. I will also rename the value column from item_cnt_day to item_cnt_month.

In [None]:
train_s5 = train_s4.groupby(["date_block_num", "shop_id", "item_id"]).agg({"item_cnt_day": np.sum, "item_price": np.median, "net_sales": np.sum})
train_s5.rename(columns={"item_cnt_day": "item_cnt_month"}, inplace=True)
train_s5.rename(columns={"item_price": "current_price"}, inplace=True)

Creating dataframe of sales by month:
- Create empty dataframe with item/shop/date_block_num combo from the test set - using this trick I will eliminate train only item-shop combo features.
- Merge train data
- Include date_block_num 34, which we are predicting

In [None]:
shop_ids = test_s['shop_id'].unique()
item_ids = test_s['item_id'].unique()

empty_df = []
for i in range(35):
    for shop in shop_ids:
        for item in item_ids:
            empty_df.append([i, shop, item])
    
empty_df = pd.DataFrame(empty_df, columns=['date_block_num','shop_id','item_id'])

In [None]:
train_s6 = pd.merge(empty_df, train_s5, on=['date_block_num','shop_id','item_id'], how='left')
train_s6[["item_cnt_month", "net_sales"]] = train_s6[["item_cnt_month", "net_sales"]].fillna(0)

Append all available data.

In [None]:
train_s6 = pd.merge(train_s6, items, on="item_id", how="left")
train_s6 = pd.merge(train_s6, shops, on="shop_id", how="left")
train_s6["year"] = train_s6["date_block_num"] // 12 + 2013
train_s6["month"] = train_s6["date_block_num"] % 12 + 1

<a id="subsection-3-3"></a>
## 3.3 Impute missing prices ##

In [None]:
# Item-shop-month price
train_s6["current_price"].isna().sum()

In [None]:
# Item-shop price
train_s7 = train_s6.set_index(['item_id', 'shop_id'])
train_s7.loc[train_s7['current_price'].isna(), 'current_price'] = avg_price_item_shop["item_price"]
train_s7["current_price"].isna().sum()

In [None]:
# Item price
train_s7 = train_s7.reset_index().set_index('item_id')
train_s7.loc[train_s7['current_price'].isna(), 'current_price'] = avg_price_item["item_price"]
train_s7["current_price"].isna().sum()

The remaining missing values are items which are in the test set only. That is why I will fill 0 for the price.

In [None]:
train_s7 = train_s7.reset_index()
train_s7["current_price"].fillna(0, inplace=True)

Let's also add average item price in a separate column for potential features such as discounts.

In [None]:
train_s7["avg_price"] = avg_price_item["item_price"]
train_s7["avg_price"].fillna(0, inplace=True)

Now lets calculate a new feature - shop item price percentage of average price in all shops.

In [None]:
train_s7["current_avg_price_delta"] = (train_s7['current_price'] - train_s7["avg_price"]) / train_s7["avg_price"]
train_s7.fillna(value=0, inplace=True)
train_s7.replace([np.inf, -np.inf], 0, inplace=True)

<a id="subsection-3-4"></a>
## 3.4 First/last sales, best month, weekend majority ##

In [None]:
train_s7 = train_s7.merge(item_shop_sales_detail.reset_index(), on=["item_id", "shop_id"], how="left")
train_s7 = train_s7.merge(item_sales_detail.reset_index(), on=["item_id"], how="left", suffixes=("_item_shop", "_item"))
train_s7 = train_s7.merge(shop_sales_detail.reset_index(), on=["shop_id"], how="left")
train_s7.rename(columns={"min": "min_shop", "max": "max_shop", "best_month": "best_month_shop", "weekend_majority": "weekend_majority_shop"}, inplace=True)
train_s7.fillna(0, inplace=True)

<a id="subsection-3-5"></a>
## 3.5 Mean encoding features ##

Before creating average quantity features, I will clip the item_cnt_month column to 0 - 20 range. This is the range we are predicting.

In [None]:
train_s7["item_cnt_month"] = train_s7["item_cnt_month"].clip(0., 20.)

In [None]:
avg_q_month = train_s7.groupby(['date_block_num']).agg({"item_cnt_month": "mean"})
avg_q_month_item = train_s7.groupby(['date_block_num', 'item_id']).agg({"item_cnt_month": "mean"})
avg_q_month_shop = train_s7.groupby(['date_block_num', 'shop_id']).agg({"item_cnt_month": "mean"})
avg_q_month_category = train_s7.groupby(['date_block_num', 'item_category_id']).agg({"item_cnt_month": "mean"})
avg_q_month_city = train_s7.groupby(['date_block_num', 'city']).agg({"item_cnt_month": "mean"})
avg_q_month_master_category = train_s7.groupby(['date_block_num', 'master_category']).agg({"item_cnt_month": "mean"})

In [None]:
train_s7["avg_month_sales"] = train_s7.merge(avg_q_month.reset_index(), on=["date_block_num"], how="left").iloc[:,-1:]
train_s7["avg_month_item_sales"] = train_s7.merge(avg_q_month_item.reset_index(), on=['date_block_num', 'item_id'], how="left").iloc[:,-1:]
train_s7["avg_month_shop_sales"] = train_s7.merge(avg_q_month_shop.reset_index(), on=['date_block_num', 'shop_id'], how="left").iloc[:,-1:]
train_s7["avg_month_category_sales"] = train_s7.merge(avg_q_month_category.reset_index(), on=['date_block_num', 'item_category_id'], how="left").iloc[:,-1:]
train_s7["avg_month_city_sales"] = train_s7.merge(avg_q_month_city.reset_index(), on=['date_block_num', 'city'], how="left").iloc[:,-1:]
train_s7["avg_month_master_category_sales"] = train_s7.merge(avg_q_month_master_category.reset_index(), on=['date_block_num', 'master_category'], how="left").iloc[:,-1:]

<a id="subsection-3-6"></a>
## 3.6 Lag features ##

We still don't have comparisons of sales against previous months. We should add some, since previous sales are one of the most important features in sales analytics.

To select the most appropriate time lags, we need to create short time-series analysis.
I will use a great function from Jagan => https://www.kaggle.com/jagangupta/time-series-basics-exploring-traditional-ts

In [None]:
ts_series = train_s7[train_s7["date_block_num"] < 34].groupby("date_block_num").agg({"item_cnt_month": "sum"})["item_cnt_month"]

In [None]:
import statsmodels.api as sm
import statsmodels
import scipy.stats as scs

def tsplot(y, lags=None, figsize=(10, 8), style='bmh',title=''):
    if not isinstance(y, pd.Series):
        y = pd.Series(y)
    with plt.style.context(style):    
        fig = plt.figure(figsize=figsize)
        #mpl.rcParams['font.family'] = 'Ubuntu Mono'
        layout = (3, 2)
        ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
        acf_ax = plt.subplot2grid(layout, (1, 0))
        pacf_ax = plt.subplot2grid(layout, (1, 1))
        qq_ax = plt.subplot2grid(layout, (2, 0))
        pp_ax = plt.subplot2grid(layout, (2, 1))
        
        y.plot(ax=ts_ax)
        ts_ax.set_title(title)
        statsmodels.graphics.tsaplots.plot_acf(y, lags=lags, ax=acf_ax, alpha=0.5)
        statsmodels.graphics.tsaplots.plot_pacf(y, lags=lags, ax=pacf_ax, alpha=0.5)
        sm.qqplot(y, line='s', ax=qq_ax)
        qq_ax.set_title('QQ Plot')        
        scs.probplot(y, sparams=(y.mean(), y.std()), plot=pp_ax)

        plt.tight_layout()

In [None]:
tsplot(ts_series, 12)

Looking at the partial autocorrelation chart, the best lag for item_cnt_month is 1, 11 and 12.
We will add features using a great function from Denis Larionov => https://www.kaggle.com/dlarionov/feature-engineering-xgboost:

In [None]:
def lag_feature(df, lags, col):
    tmp = df[['date_block_num','shop_id','item_id',col]]
    for i in lags:
        shifted = tmp.copy()
        shifted.columns = ['date_block_num','shop_id','item_id', col+'_lag_'+str(i)]
        shifted['date_block_num'] += i
        df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left')
    return df

In [None]:
%%time
train_s7 = lag_feature(train_s7, [1, 11, 12], 'item_cnt_month').fillna(0)
train_s7 = lag_feature(train_s7, [1, 11, 12], 'avg_month_sales').fillna(0)
train_s7 = lag_feature(train_s7, [1, 11, 12], 'avg_month_item_sales').fillna(0)
train_s7 = lag_feature(train_s7, [1, 11, 12], 'avg_month_shop_sales').fillna(0)
train_s7 = lag_feature(train_s7, [1, 11, 12], 'avg_month_category_sales').fillna(0)
train_s7 = lag_feature(train_s7, [1, 11, 12], 'avg_month_city_sales').fillna(0)
train_s7 = lag_feature(train_s7, [1, 11, 12], 'avg_month_master_category_sales').fillna(0)

Let's also create price lag features to catch the periods after discounts.

In [None]:
train_s7 = lag_feature(train_s7, [1], 'current_avg_price_delta').fillna(0)

In [None]:
del train_s1, train_s2, train_s3, train_s4, train_s5, train_s6

gc.collect()

<a id="subsection-3-7"></a>
## 3.7 Calendar related features ##

Let's add the number of weekend days (friday included) for every month in our data. I will also add the number of days in month.

In [None]:
import calendar

def calculateWeekendDays(month, year):
    weekend_days = 0
    for week in calendar.monthcalendar(year, month):
        for day in week[4:]:
            if day != 0:
                weekend_days +=1
                
    return weekend_days

def calculateMonthDays(month, year):
    month_days = 0
    for week in calendar.monthcalendar(year, month):
        for day in week:
            if day != 0:
                month_days +=1
                
    return month_days

In [None]:
calendar_dict = {"date_block_num": [], "weekend_days": [], "month_days": []}

for year in range (2013, 2016):
    for month in range(1, 13):
        calendar_dict["date_block_num"].append((year - 2013)*12 + month - 1)
        calendar_dict["weekend_days"].append(calculateWeekendDays(month, year))
        calendar_dict["month_days"].append(calculateMonthDays(month, year))

calendar_df = pd.DataFrame(calendar_dict)

In [None]:
train_s7 = pd.merge(train_s7, calendar_df, how="left", on='date_block_num')

<a id="subsection-4"></a>
# 4. Final steps #

Finally we need to append IDs from test dataframe. We will also remove some columns.

In [None]:
test_o["date_block_num"] = 34
test_o.drop_duplicates(inplace=True)

In [None]:
train_s8 = pd.merge(train_s7, test_o[["ID","item_id", "shop_id", "date_block_num"]], how="left", on=["item_id", "shop_id", "date_block_num"])

In [None]:
train_s8.drop(["current_price", "avg_price", "current_avg_price_delta"], axis=1, inplace=True)
train_s8.drop(["avg_month_sales", "avg_month_item_sales", "avg_month_shop_sales", "avg_month_category_sales", "avg_month_city_sales", "avg_month_master_category_sales"], axis = 1, inplace=True)
train_s8.drop(["item_name", "shop_name", "net_sales"], axis = 1, inplace=True)

Data types downgrade to save memory

In [None]:
train_s9 = downgrade_dtypes(train_s8)

In [None]:
train_s9[["weekend_majority_shop", "weekend_majority_item_shop", "weekend_majority_item",
        "date_block_num", "item_category_id", "is_digital", "month",
        "min_item_shop", "max_item_shop", "best_month_item_shop",
        "min_item", "max_item", "best_month_item",
        "min_shop", "max_shop", "best_month_shop",
        "weekend_days", "month_days"]] = train_s9[["weekend_majority_shop", "weekend_majority_item_shop", "weekend_majority_item",
        "date_block_num", "item_category_id", "is_digital", "month",
        "min_item_shop", "max_item_shop", "best_month_item_shop",
        "min_item", "max_item", "best_month_item",
        "min_shop", "max_shop", "best_month_shop",
        "weekend_days", "month_days"]].astype(np.int8)

train_s9["ID"] = train_s9["ID"].fillna(0).astype(np.int32)

In [None]:
train_s9.info()

Export cleaned dataframe

In [None]:
train_s9.to_csv("cleaned_df.csv", index=False)