<a id = "table_of_contents"></a>
# Table of contents

[Import of libraries](#imports)

[Global variables](#global_variables)

[Preprocessing before features generation](#preprocessing_before_fe)

-->[Correct the shop names and id](#correct_shop_names_id)

-->[Generate item_category_features](#generate_item_category_features)

-->[Remove the huge price and item sales outliers](#remove_outliers)

[Generate a full df with all data and records](#generate_full_df_with_all_records)

[Create a groupby df with all the sales for shop_id and item_id grouped by months](#generate_gb_df)

[Join the full_df with gb_df](#join_dfs)

[Add additional features to our full sales df](#add_new_csvs)

[FeatureGenerator class](#fe_generator_class)

[Generate additional features as, mean and total sales for shop_id , item_id, city ... for every month](#create_new_features)

-->[Date and shop_id features](#feature_1)

-->[Date and item_id features](#feature_2)

-->[Date and item_category features](#feature_3)

-->[Datetime features](#feature_5)

-->[Adding holiday and number of weekends data](#feature_6)

-->[City population and mean_income per city](#feature_7)

[Join full sales df with all the features generated](#join_dfs_with_features)

[Basic model train](#basic_model)

[Feature importance](#feature_importance_1)

[Predict and model evaluation](#predict_and_model_evaluation_1)

[To do](#to_do)

-->[Additional feature 1](#new_feature_1)

-->[Additional feature 2](#new_feature_2)

-->[Additional feature 3](#new_feature_3)

-->[Join df's with new features](#join_dfs_with_new_features)

-->[Model training](#new_model)

-->[Feature importance of new model](#feature_importance_2)

-->[Predict and model evaluation of new model](#predict_and_model_evaluation_2)

<a id = "imports"></a>
# Import of libraries
[Go back to the table of contents](#table_of_contents)

In [None]:
# import the basic libraries we will use in this kernel
import os
import numpy as np
import pandas as pd
import pickle

import time
import datetime
from datetime import datetime
import calendar

from sklearn import metrics
from math import sqrt
import gc

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

from xgboost import XGBRegressor
from xgboost import plot_importance

from sklearn.preprocessing import LabelEncoder

import itertools
import warnings
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose

warnings.filterwarnings("ignore") # specify to ignore warning messages

<a id = "global_variables"></a>
# Global variables
[Go back to the table of contents](#table_of_contents)

In [None]:
# Resample the sales by this parameter
PERIOD = "M"

SHOPS = [8, 14, 37, 41, 59]

# this is help us change faster between Kaggle and local machine
LOCAL = False

if LOCAL:
    PATH = os.getcwd()
    FULL_DF_PATH = PATH
    GB_DF_PATH = PATH
    OUTPUT_PATH = PATH
else:
    PATH = '../input/competitive-data-science-predict-future-sales/'
    FULL_DF_PATH = "../input/full-df-only-test-all-features/"
    GB_DF_PATH = "../input/group-by-df/"

<a id = "preprocessing_before_fe"></a>
# Preprocessing before features generation
[Go back to the table of contents](#table_of_contents)

The idea of this section is very simple. We have seen in our EDA part that there are a lot of missing values.
Our model will benefit a lot if we can supply it a training data, with the missing values being zero. This way, it can learn from more amount of data.

In order to do so, we must perform a cartesian operation over dates x shops x items_id to generate all the possible combinations of months x shops and x items sales.

In this kernel we will only generate this type of features for the items that are present in TEST only.

This will reduce the amount of calculations required. If you have enough memory, we can do this for all possible combinations.

In [None]:
# load all the df we have
shops_df = pd.read_csv(os.path.join(PATH, "shops.csv"))
items_df = pd.read_csv(os.path.join(PATH, "items.csv"))
items_category_df = pd.read_csv(os.path.join(PATH, "item_categories.csv"))
sales_df = pd.read_csv(os.path.join(PATH, "sales_train.csv"))
test_df = pd.read_csv(os.path.join(PATH, "test.csv"))

In [None]:
items_category_df

In [None]:
shops_df

<a id = "correct_shop_names_id"></a>
## Correct the shop names and id
[Go back to the table of contents](#table_of_contents)

In [None]:
# we have seen in our EDA that we have some duplicate shops, let's correct them.
shops_df.loc[shops_df.shop_name == 'Сергиев Посад ТЦ "7Я"', 'shop_name'] = 'СергиевПосад ТЦ "7Я"'
shops_df['city'] = shops_df['shop_name'].str.split(' ').map(lambda x: x[0])
shops_df.loc[shops_df.city == '!Якутск', 'city'] = 'Якутск'
shops_df['city_code'] = LabelEncoder().fit_transform(shops_df['city'])
shops_df.head()

As we can see, we have some duplicate shop names, let's manually clean them.

In [None]:
shops_df[shops_df["shop_id"].isin([0, 57])]

In [None]:
# Якутск Орджоникидзе, 56
sales_df.loc[sales_df.shop_id == 0, 'shop_id'] = 57
test_df.loc[test_df.shop_id == 0, 'shop_id'] = 57

# Якутск ТЦ "Центральный"
sales_df.loc[sales_df.shop_id == 1, 'shop_id'] = 58
test_df.loc[test_df.shop_id == 1, 'shop_id'] = 58

# Жуковский ул. Чкалова 39м²
sales_df.loc[sales_df.shop_id == 10, 'shop_id'] = 11
test_df.loc[test_df.shop_id == 10, 'shop_id'] = 11

<a id = "generate_item_category_features"></a>
## Generate item_category_features
[Go back to the table of contents](#table_of_contents)

In [None]:
items_category_df['split'] = items_category_df['item_category_name'].str.split('-')
items_category_df['type'] = items_category_df['split'].map(lambda x: x[0].strip())
items_category_df['type_code'] = LabelEncoder().fit_transform(items_category_df['type'])

# if subtype is nan then type
items_category_df['subtype'] = items_category_df['split'].map(lambda x: x[1].strip() if len(x) > 1 else x[0].strip())
items_category_df['subtype_code'] = LabelEncoder().fit_transform(items_category_df['subtype'])

items_category_df.head()

<a id = "remove_outliers"></a>
## Remove the huge price and item sales outliers
[Go back to the table of contents](#table_of_contents)

In [None]:
sales_df.head()

In [None]:
# we have negative prices and some outlier
# let's replace the data with the mean value and also filter all the outliers
mean = sales_df[(sales_df["shop_id"] == 32) & (sales_df["item_id"] == 2973) & (sales_df["date_block_num"] == 4) & (sales_df["item_price"] > 0)]["item_price"].mean()
sales_df.loc[sales_df.item_price < 0, 'item_price'] = mean

sales_df = sales_df[sales_df["item_price"] < np.percentile(sales_df["item_price"], q = 100)]
sales_df = sales_df[sales_df["item_cnt_day"] < np.percentile(sales_df["item_cnt_day"], q = 100)]

<a id = "generate_full_df_with_all_records"></a>
# Generate a full df with all data and records
[Go back to the table of contents](#table_of_contents)

In [None]:
sales_df.info()

In [None]:
type(sales_df["date"].iloc[0])

In [None]:
# convert to datetime the date column
# specify the format since otherwise it might give some problems
sales_df["date"] = pd.to_datetime(sales_df["date"], format = "%d.%m.%Y")

In [None]:
# max date in sales is 31.10.2015.
# In the Kaggle competition we are asked to predict the sales for the next month
# this means the sales of November
min_date = sales_df["date"].min()
max_date_sales = sales_df["date"].max()

In [None]:
max_date_sales

In [None]:
# how to createa a new date
max_date_test = datetime(2015, 11, 30)

In [None]:
# create a date range that beggins with the first sale and ends with the last day from our max_date_test
# Notice however, that we will train our model only a selection of shops and will test our data on october data.
date_range = pd.date_range(min_date, max_date_sales, freq = "D")
date_range

In [None]:
len(date_range)

Our model will benefit a lot if we can train it with the highest granularity (daily sales).

However, as we can see doing this on a local machine is almost impossible since we have more than 1.4 BILLION rows.
If we add 10 featrues (columns) this means that our total DataFrame will have more than 10.4 BILLIONS instances.

In [None]:
shops = sorted(list(shops_df["shop_id"].unique()))

# only items present in test
items = sorted(list(items_df["item_id"].unique()))

cartesian_product = pd.MultiIndex.from_product([date_range, shops, items], names = ["date", "shop_id", "item_id"])
len(cartesian_product)

In order to replicate the Kaggle competition, we will create a smaller DataFrame with only selected shops and train the model on a Monthly basis.


We will use only 5 shops since generating a lot of features will consume a lot of memory and we won't be able to train on Kaggle. If you have a more powerful machine, you can run the script with all shops.

In [None]:
date_range = pd.date_range(min_date, max_date_sales, freq = PERIOD)
print("We have a total of {} months".format(len(date_range)))
date_range

0.87 million rows, we CAN work with this on a local machine.

We have created monthly date_range, if we want to join this with our sales data, we must "resample" our data to a monthly date_range aswell.

In [None]:
# only items present in test
items = sorted(list(test_df["item_id"].unique()))

cartesian_product = pd.MultiIndex.from_product([date_range, SHOPS, items], names = ["date", "shop_id", "item_id"])
len(cartesian_product)

<a id = "generate_gb_df"></a>
# Create a groupby df with all the sales for shop_id and item_id grouped by months
[Go back to the table of contents](#table_of_contents)

We will be working with a DataFrame resampled by Months. We must resample the sales_df.

In [None]:
'''
st = time.time()

# # set index
sales_df["revenue"] = sales_df["item_cnt_day"]*sales_df["item_price"]
gb_df = sales_df.set_index("date")

# # groupby shop_id and item_id
gb_df = gb_df.groupby(["shop_id", "item_id"])

# # resample the sales to a weekly basis
gb_df = gb_df.resample(PERIOD).agg({'item_cnt_day': np.sum, "item_price": np.mean, "revenue":np.sum})

# # convert to dataframe and save the full dataframe
gb_df.reset_index(inplace = True)

# # save the groupby dataframe
gb_df.to_pickle("GROUP_BY_DF.pkl")

et = time.time()

print("Total time in minutes to preprocess took {}".format((et - st)/60))
'''


In [None]:
# read the groupby dataframe
gb_df = pd.read_pickle(os.path.join(GB_DF_PATH, "GROUP_BY_DF.pkl"))
# gb_df = pd.read_pickle("GROUP_BY_DF.pkl")

In [None]:
gb_df.head()

In [None]:
gb_df.fillna(0, inplace = True)

<a id = "join_dfs"></a>
# Join the full_df with gb_df
[Go back to the table of contents](#table_of_contents)

Now that we have the sales_df resampled by months, and we have created a cartesian product (all possible combinations of months, shop_id and item_id), let's merge the df.

In [None]:
full_df = pd.DataFrame(index = cartesian_product).reset_index()

full_df = pd.merge(full_df, gb_df, on = ['date','shop_id', "item_id"], how = 'left')

In [None]:
full_df.shape

In [None]:
full_df.head()

<a id = "add_new_csvs"></a>
# Add additional features to our full sales df
[Go back to the table of contents](#table_of_contents)

In [None]:
# add shops_df information
full_df = pd.merge(full_df, shops_df, on = "shop_id")
full_df.head()

In [None]:
# add items_df information
full_df = pd.merge(full_df, items_df, on = "item_id")
full_df.head()

In [None]:
# add items_category_df information
full_df = pd.merge(full_df, items_category_df, on = "item_category_id")
full_df.head()

In [None]:
full_df.fillna(0, inplace = True)

In [None]:
# We will clip the value in this line.
# This means that the values greater than 20, will become 20 and lesser than 20
full_df["item_cnt_day"] = np.clip(full_df["item_cnt_day"], 0, 20)

<a id = "fe_generator_class"></a>
# FeatureGenerator class
[Go back to the table of contents](#table_of_contents)

In [None]:
class FeatureGenerator(object):
    
    '''
    This is a helper class that takes a df and a list of features and creates sum, mean, 
    lag features and variation (change over month) features.
    
    '''
    
    def __init__(self, full_df,  gb_list):
        
        '''
        Constructor of the class.
        gb_list is a list of columns that must be in full_df.
        '''
        
        self.full_df = full_df
        self.gb_list = gb_list
        # joins the gb_list, this way we can dinamically create new columns
        # ["date, "shop_id] --> date_shop_id
        self.objective_column_name = "_".join(gb_list)
            
    def generate_gb_df(self):
        
        '''
        This function thakes the full_df and creates a groupby df based on the gb_list.
        It creates 2 columns: 
            1. A sum column for every date and gb_list
            2. Mean columns for every_date and gb_list
            
        The resulting df (gb_df_) is assigned back to the FeatureGenerator class as an attribute.
        '''

        def my_agg(full_df_, args):
            
            '''
            This function is used to perform multiple operations over a groupby df and returns a df
            without multiindex.
            '''
            
            names = {
                # you can put here as many columns as you want 
                '{}_sum'.format(args):  full_df_['item_cnt_day'].sum()
            }

            return pd.Series(names, index = [key for key in names.keys()])
        
        # the args is used to pass additional argument to the apply function
        gb_df_ = self.full_df.groupby(self.gb_list).apply(my_agg, args = (self.objective_column_name)).reset_index()

        self.gb_df_ = gb_df_

        
    def return_gb_df(self):  
        
        '''
        This function takes the gb_df_ created in the previous step (generate_gb_df) and creates additional features.
        We create 3 lag features (values from the past).
        And 6 variation features: 3 with absolute values and 3 with porcentual change.
        '''
        
        def generate_shift_features(self, suffix):
            
            '''
            This function is a helper function that takes the gb_df_ and a suffix (sum or mean) and creates the
            additional features.
            '''

            # dinamically creates the features
            # date_shop_id --> date_shop_id_sum if suffix is sum
            # date_shop_id --> date_shop_id_mean if suffix is mean
            name_ = self.objective_column_name + "_" + suffix

            self.gb_df_['{}_shift_1'.format(name_)] =\
            self.gb_df_.groupby(self.gb_list[1:])[name_].transform(lambda x: x.shift(1))
            
            self.gb_df_['{}_shift_2'.format(name_)] =\
            self.gb_df_.groupby(self.gb_list[1:])[name_].transform(lambda x: x.shift(2))
            
            self.gb_df_['{}_shift_3'.format(name_)] =\
            self.gb_df_.groupby(self.gb_list[1:])[name_].transform(lambda x: x.shift(3))

            self.gb_df_['{}_var_pct_1'.format(name_)] =\
            self.gb_df_.groupby(self.gb_list[1:])[name_].transform(lambda x: (x.shift(1) - x.shift(2))/x.shift(2))
            
            self.gb_df_['{}_var_pct_2'.format(name_)] =\
            self.gb_df_.groupby(self.gb_list[1:])[name_].transform(lambda x: (x.shift(1) - x.shift(3))/x.shift(3))
            
            self.gb_df_['{}_var_pct_3'.format(name_)] =\
            self.gb_df_.groupby(self.gb_list[1:])[name_].transform(lambda x: (x.shift(1) - x.shift(4))/x.shift(4))
            
            self.gb_df_.fillna(-1, inplace = True)

            self.gb_df_.replace([np.inf, -np.inf], -1, inplace = True)
        
        # call the generate_shift_featues function with different suffix (sum and mean)
        generate_shift_features(self, suffix = "sum")
    
        return self.gb_df_
        

<a id = "create_new_features"></a>
# Generate additional features as, mean and total sales for shop_id , item_id, city ... for every month
[Go back to the table of contents](#table_of_contents)

<a id = "feature_1"></a>
## Date and shop_id features
[Go back to the table of contents](#table_of_contents)

In [None]:
st = time.time()

gb_list = ["date", "shop_id", "city"]

fe_generator = FeatureGenerator(full_df = full_df, gb_list = gb_list)

fe_generator.generate_gb_df()

shop_sales_features = fe_generator.return_gb_df()

# to avoid city_x and city_y

shop_sales_features.drop("city", axis = 1, inplace = True)
et = time.time()

(et - st)/60

In [None]:
shop_sales_features.shape

In [None]:
shop_sales_features[shop_sales_features["shop_id"] == 8].head(5)

<a id = "feature_2"></a>
## Date and item_id features
[Go back to the table of contents](#table_of_contents)

In [None]:
st = time.time()

gb_list = ["date", "item_id"]

fe_generator = FeatureGenerator(full_df = full_df, gb_list = gb_list)

fe_generator.generate_gb_df()

item_sales_features = fe_generator.return_gb_df()

et = time.time()

(et - st)/60

In [None]:
item_sales_features.shape

In [None]:
item_sales_features[item_sales_features["item_id"] == 30].head(3)

<a id = "feature_3"></a>
## Date and item_category features
[Go back to the table of contents](#table_of_contents)

In [None]:
st = time.time()

gb_list = ["date", "item_category_id"]

fe_generator = FeatureGenerator(full_df = full_df, gb_list = gb_list)

fe_generator.generate_gb_df()

month_item_category_features = fe_generator.return_gb_df()

et = time.time()

(et - st)/60

In [None]:
month_item_category_features.shape

In [None]:
month_item_category_features[month_item_category_features["item_category_id"] == 2].head(3)

<a id = "feature_5"></a>
## Datetime features
[Go back to the table of contents](#table_of_contents)

In [None]:
full_df["year"] = full_df["date"].dt.year
full_df["month"] = full_df["date"].dt.month
full_df["days_in_month"] = full_df["date"].dt.days_in_month

<a id = "feature_6"></a>
## Adding holiday and number of weekends data
[Go back to the table of contents](#table_of_contents)

In [None]:
holidays_next_month = {
    12:8,
    1:1,
    2:1,
    3:0,
    4:2,
    5:1,
    6:0,
    7:0,
    8:0,
    9:0,
    10:1,
    11:0
}

holidays_this_month = {
    1:8,
    2:1,
    3:1,
    4:0,
    5:2,
    6:1,
    7:0,
    8:0,
    9:0,
    10:0,
    11:1,
    12:0
}

full_df["holidays_next_month"] = full_df["month"].map(holidays_next_month)
full_df["holidays_this_month"] = full_df["month"].map(holidays_this_month)

In [None]:
def extract_number_weekends(test_month):
    '''
    Extracts the number of weekend days in a month.
    '''
    saturdays = len([1 for i in calendar.monthcalendar(test_month.year, test_month.month) if i[5] != 0])
    sundays = len([1 for i in calendar.monthcalendar(test_month.year, test_month.month) if i[6] != 0])
    
    return saturdays + sundays

full_df["total_weekend_days"] = full_df["date"].apply(extract_number_weekends)

# how much time has passed since the last sale?
date_diff_df = full_df[full_df["item_cnt_day"] > 0][["shop_id", "item_id", "date", "item_cnt_day"]].groupby(["shop_id", "item_id"])\
["date"].diff().apply(lambda timedelta_: timedelta_.days).to_frame()
date_diff_df.columns = ["date_diff_sales"]

full_df = pd.merge(full_df, date_diff_df, how = "left", left_index=True, right_index=True)
full_df.fillna(-1, inplace = True)

In [None]:
full_df.head()

<a id = "feature_7"></a>
## City population and mean_income per city
[Go back to the table of contents](#table_of_contents)

In [None]:
city_population = {\
'Якутск':307911, 
'Адыгея':141970,
'Балашиха':450771, 
'Волжский':326055, 
'Вологда':313012, 
'Воронеж':1047549,
'Выездная':1228680, 
'Жуковский':107560, 
'Интернет-магазин':1228680, 
'Казань':1257391, 
'Калуга':341892,
'Коломна':140129,
'Красноярск':1083865, 
'Курск':452976, 
'Москва':12678079,
'Мытищи':205397, 
'Н.Новгород':1252236,
'Новосибирск':1602915 , 
'Омск':1178391, 
'РостовНаДону':1125299, 
'СПб':5398064, 
'Самара':1156659,
'СергиевПосад':104579, 
'Сургут':373940, 
'Томск':572740, 
'Тюмень':744554, 
'Уфа':1115560, 
'Химки':244668,
'Цифровой':1228680, 
'Чехов':70548, 
'Ярославль':608353
}

city_income = {\
'Якутск':70969, 
'Адыгея':28842,
'Балашиха':54122, 
'Волжский':31666, 
'Вологда':38201, 
'Воронеж':32504,
'Выездная':46158, 
'Жуковский':54122, 
'Интернет-магазин':46158, 
'Казань':36139, 
'Калуга':39776,
'Коломна':54122,
'Красноярск':48831, 
'Курск':31391, 
'Москва':91368,
'Мытищи':54122, 
'Н.Новгород':31210,
'Новосибирск':37014 , 
'Омск':34294, 
'РостовНаДону':32067, 
'СПб':61536, 
'Самара':35218,
'СергиевПосад':54122, 
'Сургут':73780, 
'Томск':43235, 
'Тюмень':72227, 
'Уфа':35257, 
'Химки':54122,
'Цифровой':46158, 
'Чехов':54122, 
'Ярославль':34675
}

full_df["city_population"] = full_df["city"].map(city_population)
full_df["city_income"] = full_df["city"].map(city_income)
full_df["price_over_income"] = full_df["item_price"]/full_df["city_income"]

<a id = "join_dfs_with_features"></a>
# Join full sales df with all the features generated
[Go back to the table of contents](#table_of_contents)

In [None]:
print("Shape before merge is {}".format(full_df.shape))

full_df = pd.merge(full_df, shop_sales_features, on = ["date", "shop_id"], how = "left")
full_df = pd.merge(full_df, item_sales_features, on = ["date", "item_id"], how = "left")
full_df = pd.merge(full_df, month_item_category_features, on = ["date", "item_category_id"], how = "left")
full_df.rename(columns = {"item_cnt_day":"sales"}, inplace = True)

print("Shape after merge is {}".format(full_df.shape))

In [None]:
# save the file

st = time.time()

full_df.to_pickle("FULL_DF_ONLY_TEST_ALL_FEATURES.pkl")

et = time.time()
(et - st)/60

<a id = "basic_model"></a>
# Basic model train
[Go back to the table of contents](#table_of_contents)

In [None]:
# load the preprocessed data
full_df = pd.read_pickle("FULL_DF_ONLY_TEST_ALL_FEATURES.pkl")

# select only a few shops
full_df = full_df[full_df["shop_id"].isin(SHOPS)]

# delete all the columns where lags features are - 1 (shift(6))
full_df = full_df[full_df["date"] > np.datetime64("2013-03-31")]

cols_to_drop = [

'revenue',
'shop_name',
"city",
'item_name',
'item_category_name',
'split',
'type',
'subtype',

'date_item_id_sum',
"date_shop_id_city_sum",
"date_item_category_id_sum",
    
]

full_df.drop(cols_to_drop, inplace = True, axis = 1)

In [None]:
# ------------------------------------------------------
# separate the dates for train, validation and test

train_index = sorted(list(full_df["date"].unique()))[:-2]

valida_index = [sorted(list(full_df["date"].unique()))[-2]]

test_index = [sorted(list(full_df["date"].unique()))[-1]]

# ------------------------------------------------------
# split the data into train, validation and test dataset
# we "simulate" the test dataset to be the Kaggle test dataset

X_train = full_df[full_df["date"].isin(train_index)].drop(['sales', "date"], axis=1)
Y_train = full_df[full_df["date"].isin(train_index)]['sales']

X_valida = full_df[full_df["date"].isin(valida_index)].drop(['sales', "date"], axis=1)
Y_valida = full_df[full_df["date"].isin(valida_index)]['sales']

X_test = full_df[full_df["date"].isin(test_index)].drop(['sales', "date"], axis = 1)
Y_test = full_df[full_df["date"].isin(test_index)]['sales']

In [None]:
st = time.time()

model = XGBRegressor(seed = 175)

model_name = str(model).split("(")[0]

day = str(datetime.now()).split()[0].replace("-", "_")
hour = str(datetime.now()).split()[1].replace(":", "_").split(".")[0]
t = str(day) + "_" + str(hour)

model.fit(X_train, Y_train, eval_metric = "rmse", 
    eval_set = [(X_train, Y_train), (X_valida, Y_valida)], 
    verbose = True, 
    early_stopping_rounds = 10)

et = time.time()

print("Training took {} minutes!".format((et - st)/60))

In [None]:
pickle.dump(model, open("{}_{}.dat".format(model_name, t), "wb"))

In [None]:
print("{}_{}.dat".format(model_name, t))

In [None]:
model = pickle.load(open("{}_{}.dat".format(model_name, t), "rb"))

<a id = "feature_importance_1"></a>
# Feature importance
[Go back to the table of contents](#table_of_contents)

In [None]:
importance = model.get_booster().get_score(importance_type = "gain")

importance = {k: v for k, v in sorted(importance.items(), key = lambda item: item[1])}

In [None]:
fig, ax = plt.subplots(figsize = (10, 15))
plot_importance(model, importance_type = "gain", ax = ax);

<a id = "predict_and_model_evaluation_1"></a>
# Predict and model evaluation
[Go back to the table of contents](#table_of_contents)

In [None]:
Y_valida_pred = model.predict(X_valida)

rmse_valida = sqrt(metrics.mean_squared_error(Y_valida, Y_valida_pred))
rmse_valida

In [None]:
Y_test_predict = model.predict(X_test)

rmse_test = sqrt(metrics.mean_squared_error(Y_test, Y_test_predict))
rmse_test

<a id = "to_do"></a>
# To do
[Go back to the table of contents](#table_of_contents)

In [None]:
# load the preprocessed data
full_df = pd.read_pickle("../input/full-df-only-test-all-features/FULL_DF_ONLY_TEST_ALL_FEATURES.pkl")

# select only a few shops
full_df = full_df[full_df["shop_id"].isin(SHOPS)]

<a id = "new_feature_1"></a>
# Additional feature 1: Google Trends Data
[Go back to the table of contents](#table_of_contents)

I downloaded used data from google trends API and merged into one single datatable

In [None]:
class FeatureGenerator2(FeatureGenerator):
            
    def generate_gb_df(self):
        
        '''
        This function thakes the full_df and creates a groupby df based on the gb_list.
        It creates 2 columns: 
            1. A sum column for every date and gb_list
            2. Mean columns for every_date and gb_list
            
        The resulting df (gb_df_) is assigned back to the FeatureGenerator class as an attribute.
        '''

        def my_agg(full_df_, args):
            
            '''
            This function is used to perform multiple operations over a groupby df and returns a df
            without multiindex.
            '''
            
            names = {
                # you can put here as many columns as you want 
                '{}_mean_trend'.format(args):  full_df_['trend'].mean()
            }

            return pd.Series(names, index = [key for key in names.keys()])
        
        # the args is used to pass additional argument to the apply function
        gb_df_ = self.full_df.groupby(self.gb_list).apply(my_agg, args = (self.objective_column_name)).reset_index()

        self.gb_df_ = gb_df_

        
    def return_gb_df(self):  
        
        '''
        This function takes the gb_df_ created in the previous step (generate_gb_df) and creates additional features.
        We create 3 lag features (values from the past).
        And 6 variation features: 3 with absolute values and 3 with porcentual change.
        '''
        
        def generate_shift_features(self, suffix):
            
            '''
            This function is a helper function that takes the gb_df_ and a suffix (sum or mean) and creates the
            additional features.
            '''

            # dinamically creates the features
            # date_shop_id --> date_shop_id_sum if suffix is sum
            # date_shop_id --> date_shop_id_mean if suffix is mean
            name_ = self.objective_column_name + "_" + suffix

            self.gb_df_['{}_shift_1'.format(name_)] =\
            self.gb_df_.groupby(self.gb_list[1:])[name_].transform(lambda x: x.shift(1))
            
            self.gb_df_['{}_shift_2'.format(name_)] =\
            self.gb_df_.groupby(self.gb_list[1:])[name_].transform(lambda x: x.shift(2))
            
            self.gb_df_['{}_shift_3'.format(name_)] =\
            self.gb_df_.groupby(self.gb_list[1:])[name_].transform(lambda x: x.shift(3))

            self.gb_df_['{}_var_pct_1'.format(name_)] =\
            self.gb_df_.groupby(self.gb_list[1:])[name_].transform(lambda x: (x.shift(1) - x.shift(2))/x.shift(2))
            
            self.gb_df_['{}_var_pct_2'.format(name_)] =\
            self.gb_df_.groupby(self.gb_list[1:])[name_].transform(lambda x: (x.shift(1) - x.shift(3))/x.shift(3))
            
            self.gb_df_['{}_var_pct_3'.format(name_)] =\
            self.gb_df_.groupby(self.gb_list[1:])[name_].transform(lambda x: (x.shift(1) - x.shift(4))/x.shift(4))
            
            self.gb_df_.fillna(-1, inplace = True)

            self.gb_df_.replace([np.inf, -np.inf], -1, inplace = True)
        
        # call the generate_shift_featues function with different suffix (sum and mean)
        generate_shift_features(self, suffix = "mean_trend")
    
        return self.gb_df_
        

In [None]:
trends = pd.read_csv("../input/googletrends/busquedasgoogle.csv", encoding="utf-8", sep=";")

In [None]:
trends["Unnamed: 0"] = pd.to_datetime(trends["Unnamed: 0"], format = "%Y-%m-%d")

#rename columns
trends.columns=["date","trend","item_category_id"]

#create month and year column (we will use it when merging with full df)
trends["year"]=trends["date"].apply(lambda x: x.year)
trends["month"]=trends["date"].apply(lambda x: x.month)

#set date as index
trends.set_index("date", inplace=True)

In [None]:
#Add lags
gb_list = ["year","month","item_category_id"]

fe_generator = FeatureGenerator2(full_df = trends, gb_list = gb_list)

fe_generator.generate_gb_df()

category_trend_features = fe_generator.return_gb_df()

# <a id = "new_feature_3"></a>
# Additional feature 2: Sells grouped by shop & category
[Go back to the table of contents](#table_of_contents)

In [None]:
#Dataset has changed target name, so I create another child class from FeatureGenerator

class FeatureGenerator3(FeatureGenerator):
            
    def generate_gb_df(self):
        
        '''
        This function thakes the full_df and creates a groupby df based on the gb_list.
        It creates 2 columns: 
            1. A sum column for every date and gb_list
            2. Mean columns for every_date and gb_list
            
        The resulting df (gb_df_) is assigned back to the FeatureGenerator class as an attribute.
        '''

        def my_agg(full_df_, args):
            
            '''
            This function is used to perform multiple operations over a groupby df and returns a df
            without multiindex.
            '''
            
            names = {
                # you can put here as many columns as you want 
                '{}_sum'.format(args):  full_df_['sales'].sum()
            }

            return pd.Series(names, index = [key for key in names.keys()])
        
        # the args is used to pass additional argument to the apply function
        gb_df_ = self.full_df.groupby(self.gb_list).apply(my_agg, args = (self.objective_column_name)).reset_index()

        self.gb_df_ = gb_df_

In [None]:
st = time.time()

gb_list = ["date", "item_category_id", "shop_id"]

fe_generator = FeatureGenerator3(full_df = full_df, gb_list = gb_list)

fe_generator.generate_gb_df()

month_item_shop_features = fe_generator.return_gb_df()

et = time.time()

(et - st)/60

# <a id = "new_feature_3"></a>
# Additional feature 3: Sells grouped by shop & subtype
[Go back to the table of contents](#table_of_contents)

In [None]:
st = time.time()

gb_list = ["date", "subtype_code", "shop_id"]

fe_generator = FeatureGenerator3(full_df = full_df, gb_list = gb_list)

fe_generator.generate_gb_df()

month_subtype_shop_features = fe_generator.return_gb_df()

et = time.time()

(et - st)/60

# <a id = "new_feature_3"></a>
# Additional feature 4: Sells grouped by subtype & item_categoy_id
[Go back to the table of contents](#table_of_contents)

In [None]:
st = time.time()

gb_list = ["date","subtype_code", "item_category_id"]

fe_generator = FeatureGenerator3(full_df = full_df, gb_list = gb_list)

fe_generator.generate_gb_df()

date_subtype_category = fe_generator.return_gb_df()

et = time.time()

(et - st)/60

<a id = "new_feature_2"></a>
# Bonus feature: Price mean diff
[Go back to the table of contents](#table_of_contents)

During EDA we added all days with 0 sells, but we haven't assigned a price. What I will do is assign the price for the last sell or, in cases where the isn't a previous sell the price of the next one.

Once I replaced prices with 0 value, I get the mean price for each item.
Then I calculate the proportion of the daily price based on the mean of the product (if it's lower than 1 means that is cheaper than usual)

I discarted this feature from model training because it produces a worse model.

In [None]:
new_prices= pd.DataFrame(columns=["date","item_id","shop_id","item_price"])
for item in full_df["item_id"].unique():
    temp_df = full_df[full_df["item_id"]==item]
    
    #Don't know why but method ffill doesn't work for float16
    temp_df["item_price"] = temp_df["item_price"].astype("float32")
    
    temp_df["item_price"] = temp_df["item_price"].mask(temp_df["item_price"]==0.0).ffill(downcast="infer")
    temp_df["item_price"].fillna(method="bfill", inplace=True)
    
    temp_df = temp_df[["date","shop_id","item_id","item_price"]]
    
    new_prices = pd.concat([new_prices, temp_df])
    
mean_prices = new_prices[new_prices["item_price"]>0].groupby("item_id")[["item_price"]].mean()
mean_prices.columns = ["item_mean_price"]

prices = pd.merge(new_prices, mean_prices, on = ["item_id"], how = "left")
prices["price_mean_prop"] = prices["item_price"]/prices["item_mean_price"]

prices.drop("item_price", axis=1, inplace=True)
    

<a id = "join_dfs_with_new_features"></a>
# Join df's with new features
[Go back to the table of contents](#table_of_contents)

In [None]:
full_df = pd.merge(full_df, category_trend_features, on = ["year","month","item_category_id"], how = "left")
full_df = pd.merge(full_df, month_item_shop_features, on=["date", "item_category_id", "shop_id"], how = "left")
full_df = pd.merge(full_df, month_subtype_shop_features, on=["date", "subtype_code", "shop_id"], how = "left" )
full_df = pd.merge(full_df,date_subtype_category, on=["date","subtype_code", "item_category_id"], how = "left" )

I don't know why but some numeric columns have change to object during merge, so I transform to numeric again

In [None]:
full_df["shop_id"] = pd.to_numeric(full_df["shop_id"])
full_df["item_id"] = pd.to_numeric(full_df["item_id"])

I delete useless data (days without lags values and current period variables)

In [None]:
full_df = full_df[full_df["date"] > np.datetime64("2013-03-31")]

cols_to_drop = [

'revenue',
'shop_name',
"city",
'item_name',
'item_category_name',
'split',
'type',
'subtype',

'date_item_id_sum',
"date_item_category_id_sum",
    
"year_month_item_category_id_mean_trend",
"date_item_category_id_shop_id_sum",
"date_subtype_code_shop_id_sum",
"date_subtype_code_item_category_id_sum"
]

full_df.drop(cols_to_drop, inplace = True, axis = 1)

<a id = "new_model"></a>
# Model training
[Go back to the table of contents](#table_of_contents)

In [None]:
def split_data(df):
    
    train_index = sorted(list(df["date"].unique()))[:-2]

    valida_index = [sorted(list(df["date"].unique()))[-2]]

    test_index = [sorted(list(df["date"].unique()))[-1]]

    # ------------------------------------------------------
    # split the data into train, validation and test dataset
    # we "simulate" the test dataset to be the Kaggle test dataset

    X_train = df[df["date"].isin(train_index)].drop(['sales', "date"], axis=1)
    Y_train = df[df["date"].isin(train_index)]['sales']

    X_valida = df[df["date"].isin(valida_index)].drop(['sales', "date"], axis=1)
    Y_valida = df[df["date"].isin(valida_index)]['sales']

    X_test = df[df["date"].isin(test_index)].drop(['sales', "date"], axis = 1)
    Y_test = df[df["date"].isin(test_index)]['sales']
    
    return (X_train,Y_train,X_valida,Y_valida,X_test,Y_test)

In [None]:
def set_model(X_train, Y_train, X_valida, Y_valida):
    
    st = time.time()

    model = XGBRegressor(seed = 175)

    model_name = str(model).split("(")[0]

    day = str(datetime.now()).split()[0].replace("-", "_")
    hour = str(datetime.now()).split()[1].replace(":", "_").split(".")[0]
    t = str(day) + "_" + str(hour)

    model.fit(X_train, Y_train, eval_metric = "rmse", 
        eval_set = [(X_train, Y_train), (X_valida, Y_valida)],
        verbose = False,
        early_stopping_rounds = 10)

    et = time.time()

    print("Training took {} minutes!".format((et - st)/60))
    
    return model
    

In [None]:
def model_results(model, X_valida, Y_valida, X_test, Y_test):
    
    Y_valida_pred = model.predict(X_valida)

    rmse_valida = sqrt(metrics.mean_squared_error(Y_valida, Y_valida_pred))
    
    Y_test_predict = model.predict(X_test)

    rmse_test = sqrt(metrics.mean_squared_error(Y_test, Y_test_predict))
    
    print("Model has a validation rmse of {} and test rmse of {}".format(rmse_valida,rmse_test))
    
    return rmse_valida, rmse_test

In [None]:
X_train, Y_train, X_valida, Y_valida, X_test, Y_test = split_data(full_df)
    
model = set_model(X_train, Y_train, X_valida, Y_valida)
    
rmse_val, rmse_test = model_results(model, X_valida, Y_valida, X_test, Y_test)

<a id = "feature_importance_2"></a>
# Feature importance of new model
[Go back to the table of contents](#table_of_contents)

In [None]:
importance = model.get_booster().get_score(importance_type = "gain")

importance = {k: v for k, v in sorted(importance.items(), key = lambda item: item[1])}

In [None]:
fig, ax = plt.subplots(figsize = (10, 15))
plot_importance(model, importance_type = "gain", ax = ax, max_num_features=50);