QUICK DESCRIPTION

This notebook provides a full solution to the Predict Future Sales challenge. It includes
- dataset cleaning and monthly aggregation
- exploratory data analysis
- feature engineering (extraction of supercategories from the category names, features to account for time since release of an item, time since first/last sale in shop, typical values and statistics on the prices,...)
- target encoding, then temporal lagging and temporal averaging (or min/max) over past months
- splitting of dataset into three classes of realisations (month,shop,item): item new in the dataset this month / item not new but never sold in this shop before / item sold in this shop in the past
- validation and fitting of xgbregressor for each of the three classes of realisations defined above, with different features and hyperparameters

DESCRIPTION OF THE APPROACH

We aim at predicting the quantities sold for each sample (shop, item) on the next month. For each item, the quantities sold vary through time (from month to month) and space (from shop to shop). Besides, similarity between items is embedded in the category they belong to.

- Additional features are extracted from the item categories to enhance the understanding of item similarity. In particular, supercategories are defined that encompass several categories of items of the same type (such as consoles, video games, accessories, books, music, ...).

- The spatial variability of the sales appear to be mostly orthogonal to the temporal variations (similar temporal variations across all shops). Spatial trends are accounted for through target encoding of the shops.

- This solution is mostly based on the temporal separation of the samples (shop, item) depending on so-called 'seniority' levels. Each month, we analyse separately samples made of items that are new in the dataset this month (seniority = 0), samples (shop, item) made of items that are not new but have never been sold in this specific shop before (seniority = 1), and samples (shop, item) made of items that have already been sold in the past in this very shop (seniority = 2).

The main idea behind this separation is that the dataset on a given month is made of the cartesian product between the whole global catalogue of items and the whole set of shops open, whereas in practice the local catalogue of items available in a given shop only includes a fraction of the global catalogue. Being able to discriminate items that are in fact available in a given shop from items that are actually not should ease the predictions, as samples (shop, item) with items not even available in the shop in question cannot possibly be sold there in the future. 

Samples (shop, item) of seniority 2 are made of items that have previously been sold in this very shop, and so they are in the local catalogue of this shop for sure. For such samples, our predictions will be mostly based on the quantities sold for this item in this shop in the past.

Samples (shop, item) of seniority 1 are made of items that have never been sold in this shop, but have been sold in other shops before. Such a sample (shop, item) is likely to be made of an item that is not actually available in this shop, and the likelihood that this item will be sold in this shop in the future is very low. In fact, data analysis proves that only 5% of these samples contribute to the sales (compared to 20-25% of samples of seniority 2. Those few samples of seniority 1 that will be sold are likely made of items that are always sold in low quantities, possibly less than one per month. Here again, data analysis shows that the average sales for items of seniority 1 is around 0.05 (against 0.4 in seniority 2) and the likelihood that an item of seniority 1 is sold in quantities superior to 3 is less than 0.1%. Samples of seniority 1 yielding sale quantities above 8 are so rare that they may be considered noise. For seniority 1, the critical features are the number of months since it has been released in the catalogue without being sold in this shop, and the typical sale quantities of this item in shops where it has been sold in the past. If an item is typically sold in large quantities but have never been sold in a given shop, it is unlikely that it will suddenly start being sold next month. Similarly, if the item has never been sold in a given shop in spite having been released for many months in the catalogue, it is likely that it will never be sold in this shop, or only in very small quantities. Conversely, if the item has only been released last month, it is possible that it just hasn't add the chance to be sold last month but it may be sold in low quantities next month. In particular, we find that the most relevant features are target encodings of the past typical sale quantities of samples of seniority 1 in this shop, in the same category of items, and with the same number of months since it has been released, as well as the quantities sold for this item in other shops where it has been sold in the past.

Finally, samples (shop, item) of seniority 0 are totally new in the catalogue. No past data is available for these pairs. Besides, the typical sale quantities of the new items vary a lot from month to month and their temporal evolution appears much noisier than older items. For these items, we base our predictions mostly on past typical sale quantities of similar items on the month where they were released. Similar items are defined as items of the same category, or same supercategory (like some other game, or some other book).

- Apart from the separation between seniority levels, the influence of temporality is threefold: the quantities sold for a given item in a given shop depend on the absolute period of the year (more sales in December due to Christmas), on how many months since the item has been released (older items are less popular than newer ones), and also on the month where it has been released (items released in December are sold more all year round than items released in February). The sensitivity of the items to each of these three temporality vary from category to category and depending on the seniority level of the sample: samples of seniority 1 are almost insensitive to the absolute period of the year but very sensitive to the number of months since they have been released, while deliveries are sold in similar quantities regardless of how long the service has been offered, etc... Thus, the effect of temporality is accounted for jointly with other features through target encoding.

- Three separate models are built using the XGBRegressor algorithm from xgboost. Different hyperparameters are used to account for the different distributions of target variable, different problem complexities, and different tendencies to overfit

## Import libraries

In [None]:
# system and performance
import gc
import time
import os
import pickle


# date management
import datetime
import calendar


# data management
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)

from itertools import product

# visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


# machine learning
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

In [None]:
def create_directory(path):
    if not os.path.isdir(path):
        os.mkdir(path)
        print('directory '+path+' created succesfully !')
    else:
        print('directory '+path+' already exists')

In [None]:
def downcast_dtypes(df):
    float_cols = [c for c in df if df[c].dtype == "float64"]
    int_cols = [c for c in df if df[c].dtype in ["int64", "int32"]]
    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols] = df[int_cols].astype(np.int16)
    return df

In [None]:
# path to data
RAW_DATA_FOLDER = '/kaggle/input/competitive-data-science-predict-future-sales/'
DATA_FOLDER = '/kaggle/working/'

##### Define core variable space and macro to reset variable space

In [None]:
loaded=%who_ls
loaded.append('loaded')

In [None]:
all_vars=%who_ls
all_vars.append('all_vars')
for var in list(set(all_vars)-set(loaded)):
    exec('del '+var)
del var

In [None]:
%macro reset_variable_space 6
loaded.append('reset_variable_space')

## -------------------------------------------------------------

# 0 - DATA CLEANING

- Cleaning of input dataframes
- Preliminary data analysis

## Import raw data

In [None]:
train_df        = pd.read_csv(os.path.join(RAW_DATA_FOLDER, 'sales_train.csv'))
items           = pd.read_csv(os.path.join(RAW_DATA_FOLDER, 'items.csv'))
item_categories = pd.read_csv(os.path.join(RAW_DATA_FOLDER, 'item_categories.csv'))
shops           = pd.read_csv(os.path.join(RAW_DATA_FOLDER, 'shops.csv'))

test_df         = pd.read_csv(os.path.join(RAW_DATA_FOLDER, 'test.csv'))

In [None]:
print('items :' + str(items.shape))
print()
items.info(null_counts=True)
print('_'*40)
print()
print('item_categories :' + str(item_categories.shape))
print()
item_categories.info(null_counts=True)
print('_'*40)
print()
print('shops :' + str(shops.shape))
print()
shops.info(null_counts=True)

In [None]:
print('train_df :' + str(train_df.shape))
print()
train_df.info(null_counts=True)
print('_'*40)
print()
print('test set :' + str(test_df.shape))
print()
test_df.info(null_counts=True)
print('_'*40)
print('_'*40)
print()
print('number of different items in training set : '+str(train_df['item_id'].nunique()) + ' / ' + str(items.item_id.nunique()))
print('number of different items in test set : '+str(test_df['item_id'].nunique()) + ' / ' + str(items.item_id.nunique()))
print('number of different items ONLY in test set : '+str(test_df.loc[~test_df['item_id'].isin(train_df['item_id']),'item_id'].nunique()))
print()
print('number of different shops in training set : '+str(train_df['shop_id'].nunique()) + ' / ' + str(shops.shop_id.nunique()) )
print('number of different shops in test set : '+str(test_df['shop_id'].nunique()) + ' / ' + str(shops.shop_id.nunique()))

## BASIC PREPROCESSING

In [None]:
# rename features
train_df.rename({'date_block_num':'month_id', 'item_cnt_day':'item_quantity'},axis=1,inplace=True)

In [None]:
# convert date feature to datetime type
train_df['date']=pd.DataFrame(pd.to_datetime(train_df['date'],format="%d.%m.%Y"))

In [None]:
train_df['day_id']=(train_df['date']-train_df['date'].min()).dt.days

In [None]:
# discard returned articles (items with quantity < 1)
# we are only interested in predicting the sales, returned articles are irrelevant to the study
print('percentage of realisations that represent returned articles : ' +str(round((train_df['item_quantity']<0).sum()/(train_df['item_quantity']>=0).sum()*100,2)) + ' %')
train_df.drop(train_df[train_df['item_quantity']<0].index,axis=0,inplace=True)

## PREPROCESSING OF THE SHOPS

### Analysis of duplicate shops

In [None]:
print('shop  0 : ' + shops.shop_name[0])
print('shop 57 : ' + shops.shop_name[57])
print()

tmp=train_df[(train_df['shop_id']==0) | (train_df['shop_id']==57)].drop('month_id',axis=1)

print('## Value counts ##')
print(tmp['shop_id'].value_counts())
print()

print('## Activity period ##')
print('shop  0 : ' + tmp[tmp.shop_id==0].date.min().strftime("%d.%m.%Y") + ' --- ' + tmp[tmp.shop_id==0].date.max().strftime("%d.%m.%Y"))
print('shop 57 : ' + tmp[tmp.shop_id==57].date.min().strftime("%d.%m.%Y") + ' --- ' + tmp[tmp.shop_id==57].date.max().strftime("%d.%m.%Y"))
print()

print('## The period of activities do not overlap and are consecutive !! ##')
print('## SHOP 0 = SHOP 57 ##')

del tmp

In [None]:
print('shop  1 : ' + shops.shop_name[1])
print('shop 58 : ' + shops.shop_name[58])
print()

tmp=train_df[(train_df['shop_id']==1) | (train_df['shop_id']==58)].drop('month_id',axis=1)

print('## Value counts ##')
print(tmp['shop_id'].value_counts())
print()

print('## Activity period ##')
print('shop  1 : ' + tmp[tmp.shop_id==1].date.min().strftime("%d.%m.%Y") + ' --- ' + tmp[tmp.shop_id==1].date.max().strftime("%d.%m.%Y"))
print('shop 58 : ' + tmp[tmp.shop_id==58].date.min().strftime("%d.%m.%Y") + ' --- ' + tmp[tmp.shop_id==58].date.max().strftime("%d.%m.%Y"))
print()

print('## The period of activities do not overlap and are consecutive !! ##')
print('## SHOP 1 = SHOP 58 ##')

del tmp

In [None]:
print('shop 10 : ' + shops.shop_name[10])
print('shop 11 : ' + shops.shop_name[11])
print()

tmp=train_df[(train_df['shop_id']==10) | (train_df['shop_id']==11)].drop('month_id',axis=1)

print('## Value counts ##')
print(tmp['shop_id'].value_counts())
print()

print('## Activity period ##')
print('shop 10 : ' + tmp[tmp.shop_id==10].date.min().strftime("%d.%m.%Y") + ' --- ' + tmp[tmp.shop_id==10].date.max().strftime("%d.%m.%Y"))
print('shop 11 : ' + tmp[tmp.shop_id==11].date.min().strftime("%d.%m.%Y") + ' --- ' + tmp[tmp.shop_id==11].date.max().strftime("%d.%m.%Y"))
print()

print('### Count values per shop between 01.02.2015 --- 28.02.2015 ###')
print(tmp[(tmp['date'] <= tmp[tmp.shop_id==11].date.max()) & (tmp['date'] >= tmp[tmp.shop_id==11].date.min())]['shop_id'].value_counts())
print()

print('## The period of activities do not overlap !! ##')
print('## SHOP 10 = SHOP 11 ##')

del tmp

In [None]:
print('shop 39 : ' + shops.shop_name[39])
print('shop 40 : ' + shops.shop_name[40])
print()

tmp=train_df[(train_df['shop_id']==39) | (train_df['shop_id']==40)].drop('month_id',axis=1)

print('## Value count ##')
print(tmp['shop_id'].value_counts())
print()

print('## Each shop individually has no duplicate sale...')
tmpa=train_df[(train_df['shop_id']==39)].drop('month_id',axis=1)
tmpb=train_df[(train_df['shop_id']==40)].drop('month_id',axis=1)
print('duplicates date/item (shop 39)      : ' + str(tmpa.duplicated(subset=['date','item_id'],keep=False).sum()))
print('duplicates date/item (shop 40)      : ' + str(tmpb.duplicated(subset=['date','item_id'],keep=False).sum()))
print('...but the reunion of both has many...')
print('duplicates date/item                : ' + str(tmp.duplicated(subset=['date','item_id'],keep=False).sum()))
print('duplicates date/item/price          : ' + str(tmp.duplicated(subset=['date','item_id','item_price'],keep=False).sum()))
print('duplicates date/item/price/quantity : ' + str(tmp.duplicated(subset=['date','item_id','item_price','item_quantity'],keep=False).sum()))
print('...so all duplicates are exactly paired between the two shops (39,40)! ##')
print()

print('## The period of activities overlap ##')
print('shop 39 : ' + tmp[tmp.shop_id==39].date.min().strftime("%d.%m.%Y") + ' --- ' + tmp[tmp.shop_id==39].date.max().strftime("%d.%m.%Y"))
print('shop 40 : ' + tmp[tmp.shop_id==40].date.min().strftime("%d.%m.%Y") + ' --- ' + tmp[tmp.shop_id==40].date.max().strftime("%d.%m.%Y"))
print()

print('####################################')
print('## Restrict dataframe to overlapping period ##')
tmpr=tmp[tmp['date']<tmpb['date'].max()]
print(tmpr['shop_id'].value_counts())
print()

print('####################################')
pa=np.zeros(22170)
pa[tmpa['item_id'].unique()]=1
pa.astype(int)
pb=np.zeros(22170)
pb[tmpb['item_id'].unique()]=1
pb.astype(int)

print('number of different items sold in shop 39 : ' + str(sum(pa).astype(int)))
print('number of different items sold in shop 40 : ' + str(sum(pb).astype(int)))
print('number of different items sold in both shops : ' + str(sum(pa*pb).astype(int)))
print()

print("## It is likely that the shop 40 is an 'island' in the commercial center, related to the main shop 39 ##")
print("## The island 40 appears to have closed at the end of January 2015 ##")
print()

del tmp,tmpa,tmpb,tmpr,pa,pb

### Remove duplicate shops

In [None]:
# shops 0-57, 1-58, 10-11 are in fact the same but on different time periods: relabel similar shops with same id
# shops 0,1,11 are not present in the test set, but shops 57,58,10 are present so we keep the latter ids instead of the former

# shop 40 is likely to be an antenna of shop 39, so we aggregate their sales together (here we simply relabel shop 40 as shop 39, and we will aggregate the sales when grouping sales by (date,shop,item,price) in the next step)
# shop 40 is not present in the test set, but shop 39 is present so we keep the latter instead of the former

shops.drop(0,axis=0,inplace=True)
shops.drop(1,axis=0,inplace=True)
shops.drop(11,axis=0,inplace=True)
shops.drop(40,axis=0,inplace=True)

train_df.loc[train_df['shop_id']==0,'shop_id']=57
train_df.loc[train_df['shop_id']==1,'shop_id']=58
train_df.loc[train_df['shop_id']==11,'shop_id']=10
train_df.loc[train_df['shop_id']==40,'shop_id']=39

In [None]:
# group together sales by (date,shop_id,item_id,item_price)
train_df=train_df.groupby(list(train_df.columns.drop('item_quantity')),as_index=False).sum()
train_df.info(null_counts=True)

In [None]:
print('Remaining duplicates in the training set : ' + str(train_df.duplicated(subset=['date','shop_id','item_id','item_price'],keep=False).sum()))

### Feature engineering: cities

In [None]:
# create feature city from shop name
shops['city'] = shops['shop_name'].str.extract('(\S+)\s', expand=False)
print('number of cities in the dataset : ' + str(shops.city.nunique()))
print('number of null cities : ' + str(shops.city.isnull().sum()))
shops.city.unique()

In [None]:
# label encoding of city names
shops['city_id']=pd.factorize(shops['city'])[0]

### Analysis of opening periods

In [None]:
tmp=train_df[['month_id','shop_id']].groupby('shop_id').agg({'month_id':['min','max','nunique']})
tmp[('month_id','laps')]=tmp[('month_id','max')]-tmp[('month_id','min')]+1
tmp=tmp.join(shops[['shop_id','shop_name','city','city_id']].set_index('shop_id'))
tmp.sort_values(by='city_id',inplace=True)
tmp

In [None]:
# the shops are generally open every day of the month
# on the opening and closing months, the shops are open less days
# shop 12 is open on average 77% of the days

fig,axes=plt.subplots(1,2,figsize=(15,7))
tmp=train_df[['date','month_id','shop_id']].groupby(['month_id','shop_id'])['date'].nunique().unstack().T
sns.heatmap(tmp,ax=axes[0])
sns.heatmap(tmp<=3,ax=axes[1])

In [None]:
print('item 11030 is sold on months : '+str(train_df.loc[(train_df['item_id']==11030),'month_id'].unique()))
print('item 20949 is sold on months : '+str(train_df.loc[(train_df['item_id']==20949),'month_id'].unique()))

train_df.loc[(train_df['shop_id']==34)&(train_df['month_id']==18)]

# only 2 items are sold on the opening month (month 18) for shop 34, each in only one copy
# these two items were sold in the past, and are sold later as well... we may just delete these two entries from the dataset!

In [None]:
train_df.loc[(train_df['shop_id']==33)&(train_df['month_id']==19)]

In [None]:
# restriction to shops that are present in the test set
fig,axes=plt.subplots(1,2,figsize=(15,7))
tmp_test=tmp.loc[test_df['shop_id'].unique(),:]
sns.heatmap(tmp_test,ax=axes[0])
sns.heatmap(tmp_test<=3,ax=axes[1])

In [None]:
del tmp, tmp_test
gc.collect()

### Remove outlier shops

In [None]:
# shop 9 and 20 are only open on October months (and only a few days within these months)
print(shops.loc[shops['shop_id']==9,'shop_name'].values)
print('shop 9 is for "out-bound trade", and it is only open on months 9,21 and 33')
print()
print(shops.loc[shops['shop_id']==20,'shop_name'].values)
print('shop 20 is for "Moscow sell-out" and is only open on months 21 and 33')

In [None]:
# shop 12 and 55 are not physical shops: they appear in the test set so we do not discard them
print(shops.loc[shops['shop_id']==12,'shop_name'].values)
print('shop 12 is "emergency online shop" (appears in the test set as well)')
print()
print(shops.loc[shops['shop_id']==55,'shop_name'].values)
print('shop 55 is "digital warehouse 1C online" (appears in the test set as well)')

In [None]:
train_df.loc[train_df['shop_id']==9].info(null_counts=True)
print()
train_df.loc[train_df['shop_id']==20].info(null_counts=True)
print()
train_df.loc[train_df['shop_id']==12].info(null_counts=True)
print()
train_df.loc[train_df['shop_id']==55].info(null_counts=True)

In [None]:
# we remove shops 9 and 20 as they are not open on continuous time periods and they likely behave differently from other shops and are not in the test set
shops.drop(9,axis=0,inplace=True)
shops.drop(20,axis=0,inplace=True)

train_df.drop(train_df.loc[train_df['shop_id']==9].index,axis=0,inplace=True)
train_df.drop(train_df.loc[train_df['shop_id']==20].index,axis=0,inplace=True)

In [None]:
# we remove shop 33 from the dataset: it is only open on a short period of time in the middle of the training period, it has 6 full months for 1 opening and 1 closing month...
shops.drop(33,axis=0,inplace=True)

train_df.drop(train_df.loc[train_df['shop_id']==33].index,axis=0,inplace=True)

In [None]:
# we remove the two entries for shop 34 on month 18, as only two items were sold in only one copy each
train_df.drop(train_df.loc[(train_df['shop_id']==34)&(train_df['month_id']==18)].index,axis=0,inplace=True)

### Visualization

In [None]:
shops

In [None]:
# opening days of shops after cleaning
fig,axes=plt.subplots(1,2,figsize=(15,7))
tmp=train_df[['date','month_id','shop_id']].groupby(['month_id','shop_id'])['date'].nunique().unstack().T
sns.heatmap(tmp,ax=axes[0])
sns.heatmap(tmp<=3,ax=axes[1])

In [None]:
# collect garbage
del tmp
gc.collect()

## PREPROCESSING OF ITEM_CATEGORIES

### Feature engineering

In [None]:
# Rename categories
item_categories.loc[0,'item_category_name']='Аксессуары - PC (Гарнитуры/Наушники)'
item_categories.loc[8,'item_category_name']='Билеты - Билеты (Цифра)'
item_categories.loc[9,'item_category_name']='Доставка товара - Доставка товара'
item_categories.loc[26,'item_category_name']='Игры - Android (Цифра)'
item_categories.loc[27,'item_category_name']='Игры - MAC (Цифра)'
item_categories.loc[28,'item_category_name']='Игры - PC (Дополнительные издания)'
item_categories.loc[29,'item_category_name']='Игры - PC (Коллекционные издания)'
item_categories.loc[30,'item_category_name']='Игры - PC (Стандартные издания)'
item_categories.loc[31,'item_category_name']='Игры - PC (Цифра)'
item_categories.loc[32,'item_category_name']='Карты оплаты - Кино, Музыка, Игры'
item_categories.loc[79,'item_category_name']='Прием денежных средств для 1С-Онлайн - Прием денежных средств для 1С-Онлайн'
item_categories.loc[80,'item_category_name']='Билеты - Билеты'
item_categories.loc[81,'item_category_name']='Misc - Чистые носители (шпиль)'
item_categories.loc[82,'item_category_name']='Misc - Чистые носители (штучные)'
item_categories.loc[83,'item_category_name']='Misc - Элементы питания'

In [None]:
# create feature item_supercategory_name, item_category_console, item_category_is_digital from item_category_name
item_categories['item_supercategory_name'] = item_categories['item_category_name'].str.extract('([\S\s]+)\s\-', expand=False)
item_categories['item_category_is_digital']=(item_categories['item_category_name'].str.find('(Цифра)')>=0)

consoles=['PS2','PS3','PS4','PSP','PSVita','XBOX 360','XBOX ONE','PC','MAC','Android']
item_categories['item_category_console']=''
for console in consoles:
    item_categories['item_category_console']+=item_categories['item_category_name'].str.extract('('+console+')',expand=False).fillna('')

item_categories.loc[item_categories['item_category_console']=='','item_category_console']='None'

print('cardinality of the supercategories : ' + str(item_categories['item_supercategory_name'].nunique()))
print('items in the null category : ' + str(item_categories['item_supercategory_name'].isnull().sum()))
print()
print('cardinality of the consoles : ' + str(item_categories['item_category_console'].nunique()))
print('items associated with no consoles : ' + str(item_categories['item_category_console'].isnull().sum()))
print()
print('cardinality of digital items : '+ str(item_categories['item_category_is_digital'].sum()))

In [None]:
# categories in training and test set
train_categories=sorted(train_df['item_id'].map(items['item_category_id']).unique())
print(str(len(train_categories)) + ' categories in the training set :')
print(train_categories)
print()

test_categories=sorted(test_df['item_id'].map(items['item_category_id']).unique())
print(str(len(test_categories)) + ' categories in the test set :')
print(test_categories)
print()

# display supercategories
print('number of supercategories : '+str(item_categories['item_supercategory_name'].nunique()))
print()
for supercat in item_categories['item_supercategory_name'].unique():
    print(supercat + ' : ' +str(item_categories.loc[item_categories['item_supercategory_name']==supercat,'item_category_id'].nunique()))
    print(item_categories.loc[item_categories['item_supercategory_name']==supercat,['item_category_id','item_category_name']])
    print()
    
# display categories and their features
item_categories

del train_categories, test_categories

### Analysis of item categories

In [None]:
train_df['item_category_id']=train_df['item_id'].map(items['item_category_id'])

In [None]:
def cat_analyse(cat_id):
    tmp=train_df.loc[(train_df['item_category_id']==cat_id)]

    print('category '+str(cat_id)+' : ' + str(item_categories.loc[cat_id,'item_category_name']))
    print('total number of sales : ' + str(tmp.count()[0]))
    print('dates : ' + tmp.date.min().strftime("%d.%m.%Y") + ' --- ' + tmp.date.max().strftime("%d.%m.%Y"))
    print('total number of items in the category: ' + str(items.loc[(items['item_category_id']==cat_id)].count()[0]))
    print('some of the items : ')
    print(items.loc[items['item_category_id']==cat_id,'item_name'].head(10))
    print()

    sns.relplot(data=tmp,x='month_id',y='item_quantity',kind='line',estimator=sum,ci=None,marker='o')

In [None]:
def cat_compare(cat_id_a,cat_id_b):
    tmp=train_df.loc[(train_df['item_category_id']==cat_id_a) | (train_df['item_category_id']==cat_id_b)]

    print('category '+str(cat_id_a)+' : ' + str(item_categories.loc[cat_id_a,'item_category_name']))
    print('total number of sales : ' + str(tmp.loc[(tmp['item_category_id']==cat_id_a)].count()[0]))
    print('dates : ' + tmp.loc[(tmp['item_category_id']==cat_id_a)].date.min().strftime("%d.%m.%Y") + ' --- ' + tmp.loc[(tmp['item_category_id']==cat_id_a)].date.max().strftime("%d.%m.%Y"))
    print('total number of items in the category: ' + str(items.loc[(items['item_category_id']==cat_id_a)].count()[0]))
    print('list of items : ')
    print(items.loc[items['item_category_id']==cat_id_a,'item_name'])
    print()

    print('category '+str(cat_id_b)+' : ' + str(item_categories.loc[cat_id_b,'item_category_name']))
    print('total number of sales : ' + str(tmp.loc[(tmp['item_category_id']==cat_id_b)].count()[0]))
    print('dates : ' + tmp.loc[(tmp['item_category_id']==cat_id_b)].date.min().strftime("%d.%m.%Y") + ' --- ' + tmp.loc[(tmp['item_category_id']==cat_id_b)].date.max().strftime("%d.%m.%Y"))
    print('total number of items in the category: ' + str(items.loc[(items['item_category_id']==cat_id_b)].count()[0]))
    print('list of items : ')
    print(items.loc[items['item_category_id']==cat_id_b,'item_name'])
    print()

    sns.relplot(data=tmp,x='month_id',y='item_quantity',col='item_category_id',kind='line',estimator=sum,ci=None,marker='o')

In [None]:
# Categories 8 and 80 are only for tickets for the GameWorld conference in October 2014 and 2015
# A significant part of the items between the two categories are actually duplicates!
cat_compare(8,80)

In [None]:
# Categories 81 and 82 are specific to "blank media" (CD-R,CD-RW,DVD-R,DVD-RW)
cat_compare(81,82)

In [None]:
# Category 79 is specific to "Acceptance of funds for 1C-online"
cat_analyse(79)

In [None]:
# Category 9 is specific to "Delivery"
# NB: this category is exclusively sold in shop 12 !
cat_analyse(9)

In [None]:
# Category 0 is specific to "Headphones"
cat_analyse(0)

In [None]:
# Category 83 is specific to "Batteries"
cat_analyse(83)

### Drop irrelevant categories

In [None]:
# A significant part of the items between the two categories 8 and 80 seem to actually be duplicates (even though the shops don't match)
# Categories 8 and 80 are not in the test set so we might as well just ignore them

# Categories 81 and 82 are specific to "blank media" (CD-R,CD-RW,DVD-R,DVD-RW)
# Categories 81 and 82 are not in the test set so we might as well just ignore them

print('category  8 : ' + str(item_categories.loc[8,'item_category_name']))
print('category 80 : ' + str(item_categories.loc[80,'item_category_name']))
print('category 81 : ' + str(item_categories.loc[81,'item_category_name']))
print('category 82 : ' + str(item_categories.loc[82,'item_category_name']))

train_df.drop(train_df.loc[(train_df['item_id'].map(items['item_category_id'])==8)].index.values,axis=0,inplace=True)
train_df.drop(train_df.loc[(train_df['item_id'].map(items['item_category_id'])==80)].index.values,axis=0,inplace=True)
train_df.drop(train_df.loc[(train_df['item_id'].map(items['item_category_id'])==81)].index.values,axis=0,inplace=True)
train_df.drop(train_df.loc[(train_df['item_id'].map(items['item_category_id'])==82)].index.values,axis=0,inplace=True)

item_categories.drop(8,axis=0,inplace=True)
item_categories.drop(80,axis=0,inplace=True)
item_categories.drop(81,axis=0,inplace=True)
item_categories.drop(82,axis=0,inplace=True)

items.drop(items.loc[items['item_category_id']==8].index.values,axis=0,inplace=True)
items.drop(items.loc[items['item_category_id']==80].index.values,axis=0,inplace=True)
items.drop(items.loc[items['item_category_id']==81].index.values,axis=0,inplace=True)
items.drop(items.loc[items['item_category_id']==82].index.values,axis=0,inplace=True)

In [None]:
# label encoding of item_category additional features
item_categories['item_supercategory_id']=item_categories['item_supercategory_name'].map({'Игровые консоли':0,'Игры':1,'Аксессуары':2,'Доставка товара':3,'Прием денежных средств для 1С-Онлайн':4,'Карты оплаты':5,'Кино':6,'Книги':7,'Музыка':8,'Подарки':9,'Программы':10,'Misc':11})
item_categories['item_category_console_id']=item_categories['item_category_console'].map({console:i for i,console in enumerate(consoles+['None'])})

In [None]:
# reorder columns
original_cols=['item_category_name','item_supercategory_name','item_category_console','item_category_is_digital']
label_cols=['item_category_id','item_supercategory_id','item_category_console_id']
item_categories=item_categories[original_cols+label_cols]

In [None]:
# join columns to training set
for col in original_cols:
    train_df[col]=train_df['item_category_id'].map(item_categories[col])
    
del original_cols
del label_cols

### Visualization

In [None]:
item_categories

In [None]:
# categories in training and test set
train_categories=sorted(train_df['item_id'].map(items['item_category_id']).unique())
print(str(len(train_categories)) + ' categories in the training set :')
print(train_categories)
print()

test_categories=sorted(test_df['item_id'].map(items['item_category_id']).unique())
print(str(len(test_categories)) + ' categories in the test set :')
print(test_categories)
print()

# display supercategories
print('number of supercategories : '+str(item_categories['item_supercategory_name'].nunique()))
print()
for supercat in item_categories['item_supercategory_name'].unique():
    print(supercat + ' : ' +str(item_categories.loc[item_categories['item_supercategory_name']==supercat,'item_category_id'].nunique()))
    print(item_categories.loc[item_categories['item_supercategory_name']==supercat,['item_category_id','item_category_name']])
    print()
    
# display consoles
print('number of consoles : '+str(item_categories['item_category_console'].nunique()))
print()
for console in item_categories['item_category_console'].unique():
    print(console + ' : ' +str(item_categories.loc[item_categories['item_category_console']==console,'item_category_id'].nunique()))
    print(item_categories.loc[item_categories['item_category_console']==console,['item_category_id','item_category_name']])
    print()
    
# display categories and their features
item_categories

del train_categories, test_categories

In [None]:
# collect garbage
gc.collect()

## PREPROCESSING OF THE PRICES

### Remove outliers in price data

In [None]:
max_price=train_df['item_price'].max()
most_expensive_item=train_df.loc[train_df['item_price']==max_price,'item_id'].values[0]
print('index of most expensive item : '+str(most_expensive_item))
print('price of most expensive item : '+str(max_price))
print('number of times where the most expensive item appears in training set : '+str(train_df.loc[train_df['item_id']==most_expensive_item].count()[0]))
print('most expensive item appears in test set : '+str(most_expensive_item in test_df['item_id'].values))

# We drop the outlier
train_df.drop(train_df['item_price'].idxmax(),axis=0,inplace=True)

del max_price, most_expensive_item

In [None]:
# Drop the realisation with negative price (there is only one, probably missing value)
print('number of realisation with negative prices : '+str((train_df['item_price']<0).sum()))
train_df.drop(train_df['item_price'].idxmin(),axis=0,inplace=True)
print('number of realisation with negative prices : '+str((train_df['item_price']<0).sum()))

In [None]:
# collect garbage
gc.collect()

## Format dataframes

In [None]:
# reduce training set to original columns only
train_df=train_df.iloc[:,0:7]
train_df.head()

In [None]:
# downcast dtypes for all dataframes
downcast_dtypes(train_df)
train_df['month_id']=train_df['month_id'].astype(np.int8)
train_df['shop_id']=train_df['shop_id'].astype(np.int8)

downcast_dtypes(shops)
shops['shop_id']=shops['shop_id'].astype(np.int8)

downcast_dtypes(items)
items['item_category_id']=items['item_category_id'].astype(np.int8)

downcast_dtypes(item_categories)
item_categories['item_category_id']=item_categories['item_category_id'].astype(np.int8)

# collect garbage
gc.collect()

In [None]:
items.head()

In [None]:
item_categories.head()

In [None]:
shops.head()

In [None]:
train_df.head()

## Export data

In [None]:
print('TRAINING SET')
print()
print(train_df.dtypes)
print()
print('-----------')
print('SHOPS')
print()
print(shops.dtypes)
print()
print('-----------')
print('ITEMS')
print()
print(items.dtypes)
print()
print('-----------')
print('ITEM_CATEGORIES')
print()
print(item_categories.dtypes)

In [None]:
# create directory
create_directory(os.path.join(DATA_FOLDER, 'cleaned'))

# export data
train_df.to_pickle(os.path.join(DATA_FOLDER, 'cleaned/train.pkl'))
shops.to_pickle(os.path.join(DATA_FOLDER,'cleaned/shops.pkl'))
items.to_pickle(os.path.join(DATA_FOLDER,'cleaned/items.pkl'))
item_categories.to_pickle(os.path.join(DATA_FOLDER,'cleaned/item_categories.pkl'))

In [None]:
# clear memory
del train_df
del shops
del items
del item_categories
del test_df

gc.collect()

In [None]:
reset_variable_space

## -------------------------------------------------------------

# 1 - MONTHLY AGGREGATION

- Aggregation of data at the monthly level
- Feature engineering at the monthly level

## Import data

In [None]:
train_df=pd.read_pickle(os.path.join(DATA_FOLDER,'cleaned/train.pkl'))
shops=pd.read_pickle(os.path.join(DATA_FOLDER,'cleaned/shops.pkl'))
items=pd.read_pickle(os.path.join(DATA_FOLDER,'cleaned/items.pkl'))
item_categories=pd.read_pickle(os.path.join(DATA_FOLDER,'cleaned/item_categories.pkl'))

In [None]:
# drop day identifier
train_df.drop('day_id',axis=1,inplace=True)

## EXTEND DATAFRAME TO PRODUCT SHOP x ITEMS EVERY MONTH

In [None]:
# aggregate sales data by (month,shop,item)
col_agg = ['month_id','shop_id','item_id']
train_agg=train_df.groupby(col_agg).agg(item_quantity=pd.NamedAgg(column='item_quantity',aggfunc='sum'))

train_agg.reset_index(inplace=True)
train_agg['month_id'] = train_agg['month_id'].astype(np.int8)
train_agg['shop_id'] = train_agg['shop_id'].astype(np.int8)
train_agg['item_id'] = train_agg['item_id'].astype(np.int16)

# true target values are clipped between 0 and 20, so do the same to training set
train_agg['item_quantity'].clip(0,20,inplace=True)

train_agg

In [None]:
# import and format test set 
test_df=pd.read_csv(os.path.join(RAW_DATA_FOLDER, 'test.csv'))
test_df['ID']=test_df['ID'].astype(np.int32)
test_df['shop_id']=test_df['shop_id'].astype(np.int8)
test_df['item_id']=test_df['item_id'].astype(np.int16)
test_df['month_id']=34
test_df=test_df[col_agg]

In [None]:
##################################################################################################################                             #
# For each month, we keep all pairs (shop,item) such that either shop or item is present in the original dataset #
##################################################################################################################

ts = time.time()

# build full multiindex
train_X = []
for i in range(0,34):
    sales = train_df.loc[train_df['month_id']==i]
    train_X.append(np.array(list(product([i], sales['shop_id'].unique(), sales['item_id'].unique())), dtype='int16'))
    
# build dataframe from multiindex, downcast dtypes, and sort array
train_X = pd.DataFrame(np.vstack(train_X), columns=col_agg)
train_X = pd.concat([train_X,test_df],ignore_index=True)
train_X['month_id'] = train_X['month_id'].astype(np.int8)
train_X['shop_id'] = train_X['shop_id'].astype(np.int8)
train_X['item_id'] = train_X['item_id'].astype(np.int16)
train_X.sort_values(by=col_agg,inplace=True)

print('time : ' +str(time.time() - ts))
print()
print(train_X.info(null_counts=True))

del sales
gc.collect()

In [None]:
# ADD AGGREGATED SALES BY ( month, shop, item ) TO FULL DATAFRAME
ts = time.time()

# join aggregated data
train_X=train_X.join(train_agg.set_index(col_agg),on=col_agg)

# fill missing values for the item_quantity
train_X['item_quantity'].fillna(0,inplace=True)

# NB: the item_quantities for month 34 are all filled with 0 but in reality this quantity is unknown

print('time : ' +str(time.time() - ts))

In [None]:
del col_agg
gc.collect()

## FEATURE ENGINEERING

In [None]:
# month of the year
train_X['month']=train_X['month_id']%12+1

In [None]:
# number of months since the shop has been opened
train_X['shop_months_since_opening']=train_X['month_id']-train_X['shop_id'].map(train_X[['month_id','shop_id']].groupby('shop_id').min()['month_id'])

# whether the shop is opening this month or not
train_X['shop_opening']=(train_X['shop_months_since_opening']==0)

In [None]:
# month where the item has been released (items released at certain times of the year are consistently less popular than those released at other times)
train_X['item_month_id_of_release']=train_X['item_id'].map(train_X[['month_id','item_id']].groupby('item_id').min()['month_id'])
train_X['item_month_of_release']=train_X['item_month_id_of_release']%12+1

# number of months since the item has been released
train_X['item_months_since_release']=train_X['month_id']-train_X['item_month_id_of_release']
train_X['item_months_since_release'].clip(0,12,inplace=True)  # group together items older than a year

# whether the item is new in the catalogue or not
train_X['item_new']=(train_X['item_months_since_release']==0)

In [None]:
# month where the item has first been sold in shop
train_X=train_X.join(train_agg[['shop_id','month_id','item_id']].groupby(['shop_id','item_id']).min().rename({'month_id':'item_month_id_of_first_sale_in_shop'},axis=1),on=['shop_id','item_id'])
train_X['item_month_of_first_sale_in_shop']=train_X['item_month_id_of_first_sale_in_shop']%12+1

# number of months since the item has been released in this shop
train_X['item_months_since_first_sale_in_shop']=(train_X['month_id']-train_X['item_month_id_of_first_sale_in_shop'])
train_X['item_months_since_first_sale_in_shop'].clip(0,12,inplace=True)  # group together items sold for more than a year in the shop

# whether the item has already been sold in this shop before or not
train_X['item_never_sold_in_shop_before']=~(train_X['item_months_since_first_sale_in_shop']>0)



# set month of release and number of months since the item has been released in this shop to -1 for all items never sold in shop before (remove info from future)
train_X.loc[train_X['item_never_sold_in_shop_before'],'item_months_since_first_sale_in_shop']=-1
train_X.loc[train_X['item_never_sold_in_shop_before'],'item_month_id_of_first_sale_in_shop']=-1
train_X.loc[train_X['item_never_sold_in_shop_before'],'item_month_of_first_sale_in_shop']=-1

# downcast dtype
train_X['item_months_since_first_sale_in_shop']=train_X['item_months_since_first_sale_in_shop'].astype(np.int8)
train_X['item_month_id_of_first_sale_in_shop']=train_X['item_month_id_of_first_sale_in_shop'].astype(np.int8)
train_X['item_month_of_first_sale_in_shop']=train_X['item_month_of_first_sale_in_shop'].astype(np.int8)

In [None]:
# indicator for items never sold in this shop, but already sold in other shops in the past
train_X['item_never_sold_in_shop_before_but_not_new']=((1-train_X['item_new'])*train_X['item_never_sold_in_shop_before']).astype(bool)

In [None]:
# label encode the 3 categories: 0-'new item', 1-'item never sold in shop but not new (ie sold in some other shops in the past)', 2-'item sold in shop in the past'
# <1  --> item never sold anywhere in the past  |   >=1 --> item sold in at least one shop in the past
# <2   --> item never sold in this shop before  |   >=2 --> item sold in this shop in the past

train_X['item_seniority']=(2-train_X['item_new'].astype(int)-train_X['item_never_sold_in_shop_before'].astype(int)).astype(np.int8)

In [None]:
###
# defined only for item_seniority==2

# compute month of most recent sale of item in shop
tmp_list=[]
for mid in train_X['month_id'].unique():
    tmp=train_agg.loc[train_agg['month_id']<mid,['month_id','shop_id','item_id']].groupby(['shop_id','item_id']).last().rename({'month_id':'item_month_id_of_last_sale_in_shop'},axis=1).astype(np.int16)
    tmp['month_id']=mid
    tmp.reset_index(inplace=True)
    tmp_list.append(tmp)
    
tmp=pd.concat(tmp_list)
del tmp_list
train_X=train_X.join(tmp.set_index(['month_id','shop_id','item_id']),on=['month_id','shop_id','item_id'])
del tmp

# downcast dtype (int not possible due to NaN values)
train_X['item_month_id_of_last_sale_in_shop']=train_X['item_month_id_of_last_sale_in_shop'].astype(np.float16)

# time since last sale in shop
train_X['item_months_since_last_sale_in_shop']=train_X['month_id']-train_X['item_month_id_of_last_sale_in_shop']

In [None]:
###
# defined only for item_seniority>=1

# compute month of most recent sale of item over all shops
tmp_list=[]
for mid in train_X['month_id'].unique():
    tmp=train_agg.loc[train_agg['month_id']<mid,['month_id','item_id']].groupby('item_id').last().rename({'month_id':'item_month_id_of_last_sale'},axis=1)
    tmp['month_id']=mid
    tmp.reset_index(inplace=True)
    tmp_list.append(tmp)
    
tmp=pd.concat(tmp_list)
del tmp_list
train_X=train_X.join(tmp.set_index(['month_id','item_id']),on=['month_id','item_id'])
del tmp

# downcast dtype (int not possible due to NaN values)
train_X['item_month_id_of_last_sale']=train_X['item_month_id_of_last_sale'].astype(np.float16)

# time since last sale over all shops
train_X['item_months_since_last_sale']=train_X['month_id']-train_X['item_month_id_of_last_sale']

In [None]:
# binary indicator for whether a pair has resulted in a sale or not (for feature generation later on)
train_X['item_sold_in_shop']=(train_X['item_quantity']>0)

In [None]:
# ADD PERMANENT FEATURES TO FULL DATAFRAME
ts=time.time()

train_X=train_X.join(shops.set_index('shop_id'),on='shop_id')
train_X=train_X.join(items.set_index('item_id'),on='item_id')
train_X=train_X.join(item_categories.set_index('item_category_id'),on='item_category_id')

print('time : ' +str(time.time() - ts))

gc.collect()

In [None]:
# Frequency encodings (by month)
ts=time.time()

# NB: items and shops are evenly distributed each month in the global dataset

# monthly fraction of each category in the global catalogue
train_X=train_X.join(train_X.loc[:,['month_id','item_category_id']].groupby('month_id')['item_category_id'].value_counts(normalize=True).rename('item_category_freq',axis=1).astype(np.float32),on=['month_id','item_category_id'])
train_X=train_X.join(train_X.loc[:,['month_id','item_supercategory_id']].groupby('month_id')['item_supercategory_id'].value_counts(normalize=True).rename('item_supercategory_freq',axis=1).astype(np.float32),on=['month_id','item_supercategory_id'])
train_X=train_X.join(train_X.loc[:,['month_id','item_category_console_id']].groupby('month_id')['item_category_console_id'].value_counts(normalize=True).rename('item_category_console_freq',axis=1).astype(np.float32),on=['month_id','item_category_console_id'])
train_X=train_X.join(train_X.loc[:,['month_id','item_category_is_digital']].groupby('month_id')['item_category_is_digital'].value_counts(normalize=True).rename('item_category_digital_freq',axis=1).astype(np.float32),on=['month_id','item_category_is_digital'])

print('time : ' +str(time.time() - ts))

gc.collect()

In [None]:
# Frequency encodings (by month / seniority)
ts=time.time()

# monthly fraction of items and shops in each seniority level
train_X=train_X.join(train_X.loc[:,['month_id','item_seniority','item_id']].groupby(['month_id','item_seniority'])['item_id'].value_counts(normalize=True).rename('item_freq_in_seniority',axis=1).astype(np.float32),on=['month_id','item_seniority','item_id'])
train_X=train_X.join(train_X.loc[:,['month_id','item_seniority','shop_id']].groupby(['month_id','item_seniority'])['shop_id'].value_counts(normalize=True).rename('shop_freq_in_seniority',axis=1).astype(np.float32),on=['month_id','item_seniority','shop_id'])

# monthly fraction of each category in each of the seniority level
train_X=train_X.join(train_X.loc[:,['month_id','item_seniority','item_category_id']].groupby(['month_id','item_seniority'])['item_category_id'].value_counts(normalize=True).rename('item_category_freq_in_seniority',axis=1).astype(np.float32),on=['month_id','item_seniority','item_category_id'])
train_X=train_X.join(train_X.loc[:,['month_id','item_seniority','item_supercategory_id']].groupby(['month_id','item_seniority'])['item_supercategory_id'].value_counts(normalize=True).rename('item_supercategory_freq_in_seniority',axis=1).astype(np.float32),on=['month_id','item_seniority','item_supercategory_id'])
train_X=train_X.join(train_X.loc[:,['month_id','item_seniority','item_category_console_id']].groupby(['month_id','item_seniority'])['item_category_console_id'].value_counts(normalize=True).rename('item_category_console_freq_in_seniority',axis=1).astype(np.float32),on=['month_id','item_seniority','item_category_console_id'])
train_X=train_X.join(train_X.loc[:,['month_id','item_seniority','item_category_is_digital']].groupby(['month_id','item_seniority'])['item_category_is_digital'].value_counts(normalize=True).rename('item_category_digital_freq_in_seniority',axis=1).astype(np.float32),on=['month_id','item_seniority','item_category_is_digital'])

print('time : ' +str(time.time() - ts))

gc.collect()

In [None]:
print(train_X.info(null_counts=True))
train_X.head()

## PROCESS PRICE DATA

In [None]:
# expand sales to multiple rows to make statistics on the prices
train_df_expand=train_df.set_index(['month_id','shop_id','item_id'])['item_price']
train_df_expand=pd.DataFrame(train_df_expand.repeat(train_df['item_quantity']))
train_df_expand.reset_index(inplace=True)

# add values for the month 34 (useful for lagging values later on)
train_df_expand=pd.concat([train_df_expand,test_df],sort=False)

train_df_expand['item_category_id']=train_df_expand['item_id'].map(items['item_category_id'])
train_df_expand['item_supercategory_id']=train_df_expand['item_category_id'].map(item_categories['item_supercategory_id'])
train_df_expand['item_category_console_id']=train_df_expand['item_category_id'].map(item_categories['item_category_console_id'])

In [None]:
def encode_prices(groupby_labels,name):
    df_prices=train_df_expand.groupby(groupby_labels).agg({'item_price':['min','max','median','mean']})
    df_prices.columns=[name+'_price_'+col for col in df_prices.columns.get_level_values(1)]

    df_prices.reset_index(inplace=True)

    # downcast dtypes
    df_prices=downcast_dtypes(df_prices)
    
    return df_prices

def compare_to_super(df,df_super,join_labels,label,label_super,name,name_super,df_mapping):
    col_list=['price_min','price_max','price_median','price_mean']
    df[label_super]=df[label].map(df_mapping[label_super])
    df=df.join(df_super[join_labels+[label_super]+[name_super+'_'+col for col in col_list]].set_index(join_labels+[label_super]),on=join_labels+[label_super])
    df.drop(label_super,axis=1,inplace=True)
    for col in col_list:
        df[name_super+'_'+col]=df[name+'_'+col]/df[name_super+'_'+col]
        df.rename({name_super+'_'+col:name+'_'+col+'_compared_to_'+name_super+'_'+col},axis=1,inplace=True)

    # downcast dtypes
    df=downcast_dtypes(df)
    
    return df
    
def compare_to_lower(df,df_lower,label_lower,name,name_lower):
    col_list=['price_min','price_max','price_median','price_mean']
    df=df.join(df_lower[['month_id',label_lower]+[name_lower+'_'+col for col in col_list]].set_index(['month_id',label_lower]),on=['month_id',label_lower])
    for col in col_list:
        df[name_lower+'_'+col]=df[name+'_'+col]/df[name_lower+'_'+col]
        df.rename({name_lower+'_'+col:name+'_'+col+'_compared_to_'+name_lower+'_'+col},axis=1,inplace=True)

    # downcast dtypes
    df=downcast_dtypes(df)
    
    return df

In [None]:
# price data aggregated by (month,item_supercategory) (over all shops and items)
supercategory_prices=encode_prices(['month_id','item_supercategory_id'],'supercategory')

supercategory_prices['month_id'] = supercategory_prices['month_id'].astype(np.int8)

print(supercategory_prices.loc[supercategory_prices['month_id']<34,:].info(null_counts=True))
supercategory_prices

In [None]:
# price data aggregated by (month,item_category) (over all shops and items)
category_prices=encode_prices(['month_id','item_category_id'],'category')
category_prices=compare_to_super(category_prices,supercategory_prices,['month_id'],'item_category_id','item_supercategory_id','category','supercategory',item_categories)

category_prices['month_id'] = category_prices['month_id'].astype(np.int8)

print(category_prices.loc[category_prices['month_id']<34,:].info(null_counts=True))
category_prices

In [None]:
# price data aggregated by (month,item) (over all shops)
item_prices=encode_prices(['month_id','item_id'],'item')
item_prices=compare_to_super(item_prices,category_prices,['month_id'],'item_id','item_category_id','item','category',items)

item_prices['month_id'] = item_prices['month_id'].astype(np.int8)

print(item_prices.loc[item_prices['month_id']<34,:].info(null_counts=True))
item_prices

In [None]:
# price data aggregated by (month,shop,item_supercategory) (over all items)
shop_supercategory_prices=encode_prices(['month_id','shop_id','item_supercategory_id'],'shop_supercategory')
shop_supercategory_prices=compare_to_lower(shop_supercategory_prices,supercategory_prices,'item_supercategory_id','shop_supercategory','supercategory')

shop_supercategory_prices['month_id'] = shop_supercategory_prices['month_id'].astype(np.int8)
shop_supercategory_prices['shop_id'] = shop_supercategory_prices['shop_id'].astype(np.int8)

print(shop_supercategory_prices.loc[shop_supercategory_prices['month_id']<34,:].info(null_counts=True))
shop_supercategory_prices

In [None]:
# price data aggregated by (month,shop,item_category) (over all items)
shop_category_prices=encode_prices(['month_id','shop_id','item_category_id'],'shop_category')
shop_category_prices=compare_to_lower(shop_category_prices,category_prices,'item_category_id','shop_category','category')
shop_category_prices=compare_to_super(shop_category_prices,shop_supercategory_prices,['month_id','shop_id'],'item_category_id','item_supercategory_id','shop_category','shop_supercategory',item_categories)

shop_category_prices['month_id'] = shop_category_prices['month_id'].astype(np.int8)
shop_category_prices['shop_id'] = shop_category_prices['shop_id'].astype(np.int8)

print(shop_category_prices.loc[shop_category_prices['month_id']<34,:].info(null_counts=True))
shop_category_prices

In [None]:
shop_item_prices=encode_prices(['month_id','shop_id','item_id'],'shop_item')
shop_item_prices=compare_to_lower(shop_item_prices,item_prices,'item_id','shop_item','item')
shop_item_prices=compare_to_super(shop_item_prices,shop_category_prices,['month_id','shop_id'],'item_id','item_category_id','shop_item','shop_category',items)


shop_item_prices['month_id'] = shop_item_prices['month_id'].astype(np.int8)
shop_item_prices['shop_id'] = shop_item_prices['shop_id'].astype(np.int8)

print(shop_item_prices.loc[shop_item_prices['month_id']<34,:].info(null_counts=True))
shop_item_prices

## Export data

In [None]:
# create directories
create_directory(os.path.join(DATA_FOLDER, 'processed'))
create_directory(os.path.join(DATA_FOLDER, 'processed/price_features'))

# export data
train_X.to_pickle(os.path.join(DATA_FOLDER,'processed/train_X0.pkl'))

shop_item_prices.to_pickle(os.path.join(DATA_FOLDER,'processed/price_features/shop_item_prices.pkl'))
item_prices.to_pickle(os.path.join(DATA_FOLDER,'processed/price_features/item_prices.pkl'))
shop_category_prices.to_pickle(os.path.join(DATA_FOLDER,'processed/price_features/shop_category_prices.pkl'))
category_prices.to_pickle(os.path.join(DATA_FOLDER,'processed/price_features/category_prices.pkl'))
shop_supercategory_prices.to_pickle(os.path.join(DATA_FOLDER,'processed/price_features/shop_supercategory_prices.pkl'))
supercategory_prices.to_pickle(os.path.join(DATA_FOLDER,'processed/price_features/supercategory_prices.pkl'))

In [None]:
# clear memory
del items
del item_categories
del shops
del train_df
del train_df_expand
del train_agg
del train_X

del shop_item_prices
del item_prices
del shop_category_prices
del category_prices
del shop_supercategory_prices
del supercategory_prices


gc.collect()

In [None]:
reset_variable_space

## -------------------------------------------------------------

# 2 - EXPLORATORY DATA ANALYSIS

## Import data

In [None]:
train_X=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/train_X0.pkl'))

# restrict to training set
train_X=train_X[train_X['month_id']<34]

## ANALYSE TARGET DISTRIBUTIONS IN THE TRAINING SET

### Total monthly sales

In [None]:
# Analyse total monthly sales
tmp_sales=train_X.loc[:,['month_id','shop_id','item_id','item_quantity']]
tmp_sales['item_quantity']=tmp_sales['item_quantity'].astype(np.float32)

tmp=pd.DataFrame()
tmp['total_sales_per_month']=tmp_sales[['month_id','item_quantity']].groupby('month_id').sum()['item_quantity']
tmp['shop_count']=tmp_sales[['month_id','shop_id']].groupby('month_id').nunique()['shop_id']
tmp['item_count']=tmp_sales[['month_id','item_id']].groupby('month_id').nunique()['item_id']
tmp['avg_sales_per_shop']=tmp['total_sales_per_month']/tmp['shop_count']
tmp['avg_sales_per_item']=tmp['total_sales_per_month']/tmp['item_count']
tmp['pair_count']=tmp['shop_count']*tmp['item_count']
tmp['avg_sales_per_pair']=tmp['total_sales_per_month']/tmp['pair_count']

fig,axes=plt.subplots(4,2,figsize=(15,20))

axes[0,1].plot(tmp['total_sales_per_month'],'-o')
axes[0,1].set_xlim(0,34)
axes[0,1].set_ylim(0,170000)
axes[0,1].grid(True)
axes[0,1].set_xlabel('month_id')
axes[0,1].set_ylabel('total_sales')

axes[1,0].plot(tmp['shop_count'],'-o')
axes[1,0].set_xlim(0,34)
axes[1,0].set_ylim(0,60)
axes[1,0].grid(True)
axes[1,0].set_xlabel('month_id')
axes[1,0].set_ylabel('shop_count')

axes[1,1].plot(tmp['avg_sales_per_shop'],'-o')
axes[1,1].set_xlim(0,34)
axes[1,1].set_ylim(0,3700)
axes[1,1].grid(True)
axes[1,1].set_xlabel('month_id')
axes[1,1].set_ylabel('avg_sales_per_shop')

axes[2,0].plot(tmp['item_count'],'-o')
axes[2,0].set_xlim(0,34)
axes[2,0].set_ylim(0,9000)
axes[2,0].grid(True)
axes[2,0].set_xlabel('month_id')
axes[2,0].set_ylabel('item_count')

axes[2,1].plot(tmp['avg_sales_per_item'],'-o')
axes[2,1].set_xlim(0,34)
axes[2,1].set_ylim(0,25)
axes[2,1].grid(True)
axes[2,1].set_xlabel('month_id')
axes[2,1].set_ylabel('avg_sales_per_item')

axes[3,0].plot(tmp['pair_count'],'-o')
axes[3,0].set_xlim(0,34)
axes[3,0].set_ylim(0,400000)
axes[3,0].grid(True)
axes[3,0].set_xlabel('month_id')
axes[3,0].set_ylabel('pair_count')

axes[3,1].plot(tmp['avg_sales_per_pair'],'-o')
axes[3,1].set_xlim(0,34)
axes[3,1].set_ylim(0,0.5)
axes[3,1].grid(True)
axes[3,1].set_xlabel('month_id')
axes[3,1].set_ylabel('avg_sales_per_pair')

del tmp_sales,tmp,fig,axes

The globally decreasing trend of the total sales over time is mostly due to the decreasing amount of items in the catalogue. The number of shops open remains roughly the same, and the average sales per item as well, except for the December peaks (months 11 and 23)

The December peaks should therefore imply a different distribution of the target distributions among the values 0,1,...,20. In December, the repartition should have more realisations in the classes with higher values compared to other months.

Apart from December, the average amount of sales per item in the catalogue remain roughly the same over all months.

### Distribution of target value

In [None]:
# Compare distributions of the target value 'item_quantity' in dataset between all months
df_count=train_X[['month_id','item_quantity']].groupby('month_id')['item_quantity'].value_counts()
df_count=df_count.unstack()

df_percentage=train_X[['month_id','item_quantity']].groupby('month_id')['item_quantity'].value_counts(normalize=True)
df_percentage=df_percentage.unstack()*100

a=df_count.values
b=df_percentage.values

print('percentages')
print(df_percentage.iloc[:,0:5].describe(percentiles=[]).drop('50%',axis=0))
print(df_percentage.iloc[:,5:10].describe(percentiles=[]).drop('50%',axis=0))
print(df_percentage.iloc[:,10:15].describe(percentiles=[]).drop('50%',axis=0))
print(df_percentage.iloc[:,15:].describe(percentiles=[]).drop('50%',axis=0))
print()
print('counts')
print(df_count.iloc[:,0:4].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,4:8].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,8:12].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,12:16].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,16:].describe(percentiles=[]).drop('50%',axis=0))

fig,axes=plt.subplots(1,2,figsize=(15,5))
axes[0].plot(b.T,'-o')
axes[0].plot(b.mean(axis=0),'-ok',linewidth=3)
axes[0].set_xlabel('target value : item_quantity')
axes[0].set_ylabel('percentage of total realisations')
axes[0].grid(True)
axes[0].set_ylim(0,100)

axes[1].plot(a.T,'-o')
axes[1].plot(a.mean(axis=0),'-ok',linewidth=3)
axes[1].set_xlabel('target value : item_quantity')
axes[1].set_ylabel('number of realisations (log scale)')
axes[1].grid(True)
axes[1].set_yscale('log')

del axes,fig

# The distribution of the target value is similar every month
# Clipping the values above 20 together will result in about 100-900 realisations in the class item_quantity = 20
# Then, the amount of realisations with value 19 (lowest populated category after clipping) varies around 10-100

In [None]:
# compare the distribution to its mean over all months
# threshold the comparison at 1 to focus on under/overshooting values (set all values below threshold to 'threshold')
b_over=b/b.mean(axis=0)
b_under=b/b.mean(axis=0)
thres=1       # here threshold=1
b_over[b_over<thres]=thres
b_under[b_under>thres]=thres

fig,axes=plt.subplots(1,2,figsize=(15,5))
sns.heatmap(b_under,ax=axes[0])
axes[0].set_title('undershooting values')
axes[0].set_ylabel('month_id')
axes[0].set_xlabel('target_value')
sns.heatmap(b_over,ax=axes[1])
axes[1].set_title('overshooting values')
axes[1].set_ylabel('month_id')
axes[1].set_xlabel('target_value')

del fig

# the distribution of target variable is clearly moved towards the larger values in December
# in December, the larger values may be up to 2.7 more frequent than on average !!
# the frequency of the low values varies very little from month to month

# November appears like a fairly average month!

In [None]:
del a,b,b_over,b_under,df_count,df_percentage
gc.collect()

## ANALYSIS BY SENIORITY LEVEL

In [None]:
# repartition of seniority levels in the dataset each month
tmp=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/train_X0.pkl'))

tmp=tmp[['month_id','item_seniority']].groupby('month_id')['item_seniority'].value_counts(normalize=True).unstack()

print(tmp)

del tmp

In [None]:
# Assess variations of sales between seniority levels
tmp=train_X.loc[:,['month_id','item_seniority','item_sold_in_shop']]

tmp['all']=1

tmp['seniority_0']=(tmp['item_seniority']==0)
tmp['seniority_1']=(tmp['item_seniority']==1)
tmp['seniority_2']=(tmp['item_seniority']==2)

tmp['sold_0']=tmp['item_sold_in_shop']&tmp['seniority_0']
tmp['sold_1']=tmp['item_sold_in_shop']&tmp['seniority_1']
tmp['sold_2']=tmp['item_sold_in_shop']&tmp['seniority_2']



tmp_count=tmp[['month_id','seniority_0','seniority_1','seniority_2','sold_0','sold_1','sold_2','item_sold_in_shop','all']].groupby('month_id').sum()

fig,axes=plt.subplots(2,2,figsize=(15,10))
axes[0,0].plot(tmp_count['seniority_0']/tmp_count['all'],'-o')
axes[0,0].plot(tmp_count['seniority_1']/tmp_count['all'],'-o')
axes[0,0].plot(tmp_count['seniority_2']/tmp_count['all'],'-o')
axes[0,0].grid(True)
axes[0,0].set_ylim(0,1)
axes[0,0].set_title('proportion of items of each seniority in the dataset')

axes[0,1].plot(tmp_count['item_sold_in_shop']/tmp_count['all'],'-o')
axes[0,1].grid(True)
axes[0,1].set_ylim(0,1)
axes[0,1].set_title('fraction of items sold overall')

axes[1,0].plot(tmp_count['sold_0']/tmp_count['item_sold_in_shop'],'-o')
axes[1,0].plot(tmp_count['sold_1']/tmp_count['item_sold_in_shop'],'-o')
axes[1,0].plot(tmp_count['sold_2']/tmp_count['item_sold_in_shop'],'-o')
axes[1,0].grid(True)
axes[1,0].set_ylim(0,1)
axes[1,0].set_title('contribution of seniority to items sold in shop')

axes[1,1].plot(tmp_count['sold_0']/tmp_count['seniority_0'],'-o')
axes[1,1].plot(tmp_count['sold_1']/tmp_count['seniority_1'],'-o')
axes[1,1].plot(tmp_count['sold_2']/tmp_count['seniority_2'],'-o')
axes[1,1].grid(True)
axes[1,1].set_ylim(0,1)
axes[1,1].set_title('fraction of items sold per seniority')
axes[1,1].legend(['seniority 0','seniority 1','seniority 2'])

del fig,axes,tmp,tmp_count

# among all items, the pairs (shop,item) are distributed approximately:
    # ~ 5%  in seniority 0
    # ~ 40% in seniority 1
    # ~ 55% in seniority 2
    
# among the items sold:
    # ~5% in seniority 0
    # ~12% in seniority 1
    # ~83% in seniority 2
    
# every month, the proportion of pairs (shop,item) that are sold is aproximately
    # ~15% overall
    # ~15-20% in seniority 0
    # ~5% in seniority 1
    # ~20-25% in seniority 2

3 situations may occur, and it is probably best to think of 3 different models for each of these categories!

1) the item has previously been sold in this shop (seniority = 2)
- This category covers about 50-60% of the dataset
- We are able to use information about past sales for this pair (shop, item) to predict future sales

2) the item has never been sold in this shop, but it was already in the global catalogue before  (seniority = 1)
- This category covers about 40-50% of the dataset each month
- From the above analysis, such a pair (shop,item) has around 95% chance not to result in any sales this month either, so estimating their sales to 0 should be correct for 95% of them, and wrong for about 5% of them
- We have no past data for this pair (shop,item), but we have past data for this item in other shops where it was sold.

3) the item is new in the global catalogue  (seniority = 0)
- This category covers roughly between 2.5 and 10% of all realisations in the dataset
- There is no past data for this item, so we must estimate the sales of this item based on the sales of similar items when they were released.

NB: All items are assumed new on the first months because we do not have data from earlier months. This causes the repartition of items among the different categories to be misestimated for about the first 12 months.

In [None]:
# drop first 12 months
train_X.drop(train_X.loc[train_X['month_id']<12].index,axis=0,inplace=True)

### Distribution of target value for each seniority level

In [None]:
# Compare distributions of the target value 'item_quantity' in dataset between all months
# RESTRICT TO NEW ITEMS (seniority = 0)

idxbool=train_X['item_new']
target_label='item_quantity'

df_count=train_X.loc[idxbool,['month_id',target_label]].groupby('month_id')[target_label].value_counts()
df_count=df_count.unstack()

df_percentage=train_X.loc[idxbool,['month_id',target_label]].groupby('month_id')[target_label].value_counts(normalize=True)
df_percentage=df_percentage.unstack()*100

a=df_count.values
b=df_percentage.values

print('percentages')
print(df_percentage.iloc[:,0:5].describe(percentiles=[]).drop('50%',axis=0))
print(df_percentage.iloc[:,5:10].describe(percentiles=[]).drop('50%',axis=0))
print(df_percentage.iloc[:,10:15].describe(percentiles=[]).drop('50%',axis=0))
print(df_percentage.iloc[:,15:].describe(percentiles=[]).drop('50%',axis=0))
print()
print('counts')
print(df_count.iloc[:,0:4].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,4:8].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,8:12].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,12:16].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,16:].describe(percentiles=[]).drop('50%',axis=0))

fig,axes=plt.subplots(2,2,figsize=(15,10))
axes[0,0].plot(b.T,'-o')
axes[0,0].plot(b.mean(axis=0),'-ok',linewidth=3)
axes[0,0].set_xlabel('target value : item_quantity')
axes[0,0].set_ylabel('percentage of total realisations')
axes[0,0].grid(True)
axes[0,0].set_ylim(0,100)

axes[0,1].plot(b.T,'-o')
axes[0,1].plot(b.mean(axis=0),'-ok',linewidth=3)
axes[0,1].set_xlabel('target value : item_quantity')
axes[0,1].set_ylabel('percentage of total realisations (log scale)')
axes[0,1].grid(True)
axes[0,1].set_yscale('log')

axes[1,0].plot(b,'-o')
axes[1,0].set_xlabel('month_id')
axes[1,0].set_ylabel('percentage of total realisations')
axes[1,0].grid(True)
axes[1,0].set_ylim(0,100)

axes[1,1].plot(b,'-o')
axes[1,1].set_xlabel('month_id')
axes[1,1].set_ylabel('percentage of total realisations (log scale)')
axes[1,1].grid(True)
axes[1,1].set_yscale('log')

del idxbool,target_label,axes,fig,a,b,df_count,df_percentage

# The distribution include significant contribution from quite large values
# The probability of an item being sold in quantity 13 is still 0.1%

In [None]:
# Compare distributions of the target value 'item_quantity' in dataset between all months
# RESTRICT TO ITEM_NEVER_SOLD_IN_SHOP_BEFORE_BUT_NOT_NEW (seniority = 1)

idxbool=train_X['item_never_sold_in_shop_before_but_not_new']
target_label='item_quantity'

df_count=train_X.loc[idxbool,['month_id',target_label]].groupby('month_id')[target_label].value_counts()
df_count=df_count.unstack()

df_percentage=train_X.loc[idxbool,['month_id',target_label]].groupby('month_id')[target_label].value_counts(normalize=True)
df_percentage=df_percentage.unstack()*100

a=df_count.values
b=df_percentage.values

print('percentages')
print(df_percentage.iloc[:,0:5].describe(percentiles=[]).drop('50%',axis=0))
print(df_percentage.iloc[:,5:10].describe(percentiles=[]).drop('50%',axis=0))
print(df_percentage.iloc[:,10:15].describe(percentiles=[]).drop('50%',axis=0))
print(df_percentage.iloc[:,15:].describe(percentiles=[]).drop('50%',axis=0))
print()
print('counts')
print(df_count.iloc[:,0:4].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,4:8].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,8:12].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,12:16].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,16:].describe(percentiles=[]).drop('50%',axis=0))

fig,axes=plt.subplots(2,2,figsize=(15,10))
axes[0,0].plot(b.T,'-o')
axes[0,0].plot(b.mean(axis=0),'-ok',linewidth=3)
axes[0,0].set_xlabel('target value : item_quantity')
axes[0,0].set_ylabel('percentage of total realisations')
axes[0,0].grid(True)
axes[0,0].set_ylim(0,100)

axes[0,1].plot(b.T,'-o')
axes[0,1].plot(b.mean(axis=0),'-ok',linewidth=3)
axes[0,1].set_xlabel('target value : item_quantity')
axes[0,1].set_ylabel('percentage of total realisations (log scale)')
axes[0,1].grid(True)
axes[0,1].set_yscale('log')

axes[1,0].plot(b,'-o')
axes[1,0].set_xlabel('month_id')
axes[1,0].set_ylabel('percentage of total realisations')
axes[1,0].grid(True)
axes[1,0].set_ylim(0,100)

axes[1,1].plot(b,'-o')
axes[1,1].set_xlabel('month_id')
axes[1,1].set_ylabel('percentage of total realisations (log scale)')
axes[1,1].grid(True)
axes[1,1].set_yscale('log')

del idxbool,target_label,axes,fig,a,b,df_count,df_percentage

# 95% of items are NOT sold, and the amount of realisations is rapidly decreasing when the number of items sold increases
# The probability of an item being sold in amounts superior to 3 is less than 0.1% !

In [None]:
# Compare distributions of the target value 'item_quantity' in dataset between all months
# RESTRICT TO ITEM_SOLD_IN_SHOP_BEFORE (seniority = 2)

idxbool=~train_X['item_never_sold_in_shop_before']
target_label='item_quantity'

df_count=train_X.loc[idxbool,['month_id',target_label]].groupby('month_id')[target_label].value_counts()
df_count=df_count.unstack()

df_percentage=train_X.loc[idxbool,['month_id',target_label]].groupby('month_id')[target_label].value_counts(normalize=True)
df_percentage=df_percentage.unstack()*100

a=df_count.values
b=df_percentage.values

print('percentages')
print(df_percentage.iloc[:,0:5].describe(percentiles=[]).drop('50%',axis=0))
print(df_percentage.iloc[:,5:10].describe(percentiles=[]).drop('50%',axis=0))
print(df_percentage.iloc[:,10:15].describe(percentiles=[]).drop('50%',axis=0))
print(df_percentage.iloc[:,15:].describe(percentiles=[]).drop('50%',axis=0))
print()
print('counts')
print(df_count.iloc[:,0:4].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,4:8].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,8:12].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,12:16].describe(percentiles=[]).drop('50%',axis=0))
print(df_count.iloc[:,16:].describe(percentiles=[]).drop('50%',axis=0))

fig,axes=plt.subplots(2,2,figsize=(15,10))
axes[0,0].plot(b.T,'-o')
axes[0,0].plot(b.mean(axis=0),'-ok',linewidth=3)
axes[0,0].set_xlabel('target value : item_quantity')
axes[0,0].set_ylabel('percentage of total realisations')
axes[0,0].grid(True)
axes[0,0].set_ylim(0,100)

axes[0,1].plot(b.T,'-o')
axes[0,1].plot(b.mean(axis=0),'-ok',linewidth=3)
axes[0,1].set_xlabel('target value : item_quantity')
axes[0,1].set_ylabel('percentage of total realisations (log scale)')
axes[0,1].grid(True)
axes[0,1].set_yscale('log')

axes[1,0].plot(b,'-o')
axes[1,0].set_xlabel('month_id')
axes[1,0].set_ylabel('percentage of total realisations')
axes[1,0].grid(True)
axes[1,0].set_ylim(0,100)

axes[1,1].plot(b,'-o')
axes[1,1].set_xlabel('month_id')
axes[1,1].set_ylabel('percentage of total realisations (log scale)')
axes[1,1].grid(True)
axes[1,1].set_yscale('log')

del idxbool,target_label,axes,fig,a,b,df_count,df_percentage

# The distribution of sales is fairly even from month to month
# The probability of an item being sold in a given quantity decreases quite smoothly and rapidly with the amount of quantities
# The probability that an item is sold in quantity 9 is 0.1%

## ANALYSIS OF TEMPORAL TRENDS

In [None]:
# Analyse the mean of target value over all months
tmp_avg=train_X[['month_id','item_quantity']].groupby('month_id').mean()
tmp_avg.rename({'item_quantity':'overall'},axis=1,inplace=True)

idxbool=(train_X['item_quantity']>0)
tmp_avg['over_sold']=train_X.loc[idxbool,['month_id','item_quantity']].groupby('month_id').mean()

idxbool=(train_X['item_new'])
tmp_avg['over_new']=train_X.loc[idxbool,['month_id','item_quantity']].groupby('month_id').mean()

for i in range(1,13):
    idxbool=(train_X['item_months_since_release']==i)
    tmp_avg['over_released_since_'+str(i)+'_months']=train_X.loc[idxbool,['month_id','item_quantity']].groupby('month_id').mean()

idxbool=(train_X['item_never_sold_in_shop_before'])
tmp_avg['over_never_sold_in_shop_before']=train_X.loc[idxbool,['month_id','item_quantity']].groupby('month_id').mean()
 
for i in range(1,13):
    idxbool=(train_X['item_months_since_first_sale_in_shop']==i)
    tmp_avg['over_first_sold_in_shop_since_'+str(i)+'_months']=train_X.loc[idxbool,['month_id','item_quantity']].groupby('month_id').mean()
    
    
    
# display    
plt.figure(figsize=(15,10))
sns.heatmap(tmp_avg.T)
plt.title('average sales')

del tmp_avg

There seem to be a peak of average sales on December months.

We also notice that the sale quantities of items released in October, November, December appears to remain consistently higher than average the next months. Conversely, items released in February are sold in smaller quantities all year round.

--> There is an influence of the month where the item is first released!

In [None]:
fig,axes=plt.subplots(3,2,figsize=(15,15))
sns.heatmap(train_X[['item_month_of_release','item_months_since_release','item_quantity']].groupby(['item_month_of_release','item_months_since_release']).mean()['item_quantity'].unstack(level=0),ax=axes[0,0])
sns.heatmap(train_X.loc[~train_X['item_never_sold_in_shop_before'],['item_month_of_first_sale_in_shop','item_months_since_first_sale_in_shop','item_quantity']].groupby(['item_month_of_first_sale_in_shop','item_months_since_first_sale_in_shop']).mean()['item_quantity'].unstack(level=0),ax=axes[0,1])
sns.heatmap(train_X[['month','item_months_since_release','item_quantity']].groupby(['month','item_months_since_release']).mean()['item_quantity'].unstack(level=0),ax=axes[1,0])
sns.heatmap(train_X.loc[~train_X['item_never_sold_in_shop_before'],['month','item_months_since_first_sale_in_shop','item_quantity']].groupby(['month','item_months_since_first_sale_in_shop']).mean()['item_quantity'].unstack(level=0),ax=axes[1,1])
sns.heatmap(train_X[['month','item_month_of_release','item_quantity']].groupby(['month','item_month_of_release']).mean()['item_quantity'].unstack(level=0),ax=axes[2,0])
sns.heatmap(train_X.loc[~train_X['item_never_sold_in_shop_before'],['month','item_month_of_first_sale_in_shop','item_quantity']].groupby(['month','item_month_of_first_sale_in_shop']).mean()['item_quantity'].unstack(level=0),ax=axes[2,1])

We also notice a storng influence of the time spent wince the item has been released: on average, newer items are sold in larger quantities than older ones.

--> There is an influence of the number of months since the item was first released (or first sold in shop)

## DISTRIBUTION OF SALES AMONG SHOPS THROUGH TIME

In [None]:
tmp=train_X.loc[:,['month_id','shop_id','item_quantity']]
tmp['sold']=(tmp['item_quantity']>0)
tmp['count']=(tmp['item_quantity']>=0)
tmp.drop('item_quantity',axis=1,inplace=True)
tmp_1=tmp.groupby(['month_id','shop_id'],as_index=False).sum()
tmp_1['fraction_of_new_items_sold_in_shop']=tmp_1['sold']/tmp_1['count']


tmp=tmp.groupby(['month_id','shop_id']).sum()
tmp['fraction_of_all_items_sold_in_shop']=tmp['sold']/tmp['count']
tmp=tmp['fraction_of_all_items_sold_in_shop'].unstack()



fig=plt.figure(figsize=(15,5))
sns.scatterplot(data=tmp_1,x='shop_id',y='fraction_of_new_items_sold_in_shop')
plt.plot(tmp.columns,tmp.mean(),'ok')
plt.grid(True)
plt.title('fraction of all items that are sold each month in each shop (different points for different months)')
plt.show()

del tmp, tmp_1, fig

# SOME SHOPS SELL A LARGER PORTION OF THE GLOBAL CATALOGUE THAN OTHERS

In [None]:
# clear memory
del train_X

gc.collect()

In [None]:
reset_variable_space

## -------------------------------------------------------------

# 3 - TARGET ENCODING

## Import data

In [None]:
train_X=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/train_X0.pkl'))

items=pd.read_pickle(os.path.join(DATA_FOLDER,'cleaned/items.pkl'))
item_categories=pd.read_pickle(os.path.join(DATA_FOLDER,'cleaned/item_categories.pkl'))

## Functions for target encoding

In [None]:
def encoder(agg_func,target_label,groupby_labels,idxbool=None,target_dtype=None):
    if target_dtype is None:
        target_dtype=train_X[target_label].dtype
    if agg_func=='std':
        if idxbool is None:
            return train_X.loc[:,groupby_labels+[target_label]].groupby(groupby_labels,as_index=True).agg(agg_func,ddof=0)[target_label].astype(target_dtype)
        return train_X.loc[idxbool,groupby_labels+[target_label]].groupby(groupby_labels,as_index=True).agg(agg_func,ddof=0)[target_label].astype(target_dtype)
    else:
        if idxbool is None:
            return train_X.loc[:,groupby_labels+[target_label]].groupby(groupby_labels,as_index=True).agg(agg_func)[target_label].astype(target_dtype)
        return train_X.loc[idxbool,groupby_labels+[target_label]].groupby(groupby_labels,as_index=True).agg(agg_func)[target_label].astype(target_dtype)

In [None]:
def encode_comparison(df,df_super,join_labels,target_labels,comparison_labels,new_labels):
    df=df.join(df_super[join_labels+comparison_labels].set_index(join_labels),on=join_labels)
    for target_label,cp_label in zip(target_labels,comparison_labels):
        df[cp_label]=df[target_label]/df[cp_label]
    return df.rename({cp_label:new_label for (cp_label,new_label) in zip(comparison_labels,new_labels)},axis=1)

In [None]:
all_items=(train_X['item_quantity']>=0)

#----------------------------------------------------------------
def encoding_0(idxbool,groupby_labels,prefix,suffix):
    idxbool_sold=idxbool&train_X['item_sold_in_shop']
    
    df=pd.DataFrame(encoder('mean','item_quantity',groupby_labels,idxbool)).rename({'item_quantity':prefix+'_avg_sales'+suffix},axis=1)
    df[prefix+'_fraction_of_items_sold'+suffix]=encoder('mean','item_sold_in_shop',groupby_labels,idxbool,target_dtype=np.float32)
    df[prefix+'_avg_sales_over_sold'+suffix]=encoder('mean','item_quantity',groupby_labels,idxbool_sold)

    # reset index
    df.reset_index(inplace=True)

    return df



#----------------------------------------------------------------
def encoding_1(idxbool,groupby_labels,prefix,suffix):
    idxbool_sold=idxbool&train_X['item_sold_in_shop']
    
    df=pd.DataFrame(encoder('mean','item_quantity',groupby_labels,idxbool)).rename({'item_quantity':prefix+'_avg_sales'+suffix},axis=1)
    df[prefix+'_fraction_of_items_sold'+suffix]=encoder('mean','item_sold_in_shop',groupby_labels,idxbool,target_dtype=np.float32)
    df[prefix+'_avg_sales_over_sold'+suffix]=encoder('mean','item_quantity',groupby_labels,idxbool_sold)

    df[prefix+'_min_quantity_over_sold'+suffix]=encoder('min','item_quantity',groupby_labels,idxbool_sold)
    df[prefix+'_max_quantity'+suffix]=encoder('max','item_quantity',groupby_labels,idxbool)

    # reset index
    df.reset_index(inplace=True)

    return df



def add_comparison_super(df,idxbool,groupby_labels,prefix,suffix,df_super,label_super,prefix_super,df_mapping):

    # comparison of features with respect to superfeatures
    compared_features=['_avg_sales','_fraction_of_items_sold','_avg_sales_over_sold']
    
    df[label_super]=df[groupby_labels[1]].map(df_mapping[label_super])
    df=encode_comparison(df,df_super,['month_id',label_super],[prefix+feature+suffix for feature in compared_features],[prefix_super+feature+suffix for feature in compared_features],[prefix+feature+'_compared_to_'+prefix_super+suffix for feature in compared_features])
    df.drop(label_super,axis=1,inplace=True)
    
    return df



#----------------------------------------------------------------
def add_comparison_lower(df,idxbool,groupby_labels,prefix_1,prefix_2,suffix,df_compare_1,df_compare_2):
    prefix=prefix_1+'_'+prefix_2

    # comparison to lower order encoding
    compared_features=['_avg_sales','_fraction_of_items_sold','_avg_sales_over_sold']
    
    df=encode_comparison(df,df_compare_1,['month_id',groupby_labels[1]],[prefix+feature+suffix for feature in compared_features],[prefix_1+feature+suffix for feature in compared_features],[prefix+feature+'_compared_to_'+prefix_1+suffix for feature in compared_features])
    df=encode_comparison(df,df_compare_2,['month_id',groupby_labels[2]],[prefix+feature+suffix for feature in compared_features],[prefix_2+feature+suffix for feature in compared_features],[prefix+feature+'_compared_to_'+prefix_2+suffix for feature in compared_features])

    return df


def encoding_2(idxbool,groupby_labels,prefix_1,prefix_2,suffix,df_compare_1,df_compare_2):
    prefix=prefix_1+'_'+prefix_2
    df=encoding_1(idxbool,groupby_labels,prefix,suffix)
    return add_comparison_lower(df,idxbool,groupby_labels,prefix_1,prefix_2,suffix,df_compare_1,df_compare_2)


#----------------------------------------------------------------
def encoding_3(idxbool,groupby_labels,prefix_1,prefix_2,prefix_3,suffix):
    prefix=prefix_1+'_'+prefix_2+'_'+prefix_3
    df=encoding_1(idxbool,groupby_labels,prefix,suffix)

    return df

In [None]:
# functions for display purposes only

def print_infos(df):
    print(df[df['month_id']<34].info(null_counts=True))

def display_1(df,display_label):
    groupby_labels=list(df.columns[0:2])
    
    fig,axes=plt.subplots(2,2,figsize=(15,18))
    sns.lineplot(data=df[df['month_id']<34],x=groupby_labels[0],y=display_label,hue=groupby_labels[1],marker='o',ax=axes[0,0])
    sns.lineplot(data=df[df['month_id']<34],x=groupby_labels[0],y=display_label,marker='o',ax=axes[0,1])
    sns.lineplot(data=df[df['month_id']<34],x=groupby_labels[1],y=display_label,hue=groupby_labels[0],marker='o',ax=axes[1,0])
    sns.lineplot(data=df[df['month_id']<34],x=groupby_labels[1],y=display_label,marker='o',ax=axes[1,1])
    for i in range(0,2):
        for j in range(0,2):
            axes[i,j].grid(True)

def display_2(df,display_label):
    groupby_labels=list(df.columns[0:2])
    
    fig,axes=plt.subplots(2,2,figsize=(15,18))
    sns.lineplot(data=df[df['month_id']<34],x=groupby_labels[0],y=display_label,hue=groupby_labels[1],marker='o',ax=axes[0,0])
    sns.lineplot(data=df[df['month_id']<34],x=groupby_labels[0],y=display_label,marker='o',ax=axes[0,1])
    sns.scatterplot(data=df[df['month_id']<34],x=groupby_labels[1],y=display_label,hue=groupby_labels[0],marker='o',ax=axes[1,0])
    sns.lineplot(data=df[df['month_id']<34],x=groupby_labels[1],y=display_label,marker='o',ax=axes[1,1])
    for i in range(0,2):
        for j in range(0,2):
            axes[i,j].grid(True)

def display_3(df,display_label):
    groupby_labels=list(df.columns[0:2])
    
    fig,axes=plt.subplots(2,2,figsize=(15,18))
    sns.lineplot(data=df[df['month_id']<34],x=groupby_labels[0],y=display_label,hue=groupby_labels[1],marker='o',ax=axes[0,0])
    sns.lineplot(data=df[df['month_id']<34],x=groupby_labels[0],y=display_label,marker='o',ax=axes[0,1])
    sns.scatterplot(data=df[df['month_id']<34],x=groupby_labels[1],y=display_label,hue=groupby_labels[0],marker='o',ax=axes[1,0])
    sns.boxplot(data=df[df['month_id']<34],x=groupby_labels[1],y=display_label,ax=axes[1,1])

    axes[0,0].grid(True)
    axes[0,1].grid(True)
    axes[1,0].grid(True)

def display_4(df,display_label):
    groupby_labels=list(df.columns[1:3])
    
    fig,axes=plt.subplots(1,2,figsize=(16,6))
    tmp=df[groupby_labels+[display_label]].groupby(groupby_labels).mean()[display_label].unstack()
    sns.heatmap(tmp/tmp.max(axis=0),ax=axes[0])
    axes[0].set_title('normalized over rows')
    sns.heatmap(((tmp.T)/tmp.max(axis=1)).T,ax=axes[1])
    axes[1].set_title('normalized over columns')
    
def display_5(df,display_label):
    groupby_labels=list(df.columns[1:3])
    
    fig,axes=plt.subplots(1,2,figsize=(16,3))
    tmp=df[groupby_labels+[display_label]].groupby(groupby_labels).mean()[display_label].unstack()
    sns.heatmap(((tmp.T)/tmp.max(axis=1)),ax=axes[0])
    axes[0].set_title('normalized over columns')
    sns.heatmap((tmp/tmp.max(axis=0)).T,ax=axes[1])
    axes[1].set_title('normalized over rows')
    
def display_6(df,display_label):
    groupby_labels=list(df.columns[1:3])
    tmp=df.drop('month_id',axis=1).groupby(groupby_labels,as_index=False).mean()
    
    fig,axes=plt.subplots(2,2,figsize=(15,18))
    sns.lineplot(data=tmp,x=groupby_labels[1],y=display_label,marker='o',hue=groupby_labels[0],ax=axes[0,0])
    sns.lineplot(data=tmp,x=groupby_labels[1],y=display_label,marker='o',ax=axes[0,1])
    sns.scatterplot(data=tmp,x=groupby_labels[0],y=display_label,hue=groupby_labels[1],marker='o',ax=axes[1,0])
    sns.boxplot(data=tmp,x=groupby_labels[0],y=display_label,ax=axes[1,1])

    axes[0,0].grid(True)
    axes[0,1].grid(True)
    axes[1,0].grid(True)

## ABSOLUTE TIME

### 0-th order feature: ABSOLUTE TIME

In [None]:
# MONTHS

# build dataframes
month_encoding=encoding_0(all_items,['month_id'],'month','')
month_encoding_seniority_0=encoding_0((train_X['item_seniority']==0),['month_id'],'month','_seniority_0')
month_encoding_seniority_1=encoding_0((train_X['item_seniority']==1),['month_id'],'month','_seniority_1')
month_encoding_seniority_2=encoding_0((train_X['item_seniority']==2),['month_id'],'month','_seniority_2')

#----------------------------------------------------------------
# display
print_infos(month_encoding)
print()
print('------------------------------------------------')
print()
print_infos(month_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(month_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(month_encoding_seniority_2)
print()

fig,axes=plt.subplots(1,3,figsize=(15,4))
axes[0].plot(month_encoding['month_avg_sales'].values[:-1],'-o')
axes[0].plot(month_encoding_seniority_0['month_avg_sales_seniority_0'].values[:-1],'-o')
axes[0].plot(month_encoding_seniority_1['month_avg_sales_seniority_1'].values[:-1],'-o')
axes[0].plot(month_encoding_seniority_2['month_avg_sales_seniority_2'].values[:-1],'-o')
axes[0].grid(True)
axes[0].set_ylim(0,1.5)
axes[0].legend(['overall','seniority 0','seniority 1','seniority 2'])
axes[0].set_title('average sales')
axes[0].set_xlabel('month_id')
axes[1].plot(month_encoding['month_fraction_of_items_sold'].values[:-1],'-o')
axes[1].plot(month_encoding_seniority_0['month_fraction_of_items_sold_seniority_0'].values[:-1],'-o')
axes[1].plot(month_encoding_seniority_1['month_fraction_of_items_sold_seniority_1'].values[:-1],'-o')
axes[1].plot(month_encoding_seniority_2['month_fraction_of_items_sold_seniority_2'].values[:-1],'-o')
axes[1].grid(True)
axes[1].set_ylim(0,0.3)
axes[1].set_title('fraction of items sold')
axes[1].set_xlabel('month_id')
axes[2].plot(month_encoding['month_avg_sales_over_sold'].values[:-1],'-o')
axes[2].plot(month_encoding_seniority_0['month_avg_sales_over_sold_seniority_0'].values[:-1],'-o')
axes[2].plot(month_encoding_seniority_1['month_avg_sales_over_sold_seniority_1'].values[:-1],'-o')
axes[2].plot(month_encoding_seniority_2['month_avg_sales_over_sold_seniority_2'].values[:-1],'-o')
axes[2].grid(True)
axes[2].set_ylim(0,5)
axes[2].set_title('average sales over sold')
axes[2].set_xlabel('month_id')

- seniority 0: there is no clear correlation of the month of the year with the sales of new items
- seniority 1: weak peak of sales in December due to a larger fraction of items sold, but no clear peak of the quantity of items sold when they are sold
- seniority 2: peak of sales in December due to a larger fraction of items sold and larger quantities of items sold when they are sold

- on average: seniority 1 < seniority 2 < seniority 0
- seniority 2 represent most of the sales, and so the quantities over sold are almost identical on the global average and on the seniority 2 averages. However, the quantities over all items are much more precisely predicted when seniority 2 is isolated from seniority 0 and 1 (they are underestimated otherwise due to the large contribution of seniority 1 in the global average)

## FIRST ORDER INTERACTIONS FEATURES

### 1st order features - RELATIVE TEMPORAL DYNAMICS

In [None]:
# MONTHS SINCE RELEASE-TIME

# build dataframes
months_since_release_encoding=encoding_1(all_items,['month_id','item_months_since_release'],'months_since_release','')
months_since_release_encoding_seniority_0=encoding_1((train_X['item_seniority']==0),['month_id','item_months_since_release'],'months_since_release','_seniority_0')
months_since_release_encoding_seniority_1=encoding_1((train_X['item_seniority']==1),['month_id','item_months_since_release'],'months_since_release','_seniority_1')
months_since_release_encoding_seniority_2=encoding_1((train_X['item_seniority']==2),['month_id','item_months_since_release'],'months_since_release','_seniority_2')



#----------------------------------------------------------------
# display
print_infos(months_since_release_encoding)
print()
print('------------------------------------------------')
print()
print_infos(months_since_release_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(months_since_release_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(months_since_release_encoding_seniority_2)
print()

display_1(months_since_release_encoding,'months_since_release_avg_sales')

In [None]:
display_1(months_since_release_encoding_seniority_1,'months_since_release_avg_sales_seniority_1')

In [None]:
display_1(months_since_release_encoding_seniority_2,'months_since_release_avg_sales_seniority_2')

In [None]:
# MONTH OF RELEASE-TIME

# build dataframes
month_of_release_encoding=encoding_1(all_items,['month_id','item_month_of_release'],'month_of_release','')
month_of_release_encoding_seniority_0=encoding_1((train_X['item_seniority']==0),['month_id','item_month_of_release'],'month_of_release','_seniority_0')
month_of_release_encoding_seniority_1=encoding_1((train_X['item_seniority']==1),['month_id','item_month_of_release'],'month_of_release','_seniority_1')
month_of_release_encoding_seniority_2=encoding_1((train_X['item_seniority']==2),['month_id','item_month_of_release'],'month_of_release','_seniority_2')



#----------------------------------------------------------------
# display
print_infos(month_of_release_encoding)
print()
print('------------------------------------------------')
print()
print_infos(month_of_release_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(month_of_release_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(month_of_release_encoding_seniority_2)
print()

display_1(month_of_release_encoding,'month_of_release_avg_sales')

In [None]:
display_1(month_of_release_encoding_seniority_1,'month_of_release_avg_sales_seniority_1')

In [None]:
display_1(month_of_release_encoding_seniority_2,'month_of_release_avg_sales_seniority_2')

In [None]:
# MONTHS SINCE FIRST SALE IN SHOP-TIME
# only seniority=2 has known values

# build dataframes
months_since_first_sale_in_shop_encoding=encoding_1(all_items,['month_id','item_months_since_first_sale_in_shop'],'months_since_first_sale_in_shop','')
months_since_first_sale_in_shop_encoding_seniority_0=encoding_1((train_X['item_seniority']==0),['month_id','item_months_since_first_sale_in_shop'],'months_since_first_sale_in_shop','_seniority_0')
months_since_first_sale_in_shop_encoding_seniority_1=encoding_1((train_X['item_seniority']==1),['month_id','item_months_since_first_sale_in_shop'],'months_since_first_sale_in_shop','_seniority_1')
months_since_first_sale_in_shop_encoding_seniority_2=encoding_1((train_X['item_seniority']==2),['month_id','item_months_since_first_sale_in_shop'],'months_since_first_sale_in_shop','_seniority_2')

# erase data from the future
months_since_first_sale_in_shop_encoding.loc[months_since_first_sale_in_shop_encoding['item_months_since_first_sale_in_shop']==-1,months_since_first_sale_in_shop_encoding.columns[2:]]=-1
months_since_first_sale_in_shop_encoding_seniority_0.loc[months_since_first_sale_in_shop_encoding_seniority_0['item_months_since_first_sale_in_shop']==-1,months_since_first_sale_in_shop_encoding_seniority_0.columns[2:]]=-1
months_since_first_sale_in_shop_encoding_seniority_1.loc[months_since_first_sale_in_shop_encoding_seniority_1['item_months_since_first_sale_in_shop']==-1,months_since_first_sale_in_shop_encoding_seniority_1.columns[2:]]=-1
months_since_first_sale_in_shop_encoding_seniority_2.loc[months_since_first_sale_in_shop_encoding_seniority_2['item_months_since_first_sale_in_shop']==-1,months_since_first_sale_in_shop_encoding_seniority_2.columns[2:]]=-1


#----------------------------------------------------------------
# display
print_infos(months_since_first_sale_in_shop_encoding)
print()
print('------------------------------------------------')
print()
print_infos(months_since_first_sale_in_shop_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(months_since_first_sale_in_shop_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(months_since_first_sale_in_shop_encoding_seniority_2)
print()

display_1(months_since_first_sale_in_shop_encoding_seniority_2,'months_since_first_sale_in_shop_avg_sales_seniority_2')

In [None]:
# MONTH OF FIRST SALE IN SHOP-TIME
# only seniority=2 has known values

# build dataframes
month_of_first_sale_in_shop_encoding=encoding_1(all_items,['month_id','item_month_of_first_sale_in_shop'],'month_of_first_sale_in_shop','')
month_of_first_sale_in_shop_encoding_seniority_0=encoding_1((train_X['item_seniority']==0),['month_id','item_month_of_first_sale_in_shop'],'month_of_first_sale_in_shop','_seniority_0')
month_of_first_sale_in_shop_encoding_seniority_1=encoding_1((train_X['item_seniority']==1),['month_id','item_month_of_first_sale_in_shop'],'month_of_first_sale_in_shop','_seniority_1')
month_of_first_sale_in_shop_encoding_seniority_2=encoding_1((train_X['item_seniority']==2),['month_id','item_month_of_first_sale_in_shop'],'month_of_first_sale_in_shop','_seniority_2')

# erase data from the future
month_of_first_sale_in_shop_encoding.loc[month_of_first_sale_in_shop_encoding['item_month_of_first_sale_in_shop']==-1,month_of_first_sale_in_shop_encoding.columns[2:]]=-1
month_of_first_sale_in_shop_encoding_seniority_0.loc[month_of_first_sale_in_shop_encoding_seniority_0['item_month_of_first_sale_in_shop']==-1,month_of_first_sale_in_shop_encoding_seniority_0.columns[2:]]=-1
month_of_first_sale_in_shop_encoding_seniority_1.loc[month_of_first_sale_in_shop_encoding_seniority_1['item_month_of_first_sale_in_shop']==-1,month_of_first_sale_in_shop_encoding_seniority_1.columns[2:]]=-1
month_of_first_sale_in_shop_encoding_seniority_2.loc[month_of_first_sale_in_shop_encoding_seniority_2['item_month_of_first_sale_in_shop']==-1,month_of_first_sale_in_shop_encoding_seniority_2.columns[2:]]=-1


#----------------------------------------------------------------
# display
print_infos(month_of_first_sale_in_shop_encoding)
print()
print('------------------------------------------------')
print()
print_infos(month_of_first_sale_in_shop_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(month_of_first_sale_in_shop_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(month_of_first_sale_in_shop_encoding_seniority_2)
print()

display_1(month_of_first_sale_in_shop_encoding_seniority_2,'month_of_first_sale_in_shop_avg_sales_seniority_2')

* seniority 0: 
    - there is no relative time for new items because 'month of release' = 'current month', so we can only analyse the correlation of sales with the current month of the year for new items.

* seniority 1: 
    - these items are highly correlated with the number of 'months since release', but very weakly with the absolute month of the year, and almost not at all with the 'month of release'

* seniority 2: 
    - these items are highly correlated with the months since release and the month since first sale in shop. 
    - there is a slight correlation with 'month of release' (more items sold in September and November). This is likely to be specific to some items and not others.

CONCLUSION:
   - pairs (shop,item) of seniority 0 are independent of all temporal features
   - pairs (shop,item) of seniority 1 depends strongly on 'months since release', with a weak peak in December and no clear influence of the month of release of the item
   - pairs (shop,item) of seniority 2 depends strongly on 'months since release', as well as on 'months since first sale in shop' in a slightly different fashion. There is also a peak of sales in December, and for items released in September and November. The dependency on the release time is likely to vary from one type of items to another.

### 1st order features - SPATIAL DISTRIBUTION

In [None]:
# SHOP-TIME

# build dataframes
shop_encoding=encoding_1(all_items,['month_id','shop_id'],'shop','')
shop_encoding_seniority_0=encoding_1((train_X['item_seniority']==0),['month_id','shop_id'],'shop','_seniority_0')
shop_encoding_seniority_1=encoding_1((train_X['item_seniority']==1),['month_id','shop_id'],'shop','_seniority_1')
shop_encoding_seniority_2=encoding_1((train_X['item_seniority']==2),['month_id','shop_id'],'shop','_seniority_2')

#----------------------------------------------------------------
# display
print_infos(shop_encoding)
print()
print('------------------------------------------------')
print()
print_infos(shop_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(shop_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(shop_encoding_seniority_2)
print()

display_2(shop_encoding,'shop_avg_sales')

### 1st order features - ITEM-RELATIVE DISTRIBUTION

In [None]:
# SUPERCATEGORY-TIME

# build dataframes
supercategory_encoding=encoding_1(all_items,['month_id','item_supercategory_id'],'supercategory','')
supercategory_encoding_seniority_0=encoding_1((train_X['item_seniority']==0),['month_id','item_supercategory_id'],'supercategory','_seniority_0')
supercategory_encoding_seniority_1=encoding_1((train_X['item_seniority']==1),['month_id','item_supercategory_id'],'supercategory','_seniority_1')
supercategory_encoding_seniority_2=encoding_1((train_X['item_seniority']==2),['month_id','item_supercategory_id'],'supercategory','_seniority_2')

#----------------------------------------------------------------
# display
print_infos(supercategory_encoding)
print()
print('------------------------------------------------')
print()
print_infos(supercategory_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(supercategory_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(supercategory_encoding_seniority_2)
print()

display_3(supercategory_encoding,'supercategory_avg_sales')

In [None]:
# CATEGORY-TIME

# build dataframes
category_encoding=encoding_1(all_items,['month_id','item_category_id'],'category','')
category_encoding_seniority_0=encoding_1((train_X['item_seniority']==0),['month_id','item_category_id'],'category','_seniority_0')
category_encoding_seniority_1=encoding_1((train_X['item_seniority']==1),['month_id','item_category_id'],'category','_seniority_1')
category_encoding_seniority_2=encoding_1((train_X['item_seniority']==2),['month_id','item_category_id'],'category','_seniority_2')

# compare to supercategory
category_encoding=add_comparison_super(category_encoding,all_items,['month_id','item_category_id'],'category','',supercategory_encoding,'item_supercategory_id','supercategory',item_categories)
category_encoding_seniority_0=add_comparison_super(category_encoding_seniority_0,(train_X['item_seniority']==0),['month_id','item_category_id'],'category','_seniority_0',supercategory_encoding_seniority_0,'item_supercategory_id','supercategory',item_categories)
category_encoding_seniority_1=add_comparison_super(category_encoding_seniority_1,(train_X['item_seniority']==1),['month_id','item_category_id'],'category','_seniority_1',supercategory_encoding_seniority_1,'item_supercategory_id','supercategory',item_categories)
category_encoding_seniority_2=add_comparison_super(category_encoding_seniority_2,(train_X['item_seniority']==2),['month_id','item_category_id'],'category','_seniority_2',supercategory_encoding_seniority_2,'item_supercategory_id','supercategory',item_categories)

#----------------------------------------------------------------
# display
print_infos(category_encoding)
print()
print('------------------------------------------------')
print()
print_infos(category_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(category_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(category_encoding_seniority_2)
print()

display_3(category_encoding,'category_avg_sales')

In [None]:
# ITEM-TIME
# NB: 
    # - 'fraction_of_items_sold' here actually stands for 'fraction of shops where this item is sold'
    # - the dataframe 'seniority_0' (new items) should be identical to the corresponding rows of the full dataframe for most features, but not for the weights and comparison
    # - on the other hand, the distinction seniority 1 or 2 should yield different statistics depending on whether the row correspond to a shop where the item has previously been sold or not

# build dataframes
item_encoding=encoding_1(all_items,['month_id','item_id'],'item','')
item_encoding_seniority_0=encoding_1((train_X['item_seniority']==0),['month_id','item_id'],'item','_seniority_0')
item_encoding_seniority_1=encoding_1((train_X['item_seniority']==1),['month_id','item_id'],'item','_seniority_1')
item_encoding_seniority_2=encoding_1((train_X['item_seniority']==2),['month_id','item_id'],'item','_seniority_2')

# compare to category
item_encoding=add_comparison_super(item_encoding,all_items,['month_id','item_id'],'item','',category_encoding,'item_category_id','category',items)
item_encoding_seniority_0=add_comparison_super(item_encoding_seniority_0,(train_X['item_seniority']==0),['month_id','item_id'],'item','_seniority_0',category_encoding_seniority_0,'item_category_id','category',items)
item_encoding_seniority_1=add_comparison_super(item_encoding_seniority_1,(train_X['item_seniority']==1),['month_id','item_id'],'item','_seniority_1',category_encoding_seniority_1,'item_category_id','category',items)
item_encoding_seniority_2=add_comparison_super(item_encoding_seniority_2,(train_X['item_seniority']==2),['month_id','item_id'],'item','_seniority_2',category_encoding_seniority_2,'item_category_id','category',items)

#----------------------------------------------------------------
# display
print_infos(item_encoding)
print()
print('------------------------------------------------')
print()
print_infos(item_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(item_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(item_encoding_seniority_2)
print()

## SECOND ORDER INTERACTIONS FEATURES

### 2nd order features - Spatial repartition of items / Topical specialization of the shops

In [None]:
# SHOP-CATEGORY-TIME

# build dataframes
shop_category_encoding=encoding_2(all_items,['month_id','shop_id','item_category_id'],'shop','category','',shop_encoding,category_encoding)
shop_category_encoding_seniority_0=encoding_2((train_X['item_seniority']==0),['month_id','shop_id','item_category_id'],'shop','category','_seniority_0',shop_encoding_seniority_0,category_encoding_seniority_0)
shop_category_encoding_seniority_1=encoding_2((train_X['item_seniority']==1),['month_id','shop_id','item_category_id'],'shop','category','_seniority_1',shop_encoding_seniority_1,category_encoding_seniority_1)
shop_category_encoding_seniority_2=encoding_2((train_X['item_seniority']==2),['month_id','shop_id','item_category_id'],'shop','category','_seniority_2',shop_encoding_seniority_2,category_encoding_seniority_2)

#----------------------------------------------------------------
# display
print_infos(shop_category_encoding)
print()
print('------------------------------------------------')
print()
print_infos(shop_category_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(shop_category_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(shop_category_encoding_seniority_2)
print()

display_4(shop_category_encoding,'shop_category_avg_sales')

In [None]:
# SHOP-SUPERCATEGORY-TIME

# build dataframes
shop_supercategory_encoding=encoding_2(all_items,['month_id','shop_id','item_supercategory_id'],'shop','supercategory','',shop_encoding,supercategory_encoding)
shop_supercategory_encoding_seniority_0=encoding_2((train_X['item_seniority']==0),['month_id','shop_id','item_supercategory_id'],'shop','supercategory','_seniority_0',shop_encoding_seniority_0,supercategory_encoding_seniority_0)
shop_supercategory_encoding_seniority_1=encoding_2((train_X['item_seniority']==1),['month_id','shop_id','item_supercategory_id'],'shop','supercategory','_seniority_1',shop_encoding_seniority_1,supercategory_encoding_seniority_1)
shop_supercategory_encoding_seniority_2=encoding_2((train_X['item_seniority']==2),['month_id','shop_id','item_supercategory_id'],'shop','supercategory','_seniority_2',shop_encoding_seniority_2,supercategory_encoding_seniority_2)

#----------------------------------------------------------------
# display
print_infos(shop_supercategory_encoding)
print()
print('------------------------------------------------')
print()
print_infos(shop_supercategory_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(shop_supercategory_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(shop_supercategory_encoding_seniority_2)
print()

display_5(shop_supercategory_encoding,'shop_supercategory_avg_sales')

In [None]:
# SHOP-ITEM-TIME

# MEAN ENCODING
# Each month, encode each pair (item-shop) according to its monthly-average sales (='item_quantity')
shop_item_encoding=train_X[['month_id','shop_id','item_id','item_quantity']]

# COMPARISON TO LOWER ORDER ENCODING
shop_item_encoding=encode_comparison(shop_item_encoding,shop_encoding,['month_id','shop_id'],['item_quantity','item_quantity'],['shop_avg_sales','shop_avg_sales_over_sold'],['item_quantity_compared_to_shop_avg_sales','item_quantity_compared_to_shop_avg_sales_over_sold'])
shop_item_encoding=encode_comparison(shop_item_encoding,item_encoding,['month_id','item_id'],['item_quantity','item_quantity'],['item_avg_sales','item_avg_sales_over_sold'],['item_quantity_compared_to_item_avg_sales','item_quantity_compared_to_item_avg_sales_over_sold'])

# COMPARISON TO SUPERFEATURE
shop_item_encoding['item_category_id']=shop_item_encoding['item_id'].map(items['item_category_id'])
shop_item_encoding=encode_comparison(shop_item_encoding,shop_category_encoding,['month_id','shop_id','item_category_id'],['item_quantity','item_quantity'],['shop_category_avg_sales','shop_category_avg_sales_over_sold'],['item_quantity_compared_to_shop_category_avg_sales','item_quantity_compared_to_shop_category_avg_sales_over_sold'])
shop_item_encoding.drop('item_category_id',axis=1,inplace=True)

print_infos(shop_item_encoding)

### 2nd order features - Temporal dynamics of sales for the items

In [None]:
# CATEGORY-MONTHS SINCE RELEASE-TIME

# build dataframes
category_months_since_release_encoding=encoding_2(all_items,['month_id','item_category_id','item_months_since_release'],'category','months_since_release','',category_encoding,months_since_release_encoding)
category_months_since_release_encoding_seniority_0=encoding_2((train_X['item_seniority']==0),['month_id','item_category_id','item_months_since_release'],'category','months_since_release','_seniority_0',category_encoding_seniority_0,months_since_release_encoding_seniority_0)
category_months_since_release_encoding_seniority_1=encoding_2((train_X['item_seniority']==1),['month_id','item_category_id','item_months_since_release'],'category','months_since_release','_seniority_1',category_encoding_seniority_1,months_since_release_encoding_seniority_1)
category_months_since_release_encoding_seniority_2=encoding_2((train_X['item_seniority']==2),['month_id','item_category_id','item_months_since_release'],'category','months_since_release','_seniority_2',category_encoding_seniority_2,months_since_release_encoding_seniority_2)

#----------------------------------------------------------------
# display
print_infos(category_months_since_release_encoding)
print()
print('------------------------------------------------')
print()
print_infos(category_months_since_release_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(category_months_since_release_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(category_months_since_release_encoding_seniority_2)
print()

display_6(category_months_since_release_encoding,'category_months_since_release_avg_sales')

In [None]:
# SUPERCATEGORY-MONTHS SINCE RELEASE-TIME

# build dataframes
supercategory_months_since_release_encoding=encoding_2(all_items,['month_id','item_supercategory_id','item_months_since_release'],'supercategory','months_since_release','',supercategory_encoding,months_since_release_encoding)
supercategory_months_since_release_encoding_seniority_0=encoding_2((train_X['item_seniority']==0),['month_id','item_supercategory_id','item_months_since_release'],'supercategory','months_since_release','_seniority_0',supercategory_encoding_seniority_0,months_since_release_encoding_seniority_0)
supercategory_months_since_release_encoding_seniority_1=encoding_2((train_X['item_seniority']==1),['month_id','item_supercategory_id','item_months_since_release'],'supercategory','months_since_release','_seniority_1',supercategory_encoding_seniority_1,months_since_release_encoding_seniority_1)
supercategory_months_since_release_encoding_seniority_2=encoding_2((train_X['item_seniority']==2),['month_id','item_supercategory_id','item_months_since_release'],'supercategory','months_since_release','_seniority_2',supercategory_encoding_seniority_2,months_since_release_encoding_seniority_2)

#----------------------------------------------------------------
# display
print_infos(supercategory_months_since_release_encoding)
print()
print('------------------------------------------------')
print()
print_infos(supercategory_months_since_release_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(supercategory_months_since_release_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(supercategory_months_since_release_encoding_seniority_2)
print()

display_6(supercategory_months_since_release_encoding,'supercategory_months_since_release_avg_sales')

In [None]:
# CATEGORY-MONTH OF RELEASE-TIME

# build dataframes
category_month_of_release_encoding=encoding_2(all_items,['month_id','item_category_id','item_month_of_release'],'category','month_of_release','',category_encoding,month_of_release_encoding)
category_month_of_release_encoding_seniority_0=encoding_2((train_X['item_seniority']==0),['month_id','item_category_id','item_month_of_release'],'category','month_of_release','_seniority_0',category_encoding_seniority_0,month_of_release_encoding_seniority_0)
category_month_of_release_encoding_seniority_1=encoding_2((train_X['item_seniority']==1),['month_id','item_category_id','item_month_of_release'],'category','month_of_release','_seniority_1',category_encoding_seniority_1,month_of_release_encoding_seniority_1)
category_month_of_release_encoding_seniority_2=encoding_2((train_X['item_seniority']==2),['month_id','item_category_id','item_month_of_release'],'category','month_of_release','_seniority_2',category_encoding_seniority_2,month_of_release_encoding_seniority_2)

#----------------------------------------------------------------
# display
print_infos(category_month_of_release_encoding)
print()
print('------------------------------------------------')
print()
print_infos(category_month_of_release_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(category_month_of_release_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(category_month_of_release_encoding_seniority_2)
print()

display_6(category_month_of_release_encoding,'category_month_of_release_avg_sales')

In [None]:
# SUPERCATEGORY-MONTH OF RELEASE-TIME

# build dataframes
supercategory_month_of_release_encoding=encoding_2(all_items,['month_id','item_supercategory_id','item_month_of_release'],'supercategory','month_of_release','',supercategory_encoding,month_of_release_encoding)
supercategory_month_of_release_encoding_seniority_0=encoding_2((train_X['item_seniority']==0),['month_id','item_supercategory_id','item_month_of_release'],'supercategory','month_of_release','_seniority_0',supercategory_encoding_seniority_0,month_of_release_encoding_seniority_0)
supercategory_month_of_release_encoding_seniority_1=encoding_2((train_X['item_seniority']==1),['month_id','item_supercategory_id','item_month_of_release'],'supercategory','month_of_release','_seniority_1',supercategory_encoding_seniority_1,month_of_release_encoding_seniority_1)
supercategory_month_of_release_encoding_seniority_2=encoding_2((train_X['item_seniority']==2),['month_id','item_supercategory_id','item_month_of_release'],'supercategory','month_of_release','_seniority_2',supercategory_encoding_seniority_2,month_of_release_encoding_seniority_2)

#----------------------------------------------------------------
# display
print_infos(supercategory_month_of_release_encoding)
print()
print('------------------------------------------------')
print()
print_infos(supercategory_month_of_release_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(supercategory_month_of_release_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(supercategory_month_of_release_encoding_seniority_2)
print()

display_6(supercategory_month_of_release_encoding,'supercategory_month_of_release_avg_sales')

In [None]:
# CATEGORY-MONTHS SINCE FIRST SALE IN SHOP-TIME

# build dataframes
category_months_since_first_sale_in_shop_encoding=encoding_2(all_items,['month_id','item_category_id','item_months_since_first_sale_in_shop'],'category','months_since_first_sale_in_shop','',category_encoding,months_since_first_sale_in_shop_encoding)
category_months_since_first_sale_in_shop_encoding_seniority_0=encoding_2((train_X['item_seniority']==0),['month_id','item_category_id','item_months_since_first_sale_in_shop'],'category','months_since_first_sale_in_shop','_seniority_0',category_encoding_seniority_0,months_since_first_sale_in_shop_encoding_seniority_0)
category_months_since_first_sale_in_shop_encoding_seniority_1=encoding_2((train_X['item_seniority']==1),['month_id','item_category_id','item_months_since_first_sale_in_shop'],'category','months_since_first_sale_in_shop','_seniority_1',category_encoding_seniority_1,months_since_first_sale_in_shop_encoding_seniority_1)
category_months_since_first_sale_in_shop_encoding_seniority_2=encoding_2((train_X['item_seniority']==2),['month_id','item_category_id','item_months_since_first_sale_in_shop'],'category','months_since_first_sale_in_shop','_seniority_2',category_encoding_seniority_2,months_since_first_sale_in_shop_encoding_seniority_2)

# erase data from the future
category_months_since_first_sale_in_shop_encoding.loc[category_months_since_first_sale_in_shop_encoding['item_months_since_first_sale_in_shop']==-1,category_months_since_first_sale_in_shop_encoding.columns[3:]]=-1
category_months_since_first_sale_in_shop_encoding_seniority_0.loc[category_months_since_first_sale_in_shop_encoding_seniority_0['item_months_since_first_sale_in_shop']==-1,category_months_since_first_sale_in_shop_encoding_seniority_0.columns[3:]]=-1
category_months_since_first_sale_in_shop_encoding_seniority_1.loc[category_months_since_first_sale_in_shop_encoding_seniority_1['item_months_since_first_sale_in_shop']==-1,category_months_since_first_sale_in_shop_encoding_seniority_1.columns[3:]]=-1
category_months_since_first_sale_in_shop_encoding_seniority_2.loc[category_months_since_first_sale_in_shop_encoding_seniority_2['item_months_since_first_sale_in_shop']==-1,category_months_since_first_sale_in_shop_encoding_seniority_2.columns[3:]]=-1


#----------------------------------------------------------------
# display
print_infos(category_months_since_first_sale_in_shop_encoding)
print()
print('------------------------------------------------')
print()
print_infos(category_months_since_first_sale_in_shop_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(category_months_since_first_sale_in_shop_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(category_months_since_first_sale_in_shop_encoding_seniority_2)
print()

display_6(category_months_since_first_sale_in_shop_encoding_seniority_2,'category_months_since_first_sale_in_shop_avg_sales_seniority_2')

In [None]:
# SUPERCATEGORY-MONTHS SINCE FIRST SALE IN SHOP-TIME

# build dataframes
supercategory_months_since_first_sale_in_shop_encoding=encoding_2(all_items,['month_id','item_supercategory_id','item_months_since_first_sale_in_shop'],'supercategory','months_since_first_sale_in_shop','',supercategory_encoding,months_since_first_sale_in_shop_encoding)
supercategory_months_since_first_sale_in_shop_encoding_seniority_0=encoding_2((train_X['item_seniority']==0),['month_id','item_supercategory_id','item_months_since_first_sale_in_shop'],'supercategory','months_since_first_sale_in_shop','_seniority_0',supercategory_encoding_seniority_0,months_since_first_sale_in_shop_encoding_seniority_0)
supercategory_months_since_first_sale_in_shop_encoding_seniority_1=encoding_2((train_X['item_seniority']==1),['month_id','item_supercategory_id','item_months_since_first_sale_in_shop'],'supercategory','months_since_first_sale_in_shop','_seniority_1',supercategory_encoding_seniority_1,months_since_first_sale_in_shop_encoding_seniority_1)
supercategory_months_since_first_sale_in_shop_encoding_seniority_2=encoding_2((train_X['item_seniority']==2),['month_id','item_supercategory_id','item_months_since_first_sale_in_shop'],'supercategory','months_since_first_sale_in_shop','_seniority_2',supercategory_encoding_seniority_2,months_since_first_sale_in_shop_encoding_seniority_2)

# erase data from the future
supercategory_months_since_first_sale_in_shop_encoding.loc[supercategory_months_since_first_sale_in_shop_encoding['item_months_since_first_sale_in_shop']==-1,supercategory_months_since_first_sale_in_shop_encoding.columns[3:]]=-1
supercategory_months_since_first_sale_in_shop_encoding_seniority_0.loc[supercategory_months_since_first_sale_in_shop_encoding_seniority_0['item_months_since_first_sale_in_shop']==-1,supercategory_months_since_first_sale_in_shop_encoding_seniority_0.columns[3:]]=-1
supercategory_months_since_first_sale_in_shop_encoding_seniority_1.loc[supercategory_months_since_first_sale_in_shop_encoding_seniority_1['item_months_since_first_sale_in_shop']==-1,supercategory_months_since_first_sale_in_shop_encoding_seniority_1.columns[3:]]=-1
supercategory_months_since_first_sale_in_shop_encoding_seniority_2.loc[supercategory_months_since_first_sale_in_shop_encoding_seniority_2['item_months_since_first_sale_in_shop']==-1,supercategory_months_since_first_sale_in_shop_encoding_seniority_2.columns[3:]]=-1


#----------------------------------------------------------------
# display
print_infos(supercategory_months_since_first_sale_in_shop_encoding)
print()
print('------------------------------------------------')
print()
print_infos(supercategory_months_since_first_sale_in_shop_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(supercategory_months_since_first_sale_in_shop_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(supercategory_months_since_first_sale_in_shop_encoding_seniority_2)
print()

display_6(supercategory_months_since_first_sale_in_shop_encoding_seniority_2,'supercategory_months_since_first_sale_in_shop_avg_sales_seniority_2')

## THIRD ORDER INTERACTIONS FEATURES

In [None]:
# SHOP-CATEGORY-MONTHS SINCE RELEASE-TIME

# build dataframes
shop_category_months_since_release_encoding=encoding_3(all_items,['month_id','shop_id','item_category_id','item_months_since_release'],'shop','category','months_since_release','')
shop_category_months_since_release_encoding_seniority_0=encoding_3((train_X['item_seniority']==0),['month_id','shop_id','item_category_id','item_months_since_release'],'shop','category','months_since_release','_seniority_0')
shop_category_months_since_release_encoding_seniority_1=encoding_3((train_X['item_seniority']==1),['month_id','shop_id','item_category_id','item_months_since_release'],'shop','category','months_since_release','_seniority_1')
shop_category_months_since_release_encoding_seniority_2=encoding_3((train_X['item_seniority']==2),['month_id','shop_id','item_category_id','item_months_since_release'],'shop','category','months_since_release','_seniority_2')

#----------------------------------------------------------------
# display
print_infos(shop_category_months_since_release_encoding)
print()
print('------------------------------------------------')
print()
print_infos(shop_category_months_since_release_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(shop_category_months_since_release_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(shop_category_months_since_release_encoding_seniority_2)
print()

In [None]:
# SHOP-CATEGORY-MONTHS SINCE FIRST SALE IN SHOP-TIME

# build dataframes
shop_category_months_since_first_sale_in_shop_encoding=encoding_3(all_items,['month_id','shop_id','item_category_id','item_months_since_first_sale_in_shop'],'shop','category','months_since_first_sale_in_shop','')
shop_category_months_since_first_sale_in_shop_encoding_seniority_0=encoding_3((train_X['item_seniority']==0),['month_id','shop_id','item_category_id','item_months_since_first_sale_in_shop'],'shop','category','months_since_first_sale_in_shop','_seniority_0')
shop_category_months_since_first_sale_in_shop_encoding_seniority_1=encoding_3((train_X['item_seniority']==1),['month_id','shop_id','item_category_id','item_months_since_first_sale_in_shop'],'shop','category','months_since_first_sale_in_shop','_seniority_1')
shop_category_months_since_first_sale_in_shop_encoding_seniority_2=encoding_3((train_X['item_seniority']==2),['month_id','shop_id','item_category_id','item_months_since_first_sale_in_shop'],'shop','category','months_since_first_sale_in_shop','_seniority_2')

#----------------------------------------------------------------
# display
print_infos(shop_category_months_since_first_sale_in_shop_encoding)
print()
print('------------------------------------------------')
print()
print_infos(shop_category_months_since_first_sale_in_shop_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(shop_category_months_since_first_sale_in_shop_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(shop_category_months_since_first_sale_in_shop_encoding_seniority_2)
print()

In [None]:
# SHOP-SUPERCATEGORY-MONTHS SINCE RELEASE-TIME

# build dataframes
shop_supercategory_months_since_release_encoding=encoding_3(all_items,['month_id','shop_id','item_supercategory_id','item_months_since_release'],'shop','supercategory','months_since_release','')
shop_supercategory_months_since_release_encoding_seniority_0=encoding_3((train_X['item_seniority']==0),['month_id','shop_id','item_supercategory_id','item_months_since_release'],'shop','supercategory','months_since_release','_seniority_0')
shop_supercategory_months_since_release_encoding_seniority_1=encoding_3((train_X['item_seniority']==1),['month_id','shop_id','item_supercategory_id','item_months_since_release'],'shop','supercategory','months_since_release','_seniority_1')
shop_supercategory_months_since_release_encoding_seniority_2=encoding_3((train_X['item_seniority']==2),['month_id','shop_id','item_supercategory_id','item_months_since_release'],'shop','supercategory','months_since_release','_seniority_2')

#----------------------------------------------------------------
# display
print_infos(shop_supercategory_months_since_release_encoding)
print()
print('------------------------------------------------')
print()
print_infos(shop_supercategory_months_since_release_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(shop_supercategory_months_since_release_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(shop_supercategory_months_since_release_encoding_seniority_2)
print()

In [None]:
# SHOP-SUPERCATEGORY-MONTHS SINCE FIRST SALE IN SHOP-TIME

# build dataframes
shop_supercategory_months_since_first_sale_in_shop_encoding=encoding_3(all_items,['month_id','shop_id','item_supercategory_id','item_months_since_first_sale_in_shop'],'shop','supercategory','months_since_first_sale_in_shop','')
shop_supercategory_months_since_first_sale_in_shop_encoding_seniority_0=encoding_3((train_X['item_seniority']==0),['month_id','shop_id','item_supercategory_id','item_months_since_first_sale_in_shop'],'shop','supercategory','months_since_first_sale_in_shop','_seniority_0')
shop_supercategory_months_since_first_sale_in_shop_encoding_seniority_1=encoding_3((train_X['item_seniority']==1),['month_id','shop_id','item_supercategory_id','item_months_since_first_sale_in_shop'],'shop','supercategory','months_since_first_sale_in_shop','_seniority_1')
shop_supercategory_months_since_first_sale_in_shop_encoding_seniority_2=encoding_3((train_X['item_seniority']==2),['month_id','shop_id','item_supercategory_id','item_months_since_first_sale_in_shop'],'shop','supercategory','months_since_first_sale_in_shop','_seniority_2')

#----------------------------------------------------------------
# display
print_infos(shop_supercategory_months_since_first_sale_in_shop_encoding)
print()
print('------------------------------------------------')
print()
print_infos(shop_supercategory_months_since_first_sale_in_shop_encoding_seniority_0)
print()
print('------------------------------------------------')
print()
print_infos(shop_supercategory_months_since_first_sale_in_shop_encoding_seniority_1)
print()
print('------------------------------------------------')
print()
print_infos(shop_supercategory_months_since_first_sale_in_shop_encoding_seniority_2)
print()

## Format and export data

In [None]:
dfs=['month_encoding',
     'months_since_release_encoding',
     'month_of_release_encoding',
     'months_since_first_sale_in_shop_encoding',
     'month_of_first_sale_in_shop_encoding',
     'shop_encoding',
     'item_encoding',
     'category_encoding',
     'supercategory_encoding',
     'shop_category_encoding',
     'shop_supercategory_encoding',
     'category_months_since_release_encoding',
     'supercategory_months_since_release_encoding',
     'category_month_of_release_encoding',
     'supercategory_month_of_release_encoding',
     'category_months_since_first_sale_in_shop_encoding',
     'supercategory_months_since_first_sale_in_shop_encoding',
     'shop_category_months_since_release_encoding',
     'shop_supercategory_months_since_release_encoding',
     'shop_category_months_since_first_sale_in_shop_encoding',
     'shop_supercategory_months_since_first_sale_in_shop_encoding'
    ]

In [None]:
for df in dfs+['shop_item_encoding']:
    print('-----------')
    print(df)
    print()
    exec("print("+df+"["+df+"['month_id']<34].info(null_counts=True))")
    print()

In [None]:
for df in dfs:
    for seniority in range(0,3):
        print('-----------')
        print(df+"_seniority_"+str(seniority))
        print()
        exec("print("+df+"_seniority_"+str(seniority)+"["+df+"_seniority_"+str(seniority)+"['month_id']<34].info(null_counts=True))")
        print()

In [None]:
# create directory
create_directory(os.path.join(DATA_FOLDER,'processed/target_encodings'))

# export encoded features
for df in dfs+['shop_item_encoding']:
    exec(df+".to_pickle(os.path.join(DATA_FOLDER,'processed/target_encodings/"+df+".pkl'))")
    exec("del "+df)

for df in dfs:
    for seniority in range(0,3):
        exec(df+"_seniority_"+str(seniority)+".to_pickle(os.path.join(DATA_FOLDER,'processed/target_encodings/"+df+"_seniority_"+str(seniority)+".pkl'))")
        exec("del "+df+"_seniority_"+str(seniority))

In [None]:
# clear memory
del train_X
del items, item_categories

gc.collect()

In [None]:
reset_variable_space

## -------------------------------------------------------------

# 4 - FEATURE AGGREGATION

## Aggregation functions

### Temporal transformation of features

In [None]:
# LAG FEATURES

def lag_features(df,df_features,col_agg,col_lag,lagging_values,fill_value=None):
    col_agg=list(col_agg)
    col_lag=list(col_lag)
    
    def lag_dataframe(df_original,lag):
        df_shift=df_original.copy()
        df_shift['month_id']+=lag
        for col in col_lag:
            df_shift.rename({col:col+'_lag_'+str(lag)},axis=1,inplace=True)
        return df_shift
    
    df_features_lag=df.loc[:,col_agg]
    tmp=df_features.loc[:,col_agg+col_lag]
    for lag in lagging_values:
        df_features_lag=df_features_lag.join(lag_dataframe(tmp,lag).set_index(col_agg),on=col_agg)
        
    if fill_value is not None:
        for col in col_lag:
            for lag in lagging_values:
                df_features_lag[col+'_lag_'+str(lag)].fillna(fill_value,inplace=True)
            
    return df_features_lag.drop(col_agg,axis=1)

In [None]:
# TEMPORAL STATISTICS

def moving_statistics(df,df_features,col_agg,col_avg,windowsize,func,suffix='',fill_value=None):
    col_agg=list(col_agg)
    col_lag=list(col_avg)
    
    def lag_dataframe(df_original,lag):
        df_shift=df_original.copy()
        df_shift['month_id']+=lag
        for col in col_lag:
            df_shift.rename({col:col+'_lag_'+str(lag)},axis=1,inplace=True)
        return df_shift
    
    df_features_lag=df.loc[:,col_agg]
    tmp=df_features.loc[:,col_agg+col_lag]
    for lag in range(1,windowsize+1):
        df_features_lag=df_features_lag.join(lag_dataframe(tmp,lag).set_index(col_agg),on=col_agg)

    df_features_avg=df.loc[:,col_agg]
    for col in col_avg:
        if func=='mean':
            if suffix=='':
                suffix='_movavg_'+str(windowsize)
            df_features_avg[col+suffix]=df_features_lag.loc[:,[col+'_lag_'+str(lag) for lag in range(1,windowsize+1)]].mean(axis=1)
        elif func=='max':
            if suffix=='':
                suffix='_movmax_'+str(windowsize)
            df_features_avg[col+suffix]=df_features_lag.loc[:,[col+'_lag_'+str(lag) for lag in range(1,windowsize+1)]].max(axis=1)
        elif func=='min':
            if suffix=='':
                suffix='_movmin_'+str(windowsize)
            df_features_avg[col+suffix]=df_features_lag.loc[:,[col+'_lag_'+str(lag) for lag in range(1,windowsize+1)]].min(axis=1)
        elif func=='std':
            if suffix=='':
                suffix='_movstd_'+str(windowsize)
            df_features_avg[col+suffix]=df_features_lag.loc[:,[col+'_lag_'+str(lag) for lag in range(1,windowsize+1)]].std(ddof=0,axis=1)
        elif func=='rsd':
            if suffix=='':
                suffix='_movrsd_'+str(windowsize)
            denom_avg=df_features_lag.loc[:,[col+'_lag_'+str(lag) for lag in range(1,windowsize+1)]].mean(axis=1)
            num_std=df_features_lag.loc[:,[col+'_lag_'+str(lag) for lag in range(1,windowsize+1)]].std(ddof=0,axis=1)
            df_features_avg[col+suffix]=num_std/denom_avg
            df_features_avg.loc[num_std==0,col+suffix]=0
            del denom_avg
            del num_std


    if fill_value is not None:
        for col in col_lag:
            df_features_avg[col+suffix].fillna(fill_value,inplace=True)
    
    return df_features_avg.drop(col_agg,axis=1)

In [None]:
# LINEAR COMBINATION OF PAST VALUES (differentiation, extrapolation, weighted average)

# weight vectors for...
fd1=[1,-1]          # first order backward finite difference scheme for derivative on previous month
fd2=[3/2,-2,1/2]    # second order backward finite difference scheme for derivative on previous month
extralin2=[2,-1]         # linear extrapolation from 2 previous months
extralin3=[4/3,1/3,-2/3] # linear extrapolation from 3 previous months
extraquad3=[3,-3,1]      # quadratic extrapolation from 3 previous months

def linear_combination(df,df_features,col_agg,col_avg,weights,suffix,fill_value=None):
    col_agg=list(col_agg)
    col_lag=list(col_avg)
    windowsize=len(weights)
    
    def lag_dataframe(df_original,lag):
        df_shift=df_original.copy()
        df_shift['month_id']+=lag
        for col in col_lag:
            df_shift.rename({col:col+'_lag_'+str(lag)},axis=1,inplace=True)
        return df_shift
    
    df_features_lag=df.loc[:,col_agg]
    tmp=df_features.loc[:,col_agg+col_lag]
    for lag in range(1,windowsize+1):
        df_features_lag=df_features_lag.join(lag_dataframe(tmp,lag).set_index(col_agg)*weights[lag-1],on=col_agg)
        
    df_features_avg=df.loc[:,col_agg]
    for col in col_avg:
        df_features_avg[col+suffix]=df_features_lag.loc[:,[col+'_lag_'+str(lag) for lag in range(1,windowsize+1)]].sum(axis=1,skipna=False)

    if fill_value is not None:
        for col in col_lag:
            df_features_avg[col+suffix].fillna(fill_value,inplace=True)
    
    return df_features_avg.drop(col_agg,axis=1)

In [None]:
# COMPARE PAST VALUES

def temporal_compare(df,df_features,col_agg,col_avg,lags_num,lags_denom,suffix,fill_value=None):
    col_agg=list(col_agg)
    col_lag=list(col_avg)
    
    def lag_dataframe(df_original,lag):
        df_shift=df_original.copy()
        df_shift['month_id']+=lag
        for col in col_lag:
            df_shift.rename({col:col+'_lag_'+str(lag)},axis=1,inplace=True)
        return df_shift
    
    
    
    tmp=df_features.loc[:,col_agg+col_lag]
    
    df_features_lag_num=df.loc[:,col_agg]
    for lag in lags_num:
        df_features_lag_num=df_features_lag_num.join(lag_dataframe(tmp,lag).set_index(col_agg),on=col_agg)
    
    df_features_lag_denom=df.loc[:,col_agg]
    for lag in lags_denom:
        df_features_lag_denom=df_features_lag_denom.join(lag_dataframe(tmp,lag).set_index(col_agg),on=col_agg)
        
        
    df_features_comp=df.loc[:,col_agg]
    for col in col_avg:
        df_features_lag_num[col+'_avg']=df_features_lag_num.loc[:,[col+'_lag_'+str(lag) for lag in lags_num]].mean(axis=1)
        df_features_lag_denom[col+'_avg']=df_features_lag_denom.loc[:,[col+'_lag_'+str(lag) for lag in lags_denom]].mean(axis=1)

        df_features_comp[col+suffix]=df_features_lag_num[col+'_avg']/df_features_lag_denom[col+'_avg']

    if fill_value is not None:
        for col in col_lag:
            df_features_comp[col+suffix].fillna(fill_value,inplace=True)
    
    return df_features_comp.drop(col_agg,axis=1)

In [None]:
# COMPARE PAST VALUES

def rational_fraction(df,df_features,col_agg,col_avg,weights_num,weights_denom,suffix,fill_value=None):
    col_agg=list(col_agg)
    col_lag=list(col_avg)
    windowsize=len(weights_num)
    
    def lag_dataframe(df_original,lag):
        df_shift=df_original.copy()
        df_shift['month_id']+=lag
        for col in col_lag:
            df_shift.rename({col:col+'_lag_'+str(lag)},axis=1,inplace=True)
        return df_shift
    
    
    
    tmp=df_features.loc[:,col_agg+col_lag]
    
    df_features_lag_num=df.loc[:,col_agg]
    for lag in range(1,windowsize+1):
        df_features_lag_num=df_features_lag_num.join(lag_dataframe(tmp,lag).set_index(col_agg)*weights_num[lag-1],on=col_agg)
    
    df_features_lag_denom=df.loc[:,col_agg]
    for lag in range(1,windowsize+1):
        df_features_lag_denom=df_features_lag_denom.join(lag_dataframe(tmp,lag).set_index(col_agg)*weights_denom[lag-1],on=col_agg)
        
        
    df_features_comp=df.loc[:,col_agg]
    for col in col_avg:
        df_features_lag_num[col+'_avg']=df_features_lag_num.loc[:,[col+'_lag_'+str(lag) for lag in range(1,windowsize+1)]].sum(axis=1)
        df_features_lag_denom[col+'_avg']=df_features_lag_denom.loc[:,[col+'_lag_'+str(lag) for lag in range(1,windowsize+1)]].sum(axis=1)

        df_features_comp[col+suffix]=df_features_lag_num[col+'_avg']/df_features_lag_denom[col+'_avg']
        
    if fill_value is not None:
        for col in col_lag:
            df_features_comp[col+suffix].fillna(fill_value,inplace=True)
    
    return df_features_comp.drop(col_agg,axis=1)

In [None]:
# TIME MAPPING
# map columns 'col_labels' for the current month to the value at another point in time given by 'map_to_label'

def time_mapping(df,df_features,join_labels,col_labels,map_to_label,suffix):
    df_mapped=df.loc[:,[map_to_label]+join_labels]
    
    shifted_df=df_features.loc[:,['month_id']+join_labels+col_labels].rename({'month_id':map_to_label},axis=1).set_index([map_to_label]+join_labels)
    shifted_df.rename({col:col+suffix for col in col_labels},axis=1,inplace=True)
    
    return df_mapped.join(shifted_df,on=[map_to_label]+join_labels).drop([map_to_label]+join_labels,axis=1)   

### Feature analysis functions

In [None]:
def feature_correlation_analysis(train_df,feature_df,feature_df_seniority,feature_name,groupby_labels,seniority):

    tmp=feature_df_seniority.loc[(feature_df_seniority['month_id']>12)&(feature_df_seniority['month_id']<34),groupby_labels+[feature_name+'_avg_sales_seniority_'+str(seniority),feature_name+'_avg_sales_over_sold_seniority_'+str(seniority)]]
    tmp2=feature_df.loc[(feature_df['month_id']>12)&(feature_df['month_id']<34),groupby_labels+[feature_name+'_avg_sales',feature_name+'_avg_sales_over_sold']]
    tmp=tmp.join(tmp2.set_index(groupby_labels),on=groupby_labels)

    tmp_full=train_df[groupby_labels+['item_quantity']].join(tmp.set_index(groupby_labels),on=groupby_labels)

    fig=plt.figure(figsize=(5,5))
    sns.heatmap(tmp_full.drop(groupby_labels,axis=1).corr(),annot=True)
    
    
    
def compare_features(df,df_seniority,name,labels,seniority,avg_sales_max=20,avg_sales_seniority_max=20,avg_sales_over_sold_max=20):
    n=len(labels)
    
    tmp=df.join(df_seniority.set_index(labels),on=labels)
    labels.remove('month_id')
    labels.append('month')
    tmp['month']=tmp['month_id']%12+1
    
    if name=='item':
        fig,axes=plt.subplots(1,3,figsize=(15,5))
        sns.scatterplot(data=tmp,x=name+'_avg_sales',y=name+'_avg_sales_seniority_'+str(seniority),hue='month',ax=axes[0])
        axes[0].plot([0,20],[0,20],'k')
        axes[0].grid(True)
        axes[0].set_xlim(0,avg_sales_max)
        axes[0].set_ylim(0,avg_sales_seniority_max)
        sns.scatterplot(data=tmp,x=name+'_avg_sales',y=name+'_avg_sales_over_sold',hue='month',ax=axes[1])
        axes[1].plot([0,20],[0,20],'k')
        axes[1].grid(True)
        axes[1].set_xlim(0,avg_sales_max)
        axes[1].set_ylim(0,avg_sales_over_sold_max)
        sns.scatterplot(data=tmp,x=name+'_avg_sales_over_sold',y=name+'_avg_sales_seniority_'+str(seniority),hue='month',ax=axes[2])
        axes[2].plot([0,20],[0,20],'k')
        axes[2].grid(True)
        axes[2].set_xlim(0,avg_sales_over_sold_max)
        axes[2].set_ylim(0,avg_sales_seniority_max)
    
    else:    
        fig,axes=plt.subplots(n,3,figsize=(15,n*5))
        for i,label in enumerate(labels):
            sns.scatterplot(data=tmp,x=name+'_avg_sales',y=name+'_avg_sales_seniority_'+str(seniority),hue=label,ax=axes[i,0])
            axes[i,0].plot([0,20],[0,20],'k')
            axes[i,0].grid(True)
            axes[i,0].set_xlim(0,avg_sales_max)
            axes[i,0].set_ylim(0,avg_sales_seniority_max)
            sns.scatterplot(data=tmp,x=name+'_avg_sales',y=name+'_avg_sales_over_sold',hue=label,ax=axes[i,1])
            axes[i,1].plot([0,20],[0,20],'k')
            axes[i,1].grid(True)
            axes[i,1].set_xlim(0,avg_sales_max)
            axes[i,1].set_ylim(0,avg_sales_over_sold_max)
            sns.scatterplot(data=tmp,x=name+'_avg_sales_over_sold',y=name+'_avg_sales_seniority_'+str(seniority),hue=label,ax=axes[i,2])
            axes[i,2].plot([0,20],[0,20],'k')
            axes[i,2].grid(True)
            axes[i,2].set_xlim(0,avg_sales_over_sold_max)
            axes[i,2].set_ylim(0,avg_sales_seniority_max)

In [None]:
loaded=%who_ls
loaded.append('loaded')

## TEMPORAL VARIABILITY ENCODING

- Encode the temporal variability of the sale quantities, category by category

In [None]:
# import dataset
train_X=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/train_X0.pkl'))

In [None]:
def moving_statistics_0(df,df_features,col_agg,col_avg,windowsize,func,suffix='',fill_value=None):
    col_agg=list(col_agg)
    col_lag=list(col_avg)
    
    def lag_dataframe(df_original,lag):
        df_shift=df_original.copy()
        df_shift['month_id']+=lag
        for col in col_lag:
            df_shift.rename({col:col+'_lag_'+str(lag)},axis=1,inplace=True)
        return df_shift
    
    df_features_lag=df.loc[:,col_agg]
    tmp=df_features.loc[:,col_agg+col_lag]
    for lag in range(0,windowsize):
        df_features_lag=df_features_lag.join(lag_dataframe(tmp,lag).set_index(col_agg),on=col_agg)

    df_features_avg=df.loc[:,col_agg]
    for col in col_avg:
        if func=='mean':
            if suffix=='':
                suffix='_movavg_'+str(windowsize)
            df_features_avg[col+suffix]=df_features_lag.loc[:,[col+'_lag_'+str(lag) for lag in range(0,windowsize)]].mean(axis=1)
        elif func=='max':
            if suffix=='':
                suffix='_movmax_'+str(windowsize)
            df_features_avg[col+suffix]=df_features_lag.loc[:,[col+'_lag_'+str(lag) for lag in range(0,windowsize)]].max(axis=1)
        elif func=='min':
            if suffix=='':
                suffix='_movmin_'+str(windowsize)
            df_features_avg[col+suffix]=df_features_lag.loc[:,[col+'_lag_'+str(lag) for lag in range(0,windowsize)]].min(axis=1)
        elif func=='std':
            if suffix=='':
                suffix='_movstd_'+str(windowsize)
            df_features_avg[col+suffix]=df_features_lag.loc[:,[col+'_lag_'+str(lag) for lag in range(0,windowsize)]].std(ddof=0,axis=1)
        elif func=='rsd':
            if suffix=='':
                suffix='_movrsd_'+str(windowsize)
            denom_avg=df_features_lag.loc[:,[col+'_lag_'+str(lag) for lag in range(0,windowsize)]].mean(axis=1)
            num_std=df_features_lag.loc[:,[col+'_lag_'+str(lag) for lag in range(0,windowsize)]].std(ddof=0,axis=1)
            df_features_avg[col+suffix]=num_std/denom_avg
            df_features_avg.loc[num_std==0,col+suffix]=0
            del denom_avg
            del num_std


    if fill_value is not None:
        for col in col_lag:
            df_features_avg[col+suffix].fillna(fill_value,inplace=True)
    
    return df_features_avg.drop(col_agg,axis=1)


def categorical_variability_encoding(agg_label,name,windowsize_inner,suffix,fill_value=None,windowsize_outer=34):
    df=train_X[['month_id','shop_id','item_id',agg_label,'item_quantity']]

    # compute the temporal relative standard deviation of the item_quantity of item in shop
    tmp=moving_statistics_0(df,df,['month_id','shop_id','item_id'],['item_quantity'],windowsize_inner,'rsd',suffix,fill_value=None)
    tmp=pd.concat([df,tmp],axis=1,sort=False)
    
    # average within categories every month over all shop and items...
    # ...then average over all past months
    # remaining NaNs correspond only to categories that are new in the dataset (no past data for this category in the dataset)
    tmp=tmp.groupby(['month_id',agg_label]).agg({'item_quantity'+suffix:'mean'}).rename({'item_quantity'+suffix:name+suffix},axis=1)
    tmp=moving_statistics(df,tmp.reset_index(),['month_id',agg_label],[name+suffix],windowsize_outer,'mean','_',fill_value)
    tmp.rename({name+suffix+'_':name+suffix},axis=1,inplace=True)

    return tmp

In [None]:
df=categorical_variability_encoding('item_category_id','category_semiannual_avg',3,'_recent_rsd',fill_value=None,windowsize_outer=6)
df2=pd.concat([train_X[['month_id','shop_id','item_id','item_category_id']],df],axis=1,sort=False)
print(df2.info(null_counts=True))

tmp=df2[['month_id','item_category_id','category_semiannual_avg_recent_rsd']].groupby(['month_id','item_category_id']).mean().unstack()
tmp.columns=tmp.columns.droplevel(0)

fig,axes=plt.subplots(1,2,figsize=(15,5))
axes[0].plot(tmp)

sns.heatmap(tmp.T,ax=axes[1])
axes[1].set_xlim(20,35)

del df, df2, tmp, fig, axes

In [None]:
df=categorical_variability_encoding('item_category_id','category_semiannual_avg',12,'_annual_rsd',fill_value=None,windowsize_outer=6)
df2=pd.concat([train_X[['month_id','shop_id','item_id','item_category_id']],df],axis=1,sort=False)
print(df2.info(null_counts=True))

tmp=df2[['month_id','item_category_id','category_semiannual_avg_annual_rsd']].groupby(['month_id','item_category_id']).mean().unstack()
tmp.columns=tmp.columns.droplevel(0)

fig,axes=plt.subplots(1,2,figsize=(15,5))
axes[0].plot(tmp)

sns.heatmap(tmp.T,ax=axes[1])
axes[1].set_xlim(20,35)

del df, df2, tmp, fig, axes

In [None]:
# generate features
category_variability_rs=categorical_variability_encoding('item_category_id','category_semiannual_avg',3,'_recent_rsd',fill_value=None,windowsize_outer=6)
category_variability_as=categorical_variability_encoding('item_category_id','category_semiannual_avg',12,'_annual_rsd',fill_value=None,windowsize_outer=6)

# join features to dataframe
train_X=pd.concat([train_X,category_variability_rs,category_variability_as],axis=1,sort=False)

del category_variability_rs, category_variability_as 

## ----------------------------------------------

## SPLIT DATAFRAME DEPENDING ON SENIORITY

In [None]:
# restrict temporal extent of the dataframe (drop first 18 months)
train_X.drop(train_X.loc[train_X['month_id']<18].index,axis=0,inplace=True)

In [None]:
# split dataframe depending on seniority
train_0=train_X.loc[train_X['item_seniority']==0,:]
train_1=train_X.loc[train_X['item_seniority']==1,:]
train_2=train_X.loc[train_X['item_seniority']==2,:]

pd.DataFrame([train_X.count().rename('global'),train_0.count().rename('seniority_0'),train_1.count().rename('seniority_1'),train_2.count().rename('seniority_2')]).T

In [None]:
# clear disk space
os.remove(os.path.join(DATA_FOLDER,'processed/train_X0.pkl'))

# export datasets
train_0.to_pickle(os.path.join(DATA_FOLDER,'processed/train_0.pkl'))
train_1.to_pickle(os.path.join(DATA_FOLDER,'processed/train_1.pkl'))
train_2.to_pickle(os.path.join(DATA_FOLDER,'processed/train_2.pkl'))

In [None]:
# clear memory
del train_X

del train_0
del train_1
del train_2

gc.collect()

In [None]:
reset_variable_space

## ----------------------------------------------

## SENIORITY 0

Samples (shop, item) of seniority 0 are totally new in the catalogue. No past data is available for these pairs. The typical sale quantities of the new items vary a lot from month to month and their temporal evolution appears much noisier than older items. For these items, we base our predictions mostly on past typical sale quantities of similar items on the month where they were released. Similar items are defined as items of the same category, or same supercategory (like some other game, or some other book).

In [None]:
# import dataset
train_0=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/train_0.pkl'))

print(train_0.info(null_counts=True))

### RAW FEATURES

In [None]:
# RAW FEATURES SELECTION
split_col='month_id'
target_col='item_quantity'
raw_features=[]

# label encodings must be retained at this stage for aggregating features
# most should be discarded later on
label_encoding=['shop_id',
                'item_id',
                'item_category_id',
                'item_supercategory_id',
                'item_category_console_id',
                'item_category_is_digital'
               ]

# mapping info for aggregation (to be discarded later)
mapping_columns=[]

# define already the columns to discard after aggregation of all features
features_to_discard=['item_id',
                     'shop_id',
                     'item_supercategory_id',
                     'item_category_console_id',
                     'item_category_is_digital'
                    ]
features_to_discard+=mapping_columns


#-----------------------------------
# SHOPS

# months since opening
raw_features+=['shop_months_since_opening']
raw_features+=['shop_opening']

#-----------------------------------
# ITEMS

# month-specific frequency encoding
#raw_features+=['item_freq_in_seniority']    # uniform in seniority 0 (when an item is new, it is new in all shops)

#-----------------------------------
# ITEM CATEGORIES

# month-specific frequency encoding
raw_features+=['item_category_freq']
raw_features+=['item_supercategory_freq']
raw_features+=['item_category_console_freq']
raw_features+=['item_category_digital_freq']

raw_features+=['item_category_freq_in_seniority']
raw_features+=['item_supercategory_freq_in_seniority']
raw_features+=['item_category_console_freq_in_seniority']
raw_features+=['item_category_digital_freq_in_seniority']

# encoding of category temporal variability
raw_features+=['category_semiannual_avg_recent_rsd']
raw_features+=['category_semiannual_avg_annual_rsd']


#-----------------------------------
# RELATIVE TIME FEATURES
#raw_features+=['item_months_since_release']     # doesn't make sense for seniority 0 (items have all just been released)

#raw_features+=['item_months_since_first_sale_in_shop']   # doesn't make sense for seniority 0 (items have never been sold in shop)
#raw_features+=['item_months_since_last_sale_in_shop']    # doesn't make sense for seniority 0 (items have never been sold in shop)


train_0=train_0[[split_col]+label_encoding+mapping_columns+[target_col]+raw_features]

print(train_0.info(null_counts=True))

gc.collect()

### PRICE FEATURES

In [None]:
# import price data

shop_category_prices=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/price_features/shop_category_prices.pkl'))
category_prices=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/price_features/category_prices.pkl'))
supercategory_prices=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/price_features/supercategory_prices.pkl'))

In [None]:
ts = time.time()

price_features_df=[]

# -------------------------------------------------
# SHOP-CATEGORY

# assess typical price range for this category in this shop: recently, and at the same period last year
price_features_df.append(moving_statistics(train_0,shop_category_prices,['month_id','shop_id','item_category_id'],['shop_category_price_median'],34,'mean',suffix='_absolute_mean'))
price_features_df.append(moving_statistics(train_0,shop_category_prices,['month_id','shop_id','item_category_id'],['shop_category_price_median'],3,'mean',suffix='_recent_mean'))
price_features_df.append(lag_features(train_0,shop_category_prices,['month_id','shop_id','item_category_id'],['shop_category_price_median'],[12]))

# -------------------------------------------------
# CATEGORY

# infer range of possible price values from all values observed over time in the shops where it was sold
price_features_df.append(moving_statistics(train_0,category_prices,['month_id','item_category_id'],['category_price_min'],34,'min',suffix='_absolute_min'))
price_features_df.append(moving_statistics(train_0,category_prices,['month_id','item_category_id'],['category_price_max'],34,'max',suffix='_absolute_max'))

price_features_df.append(moving_statistics(train_0,category_prices,['month_id','item_category_id'],['category_price_min'],12,'min',suffix='_annual_min'))
price_features_df.append(moving_statistics(train_0,category_prices,['month_id','item_category_id'],['category_price_max'],12,'max',suffix='_annual_max'))

# assess typical price range for this category of items recently, and at the same period last year
price_features_df.append(moving_statistics(train_0,category_prices,['month_id','item_category_id'],['category_price_median'],3,'mean',suffix='_recent_mean'))
price_features_df.append(lag_features(train_0,category_prices,['month_id','item_category_id'],['category_price_median'],[12]))

# compare price of category to supercategory
price_features_df.append(moving_statistics(train_0,category_prices,['month_id','item_category_id'],['category_price_median_compared_to_supercategory_price_median'],34,'mean',suffix='_absolute_mean'))


# for totally new categories, we need data from the supercategory
# -------------------------------------------------
# SUPERCATEGORY

# assess typical price range for this supercategory of items recently, and at the same period last year
price_features_df.append(moving_statistics(train_0,supercategory_prices,['month_id','item_supercategory_id'],['supercategory_price_median'],3,'mean',suffix='_recent_mean'))
price_features_df.append(lag_features(train_0,supercategory_prices,['month_id','item_supercategory_id'],['supercategory_price_median'],[12]))


print('time : ' +str(time.time() - ts))




# -------------------------------------------------
# CONCATENATE FEATURES

train_0=pd.concat([train_0]+price_features_df,axis=1,sort=False)
del price_features_df
gc.collect()

print('time : ' +str(time.time() - ts))


# -------------------------------------------------
# DOWNCAST DTYPES
train_0=downcast_dtypes(train_0)
print('time : ' +str(time.time() - ts))


print(train_0.info(null_counts=True))
train_0

In [None]:
# clear memory space

del shop_category_prices
del category_prices
del supercategory_prices

gc.collect()

### TARGET ENCODED FEATURES

In [None]:
dfs=['shop_encoding',
     'category_encoding',
     'supercategory_encoding',
     'shop_category_encoding',
     'shop_supercategory_encoding'
    ]

    
for df in dfs:
    exec(df+"=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/target_encodings/"+df+".pkl'))")
    exec(df+"_seniority_0=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/target_encodings/"+df+"_seniority_0.pkl'))")

#### Feature analysis

In [None]:
feature_correlation_analysis(train_0,shop_encoding,shop_encoding_seniority_0,'shop',['month_id','shop_id'],0)
compare_features(shop_encoding,shop_encoding_seniority_0,'shop',['month_id','shop_id'],0,4,4,4)

In [None]:
feature_correlation_analysis(train_0,category_encoding,category_encoding_seniority_0,'category',['month_id','item_category_id'],0)
compare_features(category_encoding,category_encoding_seniority_0,'category',['month_id','item_category_id'],0,20,20,20)

In [None]:
feature_correlation_analysis(train_0,supercategory_encoding,supercategory_encoding_seniority_0,'supercategory',['month_id','item_supercategory_id'],0)
compare_features(supercategory_encoding,supercategory_encoding_seniority_0,'supercategory',['month_id','item_supercategory_id'],0,20,6,20)

In [None]:
feature_correlation_analysis(train_0,shop_category_encoding,shop_category_encoding_seniority_0,'shop_category',['month_id','shop_id','item_category_id'],0)
compare_features(shop_category_encoding,shop_category_encoding_seniority_0,'shop_category',['month_id','shop_id','item_category_id'],0,20,20,20)

In [None]:
feature_correlation_analysis(train_0,shop_supercategory_encoding,shop_supercategory_encoding_seniority_0,'shop_supercategory',['month_id','shop_id','item_supercategory_id'],0)
compare_features(shop_supercategory_encoding,shop_supercategory_encoding_seniority_0,'shop_supercategory',['month_id','shop_id','item_supercategory_id'],0,20,20,20)

#### Feature aggregation

In [None]:
ts = time.time()

target_features_df=[]

#############################################################
# SHOP
col_labels=['shop_avg_sales']
col_labels_seniority_0=[col+'_seniority_0' for col in col_labels]

target_features_df.append(moving_statistics(train_0,shop_encoding_seniority_0,['month_id','shop_id'],col_labels_seniority_0,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_0,shop_encoding_seniority_0,['month_id','shop_id'],col_labels_seniority_0,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_0,shop_encoding_seniority_0,['month_id','shop_id'],col_labels_seniority_0,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_0,shop_encoding_seniority_0,['month_id','shop_id'],col_labels_seniority_0,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_0,shop_encoding_seniority_0,['month_id','shop_id'],col_labels_seniority_0,[1]))
target_features_df.append(lag_features(train_0,shop_encoding_seniority_0,['month_id','shop_id'],col_labels_seniority_0,[12]))

print('time : ' +str(time.time() - ts))





#############################################################
# SUPERCATEGORY
col_labels=['supercategory_avg_sales']
col_labels_seniority_0=[col+'_seniority_0' for col in col_labels]

target_features_df.append(moving_statistics(train_0,supercategory_encoding_seniority_0,['month_id','item_supercategory_id'],col_labels_seniority_0,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_0,supercategory_encoding_seniority_0,['month_id','item_supercategory_id'],col_labels_seniority_0,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_0,supercategory_encoding_seniority_0,['month_id','item_supercategory_id'],col_labels_seniority_0,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_0,supercategory_encoding_seniority_0,['month_id','item_supercategory_id'],col_labels_seniority_0,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_0,supercategory_encoding_seniority_0,['month_id','item_supercategory_id'],col_labels_seniority_0,[1]))
target_features_df.append(lag_features(train_0,supercategory_encoding_seniority_0,['month_id','item_supercategory_id'],col_labels_seniority_0,[12]))

print('time : ' +str(time.time() - ts))

#############################################################
# SHOP-SUPERCATEGORY
col_labels=['shop_supercategory_avg_sales']
col_labels_seniority_0=[col+'_seniority_0' for col in col_labels]

target_features_df.append(moving_statistics(train_0,shop_supercategory_encoding_seniority_0,['month_id','shop_id','item_supercategory_id'],col_labels_seniority_0,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_0,shop_supercategory_encoding_seniority_0,['month_id','shop_id','item_supercategory_id'],col_labels_seniority_0,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_0,shop_supercategory_encoding_seniority_0,['month_id','shop_id','item_supercategory_id'],col_labels_seniority_0,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_0,shop_supercategory_encoding_seniority_0,['month_id','shop_id','item_supercategory_id'],col_labels_seniority_0,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_0,shop_supercategory_encoding_seniority_0,['month_id','shop_id','item_supercategory_id'],col_labels_seniority_0,[1]))
target_features_df.append(lag_features(train_0,shop_supercategory_encoding_seniority_0,['month_id','shop_id','item_supercategory_id'],col_labels_seniority_0,[12]))

print('time : ' +str(time.time() - ts))

#############################################################
# CATEGORY
col_labels=['category_avg_sales']
col_labels_seniority_0=[col+'_seniority_0' for col in col_labels]

target_features_df.append(moving_statistics(train_0,category_encoding_seniority_0,['month_id','item_category_id'],col_labels_seniority_0,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_0,category_encoding_seniority_0,['month_id','item_category_id'],col_labels_seniority_0,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_0,category_encoding_seniority_0,['month_id','item_category_id'],col_labels_seniority_0,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_0,category_encoding_seniority_0,['month_id','item_category_id'],col_labels_seniority_0,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_0,category_encoding_seniority_0,['month_id','item_category_id'],col_labels_seniority_0,[1]))
target_features_df.append(lag_features(train_0,category_encoding_seniority_0,['month_id','item_category_id'],col_labels_seniority_0,[12]))

print('time : ' +str(time.time() - ts))

#############################################################
# SHOP-CATEGORY
col_labels=['shop_category_avg_sales']
col_labels_seniority_0=[col+'_seniority_0' for col in col_labels]

target_features_df.append(moving_statistics(train_0,shop_category_encoding_seniority_0,['month_id','shop_id','item_category_id'],col_labels_seniority_0,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_0,shop_category_encoding_seniority_0,['month_id','shop_id','item_category_id'],col_labels_seniority_0,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_0,shop_category_encoding_seniority_0,['month_id','shop_id','item_category_id'],col_labels_seniority_0,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_0,shop_category_encoding_seniority_0,['month_id','shop_id','item_category_id'],col_labels_seniority_0,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_0,shop_category_encoding_seniority_0,['month_id','shop_id','item_category_id'],col_labels_seniority_0,[1]))
target_features_df.append(lag_features(train_0,shop_category_encoding_seniority_0,['month_id','shop_id','item_category_id'],col_labels_seniority_0,[12]))

print('time : ' +str(time.time() - ts))










#############################################################
# SPATIAL TRENDS
# assess whether shop sells more of this category than other shops (to relate to category-specific data)
col_labels=['shop_category_avg_sales_compared_to_category']

target_features_df.append(moving_statistics(train_0,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_0,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_0,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_0,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_0,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,[1]))
target_features_df.append(lag_features(train_0,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,[12]))

col_labels=['shop_supercategory_avg_sales_compared_to_supercategory']

target_features_df.append(moving_statistics(train_0,shop_supercategory_encoding,['month_id','shop_id','item_supercategory_id'],col_labels,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_0,shop_supercategory_encoding,['month_id','shop_id','item_supercategory_id'],col_labels,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_0,shop_supercategory_encoding,['month_id','shop_id','item_supercategory_id'],col_labels,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_0,shop_supercategory_encoding,['month_id','shop_id','item_supercategory_id'],col_labels,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_0,shop_supercategory_encoding,['month_id','shop_id','item_supercategory_id'],col_labels,[1]))
target_features_df.append(lag_features(train_0,shop_supercategory_encoding,['month_id','shop_id','item_supercategory_id'],col_labels,[12]))

print('time : ' +str(time.time() - ts))





#############################################################################
# TEMPORAL TRENDS (encode month or recent period wrt to how it compares to other month of the year)

# SHOPS
col_labels=['shop_avg_sales']
col_labels_seniority_0=[col+'_seniority_0' for col in col_labels]

target_features_df.append(rational_fraction(train_0,shop_encoding,['month_id','shop_id'],col_labels,weights_num=[0,0,0,0,0,0,0,0,0,0,0,12],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag12_to_annual_mean'))
target_features_df.append(rational_fraction(train_0,shop_encoding,['month_id','shop_id'],col_labels,weights_num=[12,0,0,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag1_to_annual_mean'))
target_features_df.append(rational_fraction(train_0,shop_encoding,['month_id','shop_id'],col_labels,weights_num=[4,4,4,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_recent_mean_to_annual_mean'))

# CATEGORY
col_labels=['category_avg_sales']
col_labels_seniority_0=[col+'_seniority_0' for col in col_labels]

target_features_df.append(rational_fraction(train_0,category_encoding,['month_id','item_category_id'],['category_avg_sales'],weights_num=[0,0,0,0,0,0,0,0,0,0,0,12],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag12_to_annual_mean'))
target_features_df.append(rational_fraction(train_0,category_encoding,['month_id','item_category_id'],['category_avg_sales'],weights_num=[12,0,0,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag1_to_annual_mean'))
target_features_df.append(rational_fraction(train_0,category_encoding,['month_id','item_category_id'],['category_avg_sales'],weights_num=[4,4,4,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_recent_mean_to_annual_mean'))

# decouple SUPERCATEGORY (long-term trend) / CATEGORY within SUPERCATEGORY (short-term trend)
col_labels=['supercategory_avg_sales']
col_labels_seniority_0=[col+'_seniority_0' for col in col_labels]

target_features_df.append(rational_fraction(train_0,supercategory_encoding,['month_id','item_supercategory_id'],col_labels,weights_num=[0,0,0,0,0,0,0,0,0,0,0,12],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag12_to_annual_mean'))
target_features_df.append(rational_fraction(train_0,supercategory_encoding,['month_id','item_supercategory_id'],col_labels,weights_num=[12,0,0,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag1_to_annual_mean'))
target_features_df.append(rational_fraction(train_0,supercategory_encoding,['month_id','item_supercategory_id'],col_labels,weights_num=[4,4,4,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_recent_mean_to_annual_mean'))

col_labels=['category_avg_sales_compared_to_supercategory']
col_labels_seniority_0=[col+'_seniority_0' for col in col_labels]

target_features_df.append(moving_statistics(train_0,category_encoding,['month_id','item_category_id'],col_labels,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_0,category_encoding,['month_id','item_category_id'],col_labels,[1]))
target_features_df.append(lag_features(train_0,category_encoding,['month_id','item_category_id'],col_labels,[12]))

print('time : ' +str(time.time() - ts))





#############################################################################
# TEMPORAL TRENDS (are items of seniority 0 sold in larger quantities at given time of the year?)

# SHOPS
col_labels=['shop_avg_sales']
col_labels_seniority_0=[col+'_seniority_0' for col in col_labels]

target_features_df.append(rational_fraction(train_0,shop_encoding_seniority_0,['month_id','shop_id'],col_labels_seniority_0,weights_num=[0,0,0,0,0,0,0,0,0,0,0,12],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag12_to_annual_mean'))
target_features_df.append(rational_fraction(train_0,shop_encoding_seniority_0,['month_id','shop_id'],col_labels_seniority_0,weights_num=[12,0,0,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag1_to_annual_mean'))
target_features_df.append(rational_fraction(train_0,shop_encoding_seniority_0,['month_id','shop_id'],col_labels_seniority_0,weights_num=[4,4,4,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_recent_mean_to_annual_mean'))

# SUPERCATEGORY
col_labels=['supercategory_avg_sales']
col_labels_seniority_0=[col+'_seniority_0' for col in col_labels]

target_features_df.append(rational_fraction(train_0,supercategory_encoding_seniority_0,['month_id','item_supercategory_id'],col_labels_seniority_0,weights_num=[0,0,0,0,0,0,0,0,0,0,0,12],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag12_to_annual_mean'))
target_features_df.append(rational_fraction(train_0,supercategory_encoding_seniority_0,['month_id','item_supercategory_id'],col_labels_seniority_0,weights_num=[12,0,0,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag1_to_annual_mean'))
target_features_df.append(rational_fraction(train_0,supercategory_encoding_seniority_0,['month_id','item_supercategory_id'],col_labels_seniority_0,weights_num=[4,4,4,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_recent_mean_to_annual_mean'))

# CATEGORY
col_labels=['category_avg_sales']
col_labels_seniority_0=[col+'_seniority_0' for col in col_labels]

target_features_df.append(rational_fraction(train_0,category_encoding_seniority_0,['month_id','item_category_id'],col_labels_seniority_0,weights_num=[0,0,0,0,0,0,0,0,0,0,0,12],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag12_to_annual_mean'))
target_features_df.append(rational_fraction(train_0,category_encoding_seniority_0,['month_id','item_category_id'],col_labels_seniority_0,weights_num=[12,0,0,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag1_to_annual_mean'))
target_features_df.append(rational_fraction(train_0,category_encoding_seniority_0,['month_id','item_category_id'],col_labels_seniority_0,weights_num=[4,4,4,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_recent_mean_to_annual_mean'))

print('time : ' +str(time.time() - ts))










# -------------------------------------------------
# CONCATENATE FEATURES

train_0=pd.concat([train_0]+target_features_df,axis=1,sort=False)
del target_features_df
gc.collect()
                         
print('time : ' +str(time.time() - ts))

# -------------------------------------------------
print(train_0.info(null_counts=True,verbose=True))
train_0

In [None]:
# clear memory space

for df in dfs:
    exec("del "+df)
    exec("del "+df+"_seniority_0")

gc.collect()

### Discard irrelevant features

In [None]:
train_0.drop(features_to_discard,axis=1,inplace=True)

train_0.info(null_counts=True,verbose=True)

### Export aggregated dataset

In [None]:
# create directory
create_directory(os.path.join(DATA_FOLDER, 'training'))

# export dataset
train_0.to_pickle(os.path.join(DATA_FOLDER,'training/train_0_pred.pkl'))

In [None]:
# clear memory
del train_0

gc.collect()

In [None]:
reset_variable_space

## ----------------------------------------------

## SENIORITY 1

Samples (shop, item) of seniority 1 are made of items that have never been sold in this shop, but have been sold in other shops before. Such a sample (shop, item) is likely to be made of an item that is not actually available in this shop, and the likelihood that this item will be sold in this shop in the future is very low. In fact, data analysis proves that only 5% of these samples contribute to the sales (compared to 20-25% of samples of seniority 2. Those few samples of seniority 1 that will be sold are likely made of items that are always sold in low quantities, possibly less than one per month. Here again, data analysis shows that the average sales for items of seniority 1 is around 0.05 (against 0.4 in seniority 2) and the likelihood that an item of seniority 1 is sold in quantities superior to 3 is less than 0.1%. Samples of seniority 1 yielding sale quantities above 8 are so rare that they may be considered noise. For seniority 1, the critical features are the number of months since it has been released in the catalogue without being sold in this shop, and the typical sale quantities of this item in shops where it has been sold in the past. If an item is typically sold in large quantities but have never been sold in a given shop, it is unlikely that it will suddenly start being sold next month. Similarly, if the item has never been sold in a given shop in spite having been released for many months in the catalogue, it is likely that it will never be sold in this shop, or only in very small quantities. Conversely, if the item has only been released last month, it is possible that it just hasn't add the chance to be sold last month but it may be sold in low quantities next month. In particular, we find that the most relevant features are target encodings of the past typical sale quantities of samples of seniority 1 in this shop, in the same category of items, and with the same number of months since it has been released, as well as the quantities sold for this item in other shops where it has been sold in the past.


ADDITIONAL REMARKS:
- For now, we disregard the influence of the shops that have just opened. Such shops are rare in the dataset and we expect they do not impact significantly the overall result.

- The temporality of items in seniority 1 seem to be mostly related to 'months_since_release', but the sales of such items do not seem to be related to the absolute period of the year, apart from a peak in December. The effect of the month of release or the time of the year apart from the Christmas period are negligible.

In [None]:
# import dataset
train_1=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/train_1.pkl'))

print(train_1.info(null_counts=True))

### RAW FEATURES

In [None]:
# RAW FEATURES SELECTION
split_col='month_id'
target_col='item_quantity'
raw_features=[]

# label encodings must be retained at this stage for aggregating features
# most should be discarded later on
label_encoding=['shop_id',
                'item_id',
                'item_category_id',
                'item_supercategory_id',
                'item_category_console_id',
                'item_category_is_digital'
               ]

# mapping info for aggregation (to be discarded later)
mapping_columns=['item_month_id_of_last_sale',
                 'item_month_id_of_release',
                 'item_month_of_release'
                ]

# define already the columns to discard after aggregation of all features
features_to_discard=['item_id',
                     'shop_id',
                     'item_supercategory_id',
                     'item_category_console_id',
                     'item_category_is_digital'
                    ]
features_to_discard+=mapping_columns


#-----------------------------------
# SHOPS

# months since opening
raw_features+=['shop_months_since_opening']
raw_features+=['shop_opening']

#-----------------------------------
# ITEMS

# month-specific frequency encoding
raw_features+=['item_freq_in_seniority']

#-----------------------------------
# ITEM CATEGORIES

# month-specific frequency encoding
raw_features+=['item_category_freq']
raw_features+=['item_supercategory_freq']
raw_features+=['item_category_console_freq']
raw_features+=['item_category_digital_freq']

raw_features+=['item_category_freq_in_seniority']
raw_features+=['item_supercategory_freq_in_seniority']
raw_features+=['item_category_console_freq_in_seniority']
raw_features+=['item_category_digital_freq_in_seniority']

# encoding of category temporal variability
raw_features+=['category_semiannual_avg_recent_rsd']
raw_features+=['category_semiannual_avg_annual_rsd']


#-----------------------------------
# RELATIVE TIME FEATURES
raw_features+=['item_months_since_release']

#raw_features+=['item_months_since_first_sale_in_shop']   # doesn't make sense for seniority 1 (items have never been sold in shop)
#raw_features+=['item_months_since_last_sale_in_shop']    # doesn't make sense for seniority 1 (items have never been sold in shop)


train_1=train_1[[split_col]+label_encoding+mapping_columns+[target_col]+raw_features]

print(train_1.info(null_counts=True))

gc.collect()

### PRICE FEATURES

In [None]:
# import price data

item_prices=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/price_features/item_prices.pkl'))
category_prices=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/price_features/category_prices.pkl'))

In [None]:
ts = time.time()

price_features_df=[]

# -------------------------------------------------
# ITEM
# infer item typical price from values observed in the shops where it was most recently sold
col_labels=['item_price_median']
col_labels+=['item_price_median_compared_to_category_price_median']
price_features_df.append(time_mapping(train_1,item_prices,['item_id'],col_labels,'item_month_id_of_last_sale',suffix='_last_sale'))
price_features_df.append(moving_statistics(train_1,item_prices,['month_id','item_id'],col_labels,34,'mean',suffix='_absolute_mean'))

# infer range of possible price values from all values observed over time in the shops where it was sold
time_window=34
price_features_df.append(moving_statistics(train_1,item_prices,['month_id','item_id'],['item_price_min'],34,'min',suffix='_absolute_min'))
price_features_df.append(moving_statistics(train_1,item_prices,['month_id','item_id'],['item_price_max'],34,'max',suffix='_absolute_max'))

print('time : ' +str(time.time() - ts))


# -------------------------------------------------
# CATEGORY
# assess typical price range for this category of items recently
price_features_df.append(moving_statistics(train_1,category_prices,['month_id','item_category_id'],['category_price_median'],3,'mean',suffix='_recent_mean'))

print('time : ' +str(time.time() - ts))




# -------------------------------------------------
# CONCATENATE FEATURES

train_1=pd.concat([train_1]+price_features_df,axis=1,sort=False)
del price_features_df
gc.collect()

print('time : ' +str(time.time() - ts))


# -------------------------------------------------
# DOWNCAST DTYPES
train_1=downcast_dtypes(train_1)
print('time : ' +str(time.time() - ts))


print(train_1.info(null_counts=True))
train_1

In [None]:
# clear memory space

del item_prices
del category_prices

gc.collect()

### TARGET ENCODED FEATURES

In [None]:
dfs=['shop_encoding',
     'item_encoding',
     'category_encoding',
     'supercategory_encoding',
     'shop_category_encoding',
     'shop_supercategory_encoding',
     'category_months_since_release_encoding',
     'supercategory_months_since_release_encoding',
     'category_month_of_release_encoding',
     'supercategory_month_of_release_encoding',
     'shop_category_months_since_release_encoding',
     'shop_supercategory_months_since_release_encoding'
    ]

    
for df in dfs:
    exec(df+"=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/target_encodings/"+df+".pkl'))")
    exec(df+"_seniority_1=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/target_encodings/"+df+"_seniority_1.pkl'))")

#### Feature analysis

In [None]:
feature_correlation_analysis(train_1,item_encoding,item_encoding_seniority_1,'item',['month_id','item_id'],1)
compare_features(item_encoding,item_encoding_seniority_1,'item',['month_id','item_id'],1,20,20,20)

In [None]:
feature_correlation_analysis(train_1,shop_encoding,shop_encoding_seniority_1,'shop',['month_id','shop_id'],1)
compare_features(shop_encoding,shop_encoding_seniority_1,'shop',['month_id','shop_id'],1,2,2,4)

In [None]:
feature_correlation_analysis(train_1,category_encoding,category_encoding_seniority_1,'category',['month_id','item_category_id'],1)
compare_features(category_encoding,category_encoding_seniority_1,'category',['month_id','item_category_id'],1,20,13,20)

In [None]:
feature_correlation_analysis(train_1,supercategory_encoding,supercategory_encoding_seniority_1,'supercategory',['month_id','item_supercategory_id'],1)
compare_features(supercategory_encoding,supercategory_encoding_seniority_1,'supercategory',['month_id','item_supercategory_id'],1,20,3,20)

In [None]:
feature_correlation_analysis(train_1,shop_category_encoding,shop_category_encoding_seniority_1,'shop_category',['month_id','shop_id','item_category_id'],1)
compare_features(shop_category_encoding,shop_category_encoding_seniority_1,'shop_category',['month_id','shop_id','item_category_id'],1,20,20,20)

In [None]:
feature_correlation_analysis(train_1,shop_supercategory_encoding,shop_supercategory_encoding_seniority_1,'shop_supercategory',['month_id','shop_id','item_supercategory_id'],1)
compare_features(shop_supercategory_encoding,shop_supercategory_encoding_seniority_1,'shop_supercategory',['month_id','shop_id','item_supercategory_id'],1,20,20,20)

In [None]:
feature_correlation_analysis(train_1,category_months_since_release_encoding,category_months_since_release_encoding_seniority_1,'category_months_since_release',['month_id','item_category_id','item_months_since_release'],1)
compare_features(category_months_since_release_encoding,category_months_since_release_encoding_seniority_1,'category_months_since_release',['month_id','item_category_id','item_months_since_release'],1,20,20,20)

In [None]:
feature_correlation_analysis(train_1,shop_category_months_since_release_encoding,shop_category_months_since_release_encoding_seniority_1,'shop_category_months_since_release',['month_id','shop_id','item_category_id','item_months_since_release'],1)

#### Feature aggregation

In [None]:
ts = time.time()

target_features_df=[]

#############################################################
# ITEM
col_labels=['item_avg_sales','item_avg_sales_over_sold']

target_features_df.append(moving_statistics(train_1,item_encoding,['month_id','item_id'],col_labels,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_1,item_encoding,['month_id','item_id'],col_labels,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_1,item_encoding,['month_id','item_id'],col_labels,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_1,item_encoding,['month_id','item_id'],col_labels,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_1,item_encoding,['month_id','item_id'],col_labels,[1]))
target_features_df.append(lag_features(train_1,item_encoding,['month_id','item_id'],col_labels,[12]))

target_features_df.append(time_mapping(train_1,item_encoding,['item_id'],col_labels,'item_month_id_of_last_sale',suffix='_last_sale'))

# maximum amount sold observed over time in the shops where it has been sold
target_features_df.append(moving_statistics(train_1,item_encoding,['month_id','item_id'],['item_max_quantity'],34,'max',suffix='_absolute_max'))


# Seniority 1
col_labels_seniority_1=['item_avg_sales_seniority_1']

target_features_df.append(moving_statistics(train_1,item_encoding_seniority_1,['month_id','item_id'],col_labels_seniority_1,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_1,item_encoding_seniority_1,['month_id','item_id'],col_labels_seniority_1,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_1,item_encoding_seniority_1,['month_id','item_id'],col_labels_seniority_1,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_1,item_encoding_seniority_1,['month_id','item_id'],col_labels_seniority_1,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_1,item_encoding_seniority_1,['month_id','item_id'],col_labels_seniority_1,[1]))
target_features_df.append(lag_features(train_1,item_encoding_seniority_1,['month_id','item_id'],col_labels_seniority_1,[12]))

target_features_df.append(time_mapping(train_1,item_encoding_seniority_1,['item_id'],col_labels_seniority_1,'item_month_id_of_last_sale',suffix='_last_sale'))

print('time : ' +str(time.time() - ts))





#############################################################
# SHOP
col_labels=['shop_avg_sales']
col_labels_seniority_1=[col+'_seniority_1' for col in col_labels]

target_features_df.append(moving_statistics(train_1,shop_encoding,['month_id','shop_id'],col_labels,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_1,shop_encoding,['month_id','shop_id'],col_labels,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_1,shop_encoding,['month_id','shop_id'],col_labels,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_1,shop_encoding,['month_id','shop_id'],col_labels,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_1,shop_encoding,['month_id','shop_id'],col_labels,[1]))
target_features_df.append(lag_features(train_1,shop_encoding,['month_id','shop_id'],col_labels,[12]))

target_features_df.append(moving_statistics(train_1,shop_encoding_seniority_1,['month_id','shop_id'],col_labels_seniority_1,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_1,shop_encoding_seniority_1,['month_id','shop_id'],col_labels_seniority_1,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_1,shop_encoding_seniority_1,['month_id','shop_id'],col_labels_seniority_1,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_1,shop_encoding_seniority_1,['month_id','shop_id'],col_labels_seniority_1,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_1,shop_encoding_seniority_1,['month_id','shop_id'],col_labels_seniority_1,[1]))
target_features_df.append(lag_features(train_1,shop_encoding_seniority_1,['month_id','shop_id'],col_labels_seniority_1,[12]))

print('time : ' +str(time.time() - ts))





#############################################################
# MONTHS SINCE RELEASE
col_labels=['category_months_since_release_avg_sales']
col_labels_seniority_1=[col+'_seniority_1' for col in col_labels]

target_features_df.append(moving_statistics(train_1,category_months_since_release_encoding_seniority_1,['month_id','item_category_id','item_months_since_release'],col_labels_seniority_1,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_1,category_months_since_release_encoding_seniority_1,['month_id','item_category_id','item_months_since_release'],col_labels_seniority_1,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_1,category_months_since_release_encoding_seniority_1,['month_id','item_category_id','item_months_since_release'],col_labels_seniority_1,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_1,category_months_since_release_encoding_seniority_1,['month_id','item_category_id','item_months_since_release'],col_labels_seniority_1,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_1,category_months_since_release_encoding_seniority_1,['month_id','item_category_id','item_months_since_release'],col_labels_seniority_1,[1]))
target_features_df.append(lag_features(train_1,category_months_since_release_encoding_seniority_1,['month_id','item_category_id','item_months_since_release'],col_labels_seniority_1,[12]))

print('time : ' +str(time.time() - ts))

col_labels=['shop_category_months_since_release_avg_sales']
col_labels_seniority_1=[col+'_seniority_1' for col in col_labels]

target_features_df.append(moving_statistics(train_1,shop_category_months_since_release_encoding_seniority_1,['month_id','shop_id','item_category_id','item_months_since_release'],col_labels_seniority_1,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_1,shop_category_months_since_release_encoding_seniority_1,['month_id','shop_id','item_category_id','item_months_since_release'],col_labels_seniority_1,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_1,shop_category_months_since_release_encoding_seniority_1,['month_id','shop_id','item_category_id','item_months_since_release'],col_labels_seniority_1,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_1,shop_category_months_since_release_encoding_seniority_1,['month_id','shop_id','item_category_id','item_months_since_release'],col_labels_seniority_1,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_1,shop_category_months_since_release_encoding_seniority_1,['month_id','shop_id','item_category_id','item_months_since_release'],col_labels_seniority_1,[1]))
target_features_df.append(lag_features(train_1,shop_category_months_since_release_encoding_seniority_1,['month_id','shop_id','item_category_id','item_months_since_release'],col_labels_seniority_1,[12]))

print('time : ' +str(time.time() - ts))

col_labels=['supercategory_months_since_release_avg_sales']
col_labels_seniority_1=[col+'_seniority_1' for col in col_labels]

target_features_df.append(moving_statistics(train_1,supercategory_months_since_release_encoding_seniority_1,['month_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_1,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_1,supercategory_months_since_release_encoding_seniority_1,['month_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_1,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_1,supercategory_months_since_release_encoding_seniority_1,['month_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_1,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_1,supercategory_months_since_release_encoding_seniority_1,['month_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_1,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_1,supercategory_months_since_release_encoding_seniority_1,['month_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_1,[1]))
target_features_df.append(lag_features(train_1,supercategory_months_since_release_encoding_seniority_1,['month_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_1,[12]))

print('time : ' +str(time.time() - ts))

col_labels=['shop_supercategory_months_since_release_avg_sales']
col_labels_seniority_1=[col+'_seniority_1' for col in col_labels]

target_features_df.append(moving_statistics(train_1,shop_supercategory_months_since_release_encoding_seniority_1,['month_id','shop_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_1,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_1,shop_supercategory_months_since_release_encoding_seniority_1,['month_id','shop_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_1,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_1,shop_supercategory_months_since_release_encoding_seniority_1,['month_id','shop_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_1,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_1,shop_supercategory_months_since_release_encoding_seniority_1,['month_id','shop_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_1,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_1,shop_supercategory_months_since_release_encoding_seniority_1,['month_id','shop_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_1,[1]))
target_features_df.append(lag_features(train_1,shop_supercategory_months_since_release_encoding_seniority_1,['month_id','shop_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_1,[12]))

print('time : ' +str(time.time() - ts))










#############################################################
# SPATIAL TRENDS
# assess whether shop sells more of this category than other shops (to relate to category-specific data)
col_labels=['shop_category_avg_sales_compared_to_category']

target_features_df.append(moving_statistics(train_1,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_1,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_1,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_1,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_1,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,[1]))
target_features_df.append(lag_features(train_1,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,[12]))

print('time : ' +str(time.time() - ts))





#############################################################################
# TEMPORAL TRENDS (encode month or recent period wrt to how it compares to other month of the year)

# SHOPS
col_labels=['shop_avg_sales']
col_labels_seniority_1=[col+'_seniority_1' for col in col_labels]

target_features_df.append(rational_fraction(train_1,shop_encoding,['month_id','shop_id'],col_labels,weights_num=[0,0,0,0,0,0,0,0,0,0,0,12],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag12_to_annual_mean'))
target_features_df.append(rational_fraction(train_1,shop_encoding,['month_id','shop_id'],col_labels,weights_num=[12,0,0,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag1_to_annual_mean'))
target_features_df.append(rational_fraction(train_1,shop_encoding,['month_id','shop_id'],col_labels,weights_num=[4,4,4,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_recent_mean_to_annual_mean'))

# CATEGORY
col_labels=['category_avg_sales']
col_labels_seniority_1=[col+'_seniority_1' for col in col_labels]

target_features_df.append(rational_fraction(train_1,category_encoding,['month_id','item_category_id'],['category_avg_sales'],weights_num=[0,0,0,0,0,0,0,0,0,0,0,12],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag12_to_annual_mean'))
target_features_df.append(rational_fraction(train_1,category_encoding,['month_id','item_category_id'],['category_avg_sales'],weights_num=[12,0,0,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag1_to_annual_mean'))
target_features_df.append(rational_fraction(train_1,category_encoding,['month_id','item_category_id'],['category_avg_sales'],weights_num=[4,4,4,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_recent_mean_to_annual_mean'))

# decouple SUPERCATEGORY (long-term trend) / CATEGORY within SUPERCATEGORY (short-term trend)
col_labels=['supercategory_avg_sales']
col_labels_seniority_1=[col+'_seniority_1' for col in col_labels]

target_features_df.append(rational_fraction(train_1,supercategory_encoding,['month_id','item_supercategory_id'],col_labels,weights_num=[0,0,0,0,0,0,0,0,0,0,0,12],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag12_to_annual_mean'))
target_features_df.append(rational_fraction(train_1,supercategory_encoding,['month_id','item_supercategory_id'],col_labels,weights_num=[12,0,0,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag1_to_annual_mean'))
target_features_df.append(rational_fraction(train_1,supercategory_encoding,['month_id','item_supercategory_id'],col_labels,weights_num=[4,4,4,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_recent_mean_to_annual_mean'))

col_labels=['category_avg_sales_compared_to_supercategory']
col_labels_seniority_1=[col+'_seniority_1' for col in col_labels]

target_features_df.append(moving_statistics(train_1,category_encoding,['month_id','item_category_id'],col_labels,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_1,category_encoding,['month_id','item_category_id'],col_labels,[1]))
target_features_df.append(lag_features(train_1,category_encoding,['month_id','item_category_id'],col_labels,[12]))

print('time : ' +str(time.time() - ts))





#############################################################################
# MONTH OF RELEASE
col_labels=['category_month_of_release_avg_sales_compared_to_category']
target_features_df.append(moving_statistics(train_1,category_month_of_release_encoding,['month_id','item_category_id','item_month_of_release'],col_labels,34,'mean',suffix='_absolute_mean'))

col_labels=['supercategory_month_of_release_avg_sales_compared_to_supercategory']
target_features_df.append(moving_statistics(train_1,supercategory_month_of_release_encoding,['month_id','item_supercategory_id','item_month_of_release'],col_labels,34,'mean',suffix='_absolute_mean'))

print('time : ' +str(time.time() - ts))










# -------------------------------------------------
# CONCATENATE FEATURES

train_1=pd.concat([train_1]+target_features_df,axis=1,sort=False)
del target_features_df
gc.collect()
                         
print('time : ' +str(time.time() - ts))

# -------------------------------------------------
print(train_1.info(null_counts=True,verbose=True))
train_1

In [None]:
# clear memory space

for df in dfs:
    exec("del "+df)
    exec("del "+df+"_seniority_1")

gc.collect()

### Discard irrelevant features

In [None]:
train_1.drop(features_to_discard,axis=1,inplace=True)

train_1.info(null_counts=True,verbose=True)

### Export aggregated dataset

In [None]:
# create directory
create_directory(os.path.join(DATA_FOLDER, 'training'))

# export dataset
train_1.to_pickle(os.path.join(DATA_FOLDER,'training/train_1_pred.pkl'))

In [None]:
# clear memory
del train_1

gc.collect()

In [None]:
reset_variable_space

## ----------------------------------------------

## SENIORITY 2

Samples (shop, item) of seniority 2 are made of items that have previously been sold in this very shop, and so they are in the local catalogue of this shop for sure. For such samples, our predictions will be mostly based on the quantities sold for this item in this shop in the past.

In [None]:
# import dataset
train_2=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/train_2.pkl'))

print(train_2.info(null_counts=True))

### RAW FEATURES

In [None]:
# RAW FEATURES SELECTION
split_col='month_id'
target_col='item_quantity'
raw_features=[]

# label encodings must be retained at this stage for aggregating features
# most should be discarded later on
label_encoding=['shop_id',
                'item_id',
                'item_category_id',
                'item_supercategory_id',
                'item_category_console_id',
                'item_category_is_digital'
               ]

# mapping info for aggregation (to be discarded later)
mapping_columns=['item_month_id_of_last_sale',
                 'item_month_id_of_release',
                 'item_month_of_release',
                 'item_month_id_of_last_sale_in_shop',
                 'item_month_id_of_first_sale_in_shop'
                ]

# define already the columns to discard after aggregation of all features
features_to_discard=['item_id',
                     'shop_id',
                     'item_supercategory_id',
                     'item_category_console_id',
                     'item_category_is_digital'
                    ]
features_to_discard+=mapping_columns


#-----------------------------------
# SHOPS

# months since opening
raw_features+=['shop_months_since_opening']
#raw_features+=['shop_opening']     # uniformly 0 in seniority 2 (no shop that just opened this month can possibly have already sold any item in the past)

#-----------------------------------
# ITEMS

# month-specific frequency encoding
raw_features+=['item_freq_in_seniority']

#-----------------------------------
# ITEM CATEGORIES

# month-specific frequency encoding
raw_features+=['item_category_freq']
raw_features+=['item_supercategory_freq']
raw_features+=['item_category_console_freq']
raw_features+=['item_category_digital_freq']

raw_features+=['item_category_freq_in_seniority']
raw_features+=['item_supercategory_freq_in_seniority']
raw_features+=['item_category_console_freq_in_seniority']
raw_features+=['item_category_digital_freq_in_seniority']

# encoding of category temporal variability
raw_features+=['category_semiannual_avg_recent_rsd']
raw_features+=['category_semiannual_avg_annual_rsd']


#-----------------------------------
# RELATIVE TIME FEATURES
raw_features+=['item_months_since_release']

raw_features+=['item_months_since_first_sale_in_shop']
raw_features+=['item_months_since_last_sale_in_shop']


train_2=train_2[[split_col]+label_encoding+mapping_columns+[target_col]+raw_features]

print(train_2.info(null_counts=True))

gc.collect()

### PRICE FEATURES

In [None]:
# import price data

item_prices=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/price_features/item_prices.pkl'))
category_prices=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/price_features/category_prices.pkl'))

In [None]:
ts = time.time()

price_features_df=[]

# -------------------------------------------------
# ITEM
# infer item typical price from values observed in the shops where it was most recently sold
col_labels=['item_price_median']
col_labels+=['item_price_median_compared_to_category_price_median']
price_features_df.append(time_mapping(train_2,item_prices,['item_id'],col_labels,'item_month_id_of_last_sale',suffix='_last_sale'))
price_features_df.append(moving_statistics(train_2,item_prices,['month_id','item_id'],col_labels,34,'mean',suffix='_absolute_mean'))

# infer range of possible price values from all values observed over time in the shops where it was sold
time_window=34
price_features_df.append(moving_statistics(train_2,item_prices,['month_id','item_id'],['item_price_min'],34,'min',suffix='_absolute_min'))
price_features_df.append(moving_statistics(train_2,item_prices,['month_id','item_id'],['item_price_max'],34,'max',suffix='_absolute_max'))

print('time : ' +str(time.time() - ts))


# -------------------------------------------------
# CATEGORY
# assess typical price range for this category of items recently
price_features_df.append(moving_statistics(train_2,category_prices,['month_id','item_category_id'],['category_price_median'],3,'mean',suffix='_recent_mean'))

print('time : ' +str(time.time() - ts))




# -------------------------------------------------
# CONCATENATE FEATURES

train_2=pd.concat([train_2]+price_features_df,axis=1,sort=False)
del price_features_df
gc.collect()

print('time : ' +str(time.time() - ts))


# -------------------------------------------------
# DOWNCAST DTYPES
train_2=downcast_dtypes(train_2)
print('time : ' +str(time.time() - ts))


print(train_2.info(null_counts=True))
train_2

In [None]:
# clear memory space

del item_prices
del category_prices

gc.collect()

### TARGET ENCODED FEATURES

In [None]:
dfs=['shop_encoding',
     'item_encoding',
     'category_encoding',
     'supercategory_encoding',
     'shop_category_encoding',
     'shop_supercategory_encoding',
     'category_months_since_release_encoding',
     'supercategory_months_since_release_encoding',
     'category_month_of_release_encoding',
     'supercategory_month_of_release_encoding',
     'category_months_since_first_sale_in_shop_encoding',
     'supercategory_months_since_first_sale_in_shop_encoding',
     'shop_category_months_since_release_encoding',
     'shop_supercategory_months_since_release_encoding',
     'shop_category_months_since_first_sale_in_shop_encoding',
     'shop_supercategory_months_since_first_sale_in_shop_encoding'
    ]

    
for df in dfs:
    exec(df+"=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/target_encodings/"+df+".pkl'))")
    exec(df+"_seniority_2=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/target_encodings/"+df+"_seniority_2.pkl'))")

shop_item_encoding=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/target_encodings/shop_item_encoding.pkl'))

#### Feature analysis

In [None]:
tmp=lag_features(train_2,shop_item_encoding,['month_id','shop_id','item_id'],['item_quantity'],[1,2,3,4,5,6],fill_value=None)
tmp=pd.concat([train_2[['month_id','shop_id','item_id','item_quantity']],tmp],axis=1,sort=False)
tmp.info(null_counts=True)
    
tmp=train_2[['month_id','shop_id','item_id']].join(shop_item_encoding.set_index(['month_id','shop_id','item_id']),on=['month_id','shop_id','item_id'])
sns.heatmap(tmp.drop(['month_id','shop_id','item_id','item_quantity_compared_to_shop_avg_sales','item_quantity_compared_to_item_avg_sales','item_quantity_compared_to_item_avg_sales_over_sold','item_quantity_compared_to_shop_avg_sales_over_sold'],axis=1).corr(),annot=True)

del tmp

In [None]:
feature_correlation_analysis(train_2,item_encoding,item_encoding_seniority_2,'item',['month_id','item_id'],2)
compare_features(item_encoding,item_encoding_seniority_2,'item',['month_id','item_id'],2,20,20,20)

In [None]:
feature_correlation_analysis(train_2,shop_encoding,shop_encoding_seniority_2,'shop',['month_id','shop_id'],2)
compare_features(shop_encoding,shop_encoding_seniority_2,'shop',['month_id','shop_id'],2,2,4,4)

In general:

shop_avg_sales_seniority_2 ~= shop_avg_sales

shop_avg_sales_over_sold   ~= (1+shop_avg_sales)

(seniority_2 is slightly superior due to a larger fraction of items sold)

BUT: shop 55 (digital warehouse) has similar shop_avg_sales_seniority_2 = shop_avg_sales_over_sold --> this shop always sells the same article every month

Otherwise, seniority_2 or global are fairly similar. Early months of the year have a premium on seniority 2 compared to global (product released during christmas are sold better in the early months).

In [None]:
feature_correlation_analysis(train_2,category_encoding,category_encoding_seniority_2,'category',['month_id','item_category_id'],2)
compare_features(category_encoding,category_encoding_seniority_2,'category',['month_id','item_category_id'],2,20,20,20)

In [None]:
feature_correlation_analysis(train_2,supercategory_encoding,supercategory_encoding_seniority_2,'supercategory',['month_id','item_supercategory_id'],2)
compare_features(supercategory_encoding,supercategory_encoding_seniority_2,'supercategory',['month_id','item_supercategory_id'],2,20,20,20)

In [None]:
feature_correlation_analysis(train_2,shop_category_encoding,shop_category_encoding_seniority_2,'shop_category',['month_id','shop_id','item_category_id'],2)
compare_features(shop_category_encoding,shop_category_encoding_seniority_2,'shop_category',['month_id','shop_id','item_category_id'],2,20,20,20)

In [None]:
feature_correlation_analysis(train_2,shop_supercategory_encoding,shop_supercategory_encoding_seniority_2,'shop_supercategory',['month_id','shop_id','item_supercategory_id'],2)
compare_features(shop_supercategory_encoding,shop_supercategory_encoding_seniority_2,'shop_supercategory',['month_id','shop_id','item_supercategory_id'],2,20,20,20)

In [None]:
feature_correlation_analysis(train_2,category_months_since_release_encoding,category_months_since_release_encoding_seniority_2,'category_months_since_release',['month_id','item_category_id','item_months_since_release'],2)
compare_features(category_months_since_release_encoding,category_months_since_release_encoding_seniority_2,'category_months_since_release',['month_id','item_category_id','item_months_since_release'],2,20,20,20)

In [None]:
feature_correlation_analysis(train_2,category_months_since_first_sale_in_shop_encoding,category_months_since_first_sale_in_shop_encoding_seniority_2,'category_months_since_first_sale_in_shop',['month_id','item_category_id','item_months_since_first_sale_in_shop'],2)
compare_features(category_months_since_first_sale_in_shop_encoding,category_months_since_first_sale_in_shop_encoding_seniority_2,'category_months_since_first_sale_in_shop',['month_id','item_category_id','item_months_since_first_sale_in_shop'],2,20,20,20)

In [None]:
# correlation between 'month_since_release' and 'month_since_first_sale_in_shop'
tmp=train_2[['month_id','shop_id','item_id','item_category_id','item_months_since_release','item_months_since_first_sale_in_shop']]

tmp2=category_months_since_release_encoding_seniority_2.loc[(category_months_since_release_encoding_seniority_2['month_id']>12)&(category_months_since_release_encoding_seniority_2['month_id']<34),['month_id','item_category_id','item_months_since_release','category_months_since_release_avg_sales_seniority_2','category_months_since_release_avg_sales_over_sold_seniority_2']]
tmp3=category_months_since_release_encoding.loc[(category_months_since_release_encoding['month_id']>12)&(category_months_since_release_encoding['month_id']<34),['month_id','item_category_id','item_months_since_release','category_months_since_release_avg_sales','category_months_since_release_avg_sales_over_sold']]
tmp=tmp.join(tmp2.set_index(['month_id','item_category_id','item_months_since_release']),on=['month_id','item_category_id','item_months_since_release'])
tmp=tmp.join(tmp3.set_index(['month_id','item_category_id','item_months_since_release']),on=['month_id','item_category_id','item_months_since_release'])

tmp4=category_months_since_first_sale_in_shop_encoding.loc[(category_months_since_first_sale_in_shop_encoding['month_id']>12)&(category_months_since_first_sale_in_shop_encoding['month_id']<34),['month_id','item_category_id','item_months_since_first_sale_in_shop','category_months_since_first_sale_in_shop_avg_sales','category_months_since_first_sale_in_shop_avg_sales_over_sold']]
tmp=tmp.join(tmp4.set_index(['month_id','item_category_id','item_months_since_first_sale_in_shop']),on=['month_id','item_category_id','item_months_since_first_sale_in_shop'])

print(tmp[['item_months_since_release','item_months_since_first_sale_in_shop']].corr().loc['item_months_since_release',:])
sns.heatmap(tmp.drop(['month_id','shop_id','item_id','item_category_id','item_months_since_release','item_months_since_first_sale_in_shop'],axis=1).corr(),annot=True)             

del tmp, tmp2, tmp3, tmp4

In [None]:
feature_correlation_analysis(train_2,shop_category_months_since_release_encoding,shop_category_months_since_release_encoding_seniority_2,'shop_category_months_since_release',['month_id','shop_id','item_category_id','item_months_since_release'],2)

#### Feature aggregation

In [None]:
ts = time.time()

target_features_df=[]
    
#############################################################
# SHOP-ITEM

# typical sale values
target_features_df.append(moving_statistics(train_2,shop_item_encoding,['month_id','shop_id','item_id'],['item_quantity'],34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_2,shop_item_encoding,['month_id','shop_id','item_id'],['item_quantity'],12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_2,shop_item_encoding,['month_id','shop_id','item_id'],['item_quantity'],6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_2,shop_item_encoding,['month_id','shop_id','item_id'],['item_quantity'],3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_2,shop_item_encoding,['month_id','shop_id','item_id'],['item_quantity'],[1]))
target_features_df.append(lag_features(train_2,shop_item_encoding,['month_id','shop_id','item_id'],['item_quantity'],[12]))

# extrapolation: linear predictions of sales in this shop based on sale quantities for the past 2 and 3 months
target_features_df.append(linear_combination(train_2,shop_item_encoding,['month_id','shop_id','item_id'],['item_quantity'],weights=extralin2,suffix='_extralin2'))
target_features_df.append(linear_combination(train_2,shop_item_encoding,['month_id','shop_id','item_id'],['item_quantity'],weights=extralin3,suffix='_extralin3'))

# range of all possible sale quantities observed over time in this shop
target_features_df.append(moving_statistics(train_2,shop_item_encoding,['month_id','shop_id','item_id'],['item_quantity'],34,'min',suffix='_absolute_min'))
target_features_df.append(moving_statistics(train_2,shop_item_encoding,['month_id','shop_id','item_id'],['item_quantity'],34,'max',suffix='_absolute_max'))

print('time : ' +str(time.time() - ts))





#############################################################
# ITEM
col_labels=['item_avg_sales','item_avg_sales_over_sold']

target_features_df.append(moving_statistics(train_2,item_encoding,['month_id','item_id'],col_labels,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_2,item_encoding,['month_id','item_id'],col_labels,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_2,item_encoding,['month_id','item_id'],col_labels,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_2,item_encoding,['month_id','item_id'],col_labels,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_2,item_encoding,['month_id','item_id'],col_labels,[1]))
target_features_df.append(lag_features(train_2,item_encoding,['month_id','item_id'],col_labels,[12]))

target_features_df.append(time_mapping(train_2,item_encoding,['item_id'],col_labels,'item_month_id_of_last_sale',suffix='_last_sale'))


# Seniority 2
col_labels_seniority_2=['item_avg_sales_seniority_2']

target_features_df.append(moving_statistics(train_2,item_encoding_seniority_2,['month_id','item_id'],col_labels_seniority_2,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_2,item_encoding_seniority_2,['month_id','item_id'],col_labels_seniority_2,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_2,item_encoding_seniority_2,['month_id','item_id'],col_labels_seniority_2,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_2,item_encoding_seniority_2,['month_id','item_id'],col_labels_seniority_2,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_2,item_encoding_seniority_2,['month_id','item_id'],col_labels_seniority_2,[1]))
target_features_df.append(lag_features(train_2,item_encoding_seniority_2,['month_id','item_id'],col_labels_seniority_2,[12]))

target_features_df.append(time_mapping(train_2,item_encoding_seniority_2,['item_id'],col_labels_seniority_2,'item_month_id_of_last_sale',suffix='_last_sale'))

print('time : ' +str(time.time() - ts))





#############################################################
# SHOP
col_labels=['shop_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(moving_statistics(train_2,shop_encoding_seniority_2,['month_id','shop_id'],col_labels_seniority_2,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_2,shop_encoding_seniority_2,['month_id','shop_id'],col_labels_seniority_2,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_2,shop_encoding_seniority_2,['month_id','shop_id'],col_labels_seniority_2,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_2,shop_encoding_seniority_2,['month_id','shop_id'],col_labels_seniority_2,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_2,shop_encoding_seniority_2,['month_id','shop_id'],col_labels_seniority_2,[1]))
target_features_df.append(lag_features(train_2,shop_encoding_seniority_2,['month_id','shop_id'],col_labels_seniority_2,[12]))

print('time : ' +str(time.time() - ts))





#############################################################
# SUPERCATEGORY
col_labels=['supercategory_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(moving_statistics(train_2,supercategory_encoding_seniority_2,['month_id','item_supercategory_id'],col_labels_seniority_2,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_2,supercategory_encoding_seniority_2,['month_id','item_supercategory_id'],col_labels_seniority_2,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_2,supercategory_encoding_seniority_2,['month_id','item_supercategory_id'],col_labels_seniority_2,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_2,supercategory_encoding_seniority_2,['month_id','item_supercategory_id'],col_labels_seniority_2,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_2,supercategory_encoding_seniority_2,['month_id','item_supercategory_id'],col_labels_seniority_2,[1]))
target_features_df.append(lag_features(train_2,supercategory_encoding_seniority_2,['month_id','item_supercategory_id'],col_labels_seniority_2,[12]))

print('time : ' +str(time.time() - ts))

#############################################################
# SHOP-SUPERCATEGORY
col_labels=['shop_supercategory_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(moving_statistics(train_2,shop_supercategory_encoding_seniority_2,['month_id','shop_id','item_supercategory_id'],col_labels_seniority_2,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_2,shop_supercategory_encoding_seniority_2,['month_id','shop_id','item_supercategory_id'],col_labels_seniority_2,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_2,shop_supercategory_encoding_seniority_2,['month_id','shop_id','item_supercategory_id'],col_labels_seniority_2,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_2,shop_supercategory_encoding_seniority_2,['month_id','shop_id','item_supercategory_id'],col_labels_seniority_2,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_2,shop_supercategory_encoding_seniority_2,['month_id','shop_id','item_supercategory_id'],col_labels_seniority_2,[1]))
target_features_df.append(lag_features(train_2,shop_supercategory_encoding_seniority_2,['month_id','shop_id','item_supercategory_id'],col_labels_seniority_2,[12]))

print('time : ' +str(time.time() - ts))

#############################################################
# CATEGORY
col_labels=['category_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(moving_statistics(train_2,category_encoding_seniority_2,['month_id','item_category_id'],col_labels_seniority_2,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_2,category_encoding_seniority_2,['month_id','item_category_id'],col_labels_seniority_2,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_2,category_encoding_seniority_2,['month_id','item_category_id'],col_labels_seniority_2,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_2,category_encoding_seniority_2,['month_id','item_category_id'],col_labels_seniority_2,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_2,category_encoding_seniority_2,['month_id','item_category_id'],col_labels_seniority_2,[1]))
target_features_df.append(lag_features(train_2,category_encoding_seniority_2,['month_id','item_category_id'],col_labels_seniority_2,[12]))

print('time : ' +str(time.time() - ts))

#############################################################
# SHOP-CATEGORY
col_labels=['shop_category_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(moving_statistics(train_2,shop_category_encoding_seniority_2,['month_id','shop_id','item_category_id'],col_labels_seniority_2,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_2,shop_category_encoding_seniority_2,['month_id','shop_id','item_category_id'],col_labels_seniority_2,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_2,shop_category_encoding_seniority_2,['month_id','shop_id','item_category_id'],col_labels_seniority_2,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_2,shop_category_encoding_seniority_2,['month_id','shop_id','item_category_id'],col_labels_seniority_2,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_2,shop_category_encoding_seniority_2,['month_id','shop_id','item_category_id'],col_labels_seniority_2,[1]))
target_features_df.append(lag_features(train_2,shop_category_encoding_seniority_2,['month_id','shop_id','item_category_id'],col_labels_seniority_2,[12]))

print('time : ' +str(time.time() - ts))










#############################################################
# MONTHS SINCE RELEASE
col_labels=['category_months_since_release_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(moving_statistics(train_2,category_months_since_release_encoding_seniority_2,['month_id','item_category_id','item_months_since_release'],col_labels_seniority_2,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_2,category_months_since_release_encoding_seniority_2,['month_id','item_category_id','item_months_since_release'],col_labels_seniority_2,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_2,category_months_since_release_encoding_seniority_2,['month_id','item_category_id','item_months_since_release'],col_labels_seniority_2,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_2,category_months_since_release_encoding_seniority_2,['month_id','item_category_id','item_months_since_release'],col_labels_seniority_2,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_2,category_months_since_release_encoding_seniority_2,['month_id','item_category_id','item_months_since_release'],col_labels_seniority_2,[1]))
target_features_df.append(lag_features(train_2,category_months_since_release_encoding_seniority_2,['month_id','item_category_id','item_months_since_release'],col_labels_seniority_2,[12]))

print('time : ' +str(time.time() - ts))

col_labels=['shop_category_months_since_release_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(moving_statistics(train_2,shop_category_months_since_release_encoding_seniority_2,['month_id','shop_id','item_category_id','item_months_since_release'],col_labels_seniority_2,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_2,shop_category_months_since_release_encoding_seniority_2,['month_id','shop_id','item_category_id','item_months_since_release'],col_labels_seniority_2,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_2,shop_category_months_since_release_encoding_seniority_2,['month_id','shop_id','item_category_id','item_months_since_release'],col_labels_seniority_2,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_2,shop_category_months_since_release_encoding_seniority_2,['month_id','shop_id','item_category_id','item_months_since_release'],col_labels_seniority_2,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_2,shop_category_months_since_release_encoding_seniority_2,['month_id','shop_id','item_category_id','item_months_since_release'],col_labels_seniority_2,[1]))
target_features_df.append(lag_features(train_2,shop_category_months_since_release_encoding_seniority_2,['month_id','shop_id','item_category_id','item_months_since_release'],col_labels_seniority_2,[12]))

print('time : ' +str(time.time() - ts))

col_labels=['supercategory_months_since_release_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(moving_statistics(train_2,supercategory_months_since_release_encoding_seniority_2,['month_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_2,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_2,supercategory_months_since_release_encoding_seniority_2,['month_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_2,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_2,supercategory_months_since_release_encoding_seniority_2,['month_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_2,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_2,supercategory_months_since_release_encoding_seniority_2,['month_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_2,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_2,supercategory_months_since_release_encoding_seniority_2,['month_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_2,[1]))
target_features_df.append(lag_features(train_2,supercategory_months_since_release_encoding_seniority_2,['month_id','item_supercategory_id','item_months_since_release'],col_labels_seniority_2,[12]))

print('time : ' +str(time.time() - ts))





#############################################################
# MONTHS SINCE FIRST SALE IN SHOP
col_labels=['category_months_since_first_sale_in_shop_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(moving_statistics(train_2,category_months_since_first_sale_in_shop_encoding_seniority_2,['month_id','item_category_id','item_months_since_first_sale_in_shop'],col_labels_seniority_2,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_2,category_months_since_first_sale_in_shop_encoding_seniority_2,['month_id','item_category_id','item_months_since_first_sale_in_shop'],col_labels_seniority_2,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_2,category_months_since_first_sale_in_shop_encoding_seniority_2,['month_id','item_category_id','item_months_since_first_sale_in_shop'],col_labels_seniority_2,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_2,category_months_since_first_sale_in_shop_encoding_seniority_2,['month_id','item_category_id','item_months_since_first_sale_in_shop'],col_labels_seniority_2,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_2,category_months_since_first_sale_in_shop_encoding_seniority_2,['month_id','item_category_id','item_months_since_first_sale_in_shop'],col_labels_seniority_2,[1]))
target_features_df.append(lag_features(train_2,category_months_since_first_sale_in_shop_encoding_seniority_2,['month_id','item_category_id','item_months_since_first_sale_in_shop'],col_labels_seniority_2,[12]))

print('time : ' +str(time.time() - ts))

col_labels=['shop_category_months_since_first_sale_in_shop_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(moving_statistics(train_2,shop_category_months_since_first_sale_in_shop_encoding_seniority_2,['month_id','shop_id','item_category_id','item_months_since_first_sale_in_shop'],col_labels_seniority_2,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_2,shop_category_months_since_first_sale_in_shop_encoding_seniority_2,['month_id','shop_id','item_category_id','item_months_since_first_sale_in_shop'],col_labels_seniority_2,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_2,shop_category_months_since_first_sale_in_shop_encoding_seniority_2,['month_id','shop_id','item_category_id','item_months_since_first_sale_in_shop'],col_labels_seniority_2,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_2,shop_category_months_since_first_sale_in_shop_encoding_seniority_2,['month_id','shop_id','item_category_id','item_months_since_first_sale_in_shop'],col_labels_seniority_2,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_2,shop_category_months_since_first_sale_in_shop_encoding_seniority_2,['month_id','shop_id','item_category_id','item_months_since_first_sale_in_shop'],col_labels_seniority_2,[1]))
target_features_df.append(lag_features(train_2,shop_category_months_since_first_sale_in_shop_encoding_seniority_2,['month_id','shop_id','item_category_id','item_months_since_first_sale_in_shop'],col_labels_seniority_2,[12]))

print('time : ' +str(time.time() - ts))










#############################################################
# SPATIAL TRENDS
# assess whether shop sells more of this category than other shops (to relate to category-specific data)
col_labels=['shop_category_avg_sales_compared_to_category']

target_features_df.append(moving_statistics(train_2,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,34,'mean',suffix='_absolute_mean'))
target_features_df.append(moving_statistics(train_2,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,12,'mean',suffix='_annual_mean'))
target_features_df.append(moving_statistics(train_2,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,6,'mean',suffix='_semiannual_mean'))
target_features_df.append(moving_statistics(train_2,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_2,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,[1]))
target_features_df.append(lag_features(train_2,shop_category_encoding,['month_id','shop_id','item_category_id'],col_labels,[12]))

print('time : ' +str(time.time() - ts))





#############################################################################
# TEMPORAL TRENDS (encode month or recent period wrt to how it compares to other month of the year)

# SHOPS
col_labels=['shop_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(rational_fraction(train_2,shop_encoding,['month_id','shop_id'],col_labels,weights_num=[0,0,0,0,0,0,0,0,0,0,0,12],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag12_to_annual_mean'))
target_features_df.append(rational_fraction(train_2,shop_encoding,['month_id','shop_id'],col_labels,weights_num=[12,0,0,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag1_to_annual_mean'))
target_features_df.append(rational_fraction(train_2,shop_encoding,['month_id','shop_id'],col_labels,weights_num=[4,4,4,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_recent_mean_to_annual_mean'))

# CATEGORY
col_labels=['category_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(rational_fraction(train_2,category_encoding,['month_id','item_category_id'],['category_avg_sales'],weights_num=[0,0,0,0,0,0,0,0,0,0,0,12],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag12_to_annual_mean'))
target_features_df.append(rational_fraction(train_2,category_encoding,['month_id','item_category_id'],['category_avg_sales'],weights_num=[12,0,0,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag1_to_annual_mean'))
target_features_df.append(rational_fraction(train_2,category_encoding,['month_id','item_category_id'],['category_avg_sales'],weights_num=[4,4,4,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_recent_mean_to_annual_mean'))

# decouple SUPERCATEGORY (long-term trend) / CATEGORY within SUPERCATEGORY (short-term trend)
col_labels=['supercategory_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(rational_fraction(train_2,supercategory_encoding,['month_id','item_supercategory_id'],col_labels,weights_num=[0,0,0,0,0,0,0,0,0,0,0,12],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag12_to_annual_mean'))
target_features_df.append(rational_fraction(train_2,supercategory_encoding,['month_id','item_supercategory_id'],col_labels,weights_num=[12,0,0,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag1_to_annual_mean'))
target_features_df.append(rational_fraction(train_2,supercategory_encoding,['month_id','item_supercategory_id'],col_labels,weights_num=[4,4,4,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_recent_mean_to_annual_mean'))

col_labels=['category_avg_sales_compared_to_supercategory']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(moving_statistics(train_2,category_encoding,['month_id','item_category_id'],col_labels,3,'mean',suffix='_recent_mean'))
target_features_df.append(lag_features(train_2,category_encoding,['month_id','item_category_id'],col_labels,[1]))
target_features_df.append(lag_features(train_2,category_encoding,['month_id','item_category_id'],col_labels,[12]))

print('time : ' +str(time.time() - ts))





#############################################################################
# TEMPORAL TRENDS (are items of seniority 2 sold in larger quantities at given time of the year? for instance in january, when many items have been released for christmas? or on any month following a large release of new items)

# SHOPS
col_labels=['shop_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(rational_fraction(train_2,shop_encoding_seniority_2,['month_id','shop_id'],col_labels_seniority_2,weights_num=[0,0,0,0,0,0,0,0,0,0,0,12],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag12_to_annual_mean'))
target_features_df.append(rational_fraction(train_2,shop_encoding_seniority_2,['month_id','shop_id'],col_labels_seniority_2,weights_num=[12,0,0,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag1_to_annual_mean'))
target_features_df.append(rational_fraction(train_2,shop_encoding_seniority_2,['month_id','shop_id'],col_labels_seniority_2,weights_num=[4,4,4,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_recent_mean_to_annual_mean'))

# SUPERCATEGORY
col_labels=['supercategory_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(rational_fraction(train_2,supercategory_encoding_seniority_2,['month_id','item_supercategory_id'],col_labels_seniority_2,weights_num=[0,0,0,0,0,0,0,0,0,0,0,12],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag12_to_annual_mean'))
target_features_df.append(rational_fraction(train_2,supercategory_encoding_seniority_2,['month_id','item_supercategory_id'],col_labels_seniority_2,weights_num=[12,0,0,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag1_to_annual_mean'))
target_features_df.append(rational_fraction(train_2,supercategory_encoding_seniority_2,['month_id','item_supercategory_id'],col_labels_seniority_2,weights_num=[4,4,4,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_recent_mean_to_annual_mean'))

# CATEGORY
col_labels=['category_avg_sales']
col_labels_seniority_2=[col+'_seniority_2' for col in col_labels]

target_features_df.append(rational_fraction(train_2,category_encoding_seniority_2,['month_id','item_category_id'],col_labels_seniority_2,weights_num=[0,0,0,0,0,0,0,0,0,0,0,12],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag12_to_annual_mean'))
target_features_df.append(rational_fraction(train_2,category_encoding_seniority_2,['month_id','item_category_id'],col_labels_seniority_2,weights_num=[12,0,0,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_lag1_to_annual_mean'))
target_features_df.append(rational_fraction(train_2,category_encoding_seniority_2,['month_id','item_category_id'],col_labels_seniority_2,weights_num=[4,4,4,0,0,0,0,0,0,0,0,0],weights_denom=[1,1,1,1,1,1,1,1,1,1,1,1],suffix='_compare_recent_mean_to_annual_mean'))

print('time : ' +str(time.time() - ts))





#############################################################################
# MONTH OF RELEASE
col_labels=['category_month_of_release_avg_sales_compared_to_category']
target_features_df.append(moving_statistics(train_2,category_month_of_release_encoding,['month_id','item_category_id','item_month_of_release'],col_labels,34,'mean',suffix='_absolute_mean'))

col_labels=['supercategory_month_of_release_avg_sales_compared_to_supercategory']
target_features_df.append(moving_statistics(train_2,supercategory_month_of_release_encoding,['month_id','item_supercategory_id','item_month_of_release'],col_labels,34,'mean',suffix='_absolute_mean'))

print('time : ' +str(time.time() - ts))










# -------------------------------------------------
# CONCATENATE FEATURES

train_2=pd.concat([train_2]+target_features_df,axis=1,sort=False)
del target_features_df
gc.collect()
                         
print('time : ' +str(time.time() - ts))

# -------------------------------------------------
print(train_2.info(null_counts=True,verbose=True))
train_2

In [None]:
# clear memory space

for df in dfs:
    exec("del "+df)
    exec("del "+df+"_seniority_2")

del shop_item_encoding

gc.collect()

### Discard irrelevant features

In [None]:
train_2.drop(features_to_discard,axis=1,inplace=True)

train_2.info(null_counts=True,verbose=True)

### Export aggregated dataset

In [None]:
# create directory
create_directory(os.path.join(DATA_FOLDER, 'training'))

# export dataset
train_2.to_pickle(os.path.join(DATA_FOLDER,'training/train_2_pred.pkl'))

In [None]:
# clear memory
del train_2

gc.collect()

In [None]:
reset_variable_space

## -------------------------------------------------------------

# 5 - VALIDATION

## Functions for training and validation

In [None]:
# SPLIT TRAIN-VALIDATION SET
def datasplit_train_val(train_df,n_months_val=1,month_id_first=0):
    if month_id_first==0:
        month_id_first=train_df['month_id'].min()
        
    month_id_split=train_df['month_id'].max()-(n_months_val-1)

    # TRAINING SET
    X_train=train_df.loc[(train_df['month_id']>=month_id_first)&(train_df['month_id']<month_id_split),:].astype(np.float32)
    Y_train=X_train['item_quantity'].astype(np.float32)
    X_train.drop(['month_id','item_quantity'],axis=1,inplace=True)

    print(X_train.shape, Y_train.shape)

    # VALIDATION SET
    X_val=train_df.loc[train_df['month_id']>=month_id_split,:].astype(np.float32)
    Y_val=X_val['item_quantity'].astype(np.float32)
    X_val.drop(['month_id','item_quantity'],axis=1,inplace=True)

    print(X_val.shape , Y_val.shape)

    return (X_train,Y_train,X_val,Y_val)





# SPLIT TRAIN-TEST SET
def datasplit_train_test(train_df,month_id_first=0):
    if month_id_first==0:
        month_id_first=train_df['month_id'].min()
        
    # TRAINING SET
    X_train=train_df.loc[(train_df['month_id']>=month_id_first)&(train_df['month_id']<34),:].astype(np.float32)
    Y_train=X_train['item_quantity'].astype(np.float32)
    X_train.drop(['month_id','item_quantity'],axis=1,inplace=True)

    print(X_train.shape, Y_train.shape)

    # TEST SET
    X_test=train_df.loc[train_df['month_id']==34,:].astype(np.float32)
    X_test.drop(['month_id','item_quantity'],axis=1,inplace=True)

    print(X_test.shape)

    return (X_train,Y_train,X_test)





# SPLIT TRAINING SET MONTH BY MONTH
def datasplit_evalset(train_df,month_id_first=0):
    if month_id_first==0:
        month_id_first=train_df['month_id'].min()
        
    eval_set=[]
    for mid in train_df['month_id'].unique():
        if (mid>=month_id_first)&(mid<34):
            X_val=train_df.loc[(train_df['month_id']==mid),:].astype(np.float32)
            Y_val=X_val['item_quantity'].astype(np.float32)
            X_val.drop(['month_id','item_quantity'],axis=1,inplace=True)
            eval_set.append((X_val,Y_val))
        
    return eval_set

In [None]:
# ANALYSIS OF CORRELATIONS OF FEATURES WITH TARGET
def show_feature_target_correlations(train_df,month_id_first=22,clip_value=20):
    n_features=len(train_df.columns)-1
    
    train_clip=train_df.copy()
    train_clip['item_quantity'].clip(0,clip_value,inplace=True)

    # plot month by month correlations
    fig,ax=plt.subplots(1,1,figsize=(15,np.floor(n_features/3)))
    for i in range(month_id_first,34):
        tmp=train_clip.loc[train_clip['month_id']==i,:].corr()
        tmp.drop('item_quantity',axis=0,inplace=True)
        ax.plot(tmp['item_quantity'],tmp.index,'x')
    ax.grid(True)

    # plot global correlation over all months
    tmp=train_clip.loc[train_clip['month_id']>=month_id_first,:].corr()
    tmp.drop('item_quantity',axis=0,inplace=True)
    ax.plot(tmp['item_quantity'],tmp.index,'ko',markersize=10)

In [None]:
# ANALYSIS OF TARGET VALUE AND NAIVE PREDICTIONS
def target_analysis_val(train_df,n_months_val=1,month_id_first=0):

    # split dataset
    (X_train,Y_train,X_val,Y_val) = datasplit_train_val(train_df,n_months_val,month_id_first)
    evalset = datasplit_evalset(train_df,month_id_first)

    # TRAINING SET
    base_score=Y_train.mean()
    
    Y_guess_train=0*Y_train
    rmse_guess_train=np.sqrt(mean_squared_error(Y_train, Y_guess_train))

    Y_base_train=0*Y_train+base_score
    rmse_base_train=np.sqrt(mean_squared_error(Y_train, Y_base_train))

    print('#### TRAINING SET ###')
    print('mean target value: '+str(base_score))
    print('RMSE (guess,base): '+str(rmse_guess_train)+' , '+str(rmse_base_train))

    
    # VALIDATION SET
    base_score_val=Y_val.mean()
    
    Y_guess_val=0*Y_val 
    rmse_guess_val=np.sqrt(mean_squared_error(Y_val, Y_guess_val))

    Y_base_val=0*Y_val+base_score_val
    rmse_base_val=np.sqrt(mean_squared_error(Y_val, Y_base_val))

    print('#### VALIDATION SET ###')
    print('mean target value: '+str(base_score_val))
    print('RMSE (guess,base): '+str(rmse_guess_val)+' , '+str(rmse_base_val))
    

    # EVALUATION SET
    base_score_evalset=[Y.mean() for (_,Y) in evalset]

    rmse_guess_evalset=[]
    rmse_base_evalset=[]
    for (_,Y) in evalset:
        Y_guess=0*Y
        rmse_guess_evalset.append(np.sqrt(mean_squared_error(Y, Y_guess)))

        Y_base=0*Y+Y.mean()
        rmse_base_evalset.append(np.sqrt(mean_squared_error(Y, Y_base)))

    print('#### EVAL SET ###')
    print('mean target value: '+str([round(b*1e4)/1e4 for b in base_score_evalset]))
    print('RMSE (guess): '+str([round(b*1e2)/1e2 for b in rmse_guess_evalset]))
    print('RMSE (base): '+str([round(b*1e2)/1e2 for b in rmse_base_evalset]))

    
    
    # DISPLAY
    fig,axes=plt.subplots(1,2,figsize=(15,5))
    axes[0].plot(train_df.loc[(train_df['month_id']>=month_id_first)&(train_df['month_id']<34),'month_id'].unique(),base_score_evalset,'-o')
    axes[0].set_xlabel('month_id')
    axes[0].set_ylabel('mean of target value')
    axes[0].grid(True)
    axes[0].set_ylim(bottom=0)

    axes[1].plot(train_df.loc[(train_df['month_id']>=month_id_first)&(train_df['month_id']<34),'month_id'].unique(),rmse_guess_evalset,'-o')
    axes[1].plot(train_df.loc[(train_df['month_id']>=month_id_first)&(train_df['month_id']<34),'month_id'].unique(),rmse_base_evalset,'-o')
    axes[1].set_xlabel('month_id')
    axes[1].set_ylabel('rmse')
    axes[1].legend(['guess = 0','mean value'])
    axes[1].grid(True)
    axes[1].set_ylim(bottom=0)
                             
    return rmse_guess_train, rmse_base_train, rmse_guess_val, rmse_base_val, rmse_guess_evalset, rmse_base_evalset








# ANALYSIS OF TARGET VALUE AND NAIVE PREDICTIONS
def target_analysis_test(train_df,month_id_first=0):

    # split dataset
    (X_train,Y_train,_) = datasplit_train_test(train_df,month_id_first)
    evalset = datasplit_evalset(train_df,month_id_first)

    # BASE SCORE
    base_score=Y_train.mean()
    base_score_evalset=[Y.mean() for (_,Y) in evalset]

    # BENCHMARK (guess with 0 everywhere)
    Y_guess_train=0*Y_train
    rmse_guess_train=np.sqrt(mean_squared_error(Y_train, Y_guess_train))

    Y_base_train=0*Y_train+base_score
    rmse_base_train=np.sqrt(mean_squared_error(Y_train, Y_base_train))

    print('#### TRAINING SET ###')
    print('mean target value: '+str(base_score))
    print('RMSE (guess,base): '+str(rmse_guess_train)+' , '+str(rmse_base_train))

    rmse_guess_evalset=[]
    rmse_base_evalset=[]
    for (_,Y) in evalset:
        Y_guess=0*Y
        rmse_guess_evalset.append(np.sqrt(mean_squared_error(Y, Y_guess)))

        Y_base=0*Y+Y.mean()
        rmse_base_evalset.append(np.sqrt(mean_squared_error(Y, Y_base)))

    print('#### EVAL SET ###')
    print('mean target value: '+str([round(b*1e4)/1e4 for b in base_score_evalset]))
    print('RMSE (guess): '+str([round(b*1e2)/1e2 for b in rmse_guess_evalset]))
    print('RMSE (base): '+str([round(b*1e2)/1e2 for b in rmse_base_evalset]))

    fig,axes=plt.subplots(1,2,figsize=(15,5))
    axes[0].plot(train_df.loc[(train_df['month_id']>=month_id_first)&(train_df['month_id']<34),'month_id'].unique(),base_score_evalset,'-o')
    axes[0].set_xlabel('month_id')
    axes[0].set_ylabel('mean of target value')
    axes[0].grid(True)
    axes[0].set_ylim(bottom=0)

    axes[1].plot(train_df.loc[(train_df['month_id']>=month_id_first)&(train_df['month_id']<34),'month_id'].unique(),rmse_guess_evalset,'-o')
    axes[1].plot(train_df.loc[(train_df['month_id']>=month_id_first)&(train_df['month_id']<34),'month_id'].unique(),rmse_base_evalset,'-o')
    axes[1].set_xlabel('month_id')
    axes[1].set_ylabel('rmse')
    axes[1].legend(['guess = 0','mean value'])
    axes[1].grid(True)
    axes[1].set_ylim(bottom=0)
    
    return rmse_guess_train, rmse_base_train, rmse_guess_evalset, rmse_base_evalset

In [None]:
# PERFORMANCES ON TRAINING SET
def performance_analysis_val(xgbreg,rmse_guess_train,rmse_base_train,rmse_guess_val,rmse_base_val,rmse_guess_evalset,rmse_base_evalset):

    rmse_evalset_last=[]
    rmse_evalset_min=[]
    for evset in xgbreg.evals_result().values():
        rmse_evalset_last.append(evset['rmse'][-1])
        rmse_evalset_min.append(min(evset['rmse']))

    # display
    print('#### TRAINING SET ###')
    print('RMSE (last, min, base, guess): '+str(rmse_evalset_last[0]) +' , '+str(rmse_evalset_min[0])+' , '+str(rmse_base_train)+' , '+str(rmse_guess_train))
    print()
    print('#### VALIDATION SET ###')
    print('RMSE (last, min, base, guess): '+str(rmse_evalset_last[1]) +' , '+str(rmse_evalset_min[1])+' , '+str(rmse_base_val)+' , '+str(rmse_guess_val))
    print()
    print('#### EVAL SET ###')
    print('RMSE (last): '+str([round(b*1e4)/1e4 for b in rmse_evalset_last[2:]]))
    print('RMSE (min): '+str([round(b*1e4)/1e4 for b in rmse_evalset_min[2:]]))
    print('RMSE (guess): '+str([round(b*1e4)/1e4 for b in rmse_guess_evalset]))
    print('RMSE (base): '+str([round(b*1e4)/1e4 for b in rmse_base_evalset]))


    fig,axes=plt.subplots(1,2,figsize=(15,5))
    for i in range(2,len(xgbreg.evals_result())):
        evset=xgbreg.evals_result()['validation_'+str(i)]['rmse']
        axes[0].plot(evset,'-o')
    axes[0].legend(range(0,len(eval_set)))
    axes[0].plot(xgbreg.evals_result()['validation_0']['rmse'],'k',linewidth=5)
    axes[0].plot(xgbreg.evals_result()['validation_1']['rmse'],'k--',linewidth=5)


    axes[1].plot(rmse_evalset_last[2:],'-o')
    axes[1].plot(0*np.array(rmse_evalset_last[2:])+rmse_evalset_last[0],'k')
    axes[1].plot(0*np.array(rmse_evalset_last[2:])+rmse_evalset_last[1],'k--')
    axes[1].plot(rmse_evalset_min[2:],'-o')
    axes[1].plot(0*np.array(rmse_evalset_min[2:])+rmse_evalset_last[0],'r')
    axes[1].plot(0*np.array(rmse_evalset_min[2:])+rmse_evalset_last[1],'r--')
    axes[1].plot(rmse_guess_evalset,'k-s')
    axes[1].set_xlabel('month_id')
    axes[1].set_ylabel('rmse (last,min)')
    axes[1].grid(True)
    axes[1].set_ylim(bottom=0)
    
    
    

    
# PERFORMANCES ON TRAINING SET
def performance_analysis_test(xgbreg,rmse_guess_train,rmse_base_train,rmse_guess_evalset,rmse_base_evalset):

    rmse_evalset_last=[]
    rmse_evalset_min=[]
    for evset in xgbreg.evals_result().values():
        rmse_evalset_last.append(evset['rmse'][-1])
        rmse_evalset_min.append(min(evset['rmse']))

    # display
    print('#### TRAINING SET ###')
    print('RMSE (last, min, base, guess): '+str(rmse_evalset_last[0]) +' , '+str(rmse_evalset_min[0])+' , '+str(rmse_base_train)+' , '+str(rmse_guess_train))
    print()
    print('#### EVAL SET ###')
    print('RMSE (last): '+str([round(b*1e4)/1e4 for b in rmse_evalset_last[1:]]))
    print('RMSE (min): '+str([round(b*1e4)/1e4 for b in rmse_evalset_min[1:]]))
    print('RMSE (base): '+str([round(b*1e4)/1e4 for b in rmse_base_evalset]))
    print('RMSE (guess): '+str([round(b*1e4)/1e4 for b in rmse_guess_evalset]))


    fig,axes=plt.subplots(1,2,figsize=(15,5))
    for i in range(1,len(xgbreg.evals_result())):
        evset=xgbreg.evals_result()['validation_'+str(i)]['rmse']
        axes[0].plot(evset,'-o')
    axes[0].legend(range(0,len(eval_set)))
    axes[0].plot(xgbreg.evals_result()['validation_0']['rmse'],'k',linewidth=5)


    axes[1].plot(rmse_evalset_last[1:],'-o')
    axes[1].plot(0*np.array(rmse_evalset_last[1:])+rmse_evalset_last[0],'k')
    axes[1].plot(rmse_evalset_min[1:],'-o')
    axes[1].plot(0*np.array(rmse_evalset_min[1:])+rmse_evalset_last[0],'r')
    axes[1].plot(rmse_guess_evalset,'k-s')
    axes[1].set_xlabel('month_id')
    axes[1].set_ylabel('rmse (last,min)')
    axes[1].grid(True)
    axes[1].set_ylim(bottom=0)

In [None]:
# PLOT LEARNING CURVES
def learning_curves(xgbreg,n_months_val,month_id_first):
    n_sets=len(xgbreg.evals_result())-2
    ntrees=xgbreg.get_num_boosting_rounds()
    
    fig,axes=plt.subplots(2,2,figsize=(15,10))
    axes[0,0].plot(xgbreg.evals_result()['validation_1']['rmse'])
    for i in range(0,n_months_val):
        axes[0,0].plot(xgbreg.evals_result()['validation_'+str(n_sets+i)]['rmse'])
    axes[0,0].set_xlim(0,ntrees)
    axes[0,0].grid(True)
    axes[0,0].set_title('validation')

    axes[0,1].plot(xgbreg.evals_result()['validation_1']['rmse'],label='validation')
    for i in range(0,n_months_val):
        axes[0,1].plot(xgbreg.evals_result()['validation_'+str(n_sets+i)]['rmse'],label='month '+str(month_id_first+n_sets+i-2))
    axes[0,1].set_xlim(0,ntrees)
    axes[0,1].grid(True)
    axes[0,1].legend()
    axes[0,1].set_title('validation')

    
    axes[1,0].plot(xgbreg.evals_result()['validation_0']['rmse'],'k',linewidth=3)
    for i in range(2,n_sets):
        axes[1,0].plot(xgbreg.evals_result()['validation_'+str(i)]['rmse'])
    axes[1,0].set_xlim(0,ntrees)
    axes[1,0].grid(True)
    axes[1,0].set_title('training')

    
    axes[1,1].plot(xgbreg.evals_result()['validation_0']['rmse'],'k',linewidth=3,label='training')
    for i in range(2,n_sets):
        axes[1,1].plot(xgbreg.evals_result()['validation_'+str(i)]['rmse'],label='month '+str(month_id_first+i-2))
    axes[1,1].set_xlim(0,ntrees)
    axes[1,1].grid(True)
    axes[1,1].legend()
    axes[1,1].set_title('training')
    
    return axes

In [None]:
# FEATURE IMPORTANCE
def plot_feature_importance(xgbreg,X_train):
    # gain     := improvement in accuracy resulting from a split according to this feature (measured by a reduction of the optimization metric)
    # weight   := number of times a feature is used to split the data across all trees
    # coverage := number of observations that are classified according to this feature in the tree

    n_features=len(X_train.columns)
    
    fi_types=[ 'total_gain' , 'gain', 'weight', 'cover' , 'total_cover']
    xgb_fi=pd.DataFrame(index=X_train.columns)
    for typ in fi_types:
        xgb_fi[typ]=pd.Series(data=xgbreg.get_booster().get_score(importance_type=typ),index=X_train.columns)

    xgb_fi.fillna(0,inplace=True)
    xgb_fi=xgb_fi.sort_values(by='gain',ascending=True)

    fig,ax=plt.subplots(1,5,figsize=(18,np.floor(n_features/2.5)))
    for i,typ in enumerate(fi_types[0:5]):
        if i==0:
            ax[i].barh(y=xgb_fi.index,width=xgb_fi[typ].values)
        else:
            ax[i].barh(y=xgb_fi.index,width=xgb_fi[typ].values,tick_label=[None])
        ax[i].grid(True)
        ax[i].set_title(typ)

In [None]:
loaded=%who_ls
loaded.append('loaded')

## ----------------------------------------------

## SENIORITY 0

### Import and process training set

In [None]:
dataset_name='train_0_pred'
n_months_val=2
month_id_first=18

train_X=pd.read_pickle(os.path.join(DATA_FOLDER,'training/'+dataset_name+'.pkl'))

# drop test set for validation purposes
train_X.drop(train_X.loc[train_X['month_id']==34].index,axis=0,inplace=True)

print(train_X.info(null_counts=True,verbose=True))

In [None]:
# analyse correlation of features with target variable
show_feature_target_correlations(train_X,month_id_first=22)

In [None]:
# analyse target value
rmse_guess_train, rmse_base_train, rmse_guess_val, rmse_base_val, rmse_guess_evalset, rmse_base_evalset = target_analysis_val(train_X,n_months_val=2,month_id_first=18)

In [None]:
features_to_keep_0=[]

#--------------
# CATEGORY
features_to_keep_0+=['item_category_id']

features_to_keep_0+=['item_category_freq']
features_to_keep_0+=['item_supercategory_freq']
features_to_keep_0+=['item_category_console_freq']
#features_to_keep_0+=['item_category_digital_freq']

features_to_keep_0+=['item_category_freq_in_seniority']
features_to_keep_0+=['item_supercategory_freq_in_seniority']
features_to_keep_0+=['item_category_console_freq_in_seniority']
#features_to_keep_0+=['item_category_digital_freq_in_seniority']

#features_to_keep_0+=['category_semiannual_avg_recent_rsd']
#features_to_keep_0+=['category_semiannual_avg_annual_rsd']


#--------------
# SHOP
#features_to_keep_0+=['shop_months_since_opening']
#features_to_keep_0+=['shop_opening']










#--------------
# PRICES

# typical prices

features_to_keep_0+=['shop_category_price_median_absolute_mean']         
#features_to_keep_0+=['shop_category_price_median_recent_mean']           
features_to_keep_0+=['shop_category_price_median_lag_12']                

features_to_keep_0+=['category_price_median_recent_mean']                
features_to_keep_0+=['category_price_median_lag_12']                     

#features_to_keep_0+=['supercategory_price_median_recent_mean']       
features_to_keep_0+=['supercategory_price_median_lag_12']                

features_to_keep_0+=['category_price_min_absolute_min']                  
features_to_keep_0+=['category_price_max_absolute_max']                  

features_to_keep_0+=['category_price_min_annual_min']                    
features_to_keep_0+=['category_price_max_annual_max']                    

features_to_keep_0+=['category_price_median_compared_to_supercategory_price_median_absolute_mean']










#--------------
# SHOP
#features_to_keep_0+=['shop_avg_sales_seniority_0_absolute_mean']
#features_to_keep_0+=['shop_avg_sales_seniority_0_annual_mean']           
features_to_keep_0+=['shop_avg_sales_seniority_0_semiannual_mean']       
#features_to_keep_0+=['shop_avg_sales_seniority_0_recent_mean']           
#features_to_keep_0+=['shop_avg_sales_seniority_0_lag_1']                 
#features_to_keep_0+=['shop_avg_sales_seniority_0_lag_12']                


#--------------
# SUPERCATEGORY
#features_to_keep_0+=['supercategory_avg_sales_seniority_0_absolute_mean']
#features_to_keep_0+=['supercategory_avg_sales_seniority_0_annual_mean']          
features_to_keep_0+=['supercategory_avg_sales_seniority_0_semiannual_mean']   
#features_to_keep_0+=['supercategory_avg_sales_seniority_0_recent_mean']          
#features_to_keep_0+=['supercategory_avg_sales_seniority_0_lag_1']                
features_to_keep_0+=['supercategory_avg_sales_seniority_0_lag_12']            


#--------------
# SHOP-SUPERCATEGORY
#features_to_keep_0+=['shop_supercategory_avg_sales_seniority_0_absolute_mean']
features_to_keep_0+=['shop_supercategory_avg_sales_seniority_0_annual_mean']      
#features_to_keep_0+=['shop_supercategory_avg_sales_seniority_0_semiannual_mean']     
#features_to_keep_0+=['shop_supercategory_avg_sales_seniority_0_recent_mean']         
#features_to_keep_0+=['shop_supercategory_avg_sales_seniority_0_lag_1']               
features_to_keep_0+=['shop_supercategory_avg_sales_seniority_0_lag_12']              




#--------------
# CATEGORY
#features_to_keep_0+=['category_avg_sales_seniority_0_absolute_mean']
features_to_keep_0+=['category_avg_sales_seniority_0_annual_mean']        
features_to_keep_0+=['category_avg_sales_seniority_0_semiannual_mean']    
#features_to_keep_0+=['category_avg_sales_seniority_0_recent_mean']        
#features_to_keep_0+=['category_avg_sales_seniority_0_lag_1']              
#features_to_keep_0+=['category_avg_sales_seniority_0_lag_12']             



#--------------
# SHOP-CATEGORY
#features_to_keep_0+=['shop_category_avg_sales_seniority_0_absolute_mean']
features_to_keep_0+=['shop_category_avg_sales_seniority_0_annual_mean']        
features_to_keep_0+=['shop_category_avg_sales_seniority_0_semiannual_mean']    
features_to_keep_0+=['shop_category_avg_sales_seniority_0_recent_mean']        
features_to_keep_0+=['shop_category_avg_sales_seniority_0_lag_1']              
features_to_keep_0+=['shop_category_avg_sales_seniority_0_lag_12']             





#--------------
# SPATIAL TRENDS
#features_to_keep_0+=['shop_category_avg_sales_compared_to_category_absolute_mean']
features_to_keep_0+=['shop_category_avg_sales_compared_to_category_annual_mean']                  
#features_to_keep_0+=['shop_category_avg_sales_compared_to_category_semiannual_mean']
features_to_keep_0+=['shop_category_avg_sales_compared_to_category_recent_mean']                  
#features_to_keep_0+=['shop_category_avg_sales_compared_to_category_lag_1']
features_to_keep_0+=['shop_category_avg_sales_compared_to_category_lag_12']                       

#features_to_keep_0+=['shop_supercategory_avg_sales_compared_to_supercategory_absolute_mean']
#features_to_keep_0+=['shop_supercategory_avg_sales_compared_to_supercategory_annual_mean']       
#features_to_keep_0+=['shop_supercategory_avg_sales_compared_to_supercategory_semiannual_mean']
features_to_keep_0+=['shop_supercategory_avg_sales_compared_to_supercategory_recent_mean']        
#features_to_keep_0+=['shop_supercategory_avg_sales_compared_to_supercategory_lag_1']
#features_to_keep_0+=['shop_supercategory_avg_sales_compared_to_supercategory_lag_12']            


# TEMPORAL TRENDS
#features_to_keep_0+=['shop_avg_sales_compare_recent_mean_to_annual_mean']          
#features_to_keep_0+=['shop_avg_sales_compare_lag1_to_annual_mean']
#features_to_keep_0+=['shop_avg_sales_compare_lag12_to_annual_mean']                

features_to_keep_0+=['supercategory_avg_sales_compare_recent_mean_to_annual_mean'] 
#features_to_keep_0+=['supercategory_avg_sales_compare_lag1_to_annual_mean']
features_to_keep_0+=['supercategory_avg_sales_compare_lag12_to_annual_mean']       

#features_to_keep_0+=['category_avg_sales_compare_recent_mean_to_annual_mean']    
#features_to_keep_0+=['category_avg_sales_compare_lag1_to_annual_mean']
features_to_keep_0+=['category_avg_sales_compare_lag12_to_annual_mean']           

features_to_keep_0+=['category_avg_sales_compared_to_supercategory_recent_mean']  
#features_to_keep_0+=['category_avg_sales_compared_to_supercategory_lag_1']
#features_to_keep_0+=['category_avg_sales_compared_to_supercategory_lag_12']        


# TEMPORAL TRENDS FOR SENIORITY 0 (months of the year where seniority 0 is boosted)
#features_to_keep_0+=['shop_avg_sales_seniority_0_compare_recent_mean_to_annual_mean']
#features_to_keep_0+=['shop_avg_sales_seniority_0_compare_lag1_to_annual_mean']
#features_to_keep_0+=['shop_avg_sales_seniority_0_compare_lag12_to_annual_mean']    

#features_to_keep_0+=['supercategory_avg_sales_seniority_0_compare_recent_mean_to_annual_mean']
#features_to_keep_0+=['supercategory_avg_sales_seniority_0_compare_lag1_to_annual_mean']
features_to_keep_0+=['supercategory_avg_sales_seniority_0_compare_lag12_to_annual_mean']  

#features_to_keep_0+=['category_avg_sales_seniority_0_compare_recent_mean_to_annual_mean']
#features_to_keep_0+=['category_avg_sales_seniority_0_compare_lag1_to_annual_mean']
#features_to_keep_0+=['category_avg_sales_seniority_0_compare_lag12_to_annual_mean'] 

In [None]:
# drop features
features_to_discard_0=list( set(list(train_X.columns)) - set(['month_id','item_quantity']+features_to_keep_0) )
train_X.drop(features_to_discard_0,axis=1,inplace=True)

# fill missing values
train_X.fillna(-1,inplace=True)

# split dataset
(X_train,Y_train,X_val,Y_val) = datasplit_train_val(train_X,n_months_val,month_id_first)
eval_set = [(X_train,Y_train),(X_val,Y_val)]+datasplit_evalset(train_X,month_id_first)

# show dataset info
print('number of features to keep : '+str(len(features_to_keep_0)))
print('number of features kept    : '+str(len(X_train.columns)))

print(X_train.info(null_counts=True,verbose=True))


# clear memory space
del train_X
gc.collect()

### XGBRegressor - validation

In [None]:
# SET XGBOOST PARAMETERS

xgb_params_0={'objective':        'reg:squarederror',
# TREE SPECIFIC PARAMETERS
              'max_depth':        4,
              'min_child_weight': 1e3,
              'subsample':        0.8,
              'colsample_bytree': 0.8,
# PARAMETERS RELATED TO THE LEARNING/BOOSTING PROCESS
              'learning_rate':    0.01,
              'n_estimators':     300,  
# MISCELLANEOUS PARAMETERS
              'base_score':       0.6,
              'n_jobs':           4,
              'random_state':     2
             }




# FIT MODEL
ts=time.time()

xgbreg=XGBRegressor(**xgb_params_0)
xgbreg.fit(X_train,Y_train,eval_set=eval_set,eval_metric='rmse',verbose=False)

print(time.time()-ts)

In [None]:
# PERFORMANCE ANALYSIS
performance_analysis_val(xgbreg,rmse_guess_train,rmse_base_train,rmse_guess_val,rmse_base_val,rmse_guess_evalset,rmse_base_evalset)

# learning curves
axes=learning_curves(xgbreg,n_months_val,month_id_first)

axes[0,0].set_ylim(1.7,2.7)
axes[0,1].set_ylim(1.7,2.3)
axes[1,0].set_ylim(0.9,3)
axes[1,1].set_ylim(0.9,3)

del axes

In [None]:
# FEATURE IMPORTANCE
plot_feature_importance(xgbreg,X_train)

In [None]:
# clear memory
del X_train, Y_train, X_val, Y_val
del eval_set
del rmse_guess_train, rmse_base_train, rmse_guess_val, rmse_base_val, rmse_guess_evalset, rmse_base_evalset
del xgbreg

gc.collect()

In [None]:
# append variables to list of variables to keep
loaded.extend(['features_to_keep_0','features_to_discard_0','xgb_params_0'])

In [None]:
reset_variable_space

## ----------------------------------------------

## SENIORITY 1

### Import and process training set

In [None]:
dataset_name='train_1_pred'
n_months_val=2
month_id_first=18

train_X=pd.read_pickle(os.path.join(DATA_FOLDER,'training/'+dataset_name+'.pkl'))

# drop test set for validation purposes
train_X.drop(train_X.loc[train_X['month_id']==34].index,axis=0,inplace=True)

print(train_X.info(null_counts=True,verbose=True))

In [None]:
# analyse correlation of features with target variable
show_feature_target_correlations(train_X,month_id_first=22,clip_value=6)

In [None]:
# analyse target value
rmse_guess_train, rmse_base_train, rmse_guess_val, rmse_base_val, rmse_guess_evalset, rmse_base_evalset = target_analysis_val(train_X,n_months_val=2,month_id_first=18)

In [None]:
features_to_keep_1=[]

#--------------
# CATEGORY
features_to_keep_1+=['item_category_id']                                     

features_to_keep_1+=['item_category_freq']                                   
features_to_keep_1+=['item_supercategory_freq']                              
features_to_keep_1+=['item_category_console_freq']                            
features_to_keep_1+=['item_category_digital_freq']

#features_to_keep_1+=['item_category_freq_in_seniority']
#features_to_keep_1+=['item_supercategory_freq_in_seniority']
#features_to_keep_1+=['item_category_console_freq_in_seniority']
#features_to_keep_1+=['item_category_digital_freq_in_seniority']

#features_to_keep_1+=['category_semiannual_avg_recent_rsd']
#features_to_keep_1+=['category_semiannual_avg_annual_rsd']


#--------------
# SHOP
features_to_keep_1+=['shop_months_since_opening']
features_to_keep_1+=['shop_opening']

#--------------
# ITEM
features_to_keep_1+=['item_freq_in_seniority']

#--------------
# RELATIVE TIME FEATURES
features_to_keep_1+=['item_months_since_release']










#--------------
# PRICES

features_to_keep_1+=['item_price_median_absolute_mean']                
features_to_keep_1+=['item_price_median_last_sale']                    
features_to_keep_1+=['item_price_min_absolute_min']                    
features_to_keep_1+=['item_price_max_absolute_max']                    

#features_to_keep_1+=['item_price_median_compared_to_category_price_median_absolute_mean']                
#features_to_keep_1+=['item_price_median_compared_to_category_price_median_last_sale']                    

features_to_keep_1+=['category_price_median_recent_mean']                    










#--------------
# ITEM
features_to_keep_1+=['item_max_quantity_absolute_max'] 

#features_to_keep_1+=['item_avg_sales_absolute_mean']
#features_to_keep_1+=['item_avg_sales_annual_mean']
#features_to_keep_1+=['item_avg_sales_semiannual_mean']
features_to_keep_1+=['item_avg_sales_recent_mean']                
features_to_keep_1+=['item_avg_sales_lag_1']                        
#features_to_keep_1+=['item_avg_sales_lag_12']
features_to_keep_1+=['item_avg_sales_last_sale']                   


#features_to_keep_1+=['item_avg_sales_over_sold_absolute_mean']
#features_to_keep_1+=['item_avg_sales_over_sold_annual_mean']
#features_to_keep_1+=['item_avg_sales_over_sold_semiannual_mean']
features_to_keep_1+=['item_avg_sales_over_sold_recent_mean']                  
features_to_keep_1+=['item_avg_sales_over_sold_lag_1']                        
#features_to_keep_1+=['item_avg_sales_over_sold_lag_12']                         
features_to_keep_1+=['item_avg_sales_over_sold_last_sale']                    


#features_to_keep_1+=['item_avg_sales_seniority_1_absolute_mean']
#features_to_keep_1+=['item_avg_sales_seniority_1_annual_mean']
#features_to_keep_1+=['item_avg_sales_seniority_1_semiannual_mean']
features_to_keep_1+=['item_avg_sales_seniority_1_recent_mean']
features_to_keep_1+=['item_avg_sales_seniority_1_lag_1']
#features_to_keep_1+=['item_avg_sales_seniority_1_lag_12']
features_to_keep_1+=['item_avg_sales_seniority_1_last_sale']



#--------------
# SHOP
#features_to_keep_1+=['shop_avg_sales_seniority_1_absolute_mean']
#features_to_keep_1+=['shop_avg_sales_seniority_1_annual_mean']
#features_to_keep_1+=['shop_avg_sales_seniority_1_semiannual_mean']
#features_to_keep_1+=['shop_avg_sales_seniority_1_recent_mean']
#features_to_keep_1+=['shop_avg_sales_seniority_1_lag_1']
features_to_keep_1+=['shop_avg_sales_seniority_1_lag_12']                                                     

#features_to_keep_1+=['shop_avg_sales_absolute_mean']
#features_to_keep_1+=['shop_avg_sales_annual_mean']
#features_to_keep_1+=['shop_avg_sales_semiannual_mean']
#features_to_keep_1+=['shop_avg_sales_recent_mean']
#features_to_keep_1+=['shop_avg_sales_lag_1']                                                                          
features_to_keep_1+=['shop_avg_sales_lag_12']                                                                         



#--------------
# MONTHS SINCE RELEASE
features_to_keep_1+=['category_months_since_release_avg_sales_seniority_1_absolute_mean']                       
features_to_keep_1+=['category_months_since_release_avg_sales_seniority_1_annual_mean']
#features_to_keep_1+=['category_months_since_release_avg_sales_seniority_1_semiannual_mean']
features_to_keep_1+=['category_months_since_release_avg_sales_seniority_1_recent_mean']               
#features_to_keep_1+=['category_months_since_release_avg_sales_seniority_1_lag_1']
features_to_keep_1+=['category_months_since_release_avg_sales_seniority_1_lag_12']                   

features_to_keep_1+=['shop_category_months_since_release_avg_sales_seniority_1_absolute_mean']                        
features_to_keep_1+=['shop_category_months_since_release_avg_sales_seniority_1_annual_mean'] 
#features_to_keep_1+=['shop_category_months_since_release_avg_sales_seniority_1_semiannual_mean']
features_to_keep_1+=['shop_category_months_since_release_avg_sales_seniority_1_recent_mean']                         
#features_to_keep_1+=['shop_category_months_since_release_avg_sales_seniority_1_lag_1']               
features_to_keep_1+=['shop_category_months_since_release_avg_sales_seniority_1_lag_12']                                 

features_to_keep_1+=['supercategory_months_since_release_avg_sales_seniority_1_absolute_mean']
features_to_keep_1+=['supercategory_months_since_release_avg_sales_seniority_1_annual_mean']
#features_to_keep_1+=['supercategory_months_since_release_avg_sales_seniority_1_semiannual_mean']
features_to_keep_1+=['supercategory_months_since_release_avg_sales_seniority_1_recent_mean']
#features_to_keep_1+=['supercategory_months_since_release_avg_sales_seniority_1_lag_1']
features_to_keep_1+=['supercategory_months_since_release_avg_sales_seniority_1_lag_12']                                

features_to_keep_1+=['shop_supercategory_months_since_release_avg_sales_seniority_1_absolute_mean']
features_to_keep_1+=['shop_supercategory_months_since_release_avg_sales_seniority_1_annual_mean']
#features_to_keep_1+=['shop_supercategory_months_since_release_avg_sales_seniority_1_semiannual_mean']
features_to_keep_1+=['shop_supercategory_months_since_release_avg_sales_seniority_1_recent_mean']
#features_to_keep_1+=['shop_supercategory_months_since_release_avg_sales_seniority_1_lag_1']
features_to_keep_1+=['shop_supercategory_months_since_release_avg_sales_seniority_1_lag_12']                         



#--------------
# SPATIAL TRENDS
#features_to_keep_1+=['shop_category_avg_sales_compared_to_category_absolute_mean']
features_to_keep_1+=['shop_category_avg_sales_compared_to_category_annual_mean']               
#features_to_keep_1+=['shop_category_avg_sales_compared_to_category_semiannual_mean']
features_to_keep_1+=['shop_category_avg_sales_compared_to_category_recent_mean']              
features_to_keep_1+=['shop_category_avg_sales_compared_to_category_lag_1']                    
features_to_keep_1+=['shop_category_avg_sales_compared_to_category_lag_12']                    


# TEMPORAL TRENDS
#features_to_keep_1+=['shop_avg_sales_compare_recent_mean_to_annual_mean']
#features_to_keep_1+=['shop_avg_sales_compare_lag1_to_annual_mean']                              
features_to_keep_1+=['shop_avg_sales_compare_lag12_to_annual_mean']                               

features_to_keep_1+=['supercategory_avg_sales_compare_recent_mean_to_annual_mean']        
#features_to_keep_1+=['supercategory_avg_sales_compare_lag1_to_annual_mean']
features_to_keep_1+=['supercategory_avg_sales_compare_lag12_to_annual_mean']              

features_to_keep_1+=['category_avg_sales_compare_recent_mean_to_annual_mean']               
#features_to_keep_1+=['category_avg_sales_compare_lag1_to_annual_mean']
features_to_keep_1+=['category_avg_sales_compare_lag12_to_annual_mean']                     

features_to_keep_1+=['category_avg_sales_compared_to_supercategory_recent_mean']          
#features_to_keep_1+=['category_avg_sales_compared_to_supercategory_lag_1']                
features_to_keep_1+=['category_avg_sales_compared_to_supercategory_lag_12']                    


# MONTH OF RELEASE
features_to_keep_1+=['supercategory_month_of_release_avg_sales_compared_to_supercategory_absolute_mean']         
features_to_keep_1+=['category_month_of_release_avg_sales_compared_to_category_absolute_mean']                  

In [None]:
# drop features
features_to_discard_1=list( set(list(train_X.columns)) - set(['month_id','item_quantity']+features_to_keep_1) )
train_X.drop(features_to_discard_1,axis=1,inplace=True)

# fill missing values
train_X.fillna(-1,inplace=True)

# split dataset
(X_train,Y_train,X_val,Y_val) = datasplit_train_val(train_X,n_months_val,month_id_first)
eval_set = [(X_train,Y_train),(X_val,Y_val)]+datasplit_evalset(train_X,month_id_first)

# show dataset info
print('number of features to keep : '+str(len(features_to_keep_1)))
print('number of features kept    : '+str(len(X_train.columns)))

print(X_train.info(null_counts=True,verbose=True))


# clear memory space
del train_X
gc.collect()

### XGBRegressor - validation

In [None]:
# SET XGBOOST PARAMETERS

xgb_params_1={'objective':        'reg:squarederror',
# TREE SPECIFIC PARAMETERS
              'max_depth':        5,
              'min_child_weight': 1e3,
              'subsample':        0.8,
              'colsample_bytree': 0.8,
# PARAMETERS RELATED TO THE LEARNING/BOOSTING PROCESS
              'learning_rate':    0.05,
              'n_estimators':     500,  
# MISCELLANEOUS PARAMETERS
              'base_score':       0.05,
              'n_jobs':           4,
              'random_state':     2
             }


# restrict clipping range for training data
# for seniority 1, we observed that target values larger than 8 are so rare that they may be considered noise
# clipping target values to 6 acts as a filter that improves the overall prediction on that class of samples
Y_train.clip(0,6,inplace=True)



# FIT MODEL
ts=time.time()

xgbreg=XGBRegressor(**xgb_params_1)
xgbreg.fit(X_train,Y_train,eval_set=eval_set,eval_metric='rmse',verbose=False)

print(time.time()-ts)

In [None]:
# PERFORMANCE ANALYSIS
performance_analysis_val(xgbreg,rmse_guess_train,rmse_base_train,rmse_guess_val,rmse_base_val,rmse_guess_evalset,rmse_base_evalset)

# learning curves
axes=learning_curves(xgbreg,n_months_val,month_id_first)

axes[0,0].set_ylim(0.25,0.35)
axes[0,1].set_ylim(0.28,0.31)
axes[1,0].set_ylim(0.2,0.6)
axes[1,1].set_ylim(0.2,0.6)

del axes

In [None]:
# FEATURE IMPORTANCE
plot_feature_importance(xgbreg,X_train)

In [None]:
# clear memory
del X_train, Y_train, X_val, Y_val
del eval_set
del rmse_guess_train, rmse_base_train, rmse_guess_val, rmse_base_val, rmse_guess_evalset, rmse_base_evalset
del xgbreg

gc.collect()

In [None]:
# append variables to list of variables to keep
loaded.extend(['features_to_keep_1','features_to_discard_1','xgb_params_1'])

In [None]:
reset_variable_space

## ----------------------------------------------

## SENIORITY 2

### Import and process training set

In [None]:
dataset_name='train_2_pred'
n_months_val=2
month_id_first=18

train_X=pd.read_pickle(os.path.join(DATA_FOLDER,'training/'+dataset_name+'.pkl'))

# drop test set for validation purposes
train_X.drop(train_X.loc[train_X['month_id']==34].index,axis=0,inplace=True)

print(train_X.info(null_counts=True,verbose=True))

In [None]:
# analyse correlation of features with target variable
show_feature_target_correlations(train_X,month_id_first=22)

In [None]:
# analyse target value
rmse_guess_train, rmse_base_train, rmse_guess_val, rmse_base_val, rmse_guess_evalset, rmse_base_evalset = target_analysis_val(train_X,n_months_val=2,month_id_first=18)

In [None]:
features_to_keep_2=[]

#--------------
# CATEGORY
features_to_keep_2+=['item_category_id']                                      

features_to_keep_2+=['item_category_freq']                                    
features_to_keep_2+=['item_supercategory_freq']                               
features_to_keep_2+=['item_category_console_freq']                            
#features_to_keep_2+=['item_category_digital_freq']

#features_to_keep_2+=['item_category_freq_in_seniority']
#features_to_keep_2+=['item_supercategory_freq_in_seniority']
#features_to_keep_2+=['item_category_console_freq_in_seniority']
#features_to_keep_2+=['item_category_digital_freq_in_seniority']

features_to_keep_2+=['category_semiannual_avg_recent_rsd']
features_to_keep_2+=['category_semiannual_avg_annual_rsd']


#--------------
# SHOP
features_to_keep_2+=['shop_months_since_opening']

#--------------
# ITEM
#features_to_keep_2+=['item_freq_in_seniority']

#--------------
# RELATIVE TIME FEATURES
features_to_keep_2+=['item_months_since_release']
features_to_keep_2+=['item_months_since_first_sale_in_shop']
features_to_keep_2+=['item_months_since_last_sale_in_shop']










#--------------
# PRICES

features_to_keep_2+=['item_price_median_absolute_mean']                
features_to_keep_2+=['item_price_median_last_sale']                    
features_to_keep_2+=['item_price_min_absolute_min']                    
features_to_keep_2+=['item_price_max_absolute_max']                    

features_to_keep_2+=['item_price_median_compared_to_category_price_median_absolute_mean']                
features_to_keep_2+=['item_price_median_compared_to_category_price_median_last_sale']                    

features_to_keep_2+=['category_price_median_recent_mean']                    










#--------------
# SHOP-ITEM
#features_to_keep_2+=['item_quantity_absolute_mean']
features_to_keep_2+=['item_quantity_annual_mean']                    
features_to_keep_2+=['item_quantity_semiannual_mean']                
features_to_keep_2+=['item_quantity_recent_mean']                    
features_to_keep_2+=['item_quantity_lag_1']                          
features_to_keep_2+=['item_quantity_lag_12']                         

features_to_keep_2+=['item_quantity_extralin2']                       
features_to_keep_2+=['item_quantity_extralin3']    

features_to_keep_2+=['item_quantity_absolute_min']                    
features_to_keep_2+=['item_quantity_absolute_max']                    

#--------------
# ITEM
#features_to_keep_2+=['item_avg_sales_absolute_mean']
#features_to_keep_2+=['item_avg_sales_annual_mean']
#features_to_keep_2+=['item_avg_sales_semiannual_mean']
features_to_keep_2+=['item_avg_sales_recent_mean']                  
features_to_keep_2+=['item_avg_sales_lag_1']                        
#features_to_keep_2+=['item_avg_sales_lag_12']
features_to_keep_2+=['item_avg_sales_last_sale']                    


#features_to_keep_2+=['item_avg_sales_over_sold_absolute_mean']
#features_to_keep_2+=['item_avg_sales_over_sold_annual_mean']
#features_to_keep_2+=['item_avg_sales_over_sold_semiannual_mean']
features_to_keep_2+=['item_avg_sales_over_sold_recent_mean']                    
features_to_keep_2+=['item_avg_sales_over_sold_lag_1']                          
features_to_keep_2+=['item_avg_sales_over_sold_lag_12']                         
features_to_keep_2+=['item_avg_sales_over_sold_last_sale']                      


#features_to_keep_2+=['item_avg_sales_seniority_2_absolute_mean']
#features_to_keep_2+=['item_avg_sales_seniority_2_annual_mean']
#features_to_keep_2+=['item_avg_sales_seniority_2_semiannual_mean']
#features_to_keep_2+=['item_avg_sales_seniority_2_recent_mean']
#features_to_keep_2+=['item_avg_sales_seniority_2_lag_1']
#features_to_keep_2+=['item_avg_sales_seniority_2_lag_12']
#features_to_keep_2+=['item_avg_sales_seniority_2_last_sale']



#--------------
# SHOP
#features_to_keep_2+=['shop_avg_sales_seniority_2_absolute_mean']
#features_to_keep_2+=['shop_avg_sales_seniority_2_annual_mean']
#features_to_keep_2+=['shop_avg_sales_seniority_2_semiannual_mean']
#features_to_keep_2+=['shop_avg_sales_seniority_2_recent_mean']
#features_to_keep_2+=['shop_avg_sales_seniority_2_lag_1']
features_to_keep_2+=['shop_avg_sales_seniority_2_lag_12']                                             
                                                                    
#--------------
# SUPERCATEGORY
#features_to_keep_2+=['supercategory_avg_sales_seniority_2_absolute_mean']
#features_to_keep_2+=['supercategory_avg_sales_seniority_2_annual_mean']                
#features_to_keep_2+=['supercategory_avg_sales_seniority_2_semiannual_mean']
features_to_keep_2+=['supercategory_avg_sales_seniority_2_recent_mean']               
features_to_keep_2+=['supercategory_avg_sales_seniority_2_lag_1']                     
features_to_keep_2+=['supercategory_avg_sales_seniority_2_lag_12']                       

#--------------
# SHOP-SUPERCATEGORY
#features_to_keep_2+=['shop_supercategory_avg_sales_seniority_2_absolute_mean']
#features_to_keep_2+=['shop_supercategory_avg_sales_seniority_2_annual_mean']                            
#features_to_keep_2+=['shop_supercategory_avg_sales_seniority_2_semiannual_mean']
#features_to_keep_2+=['shop_supercategory_avg_sales_seniority_2_recent_mean']                              
#features_to_keep_2+=['shop_supercategory_avg_sales_seniority_2_lag_1']                                    
features_to_keep_2+=['shop_supercategory_avg_sales_seniority_2_lag_12']                                   

#--------------
# CATEGORY
#features_to_keep_2+=['category_avg_sales_seniority_2_absolute_mean']
features_to_keep_2+=['category_avg_sales_seniority_2_annual_mean']             
features_to_keep_2+=['category_avg_sales_seniority_2_semiannual_mean']             
features_to_keep_2+=['category_avg_sales_seniority_2_recent_mean']                   
features_to_keep_2+=['category_avg_sales_seniority_2_lag_1']                    
#features_to_keep_2+=['category_avg_sales_seniority_2_lag_12']                      

#--------------
# SHOP-CATEGORY
#features_to_keep_2+=['shop_category_avg_sales_seniority_2_absolute_mean']
features_to_keep_2+=['shop_category_avg_sales_seniority_2_annual_mean']                                              
features_to_keep_2+=['shop_category_avg_sales_seniority_2_semiannual_mean']                                            
features_to_keep_2+=['shop_category_avg_sales_seniority_2_recent_mean']                                                
features_to_keep_2+=['shop_category_avg_sales_seniority_2_lag_1']                                                      
features_to_keep_2+=['shop_category_avg_sales_seniority_2_lag_12']                                                     
                  


    
    

#--------------
# MONTHS SINCE RELEASE
#features_to_keep_2+=['category_months_since_release_avg_sales_seniority_2_absolute_mean']         
#features_to_keep_2+=['category_months_since_release_avg_sales_seniority_2_annual_mean']
#features_to_keep_2+=['category_months_since_release_avg_sales_seniority_2_semiannual_mean']
features_to_keep_2+=['category_months_since_release_avg_sales_seniority_2_recent_mean']              
#features_to_keep_2+=['category_months_since_release_avg_sales_seniority_2_lag_1']
#features_to_keep_2+=['category_months_since_release_avg_sales_seniority_2_lag_12']                  

features_to_keep_2+=['shop_category_months_since_release_avg_sales_seniority_2_absolute_mean']                       
#features_to_keep_2+=['shop_category_months_since_release_avg_sales_seniority_2_annual_mean'] 
#features_to_keep_2+=['shop_category_months_since_release_avg_sales_seniority_2_semiannual_mean']
#features_to_keep_2+=['shop_category_months_since_release_avg_sales_seniority_2_recent_mean']                         
#features_to_keep_2+=['shop_category_months_since_release_avg_sales_seniority_2_lag_1']               
features_to_keep_2+=['shop_category_months_since_release_avg_sales_seniority_2_lag_12']                               


#features_to_keep_2+=['supercategory_months_since_release_avg_sales_seniority_2_absolute_mean']
#features_to_keep_2+=['supercategory_months_since_release_avg_sales_seniority_2_annual_mean']
#features_to_keep_2+=['supercategory_months_since_release_avg_sales_seniority_2_semiannual_mean']
#features_to_keep_2+=['supercategory_months_since_release_avg_sales_seniority_2_recent_mean']
#features_to_keep_2+=['supercategory_months_since_release_avg_sales_seniority_2_lag_1']
features_to_keep_2+=['supercategory_months_since_release_avg_sales_seniority_2_lag_12']                             


#--------------
# MONTHS SINCE FIRST SALE IN SHOP
#features_to_keep_2+=['category_months_since_first_sale_in_shop_avg_sales_seniority_2_absolute_mean']                
#features_to_keep_2+=['category_months_since_first_sale_in_shop_avg_sales_seniority_2_annual_mean']           
#features_to_keep_2+=['category_months_since_first_sale_in_shop_avg_sales_seniority_2_semiannual_mean']
features_to_keep_2+=['category_months_since_first_sale_in_shop_avg_sales_seniority_2_recent_mean']              
#features_to_keep_2+=['category_months_since_first_sale_in_shop_avg_sales_seniority_2_lag_1']                  
#features_to_keep_2+=['category_months_since_first_sale_in_shop_avg_sales_seniority_2_lag_12']                  

#features_to_keep_2+=['shop_category_months_since_first_sale_in_shop_avg_sales_seniority_2_absolute_mean']            
#features_to_keep_2+=['shop_category_months_since_first_sale_in_shop_avg_sales_seniority_2_annual_mean']
#features_to_keep_2+=['shop_category_months_since_first_sale_in_shop_avg_sales_seniority_2_semiannual_mean']
#features_to_keep_2+=['shop_category_months_since_first_sale_in_shop_avg_sales_seniority_2_recent_mean']               
#features_to_keep_2+=['shop_category_months_since_first_sale_in_shop_avg_sales_seniority_2_lag_1']
features_to_keep_2+=['shop_category_months_since_first_sale_in_shop_avg_sales_seniority_2_lag_12']                



#--------------
# SPATIAL TRENDS
#features_to_keep_2+=['shop_category_avg_sales_compared_to_category_absolute_mean']
features_to_keep_2+=['shop_category_avg_sales_compared_to_category_annual_mean']              
#features_to_keep_2+=['shop_category_avg_sales_compared_to_category_semiannual_mean']
features_to_keep_2+=['shop_category_avg_sales_compared_to_category_recent_mean']            
#features_to_keep_2+=['shop_category_avg_sales_compared_to_category_lag_1']                  
#features_to_keep_2+=['shop_category_avg_sales_compared_to_category_lag_12']                


# TEMPORAL TRENDS
#features_to_keep_2+=['shop_avg_sales_compare_recent_mean_to_annual_mean']
#features_to_keep_2+=['shop_avg_sales_compare_lag1_to_annual_mean']                               
features_to_keep_2+=['shop_avg_sales_compare_lag12_to_annual_mean']                              

features_to_keep_2+=['supercategory_avg_sales_compare_recent_mean_to_annual_mean']         
#features_to_keep_2+=['supercategory_avg_sales_compare_lag1_to_annual_mean']
features_to_keep_2+=['supercategory_avg_sales_compare_lag12_to_annual_mean']               

features_to_keep_2+=['category_avg_sales_compare_recent_mean_to_annual_mean']            
#features_to_keep_2+=['category_avg_sales_compare_lag1_to_annual_mean']
features_to_keep_2+=['category_avg_sales_compare_lag12_to_annual_mean']                  

features_to_keep_2+=['category_avg_sales_compared_to_supercategory_recent_mean']           
#features_to_keep_2+=['category_avg_sales_compared_to_supercategory_lag_1']                
features_to_keep_2+=['category_avg_sales_compared_to_supercategory_lag_12']                      


# TEMPORAL TRENDS FOR SENIORITY 2 (months of the year where seniority 2 is boosted due to previous release of new items)
#features_to_keep_2+=['shop_avg_sales_seniority_2_compare_recent_mean_to_annual_mean']
#features_to_keep_2+=['shop_avg_sales_seniority_2_compare_lag1_to_annual_mean']                   
features_to_keep_2+=['shop_avg_sales_seniority_2_compare_lag12_to_annual_mean']                 

#features_to_keep_2+=['supercategory_avg_sales_seniority_2_compare_recent_mean_to_annual_mean']
#features_to_keep_2+=['supercategory_avg_sales_seniority_2_compare_lag1_to_annual_mean']
features_to_keep_2+=['supercategory_avg_sales_seniority_2_compare_lag12_to_annual_mean']         

#features_to_keep_2+=['category_avg_sales_seniority_2_compare_recent_mean_to_annual_mean']
#features_to_keep_2+=['category_avg_sales_seniority_2_compare_lag1_to_annual_mean']
features_to_keep_2+=['category_avg_sales_seniority_2_compare_lag12_to_annual_mean']              


# MONTH OF RELEASE
features_to_keep_2+=['supercategory_month_of_release_avg_sales_compared_to_supercategory_absolute_mean']          
features_to_keep_2+=['category_month_of_release_avg_sales_compared_to_category_absolute_mean']                    

In [None]:
# drop features
features_to_discard_2=list( set(list(train_X.columns)) - set(['month_id','item_quantity']+features_to_keep_2) )
train_X.drop(features_to_discard_2,axis=1,inplace=True)

# fill missing values
train_X['item_quantity_extralin2'].fillna(train_X['item_quantity_lag_1'],inplace=True)
train_X['item_quantity_extralin3'].fillna(train_X['item_quantity_lag_1'],inplace=True)
train_X.fillna(-1,inplace=True)

# split dataset
(X_train,Y_train,X_val,Y_val) = datasplit_train_val(train_X,n_months_val,month_id_first)
eval_set = [(X_train,Y_train),(X_val,Y_val)]+datasplit_evalset(train_X,month_id_first)

# show dataset info
print('number of features to keep : '+str(len(features_to_keep_2)))
print('number of features kept    : '+str(len(X_train.columns)))

print(X_train.info(null_counts=True,verbose=True))


# clear memory space
del train_X
gc.collect()

### XGBRegressor - validation

In [None]:
# SET XGBOOST PARAMETERS

xgb_params_2={'objective':        'reg:squarederror',
# TREE SPECIFIC PARAMETERS
              'max_depth':        6,
              'min_child_weight': 300,
              'subsample':        0.8,
              'colsample_bytree': 0.8,
# PARAMETERS RELATED TO THE LEARNING/BOOSTING PROCESS
              'learning_rate':    0.02,
              'n_estimators':     2000,  
# MISCELLANEOUS PARAMETERS
              'base_score':       0.4,
              'n_jobs':           4,
              'random_state':     2
             }



# FIT MODEL
ts=time.time()

xgbreg=XGBRegressor(**xgb_params_2)
xgbreg.fit(X_train,Y_train,eval_set=eval_set,eval_metric='rmse',verbose=False)

print(time.time()-ts)

In [None]:
# PERFORMANCE ANALYSIS
performance_analysis_val(xgbreg,rmse_guess_train,rmse_base_train,rmse_guess_val,rmse_base_val,rmse_guess_evalset,rmse_base_evalset)

# learning curves
axes=learning_curves(xgbreg,n_months_val,month_id_first)

axes[0,0].set_ylim(0.8,0.95)
axes[0,1].set_ylim(0.804,0.82)
axes[1,0].set_ylim(0.7,0.9)
axes[1,1].set_ylim(0.9,1.4)

del axes

In [None]:
# FEATURE IMPORTANCE
plot_feature_importance(xgbreg,X_train)

In [None]:
# clear memory
del X_train, Y_train, X_val, Y_val
del eval_set
del rmse_guess_train, rmse_base_train, rmse_guess_val, rmse_base_val, rmse_guess_evalset, rmse_base_evalset
del xgbreg

gc.collect()

In [None]:
# append variables to list of variables to keep
loaded.extend(['features_to_keep_2','features_to_discard_2','xgb_params_2'])

In [None]:
reset_variable_space

## -------------------------------------------------------------

# 6 - TRAINING FOR PREDICTION

In [None]:
# create directories
create_directory(os.path.join(DATA_FOLDER, 'predictions'))
create_directory(os.path.join(DATA_FOLDER, 'predictions/models/'))

## SENIORITY 0

### Import and process training set

In [None]:
dataset_name='train_0_pred'
month_id_first=18

train_X=pd.read_pickle(os.path.join(DATA_FOLDER,'training/'+dataset_name+'.pkl'))

print(train_X.info(null_counts=True,verbose=True))

In [None]:
# analyse dataset
rmse_guess_train, rmse_base_train, rmse_guess_evalset, rmse_base_evalset = target_analysis_test(train_X,month_id_first=18)

In [None]:
# drop features
train_X.drop(features_to_discard_0,axis=1,inplace=True)

# fill missing values
train_X.fillna(-1,inplace=True)

# split dataset
(X_train,Y_train,X_test) = datasplit_train_test(train_X,month_id_first)
eval_set = [(X_train,Y_train)]+datasplit_evalset(train_X,month_id_first)

# show dataset info
print('number of features to keep : '+str(len(features_to_keep_0)))
print('number of features kept    : '+str(len(X_train.columns)))

print(X_train.info(null_counts=True,verbose=True))


# clear memory space
del train_X
gc.collect()

### XGBRegressor - training for predictions

In [None]:
# FIT MODEL
ts=time.time()

xgbreg=XGBRegressor(**xgb_params_0)
xgbreg.fit(X_train,Y_train,eval_set=eval_set,eval_metric='rmse',verbose=False)

print(time.time()-ts)

In [None]:
# PERFORMANCE ANALYSIS
performance_analysis_test(xgbreg,rmse_guess_train,rmse_base_train,rmse_guess_evalset,rmse_base_evalset)

In [None]:
# FEATURE IMPORTANCE
plot_feature_importance(xgbreg,X_train)

### Save model and test set

In [None]:
# create directory
create_directory(os.path.join(DATA_FOLDER, 'predictions/models/xgbreg_seniority0'))

# export model
pickle.dump(xgbreg, open(os.path.join(DATA_FOLDER,'predictions/models/xgbreg_seniority0/model.pickle'), 'wb'))
X_test.to_pickle(os.path.join(DATA_FOLDER,'predictions/models/xgbreg_seniority0/X_test.pkl'))

In [None]:
# clear memory
del X_train, Y_train, X_test
del eval_set
del rmse_guess_train, rmse_base_train, rmse_guess_evalset, rmse_base_evalset
del xgbreg

del features_to_keep_0, features_to_discard_0
del xgb_params_0

loaded.remove('features_to_keep_0')
loaded.remove('features_to_discard_0')
loaded.remove('xgb_params_0')

gc.collect()

In [None]:
reset_variable_space

# ----------------------------------------------

## SENIORITY 1

### Import and process training set

In [None]:
dataset_name='train_1_pred'
month_id_first=18

train_X=pd.read_pickle(os.path.join(DATA_FOLDER,'training/'+dataset_name+'.pkl'))

print(train_X.info(null_counts=True,verbose=True))

In [None]:
# analyse dataset
rmse_guess_train, rmse_base_train, rmse_guess_evalset, rmse_base_evalset = target_analysis_test(train_X,month_id_first=18)

In [None]:
# drop features
train_X.drop(features_to_discard_1,axis=1,inplace=True)

# fill missing values
train_X.fillna(-1,inplace=True)

# split dataset
(X_train,Y_train,X_test) = datasplit_train_test(train_X,month_id_first)
eval_set = [(X_train,Y_train)]+datasplit_evalset(train_X,month_id_first)

# show dataset info
print('number of features to keep : '+str(len(features_to_keep_1)))
print('number of features kept    : '+str(len(X_train.columns)))

print(X_train.info(null_counts=True,verbose=True))


# clear memory space
del train_X
gc.collect()

### XGBRegressor - training for predictions

In [None]:
# restrict clipping range for training data
# for seniority 1, we observed that target values larger than 8 are so rare that they may be considered noise
# clipping target values to 6 acts as a filter that improves the overall prediction on that class of samples
Y_train.clip(0,6,inplace=True)

In [None]:
# FIT MODEL
ts=time.time()

xgbreg=XGBRegressor(**xgb_params_1)
xgbreg.fit(X_train,Y_train,eval_set=eval_set,eval_metric='rmse',verbose=False)

print(time.time()-ts)

In [None]:
# PERFORMANCE ANALYSIS
performance_analysis_test(xgbreg,rmse_guess_train,rmse_base_train,rmse_guess_evalset,rmse_base_evalset)

In [None]:
# FEATURE IMPORTANCE
plot_feature_importance(xgbreg,X_train)

### Save model and test set

In [None]:
# create directory
create_directory(os.path.join(DATA_FOLDER, 'predictions/models/xgbreg_seniority1'))

# export model
pickle.dump(xgbreg, open(os.path.join(DATA_FOLDER,'predictions/models/xgbreg_seniority1/model.pickle'), 'wb'))
X_test.to_pickle(os.path.join(DATA_FOLDER,'predictions/models/xgbreg_seniority1/X_test.pkl'))

In [None]:
# clear memory
del X_train, Y_train, X_test
del eval_set
del rmse_guess_train, rmse_base_train, rmse_guess_evalset, rmse_base_evalset
del xgbreg

del features_to_keep_1, features_to_discard_1
del xgb_params_1

loaded.remove('features_to_keep_1')
loaded.remove('features_to_discard_1')
loaded.remove('xgb_params_1')

gc.collect()

In [None]:
reset_variable_space

# ----------------------------------------------

## SENIORITY 2

### Import and process training set

In [None]:
dataset_name='train_2_pred'
month_id_first=18

train_X=pd.read_pickle(os.path.join(DATA_FOLDER,'training/'+dataset_name+'.pkl'))

print(train_X.info(null_counts=True,verbose=True))

In [None]:
# analyse dataset
rmse_guess_train, rmse_base_train, rmse_guess_evalset, rmse_base_evalset = target_analysis_test(train_X,month_id_first=18)

In [None]:
# drop features
train_X.drop(features_to_discard_2,axis=1,inplace=True)

# fill missing values
train_X['item_quantity_extralin2'].fillna(train_X['item_quantity_lag_1'],inplace=True)
train_X['item_quantity_extralin3'].fillna(train_X['item_quantity_lag_1'],inplace=True)
train_X.fillna(-1,inplace=True)

# split dataset
(X_train,Y_train,X_test) = datasplit_train_test(train_X,month_id_first)
eval_set = [(X_train,Y_train)]+datasplit_evalset(train_X,month_id_first)

# show dataset info
print('number of features to keep : '+str(len(features_to_keep_2)))
print('number of features kept    : '+str(len(X_train.columns)))

print(X_train.info(null_counts=True,verbose=True))


# clear memory space
del train_X
gc.collect()

### XGBRegressor - training for predictions

In [None]:
# FIT MODEL
ts=time.time()

xgbreg=XGBRegressor(**xgb_params_2)
xgbreg.fit(X_train,Y_train,eval_set=eval_set,eval_metric='rmse',verbose=False)

print(time.time()-ts)

In [None]:
# PERFORMANCE ANALYSIS
performance_analysis_test(xgbreg,rmse_guess_train,rmse_base_train,rmse_guess_evalset,rmse_base_evalset)

In [None]:
# FEATURE IMPORTANCE
plot_feature_importance(xgbreg,X_train)

### Save model and test set

In [None]:
# create directory
create_directory(os.path.join(DATA_FOLDER, 'predictions/models/xgbreg_seniority2'))

# export model
pickle.dump(xgbreg, open(os.path.join(DATA_FOLDER,'predictions/models/xgbreg_seniority2/model.pickle'), 'wb'))
X_test.to_pickle(os.path.join(DATA_FOLDER,'predictions/models/xgbreg_seniority2/X_test.pkl'))

In [None]:
# clear memory
del X_train, Y_train, X_test
del eval_set
del rmse_guess_train, rmse_base_train, rmse_guess_evalset, rmse_base_evalset
del xgbreg

del features_to_keep_2, features_to_discard_2
del xgb_params_2

loaded.remove('features_to_keep_2')
loaded.remove('features_to_discard_2')
loaded.remove('xgb_params_2')

gc.collect()

In [None]:
reset_variable_space

## -------------------------------------------------------------

# 7 - ASSEMBLE PREDICTION

In [None]:
def predict_seniority(seniority,ntree=0):

    # import model and test set
    xgbreg = pickle.load(open(os.path.join(DATA_FOLDER,'predictions/models/xgbreg_seniority'+str(seniority)+'/model.pickle'), 'rb'))
    X_test = pd.read_pickle(os.path.join(DATA_FOLDER,'predictions/models/xgbreg_seniority'+str(seniority)+'/X_test.pkl'))

    # form prediction
    Y_pred_test=xgbreg.predict(X_test,ntree_limit=ntree).clip(0,20)

    Y_pred=pd.read_pickle(os.path.join(DATA_FOLDER,'processed/train_'+str(seniority)+'.pkl'))
    Y_pred=Y_pred.loc[Y_pred['month_id']==34,['shop_id','item_id']]
    Y_pred['prediction']=Y_pred_test

    return Y_pred

In [None]:
# build global prediction

# number of trees used for predictions for pairs of seniority 0,1,2 respectively
ntrees=[200,400,2000]

Y_0=predict_seniority(seniority=0,ntree=ntrees[0])
Y_1=predict_seniority(seniority=1,ntree=ntrees[1])
Y_2=predict_seniority(seniority=2,ntree=ntrees[2])
Y_pred=pd.concat([Y_0,Y_1,Y_2],axis=0,sort=False)

test_df = pd.read_csv(os.path.join(RAW_DATA_FOLDER, 'test.csv'))
test_df=test_df.join(Y_pred.set_index(['shop_id','item_id']),on=['shop_id','item_id'])

print(test_df.info(null_counts=True))
test_df

In [None]:
# build and export submission
submission = pd.DataFrame({
    "ID": test_df['ID'], 
    "item_cnt_month": test_df['prediction']
})

submission.to_csv('xgb_prediction.csv', index=False)

In [None]:
# clear memory
del test_df
del Y_0, Y_1, Y_2, Y_pred
del submission

gc.collect()

In [None]:
reset_variable_space