In [1]:
%matplotlib inline
import gc
from tqdm import tqdm_notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
import seaborn as sns
sns.set_style("whitegrid")
# Bigger font
# sns.set_context("poster")
sns.set_context("talk")
# Figure size
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 20, 5
# np.random.seed(123)

# This function really happens to be useful. Memory matters.
def downcast_dtypes(df):
    '''
        Changes column types in the dataframe: 
                `float64` type to `float32`
                `int64`   type to `int32`
    '''
    # Select columns to downcast
    float_cols = [c for c in df if df[c].dtype == "float64"]
    int_cols =   [c for c in df if df[c].dtype == "int64"]
    # Downcast
    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols]   = df[int_cols].astype(np.int32)
    return df

items  = downcast_dtypes(pd.read_csv('data/items.csv'))
train = downcast_dtypes(pd.read_csv('data/sales_train.csv.gz'))
test = downcast_dtypes(pd.read_csv('data/test.csv.gz'))
item_category = downcast_dtypes(pd.read_csv('data/item_categories.csv'))
shops = downcast_dtypes(pd.read_csv('data/shops.csv'))

train['date'] = pd.to_datetime(train['date'], format='%d.%m.%Y')
# To make things simpler
# train = train.rename(columns = {'item_cnt_day':'target'})

The last solution wasn't correct because it made the last column the target, yet it wasn't the actual target month.

And something important to remember is this

    date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33

## Training and Test data

## Feature matrix

Time series data has to be processed in certain ways before doing predictions (forecasting) with them.

So, I've been checking out solutions and the training set has to have the same columns as the test set. The test set has these columns

    shop_id	item_id	date_block_num

So then this is more or less how the training set should look for this competition

    shop_id | item_id | date_block_num
    
It has to look the same.

But we can add more features for each row in the test set. The way to do so is by doing

- Mean encoding
- Shifting (adding lag features)

In that way we would have something like this concerning the features

    shop_id | item_id | date_block_num | shop_encoded | item_encoded | target_lag_1 | target_lag_2

And you may think that then the test set will have 0 in those extra columns, but that's where those 2 methods come into play.

## Dataset partitions

### Test Set - November 2015

For the test data, logically we will only consider the month that is asked to be forecasted though there may be other approaches.

### Training Set - January 2013 to October 2015

#### Training Set - local training set X_train

Normally, this training set can be a random dataset partition.

But in time series, I saw it is better to consider time periods. In this case: months, so

- All months except for the last 4

#### Validation Set - local validation set X_val

- The last 3 months: August, September, October from 2015
- Also, November from 2014 (I believe this is key)

## Filling out values that are not in the training set
    
For example this one

In [2]:
test.loc[(test.shop_id == 5) & (test.item_id == 5320),:]

Unnamed: 0,ID,shop_id,item_id
1,1,5,5320


In [3]:
train.loc[(train.shop_id == 5) & (train.item_id == 5320),:]

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day


We can make them "target = 0" in the training set.

We could fill out all possible combinations.

But the solution in the course assignment is by far the best. That's what I experienced after trying out on my own.

## Get a feature matrix

In [4]:
from itertools import product

## This comes straight from the ensembling assignment
def get_feature_matrix(sales, test, items, list_lags):
    
    # Create "grid" with columns
    index_cols = ['shop_id', 'item_id', 'date_block_num']

    # For every month we create a grid from all shops/items combinations from that month
    grid = [] 
    new_items = pd.DataFrame()
    cur_items_aux=np.array([])
    for block_num in sales['date_block_num'].unique():
        cur_shops = sales.loc[sales['date_block_num'] == block_num, 'shop_id'].unique()
        cur_items = sales.loc[sales['date_block_num'] == block_num, 'item_id'].append(pd.Series(cur_items_aux)).unique()
        cur_items_aux = cur_items[pd.Series(cur_items).isin(test.item_id)]
        grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))

    # Turn the grid into a dataframe
    grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

    # Add submission shop_id-item_id in order to test predictions
    test['date_block_num'] = 34
    grid = grid.append(test[['shop_id', 'item_id', 'date_block_num']])

    # Groupby data to get shop-item-month aggregates
    gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}})
    # Fix column names
    gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values] 
    # Join it to the grid
    all_data = pd.merge(grid, gb, how='left', on=index_cols).fillna(0)

    # Same as above but with shop-month aggregates
    gb = sales.groupby(['shop_id', 'date_block_num'],as_index=False).agg({'item_cnt_day':{'target_shop':'sum'}})
    gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]
    all_data = pd.merge(all_data, gb, how='left', on=['shop_id', 'date_block_num']).fillna(0)

    # Same as above but with item-month aggregates
    gb = sales.groupby(['item_id', 'date_block_num'],as_index=False).agg({'item_cnt_day':{'target_item':'sum'}})
    gb.columns = [col[0] if col[-1] == '' else col[-1] for col in gb.columns.values]
    all_data = pd.merge(all_data, gb, how='left', on=['item_id', 'date_block_num']).fillna(0)

    # Downcast dtypes from 64 to 32 bit to save memory
    all_data = downcast_dtypes(all_data)
    del grid, gb 
    gc.collect()
    # List of columns that we will use to create lags
    cols_to_rename = list(all_data.columns.difference(index_cols)) 

    shift_range = list_lags

    for month_shift in tqdm_notebook(shift_range):
        train_shift = all_data[index_cols + cols_to_rename].copy()
    
        train_shift['date_block_num'] = train_shift['date_block_num'] + month_shift
    
        foo = lambda x: '{}_lag_{}'.format(x, month_shift) if x in cols_to_rename else x
        train_shift = train_shift.rename(columns=foo)

        all_data = pd.merge(all_data, train_shift, on=index_cols, how='left').fillna(0)

    del train_shift

#     # Don't use old data from year 2013
#     all_data = all_data[all_data['date_block_num'] >= date_block_threshold] 

    # List of all lagged features
    fit_cols = [col for col in all_data.columns if col[-1] in [str(item) for item in shift_range]] 
    # We will drop these at fitting stage
    to_drop_cols = list(set(list(all_data.columns)) - (set(fit_cols)|set(index_cols))) + ['date_block_num'] 

    # Category for each item
    item_category_mapping = items[['item_id','item_category_id']].drop_duplicates()

    all_data = pd.merge(all_data, item_category_mapping, how='left', on='item_id')
    all_data = downcast_dtypes(all_data)
    gc.collect();
    
    return [all_data, to_drop_cols]

In [5]:
list_lags = [1, 2, 3, 4, 5, 6, 12]
sales_for_modelling = train[train.item_id.isin(test.item_id)]
[all_data, to_drop_cols]  = get_feature_matrix(sales_for_modelling, test, items, list_lags)

  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




In [6]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,target_shop,target_item,target_lag_1,target_item_lag_1,target_shop_lag_1,target_lag_2,target_item_lag_2,target_shop_lag_2,target_lag_3,target_item_lag_3,target_shop_lag_3,target_lag_4,target_item_lag_4,target_shop_lag_4,target_lag_5,target_item_lag_5,target_shop_lag_5,target_lag_6,target_item_lag_6,target_shop_lag_6,target_lag_12,target_item_lag_12,target_shop_lag_12,item_category_id
0,59,22154,0,1.0,452.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37
1,59,2574,0,2.0,452.0,119.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55
2,59,2607,0,0.0,452.0,29.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55
3,59,2614,0,0.0,452.0,19.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55
4,59,2808,0,15.0,452.0,858.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,30


We can add a time as the index so as to easily add more lag features.

In [7]:
import datetime
from dateutil.relativedelta import relativedelta
gendates = {}
for i in range(35):
    gendates[i] = datetime.date(2013,1,1) + relativedelta(months=+i)
    
all_data = all_data.set_index(all_data['date_block_num'].map(gendates).values)
all_data.index = pd.to_datetime(all_data.index)
all_data.sort_index(inplace=True)
all_data.head(2)

Unnamed: 0,shop_id,item_id,date_block_num,target,target_shop,target_item,target_lag_1,target_item_lag_1,target_shop_lag_1,target_lag_2,target_item_lag_2,target_shop_lag_2,target_lag_3,target_item_lag_3,target_shop_lag_3,target_lag_4,target_item_lag_4,target_shop_lag_4,target_lag_5,target_item_lag_5,target_shop_lag_5,target_lag_6,target_item_lag_6,target_shop_lag_6,target_lag_12,target_item_lag_12,target_shop_lag_12,item_category_id
2013-01-01,59,22154,0,1.0,452.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37
2013-01-01,59,2574,0,2.0,452.0,119.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55


In [8]:
all_data.tail(2)

Unnamed: 0,shop_id,item_id,date_block_num,target,target_shop,target_item,target_lag_1,target_item_lag_1,target_shop_lag_1,target_lag_2,target_item_lag_2,target_shop_lag_2,target_lag_3,target_item_lag_3,target_shop_lag_3,target_lag_4,target_item_lag_4,target_shop_lag_4,target_lag_5,target_item_lag_5,target_shop_lag_5,target_lag_6,target_item_lag_6,target_shop_lag_6,target_lag_12,target_item_lag_12,target_shop_lag_12,item_category_id
2015-11-01,45,19648,34,0.0,0.0,0.0,0.0,2.0,683.0,0.0,3.0,624.0,0.0,7.0,653.0,0.0,2.0,565.0,0.0,4.0,533.0,0.0,4.0,640.0,0.0,0.0,0.0,40
2015-11-01,45,969,34,0.0,0.0,0.0,0.0,3.0,683.0,0.0,5.0,624.0,0.0,1.0,653.0,0.0,2.0,565.0,0.0,2.0,533.0,0.0,3.0,640.0,0.0,6.0,956.0,37


## Adding month (0-11) and season

In [9]:
all_data['month'] = (all_data['date_block_num']%12)
all_data['season'] = (all_data['month']%11 + 1)//3
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,target_shop,target_item,target_lag_1,target_item_lag_1,target_shop_lag_1,target_lag_2,target_item_lag_2,target_shop_lag_2,target_lag_3,target_item_lag_3,target_shop_lag_3,target_lag_4,target_item_lag_4,target_shop_lag_4,target_lag_5,target_item_lag_5,target_shop_lag_5,target_lag_6,target_item_lag_6,target_shop_lag_6,target_lag_12,target_item_lag_12,target_shop_lag_12,item_category_id,month,season
2013-01-01,59,22154,0,1.0,452.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37,0,0
2013-01-01,59,2574,0,2.0,452.0,119.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55,0,0
2013-01-01,59,2607,0,0.0,452.0,29.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55,0,0
2013-01-01,59,2614,0,0.0,452.0,19.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55,0,0
2013-01-01,59,2808,0,15.0,452.0,858.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,30,0,0


## Feature Interactions

Here I combine some features I think might help.

In [10]:
all_data['shop_comb_category'] = all_data['shop_id'].astype(str) + '_' + all_data['item_category_id'].astype(str)
all_data['shop_comb_item'] = all_data['shop_id'].astype(str) + '_' + all_data['item_id'].astype(str)

In [11]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,target_shop,target_item,target_lag_1,target_item_lag_1,target_shop_lag_1,target_lag_2,target_item_lag_2,target_shop_lag_2,target_lag_3,target_item_lag_3,target_shop_lag_3,target_lag_4,target_item_lag_4,target_shop_lag_4,target_lag_5,target_item_lag_5,target_shop_lag_5,target_lag_6,target_item_lag_6,target_shop_lag_6,target_lag_12,target_item_lag_12,target_shop_lag_12,item_category_id,month,season,shop_comb_category,shop_comb_item
2013-01-01,59,22154,0,1.0,452.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37,0,0,59_37,59_22154
2013-01-01,59,2574,0,2.0,452.0,119.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55,0,0,59_55,59_2574
2013-01-01,59,2607,0,0.0,452.0,29.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55,0,0,59_55,59_2607
2013-01-01,59,2614,0,0.0,452.0,19.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55,0,0,59_55,59_2614
2013-01-01,59,2808,0,15.0,452.0,858.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,30,0,0,59_30,59_2808


## Mean Encoding without Regularization


In [12]:
from sklearn.model_selection import KFold

def mean_encoding(all_data, feature):
    enc_name = feature + '_enc'
    globalmean = all_data.target.mean()
    item_id_target_mean = all_data.groupby(feature).target.mean()
    all_data[enc_name] = all_data[feature].map(item_id_target_mean)
    all_data[enc_name].fillna(globalmean, inplace=True) 
    
mean_encoding(all_data, 'shop_comb_category')
mean_encoding(all_data, 'shop_comb_item')
mean_encoding(all_data, 'month')
mean_encoding(all_data, 'season')

all_data = all_data.drop('shop_comb_category', 1)
all_data = all_data.drop('shop_comb_item', 1)
all_data = all_data.drop('month', 1)
all_data = all_data.drop('season', 1)

all_data.tail()

Unnamed: 0,shop_id,item_id,date_block_num,target,target_shop,target_item,target_lag_1,target_item_lag_1,target_shop_lag_1,target_lag_2,target_item_lag_2,target_shop_lag_2,target_lag_3,target_item_lag_3,target_shop_lag_3,target_lag_4,target_item_lag_4,target_shop_lag_4,target_lag_5,target_item_lag_5,target_shop_lag_5,target_lag_6,target_item_lag_6,target_shop_lag_6,target_lag_12,target_item_lag_12,target_shop_lag_12,item_category_id,shop_comb_category_enc,shop_comb_item_enc,month_enc,season_enc
2015-11-01,45,18454,34,0.0,0.0,0.0,1.0,2.0,683.0,0.0,1.0,624.0,0.0,3.0,653.0,0.0,12.0,565.0,0.0,19.0,533.0,0.0,26.0,640.0,0.0,0.0,0.0,55,0.195658,0.75,0.319235,0.405111
2015-11-01,45,16188,34,0.0,0.0,0.0,0.0,1.0,683.0,0.0,3.0,624.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64,0.236737,0.0,0.319235,0.405111
2015-11-01,45,15757,34,0.0,0.0,0.0,0.0,5.0,683.0,0.0,3.0,624.0,0.0,4.0,653.0,0.0,4.0,565.0,0.0,8.0,533.0,0.0,11.0,640.0,0.0,9.0,956.0,55,0.195658,0.2,0.319235,0.405111
2015-11-01,45,19648,34,0.0,0.0,0.0,0.0,2.0,683.0,0.0,3.0,624.0,0.0,7.0,653.0,0.0,2.0,565.0,0.0,4.0,533.0,0.0,4.0,640.0,0.0,0.0,0.0,40,0.164583,0.0,0.319235,0.405111
2015-11-01,45,969,34,0.0,0.0,0.0,0.0,3.0,683.0,0.0,5.0,624.0,0.0,1.0,653.0,0.0,2.0,565.0,0.0,2.0,533.0,0.0,3.0,640.0,0.0,6.0,956.0,37,0.116873,0.277778,0.319235,0.405111


In [13]:
all_data.head()

Unnamed: 0,shop_id,item_id,date_block_num,target,target_shop,target_item,target_lag_1,target_item_lag_1,target_shop_lag_1,target_lag_2,target_item_lag_2,target_shop_lag_2,target_lag_3,target_item_lag_3,target_shop_lag_3,target_lag_4,target_item_lag_4,target_shop_lag_4,target_lag_5,target_item_lag_5,target_shop_lag_5,target_lag_6,target_item_lag_6,target_shop_lag_6,target_lag_12,target_item_lag_12,target_shop_lag_12,item_category_id,shop_comb_category_enc,shop_comb_item_enc,month_enc,season_enc
2013-01-01,59,22154,0,1.0,452.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37,0.107549,0.028571,0.573074,0.641294
2013-01-01,59,2574,0,2.0,452.0,119.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55,0.18541,0.628571,0.573074,0.641294
2013-01-01,59,2607,0,0.0,452.0,29.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55,0.18541,0.457143,0.573074,0.641294
2013-01-01,59,2614,0,0.0,452.0,19.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55,0.18541,0.114286,0.573074,0.641294
2013-01-01,59,2808,0,15.0,452.0,858.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,30,2.067194,7.257143,0.573074,0.641294


## Lag features

Here I build lag features at time steps with size 214200.

In [14]:
for i in range(0,35):
    print(len(all_data[all_data.date_block_num == i]), end=", ", flush=True)

42255, 46230, 48898, 50355, 54765, 58466, 60444, 61470, 64575, 71162, 73755, 80132, 82340, 86112, 93264, 99519, 103390, 107702, 115600, 122808, 127100, 142428, 147000, 157000, 160850, 155711, 160264, 167072, 165836, 168689, 173935, 178458, 187068, 208428, 214200, 

In [15]:
test_size = test.shape[0]
test_size

214200

And some of its multiples.

In [16]:
def lag_features(data, lag_columns):
    shift_range = [1]
    data_tmp = data.copy()
    for month_shift in shift_range:
        step = month_shift * test_size
        print(step)
        data = pd.concat([data, data_tmp[lag_columns].shift(step)], axis=1)
    lag_column_names = [col[:col.index('_enc')] + '_' + str(step) for col in lag_columns for step in shift_range]
    data.columns = np.append(data_tmp.columns.values, lag_column_names)
    del data_tmp
    gc.collect();
    return data

lag_columns = ['shop_comb_category_enc', 
              'shop_comb_item_enc',
              'month_enc', 'season_enc']
all_data_lag = lag_features(all_data, lag_columns)

214200


In [17]:
all_data_lag.head(2)

Unnamed: 0,shop_id,item_id,date_block_num,target,target_shop,target_item,target_lag_1,target_item_lag_1,target_shop_lag_1,target_lag_2,target_item_lag_2,target_shop_lag_2,target_lag_3,target_item_lag_3,target_shop_lag_3,target_lag_4,target_item_lag_4,target_shop_lag_4,target_lag_5,target_item_lag_5,target_shop_lag_5,target_lag_6,target_item_lag_6,target_shop_lag_6,target_lag_12,target_item_lag_12,target_shop_lag_12,item_category_id,shop_comb_category_enc,shop_comb_item_enc,month_enc,season_enc,shop_comb_category_1,shop_comb_item_1,month_1,season_1
2013-01-01,59,22154,0,1.0,452.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37,0.107549,0.028571,0.573074,0.641294,,,,
2013-01-01,59,2574,0,2.0,452.0,119.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55,0.18541,0.628571,0.573074,0.641294,,,,


In [18]:
all_data_lag.tail(2)

Unnamed: 0,shop_id,item_id,date_block_num,target,target_shop,target_item,target_lag_1,target_item_lag_1,target_shop_lag_1,target_lag_2,target_item_lag_2,target_shop_lag_2,target_lag_3,target_item_lag_3,target_shop_lag_3,target_lag_4,target_item_lag_4,target_shop_lag_4,target_lag_5,target_item_lag_5,target_shop_lag_5,target_lag_6,target_item_lag_6,target_shop_lag_6,target_lag_12,target_item_lag_12,target_shop_lag_12,item_category_id,shop_comb_category_enc,shop_comb_item_enc,month_enc,season_enc,shop_comb_category_1,shop_comb_item_1,month_1,season_1
2015-11-01,45,19648,34,0.0,0.0,0.0,0.0,2.0,683.0,0.0,3.0,624.0,0.0,7.0,653.0,0.0,2.0,565.0,0.0,4.0,533.0,0.0,4.0,640.0,0.0,0.0,0.0,40,0.164583,0.0,0.319235,0.405111,0.368869,0.0,0.447834,0.405111
2015-11-01,45,969,34,0.0,0.0,0.0,0.0,3.0,683.0,0.0,5.0,624.0,0.0,1.0,653.0,0.0,2.0,565.0,0.0,2.0,533.0,0.0,3.0,640.0,0.0,6.0,956.0,37,0.116873,0.277778,0.319235,0.405111,0.414286,0.0,0.447834,0.405111


## Omitting year 2013

This is done after adding the extra lag features.

In [19]:
date_block_threshold = 12
all_data_lag = all_data[all_data['date_block_num'] >= date_block_threshold] 

I remove the extra NaN values, which should be none.

In [20]:
all_data_lag = all_data_lag.dropna()
all_data_lag.head(2)

Unnamed: 0,shop_id,item_id,date_block_num,target,target_shop,target_item,target_lag_1,target_item_lag_1,target_shop_lag_1,target_lag_2,target_item_lag_2,target_shop_lag_2,target_lag_3,target_item_lag_3,target_shop_lag_3,target_lag_4,target_item_lag_4,target_shop_lag_4,target_lag_5,target_item_lag_5,target_shop_lag_5,target_lag_6,target_item_lag_6,target_shop_lag_6,target_lag_12,target_item_lag_12,target_shop_lag_12,item_category_id,shop_comb_category_enc,shop_comb_item_enc,month_enc,season_enc
2014-01-01,54,10297,12,4.0,3416.0,23.0,3.0,42.0,4282.0,0.0,2.0,3085.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37,0.579359,0.5,0.573074,0.641294
2014-01-01,54,10298,12,14.0,3416.0,182.0,21.0,369.0,4282.0,119.0,1309.0,3085.0,7.0,144.0,2464.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40,1.350627,12.105263,0.573074,0.641294


## Train/validation split

The validation set will be the last 4 months.

In [21]:
train_new = all_data_lag[all_data_lag.date_block_num <= 33]
test_new = all_data_lag[all_data_lag.date_block_num >= 34]
assert len(test_new) == len(test)

Then these are the training set and the new test set.

'target' is removed from the list of features of the test set.

In [22]:
test_new = test_new.drop(['target'], 1)

But the local modelling will be done only with the training set

In [23]:
boolean_val = (train_new['date_block_num'].isin([28, 29, 30, 31, 32, 33]))
boolean_train = ~boolean_val
dates_train = train_new['date_block_num'][boolean_train]
dates_val  = train_new['date_block_num'][boolean_val]

X_train = train_new.loc[boolean_train, (train_new.columns != 'target')]
X_val =  train_new.loc[boolean_val, (train_new.columns != 'target')]
y_train = train_new.loc[boolean_train, 'target'].values
y_val = train_new.loc[boolean_val, 'target'].values

print('X_train shape is ' + str(X_train.shape))
print('X_val shape is ' + str(X_val.shape))

X_train shape is (2028160, 31)
X_val shape is (1082414, 31)


In [24]:
np.savez_compressed('data/X_train.npz', X_train=X_train)
np.savez_compressed('data/X_val.npz', X_val=X_val)
np.savez_compressed('data/test_new.npz', test_new=test_new)

np.savez_compressed('data/y_train.npz', y_train=y_train)
np.savez_compressed('data/y_val.npz', y_val=y_val)