This notebook describes how to make basic eda, data preparation and generating features for predicting month sales. A lot of inspiration and good tips and tricks I got from https://www.kaggle.com/dlarionov/feature-engineering-xgboost

Pipeline:
* Check missing values
* Handle outliers
* Cleaning shops/categories
* Define our target
* Lags and mean encodings
* Price trending features
* Extra features

First we import everything we need

In [None]:
import numpy as np 
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
%matplotlib inline 

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder

import os
import gc
from itertools import product
from tqdm import tqdm_notebook
import time

In [None]:
DATA_FOLDER = '../input/'
sales = pd.read_csv(os.path.join(DATA_FOLDER, 'sales_train.csv'))
shops_df = pd.read_csv(os.path.join(DATA_FOLDER, 'shops.csv'))
cat_df = pd.read_csv(os.path.join(DATA_FOLDER, 'item_categories.csv'))
items_df = pd.read_csv(os.path.join(DATA_FOLDER, 'items.csv'))
test_df = pd.read_csv(os.path.join(DATA_FOLDER, 'test.csv'))

Let's check missing values

In [None]:
print(sales.isnull().sum())
print(shops_df.isnull().sum())
print(cat_df.isnull().sum())
print(items_df.isnull().sum())
print(test_df.isnull().sum())

As you can see there is no missing values so we can move forward and start to get acquainted with the data. Let's print our sales

In [None]:
sales.head()

There is no column that describes month sales for specific item/shop. But there is only column that describe daily sales. So in future we need to calculate how much items was sold in month with respect to shops. But for now let make some analysis.

**<font size=4>Dealing with outliers</font>**

First of all we need to check our data for outliers. Seaborn will help us. Boxplot is a good tool to use for this purpose

In [None]:
sns.boxplot(sales['item_cnt_day'])

You can see that we have 2 poits that are very far from another poits. So we consider them as outliers and can remove them.

In [None]:
sales = sales[sales['item_cnt_day'] < 900]
sns.boxplot(sales['item_cnt_day'])

Do the same for price

In [None]:
sns.boxplot(sales['item_price'])

In [None]:
sales = sales[(sales['item_price']<100000) & (sales['item_price']>0)]
# uper, lower = np.percentile(sales['item_price'], [1, 99])
# sales['item_price'] = np.clip(sales['item_price'], uper, lower)
sns.boxplot(sales['item_price'])

Also it is a good practice to scale numeric features (gradient descent will converge faster). In this case I will use MinMax scaler, so price range will be from 0 to 1.
I will not scale item_cnt_day. I will deal with it later.

In [None]:
scaler = MinMaxScaler().fit(sales[['item_price']])
sales['item_price'] = scaler.transform(sales[['item_price']])

**<font size="4">Shops and category preprocessing</font>**


Let's get acquainted with the shops. 
Several shops was duplicated. So remove them from our train and test

In [None]:
# Якутск Орджоникидзе, 56
sales.loc[sales['shop_id'] == 0, 'shop_id'] = 57
test_df.loc[test_df['shop_id'] == 0, 'shop_id'] = 57
# Якутск ТЦ "Центральный"
sales.loc[sales['shop_id'] == 1, 'shop_id'] = 58
test_df.loc[test_df['shop_id'] == 1, 'shop_id'] = 58
# Жуковский ул. Чкалова 39м²
sales.loc[sales['shop_id'] == 10, 'shop_id'] = 11
test_df.loc[test_df['shop_id'] == 10, 'shop_id'] = 11

In [None]:
shops_df.head()

Let's make some cleaning in shops_df. I noticed that first word in shop_name means city name. Extract it. And make label encoding since city is a categorical type of feature.
Note that shops with id 9, 12, 55 have not city so I encoded it with 999 which means unknown.

In [None]:
shops_df.loc[shops_df['shop_name'] == 'Сергиев Посад ТЦ "7Я"', 'shop_name'] = 'СергиевПосад ТЦ "7Я"'
shops_df['city'] = shops_df['shop_name'].str.split(' ').map(lambda x: x[0])
shops_df.loc[shops_df['city'] == '!Якутск', 'city'] = 'Якутск'
shops_df['city_code'] = LabelEncoder().fit_transform(shops_df['city'])
shops_df.loc[shops_df['shop_id'].isin([9,12, 55]), 'city_code'] = 999
shops_df = shops_df[['shop_id','city_code']]

Now see how categories data looks like

In [None]:
cat_df.head()

Categories as shops also have duplicates. Get rid of them. Don't forget to alter item_df because it contains column with category id. Some of category names look like "blabla (Цыфра)" and "blabla". There are equal.

In [None]:
cat_df.loc[cat_df['item_category_id'] == 8, 'item_category_name'] = 'Билеты'
cat_df.loc[cat_df['item_category_id'] == 26, 'item_category_name'] = 'Игры Android'
cat_df.loc[cat_df['item_category_id'] == 27, 'item_category_name'] = 'Игры MAC'
cat_df.loc[cat_df['item_category_id'] == 31, 'item_category_name'] = 'Игры PC'
cat_df.loc[cat_df['item_category_id'] == 34, 'item_category_name'] = 'Карты оплаты - Live!'
cat_df.loc[cat_df['item_category_id'] == 36, 'item_category_name'] = 'Карты оплаты - Windows'
cat_df.loc[cat_df['item_category_id'] == 44, 'item_category_name'] = 'Карты оплаты - Windows'
cat_df.loc[cat_df['item_category_id'] == 74, 'item_category_name'] = 'Программы - MAC'
#43 equals 44
cat_df.drop(cat_df[cat_df['item_category_id'] == 44].index, inplace=True)
#75 == 76
cat_df.drop(cat_df[cat_df['item_category_id'] == 76].index, inplace=True)
#77 == 78
cat_df.drop(cat_df[cat_df['item_category_id'] == 78].index, inplace=True)

items_df.loc[items_df['item_category_id'] == 44, 'item_category_id'] = 43
items_df.loc[items_df['item_category_id'] == 76, 'item_category_id'] = 75
items_df.loc[items_df['item_category_id'] == 78, 'item_category_id'] = 77

Most of the category names has stucte "type - subtype". Let's extract it and apply label encoding

In [None]:
cat_df['split'] = cat_df['item_category_name'].str.split('-')
cat_df['type'] = cat_df['split'].map(lambda x: x[0].strip())
cat_df['type_code'] = LabelEncoder().fit_transform(cat_df['type'])
# if subtype is nan then type
cat_df['subtype'] = cat_df['split'].map(lambda x: x[1].strip() if len(x) > 1 else 'unknown')
cat_df['subtype_code'] = LabelEncoder().fit_transform(cat_df['subtype'])
cat_df = cat_df[['item_category_id','type_code', 'subtype_code']]

items_df.drop(['item_name'], axis=1, inplace=True)

**Useful note**

Since we have a lot of rows and in future we will have new features we will use more and more RAM to hold this data. So I encourage you to manage memory very carefully. For example we have column date_block_num and it has type int64. But range of values are only from 0 to 34, so we simply can change type to int8 for saving memory. I have some cases when notebook kernel was died because of run out of memory. So downcasting types is very important when you have a lot of data.

In [None]:
sales['date_block_num'] = sales['date_block_num'].astype(np.int8)
sales['shop_id'] = sales['shop_id'].astype(np.int16)
sales['item_id'] = sales['item_id'].astype(np.int32)

test_df['date_block_num'] = 34
test_df['date_block_num'] = test_df['date_block_num'].astype(np.int8)
test_df['shop_id'] = test_df['shop_id'].astype(np.int16)
test_df['item_id'] = test_df['item_id'].astype(np.int32)

**<font size=4>Generating target</font>**

So now we need compute our target values (month sales) for item/shop. But first let's take a look on our test data. This can help understand in what way we need to construct train data that they would be the same. Let's find out how many unique items in test and how many unique items per shop

In [None]:
print('number of unique items in test: {0}'.format(test_df['item_id'].nunique()))
test_df.groupby('shop_id')['item_id'].nunique()


As you can see every shop has the same items. So let use this knowledge for building our train data

In [None]:
index_cols = ['shop_id', 'item_id', 'date_block_num']

grid = [] 
for block_num in sales['date_block_num'].unique():
    #create all posible tuples [shop, item, month]. Shop and item that used in particular month
    cur_shops = sales.loc[sales['date_block_num'] == block_num, 'shop_id'].unique()
    cur_items = sales.loc[sales['date_block_num'] == block_num, 'item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))

grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}})
gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values] 

Compute target value. This value we will be predicted for test data.

In [None]:
# Groupby data to get shop-item-month aggregates
gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}})
#it was created 2 layer columns after aggregating. So let's use the last.
gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values] 

# Join it to the grid
all_data = pd.merge(grid, gb, how='left', on=index_cols).fillna(0)
all_data['target'] = all_data['target'].astype(np.float16)

Compute some helpful features

In [None]:
#joining test data to all_data
all_data = pd.concat([all_data, test_df[index_cols]], ignore_index=True, sort=False, keys=index_cols)
all_data.fillna(0, inplace=True)

In [None]:
all_data['date_block_num'] = all_data['date_block_num'].astype(np.int8)
all_data['shop_id'] = all_data['shop_id'].astype(np.int16)
all_data['item_id'] = all_data['item_id'].astype(np.int32)

this also help with fighting RAM issues. Don't forget to delete havy objects that you don't need anymore.

In [None]:
del grid, gb 
gc.collect()

Since we have competition requirement to clip our prediction from 0 to 20. Then let's make it for train data. That's because I didn't make scaling item_cnt_day in previous steps

In [None]:
all_data['target'] = np.clip(all_data['target'], 0, 20)

Let's see how our target values are changing with respcect to time

In [None]:
all_data['target'] = all_data['target'].astype(np.float64) # we cannot calculate sum with float16
sales_per_month = all_data.groupby('date_block_num')['target'].sum()
all_data['target'] = all_data['target'].astype(np.float16)
sales_per_month.plot(figsize=(10,8))

![](http://)Here we can see two peaks. This peaks related to 11 and 23 monthes. It is december. There are much more sales before New Year. So it will be reasonable to include this information in our set. 

In [None]:
decembers = [11, 23]
all_data['is_december'] = all_data['date_block_num'].isin(decembers).astype(np.int8)

Join information about items, shops and categories

In [None]:
all_data = pd.merge(all_data, items_df, how='left', on='item_id')
all_data = pd.merge(all_data, cat_df, how='left', on='item_category_id')
all_data = pd.merge(all_data, shops_df, how='left', on='shop_id')

In [None]:
all_data['item_category_id'] = all_data['item_category_id'].astype(np.int8)
all_data['type_code'] = all_data['type_code'].astype(np.int8)
all_data['subtype_code'] = all_data['subtype_code'].astype(np.int8)
all_data['city_code'] = all_data['city_code'].astype(np.int16)

In [None]:
del items_df
del cat_df
del shops_df
gc.collect()

**<font size=4>Lag features and mean encoding</font>**

This is where the fun begins. Since it is time series based problem it is recomended to add lag features. Lag features it is feature that describe values in previous points of time. For example one month before, two, three or six.

In [None]:
#function to create lag features

def create_lag(df, cols_to_lag, shift_range):
    print(cols_to_lag)
    
    for month_shift in tqdm_notebook(shift_range):
        train_shift = df[index_cols + cols_to_lag].copy()
        train_shift['date_block_num'] = train_shift['date_block_num'] + month_shift

        foo = lambda x: '{}_lag_{}'.format(x, month_shift) if x in cols_to_lag else x
        train_shift = train_shift.rename(columns=foo)

        df = pd.merge(df, train_shift, on=index_cols, how='left')

    del train_shift
    gc.collect()
    return df

At this point I will introduce **mean encoding** features. But it will be adapted for time series problem. Instead of simply group data by some category and calculate mean of target value and then replace it with category I will do it with respect to month.

In [None]:
mean_enc = all_data.groupby('date_block_num').agg({'target':['mean']})
mean_enc.columns = ['date_block_num_enc']
all_data = pd.merge(all_data, mean_enc, how='left', on=['date_block_num'])

and create lags for this feature for previous month, month before previous and further.

In [None]:
all_data = create_lag(all_data, ['date_block_num_enc'], [1, 2, 3, 6, 12])

Make another features in this way

In [None]:
ts = time.time()
mean_enc = all_data.groupby(['date_block_num', 'item_id']).agg({'target':['mean']})
mean_enc.columns = ['date_block_num_item_enc']
all_data = pd.merge(all_data, mean_enc, how='left', on=['date_block_num', 'item_id'])

mean_enc = all_data.groupby(['date_block_num', 'shop_id']).agg({'target':['mean']})
mean_enc.columns = ['date_block_num_shop_enc']
all_data = pd.merge(all_data, mean_enc, how='left', on=['date_block_num', 'shop_id'])

mean_enc = all_data.groupby(['date_block_num', 'item_category_id']).agg({'target':['mean']})
mean_enc.columns = ['date_block_num_cat_enc']
all_data = pd.merge(all_data, mean_enc, how='left', on=['date_block_num', 'item_category_id'])

mean_enc = all_data.groupby(['date_block_num', 'shop_id', 'item_category_id']).agg({'target':['mean']})
mean_enc.columns = ['date_block_num_shop_cat_enc']
all_data = pd.merge(all_data, mean_enc, how='left', on=['date_block_num', 'shop_id','item_category_id'])

mean_enc = all_data.groupby(['date_block_num', 'shop_id', 'type_code']).agg({'target':['mean']})
mean_enc.columns = ['date_block_num_shop_type_enc']
all_data = pd.merge(all_data, mean_enc, how='left', on=['date_block_num', 'shop_id','type_code'])

mean_enc = all_data.groupby(['date_block_num', 'shop_id', 'subtype_code']).agg({'target':['mean']})
mean_enc.columns = ['date_block_num_shop_subtype_enc']
all_data = pd.merge(all_data, mean_enc, how='left', on=['date_block_num', 'shop_id','subtype_code'])

mean_enc = all_data.groupby(['date_block_num', 'city_code']).agg({'target':['mean']})
mean_enc.columns = ['date_block_num_city_enc']
all_data = pd.merge(all_data, mean_enc, how='left', on=['date_block_num', 'city_code'])

mean_enc = all_data.groupby(['date_block_num', 'item_id','city_code']).agg({'target':['mean']})
mean_enc.columns = ['date_block_num_item_city_enc']
all_data = pd.merge(all_data, mean_enc, how='left', on=['date_block_num', 'item_id','city_code'])

mean_enc = all_data.groupby(['date_block_num', 'type_code']).agg({'target':['mean']})
mean_enc.columns = ['date_block_num_type_enc']
all_data = pd.merge(all_data, mean_enc, how='left', on=['date_block_num', 'type_code'])

mean_enc = all_data.groupby(['date_block_num', 'subtype_code']).agg({'target':['mean']})
mean_enc.columns = ['date_block_num_subtype_enc']
all_data = pd.merge(all_data, mean_enc, how='left', on=['date_block_num', 'subtype_code'])

del mean_enc
gc.collect()
time.time() - ts

And create lags features for this mean encodings

In [None]:
lag_features = ['date_block_num_item_enc', 'date_block_num_shop_enc', 'date_block_num_cat_enc',
               'date_block_num_shop_cat_enc', 'date_block_num_shop_type_enc', 'date_block_num_shop_subtype_enc',
               'date_block_num_city_enc', 'date_block_num_item_city_enc', 'date_block_num_type_enc',
               'date_block_num_subtype_enc']
all_data = create_lag(all_data, lag_features, [1, 2, 3, 6, 12])

Remove mean encoding for current month since for test data it is unknown.

In [None]:
all_data.drop(lag_features + ['date_block_num_enc'], inplace=True, axis=1)

**<font size=4>Create super interesting features</font>**

Price trend for the last six months for shops.

In [None]:
#find out shop mean price for all time and shop mean price for every month
ts = time.time()

group = sales.groupby(['shop_id']).agg({'item_price':['mean']})
group.columns = ['shop_price_mean']

#another way to get rid of 2 layer columns
group.reset_index(inplace=True)
all_data = pd.merge(all_data, group, how='left', on=['shop_id'])
all_data['shop_price_mean'] = all_data['shop_price_mean'].astype(np.float32)

group = sales.groupby(['date_block_num', 'shop_id']).agg({'item_price':['mean']})
group.columns = ['date_shop_price_mean']
group.reset_index(inplace=True)

all_data = pd.merge(all_data, group, how='left', on=['date_block_num', 'shop_id'])
all_data['date_shop_price_mean'] = all_data['date_shop_price_mean'].astype(np.float16)

# create lags 
lags = [1, 2, 3, 4, 5, 6]
all_data = create_lag(all_data, ['date_shop_price_mean'],lags)
time.time() - ts

In [None]:
# find out how much shop mean month price different from the 'all time' shop mean price , and normilize it.
ts = time.time()

for i in lags:
    all_data['delta_shop_price_lag_'+str(i)] = \
        (all_data['date_shop_price_mean_lag_'+str(i)] - all_data['shop_price_mean']) / all_data['shop_price_mean']


del group
gc.collect()

delta_cols = [col  for col in all_data.columns.values if col.startswith('delta_shop_price_lag_')]
date_item_price_cols = [col  for col in all_data.columns.values if col.startswith('date_shop_price_mean_lag_')]

#fillna(method='backfill') doesn't support float16, so change it to float32
all_data[delta_cols] = all_data[delta_cols].astype(np.float32) 

#get first non nan value in a row.
all_data['delta_shop_price_lag'] = all_data[delta_cols].fillna(method='backfill', axis=1).iloc[:, 0]

#fill it with zeros if it did not found non nan
all_data['delta_shop_price_lag'] = all_data['delta_shop_price_lag'].fillna(0).astype(np.float16)

#and remove feature we used to calclulate this
cols_to_drop = delta_cols + date_item_price_cols +['date_shop_price_mean', 'shop_price_mean']
all_data.drop(cols_to_drop, axis=1, inplace=True)
time.time()-ts

Make the same calculation for items

In [None]:
ts = time.time()
group = sales.groupby(['item_id']).agg({'item_price':['mean']})
group.columns = ['item_price_mean']
group.reset_index(inplace=True)
all_data = pd.merge(all_data, group, how='left', on=['item_id'])
all_data['item_price_mean'] = all_data['item_price_mean'].astype(np.float16)

group = sales.groupby(['date_block_num', 'item_id']).agg({'item_price':['mean']})
group.columns = ['date_item_price_mean']
group.reset_index(inplace=True)

all_data = pd.merge(all_data, group, how='left', on=['date_block_num', 'item_id'])
all_data['date_item_price_mean'] = all_data['date_item_price_mean'].astype(np.float16)
lags = [1,2,3,4,5,6]

all_data = create_lag(all_data, ['date_item_price_mean'],lags)

del group
gc.collect()

for i in lags:
    all_data['delta_item_price_lag_'+str(i)] = \
        (all_data['date_item_price_mean_lag_'+str(i)] - all_data['item_price_mean']) / all_data['item_price_mean']

delta_cols = [col  for col in all_data.columns.values if col.startswith('delta_item_price_lag_')]
date_item_price_cols = [col  for col in all_data.columns.values if col.startswith('date_item_price_mean_lag_')]

all_data[delta_cols] = all_data[delta_cols].astype(np.float32) 
all_data['delta_item_price_lag'] = all_data[delta_cols].fillna(method='backfill', axis=1).iloc[:, 0]

all_data['delta_item_price_lag'] = all_data['delta_item_price_lag'].fillna(0).astype(np.float16)
cols_to_drop = delta_cols + date_item_price_cols +['date_item_price_mean', 'item_price_mean']
all_data.drop(cols_to_drop, axis=1, inplace=True)
all_data['delta_item_price_lag'].head()

time.time()-ts

**Extra features**

Create number of days in month features

In [None]:
all_data['month'] = all_data['date_block_num'] % 12
days = pd.Series([31,28,31,30,31,30,31,31,30,31,30,31])
all_data['days'] = all_data['month'].map(days).astype(np.int8)

Number of month since the last sale for shop/item and for just item. Stay calm, this will take some time to calculate.

In [None]:
ts = time.time()
cache = {} # key is 'item_id shop_id', values is date_block_num
all_data['item_shop_last_sale'] = -1
all_data['item_shop_last_sale'] = all_data['item_shop_last_sale'].astype(np.int8)
for idx, row in all_data.iterrows():    
    key = str(row['item_id'])+' '+str(row['shop_id'])
    if key not in cache:
        if row['target']!=0:
            cache[key] = row['date_block_num']
    else:
        last_date_block_num = cache[key]
        all_data.at[idx, 'item_shop_last_sale'] = row['date_block_num'] - last_date_block_num
        cache[key] = row['date_block_num']         

del cache
gc.collect()        
time.time() - ts

In [None]:
ts = time.time()
cache = {}
all_data['item_last_sale'] = -1
all_data['item_last_sale'] = all_data['item_last_sale'].astype(np.int8)
for idx, row in all_data.iterrows():    
    key = row['item_id']
    if key not in cache:
        if row['target']!=0:
            cache[key] = row['date_block_num']
    else:
        last_date_block_num = cache[key]
        if row['date_block_num']>last_date_block_num:
            all_data.at[idx, 'item_last_sale'] = row['date_block_num'] - last_date_block_num
            cache[key] = row['date_block_num']   

            
del cache
gc.collect()
time.time() - ts

Monthes since first sale for shop/item and for item only

In [None]:
ts = time.time()
all_data['item_shop_first_sale'] = all_data['date_block_num'] - all_data.groupby(['item_id','shop_id'])['date_block_num'].transform('min')
all_data['item_first_sale'] = all_data['date_block_num'] - all_data.groupby('item_id')['date_block_num'].transform('min')
time.time() - ts

Remove old data

In [None]:
all_data = all_data[all_data['date_block_num'] > 11]

And After creating lags we have a lot of nans. So let's fill it zeros

In [None]:
all_data.fillna(0, inplace=True, axis=1)

In [None]:
all_data.info()

As you can see we generate quite a lot of new features and it's is almost 900 MB. So at this point our jorney of eda and feature engineering is finished but new road of building a predictive model is opening.

Save it to pickle. I chose pickle instead of csv because pkl format loads faster.

In [None]:
all_data.to_pickle('all_data.pkl')