This notebook is simpified version of the final project in the [How to Win a Data Science Competition: Learn from Top Kagglers](https://www.coursera.org/learn/competitive-data-science) course. Simplified means without ensembling.

#### Pipline
* load data
* heal data and remove outliers
* work with shops/items/cats objects and features
* create matrix as product of item/shop pairs within each month in the train set
* get monthly sales for each item/shop pair in the train set and merge it to the matrix
* clip item_cnt_month by (0,20)
* append test to the matrix, fill 34 month nans with zeros
* merge shops/items/cats to the matrix
* add target lag features
* add mean encoded features
* add price trend features
* add month
* add days
* add months since last sale/months since first sale features
* cut first year and drop columns which can not be calculated for the test set
* select best features
* set validation strategy 34 test, 33 validation, less than 33 train
* fit the model, predict and clip targets for the test set

# Part 1, perfect features

In [None]:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500) #属性にアクセスして設定値を変更
pd.set_option('display.max_columns', 100)

from itertools import product #イテレータを構築する部品を実装したPythonのモジュール
from sklearn.preprocessing import LabelEncoder

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from xgboost import XGBRegressor
from xgboost import plot_importance

def plot_features(booster, figsize):    
    fig, ax = plt.subplots(1,1,figsize=figsize)
    return plot_importance(booster=booster, ax=ax)

import time
import sys #Pythonのインタプリタや実行環境に関する情報を扱うためのライブラリ
import gc #ガベージコレクション（Garbage collection）とは、必要なくなったメモリ領域を自動的に開放する機能です。 
          #Pythonのプログラムは実行のために必要なメモリを自動的に確保してくれますが、GCはその後片付けをしてくれる機能だと思ってください
import pickle #pickleはPythonオブジェクトの直列化（シリアライズ）や非直列化（デシリアライズ）を扱うライブラリです。
              #平たく言うと、Pythonオブジェクトをファイルとして保存し、呼び出せるようにすることができる形にするためのライブラリという事になります。
sys.version_info

In [None]:
items = pd.read_csv('../input/items.csv')
shops = pd.read_csv('../input/shops.csv')
cats = pd.read_csv('../input/item_categories.csv')
train = pd.read_csv('../input/sales_train.csv')
# set index to ID to avoid droping it later
test  = pd.read_csv('../input/test.csv').set_index('ID')

**データの中身を確認**

In [None]:
items.head()

In [None]:
shops.head()

In [None]:
cats.head()

In [None]:
train.head()
# item_cnt_day - when someone buys an item, itemcntday = 1, if the item is returned back to the magazine
# (for example, if it is defective), itemcntday = -1.

In [None]:
test.head()

In [None]:
len(train)

## Outliers

There are items with strange prices and sales. After detailed exploration I decided to remove items with price > 100000 and sales > 1001 (1000 is ok).

In [None]:
plt.figure(figsize=(10,4))
plt.xlim(-100, 3000) #X 軸のみの制限を設定
sns.boxplot(x=train.item_cnt_day)

plt.figure(figsize=(10,4))
plt.xlim(train.item_price.min(), train.item_price.max()*1.1)
sns.boxplot(x=train.item_price)

In [None]:
train = train[train.item_price<100000] #this notation shuld be underestanded
train = train[train.item_cnt_day<1001]

In [None]:
# once again, make the plot
plt.figure(figsize=(10,4))
plt.xlim(-100, 3000) #X 軸のみの制限を設定
sns.boxplot(x=train.item_cnt_day)

plt.figure(figsize=(10,4))
plt.xlim(train.item_price.min(), train.item_price.max()*1.1)
sns.boxplot(x=train.item_price)

# no outliers can be confirmed!!!

There is one item with price below zero. Fill it with median.

In [None]:
## PythonにおいてTrueは1、Falseは0とみなされるため、bool値のオブジェクトに対してsum()メソッドを呼べば条件を満たす要素の数が得られる。()
(train["item_price"]<0).sum()

""""
median = train[(train.shop_id==32)&(train.item_id==2973)&(train.date_block_num==4)&(train.item_price>0)].item_price.median()
train.loc[train.item_price<0, 'item_price'] = median
# I cannot understand why shop_id==32...block_num==4
# where does this figure come? 
""""

元々上記のコードがあったが、下記に変更している

In [None]:
median = train["item_price"].median()
train.loc[train.item_price<0, 'item_price'] = median

In [None]:
# no item with price below zero ban be confirmed!!!
(train["item_price"]<0).sum()

Several shops are duplicates of each other (according to its name). Fix train and test set.

In [None]:
shops.duplicated(subset='shop_name')

shopには重複した名前がないのでは？？
ここの変換が少し不明。。。

In [None]:
# Якутск Орджоникидзе, 56
train.loc[train.shop_id == 0, 'shop_id'] = 57
test.loc[test.shop_id == 0, 'shop_id'] = 57
# Якутск ТЦ "Центральный"
train.loc[train.shop_id == 1, 'shop_id'] = 58
test.loc[test.shop_id == 1, 'shop_id'] = 58
# Жуковский ул. Чкалова 39м²
train.loc[train.shop_id == 10, 'shop_id'] = 11
test.loc[test.shop_id == 10, 'shop_id'] = 11

## Shops/Cats/Items preprocessing
Observations:
* Each shop_name starts with the city name.
* Each category contains type and subtype in its name.

In [None]:
# examine the data and do power technique 

shops.loc[shops.shop_name == 'Сергиев Посад ТЦ "7Я"', 'shop_name'] = 'СергиевПосад ТЦ "7Я"'
shops['city'] = shops['shop_name'].str.split(' ').map(lambda x: x[0])
shops.loc[shops.city == '!Якутск', 'city'] = 'Якутск'
shops['city_code'] = LabelEncoder().fit_transform(shops['city'])
shops = shops[['shop_id','city_code']]

cats['split'] = cats['item_category_name'].str.split('-')
cats['type'] = cats['split'].map(lambda x: x[0].strip())
cats['type_code'] = LabelEncoder().fit_transform(cats['type'])
# if subtype is nan then type
cats['subtype'] = cats['split'].map(lambda x: x[1].strip() if len(x) > 1 else x[0].strip())
cats['subtype_code'] = LabelEncoder().fit_transform(cats['subtype'])
cats = cats[['item_category_id','type_code', 'subtype_code']]

items.drop(['item_name'], axis=1, inplace=True)

## LabelEncoder()は，文字列や数値で表されたラベルを，0~(ラベル種類数-1)までの数値に変換してくれるものです
## fit_transform()を使うと楽です．入力がラベルの一次元リスト，出力がラベルIDの一次元リストになっています

学習しやすいように、ラベルエンコード等で特徴量を加工

In [None]:
shops.head()

In [None]:
cats.head()

## Monthly sales
Test set is a product of some shops and some items within 34 month. There are 5100 items * 42 shops = 214200 pairs. 363 items are new compared to the train. Hence, for the most of the items in the test set target value should be zero. 
In the other hand train set contains only pairs which were sold or returned in the past. Tha main idea is to calculate monthly sales and <b>extend it with zero sales</b> for each unique pair within the month. This way train data will be similar to test data.

In [None]:
len(list(set(test.item_id) - set(test.item_id).intersection(set(train.item_id)))), len(list(set(test.item_id))), len(test)
## 「intersection」で配列同士の重複を集合させる（積集合）

In [None]:
train.tail()

In [None]:
matrix = []
cols = ['date_block_num','shop_id','item_id']
for i in range(34):
    sales = train[train.date_block_num==i]
    matrix.append(np.array(list(product([i], sales.shop_id.unique(), sales.item_id.unique())), dtype='int16'))
## list()関数：strlist = list('Python')  # 'P'、'y'、't'、'h'、'o'、'n'を要素とするリストを作成
# range(34)の理由はdate_block_numが０〜３３のため
## product関数：直積（デカルト積）は、複数の集合から要素を一つずつ取り出した組み合わせの集合。

matrix

In [None]:
matrix = pd.DataFrame(np.vstack(matrix), columns=cols) ## 2次元でいうと縦方向(vertical)に連結します
matrix['date_block_num'] = matrix['date_block_num'].astype(np.int8)
matrix['shop_id'] = matrix['shop_id'].astype(np.int8) ## this method shold be understanded
matrix['item_id'] = matrix['item_id'].astype(np.int16)
matrix.sort_values(cols,inplace=True)

In [None]:
matrix.tail()

Aggregate train set by shop/item pairs to calculate target aggreagates, then <b>clip(0,20)</b> target value. This way train target will be similar to the test predictions.

<i>I use floats instead of ints for item_cnt_month to avoid downcasting it after concatination with the test set later. If it would be int16, after concatination with NaN values it becomes int64, but foat16 becomes float16 even with NaNs.</i>

downcasting:クラスベースのプログラミングでは、ダウンキャストまたは型改良は、基本クラスの参照をその派生クラスの1つにキャストする動作です。

In [None]:
train['revenue'] = train['item_price'] *  train['item_cnt_day']
# item_price * amount

In [None]:
# add revenue columns
train.head()

In [None]:
train.dtypes

In [None]:
group = train.groupby(['date_block_num','shop_id','item_id']).agg({'item_cnt_day': ['sum']})
group.columns = ['item_cnt_month'] ## 列の取得して別名に。その月に〇〇アイテムが、ある店で何個売れたか
group.head()

## groupby は、同じ値を持つデータをまとめて、それぞれの塊に対して共通の操作を行いたい時に使う。
## agg():グループごとに値を求めて表を作るような操作を Aggregation と呼ぶ

# In date_block_num,shop_id, in each item_id, get sum_data

In [None]:
## reset_index()メソッドを使うと、pandas.DataFrame, pandas.Seriesのインデックスindex（行名、行ラベル）を0始まりの連番（行番号）に振り直すことができる。
group.reset_index(inplace=True)

In [None]:
# check reset_data
group.head()

In [None]:
matrix.head()

In [None]:
cols

In [None]:
matrix = pd.merge(matrix, group, on=cols, how='left')
matrix['item_cnt_month'] = (matrix['item_cnt_month']
                                .fillna(0)
                                .clip(0,20) # NB clip target here
                                .astype(np.float16))

## clip-method:df.clip(min,max) -> set upper-limit and under-limit
## if figure is 32, it is transformed to 20!!!

In [None]:
matrix.head()

いつ、どの店で、なんのアイテムを、どのくらいの数量で特徴量を作成していく途中

## Test set
To use time tricks append test pairs to the matrix.

In [None]:
test.head()

In [None]:
# 34=2015/11
test['date_block_num'] = 34
test['date_block_num'] = test['date_block_num'].astype(np.int8)
test['shop_id'] = test['shop_id'].astype(np.int8)
test['item_id'] = test['item_id'].astype(np.int16)

In [None]:
test.head()

In [None]:
ts = time.time()
matrix = pd.concat([matrix, test], ignore_index=True, sort=False, keys=cols)
matrix.fillna(0, inplace=True) # 34 month
time.time() - ts

## cancatとmergeの違い
## 前者は縦横方向に連結、後者は条件を指定して結合する感じ

## Shops/Items/Cats features

In [None]:
matrix.tail()

In [None]:
ts = time.time()
matrix = pd.merge(matrix, shops, on=['shop_id'], how='left')
matrix = pd.merge(matrix, items, on=['item_id'], how='left')
matrix = pd.merge(matrix, cats, on=['item_category_id'], how='left')
matrix['city_code'] = matrix['city_code'].astype(np.int8)
matrix['item_category_id'] = matrix['item_category_id'].astype(np.int8)
matrix['type_code'] = matrix['type_code'].astype(np.int8)
matrix['subtype_code'] = matrix['subtype_code'].astype(np.int8)
time.time() - ts

In [None]:
matrix.tail()

## Traget lags

In [None]:
def lag_feature(df, lags, col):
    tmp = df[['date_block_num','shop_id','item_id',col]] #tmpにlag_featureの引数のdfのうち指定の列を取得する
    for i in lags:
        shifted = tmp.copy() # copy tmp and subsituate "copy" for shifted 
        shifted.columns = ['date_block_num','shop_id','item_id', col+'_lag_'+str(i)] #make shifted columns(but col..==nahn)
        shifted['date_block_num'] += i
        df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left') #col+'_lag_'+str(i)の列を追加する。
    # print(shifted.head())
    return df

In [None]:
ts = time.time()
matrix = lag_feature(matrix, [1,2,3,6,12], 'item_cnt_month')
time.time() - ts

In [None]:
## matrix.loc["item_cnt_month_lag_12"==1.0].head()

In [None]:
matrix.tail()

## Mean encoded features

In [None]:
group.tail()

In [None]:
matrix.tail()

In [None]:
group = matrix.groupby(['date_block_num']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_avg_item_cnt' ]

# item_cnt_monthはある店で、あるアイテムが何個売れたか
# meanをとることで売れた割合がでる

In [None]:
group.head()

In [None]:
group.reset_index(inplace=True)

In [None]:
group.tail()
# 各期間ブロックのアイテム数に対する、売れたアイテムの割合！

def lag_feature(df, lags, col): 

    tmp = df[['date_block_num','shop_id','item_id',col]]
    
    for i in lags:
    
        shifted = tmp.copy()  
        
        shifted.columns = ['date_block_num','shop_id','item_id', col+'_lag_'+str(i)] 
        
        shifted['date_block_num'] += i
        
        df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left') 
        
    return df
    

In [None]:
matrix = pd.merge(matrix, group, on=['date_block_num'], how='left')
matrix['date_avg_item_cnt'] = matrix['date_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_avg_item_cnt')
matrix.drop(['date_avg_item_cnt'], axis=1, inplace=True)

# date_avg_item_cnt_lag_1の列が増えた。
# 意味合いとしては、lagで１期間（block）の集計が遅れるということ？
# shifted['date_block_num'] += iここでプラス１している
# 'date_avg_item_cnt'の列を作って、クラスメソッドの後にそれを消したのは、df.columnsでcol+'_lag_'+str(i)で上書きしたから

In [None]:
matrix.head()

In [None]:
group = matrix.groupby(['date_block_num', 'item_id']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_item_avg_item_cnt' ]
# item_cnt_monthはある期間、ある店で、あるアイテムが売れたか否か（売れた個数カウント）
# 下記の１番うえの行の説明：期間０において、アイテム番号１９は、ショップ全体では０.０２２の割合でしか売れていない

In [None]:
group.head()

In [None]:
group.reset_index(inplace=True)

In [None]:
group.head()

In [None]:
matrix = pd.merge(matrix, group, on=['date_block_num','item_id'], how='left')
matrix['date_item_avg_item_cnt'] = matrix['date_item_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3,6,12], 'date_item_avg_item_cnt')
matrix.drop(['date_item_avg_item_cnt'], axis=1, inplace=True)
# date_item_avg_item_cnt_lag_1〜１２まで列が追加された


In [None]:
matrix.head()

In [None]:
group = matrix.groupby(['date_block_num', 'shop_id']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_shop_avg_item_cnt' ]

group.head(34)
# item_cnt_monthはある期間、ある店で、あるアイテムが売れたか否か（売れた個数カウント）
# 下記の１番うえの行の説明：期間０において、ショップ２では、アイテムIDの種類数に対して０.０２２の割合でしか売れていない（同じアイテムを複数売った場合は１を超える場合あり）

In [None]:
group.reset_index(inplace=True)

In [None]:
group.head()

In [None]:
matrix = pd.merge(matrix, group, on=['date_block_num','shop_id'], how='left')
matrix['date_shop_avg_item_cnt'] = matrix['date_shop_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3,6,12], 'date_shop_avg_item_cnt')
matrix.drop(['date_shop_avg_item_cnt'], axis=1, inplace=True)

In [None]:
matrix.head()

以下についても同様の作業を繰り返す

In [None]:
group = matrix.groupby(['date_block_num', 'item_category_id']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_cat_avg_item_cnt' ]

group.head()

In [None]:
group.reset_index(inplace=True)
group.head()

In [None]:
matrix = pd.merge(matrix, group, on=['date_block_num','item_category_id'], how='left')
matrix['date_cat_avg_item_cnt'] = matrix['date_cat_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_cat_avg_item_cnt')
matrix.drop(['date_cat_avg_item_cnt'], axis=1, inplace=True)
matrix.head()

In [None]:
group = matrix.groupby(['date_block_num', 'shop_id', 'item_category_id']).agg({'item_cnt_month': ['mean']})
group.columns = ['date_shop_cat_avg_item_cnt']

group.head()

In [None]:
group.reset_index(inplace=True)
group.head()

In [None]:
matrix = pd.merge(matrix, group, on=['date_block_num', 'shop_id', 'item_category_id'], how='left')
matrix['date_shop_cat_avg_item_cnt'] = matrix['date_shop_cat_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_shop_cat_avg_item_cnt')
matrix.drop(['date_shop_cat_avg_item_cnt'], axis=1, inplace=True)
matrix.head()

In [None]:
group = matrix.groupby(['date_block_num', 'shop_id', 'type_code']).agg({'item_cnt_month': ['mean']})
group.columns = ['date_shop_type_avg_item_cnt']
group.head()

In [None]:
group.reset_index(inplace=True)
group.head()

In [None]:
matrix = pd.merge(matrix, group, on=['date_block_num', 'shop_id', 'type_code'], how='left')
matrix['date_shop_type_avg_item_cnt'] = matrix['date_shop_type_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_shop_type_avg_item_cnt')
matrix.drop(['date_shop_type_avg_item_cnt'], axis=1, inplace=True)

matrix.head()

In [None]:
ts = time.time()
group = matrix.groupby(['date_block_num', 'shop_id', 'subtype_code']).agg({'item_cnt_month': ['mean']})
group.columns = ['date_shop_subtype_avg_item_cnt']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'shop_id', 'subtype_code'], how='left')
matrix['date_shop_subtype_avg_item_cnt'] = matrix['date_shop_subtype_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_shop_subtype_avg_item_cnt')
matrix.drop(['date_shop_subtype_avg_item_cnt'], axis=1, inplace=True)
time.time() - ts

In [None]:
group = matrix.groupby(['date_block_num', 'city_code']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_city_avg_item_cnt' ]
group.head()


In [None]:
group.reset_index(inplace=True)
group.head()

In [None]:
matrix = pd.merge(matrix, group, on=['date_block_num', 'city_code'], how='left')
matrix['date_city_avg_item_cnt'] = matrix['date_city_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_city_avg_item_cnt')
matrix.drop(['date_city_avg_item_cnt'], axis=1, inplace=True)

matrix.head()

In [None]:
group = matrix.groupby(['date_block_num', 'item_id', 'city_code']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_item_city_avg_item_cnt' ]

group.head()

In [None]:
group.reset_index(inplace=True)
group.head()

In [None]:
matrix = pd.merge(matrix, group, on=['date_block_num', 'item_id', 'city_code'], how='left')
matrix['date_item_city_avg_item_cnt'] = matrix['date_item_city_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_item_city_avg_item_cnt')
matrix.drop(['date_item_city_avg_item_cnt'], axis=1, inplace=True)

matrix.head()

In [None]:
group = matrix.groupby(['date_block_num', 'type_code']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_type_avg_item_cnt' ]

group.head()

In [None]:
group.reset_index(inplace=True)
group.head()

In [None]:
matrix = pd.merge(matrix, group, on=['date_block_num', 'type_code'], how='left')
matrix['date_type_avg_item_cnt'] = matrix['date_type_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_type_avg_item_cnt')
matrix.drop(['date_type_avg_item_cnt'], axis=1, inplace=True)

matrix.head()

In [None]:
group = matrix.groupby(['date_block_num', 'subtype_code']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_subtype_avg_item_cnt' ]

group.head()

In [None]:
group.reset_index(inplace=True)
group.head()

In [None]:
matrix = pd.merge(matrix, group, on=['date_block_num', 'subtype_code'], how='left')
matrix['date_subtype_avg_item_cnt'] = matrix['date_subtype_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_subtype_avg_item_cnt')
matrix.drop(['date_subtype_avg_item_cnt'], axis=1, inplace=True)

matrix.head()

上記まででひたすらよくわからない特徴量を作成した！

## Trend features

Price trend for the last six months.

よくわかっていない。。。

In [None]:
ts = time.time()
group = train.groupby(['item_id']).agg({'item_price': ['mean']})
group.columns = ['item_avg_item_price']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['item_id'], how='left')
matrix['item_avg_item_price'] = matrix['item_avg_item_price'].astype(np.float16)

group = train.groupby(['date_block_num','item_id']).agg({'item_price': ['mean']})
group.columns = ['date_item_avg_item_price']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','item_id'], how='left')
matrix['date_item_avg_item_price'] = matrix['date_item_avg_item_price'].astype(np.float16)

lags = [1,2,3,4,5,6]
matrix = lag_feature(matrix, lags, 'date_item_avg_item_price')

for i in lags:
    matrix['delta_price_lag_'+str(i)] = \
        (matrix['date_item_avg_item_price_lag_'+str(i)] - matrix['item_avg_item_price']) / matrix['item_avg_item_price']

def select_trend(row):
    for i in lags:
        if row['delta_price_lag_'+str(i)]:
            return row['delta_price_lag_'+str(i)]
    return 0
    
matrix['delta_price_lag'] = matrix.apply(select_trend, axis=1)
matrix['delta_price_lag'] = matrix['delta_price_lag'].astype(np.float16)
matrix['delta_price_lag'].fillna(0, inplace=True)

# https://stackoverflow.com/questions/31828240/first-non-null-value-per-row-from-a-list-of-pandas-columns/31828559
# matrix['price_trend'] = matrix[['delta_price_lag_1','delta_price_lag_2','delta_price_lag_3']].bfill(axis=1).iloc[:, 0]
# Invalid dtype for backfill_2d [float16]

fetures_to_drop = ['item_avg_item_price', 'date_item_avg_item_price']
for i in lags:
    fetures_to_drop += ['date_item_avg_item_price_lag_'+str(i)]
    fetures_to_drop += ['delta_price_lag_'+str(i)]

matrix.drop(fetures_to_drop, axis=1, inplace=True)

time.time() - ts

In [None]:
matrix.head()

Last month shop revenue trend

In [None]:
ts = time.time()
group = train.groupby(['date_block_num','shop_id']).agg({'revenue': ['sum']})
group.columns = ['date_shop_revenue']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','shop_id'], how='left')
matrix['date_shop_revenue'] = matrix['date_shop_revenue'].astype(np.float32)

group = group.groupby(['shop_id']).agg({'date_shop_revenue': ['mean']})
group.columns = ['shop_avg_revenue']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['shop_id'], how='left')
matrix['shop_avg_revenue'] = matrix['shop_avg_revenue'].astype(np.float32)

matrix['delta_revenue'] = (matrix['date_shop_revenue'] - matrix['shop_avg_revenue']) / matrix['shop_avg_revenue']
matrix['delta_revenue'] = matrix['delta_revenue'].astype(np.float16)

matrix = lag_feature(matrix, [1], 'delta_revenue')

matrix.drop(['date_shop_revenue','shop_avg_revenue','delta_revenue'], axis=1, inplace=True)
time.time() - ts

In [None]:
matrix.tail()

## Special features

In [None]:
matrix['month'] = matrix['date_block_num'] % 12

In [None]:
matrix.tail()

Number of days in a month. There are no leap years.

In [None]:
days = pd.Series([31,28,31,30,31,30,31,31,30,31,30,31])
matrix['days'] = matrix['month'].map(days).astype(np.int8)

In [None]:
matrix.tail()

Months since the last sale for each shop/item pair and for item only. I use programing approach.

<i>Create HashTable with key equals to {shop_id,item_id} and value equals to date_block_num. Iterate data from the top. Foreach row if {row.shop_id,row.item_id} is not present in the table, then add it to the table and set its value to row.date_block_num. if HashTable contains key, then calculate the difference beteween cached value and row.date_block_num.</i>

In [None]:
## iterrows()メソッドを使うと、1行ずつ、インデックス名（行名）とその行のデータ（pandas.Series型）のタプル(index, Series)を取得できる。
for idx, row in matrix.iterrows():    
ts = time.time()
cache = {}
matrix['item_shop_last_sale'] = -1
matrix['item_shop_last_sale'] = matrix['item_shop_last_sale'].astype(np.int8)
for idx, row in matrix.iterrows():    
    key = str(row.item_id)+' '+str(row.shop_id)
    if key not in cache:
        if row.item_cnt_month!=0:
            cache[key] = row.date_block_num
    else:
        last_date_block_num = cache[key]
        matrix.at[idx, 'item_shop_last_sale'] = row.date_block_num - last_date_block_num
        cache[key] = row.date_block_num         
time.time() - ts
## if HashTable contains key, then calculate the difference beteween cached value and row.date_block_num.

In [None]:
matrix.tail()

In [None]:
ts = time.time()
cache = {}
matrix['item_last_sale'] = -1
matrix['item_last_sale'] = matrix['item_last_sale'].astype(np.int8)


## iterrows()メソッドを使うと、1行ずつ、インデックス名（行名）とその行のデータ（pandas.Series型）のタプル(index, Series)を取得できる。
for idx, row in matrix.iterrows():    
    key = row.item_id
    if key not in cache:
        if row.item_cnt_month!=0:
            cache[key] = row.date_block_num ##　辞書型の要素の追加
    else:
        last_date_block_num = cache[key]
        if row.date_block_num>last_date_block_num:
            matrix.at[idx, 'item_last_sale'] = row.date_block_num - last_date_block_num
            ## atは行名と列名で位置を指定する。locとの違いは、単独の指定か否か（atは単独のみ指定できる）
            cache[key] = row.date_block_num         
time.time() - ts

In [None]:
matrix.tail()

Months since the first sale for each shop/item pair and for item only.

In [None]:
ts = time.time()
matrix['item_shop_first_sale'] = matrix['date_block_num'] - matrix.groupby(['item_id','shop_id'])['date_block_num'].transform('min')
matrix['item_first_sale'] = matrix['date_block_num'] - matrix.groupby('item_id')['date_block_num'].transform('min')
time.time() - ts

In [None]:
matrix.head()

## Final preparations
Because of the using 12 as lag value drop first 12 months. Also drop all the columns with this month calculated values (other words which can not be calcucated for the test set).

In [None]:
ts = time.time()
matrix = matrix[matrix.date_block_num > 11]
time.time() - ts

In [None]:
# めちゃちゃ行を減らした
matrix.head()

Producing lags brings a lot of nulls.

In [None]:
# 欠損値を０で補う
ts = time.time()
def fill_na(df):
    for col in df.columns:
        if ('_lag_' in col) & (df[col].isnull().any()):
            if ('item_cnt' in col):
                df[col].fillna(0, inplace=True)         
    return df

matrix = fill_na(matrix)
time.time() - ts

In [None]:
matrix.head()

In [None]:
matrix.columns

In [None]:
matrix.info()

In [None]:
matrix.to_pickle('data.pkl')
del matrix
del cache
del group
del items
del shops
del cats
del train
# leave test for submission
gc.collect();

# Part 2, xgboost

In [None]:
data = pd.read_pickle('data.pkl')

Select perfect features

In [None]:
data = data[[
    'date_block_num',
    'shop_id',
    'item_id',
    'item_cnt_month',
    'city_code',
    'item_category_id',
    'type_code',
    'subtype_code',
    'item_cnt_month_lag_1',
    'item_cnt_month_lag_2',
    'item_cnt_month_lag_3',
    'item_cnt_month_lag_6',
    'item_cnt_month_lag_12',
    'date_avg_item_cnt_lag_1',
    'date_item_avg_item_cnt_lag_1',
    'date_item_avg_item_cnt_lag_2',
    'date_item_avg_item_cnt_lag_3',
    'date_item_avg_item_cnt_lag_6',
    'date_item_avg_item_cnt_lag_12',
    'date_shop_avg_item_cnt_lag_1',
    'date_shop_avg_item_cnt_lag_2',
    'date_shop_avg_item_cnt_lag_3',
    'date_shop_avg_item_cnt_lag_6',
    'date_shop_avg_item_cnt_lag_12',
    'date_cat_avg_item_cnt_lag_1',
    'date_shop_cat_avg_item_cnt_lag_1',
    #'date_shop_type_avg_item_cnt_lag_1',
    #'date_shop_subtype_avg_item_cnt_lag_1',
    'date_city_avg_item_cnt_lag_1',
    'date_item_city_avg_item_cnt_lag_1',
    #'date_type_avg_item_cnt_lag_1',
    #'date_subtype_avg_item_cnt_lag_1',
    'delta_price_lag',
    'month',
    'days',
    'item_shop_last_sale',
    'item_last_sale',
    'item_shop_first_sale',
    'item_first_sale',
]]

Validation strategy is 34 month for the test set, 33 month for the validation set and 13-33 months for the train.

In [None]:
X_train = data[data.date_block_num < 33].drop(['item_cnt_month'], axis=1)
Y_train = data[data.date_block_num < 33]['item_cnt_month']
X_valid = data[data.date_block_num == 33].drop(['item_cnt_month'], axis=1)
Y_valid = data[data.date_block_num == 33]['item_cnt_month']
X_test = data[data.date_block_num == 34].drop(['item_cnt_month'], axis=1)

In [None]:
del data 
gc.collect();

In [None]:
ts = time.time()

model = XGBRegressor(
    max_depth=8,
    n_estimators=1000, # 木の数（アンサンブル学習で用いる）
    min_child_weight=300,  # 決定木の葉の重みの下限
    colsample_bytree=0.8,  # 各決定木に置いてランダムに抽出される列の割合
    subsample=0.8,  # 各決定木に置いてランダムに抽出される標本の割合
    eta=0.3, # 学習率
    seed=42)

model.fit(
    X_train, 
    Y_train, 
    eval_metric="rmse", 
    eval_set=[(X_train, Y_train), (X_valid, Y_valid)], 
    verbose=True, 
    early_stopping_rounds = 10)

time.time() - ts

In [None]:
Y_pred = model.predict(X_valid).clip(0, 20)
Y_test = model.predict(X_test).clip(0, 20)

submission = pd.DataFrame({
    "ID": test.index, 
    "item_cnt_month": Y_test
})
submission.to_csv('xgb_submission.csv', index=False)

# save predictions for an ensemble
pickle.dump(Y_pred, open('xgb_train.pickle', 'wb'))
pickle.dump(Y_test, open('xgb_test.pickle', 'wb'))

In [None]:
plot_features(model, (10,14))