## Final Project
You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

File descriptions
1. sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
2. test.csv - the test set. **You need to forecast the sales (total item count) for these shops and products for November 2015.**
3. sample_submission.csv - a sample submission file in the correct format.
4. items.csv - supplemental information about the items/products.
5. item_categories.csv  - supplemental information about the items categories.
6. shops.csv- supplemental information about the shops.

**Data fields**
-  ID - an Id that represents a (Shop, Item) tuple within the test set
-  shop_id - unique identifier of a shop
-  item_id - unique identifier of a product
-  item_category_id - unique identifier of item category
-  item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
-  item_price - current price of an item
-  date - date in format dd/mm/yyyy
-  date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
-  item_name - name of item
-  shop_name - name of shop
-  item_category_name - name of item category

## Load Data

In [5]:
import numpy as np 
import pandas as pd 


items_df = pd.read_csv('../readonly/final_project_data/items.csv')
shops_df = pd.read_csv('../readonly/final_project_data/shops.csv')

icats_df = pd.read_csv('../readonly/final_project_data/item_categories.csv')
train_df = pd.read_csv('../readonly/final_project_data/sales_train.csv.gz', header=0, sep=',', quotechar='"')
test_df  = pd.read_csv('../readonly/final_project_data/test.csv.gz', header=0, sep=',', quotechar='"')


In [6]:
items_df.head()

Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


In [7]:
items_df.shape

(22170, 3)

In [8]:
shops_df.head()

Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


In [9]:
shops_df.shape

(60, 2)

In [10]:
icats_df.head(10)

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4
5,Аксессуары - PSVita,5
6,Аксессуары - XBOX 360,6
7,Аксессуары - XBOX ONE,7
8,Билеты (Цифра),8
9,Доставка товара,9


In [11]:
icats_df.shape

(84, 2)

In [12]:
train_df.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


In [13]:
train_df.shape

(2935849, 6)

In [15]:
test_df.head()

Unnamed: 0,ID,shop_id,item_id
0,0,5,5037
1,1,5,5320
2,2,5,5233
3,3,5,5232
4,4,5,5268


In [16]:
test_df.shape

(214200, 3)

## Map Items Categorries
Map Cattegories to more narrow ones

In [17]:
l = list(icats_df.item_category_name)
l_cat = l

for ind in range(1,8):
    l_cat[ind] = 'Access'

for ind in range(10,18):
    l_cat[ind] = 'Consoles'

for ind in range(18,25):
    l_cat[ind] = 'Consoles Games'

for ind in range(26,28):
    l_cat[ind] = 'phone games'

for ind in range(28,32):
    l_cat[ind] = 'CD games'

for ind in range(32,37):
    l_cat[ind] = 'Card'

for ind in range(37,43):
    l_cat[ind] = 'Movie'

for ind in range(43,55):
    l_cat[ind] = 'Books'

for ind in range(55,61):
    l_cat[ind] = 'Music'

for ind in range(61,73):
    l_cat[ind] = 'Gifts'

for ind in range(73,79):
    l_cat[ind] = 'Soft'


icats_df['cats'] = l_cat
icats_df.head()

Unnamed: 0,item_category_name,item_category_id,cats
0,PC - Гарнитуры/Наушники,0,PC - Гарнитуры/Наушники
1,Аксессуары - PS2,1,Access
2,Аксессуары - PS3,2,Access
3,Аксессуары - PS4,3,Access
4,Аксессуары - PSP,4,Access


In [18]:
items_df = pd.merge(items_df, icats_df, on=['item_category_id'], how='left')

In [19]:
items_df = items_df[['item_id', 'cats']]
items_df.head()

Unnamed: 0,item_id,cats
0,0,Movie
1,1,Soft
2,2,Movie
3,3,Movie
4,4,Movie


## Remove Outliers

In [20]:
train_df = train_df[train_df['item_price'] < 100000]
train_df = train_df[train_df['item_cnt_day'] < 1000]
train_df.describe()

Unnamed: 0,date_block_num,shop_id,item_id,item_price,item_cnt_day
count,2935846.0,2935846.0,2935846.0,2935846.0,2935846.0
mean,14.5699,33.00175,10197.22,890.7492,1.241562
std,9.422985,16.22697,6324.297,1720.491,2.217636
min,0.0,0.0,0.0,-1.0,-22.0
25%,7.0,22.0,4476.0,249.0,1.0
50%,14.0,31.0,9343.0,399.0,1.0
75%,23.0,47.0,15684.0,999.0,1.0
max,33.0,59.0,22169.0,59200.0,669.0


## Aggregate training data

Create grid using all the shop_id, item_id combiation for each date_block_num

In [21]:
from itertools import product
index_cols = ['shop_id', 'item_id', 'date_block_num']
grid = []
for block_num in train_df['date_block_num'].unique():
    cur_shops = train_df.loc[train_df['date_block_num'] == block_num, 'shop_id'].unique()
    cur_items = train_df.loc[train_df['date_block_num'] == block_num, 'item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

In [22]:
train_df = train_df.groupby(['date_block_num','shop_id','item_id']).agg(
    {'item_cnt_day': np.sum, 'item_price': np.mean}).reset_index()

In [23]:
train_df.rename({'item_cnt_day': 'item_cnt_month'}, axis='columns', inplace=True)

In [24]:
train_df = pd.merge(grid, train_df, on=index_cols, how='left')

In [25]:
train_df.describe()

Unnamed: 0,shop_id,item_id,date_block_num,item_cnt_month,item_price
count,10913800.0,10913800.0,10913800.0,1609122.0,1609122.0
mean,31.1872,11309.29,14.97336,2.265233,790.6943
std,17.34959,6209.982,9.495635,8.429583,1532.592
min,0.0,0.0,0.0,-22.0,0.09
25%,16.0,5976.0,7.0,1.0,199.0
50%,30.0,11391.0,14.0,1.0,399.0
75%,46.0,16605.0,23.0,2.0,898.5
max,59.0,22169.0,33.0,1644.0,50999.0


## Add category id to each data point.

In [26]:
train_df = pd.merge(train_df, items_df, on=['item_id'], how='left')

In [27]:
test_df = pd.merge(test_df, items_df, on=['item_id'], how='left')

In [28]:
train_df.head()

Unnamed: 0,shop_id,item_id,date_block_num,item_cnt_month,item_price,cats
0,59,22154,0,1.0,999.0,Movie
1,59,2552,0,,,Music
2,59,2554,0,,,Music
3,59,2555,0,,,Music
4,59,2564,0,,,Music


In [29]:
test_df.head()

Unnamed: 0,ID,shop_id,item_id,cats
0,0,5,5037,Consoles Games
1,1,5,5320,Music
2,2,5,5233,Consoles Games
3,3,5,5232,Consoles Games
4,4,5,5268,Consoles Games


In [30]:
set(test_df.cats.unique()) < set(train_df.cats.unique())

True

## Mean encoding

In [31]:
for type_ids in [['item_id'], ['shop_id'], ['cats'], ['item_id', 'shop_id']]:
    for column_id in ['item_price', 'item_cnt_month']:
        mean_df = train_df[type_ids + [column_id]].groupby(type_ids).agg(np.mean).reset_index()
        mean_df.rename(
            {column_id: "mean_of_"+column_id+"_groupby_"+"_".join(type_ids)},
            axis='columns', inplace=True
        )
        
        train_df = pd.merge(train_df, mean_df, on=type_ids, how='left')
        test_df = pd.merge(test_df, mean_df, on=type_ids, how='left')

In [32]:
train_df.describe()

Unnamed: 0,shop_id,item_id,date_block_num,item_cnt_month,item_price,mean_of_item_price_groupby_item_id,mean_of_item_cnt_month_groupby_item_id,mean_of_item_price_groupby_shop_id,mean_of_item_cnt_month_groupby_shop_id,mean_of_item_price_groupby_cats,mean_of_item_cnt_month_groupby_cats,mean_of_item_price_groupby_item_id_shop_id,mean_of_item_cnt_month_groupby_item_id_shop_id
count,10913800.0,10913800.0,10913800.0,1609122.0,1609122.0,10913800.0,10913800.0,10913800.0,10913800.0,10913800.0,10913800.0,6288341.0,6288341.0
mean,31.1872,11309.29,14.97336,2.265233,790.6943,682.7821,1.810334,823.9197,2.173506,673.3016,2.165103,697.0326,1.602656
std,17.34959,6209.982,9.495635,8.429583,1532.592,1363.974,5.518281,166.8937,0.6904852,1093.747,2.459068,1388.719,3.709803
min,0.0,0.0,0.0,-22.0,0.09,4.89643,-2.2,300.5999,1.27907,33.66439,1.0,0.09,-3.5
25%,16.0,5976.0,7.0,1.0,199.0,194.2063,1.090909,757.2104,1.80925,340.5099,1.680894,199.0,1.0
50%,30.0,11391.0,14.0,1.0,399.0,300.1633,1.22549,825.1514,2.043924,372.6226,1.745858,314.0,1.0
75%,46.0,16605.0,23.0,2.0,898.5,701.7879,1.585714,876.0895,2.211633,859.6175,2.633343,699.0,1.5
max,59.0,22169.0,33.0,1644.0,50999.0,50999.0,622.0,1438.481,9.972344,13823.18,78.95545,50999.0,664.4667


In [33]:
test_df.describe()

Unnamed: 0,ID,shop_id,item_id,mean_of_item_price_groupby_item_id,mean_of_item_cnt_month_groupby_item_id,mean_of_item_price_groupby_shop_id,mean_of_item_cnt_month_groupby_shop_id,mean_of_item_price_groupby_cats,mean_of_item_cnt_month_groupby_cats,mean_of_item_price_groupby_item_id_shop_id,mean_of_item_cnt_month_groupby_item_id_shop_id
count,214200.0,214200.0,214200.0,198954.0,198954.0,214200.0,214200.0,214200.0,214200.0,111404.0,111404.0
mean,107099.5,31.642857,11019.398627,1024.694159,2.277038,863.183539,2.127632,756.945382,2.336335,992.405095,1.902407
std,61834.358168,17.561933,6252.64459,1979.721761,5.627254,164.626848,0.599815,1078.29795,3.13099,1890.474881,4.649952
min,0.0,2.0,30.0,4.89643,0.666667,523.952475,1.27907,87.8,1.0,0.99,0.0
25%,53549.75,16.0,5381.5,281.10901,1.172297,770.380085,1.799914,340.509868,1.680894,299.0,1.0
50%,107099.5,34.5,11203.0,488.3625,1.418182,831.600466,2.024027,372.622621,1.745858,472.502381,1.2
75%,160649.25,47.0,16071.5,1247.578836,2.014851,943.030349,2.154723,859.617546,2.883084,1129.0,1.857143
max,214199.0,59.0,22167.0,40453.407407,181.264706,1335.210545,4.965377,13823.183217,78.955446,42990.0,664.466667


## Fillna

In [34]:
test_df['mean_of_item_price_groupby_item_id'] = test_df['mean_of_item_price_groupby_item_id'].fillna(test_df['mean_of_item_price_groupby_cats'])
test_df['mean_of_item_cnt_month_groupby_item_id'] = test_df['mean_of_item_cnt_month_groupby_item_id'].fillna(test_df['mean_of_item_cnt_month_groupby_cats'])
test_df['mean_of_item_price_groupby_item_id_shop_id'] = test_df['mean_of_item_price_groupby_item_id_shop_id'].fillna(test_df['mean_of_item_price_groupby_item_id'])
test_df['mean_of_item_cnt_month_groupby_item_id_shop_id'] = test_df['mean_of_item_cnt_month_groupby_item_id_shop_id'].fillna(test_df['mean_of_item_cnt_month_groupby_item_id'])

In [35]:
train_df['mean_of_item_price_groupby_item_id_shop_id'] = train_df['mean_of_item_price_groupby_item_id_shop_id'].fillna(train_df['mean_of_item_price_groupby_item_id'])
train_df['mean_of_item_cnt_month_groupby_item_id_shop_id'] = train_df['mean_of_item_cnt_month_groupby_item_id_shop_id'].fillna(train_df['mean_of_item_cnt_month_groupby_item_id'])

In [36]:
train_df.columns

Index(['shop_id', 'item_id', 'date_block_num', 'item_cnt_month', 'item_price',
       'cats', 'mean_of_item_price_groupby_item_id',
       'mean_of_item_cnt_month_groupby_item_id',
       'mean_of_item_price_groupby_shop_id',
       'mean_of_item_cnt_month_groupby_shop_id',
       'mean_of_item_price_groupby_cats',
       'mean_of_item_cnt_month_groupby_cats',
       'mean_of_item_price_groupby_item_id_shop_id',
       'mean_of_item_cnt_month_groupby_item_id_shop_id'],
      dtype='object')

In [37]:
for df in train_df, test_df:
    for feat in df.columns[4:]:
        if 'item_cnt' in feat:
            df[feat]=df[feat].fillna(0)
        elif 'item_price' in feat:
            df[feat]=df[feat].fillna(df[feat].median())

In [38]:
train_df['item_cnt_month'] = train_df['item_cnt_month'].fillna(0)

In [39]:
train_df.describe()

Unnamed: 0,shop_id,item_id,date_block_num,item_cnt_month,item_price,mean_of_item_price_groupby_item_id,mean_of_item_cnt_month_groupby_item_id,mean_of_item_price_groupby_shop_id,mean_of_item_cnt_month_groupby_shop_id,mean_of_item_price_groupby_cats,mean_of_item_cnt_month_groupby_cats,mean_of_item_price_groupby_item_id_shop_id,mean_of_item_cnt_month_groupby_item_id_shop_id
count,10913800.0,10913800.0,10913800.0,10913800.0,10913800.0,10913800.0,10913800.0,10913800.0,10913800.0,10913800.0,10913800.0,10913800.0,10913800.0
mean,31.1872,11309.29,14.97336,0.333984,456.7511,682.7821,1.810334,823.9197,2.173506,673.3016,2.165103,681.008,1.733548
std,17.34959,6209.982,9.495635,3.334923,604.6454,1363.974,5.518281,166.8937,0.6904852,1093.747,2.459068,1364.772,5.861127
min,0.0,0.0,0.0,-22.0,0.09,4.89643,-2.2,300.5999,1.27907,33.66439,1.0,0.09,-3.5
25%,16.0,5976.0,7.0,0.0,399.0,194.2063,1.090909,757.2104,1.80925,340.5099,1.680894,195.9215,1.0
50%,30.0,11391.0,14.0,0.0,399.0,300.1633,1.22549,825.1514,2.043924,372.6226,1.745858,299.0,1.111111
75%,46.0,16605.0,23.0,0.0,399.0,701.7879,1.585714,876.0895,2.211633,859.6175,2.633343,699.0,1.467626
max,59.0,22169.0,33.0,1644.0,50999.0,50999.0,622.0,1438.481,9.972344,13823.18,78.95545,50999.0,664.4667


In [40]:
train_df.columns

Index(['shop_id', 'item_id', 'date_block_num', 'item_cnt_month', 'item_price',
       'cats', 'mean_of_item_price_groupby_item_id',
       'mean_of_item_cnt_month_groupby_item_id',
       'mean_of_item_price_groupby_shop_id',
       'mean_of_item_cnt_month_groupby_shop_id',
       'mean_of_item_price_groupby_cats',
       'mean_of_item_cnt_month_groupby_cats',
       'mean_of_item_price_groupby_item_id_shop_id',
       'mean_of_item_cnt_month_groupby_item_id_shop_id'],
      dtype='object')

## Add previous months sales data

In [41]:
train_df_temp = train_df.copy()
train_df = train_df[train_df['date_block_num']>=12]

In [42]:
features = ['item_cnt_month', 'item_price', 'mean_of_item_price_groupby_item_id',
       'mean_of_item_cnt_month_groupby_item_id',
       'mean_of_item_price_groupby_shop_id',
       'mean_of_item_cnt_month_groupby_shop_id',
       'mean_of_item_price_groupby_cats',
       'mean_of_item_cnt_month_groupby_cats',
       'mean_of_item_price_groupby_item_id_shop_id',
       'mean_of_item_cnt_month_groupby_item_id_shop_id']

In [43]:
def add_historical_data(df):
    for diff in (1, 2, 3, 4, 6, 12):
        train_df_copy = train_df_temp.copy()
        train_df_copy['date_block_num'] += diff
        train_df_copy = train_df_copy[['date_block_num', 'item_id', 'shop_id'] + features]
        train_df_copy.rename({
            feat: feat+"_"+str(diff)+'_month_ago' for feat in features
        }, axis=1, inplace=True)
        df = pd.merge(df, train_df_copy, on=['shop_id', 'item_id', 'date_block_num'], how='left')
    return df

In [44]:
test_df['date_block_num'] = 34
train_df = add_historical_data(train_df)
test_df = add_historical_data(test_df)
test_df.drop('date_block_num', axis=1, inplace=True)

In [45]:
train_df.head()

Unnamed: 0,shop_id,item_id,date_block_num,item_cnt_month,item_price,cats,mean_of_item_price_groupby_item_id,mean_of_item_cnt_month_groupby_item_id,mean_of_item_price_groupby_shop_id,mean_of_item_cnt_month_groupby_shop_id,...,item_cnt_month_12_month_ago,item_price_12_month_ago,mean_of_item_price_groupby_item_id_12_month_ago,mean_of_item_cnt_month_groupby_item_id_12_month_ago,mean_of_item_price_groupby_shop_id_12_month_ago,mean_of_item_cnt_month_groupby_shop_id_12_month_ago,mean_of_item_price_groupby_cats_12_month_ago,mean_of_item_cnt_month_groupby_cats_12_month_ago,mean_of_item_price_groupby_item_id_shop_id_12_month_ago,mean_of_item_cnt_month_groupby_item_id_shop_id_12_month_ago
0,54,10297,12,4.0,749.0,Movie,709.478496,1.210526,655.04886,2.636404,...,,,,,,,,,,
1,54,10296,12,3.0,1599.0,Movie,1464.972764,1.130081,655.04886,2.636404,...,,,,,,,,,,
2,54,10298,12,14.0,399.0,Movie,223.781333,4.82138,655.04886,2.636404,...,,,,,,,,,,
3,54,10300,12,3.0,699.0,Movie,519.571884,2.520833,655.04886,2.636404,...,,,,,,,,,,
4,54,10284,12,1.0,299.0,Music,284.902913,1.174757,655.04886,2.636404,...,,,,,,,,,,


Fill na

In [46]:
train_df.columns

Index(['shop_id', 'item_id', 'date_block_num', 'item_cnt_month', 'item_price',
       'cats', 'mean_of_item_price_groupby_item_id',
       'mean_of_item_cnt_month_groupby_item_id',
       'mean_of_item_price_groupby_shop_id',
       'mean_of_item_cnt_month_groupby_shop_id',
       'mean_of_item_price_groupby_cats',
       'mean_of_item_cnt_month_groupby_cats',
       'mean_of_item_price_groupby_item_id_shop_id',
       'mean_of_item_cnt_month_groupby_item_id_shop_id',
       'item_cnt_month_1_month_ago', 'item_price_1_month_ago',
       'mean_of_item_price_groupby_item_id_1_month_ago',
       'mean_of_item_cnt_month_groupby_item_id_1_month_ago',
       'mean_of_item_price_groupby_shop_id_1_month_ago',
       'mean_of_item_cnt_month_groupby_shop_id_1_month_ago',
       'mean_of_item_price_groupby_cats_1_month_ago',
       'mean_of_item_cnt_month_groupby_cats_1_month_ago',
       'mean_of_item_price_groupby_item_id_shop_id_1_month_ago',
       'mean_of_item_cnt_month_groupby_item_id_shop

In [47]:
for df in train_df, test_df:
    for feat in train_df.columns[6:]:
        if 'item_cnt' in feat:
            df[feat]=df[feat].fillna(0)
        elif 'item_price' in feat:
            df[feat]=df[feat].fillna(df[feat].median())

## Add some pair-difference features

In [48]:
columns = {
    'diff_between_item_shop_and_item': ('mean_of_item_price_groupby_item_id_shop_id', 'mean_of_item_price_groupby_item_id'),
    'diff_between_item_and_category': ('mean_of_item_price_groupby_item_id', 'mean_of_item_price_groupby_cats')
}
for new_feature, (col1, col2) in columns.items():
    for df in (train_df, test_df):
        df[new_feature] = df[col1] - df[col2]

In [49]:
train_df.head()

Unnamed: 0,shop_id,item_id,date_block_num,item_cnt_month,item_price,cats,mean_of_item_price_groupby_item_id,mean_of_item_cnt_month_groupby_item_id,mean_of_item_price_groupby_shop_id,mean_of_item_cnt_month_groupby_shop_id,...,mean_of_item_price_groupby_item_id_12_month_ago,mean_of_item_cnt_month_groupby_item_id_12_month_ago,mean_of_item_price_groupby_shop_id_12_month_ago,mean_of_item_cnt_month_groupby_shop_id_12_month_ago,mean_of_item_price_groupby_cats_12_month_ago,mean_of_item_cnt_month_groupby_cats_12_month_ago,mean_of_item_price_groupby_item_id_shop_id_12_month_ago,mean_of_item_cnt_month_groupby_item_id_shop_id_12_month_ago,diff_between_item_shop_and_item,diff_between_item_and_category
0,54,10297,12,4.0,749.0,Movie,709.478496,1.210526,655.04886,2.636404,...,296.188435,0.0,825.151374,0.0,372.622621,0.0,299.0,0.0,39.521504,368.968628
1,54,10296,12,3.0,1599.0,Movie,1464.972764,1.130081,655.04886,2.636404,...,296.188435,0.0,825.151374,0.0,372.622621,0.0,299.0,0.0,62.185807,1124.462896
2,54,10298,12,14.0,399.0,Movie,223.781333,4.82138,655.04886,2.636404,...,296.188435,0.0,825.151374,0.0,372.622621,0.0,299.0,0.0,12.277491,-116.728536
3,54,10300,12,3.0,699.0,Movie,519.571884,2.520833,655.04886,2.636404,...,296.188435,0.0,825.151374,0.0,372.622621,0.0,299.0,0.0,-18.620813,179.062016
4,54,10284,12,1.0,299.0,Music,284.902913,1.174757,655.04886,2.636404,...,296.188435,0.0,825.151374,0.0,372.622621,0.0,299.0,0.0,14.097087,-87.719709


In [50]:
test_df.head()

Unnamed: 0,ID,shop_id,item_id,cats,mean_of_item_price_groupby_item_id,mean_of_item_cnt_month_groupby_item_id,mean_of_item_price_groupby_shop_id,mean_of_item_cnt_month_groupby_shop_id,mean_of_item_price_groupby_cats,mean_of_item_cnt_month_groupby_cats,...,mean_of_item_price_groupby_item_id_12_month_ago,mean_of_item_cnt_month_groupby_item_id_12_month_ago,mean_of_item_price_groupby_shop_id_12_month_ago,mean_of_item_cnt_month_groupby_shop_id_12_month_ago,mean_of_item_price_groupby_cats_12_month_ago,mean_of_item_cnt_month_groupby_cats_12_month_ago,mean_of_item_price_groupby_item_id_shop_id_12_month_ago,mean_of_item_cnt_month_groupby_item_id_shop_id_12_month_ago,diff_between_item_shop_and_item,diff_between_item_and_category
0,0,5,5037,Consoles Games,1960.580473,2.873303,804.758232,1.773768,1537.78918,2.633343,...,1960.580473,2.873303,804.758232,1.773768,1537.78918,2.633343,1693.518519,1.444444,-267.061955,422.791294
1,1,5,5320,Music,372.622621,1.379644,804.758232,1.773768,372.622621,1.379644,...,399.968818,0.0,830.860711,0.0,372.622621,0.0,399.0,0.0,0.0,0.0
2,2,5,5233,Consoles Games,844.516003,2.668421,804.758232,1.773768,1537.78918,2.633343,...,399.968818,0.0,830.860711,0.0,372.622621,0.0,399.0,0.0,14.483997,-693.273177
3,3,5,5232,Consoles Games,792.527697,1.855263,804.758232,1.773768,1537.78918,2.633343,...,399.968818,0.0,830.860711,0.0,372.622621,0.0,399.0,0.0,-193.527697,-745.261482
4,4,5,5268,Consoles Games,1537.78918,2.633343,804.758232,1.773768,1537.78918,2.633343,...,399.968818,0.0,830.860711,0.0,372.622621,0.0,399.0,0.0,0.0,0.0


## Clip the values of the target

In [51]:
train_df['item_cnt_month'] = train_df['item_cnt_month'].clip(0, 20)

## Split into training set and validation set

In [52]:
training_set = train_df[train_df['date_block_num']<33]
validation_set = train_df[train_df['date_block_num']==33].reset_index()

In [53]:
validation_set.head()

Unnamed: 0,index,shop_id,item_id,date_block_num,item_cnt_month,item_price,cats,mean_of_item_price_groupby_item_id,mean_of_item_cnt_month_groupby_item_id,mean_of_item_price_groupby_shop_id,...,mean_of_item_price_groupby_item_id_12_month_ago,mean_of_item_cnt_month_groupby_item_id_12_month_ago,mean_of_item_price_groupby_shop_id_12_month_ago,mean_of_item_cnt_month_groupby_shop_id_12_month_ago,mean_of_item_price_groupby_cats_12_month_ago,mean_of_item_cnt_month_groupby_cats_12_month_ago,mean_of_item_price_groupby_item_id_shop_id_12_month_ago,mean_of_item_cnt_month_groupby_item_id_shop_id_12_month_ago,diff_between_item_shop_and_item,diff_between_item_and_category
0,6186922,45,13315,33,1.0,649.0,Books,640.875,1.125,861.189864,...,296.188435,0.0,825.151374,0.0,372.622621,0.0,299.0,0.0,8.125,290.090958
1,6186923,45,13880,33,1.0,229.0,Music,208.925643,2.561828,861.189864,...,208.925643,2.561828,861.189864,1.80925,372.622621,1.379644,214.0,1.8,5.074357,-163.696979
2,6186924,45,13881,33,2.0,659.0,Music,579.715439,2.710271,861.189864,...,579.715439,2.710271,861.189864,1.80925,372.622621,1.379644,588.3,1.96,8.584561,207.092817
3,6186925,45,13923,33,1.0,169.0,Movie,150.236386,1.572183,861.189864,...,150.236386,1.572183,861.189864,1.80925,340.509868,1.745858,157.571429,1.142857,7.335042,-190.273482
4,6186926,45,14227,33,1.0,99.0,CD games,97.129878,3.462963,861.189864,...,97.129878,3.462963,861.189864,1.80925,520.312473,3.636535,99.0,1.75,1.870122,-423.182595


## Only keep the useful columns

In [54]:
features = train_df.columns[6:].tolist()

In [62]:
# train_df[features].to_csv('../readonly/final_project_data/X_train_all.csv.gz', index=False, compression='gzip')
# train_df['item_cnt_month'].to_csv('../readonly/final_project_data/y_train_all.csv.gz', index=False, compression='gzip')
X_train = training_set[features]
# X_train.to_csv('../readonly/final_project_data/X_train.csv.gz', index=False, compression='gzip')
y_train = training_set['item_cnt_month']
# y_train.to_csv('../readonly/final_project_data/y_train.csv.gz', index=False, compression='gzip')
X_validation = validation_set[features]
# X_validation.to_csv('../readonly/final_project_data/X_validation.csv.gz', index=False, compression='gzip')
y_validation = validation_set['item_cnt_month']
# y_validation.to_csv('../readonly/final_project_data/y_validation.csv.gz', index=False, compression='gzip')
test_df = test_df[['ID'] + features]
# test_df.to_csv('../readonly/final_project_data/X_test.csv.gz', index=False, compression='gzip')

## xgboost model

In [57]:
import xgboost as xgb

In [58]:
train_df.head()

Unnamed: 0,shop_id,item_id,date_block_num,item_cnt_month,item_price,cats,mean_of_item_price_groupby_item_id,mean_of_item_cnt_month_groupby_item_id,mean_of_item_price_groupby_shop_id,mean_of_item_cnt_month_groupby_shop_id,...,mean_of_item_price_groupby_item_id_12_month_ago,mean_of_item_cnt_month_groupby_item_id_12_month_ago,mean_of_item_price_groupby_shop_id_12_month_ago,mean_of_item_cnt_month_groupby_shop_id_12_month_ago,mean_of_item_price_groupby_cats_12_month_ago,mean_of_item_cnt_month_groupby_cats_12_month_ago,mean_of_item_price_groupby_item_id_shop_id_12_month_ago,mean_of_item_cnt_month_groupby_item_id_shop_id_12_month_ago,diff_between_item_shop_and_item,diff_between_item_and_category
0,54,10297,12,4.0,749.0,Movie,709.478496,1.210526,655.04886,2.636404,...,296.188435,0.0,825.151374,0.0,372.622621,0.0,299.0,0.0,39.521504,368.968628
1,54,10296,12,3.0,1599.0,Movie,1464.972764,1.130081,655.04886,2.636404,...,296.188435,0.0,825.151374,0.0,372.622621,0.0,299.0,0.0,62.185807,1124.462896
2,54,10298,12,14.0,399.0,Movie,223.781333,4.82138,655.04886,2.636404,...,296.188435,0.0,825.151374,0.0,372.622621,0.0,299.0,0.0,12.277491,-116.728536
3,54,10300,12,3.0,699.0,Movie,519.571884,2.520833,655.04886,2.636404,...,296.188435,0.0,825.151374,0.0,372.622621,0.0,299.0,0.0,-18.620813,179.062016
4,54,10284,12,1.0,299.0,Music,284.902913,1.174757,655.04886,2.636404,...,296.188435,0.0,825.151374,0.0,372.622621,0.0,299.0,0.0,14.097087,-87.719709


In [59]:
params = {
        'eta': 0.08, #best 0.08
        'max_depth': 7,
        'objective': 'reg:linear',
        'eval_metric': 'rmse',
        'seed': 3,
        'gamma':1,
        'silent': True
    }

In [60]:
X_test = test_df[features]

In [63]:
watchlist = [
    (xgb.DMatrix(X_train, y_train), 'train'),
    (xgb.DMatrix(X_validation, y_validation), 'validation')
]
model = xgb.train(params, xgb.DMatrix(X_train, y_train), 500,  watchlist, maximize=False, verbose_eval=5, early_stopping_rounds=50)

[0]	train-rmse:1.15446	validation-rmse:1.10796
Multiple eval metrics have been passed: 'validation-rmse' will be used for early stopping.

Will train until validation-rmse hasn't improved in 50 rounds.
[5]	train-rmse:0.969408	validation-rmse:0.920696
[10]	train-rmse:0.869434	validation-rmse:0.827898
[15]	train-rmse:0.817062	validation-rmse:0.786543
[20]	train-rmse:0.786398	validation-rmse:0.766274
[25]	train-rmse:0.769378	validation-rmse:0.756646
[30]	train-rmse:0.757348	validation-rmse:0.750659
[35]	train-rmse:0.747826	validation-rmse:0.740738
[40]	train-rmse:0.741841	validation-rmse:0.737658
[45]	train-rmse:0.736802	validation-rmse:0.737938
[50]	train-rmse:0.731111	validation-rmse:0.735909
[55]	train-rmse:0.726615	validation-rmse:0.731159
[60]	train-rmse:0.723173	validation-rmse:0.731024
[65]	train-rmse:0.720289	validation-rmse:0.730388
[70]	train-rmse:0.717834	validation-rmse:0.730966
[75]	train-rmse:0.715444	validation-rmse:0.732174
[80]	train-rmse:0.712924	validation-rmse:0.732251

In [64]:
pred = model.predict(xgb.DMatrix(X_test), ntree_limit=model.best_ntree_limit)
test_df['item_cnt_month'] = pred.clip(0, 40)
test_df[['ID', 'item_cnt_month']].to_csv('xgboost_submission.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [79]:
test_df.head()

Unnamed: 0,ID,mean_of_item_price_groupby_item_id,mean_of_item_cnt_month_groupby_item_id,mean_of_item_price_groupby_shop_id,mean_of_item_cnt_month_groupby_shop_id,mean_of_item_price_groupby_cats,mean_of_item_cnt_month_groupby_cats,mean_of_item_price_groupby_item_id_shop_id,mean_of_item_cnt_month_groupby_item_id_shop_id,item_cnt_month_1_month_ago,...,mean_of_item_cnt_month_groupby_item_id_12_month_ago,mean_of_item_price_groupby_shop_id_12_month_ago,mean_of_item_cnt_month_groupby_shop_id_12_month_ago,mean_of_item_price_groupby_cats_12_month_ago,mean_of_item_cnt_month_groupby_cats_12_month_ago,mean_of_item_price_groupby_item_id_shop_id_12_month_ago,mean_of_item_cnt_month_groupby_item_id_shop_id_12_month_ago,diff_between_item_shop_and_item,diff_between_item_and_category,item_cnt_month
0,0,1960.580473,2.873303,804.758232,1.773768,1537.78918,2.633343,1693.518519,1.444444,0.0,...,2.873303,804.758232,1.773768,1537.78918,2.633343,1693.518519,1.444444,-267.061955,422.791294,0.671905
1,1,372.622621,1.379644,804.758232,1.773768,372.622621,1.379644,372.622621,1.379644,0.0,...,0.0,830.860711,0.0,372.622621,0.0,399.0,0.0,0.0,0.0,0.007018
2,2,844.516003,2.668421,804.758232,1.773768,1537.78918,2.633343,859.0,2.0,1.0,...,0.0,830.860711,0.0,372.622621,0.0,399.0,0.0,14.483997,-693.273177,1.067947
3,3,792.527697,1.855263,804.758232,1.773768,1537.78918,2.633343,599.0,1.0,0.0,...,0.0,830.860711,0.0,372.622621,0.0,399.0,0.0,-193.527697,-745.261482,0.258787
4,4,1537.78918,2.633343,804.758232,1.773768,1537.78918,2.633343,1537.78918,2.633343,0.0,...,0.0,830.860711,0.0,372.622621,0.0,399.0,0.0,0.0,0.0,0.091715


## Stacking (takes a long time to run)

In [65]:
X_train_new = X_train.copy()
X_validation_new = X_validation.copy()
X_test_new = X_test.copy()

Train three xgboost model with different params

In [66]:
params1 = {
        'eta': 0.08, #best 0.08
        'max_depth': 7,
        'objective': 'reg:linear',
        'eval_metric': 'rmse',
        'seed': 3,
        'gamma':1,
        'silent': True
    }

In [67]:
params2 = {
        'eta': 0.08, #best 0.08
        'max_depth': 8,
        'objective': 'reg:linear',
        'eval_metric': 'rmse',
        'seed': 4,
        'gamma':1,
        'silent': True
    }

In [68]:
params3 = {
        'eta': 0.08, #best 0.08
        'max_depth': 6,
        'objective': 'reg:linear',
        'eval_metric': 'rmse',
        'seed': 5,
        'gamma':1,
        'silent': True
    }

In [69]:
watchlist = [
    (xgb.DMatrix(X_train, y_train), 'train'),
    (xgb.DMatrix (X_validation, y_validation), 'validation')
]
for i, params in enumerate([params1, params2, params3]):
    model = xgb.train(params, xgb.DMatrix(X_train, y_train), 500,  watchlist, maximize=False, verbose_eval=50, early_stopping_rounds=50)
    X_train_new['xgboost_item_cnt_month_'+str(i)] = model.predict(xgb.DMatrix(X_train), ntree_limit=model.best_ntree_limit)
    X_validation_new['xgboost_item_cnt_month_'+str(i)] = model.predict(xgb.DMatrix(X_validation), ntree_limit=model.best_ntree_limit)
    X_test_new['xgboost_item_cnt_month_'+str(i)] = model.predict(xgb.DMatrix(X_test), ntree_limit=model.best_ntree_limit)

[0]	train-rmse:1.15446	validation-rmse:1.10796
Multiple eval metrics have been passed: 'validation-rmse' will be used for early stopping.

Will train until validation-rmse hasn't improved in 50 rounds.
[50]	train-rmse:0.731111	validation-rmse:0.735909
[100]	train-rmse:0.705274	validation-rmse:0.73483
Stopping. Best iteration:
[58]	train-rmse:0.724912	validation-rmse:0.730127

[0]	train-rmse:1.15146	validation-rmse:1.10414
Multiple eval metrics have been passed: 'validation-rmse' will be used for early stopping.

Will train until validation-rmse hasn't improved in 50 rounds.
[50]	train-rmse:0.709553	validation-rmse:0.72672
Stopping. Best iteration:
[49]	train-rmse:0.710442	validation-rmse:0.726317

[0]	train-rmse:1.15841	validation-rmse:1.11637
Multiple eval metrics have been passed: 'validation-rmse' will be used for early stopping.

Will train until validation-rmse hasn't improved in 50 rounds.
[50]	train-rmse:0.75632	validation-rmse:0.758347
[100]	train-rmse:0.731682	validation-rmse:

In [None]:
X_test_new.head(10)

In [None]:
X_train_new.to_csv('./data/X_train_new.csv', index=False)
X_validation_new.to_csv('./data/X_validation_new.csv', index=False)
X_test_new.to_csv('./data/X_test_new.csv', index=False)

In [None]:
X_train_new = pd.read_csv('./data/X_train_new.csv')
X_validation_new = pd.read_csv('./data/X_validation_new.csv')
X_test_new = pd.read_csv('./data/X_test_new.csv')

Train three knn regressors

In [70]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
X_train_sample, _, y_train_sample, __ = train_test_split(X_train, y_train, train_size=.05, random_state=10)
scaler = MinMaxScaler()
scaler.fit(X_train_sample)
for k in (2, 3, 4):
    print("Training model "+str(k))
    neigh = KNeighborsRegressor(n_neighbors=k, n_jobs=6, algorithm='kd_tree')
    neigh.fit(scaler.transform(X_train_sample), y_train_sample)
    print("Using "+str(k)+" to predict")
    X_train_new[str(k)+'_neighbors'] = neigh.predict(scaler.transform(X_train))
    X_validation_new[str(k)+'_neighbors'] = neigh.predict(scaler.transform(X_validation))
    X_test_new[str(k)+'_neighbors'] = neigh.predict(scaler.transform(X_test))



Training model 2
Using 2 to predict
Training model 3
Using 3 to predict
Training model 4
Using 4 to predict



Train svm regressors (not used)

In [None]:
# from sklearn.svm import SVR
# scaler = MinMaxScaler()
# X_train_transform = scaler.fit_transform(X_train)
# X_validation_transform = scaler.transform(X_validation)
# X_test_transform = scaler.transform(X_test)
# for kernel in 'poly', 'rbf', 'sigmoid':
#     clf = SVR(kernel=kernel, max_iter=500)
#     print("Training the "+kernel+" model")
#     clf.fit(X_train_transform, y_train)
#     print("Using the "+kernel+" model to predict")
#     X_train_new['svm_'+kernel] = clf.predict(X_train_transform)
#     X_validation_new['svm_'+kernel] = clf.predict(X_validation_transform)
#     X_test_new['svm_'+kernel] = clf.predict(X_test_transform)

In [None]:
# X_test_new.head(10)

In [None]:
# for df in X_train_new, X_validation_new, X_test_new:
#     df.drop(['svm_poly', 'svm_rbf', 'svm_sigmoid'], axis=1, inplace=True)

Use LR to ensemble all models

In [71]:
from sklearn.linear_model import Ridge
model = Ridge(alpha=1, copy_X=True, normalize=True, max_iter=1000)
model.fit(X_train_new, y_train)
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_validation, model.predict(X_validation_new)))

0.5266581452890597


In [73]:
pred = model.predict(X_test_new)
test_df['item_cnt_month'] = pred.clip(0, 20)
test_df[['ID', 'item_cnt_month']].to_csv('stacking_submission.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
