## Final Project
You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

File descriptions
1. sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
2. test.csv - the test set. **You need to forecast the sales (total item count) for these shops and products for November 2015.**
3. sample_submission.csv - a sample submission file in the correct format.
4. items.csv - supplemental information about the items/products.
5. item_categories.csv  - supplemental information about the items categories.
6. shops.csv- supplemental information about the shops.

**Data fields**
-  ID - an Id that represents a (Shop, Item) tuple within the test set
-  shop_id - unique identifier of a shop
-  item_id - unique identifier of a product
-  item_category_id - unique identifier of item category
-  item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
-  item_price - current price of an item
-  date - date in format dd/mm/yyyy
-  date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
-  item_name - name of item
-  shop_name - name of shop
-  item_category_name - name of item category

### Modeling Strategy:
Use a holdout validation set, which is month 33. 
A BETTER strategy is to use cross-validation, which requires pipeline implementation.

In [1]:
import numpy as np
import pandas as pd 
import xgboost as xgb

for p in [np, pd, xgb]:
    print (p.__name__, p.__version__)

numpy 1.14.3
pandas 0.23.0
xgboost 0.72


In [2]:
from itertools import product
# from tqdm import tqdm_notebook
import gc 

In [3]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion

### 1. Load the data and train/test split

#### 1.1 Load the data

In [4]:
sales = pd.read_csv('../readonly/final_project_data/sales_train.csv.gz')
shops = pd.read_csv('../readonly/final_project_data/shops.csv')
items = pd.read_csv('../readonly/final_project_data/items.csv')
item_cats = pd.read_csv('../readonly/final_project_data/item_categories.csv')
sample_submission = pd.read_csv('../readonly/final_project_data/sample_submission.csv.gz')
sales_test = pd.read_csv('../readonly/final_project_data/test.csv.gz')

#### 1.2 Clean the data

In [5]:
def downcast_dtypes(df):
    '''
        Changes column types in the dataframe: 
                
                `float64` type to `float32`
                `int64`   type to `int32`
    '''
    
    # Select columns to downcast
    float_cols = [c for c in df if df[c].dtype == "float64"]
    int_cols =   [c for c in df if df[c].dtype == "int64"]
    
    # Downcast
    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols]   = df[int_cols].astype(np.int32)
    
    return df

In [6]:
# Remove outliers
sales = sales[sales['item_price'] < 100000]
sales = sales[sales['item_cnt_day'] < 1000]

The data are trasaction records. So for a day, shop, and an item, there are multiple records. The objective is to forecast monthly sales. The data is converted to monthly sales

In [7]:
index_col = ['date_block_num','shop_id','item_id']
sales = sales.groupby(index_col).agg({'item_cnt_day': np.sum, 'item_price': np.mean}).reset_index()
sales.rename({'item_cnt_day': 'item_cnt_month'}, axis='columns', inplace=True)

In [8]:
sales = pd.merge(sales, items, on='item_id',how='left')

In [9]:
sales = downcast_dtypes(sales)
sales = sales.drop(['item_name'], axis=1)

In [10]:
sales.head()

Unnamed: 0,date_block_num,shop_id,item_id,item_cnt_month,item_price,item_category_id
0,0,0,32,6.0,221.0,40
1,0,0,33,3.0,347.0,37
2,0,0,35,1.0,247.0,40
3,0,0,43,1.0,221.0,40
4,0,0,51,2.0,128.5,57


In [11]:
sales_test = pd.merge(sales_test, items, on='item_id',how='left')
sales_test = sales_test.drop(['item_name'], axis=1)
sales_test['date_block_num'] = 34
sales_test.head()

Unnamed: 0,ID,shop_id,item_id,item_category_id,date_block_num
0,0,5,5037,19,34
1,1,5,5320,55,34
2,2,5,5233,19,34
3,3,5,5232,23,34
4,4,5,5268,20,34


In [12]:
sales.describe()

Unnamed: 0,date_block_num,shop_id,item_id,item_cnt_month,item_price,item_category_id
count,1609122.0,1609122.0,1609122.0,1609122.0,1609122.0,1609122.0
mean,14.66479,32.80587,10680.99,2.265233,790.1026,41.54188
std,9.542325,16.537,6238.881,8.392699,1532.409,16.32362
min,0.0,0.0,0.0,-22.0,0.09,0.0
25%,6.0,21.0,5045.0,1.0,199.0,30.0
50%,14.0,31.0,10497.0,1.0,399.0,40.0
75%,23.0,47.0,16060.0,2.0,898.5,55.0
max,33.0,59.0,22169.0,1644.0,50999.0,83.0


In [13]:
# Train/Test Split
train_df = sales.copy()
test_df = sales_test.copy()

### 2. Feature Extraction 

In [14]:
class Groupby_Avg_Featurizer(TransformerMixin):
    def __init__(self, group_col, value_col, output_col, include_negative=False, weight_col=None):
        self.group_col = group_col
        self.value_col = value_col
        self.weight_col = weight_col
        self.output_col = output_col
        self.include_negative=include_negative
        self.gb = None
        
    def fit(self, X):        
        
        if self.weight_col:
            if not self.include_negative:
                X = X.loc[X[self.weight_col] > 0]
            self.gb = X.groupby(self.group_col).apply(lambda df: np.nanmean(df[self.value_col].values*df[self.weight_col].values)).reset_index().rename(columns={0:self.output_col})
        else:
            self.gb = X.groupby(self.group_col).apply(lambda df: np.nanmean(df[self.value_col])).reset_index().rename(columns={0:self.output_col})
                      
        return self
    
    
    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        assert isinstance(self.group_col, list)
        assert isinstance(self.value_col, str)
        if self.weight_col: assert isinstance(self.weight_col, str)
        assert isinstance(self.output_col, str)
        
                                                  
        return pd.merge(X, self.gb, on=self.group_col, how='left')

In [15]:
class Groupby_Sum_Featurizer(TransformerMixin):
    '''
        Compute the weighted sum of two columns of a df
    '''
    def __init__(self, group_col, value_col, output_col, weight_col=None):
        self.group_col = group_col
        self.value_col = value_col
        self.weight_col = weight_col
        self.output_col = output_col
        self.gb = None
        
    def fit(self, X):        
        
        if self.weight_col:
            self.gb = X.groupby(self.group_col).apply(lambda df: np.nansum(df[self.value_col].values*df[self.weight_col].values)).reset_index().rename(columns={0:self.output_col})
        else:
            self.gb = X.groupby(self.group_col).apply(lambda df: np.nansum(df[self.value_col])).reset_index().rename(columns={0:self.output_col})
        
        
        return self
    
    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        assert isinstance(self.group_col, list)
        assert isinstance(self.value_col, str)
        if self.weight_col: assert isinstance(self.weight_col, str)
        assert isinstance(self.output_col, str)
             
        return pd.merge(X, self.gb, on=self.group_col, how='left')

#### 2.1 Mean-encodings

In [16]:
from itertools import product
index_cols = ['shop_id', 'item_id', 'date_block_num']
grid = []
for block_num in train_df['date_block_num'].unique():
    cur_shops = train_df.loc[train_df['date_block_num'] == block_num, 'shop_id'].unique()
    cur_items = train_df.loc[train_df['date_block_num'] == block_num, 'item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

In [17]:
train_df = pd.merge(grid, train_df, on=index_cols, how='left').drop('item_category_id', axis=1)
train_df = pd.merge(train_df, items, on='item_id', how='left') # Rejoin to ensure all items have categories
train_df.describe()

Unnamed: 0,shop_id,item_id,date_block_num,item_cnt_month,item_price,item_category_id
count,10913800.0,10913800.0,10913800.0,1609122.0,1609122.0,10913800.0
mean,31.1872,11309.29,14.97336,2.265233,790.1053,44.91718
std,17.34959,6209.982,9.495635,8.392586,1532.409,15.10617
min,0.0,0.0,0.0,-22.0,0.09,0.0
25%,16.0,5976.0,7.0,1.0,199.0,37.0
50%,30.0,11391.0,14.0,1.0,399.0,40.0
75%,46.0,16605.0,23.0,2.0,898.5,55.0
max,59.0,22169.0,33.0,1644.0,50999.0,83.0


In [18]:
# fill na with 0 for "item_cnt_month"
train_df['item_cnt_month'] = train_df['item_cnt_month'].fillna(0)
train_df.describe()

Unnamed: 0,shop_id,item_id,date_block_num,item_cnt_month,item_price,item_category_id
count,10913800.0,10913800.0,10913800.0,10913800.0,1609122.0,10913800.0
mean,31.1872,11309.29,14.97336,0.333984,790.1053,44.91718
std,17.34959,6209.982,9.495635,3.315641,1532.409,15.10617
min,0.0,0.0,0.0,-22.0,0.09,0.0
25%,16.0,5976.0,7.0,0.0,199.0,37.0
50%,30.0,11391.0,14.0,0.0,399.0,40.0
75%,46.0,16605.0,23.0,0.0,898.5,55.0
max,59.0,22169.0,33.0,1644.0,50999.0,83.0


In [19]:
# Not interacted with date_num_block because the lagged values captures the temporal effect

mean_encoding_pipeline = Pipeline(
[
    # item_count related features
    ('avg_all_shop_sales',             Groupby_Avg_Featurizer(['shop_id'], value_col='item_cnt_month',output_col='avg_all_shop_sales')),
    ('avg_all_item_sales',             Groupby_Avg_Featurizer(['item_id'], value_col='item_cnt_month',output_col='avg_all_item_sales')),
    ('avg_all_item_cat_sales',         Groupby_Avg_Featurizer(['item_category_id'], value_col='item_cnt_month',output_col='avg_all_item_cat_sales')),
    
    ('avg_shop_sales',          Groupby_Avg_Featurizer(['shop_id','date_block_num'], value_col='item_cnt_month',output_col='avg_shop_sales')),
    ('avg_item_sales',          Groupby_Avg_Featurizer(['item_id','date_block_num'], value_col='item_cnt_month',output_col='avg_item_sales')),
    ('avg_item_cat_sales',      Groupby_Avg_Featurizer(['item_category_id','date_block_num'], value_col='item_cnt_month',output_col='avg_item_cat_sales')),
    
    # price-related features
    ('avg_all_shop_item_price',     Groupby_Avg_Featurizer(['shop_id','item_id'], value_col='item_price',output_col='avg_all_shop_item_price', weight_col='item_cnt_month')),
    ('avg_all_shop_price',             Groupby_Avg_Featurizer(['shop_id'], value_col='item_price',output_col='avg_all_shop_price', weight_col='item_cnt_month')),
    ('avg_all_item_price',             Groupby_Avg_Featurizer(['item_id'], value_col='item_price',output_col='avg_all_item_price', weight_col='item_cnt_month')),
    ('avg_all_item_cat_price',         Groupby_Avg_Featurizer(['item_category_id'], value_col='item_price',output_col='avg_all_item_cat_price', weight_col='item_cnt_month')),
    
    ('avg_shop_price',             Groupby_Avg_Featurizer(['shop_id','date_block_num'], value_col='item_price',output_col='avg_shop_price', weight_col='item_cnt_month')),
    ('avg_item_price',             Groupby_Avg_Featurizer(['item_id','date_block_num'], value_col='item_price',output_col='avg_item_price', weight_col='item_cnt_month')),
    ('avg_item_cat_price',         Groupby_Avg_Featurizer(['item_category_id','date_block_num'], value_col='item_price',output_col='avg_item_cat_price', weight_col='item_cnt_month')),
    
    
    # revenue related features
    ('all_shop_revenue',        Groupby_Sum_Featurizer(['shop_id'],value_col='item_price',output_col='all_shop_revenue', weight_col='item_cnt_month')),
    ('all_item_revenue',        Groupby_Sum_Featurizer(['item_id'],value_col='item_price',output_col='all_item_revenue', weight_col='item_cnt_month')),
    ('all_item_cat_revenue',    Groupby_Sum_Featurizer(['item_category_id'],value_col='item_price',output_col='all_item_cat_revenue', weight_col='item_cnt_month')),
    
    ('shop_revenue',        Groupby_Sum_Featurizer(['shop_id','date_block_num'],value_col='item_price',output_col='shop_revenue', weight_col='item_cnt_month')),
    ('item_revenue',        Groupby_Sum_Featurizer(['item_id','date_block_num'],value_col='item_price',output_col='item_revenue', weight_col='item_cnt_month')),
    ('item_cat_revenue',    Groupby_Sum_Featurizer(['item_category_id','date_block_num'],value_col='item_price',output_col='item_cat_revenue', weight_col='item_cnt_month'))
    
]
)
    

In [20]:
train_df = mean_encoding_pipeline.fit_transform(train_df)
train_df.describe()

Unnamed: 0,shop_id,item_id,date_block_num,item_cnt_month,item_price,item_category_id,avg_all_shop_sales,avg_all_item_sales,avg_all_item_cat_sales,avg_shop_sales,...,avg_all_item_cat_price,avg_shop_price,avg_item_price,avg_item_cat_price,all_shop_revenue,all_item_revenue,all_item_cat_revenue,shop_revenue,item_revenue,item_cat_revenue
count,10913800.0,10913800.0,10913800.0,10913800.0,1609122.0,10913800.0,10913800.0,10913800.0,10913800.0,10913800.0,...,10913800.0,10889620.0,10901760.0,10913760.0,10913800.0,10913800.0,10913800.0,10913800.0,10913800.0,10913800.0
mean,31.1872,11309.29,14.97336,0.333984,790.1053,44.91718,0.333984,0.333984,0.333984,0.333984,...,1648.668,2083.991,1250.592,1628.835,69293850.0,273423.4,105655500.0,2168955.0,14664.61,3449731.0
std,17.34959,6209.982,9.495635,3.315641,1532.409,15.10617,0.2453923,1.643121,1.311867,0.2644735,...,4672.441,1238.383,12773.34,7675.508,48169350.0,2492208.0,92119580.0,1780138.0,188896.4,3692082.0
min,0.0,0.0,0.0,-22.0,0.09,0.0,0.06096435,-0.06043956,0.02,-0.0001966568,...,29.0,77.0,0.1,13.0,376933.9,-28589.0,58.0,-7990.0,-50976.0,-1500.0
25%,16.0,5976.0,7.0,0.0,199.0,37.0,0.1977504,0.06092715,0.1773911,0.1837063,...,506.3217,1423.729,199.0,454.6003,44128220.0,11626.8,28731850.0,1122306.0,498.0,849725.5
50%,30.0,11391.0,14.0,0.0,399.0,40.0,0.2799612,0.1266491,0.2241586,0.2560714,...,675.4003,1851.641,380.3387,607.5553,60593450.0,37687.15,90525500.0,1619210.0,1495.0,2717443.0
75%,46.0,16605.0,23.0,0.0,898.5,55.0,0.3259274,0.2905983,0.2540509,0.357132,...,1921.485,2433.278,974.5714,1627.997,74939470.0,121887.7,169960400.0,2513866.0,4784.0,4308572.0
max,59.0,22169.0,33.0,1644.0,50999.0,83.0,1.328612,128.8074,93.499,2.211961,...,164598.7,17780.82,2482466.0,1165798.0,235595600.0,219150600.0,413319400.0,15717830.0,46096180.0,46096180.0


In [21]:
train_df.shape[0] - train_df.count()   # count the nulls in each column

shop_id                          0
item_id                          0
date_block_num                   0
item_cnt_month                   0
item_price                 9304682
item_name                        0
item_category_id                 0
avg_all_shop_sales               0
avg_all_item_sales               0
avg_all_item_cat_sales           0
avg_shop_sales                   0
avg_item_sales                   0
avg_item_cat_sales               0
avg_all_shop_item_price    4628720
avg_all_shop_price               0
avg_all_item_price             231
avg_all_item_cat_price           0
avg_shop_price               24182
avg_item_price               12047
avg_item_cat_price              44
all_shop_revenue                 0
all_item_revenue                 0
all_item_cat_revenue             0
shop_revenue                     0
item_revenue                     0
item_cat_revenue                 0
dtype: int64

In [22]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10913804 entries, 0 to 10913803
Data columns (total 26 columns):
shop_id                    int32
item_id                    int32
date_block_num             int32
item_cnt_month             float32
item_price                 float32
item_name                  object
item_category_id           int64
avg_all_shop_sales         float64
avg_all_item_sales         float64
avg_all_item_cat_sales     float64
avg_shop_sales             float64
avg_item_sales             float64
avg_item_cat_sales         float64
avg_all_shop_item_price    float64
avg_all_shop_price         float64
avg_all_item_price         float64
avg_all_item_cat_price     float64
avg_shop_price             float64
avg_item_price             float64
avg_item_cat_price         float64
all_shop_revenue           float64
all_item_revenue           float64
all_item_cat_revenue       float64
shop_revenue               float64
item_revenue               float64
item_cat_revenue   

In [23]:
test_df = mean_encoding_pipeline.transform(test_df)
test_df.head()

Unnamed: 0,ID,shop_id,item_id,item_category_id,date_block_num,avg_all_shop_sales,avg_all_item_sales,avg_all_item_cat_sales,avg_shop_sales,avg_item_sales,...,avg_all_item_cat_price,avg_shop_price,avg_item_price,avg_item_cat_price,all_shop_revenue,all_item_revenue,all_item_cat_revenue,shop_revenue,item_revenue,item_cat_revenue
0,0,5,5037,19,34,0.189383,1.950845,0.757562,,,...,4309.483887,,,,38229024.0,2424548.0,413319424.0,,,
1,1,5,5320,55,34,0.189383,,0.224159,,,...,403.510742,,,,38229024.0,,100022616.0,,,
2,2,5,5233,19,34,0.189383,1.656863,0.757562,,,...,4309.483887,,,,38229024.0,400605.1,413319424.0,,,
3,3,5,5232,23,34,0.189383,1.093023,0.665891,,,...,3641.348633,,,,38229024.0,108359.2,260314752.0,,,
4,4,5,5268,20,34,0.189383,,1.910721,,,...,13756.227539,,,,38229024.0,,374149696.0,,,


In [24]:
test_df.shape[0] - test_df.count()   # count the nulls in each column

ID                              0
shop_id                         0
item_id                         0
item_category_id                0
date_block_num                  0
avg_all_shop_sales              0
avg_all_item_sales          15246
avg_all_item_cat_sales          0
avg_shop_sales             214200
avg_item_sales             214200
avg_item_cat_sales         214200
avg_all_shop_item_price    102838
avg_all_shop_price              0
avg_all_item_price          15246
avg_all_item_cat_price          0
avg_shop_price             214200
avg_item_price             214200
avg_item_cat_price         214200
all_shop_revenue                0
all_item_revenue            15246
all_item_cat_revenue            0
shop_revenue               214200
item_revenue               214200
item_cat_revenue           214200
dtype: int64

In [25]:
train_df = downcast_dtypes(train_df)

#### Imputation
Some items in the test dataset do not exist in train dataset. Imputation is needed. 

In [26]:
'''
Missing value in train_df

avg_all_shop_item_price    4628720
avg_all_shop_price               0
avg_all_item_price             231
avg_all_item_cat_price           0
avg_shop_price               24182
avg_item_price               12047
avg_item_cat_price              44


Fill "avg_item_cat_price" with "avg_all_item_cat_price". 
Fill "avg_all_item_price" with "avg_all_item_cat_price"
Fill "avg_item_price" with "avg_all_item_price"
Fill "avg_shop_price" with "avg_all_shop_price"
Fill "avg_all_shop_item_price" with "avg_all_item_price"


'''

train_df["avg_item_cat_price"] = train_df["avg_item_cat_price"].fillna(train_df["avg_all_item_cat_price"]) 
train_df["avg_all_item_price"] = train_df["avg_all_item_price"].fillna(train_df["avg_all_item_cat_price"]) 
train_df["avg_item_price"] = train_df["avg_item_price"].fillna(train_df["avg_all_item_price"]) 
train_df["avg_shop_price"] = train_df["avg_shop_price"].fillna(train_df["avg_all_shop_price"]) 
train_df["avg_all_shop_item_price"] = train_df["avg_all_shop_item_price"].fillna(train_df["avg_all_item_price"]) 


In [28]:
test_df.shape[0] - test_df.count()   # count the nulls in each column

ID                              0
shop_id                         0
item_id                         0
item_category_id                0
date_block_num                  0
avg_all_shop_sales              0
avg_all_item_sales          15246
avg_all_item_cat_sales          0
avg_shop_sales             214200
avg_item_sales             214200
avg_item_cat_sales         214200
avg_all_shop_item_price    102838
avg_all_shop_price              0
avg_all_item_price          15246
avg_all_item_cat_price          0
avg_shop_price             214200
avg_item_price             214200
avg_item_cat_price         214200
all_shop_revenue                0
all_item_revenue            15246
all_item_cat_revenue            0
shop_revenue               214200
item_revenue               214200
item_cat_revenue           214200
dtype: int64

In [29]:
test_df["avg_all_item_sales"] = test_df["avg_all_item_sales"].fillna(test_df["avg_all_item_cat_sales"]) 
test_df["avg_all_item_price"] = test_df["avg_all_item_price"].fillna(test_df["avg_all_item_cat_price"]) 
test_df["all_item_revenue"] = test_df["all_item_revenue"].fillna(test_df["all_item_cat_revenue"]) 
test_df["avg_all_shop_item_price"] = test_df["avg_all_shop_item_price"].fillna(test_df["avg_all_item_price"]) 

In [30]:
test_df.shape[0] - test_df.count()   # count the nulls in each column

ID                              0
shop_id                         0
item_id                         0
item_category_id                0
date_block_num                  0
avg_all_shop_sales              0
avg_all_item_sales              0
avg_all_item_cat_sales          0
avg_shop_sales             214200
avg_item_sales             214200
avg_item_cat_sales         214200
avg_all_shop_item_price         0
avg_all_shop_price              0
avg_all_item_price              0
avg_all_item_cat_price          0
avg_shop_price             214200
avg_item_price             214200
avg_item_cat_price         214200
all_shop_revenue                0
all_item_revenue                0
all_item_cat_revenue            0
shop_revenue               214200
item_revenue               214200
item_cat_revenue           214200
dtype: int64

#### 2.2 Lag features

In [31]:
def add_lag_data(df, data, features, periods, index_col, time_col):
    '''
        data: the input dataframe to get the lagged values from
        index_col: the index columns to join on
    
    '''
    
    assert isinstance(time_col, str)
    assert isinstance(index_col, list)
    
    
    for p in periods:
        data_copy = data.copy()         
        data_copy[time_col] += p    
        data_copy = data_copy[[time_col] +  index_col + features]        
        data_copy = data_copy.drop_duplicates(subset=[time_col] + index_col)
        data_copy.rename({
            feat: feat+"_"+'lag_'+str(p) for feat in features
        }, axis=1, inplace=True)
        df = pd.merge(df, data_copy, on=[time_col] + index_col, how='left')
    return df

In [32]:
lag_features_by_shop_item = ['item_cnt_month']
lag_features_by_item = [                    
                        'avg_item_sales',
                        'avg_item_cat_sales',
                        'avg_item_price',
                        'avg_item_cat_price',
                        'item_revenue',
                        'item_cat_revenue' 
                        ]
lag_features_by_shop = [
                        'avg_shop_sales',
                        'avg_shop_price',
                        'shop_revenue'   
                        ]

index_col_by_shop_item = ['shop_id','item_id']
index_col_by_item = ['item_id']
index_col_by_shop = ['shop_id']

lag_periods = [1,2,3,4,6,12]

In [33]:
train_df = add_lag_data(train_df, train_df, lag_features_by_shop_item, lag_periods, index_col_by_shop_item, 'date_block_num')
test_df = add_lag_data(test_df, train_df, lag_features_by_shop_item, lag_periods, index_col_by_shop_item, 'date_block_num')

train_df = add_lag_data(train_df, train_df, lag_features_by_item, lag_periods, index_col_by_item, 'date_block_num')
test_df = add_lag_data(test_df, train_df, lag_features_by_item, lag_periods, index_col_by_item, 'date_block_num')

train_df = add_lag_data(train_df, train_df, lag_features_by_shop, lag_periods, index_col_by_shop, 'date_block_num')
test_df = add_lag_data(test_df, train_df, lag_features_by_shop, lag_periods, index_col_by_shop, 'date_block_num')


In [34]:
train_df = train_df.drop('item_name', axis=1)

In [35]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10913804 entries, 0 to 10913803
Data columns (total 85 columns):
shop_id                      int32
item_id                      int32
date_block_num               int32
item_cnt_month               float32
item_price                   float32
item_category_id             int32
avg_all_shop_sales           float32
avg_all_item_sales           float32
avg_all_item_cat_sales       float32
avg_shop_sales               float32
avg_item_sales               float32
avg_item_cat_sales           float32
avg_all_shop_item_price      float32
avg_all_shop_price           float32
avg_all_item_price           float32
avg_all_item_cat_price       float32
avg_shop_price               float32
avg_item_price               float32
avg_item_cat_price           float32
all_shop_revenue             float32
all_item_revenue             float32
all_item_cat_revenue         float32
shop_revenue                 float32
item_revenue                 float32
item_

In [36]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 214200 entries, 0 to 214199
Data columns (total 84 columns):
ID                           214200 non-null int64
shop_id                      214200 non-null int64
item_id                      214200 non-null int64
item_category_id             214200 non-null int64
date_block_num               214200 non-null int64
avg_all_shop_sales           214200 non-null float64
avg_all_item_sales           214200 non-null float64
avg_all_item_cat_sales       214200 non-null float64
avg_shop_sales               0 non-null float64
avg_item_sales               0 non-null float64
avg_item_cat_sales           0 non-null float64
avg_all_shop_item_price      214200 non-null float64
avg_all_shop_price           214200 non-null float64
avg_all_item_price           214200 non-null float64
avg_all_item_cat_price       214200 non-null float64
avg_shop_price               0 non-null float64
avg_item_price               0 non-null float64
avg_item_cat_price     

In [37]:
# Fill na 
for df in train_df, test_df:
    for feat in df.columns[4:]:
        if 'sales' in feat:
            df[feat]=df[feat].fillna(0)            # never been sold
        else:
            df[feat]=df[feat].fillna(df[feat].median())           # median price or revenue

In [38]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10913804 entries, 0 to 10913803
Data columns (total 85 columns):
shop_id                      int32
item_id                      int32
date_block_num               int32
item_cnt_month               float32
item_price                   float32
item_category_id             int32
avg_all_shop_sales           float32
avg_all_item_sales           float32
avg_all_item_cat_sales       float32
avg_shop_sales               float32
avg_item_sales               float32
avg_item_cat_sales           float32
avg_all_shop_item_price      float32
avg_all_shop_price           float32
avg_all_item_price           float32
avg_all_item_cat_price       float32
avg_shop_price               float32
avg_item_price               float32
avg_item_cat_price           float32
all_shop_revenue             float32
all_item_revenue             float32
all_item_cat_revenue         float32
shop_revenue                 float32
item_revenue                 float32
item_

In [39]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 214200 entries, 0 to 214199
Data columns (total 84 columns):
ID                           214200 non-null int64
shop_id                      214200 non-null int64
item_id                      214200 non-null int64
item_category_id             214200 non-null int64
date_block_num               214200 non-null int64
avg_all_shop_sales           214200 non-null float64
avg_all_item_sales           214200 non-null float64
avg_all_item_cat_sales       214200 non-null float64
avg_shop_sales               214200 non-null float64
avg_item_sales               214200 non-null float64
avg_item_cat_sales           214200 non-null float64
avg_all_shop_item_price      214200 non-null float64
avg_all_shop_price           214200 non-null float64
avg_all_item_price           214200 non-null float64
avg_all_item_cat_price       214200 non-null float64
avg_shop_price               0 non-null float64
avg_item_price               0 non-null float64
avg_item

In [40]:
columns = {
    'diff_item_shop_and_item': ('avg_all_shop_item_price', 'avg_all_item_price'),
    'diff_item_and_category': ('avg_all_item_price', 'avg_all_item_cat_price')
}
for new_feature, (col1, col2) in columns.items():
    for df in (train_df, test_df):
        df[new_feature] = df[col1] - df[col2]

In [95]:
# Remove avg_shop_price, avg_item_price, avg_item_cat_price, shop_revenue , item_revenue, item_cat_revenue
drop_columns = [
    'avg_shop_price', 'avg_item_price', 'avg_item_cat_price', 
                'shop_revenue' , 'item_revenue', 'item_cat_revenue',
               'avg_shop_sales','avg_item_sales','avg_item_cat_sales'
               
               ]
train_df = train_df.drop(drop_columns, axis=1)

In [96]:
test_df = test_df.drop(drop_columns, axis=1)

In [97]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6425094 entries, 4488710 to 10913803
Data columns (total 78 columns):
shop_id                      int32
item_id                      int32
date_block_num               int32
item_cnt_month               float32
item_price                   float32
item_category_id             int32
avg_all_shop_sales           float32
avg_all_item_sales           float32
avg_all_item_cat_sales       float32
avg_all_shop_item_price      float32
avg_all_shop_price           float32
avg_all_item_price           float32
avg_all_item_cat_price       float32
all_shop_revenue             float32
all_item_revenue             float32
all_item_cat_revenue         float32
item_cnt_month_lag_1         float32
item_cnt_month_lag_2         float32
item_cnt_month_lag_3         float32
item_cnt_month_lag_4         float32
item_cnt_month_lag_6         float32
item_cnt_month_lag_12        float32
avg_item_sales_lag_1         float32
avg_item_cat_sales_lag_1     float32


In [98]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 214200 entries, 0 to 214199
Data columns (total 74 columns):
ID                           214200 non-null int64
avg_all_shop_sales           214200 non-null float64
avg_all_item_sales           214200 non-null float64
avg_all_item_cat_sales       214200 non-null float64
avg_all_shop_item_price      214200 non-null float64
avg_all_shop_price           214200 non-null float64
avg_all_item_price           214200 non-null float64
avg_all_item_cat_price       214200 non-null float64
all_shop_revenue             214200 non-null float64
all_item_revenue             214200 non-null float64
all_item_cat_revenue         214200 non-null float64
item_cnt_month_lag_1         214200 non-null float32
item_cnt_month_lag_2         214200 non-null float32
item_cnt_month_lag_3         214200 non-null float32
item_cnt_month_lag_4         214200 non-null float32
item_cnt_month_lag_6         214200 non-null float32
item_cnt_month_lag_12        214200 non-nul

### 3. Pre-processing features

In [99]:
train_df = train_df.loc[train_df['date_block_num']>=12,train_df.columns]   # Remove 2013 data

In [100]:
train_df['item_cnt_month'] = train_df['item_cnt_month'].clip(0, 20)

Use a holdout set as validation set

In [101]:
training_set = train_df[train_df['date_block_num']<33]             #
validation_set = train_df[train_df['date_block_num']==33]

In [102]:
features = train_df.columns[6:].tolist()
features

['avg_all_shop_sales',
 'avg_all_item_sales',
 'avg_all_item_cat_sales',
 'avg_all_shop_item_price',
 'avg_all_shop_price',
 'avg_all_item_price',
 'avg_all_item_cat_price',
 'all_shop_revenue',
 'all_item_revenue',
 'all_item_cat_revenue',
 'item_cnt_month_lag_1',
 'item_cnt_month_lag_2',
 'item_cnt_month_lag_3',
 'item_cnt_month_lag_4',
 'item_cnt_month_lag_6',
 'item_cnt_month_lag_12',
 'avg_item_sales_lag_1',
 'avg_item_cat_sales_lag_1',
 'avg_item_price_lag_1',
 'avg_item_cat_price_lag_1',
 'item_revenue_lag_1',
 'item_cat_revenue_lag_1',
 'avg_item_sales_lag_2',
 'avg_item_cat_sales_lag_2',
 'avg_item_price_lag_2',
 'avg_item_cat_price_lag_2',
 'item_revenue_lag_2',
 'item_cat_revenue_lag_2',
 'avg_item_sales_lag_3',
 'avg_item_cat_sales_lag_3',
 'avg_item_price_lag_3',
 'avg_item_cat_price_lag_3',
 'item_revenue_lag_3',
 'item_cat_revenue_lag_3',
 'avg_item_sales_lag_4',
 'avg_item_cat_sales_lag_4',
 'avg_item_price_lag_4',
 'avg_item_cat_price_lag_4',
 'item_revenue_lag_4',
 'i

In [103]:
X_train = training_set[features]
y_train = training_set['item_cnt_month']
X_validation = validation_set[features]
y_validation = validation_set['item_cnt_month']
test_df = test_df[['ID'] + features]
X_test = test_df[features]

### 4. Modeling

In [104]:
X_train_new = X_train.copy()
X_validation_new = X_validation.copy()
X_test_new = X_test.copy()

Train 3 XGBoost models. This also serves as hyperparameter tuning. 

In [105]:
params1 = {
        'eta': 0.08, #best 0.08
        'max_depth': 7,
        'objective': 'reg:linear',
        'eval_metric': 'rmse',
        'seed': 3,
        'gamma':1,
        'silent': True
    }

In [106]:
params2 = {
        'eta': 0.08, #best 0.08
        'max_depth': 8,
        'objective': 'reg:linear',
        'eval_metric': 'rmse',
        'seed': 4,
        'gamma':1,
        'silent': True
    }

In [107]:
params3 = {
        'eta': 0.08, #best 0.08
        'max_depth': 6,
        'objective': 'reg:linear',
        'eval_metric': 'rmse',
        'seed': 5,
        'gamma':1,
        'silent': True
    }

In [108]:
watchlist = [
    (xgb.DMatrix(X_train, y_train), 'train'),
    (xgb.DMatrix (X_validation, y_validation), 'validation')
]
for i, params in enumerate([params1, params2, params3]):
    model = xgb.train(params, xgb.DMatrix(X_train, y_train), 500,  watchlist, maximize=False, verbose_eval=50, early_stopping_rounds=50)
    X_train_new['xgboost_item_cnt_month_'+str(i)] = model.predict(xgb.DMatrix(X_train), ntree_limit=model.best_ntree_limit)
    X_validation_new['xgboost_item_cnt_month_'+str(i)] = model.predict(xgb.DMatrix(X_validation), ntree_limit=model.best_ntree_limit)
    X_test_new['xgboost_item_cnt_month_'+str(i)] = model.predict(xgb.DMatrix(X_test), ntree_limit=model.best_ntree_limit)

[0]	train-rmse:1.15578	validation-rmse:1.11099
Multiple eval metrics have been passed: 'validation-rmse' will be used for early stopping.

Will train until validation-rmse hasn't improved in 50 rounds.
[50]	train-rmse:0.715268	validation-rmse:0.729958
[100]	train-rmse:0.682875	validation-rmse:0.708742
[150]	train-rmse:0.662322	validation-rmse:0.700692
[200]	train-rmse:0.648346	validation-rmse:0.696875
[250]	train-rmse:0.635687	validation-rmse:0.69296
[300]	train-rmse:0.626836	validation-rmse:0.692762
[350]	train-rmse:0.618565	validation-rmse:0.690342
[400]	train-rmse:0.611669	validation-rmse:0.688557
[450]	train-rmse:0.605278	validation-rmse:0.687238
[499]	train-rmse:0.600374	validation-rmse:0.686666
[0]	train-rmse:1.15318	validation-rmse:1.10783
Multiple eval metrics have been passed: 'validation-rmse' will be used for early stopping.

Will train until validation-rmse hasn't improved in 50 rounds.
[50]	train-rmse:0.688993	validation-rmse:0.714803
[100]	train-rmse:0.651222	validation-r

#### KNN Regressors

In [109]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
X_train_sample, _, y_train_sample, __ = train_test_split(X_train, y_train, train_size=.05, random_state=10)
scaler = MinMaxScaler()
scaler.fit(X_train_sample)
for k in (2, 3, 4):
    print("Training model "+str(k))
    neigh = KNeighborsRegressor(n_neighbors=k, n_jobs=4, algorithm='kd_tree')
    neigh.fit(scaler.transform(X_train_sample), y_train_sample)
    print("Using "+str(k)+" to predict")
    X_train_new[str(k)+'_neighbors'] = neigh.predict(scaler.transform(X_train))
    X_validation_new[str(k)+'_neighbors'] = neigh.predict(scaler.transform(X_validation))
    X_test_new[str(k)+'_neighbors'] = neigh.predict(scaler.transform(X_test))



Training model 2
Using 2 to predict
Training model 3
Using 3 to predict
Training model 4
Using 4 to predict


#### Ensembling

In [111]:
from sklearn.linear_model import Ridge, LinearRegression
model = Ridge(alpha=1, copy_X=True, normalize=True, max_iter=1000)
model.fit(X_train_new, y_train)
from sklearn.metrics import mean_squared_error 
print(mean_squared_error(y_validation, model.predict(X_validation_new)))

0.477490678189109


#### Serialize the model


In [113]:
from sklearn.externals import joblib
joblib.dump(model, 'ensemble_simple_features.pkl') 

['ensemble_simple_features.pkl']

#### Make the submission

In [112]:
pred = model.predict(X_test_new)
test_df['item_cnt_month'] = pred.clip(0, 20)
test_df[['ID', 'item_cnt_month']].to_csv('stacking_submission.csv', index=False)