## Final Project
You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

File descriptions
1. sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
2. test.csv - the test set. **You need to forecast the sales (total item count) for these shops and products for November 2015.**
3. sample_submission.csv - a sample submission file in the correct format.
4. items.csv - supplemental information about the items/products.
5. item_categories.csv  - supplemental information about the items categories.
6. shops.csv- supplemental information about the shops.

**Data fields**
-  ID - an Id that represents a (Shop, Item) tuple within the test set
-  shop_id - unique identifier of a shop
-  item_id - unique identifier of a product
-  item_category_id - unique identifier of item category
-  item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
-  item_price - current price of an item
-  date - date in format dd/mm/yyyy
-  date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
-  item_name - name of item
-  shop_name - name of shop
-  item_category_name - name of item category

In [1]:
import numpy as np
import pandas as pd 
import sklearn
import scipy.sparse 
import lightgbm 

for p in [np, pd, scipy, sklearn, lightgbm]:
    print (p.__name__, p.__version__)

numpy 1.14.3
pandas 0.23.0
scipy 0.19.1
sklearn 0.19.1
lightgbm 2.1.1


In [2]:
from itertools import product
from tqdm import tqdm_notebook
import gc 

In [3]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion

In [4]:
from sklearn.preprocessing import Imputer, Normalizer, StandardScaler

#### 1. Load the data

In [19]:
sales = pd.read_csv('../readonly/final_project_data/sales_train.csv.gz')
shops = pd.read_csv('../readonly/final_project_data/shops.csv')
items = pd.read_csv('../readonly/final_project_data/items.csv')
item_cats = pd.read_csv('../readonly/final_project_data/item_categories.csv')
sample_submission = pd.read_csv('../readonly/final_project_data/sample_submission.csv.gz')
sales_test = pd.read_csv('../readonly/final_project_data/test.csv.gz')

In [20]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
date              object
date_block_num    int64
shop_id           int64
item_id           int64
item_price        float64
item_cnt_day      float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB


In [21]:
sales_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214200 entries, 0 to 214199
Data columns (total 3 columns):
ID         214200 non-null int64
shop_id    214200 non-null int64
item_id    214200 non-null int64
dtypes: int64(3)
memory usage: 4.9 MB


In [22]:
sample_submission.head()

Unnamed: 0,ID,item_cnt_month
0,0,0.5
1,1,0.5
2,2,0.5
3,3,0.5
4,4,0.5


In [23]:
sales.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


In [24]:
shops.head()

Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


In [25]:
items.head()

Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


In [26]:
item_cats.head()

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


In [27]:
sales_test.head()

Unnamed: 0,ID,shop_id,item_id
0,0,5,5037
1,1,5,5320
2,2,5,5233
3,3,5,5232
4,4,5,5268


In [28]:
sales = pd.merge(sales, items, on='item_id',how='left')

In [29]:
sales.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id
0,02.01.2013,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37
1,03.01.2013,0,25,2552,899.0,1.0,DEEP PURPLE The House Of Blue Light LP,58
2,05.01.2013,0,25,2552,899.0,-1.0,DEEP PURPLE The House Of Blue Light LP,58
3,06.01.2013,0,25,2554,1709.05,1.0,DEEP PURPLE Who Do You Think We Are LP,58
4,15.01.2013,0,25,2555,1099.0,1.0,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56


In [30]:
def downcast_dtypes(df):
    '''
        Changes column types in the dataframe: 
                
                `float64` type to `float32`
                `int64`   type to `int32`
    '''
    
    # Select columns to downcast
    float_cols = [c for c in df if df[c].dtype == "float64"]
    int_cols =   [c for c in df if df[c].dtype == "int64"]
    
    # Downcast
    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols]   = df[int_cols].astype(np.int32)
    
    return df

In [31]:
sales = downcast_dtypes(sales)
sales = sales.drop(['item_name','date'], axis=1)

In [32]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
date_block_num      int32
shop_id             int32
item_id             int32
item_price          float32
item_cnt_day        float32
item_category_id    int32
dtypes: float32(2), int32(4)
memory usage: 89.6 MB


In [33]:
sales_small = sales.loc[sales.shop_id.isin([26, 27, 28]), sales.columns].copy()

In [34]:
sales_small.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 301510 entries, 15036 to 2928660
Data columns (total 6 columns):
date_block_num      301510 non-null int32
shop_id             301510 non-null int32
item_id             301510 non-null int32
item_price          301510 non-null float32
item_cnt_day        301510 non-null float32
item_category_id    301510 non-null int32
dtypes: float32(2), int32(4)
memory usage: 9.2 MB


#### 2. Feature Extraction

In [35]:
class ColumnExtractor(TransformerMixin):
    def __init__(self, cols):
        self.cols = cols

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        Xcols = X[self.cols]
        return Xcols

In [36]:
class Groupby_Avg_Featurizer(TransformerMixin):
    def __init__(self, group_col, value_col, output_col, include_negative=False, weight_col=None):
        self.group_col = group_col
        self.value_col = value_col
        self.weight_col = weight_col
        self.output_col = output_col
        self.include_negative=include_negative
        self.gb = None
        
    def fit(self, X):        
        
        if self.weight_col:
            if self.include_negative:
                X = X.loc[X[self.weight_col] > 0]
            self.gb = X.groupby(self.group_col).apply(lambda df: np.average(df[self.value_col])).reset_index().rename(columns={0:self.output_col})
        else:
            self.gb = X.groupby(self.group_col).apply(lambda df: np.average(df[self.value_col])).reset_index().rename(columns={0:self.output_col})
                      
        return self
    
    
    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        assert isinstance(self.group_col, list)
        assert isinstance(self.value_col, str)
        if self.weight_col: assert isinstance(self.weight_col, str)
        assert isinstance(self.output_col, str)
        
                                                  
        return pd.merge(X, self.gb, on=self.group_col, how='left')

In [37]:
class Groupby_Std_Featurizer(TransformerMixin):
    def __init__(self, group_col, value_col, output_col, include_negative=False, weight_col=None):
        self.group_col = group_col
        self.value_col = value_col
        self.weight_col = weight_col
        self.output_col = output_col
        self.include_negative=include_negative
        self.gb = None
        
        
    def fit(self, X):        
        
            def weighted_std(df, values, weights):
                """
                Return the weighted average and standard deviation.
                values, weights -- column names
                """
                values, weights = df[values].values, df[weights].values
                average = np.average(values, weights=weights)
                variance = np.average((values-average)**2, weights=weights)
                return np.sqrt(variance)
        
            if self.weight_col:
                if self.include_negative:
                    X = X.loc[X[self.weight_col] > 0]
                self.gb = X.groupby(self.group_col).apply(weighted_std, self.value_col, self.weight_col).reset_index().rename(columns={0:self.output_col})
            else:
                self.gb = X.groupby(self.group_col).apply(lambda df: np.std(df[self.value_col])).reset_index().rename(columns={0:self.output_col})      
            return self
  
    def transform(self, X):
               
        assert isinstance(X, pd.DataFrame)
        assert isinstance(self.group_col, list)
        assert isinstance(self.value_col, str)
        if self.weight_col: assert isinstance(self.weight_col, str)
        assert isinstance(self.output_col, str)
        
        return pd.merge(X, self.gb, on=self.group_col, how='left')

In [38]:
class Groupby_Sum_Featurizer(TransformerMixin):
    '''
        Compute the weighted sum of two columns of a df
    '''
    def __init__(self, group_col, value_col, output_col, weight_col=None):
        self.group_col = group_col
        self.value_col = value_col
        self.weight_col = weight_col
        self.output_col = output_col
        self.gb = None
        
    def fit(self, X):        
        
        if self.weight_col:
            self.gb = X.groupby(self.group_col).apply(lambda df: np.dot(df[self.value_col].values, df[self.weight_col].values)).reset_index().rename(columns={0:self.output_col})
        else:
            self.gb = X.groupby(self.group_col).apply(lambda df: np.std(df[self.value_col])).reset_index().rename(columns={0:self.output_col})
        
        
        return self
    
    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        assert isinstance(self.group_col, list)
        assert isinstance(self.value_col, str)
        if self.weight_col: assert isinstance(self.weight_col, str)
        assert isinstance(self.output_col, str)
             
        return pd.merge(X, self.gb, on=self.group_col, how='left')
    

In [39]:
class Lag_Value_Transformer(TransformerMixin):
    def __init__(self, value_col, time_col, index_col, period=[1]):
        '''
        time_col: an integer-indexed column to indicate timestamp
        index_col: a list of columns together with the time_col to merge on
        
        '''
        self.value_col = value_col
        self.time_col = time_col
        self.index_col = index_col
        self.period = period
        self.df = None
        
    def fit(self, X):
        
        self.df = X.loc[:,self.index_col+[self.time_col, self.value_col]].copy()
        for p in self.period:
            temp_df = X.loc[:,self.index_col+[self.time_col, self.value_col]]
            temp_df[self.time_col] = temp_df[self.time_col] + p
            lag_col_name = '{}_lag_{}'.format(self.value_col, p)
            temp_df = temp_df.rename(columns={self.value_col:lag_col_name})
            self.df = pd.merge(self.df, temp_df, on=self.index_col+[self.time_col], how='left').fillna(0)
            del temp_df
            gc.collect();
        return self
    
    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        assert isinstance(self.value_col, str)
        assert isinstance(self.time_col, str)
        assert isinstance(self.index_col, list)
        assert isinstance(self.period, list)
        
        X = pd.merge(X, self.df, on=self.index_col+[self.time_col], how='left')
        return X

In [40]:
class DFImputer(TransformerMixin):
    # Imputer but for pandas DataFrames

    def __init__(self, strategy='mean'):
        self.strategy = strategy
        self.imp = None
        self.statistics_ = None

    def fit(self, X, y=None):
        self.imp = Imputer(strategy=self.strategy)
        self.imp.fit(X)
        self.statistics_ = pd.Series(self.imp.statistics_, index=X.columns)
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        Ximp = self.imp.transform(X)
        Xfilled = pd.DataFrame(Ximp, index=X.index, columns=X.columns)
        return Xfilled

In [41]:
class DFStandardScaler(TransformerMixin):
    # StandardScaler but for pandas DataFrames

    def __init__(self):
        self.ss = None
        self.mean_ = None
        self.scale_ = None

    def fit(self, X, y=None):
        self.ss = StandardScaler()
        self.ss.fit(X)
        self.mean_ = pd.Series(self.ss.mean_, index=X.columns)
        self.scale_ = pd.Series(self.ss.scale_, index=X.columns)
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        Xss = self.ss.transform(X)
        Xscaled = pd.DataFrame(Xss, index=X.index, columns=X.columns)
        return Xscaled

In [42]:
class DFFeatureUnion(TransformerMixin):
    # FeatureUnion but for pandas DataFrames

    def __init__(self, transformer_list):
        self.transformer_list = transformer_list

    def fit(self, X, y=None):
        for (name, t) in self.transformer_list:
            t.fit(X, y)
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        Xts = [t.transform(X) for _, t in self.transformer_list]
        Xunion = reduce(lambda X1, X2: pd.merge(X1, X2, left_index=True, right_index=True), Xts)
        return Xunion

In [43]:
class ClipTransformer(TransformerMixin):

    def __init__(self, a_min, a_max, col):
        self.a_min = a_min
        self.a_max = a_max
        self.col = col

    def fit(self, X, y=None):
        # stateless transformer
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        X.loc[:,self.col] = np.clip(X[self.col].values, self.a_min, self.a_max)
        return X

In [44]:
class ZeroFillTransformer(TransformerMixin):   
    
    def fit(self, X, y=None):
        # stateless transformer
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        Xz = X.fillna(value=0)
        return Xz

In [45]:
feature_extraction_pipeline = Pipeline(
[
    
    ('clip_sales',                 ClipTransformer(0, 20, 'item_cnt_day')),
    
#     ('avg_shop_sales',             Groupby_Avg_Featurizer(['shop_id'], value_col='item_cnt_day',output_col='avg_shop_sales')),
#     ('avg_monthly_shop_sales',     Groupby_Avg_Featurizer(['shop_id', 'date_block_num'], value_col='item_cnt_day',output_col='avg_monthly_shop_sales')),
#     ('avg_item_sales',             Groupby_Avg_Featurizer(['item_id'], value_col='item_cnt_day',output_col='avg_item_sales')),
#     ('avg_monthly_item_sales',     Groupby_Avg_Featurizer(['item_id', 'date_block_num'], value_col='item_cnt_day',output_col='avg_monthly_item_sales')),
#     ('avg_item_cat_sales',         Groupby_Avg_Featurizer(['item_category_id'], value_col='item_cnt_day',output_col='avg_item_sales')),
#     ('avg_monthly_item_cat_sales', Groupby_Avg_Featurizer(['item_category_id', 'date_block_num'], value_col='item_cnt_day',output_col='avg_monthly_item_sales')),

    
#     ('std_shop_sales',             Groupby_Std_Featurizer(['shop_id'], value_col='item_cnt_day',output_col='std_shop_sales')),
#     ('std_monthly_shop_sales',     Groupby_Std_Featurizer(['shop_id', 'date_block_num'], value_col='item_cnt_day',output_col='std_monthly_shop_sales')),
#     ('std_item_sales',             Groupby_Std_Featurizer(['item_id'], value_col='item_cnt_day',output_col='std_item_sales')),
#     ('std_monthly_item_sales',     Groupby_Std_Featurizer(['item_id', 'date_block_num'], value_col='item_cnt_day',output_col='std_monthly_item_sales')),
#     ('std_item_cat_sales',         Groupby_Std_Featurizer(['item_category_id'], value_col='item_cnt_day',output_col='std_item_sales')),
#     ('std_monthly_item_cat_sales', Groupby_Std_Featurizer(['item_category_id', 'date_block_num'], value_col='item_cnt_day',output_col='std_monthly_item_sales')),

#     ('avg_shop_price',             Groupby_Avg_Featurizer(['shop_id'], value_col='item_price',output_col='avg_shop_price', weight_col='item_cnt_day')),
#     ('avg_item_price',             Groupby_Avg_Featurizer(['item_id'], value_col='item_price',output_col='avg_item_price', weight_col='item_cnt_day')),
#     ('avg_item_cat_price',         Groupby_Avg_Featurizer(['item_category_id'], value_col='item_price',output_col='avg_item_cat_price', weight_col='item_cnt_day')),

    
#     ('std_shop_price',             Groupby_Std_Featurizer(['shop_id'], value_col='item_price',output_col='std_shop_price', weight_col='item_cnt_day')),
#     ('std_item_price',             Groupby_Std_Featurizer(['item_id'], value_col='item_price',output_col='std_item_price', weight_col='item_cnt_day')),
#     ('std_item_cat_price',         Groupby_Std_Featurizer(['item_category_id'], value_col='item_price',output_col='std_item_cat_price', weight_col='item_cnt_day')),

    
#     ('shop_revenue',               Groupby_Sum_Featurizer(['shop_id'],value_col='item_price',output_col='shop_revenue', weight_col='item_cnt_day')),
#     ('item_revenue',               Groupby_Sum_Featurizer(['item_id'],value_col='item_price',output_col='item_revenue', weight_col='item_cnt_day')),
#     ('item_cat_revenue',           Groupby_Sum_Featurizer(['item_category_id'],value_col='item_price',output_col='item_cat_revenue', weight_col='item_cnt_day')),
    
    ('lag_sales',                 Lag_Value_Transformer(value_col='item_cnt_day',time_col='date_block_num', index_col=['shop_id','item_id'], period=[1,6,12]))
#     ('lag_sales_2',                 Lag_Value_Transformer(value_col='item_cnt_day',time_col='date_block_num', index_col=['shop_id','item_id'], period=2)),
#     ('lag_sales_3',                 Lag_Value_Transformer(value_col='item_cnt_day',time_col='date_block_num', index_col=['shop_id','item_id'], period=3)),
#     ('lag_sales_6',                 Lag_Value_Transformer(value_col='item_cnt_day',time_col='date_block_num', index_col=['shop_id','item_id'], period=6)),
#     ('lag_sales_12',                Lag_Value_Transformer(value_col='item_cnt_day',time_col='date_block_num', index_col=['shop_id','item_id'], period=12)),
    
#     ('lag_shop_revenue_1',          Lag_Value_Transformer(value_col='shop_revenue',time_col='date_block_num', index_col=['shop_id'], period=1)),
#     ('lag_shop_revenue_2',          Lag_Value_Transformer(value_col='shop_revenue',time_col='date_block_num', index_col=['shop_id'], period=2)),
#     ('lag_shop_revenue_3',          Lag_Value_Transformer(value_col='shop_revenue',time_col='date_block_num', index_col=['shop_id'], period=3)),
#     ('lag_shop_revenue_6',          Lag_Value_Transformer(value_col='shop_revenue',time_col='date_block_num', index_col=['shop_id'], period=6)),
#     ('lag_shop_revenue_12',         Lag_Value_Transformer(value_col='shop_revenue',time_col='date_block_num', index_col=['shop_id'], period=12)),
    
#     ('lag_item_revenue_1',          Lag_Value_Transformer(value_col='item_revenue',time_col='date_block_num', index_col=['item_id'], period=1)),
#     ('lag_item_revenue_2',          Lag_Value_Transformer(value_col='item_revenue',time_col='date_block_num', index_col=['item_id'], period=2)),
#     ('lag_item_revenue_3',          Lag_Value_Transformer(value_col='item_revenue',time_col='date_block_num', index_col=['item_id'], period=3)),
#     ('lag_item_revenue_6',          Lag_Value_Transformer(value_col='item_revenue',time_col='date_block_num', index_col=['item_id'], period=6)),
#     ('lag_item_revenue_12',         Lag_Value_Transformer(value_col='item_revenue',time_col='date_block_num', index_col=['item_id'], period=12)),
    
#     ('lag_item_cat_revenue_1',          Lag_Value_Transformer(value_col='item_cat_revenue',time_col='date_block_num', index_col=['item_category_id'], period=1)),
#     ('lag_item_cat_revenue_2',          Lag_Value_Transformer(value_col='item_cat_revenue',time_col='date_block_num', index_col=['item_category_id'], period=2)),
#     ('lag_item_cat_revenue_3',          Lag_Value_Transformer(value_col='item_cat_revenue',time_col='date_block_num', index_col=['item_category_id'], period=3)),
#     ('lag_item_cat_revenue_6',          Lag_Value_Transformer(value_col='item_cat_revenue',time_col='date_block_num', index_col=['item_category_id'], period=6)),
#     ('lag_item_cat_revenue_12',         Lag_Value_Transformer(value_col='item_cat_revenue',time_col='date_block_num', index_col=['item_category_id'], period=12))
]


)

In [None]:
sales_small_featurized = feature_extraction_pipeline.fit_transform(sales_small)
sales_small_featurized.head()

In [234]:
sales_small_featurized.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000380 entries, 0 to 1000379
Data columns (total 9 columns):
date                  1000380 non-null object
date_block_num        1000380 non-null int64
shop_id               1000380 non-null int64
item_id               1000380 non-null int64
item_price            1000380 non-null float64
item_cnt_day          1000380 non-null float64
item_name             1000380 non-null object
item_category_id      1000380 non-null int64
item_cnt_day_lag_1    876841 non-null float64
dtypes: float64(3), int64(4), object(2)
memory usage: 76.3+ MB


In [190]:
ALL_FEATS = sales_featurized.
NUM_FEATS = None

feature_normalization_pipeline = Pipeline(

    [
        
        ('extract', ColumnExtractor(NUM_FEATS)),
        ('impute', DFImputer(strategy='mean')),
        ('scale', StandardScaler())
        
        
    ]



)

#### Create the shop/item grid.
This is necessary because shop and items could be different from one month to the next

In [40]:
# Create "grid" with columns
index_cols = ['shop_id', 'item_id', 'date_block_num']

# For every month we create a grid from all shops/items combinations from that month
grid = [] 
for block_num in sales['date_block_num'].unique():
    cur_shops = sales.loc[sales['date_block_num'] == block_num, 'shop_id'].unique()
    cur_items = sales.loc[sales['date_block_num'] == block_num, 'item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))

# Turn the grid into a dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

# group-by for each shop/item_id/month combination
gb = sales.groupby(index_cols, as_index=False)['item_cnt_day'].agg('sum').rename(columns={'sum':'target'})

In [41]:
# Merge with the grid
grid = pd.merge(grid, gb, how='left', on=index_cols).fillna(0)
grid.item_cnt_day = np.clip(grid.item_cnt_day.values,0,20) 

In [42]:
grid.head()

Unnamed: 0,shop_id,item_id,date_block_num,item_cnt_day
0,59,22154,0,1.0
1,59,2552,0,0.0
2,59,2554,0,0.0
3,59,2555,0,0.0
4,59,2564,0,0.0


#### Create a benchmark submission by using only the previous month sales
Submit the sales for 2015 October (date_block_num = 33)

In [46]:
sales_submission_prev_month = pd.merge(sales_test, grid.loc[grid.date_block_num == 33], how='left', on=['shop_id', 'item_id'])
sales_submission_prev_month = sales_submission_prev_month.loc[:,['ID', 'item_cnt_day']].fillna(0).rename(columns={'item_cnt_day':'item_cnt_month'})
sales_submission_prev_month.to_csv('../readonly/sales_submission_prev_month.csv.gz', index=False, compression='gzip')

### Create global features

#### month

In [54]:
sales['date2'] = pd.to_datetime(sales.date,format='%d.%m.%Y')
sales.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,date2
0,02.01.2013,0,59,22154,999.0,1.0,2013-01-02
1,03.01.2013,0,25,2552,899.0,1.0,2013-01-03
2,05.01.2013,0,25,2552,899.0,-1.0,2013-01-05
3,06.01.2013,0,25,2554,1709.05,1.0,2013-01-06
4,15.01.2013,0,25,2555,1099.0,1.0,2013-01-15


In [56]:
sales['month'] = sales.date2.dt.month 

### Train/Test Split
Simple split as a benchmark: 0~31 as train; 32 as test

In [137]:
train_date_num = 31
train = sales.loc[sales.date_block_num<=train_date_num,sales.columns]
test = sales.loc[sales.date_block_num>train_date_num,sales.columns]

#### average shop monthly sales (mean-encoding of shop_id)

In [138]:
def weighted_avg(df, values, weights):
    """
    Return the weighted average and standard deviation.
    values, weights -- column names
    """
    values, weights = df['item_price'].values, df['item_cnt_day'].values
    average = np.average(values, weights=weights)
    variance = np.average((values-average)**2, weights=weights)
    return average

In [139]:
def weighted_std(df, values, weights):
    """
    Return the weighted average and standard deviation.
    values, weights -- column names
    """
    values, weights = df['item_price'].values, df['item_cnt_day'].values
    average = np.average(values, weights=weights)
    variance = np.average((values-average)**2, weights=weights)
    return np.sqrt(variance)

In [150]:
train['avg_item_cnt_day'] = train.groupby(['shop_id','date_block_num'])['item_cnt_day'].transform('mean')
gb = train.groupby(['shop_id','date_block_num']).apply(weighted_avg, values='item_price',weights='item_cnt_day').reset_index().rename(columns={0:'avg_shop_price'})
train = pd.merge(train, gb, on=['shop_id','date_block_num'])