# Predict Future Sales

This challenge serves as final project for the "How to win a data science competition" Coursera course.

In this competition you will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company. 

We are asking you to predict total sales for every product and store in the next month. By solving this competition you will be able to apply and enhance your data science skills.

## Evaluation

Submissions are evaluated by root mean squared error (**RMSE**). True target values are clipped into [0,20] range;

## Submission File

For each id in the test set, you must predict a total number of sales. The file should contain a header and have the following format:

```
ID,item_cnt_month
0,0.5
1,0.5
2,0.5
3,0.5
etc.
```

## File descriptions
- sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
- test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.
- sample_submission.csv - a sample submission file in the correct format.
- items.csv - supplemental information about the items/products.
- item_categories.csv  - supplemental information about the items categories.
- shops.csv- supplemental information about the shops.

## Data fields
- ID - an Id that represents a (Shop, Item) tuple within the test set
- shop_id - unique identifier of a shop
- item_id - unique identifier of a product
- item_category_id - unique identifier of item category
- item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
- item_price - current price of an item
- date - date in format dd/mm/yyyy
- date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
- item_name - name of item
- shop_name - name of shop
- item_category_name - name of item category

## Getting Started

In [None]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns, sklearn

from scipy.stats import probplot, boxcox_normmax, norm, skew, kurtosis as kurt
from scipy.special import boxcox1p
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import KFold, cross_val_score, learning_curve
from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.preprocessing import LabelEncoder
from xgboost import plot_importance

from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from xgboost import XGBRegressor

import gc,os,sys,time,pickle
from datetime import datetime
from itertools import product

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

sns.set_style("darkgrid")
%matplotlib inline

def plot_features(booster, figsize):    
    fig, ax = plt.subplots(1,1,figsize=figsize)
    return plot_importance(booster=booster, ax=ax)

import warnings
warnings.filterwarnings("ignore")

## Load Data

In [None]:
PATH = '../input/competitive-data-science-predict-future-sales/'

train_data = pd.read_csv(PATH+'sales_train.csv')
test_data = pd.read_csv(PATH+'test.csv').set_index('ID')
item_data = pd.read_csv(PATH+'items.csv')
item_category_data = pd.read_csv(PATH+'item_categories.csv')
shop_data = pd.read_csv(PATH+'shops.csv')

In [None]:
print('='*50)
train_data.info()
print('='*50)
test_data.info()
print('='*50)
item_data.info()
print('='*50)
item_category_data.info()
print('='*50)
shop_data.info()

## PREPROCESS

### outliers

- item_cnt_day: 当日销售数超过1001，在box图中看属于严重的离群点极值了，防止其对正常数据的影响，去除；
- item_price: 大于100k，只有一个，离群明显，去除，另外，item_price有一个负数，属于异常值，用中位数填充；
- shop: 几家商店从名字上看是重复的，但是id不一致，这里做归一处理；

In [None]:
plt.figure(figsize=(20,5))
plt.xlim(-100, 3000)
sns.boxplot(x=train_data.item_cnt_day)

plt.figure(figsize=(20,5))
plt.xlim(train_data.item_price.min()*1.1, train_data.item_price.max()*1.1)
sns.boxplot(x=train_data.item_price)

In [None]:
train_data[train_data.item_price < 0]

In [None]:
train_data = train_data[train_data.item_cnt_day < 1001]
train_data = train_data[train_data.item_price < 100000]

median = train_data[(train_data.shop_id==32)&(train_data.item_id==2973)&(train_data.date_block_num==4)&
                    (train_data.item_price>0)].item_price.median()
train_data.loc[train_data.item_price < 0, 'item_price'] = median

In [None]:
# Якутск Орджоникидзе, 56
# 0 & 56
train_data.loc[train_data.shop_id == 0, 'shop_id'] = 57
test_data.loc[test_data.shop_id == 0, 'shop_id'] = 57
# Якутск ТЦ "Центральный"
# 1 & 58
train_data.loc[train_data.shop_id == 1, 'shop_id'] = 58
test_data.loc[test_data.shop_id == 1, 'shop_id'] = 58
# Жуковский ул. Чкалова 39м²
# 10 & 11
train_data.loc[train_data.shop_id == 10, 'shop_id'] = 11
test_data.loc[test_data.shop_id == 10, 'shop_id'] = 11

### Shops/Cats/Items

- 商店名字都是以城市名字开头；
- 每个类别名字都包含type和sub type信息；

In [None]:
# 提取城市信息并label encode，保留shop_id和city_code即可
shop_data.loc[shop_data.shop_name == 'Сергиев Посад ТЦ "7Я"', 'shop_name'] = 'СергиевПосад ТЦ "7Я"'
shop_data['city'] = shop_data['shop_name'].str.split(' ').map(lambda x: x[0])
shop_data.loc[shop_data.city == '!Якутск', 'city'] = 'Якутск'
shop_data['city_code'] = LabelEncoder().fit_transform(shop_data['city'])
shop_data = shop_data[['shop_id','city_code']]

# 对类别名进行切分，提取type和sub type，没有sub type时，sub type等于type
item_category_data['split'] = item_category_data['item_category_name'].str.split('-')
item_category_data['type'] = item_category_data['split'].map(lambda x: x[0].strip())
item_category_data['type_code'] = LabelEncoder().fit_transform(item_category_data['type'])
item_category_data['subtype'] = item_category_data['split'].map(lambda x: x[1].strip() if len(x) > 1 else x[0].strip())
item_category_data['subtype_code'] = LabelEncoder().fit_transform(item_category_data['subtype'])
item_category_data = item_category_data[['item_category_id','type_code', 'subtype_code']]

# 丢弃商品名字段，没有分析价值
item_data.drop(['item_name'], axis=1, inplace=True)

### Monthly Sales

训练数据每条记录代表的是某个商店的某个商品的日销量以及价格，而测试数据要的结果则是月销量，因此需要对训练数据做聚合，以使得二者的数据结构维度一致；

In [None]:
# print('trans day to month data')

# dates,date_block_nums,shop_ids,item_ids,item_price_means,item_price_medians,item_cnt_months = [],[],[],[],[],[],[]

# tmp_data = train_data.groupby(by=['shop_id','item_id','date_block_num'])
# for tmp in tmp_data:
#     dates.append(tmp[1].date.iloc[0][tmp[1].date.iloc[0].find('.')+1:])
#     date_block_nums.append(tmp[1].date_block_num.iloc[0])
#     shop_ids.append(tmp[1].shop_id.iloc[0])
#     item_ids.append(tmp[1].item_id.iloc[0])
#     item_price_means.append(tmp[1].item_price.mean())
#     item_price_medians.append(tmp[1].item_price.median())
#     item_cnt_months.append(tmp[1].item_cnt_day.count())

# train_data = pd.DataFrame({'date':dates,'date_block_num':date_block_nums,'shop_id':shop_ids,'item_id':item_ids,
#                                  'item_price_mean':item_price_means,'item_price_median':item_price_medians,
#                                  'item_cnt_month':item_cnt_months})

# del tmp_data,dates,date_block_nums,shop_ids,item_ids,item_price_means,item_price_medians,item_cnt_months
# gc.collect()

In [None]:
%%time
# 构建目标DataFrame
# 这里的matrix结构与train完全不一样，这里其实是构建了一个shop*item*block_num的排列组合的矩阵，可以看到数据行数达到了1000w+，远远高于
# train的行数，这里要搞清楚，不是简单的数据迁移，而是整个结构重组；
matrix = []
cols = ['date_block_num','shop_id','item_id']
for i in range(34):# 0~33
    sales = train_data[train_data.date_block_num==i]# 每个月的数据
    matrix.append(np.array(list(product([i], sales.shop_id.unique(), sales.item_id.unique())), dtype='int16'))# 将每个shop、item和num进行排列组合
    
matrix = pd.DataFrame(np.vstack(matrix), columns=cols)
matrix['date_block_num'] = matrix['date_block_num'].astype(np.int8)
matrix['shop_id'] = matrix['shop_id'].astype(np.int8)
matrix['item_id'] = matrix['item_id'].astype(np.int16)
matrix.sort_values(cols,inplace=True)
matrix.info()

In [None]:
# 销售额特征
train_data['revenue'] = train_data['item_price'] *  train_data['item_cnt_day']

In [None]:
%%time
# 求月销售数量，基本方法跟原来的一样，区别在于这里用了agg和内置函数sum，效率更高，注意clip(0,20)，这个是题目中有要求的
group = train_data.groupby(['date_block_num','shop_id','item_id']).agg({'item_cnt_day': ['sum']})
group.columns = ['item_cnt_month']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=cols, how='left')
matrix['item_cnt_month'] = (matrix['item_cnt_month']
                                .fillna(0) # fill NaN with 0
                                .clip(0,20) # NB clip target here
                                .astype(np.float16)) # float16 with NaN still float16, int16 will be int64

### 连续特征的数据分布转换

放到这里的原因是如果连接了测试数据，那么就会带来大量的0值，这会影响转换算法的效果；

处理train_data数据；

In [None]:
%%time
numerical_features = ['item_price','revenue']
transform_data = train_data
transform_data[numerical_features].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)

In [None]:
%%time
transform_data[numerical_features].apply(lambda x: kurt(x.dropna())).sort_values(ascending=False)

In [None]:
%%time
# 偏度最佳为0，此处大于0.5，就满足需要处理的条件
skewed_features = {feature: skew(transform_data[feature].dropna()-transform_data[feature].min()) for feature in numerical_features if skew(transform_data[feature].dropna()-transform_data[feature].min()) >= .5}
print(skewed_features)

boxcox1p_skews = {}
log1p_skews = {}
exp_skews = {}
power_skews = {}
best_skews = {}
methods = {}

tmp = pd.DataFrame()
for feature in skewed_features.keys():
    tmp[feature] = boxcox1p(transform_data[feature]-transform_data[feature].min(), boxcox_normmax(transform_data[feature]-transform_data[feature].min() + 1))
    boxcox1p_skews[feature] = skew(tmp[feature].dropna())
    tmp[feature] = np.log1p(transform_data[feature]-transform_data[feature].min())
    log1p_skews[feature] = skew(tmp[feature].dropna())
    tmp[feature] = np.exp(transform_data[feature]-transform_data[feature].min())
    exp_skews[feature] = skew(tmp[feature].dropna())
    tmp[feature] = (transform_data[feature]-transform_data[feature].min())**.5
    power_skews[feature] = skew(tmp[feature].dropna())
    best_skews[feature] = min(boxcox1p_skews[feature],log1p_skews[feature],exp_skews[feature],power_skews[feature])
    methods[feature] = 'boxcox' if best_skews[feature]==boxcox1p_skews[feature] else ('log' if best_skews[feature]==log1p_skews[feature] else ('exp' if best_skews[feature]==exp_skews[feature] else ('power' if best_skews[feature]==power_skews[feature] else 'original')))

print(methods)
    
df_skew = pd.DataFrame(index=skewed_features.keys(), columns=['Skew', 'Skew after boxcox1p'])
df_skew['Skew Original'] = skewed_features.values()
df_skew['Skew after boxcox1p'] = boxcox1p_skews.values()
df_skew['Skew after log1p'] = log1p_skews.values()
df_skew['Skew after exp'] = exp_skews.values()
df_skew['Skew after power'] = power_skews.values()
df_skew['Skew after compare'] = best_skews.values()

fig = plt.figure(figsize=(22, 6))

sns.pointplot(x=df_skew.index, y='Skew Original', data=df_skew, markers=['o'], linestyles=['-'], label='original')
sns.pointplot(x=df_skew.index, y='Skew after boxcox1p', data=df_skew, markers=['x'], linestyles=['--'], color='dodgerblue', label='boxcox1p')
sns.pointplot(x=df_skew.index, y='Skew after log1p', data=df_skew, markers=['s'], linestyles=['--'], color='peru', label='log1p')
sns.pointplot(x=df_skew.index, y='Skew after exp', data=df_skew, markers=['*'], linestyles=['--'], color='darkorchid', label='exp')
sns.pointplot(x=df_skew.index, y='Skew after power', data=df_skew, markers=['+'], linestyles=['--'], color='yellow', label='power0.5')
sns.pointplot(x=df_skew.index, y='Skew after compare', data=df_skew, markers=['v'], linestyles=['--'], color='lawngreen', label='best')

plt.xlabel('Skewed Features', size=20, labelpad=12.5)
plt.ylabel('Skewness', size=20, labelpad=12.5)
plt.tick_params(axis='x', labelsize=11)
plt.tick_params(axis='y', labelsize=15)
plt.xticks(rotation=70)
plt.title('Skewed Features Before and After Several Transformation Method', size=20)

plt.legend(loc='best', shadow=True)

plt.show()

In [None]:
%%time
for feature in numerical_features:
    method = methods[feature]
    if method=='boxcox':
        transform_data[feature] = boxcox1p(transform_data[feature]-transform_data[feature].min(), boxcox_normmax(transform_data[feature]-transform_data[feature].min() + 1))
    elif method=='log':
        transform_data[feature] = np.log1p(transform_data[feature]-transform_data[feature].min())
    elif method=='exp':
        transform_data[feature] = np.exp(transform_data[feature]-transform_data[feature].min())
    elif method=='power':
        transform_data[feature] = (transform_data[feature]-transform_data[feature].min())**.5
    else:# original
        pass

处理matrix数据；

In [None]:
%%time
numerical_features = ['item_cnt_month']
transform_data = matrix
transform_data[numerical_features].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)

In [None]:
%%time
transform_data[numerical_features].apply(lambda x: kurt(x.dropna())).sort_values(ascending=False)

In [None]:
print('After log1p transform, skew = ', skew(np.log1p(transform_data[numerical_features]).dropna()))

In [None]:
transform_data[numerical_features] = np.log1p(transform_data[numerical_features])

### 测试数据处理

In [None]:
test_data['date_block_num'] = 34
test_data['date_block_num'] = test_data['date_block_num'].astype(np.int8)
test_data['shop_id'] = test_data['shop_id'].astype(np.int8)
test_data['item_id'] = test_data['item_id'].astype(np.int16)

In [None]:
# 拼接训练和测试数据，这里时间上是连续的，即0~34
matrix = pd.concat([matrix, test_data], ignore_index=True, sort=False, keys=cols)
matrix.fillna(0, inplace=True) # item_cnt_month in 34 month fill 0

### Merge item/shop/cat

In [None]:
%%time
matrix = pd.merge(matrix, shop_data, on=['shop_id'], how='left')
matrix = pd.merge(matrix, item_data, on=['item_id'], how='left')
matrix = pd.merge(matrix, item_category_data, on=['item_category_id'], how='left')
matrix['city_code'] = matrix['city_code'].astype(np.int8)
matrix['item_category_id'] = matrix['item_category_id'].astype(np.int8)
matrix['type_code'] = matrix['type_code'].astype(np.int8)
matrix['subtype_code'] = matrix['subtype_code'].astype(np.int8)

## FEATURE ENGENEERING

In [None]:
matrix.sample(5)

In [None]:
matrix.info()

In [None]:
def lag_feature(df, lags, col):
    tmp = df[['date_block_num','shop_id','item_id',col]]
    for i in lags:
        shifted = tmp.copy()
        shifted.columns = ['date_block_num','shop_id','item_id', col+'_lag_'+str(i)]
        shifted['date_block_num'] += i
        df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left')
    return df

### Target Lags

新增目标特征的延后特征，分别延后[1,2,3,6,12]，对于每条数据的含义就是增加了5列特征，item_cnt_month_lag_1表示一个月前该商店该商品的月销售量，2,3,6,12同理，这样操作后会产生一些NaN值，类似滑窗处理在边界上总是会有NaN，正常情况；

In [None]:
%%time
matrix = lag_feature(matrix, [1,2,3,6,12], 'item_cnt_month')

### 平均编码特征

每个月的平均销量；

In [None]:
%%time
group = matrix.groupby(['date_block_num']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num'], how='left')
matrix['date_avg_item_cnt'] = matrix['date_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_avg_item_cnt')
matrix.drop(['date_avg_item_cnt'], axis=1, inplace=True)

每个月每个商品的平均销量；

In [None]:
%%time
group = matrix.groupby(['date_block_num', 'item_id']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_item_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','item_id'], how='left')
matrix['date_item_avg_item_cnt'] = matrix['date_item_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3,6,12], 'date_item_avg_item_cnt')
matrix.drop(['date_item_avg_item_cnt'], axis=1, inplace=True)

每个月每家店的平均销量；

In [None]:
%%time
group = matrix.groupby(['date_block_num', 'shop_id']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_shop_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','shop_id'], how='left')
matrix['date_shop_avg_item_cnt'] = matrix['date_shop_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1,2,3,6,12], 'date_shop_avg_item_cnt')
matrix.drop(['date_shop_avg_item_cnt'], axis=1, inplace=True)

每个月每个类别的平均销量；

In [None]:
%%time
group = matrix.groupby(['date_block_num', 'item_category_id']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_cat_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','item_category_id'], how='left')
matrix['date_cat_avg_item_cnt'] = matrix['date_cat_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_cat_avg_item_cnt')
matrix.drop(['date_cat_avg_item_cnt'], axis=1, inplace=True)

每个月每家店每个类别的平均销量；

In [None]:
%%time
group = matrix.groupby(['date_block_num', 'shop_id', 'item_category_id']).agg({'item_cnt_month': ['mean']})
group.columns = ['date_shop_cat_avg_item_cnt']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'shop_id', 'item_category_id'], how='left')
matrix['date_shop_cat_avg_item_cnt'] = matrix['date_shop_cat_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_shop_cat_avg_item_cnt')
matrix.drop(['date_shop_cat_avg_item_cnt'], axis=1, inplace=True)

每个月每家店每个type的平均销量；

In [None]:
%%time
group = matrix.groupby(['date_block_num', 'shop_id', 'type_code']).agg({'item_cnt_month': ['mean']})
group.columns = ['date_shop_type_avg_item_cnt']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'shop_id', 'type_code'], how='left')
matrix['date_shop_type_avg_item_cnt'] = matrix['date_shop_type_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_shop_type_avg_item_cnt')
matrix.drop(['date_shop_type_avg_item_cnt'], axis=1, inplace=True)

每个月每家店每个subtype的平均销量；

In [None]:
%%time
group = matrix.groupby(['date_block_num', 'shop_id', 'subtype_code']).agg({'item_cnt_month': ['mean']})
group.columns = ['date_shop_subtype_avg_item_cnt']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'shop_id', 'subtype_code'], how='left')
matrix['date_shop_subtype_avg_item_cnt'] = matrix['date_shop_subtype_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_shop_subtype_avg_item_cnt')
matrix.drop(['date_shop_subtype_avg_item_cnt'], axis=1, inplace=True)

每个月每个城市的平均销量；

In [None]:
%%time
group = matrix.groupby(['date_block_num', 'city_code']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_city_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'city_code'], how='left')
matrix['date_city_avg_item_cnt'] = matrix['date_city_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_city_avg_item_cnt')
matrix.drop(['date_city_avg_item_cnt'], axis=1, inplace=True)

每个月每个城市每个商品的平均销量；

In [None]:
%%time
group = matrix.groupby(['date_block_num', 'item_id', 'city_code']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_item_city_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'item_id', 'city_code'], how='left')
matrix['date_item_city_avg_item_cnt'] = matrix['date_item_city_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_item_city_avg_item_cnt')
matrix.drop(['date_item_city_avg_item_cnt'], axis=1, inplace=True)

每个月每个type的平均销量；

In [None]:
%%time
group = matrix.groupby(['date_block_num', 'type_code']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_type_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'type_code'], how='left')
matrix['date_type_avg_item_cnt'] = matrix['date_type_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_type_avg_item_cnt')
matrix.drop(['date_type_avg_item_cnt'], axis=1, inplace=True)

每个月每个subtype的平均销量；

In [None]:
%%time
group = matrix.groupby(['date_block_num', 'subtype_code']).agg({'item_cnt_month': ['mean']})
group.columns = [ 'date_subtype_avg_item_cnt' ]
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num', 'subtype_code'], how='left')
matrix['date_subtype_avg_item_cnt'] = matrix['date_subtype_avg_item_cnt'].astype(np.float16)
matrix = lag_feature(matrix, [1], 'date_subtype_avg_item_cnt')
matrix.drop(['date_subtype_avg_item_cnt'], axis=1, inplace=True)

### 趋势特征

最后6个月的价格趋势；

In [None]:
%%time
# 每个商品的平均价格
group = train_data.groupby(['item_id']).agg({'item_price': ['mean']})
group.columns = ['item_avg_item_price']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['item_id'], how='left')
matrix['item_avg_item_price'] = matrix['item_avg_item_price'].astype(np.float16)

# 每个商品在每个月的平均价格
group = train_data.groupby(['date_block_num','item_id']).agg({'item_price': ['mean']})
group.columns = ['date_item_avg_item_price']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','item_id'], how='left')
matrix['date_item_avg_item_price'] = matrix['date_item_avg_item_price'].astype(np.float16)

# 对每个商品在每个月的平均价格特征做lag处理，分别延后[1,2,3,4,5,6]
lags = [1,2,3,4,5,6]
matrix = lag_feature(matrix, lags, 'date_item_avg_item_price')

# 用1,2,3,4,5,6个月前的每个商品平均价格与商品总平均价格求变化量，即delta
for i in lags:
    matrix['delta_price_lag_'+str(i)] = \
        (matrix['date_item_avg_item_price_lag_'+str(i)] - matrix['item_avg_item_price']) / matrix['item_avg_item_price']

# 遍历每一行，如果对应delta有值，返回该值，否则返回0
# for循环会提前结束，因此这里可能存在多个delta时，返回最近的那个，即如果delta_1有值，则返回，不会再向后遍历
# 那么delta_price_lag可以理解为最近为一个月的商品价格与全期的商品价格的变化量
def select_trend(row):
    for i in lags:
        if row['delta_price_lag_'+str(i)]:
            return row['delta_price_lag_'+str(i)]
    return 0
    
matrix['delta_price_lag'] = matrix.apply(select_trend, axis=1)
matrix['delta_price_lag'] = matrix['delta_price_lag'].astype(np.float16)
matrix['delta_price_lag'].fillna(0, inplace=True)

# 删掉之前添加的每个商品平均和每个月每个商品平均的特征
fetures_to_drop = ['item_avg_item_price', 'date_item_avg_item_price']
for i in lags:
    fetures_to_drop += ['date_item_avg_item_price_lag_'+str(i)]
    fetures_to_drop += ['delta_price_lag_'+str(i)]

matrix.drop(fetures_to_drop, axis=1, inplace=True)

最后一个月的商品销售额趋势

In [None]:
%%time
# 每个商店每个月的总销售额
group = train_data.groupby(['date_block_num','shop_id']).agg({'revenue': ['sum']})
group.columns = ['date_shop_revenue']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['date_block_num','shop_id'], how='left')
matrix['date_shop_revenue'] = matrix['date_shop_revenue'].astype(np.float32)

# 每个商店的平均销售额
group = group.groupby(['shop_id']).agg({'date_shop_revenue': ['mean']})
group.columns = ['shop_avg_revenue']
group.reset_index(inplace=True)

matrix = pd.merge(matrix, group, on=['shop_id'], how='left')
matrix['shop_avg_revenue'] = matrix['shop_avg_revenue'].astype(np.float32)

# 每个月销售额与总销售额求销售额的变化量
matrix['delta_revenue'] = (matrix['date_shop_revenue'] - matrix['shop_avg_revenue']) / matrix['shop_avg_revenue']
matrix['delta_revenue'] = matrix['delta_revenue'].astype(np.float16)

# 变化量lag[1]
matrix = lag_feature(matrix, [1], 'delta_revenue')

matrix.drop(['date_shop_revenue','shop_avg_revenue','delta_revenue'], axis=1, inplace=True)

### 其他特征

月份、天数；

In [None]:
# 月份
matrix['month'] = matrix['date_block_num'] % 12
# 每个月天数
days = pd.Series([31,28,31,30,31,30,31,31,30,31,30,31])
matrix['days'] = matrix['month'].map(days).astype(np.int8)

每个商店的每个商品上一次售出是几个月前；

In [None]:
%%time
cache = {}
# 每个商店的每个商品上一次售出是几个月前
matrix['item_shop_last_sale'] = -1
matrix['item_shop_last_sale'] = matrix['item_shop_last_sale'].astype(np.int8)
# for idx, row in matrix.iterrows():    
#     key = str(row.item_id)+' '+str(row.shop_id)
#     if key not in cache:
#         if row.item_cnt_month!=0:
#             cache[key] = row.date_block_num
#     else:
#         last_date_block_num = cache[key]
#         matrix.at[idx, 'item_shop_last_sale'] = row.date_block_num - last_date_block_num
#         cache[key] = row.date_block_num    

for row in matrix.itertuples():
    idx = getattr(row,'Index')
    item_id = getattr(row,'item_id')
    shop_id = getattr(row,'shop_id')
    date_block_num = getattr(row,'date_block_num')
    item_cnt_month = getattr(row,'item_cnt_month')
    key = str(item_id)+' '+str(shop_id)
    if key not in cache:
        if item_cnt_month!=0:
            cache[key] = date_block_num
    else:
        last_date_block_num = cache[key]
        matrix.at[idx, 'item_shop_last_sale'] = date_block_num - last_date_block_num
        cache[key] = date_block_num

每个商品上一次售出是几个月前

In [None]:
%%time
cache = {}
matrix['item_last_sale'] = -1
matrix['item_last_sale'] = matrix['item_last_sale'].astype(np.int8)
# for idx, row in matrix.iterrows():    
#     key = row.item_id
#     if key not in cache:
#         if row.item_cnt_month!=0:
#             cache[key] = row.date_block_num
#     else:
#         last_date_block_num = cache[key]
#         if row.date_block_num>last_date_block_num:
#             matrix.at[idx, 'item_last_sale'] = row.date_block_num - last_date_block_num
#             cache[key] = row.date_block_num         

for row in matrix.itertuples():
    idx = getattr(row,'Index')
    key = getattr(row,'item_id')
    date_block_num = getattr(row,'date_block_num')
    item_cnt_month = getattr(row,'item_cnt_month')
    if key not in cache:
        if item_cnt_month!=0:
            cache[key] = date_block_num
    else:
        last_date_block_num = cache[key]
        if date_block_num>last_date_block_num:
            matrix.at[idx, 'item_last_sale'] = date_block_num - last_date_block_num
            cache[key] = date_block_num

1. 每个商店的每个商品第一次售出是几个月前；
2. 每个商品第一次售出是几个月前；

In [None]:
%%time
matrix['item_shop_first_sale'] = matrix['date_block_num'] - matrix.groupby(['item_id','shop_id'])['date_block_num'].transform('min')
matrix['item_first_sale'] = matrix['date_block_num'] - matrix.groupby('item_id')['date_block_num'].transform('min')

### 最后处理

去掉前一年的数据，即0~11，共12个月的数据，原因是前12个月的数据在target lag特征上都存在NaN，因为target lag最远有一年前，因此最早的一年的数据对应的target lag 12肯定都是NaN，这些数据无法用于训练，因此需要去掉，值得注意的是，即便去掉了这部分数据，也不会减少原始数据的信息量，因为这部分数据的信息量体现在后面数据的lag上；

In [None]:
%%time
matrix = matrix[matrix.date_block_num > 11]

用0填充月销量，product操作产生了大量的NaN值，月销量作为目标变量，NaN用0填充；

In [None]:
%%time
def fill_na(df):
    for col in df.columns:
        if ('_lag_' in col) & (df[col].isnull().any()):
            if ('item_cnt' in col):
                df[col].fillna(0, inplace=True)         
    return df

matrix = fill_na(matrix)

最后浏览下数据结构

In [None]:
matrix.columns

In [None]:
matrix.info()

In [None]:
numerical_features = [col for col in matrix.columns if (col.find('cnt')!=-1 or col.find('cnt')!=-1 or col.find('price')!=-1 or col.find('revenue')!=-1 or col.find('sale')!=-1)]
matrix[numerical_features].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)

In [None]:
matrix[numerical_features].apply(lambda x: kurt(x.dropna())).sort_values(ascending=False)

## Pickle

In [None]:
matrix.to_pickle('data.pkl')
# 保留test_data，用以生成提交文件
del matrix, cache, group, train_data, shop_data, item_data, item_category_data
gc.collect();

In [None]:
1/0

## MODELING

In [None]:
data = pd.read_pickle('data.pkl')

### SPLIT DATA

In [None]:
X_train = data[data.date_block_num < 33].drop(['item_cnt_month'], axis=1)
Y_train = data[data.date_block_num < 33]['item_cnt_month']
X_valid = data[data.date_block_num == 33].drop(['item_cnt_month'], axis=1)
Y_valid = data[data.date_block_num == 33]['item_cnt_month']
X_test = data[data.date_block_num == 34].drop(['item_cnt_month'], axis=1)

In [None]:
del data
gc.collect()

### BASELINE MODEL - xgboost

In [None]:
%%time
model = XGBRegressor(
    max_depth=8,
    n_estimators=1000, # 1000
    min_child_weight=300, 
    colsample_bytree=0.8, 
    subsample=0.8,
    eta=0.3,
    seed=10086)

model.fit(
    X_train, 
    Y_train, 
    eval_metric="rmse", 
    eval_set=[(X_train, Y_train), (X_valid, Y_valid)], 
    verbose=True, 
    early_stopping_rounds = 10)

In [None]:
Y_pred = model.predict(X_valid)
Y_pred = np.expm1(Y_pred).clip(0, 20)
print('Expm1 after RMSE = ', np.sqrt(sklearn.metrics.mean_squared_error(np.expm1(Y_valid),Y_pred)))

### 特征重要性

In [None]:
plot_features(model, (15,15))

### Correct Factor

In [None]:
Y_valid_expm1 = np.expm1(Y_valid)
Y_pred_correct = Y_pred*(np.mean(Y_valid_expm1)/np.mean(Y_pred))
Y_pred_correct_add = Y_pred+(np.mean(Y_valid_expm1-Y_pred))

print('Mean of pred - valid:', np.mean(Y_pred - Y_valid_expm1))
print('Mean of pred_correct - valid:', np.mean(Y_pred_correct - Y_valid_expm1))
print('Mean of pred_correct_add - valid:', np.mean(Y_pred_correct_add - Y_valid_expm1))

In [None]:
fig = plt.figure(figsize=(22, 6))
compare_df = pd.DataFrame({'Y_pred':Y_pred[:500], 
                           'Y_valid':Y_valid_expm1[:500], 
                          'Y_pred_correct':Y_pred_correct[:500], 
                          'Y_pred_correct_add':Y_pred_correct_add[:500]})
sns.lineplot(data=compare_df)
plt.show()

In [None]:
print('Expm1 after and Correct Factor RMSE = ', np.sqrt(sklearn.metrics.mean_squared_error(Y_valid_expm1,Y_pred_correct)))
print('Expm1 after and Correct Add RMSE = ', np.sqrt(sklearn.metrics.mean_squared_error(Y_valid_expm1,Y_pred_correct_add)))

## OUTPUT

In [None]:
Y_test = np.expm1(model.predict(X_test)).clip(0, 20)+np.mean(Y_valid_expm1-Y_pred)

submission = pd.DataFrame({
    "ID": test_data.index, 
    "item_cnt_month": Y_test
})
submission.to_csv('xgb_submission.csv', index=False)