The provided dataset contains historical sales data for a Russian software company. The goal is to predict the total number of products sold in November 2015 for every shop-item combination in the test set. This notebook processes the data, creates some new features, and then makes predictions using the GradientBoostingRegressor from sklearn.

In [None]:
import numpy as np
import pandas as pd

sales_train = pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv')
items = pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv')
test_ids = pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv')

sales_train = sales_train.drop(['date'], axis=1)
sales_train.head()

The ID field refers to a unique shop-item pair.

In [None]:
ids = list(np.arange(test_ids['ID'].max()+1))*(sales_train['date_block_num'].max()+1)
dates = list(np.arange(sales_train['date_block_num'].max()+1))*(test_ids['ID'].max()+1)
dates.sort()

date_id_dict = {'ID' : ids, 'date_block_num' : dates}
date_id_df = pd.DataFrame.from_dict(date_id_dict)
date_id_df = date_id_df.merge(test_ids, on='ID')
date_id_df.head()

The dataset contains daily sales and we want to predict monthly sales, so we aggregate the sales totals by month. We also calculate the average price for each item.

In [None]:
grouped = sales_train.groupby(['date_block_num', 'shop_id', 'item_id'], as_index=False)
item_cnt = pd.DataFrame(grouped.sum())
item_cnt = item_cnt.drop(['item_price'], axis=1)

grouped = sales_train.groupby(['shop_id', 'item_id'])
avg_price = pd.DataFrame(grouped.mean()['item_price'])

monthly_sales = item_cnt.merge(avg_price, on=['shop_id', 'item_id'])
monthly_sales = monthly_sales.merge(test_ids, on=['shop_id', 'item_id'])
monthly_sales.head()

The dataset only records when a product was actually sold. There may be months where a certain product was not sold in a certain store. So we create an entry for every ID-date_block_num combination, and record a total of 0 if there is no record of a sale for that combination.

In [None]:
item_price = monthly_sales[['item_price', 'ID']]
monthly_sales = monthly_sales.drop(['item_price'], axis=1)

monthly_sales = date_id_df.merge(monthly_sales, how='left', on=['ID', 'date_block_num'])
monthly_sales = monthly_sales.drop(['shop_id_y', 'item_id_y'], axis=1)
monthly_sales['item_cnt_day'].fillna(0, inplace=True)

monthly_sales = monthly_sales.merge(item_price, how='left', on='ID')
monthly_sales['item_price'].fillna(monthly_sales['item_price'].mean(), inplace=True)

monthly_sales = monthly_sales.drop_duplicates()
column_dict = {'shop_id_x' : 'shop_id', 'item_id_x' : 'item_id', 'item_cnt_day' : 'item_cnt_month', 'item_price' : 'avg_price'}
monthly_sales = monthly_sales.rename(columns=column_dict)
monthly_sales.head()

Now we incorporate the item_category_id feature, and use date_block_num to create month and year features.

In [None]:
monthly_sales = monthly_sales.merge(items, on='item_id')
monthly_sales = monthly_sales.drop(['item_name'], axis=1)

month = pd.DataFrame([x%12+1 for x in monthly_sales['date_block_num']], columns=['month'])
year = pd.DataFrame([np.floor(x/12)+2013 for x in monthly_sales['date_block_num']], columns=['year'])

monthly_sales = pd.concat([monthly_sales, month, year], axis=1)
monthly_sales = monthly_sales[['ID', 'date_block_num', 'shop_id', 'item_category_id', 'avg_price', 'month', 'year', 'item_cnt_month']]
monthly_sales.head()

The total sales in previous months are likely a good predictor of sales in the current month. So we create a lag feature, which records the total sales for each of the previous 12 months.

In [None]:
def calculate_item_cnt_lagged(df, lag):
    tmp = df[['date_block_num', 'ID', 'item_cnt_month']]
    shifted = tmp.copy()
    shifted.columns = ['date_block_num', 'ID', 'item_cnt_lag'+str(lag)]
    shifted.date_block_num = shifted.date_block_num + lag
    df = pd.merge(df, shifted, on=['date_block_num', 'ID'], how='left')
    return df

for lag in range(1, 13):
    monthly_sales = calculate_item_cnt_lagged(monthly_sales, lag)
    
monthly_sales = monthly_sales[monthly_sales['date_block_num'] > 11]
monthly_sales.head()

Now we incorporate these same features in the test set.

In [None]:
id_price = monthly_sales[['ID', 'avg_price']]

x_test_df = test_ids.merge(items, on='item_id')
x_test_df = x_test_df.merge(id_price, how='left', on='ID')
x_test_df.insert(loc=2, column='month', value=11)
x_test_df.insert(loc=3, column='year', value=2015)
x_test_df.insert(loc=4, column='date_block_num', value=34)

x_test_df = x_test_df.drop_duplicates()
x_test_df['avg_price'].fillna(x_test_df['avg_price'].mean(), inplace=True)

monthly_sales_subset = monthly_sales[['ID', 'date_block_num', 'shop_id', 'item_category_id', 'avg_price', 'month', 'year', 'item_cnt_month']]
x_all_df = pd.concat((x_test_df, monthly_sales_subset))

for lag in range(1, 13):
    x_all_df = calculate_item_cnt_lagged(x_all_df, lag)
    
x_test_df = x_all_df[x_all_df['date_block_num'] == 34]
x_test_df = x_test_df[['ID', 'shop_id', 'item_category_id', 'avg_price', 'month', 'year',
                    'item_cnt_lag1', 'item_cnt_lag2', 'item_cnt_lag3', 'item_cnt_lag4', 
                    'item_cnt_lag5', 'item_cnt_lag6', 'item_cnt_lag7', 'item_cnt_lag8', 
                    'item_cnt_lag9', 'item_cnt_lag10', 'item_cnt_lag11', 'item_cnt_lag12']]
x_test_df.head()

The next step is to prepare the datasets for the GradientBoostingRegressor. We normalize the avg_price feature, use one-hot encoding for the categorical features, and clip the historical sales totals into the [0, 20] range to match the test data. 

In [None]:
from scipy.sparse import hstack, vstack
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder, normalize

In [None]:
x_train_price = np.array(monthly_sales['avg_price']).reshape(-1, 1)
x_test_price = np.array(x_test_df['avg_price']).reshape(-1, 1)
x_train_price = normalize(x_train_price)
x_test_price = normalize(x_test_price)

x_train_lags = np.array(monthly_sales[['item_cnt_lag1', 'item_cnt_lag2', 'item_cnt_lag3', 'item_cnt_lag4', 
                                       'item_cnt_lag5', 'item_cnt_lag6', 'item_cnt_lag7', 'item_cnt_lag8', 
                                       'item_cnt_lag9', 'item_cnt_lag10', 'item_cnt_lag11', 'item_cnt_lag12']])
x_test_lags = np.array(x_test_df[['item_cnt_lag1', 'item_cnt_lag2', 'item_cnt_lag3', 'item_cnt_lag4', 
                                       'item_cnt_lag5', 'item_cnt_lag6', 'item_cnt_lag7', 'item_cnt_lag8', 
                                       'item_cnt_lag9', 'item_cnt_lag10', 'item_cnt_lag11', 'item_cnt_lag12']])

x_train_categorical = np.array(monthly_sales[['shop_id', 'item_category_id', 'month', 'year']])
x_test_categorical = np.array(x_test_df[['shop_id', 'item_category_id', 'month', 'year']])
x_all_categorical = np.concatenate((x_train_categorical, x_test_categorical))
y_train = np.array(monthly_sales['item_cnt_month'])

encoder = OneHotEncoder()
encoder.fit(x_all_categorical)
x_train_categorical = encoder.transform(x_train_categorical)
x_test_categorical = encoder.transform(x_test_categorical)

x_train = hstack([x_train_categorical, x_train_price, x_train_lags])
x_test = hstack([x_test_categorical, x_test_price, x_test_lags])

y_train = np.clip(y_train, 0, 20)

print(x_train.shape)
print(x_test.shape)

Finally, we fit the GradientBoostingRegressor using n_estimators=500 and use this model to make predictions on the test data.

In [None]:
gradient_boost = GradientBoostingRegressor(n_estimators=500)
gradient_boost.fit(x_train, y_train)
train_pred = gradient_boost.predict(x_train)
rmse = np.sqrt(mean_squared_error(y_train, train_pred))
print(f"RMSE on training set: {rmse}")

In [None]:
test_pred = gradient_boost.predict(x_test)
test_pred = x_test_df.assign(item_cnt_month=test_pred)
test_pred = test_pred[['ID', 'item_cnt_month']]
test_pred = test_pred.sort_values(by='ID')
test_pred.to_csv('submission.csv', index=False)