## File descriptions

sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.

test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.

sample_submission.csv - a sample submission file in the correct format.

items.csv - supplemental information about the items/products.

item_categories.csv - supplemental information about the items categories.

shops.csv- supplemental information about the shops.

## Data fields

ID - an Id that represents a (Shop, Item) tuple within the test set

shop_id - unique identifier of a shop

item_id - unique identifier of a product

item_category_id - unique identifier of item category

item_cnt_day - number of products sold. You are predicting a monthly amount of this measure

item_price - current price of an item

date - date in format dd/mm/yyyy

date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33

item_name - name of item

shop_name - name of shop

item_category_name - name of item category

## Imports

In [None]:
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV



import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Loading data

In [None]:
item_categories=pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv')

items=pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv')

dateparse = lambda x: pd.datetime.strptime(x, '%d.%m.%Y')
sales_train=pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv', \
                        parse_dates=['date'],date_parser=dateparse)

sample_submission=pd.read_csv('../input/competitive-data-science-predict-future-sales/sample_submission.csv')

shops=pd.read_csv('../input/competitive-data-science-predict-future-sales/shops.csv')

test=pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv').set_index('ID')

In [None]:
train=sales_train.join(items, on='item_id', rsuffix='_') \
                 .join(item_categories, on='item_category_id', rsuffix='_') \
                 .drop(['item_id_','item_category_id_','item_name', 'item_category_name'], axis=1)

In [None]:
train.head(2)

In [None]:
train.shape

In [None]:
test.head(2)

## Data cleaning

In [None]:
shops

We can see that several shops duplicates of each other. I need to get rid of duplicate values in the data.

In [None]:
train.loc[train.shop_id == 0, 'shop_id'] = 57
test.loc[test.shop_id==0, 'shop_id']=57
train.loc[train.shop_id==1, 'shop_id']=58
test.loc[test.shop_id==1, 'shop_id']=58
train.loc[train.shop_id==10, 'shop_id']=11
test.loc[test.shop_id==10, 'shop_id']=11

In [None]:
train.describe()

Minimal item price is negative number. We need to remove all negative and zero values in price

In [None]:
train=train[train.item_price>0]

let's add year and month colomn

In [None]:
train['year']=pd.DatetimeIndex(train.date).year
train['month']=pd.DatetimeIndex(train.date).month

Missing values

In [None]:
train.isnull().sum()

In [None]:
train.dtypes


target variable - price

## Data analys

In [None]:
plt.figure(figsize=(12,5))
sns.boxplot(train.item_price.values)
plt.title('Item Price', fontsize=15)

In [None]:
train=train[train.item_price<100000]

In [None]:
plt.figure(figsize=(12,5))
sns.boxplot(train.item_price.values)
plt.title('Item Price after outliers cut', fontsize=15)

sold products

In [None]:
plt.figure(figsize=(12,5))
sns.boxplot(train.item_cnt_day.values)
plt.title('Number of products sold', fontsize=15)
plt.xlabel('Products', fontsize=10)

In [None]:
train=train[train.item_cnt_day<999]

In [None]:
plt.figure(figsize=(12,5))
sns.boxplot(train.item_cnt_day.values)
plt.title('Number of products sold', fontsize=15)
plt.xlabel('Products after outliers cut', fontsize=10)


let's see which month had the most sales

In [None]:
plt.figure(figsize=(15,5))
sns.lineplot(data=train, x='month', y='item_cnt_day', style='year')
plt.title('Number of products sold in month', fontsize=15)

In [None]:
train_calendar=train[['date','item_cnt_day']]
train_calendar.set_index('date', inplace=True)
train_calendar=train_calendar.resample('W')['item_cnt_day'].sum()

In [None]:
plt.figure(figsize=(10,5))
sns.lineplot(data=train_calendar)
plt.title('Weekly sales',fontsize=15 )
plt.xlabel('Year',fontsize=10)
plt.ylabel('Sales per day', fontsize=10)

most popular shop / item / sales

In [None]:
train_sold=train.groupby('shop_id',as_index=False).agg({'item_cnt_day':'sum','item_price':'sum'}) \
                .rename(columns={'item_cnt_day':'total_items_sold', 'item_price': 'total_money' }) \
                .sort_values('total_items_sold', ascending=False).reset_index(drop=True)

In [None]:
plt.figure(figsize=(15,5))
sns.barplot(x=train_sold.shop_id, y=train_sold.total_items_sold ,data= train_sold, order= train_sold.shop_id)
plt.title('Most popular shop', fontsize=15)
plt.ylabel('Items sold', fontsize=10)
plt.xlabel('Shop ID', fontsize=10)


sales proceeds

In [None]:
train_sold=train_sold.sort_values('total_money', ascending=False).reset_index(drop=True)
plt.figure(figsize=(15,5))
sns.barplot(x=train_sold.shop_id, y=train_sold.total_money, order= train_sold.shop_id)
plt.title('Sales proceeds', fontsize=15)
plt.ylabel('Money', fontsize=10)
plt.xlabel('Shop ID', fontsize=10)

In [None]:
train_items_sold=train.groupby('shop_id',as_index=False) \
                      .agg({'item_id':'count'}) \
                      .rename(columns={'item_id':'items_sold'})\
                      .sort_values('items_sold', ascending=False).reset_index(drop=True)

In [None]:
plt.figure(figsize=(15,5))
sns.barplot(x=train_items_sold.shop_id, y=train_items_sold.items_sold, order= train_items_sold.shop_id)
plt.title('Items sold ', fontsize=15)
plt.ylabel('Items', fontsize=10)
plt.xlabel('Shop ID', fontsize=10)

What category sales more?

In [None]:
train_categ=train.groupby('item_category_id', as_index=False) \
                 .agg({'item_cnt_day':'count', 'item_id':'nunique'}) \
                 .rename(columns={'item_cnt_day':'items_sold', 'item_id':'assortment_items'}) \
                 .sort_values('items_sold',ascending=False) \
                 .reset_index(drop=True)

In [None]:
plt.figure(figsize=(18,5))
sns.barplot(x='item_category_id', y='items_sold', data=train_categ, palette="pastel", order=train_categ.item_category_id)
plt.title('most popular category ', fontsize=15)
plt.ylabel('Items sold', fontsize=10)
plt.xlabel('Category', fontsize=10)

## Preprocessing

In [None]:
train_df=train.drop(['date', 'item_price','item_category_id'], axis=1)

In [None]:
feature= [c for c in train_df.columns if c not in ['item_cnt_day']]

In [None]:
train_df = train_df.groupby(feature, as_index=False) \
                   .agg({'item_cnt_day':'sum'}) \
                   .rename(columns={'item_cnt_day':'item_cnt_month'})

In [None]:
shop_item_monthly_mean = train_df[['shop_id', 'item_id', 'item_cnt_month']] \
                        .groupby(['shop_id', 'item_id'], as_index=False) \
                        .agg({'item_cnt_month': 'mean'}) \
                        .rename(columns={'item_cnt_month':'item_cnt_month_mean'})

In [None]:
train_df = pd.merge(train_df, shop_item_monthly_mean, how='left', on=['shop_id', 'item_id'])
train_df.head()

In [None]:
shop_item_prev_month = train_df[train_df['date_block_num'] == 33][['shop_id', 'item_id', 'item_cnt_month']] \
                        .rename(columns={'item_cnt_month':'item_cnt_prev_month'})
shop_item_prev_month.head()

In [None]:
train_df = pd.merge(train_df, shop_item_prev_month, how='left', on=['shop_id', 'item_id'])
train_df.head()

In [None]:
train_df = train_df.fillna(0.)
train_df.head()

Let's year, month and consecutive month number to test data frame

In [None]:
test['month'] = 11
test['year'] = 2015
test['date_block_num'] = 34
test.head()

In [None]:
test_df = pd.merge(test, shop_item_monthly_mean, how='left', on=['shop_id', 'item_id'])


In [None]:
test_df = pd.merge(test_df, shop_item_prev_month, how='left', on=['shop_id', 'item_id'])
test_df.head()

In [None]:
test_df = test_df.fillna(0.)
test_df.head()

## Modelling

Validation hold out month is 33

In [None]:
feature_list = [c for c in train_df.columns if c not in 'item_cnt_month']
X_train = train_df[train_df['date_block_num'] < 33]
y_train = np.log1p(X_train['item_cnt_month'].clip(0., 20.))
X_train = X_train[feature_list]
X_val = train_df[train_df['date_block_num'] == 33]
y_val = np.log1p(X_val['item_cnt_month'].clip(0., 20.))
X_val = X_val[feature_list]
X_test=test_df[feature_list]

In [None]:
# rf_model=RandomForestRegressor(random_state=42, n_jobs=-1, verbose=1)
# params={'n_estimators':np.arange(100,500,50), 'max_depth':np.arange(10,50,10), \
#         'min_samples_s,1plit': np.arange(2,8)}
# rand_cv=RandomizedSearchCV(rf_model, params, n_iter=3, scoring= 'neg_mean_squared_error')
# rand_cv.fit(X_train,y_train)
# best_model=rand_cv.best_estimator_
# score=best_model.score(X_val, y_val)
# rmse=np.sqrt(score)
# y_test=best_model.predict(X_test).clip(0., 20.)
# print('best params', rand_cv.best_params_)


In [None]:
# params={'n_estimators':[20,30,50,100, 400], 'max_depth':[ 20, 30, 50, 100], \
#         'min_samples_split': [2, 4, 6] }
# grid_model=GridSearchCV(rf_reg, params, cv= 3, shuffle= True)

# grid_model.fit(X_train, y_train)

# grid_model.best_params_

In [None]:
# rnd=RandomForestRegressor(n_estimators=400, max_depth= 20, min_samples_split=6, \
#                           random_state=42, n_jobs=-1, verbose=1)

In [None]:
# t=dt.datetime.now()
# rnd.fit(X_train, y_train)
# print(dt.datetime.now() - t)

In [None]:
# rnd.score(X_val, y_val)

In [None]:
# y_pred=rnd.predict(X_val)
# MSE=mean_squared_error(y_val, y_pred)
# rmse=np.sqrt(MSE)
# print('MSE:', MSE)
# print('RMSE:',rmse)

In [None]:
model=XGBRegressor()
params={'learning_rate':[0.05, 0.1, 0.16],
       'max_depth':[10,30,50],
       'min_child_weight':[1,3,6],
       'n_estimators':[200, 300, 400]}
xgb_grid=GridSearchCV(model, params, n_jobs=1, verbose=1, cv=3)

In [None]:
%time
xgb_grid.fit(X_train, y_train)

In [None]:
xgb_grid.best_estimator_

In [None]:
xgb_grid.best_params_

In [None]:
y_test=rnd.predict(X_test).clip(0., 20.)

In [None]:
submission=pd.DataFrame({'ID': X_test.index,'item_cnt_month':y_test})
submission.to_csv('rnd_submission.csv', index=False)