# Predicting total sales for every product and store 

In this notebook, we're going to go through an example machine learning project with the goal of predicting total sales for every product and store.

## 1. Problem Definition
> How well can we predict the future sale price of every product and store, given its characteristics and previous examples of how much products have been sold for?

## 2. Data
The data is downloaded from Final project for "How to win a data science competition" Coursera course competition : https://www.kaggle.com/c/competitive-data-science-predict-future-sales/data
        
There are main 2 datasets:
- sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
- test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.

## 3. Evaluation
Submissions are evaluated by root mean squared error (RMSE) between actual sales and predicted sales.

For more on the evaluation of this project check: https://www.kaggle.com/c/competitive-data-science-predict-future-sales/overview/evaluation


## 4. Features

Data dictionary:
    
- ID - an Id that represents a (Shop, Item) tuple within the test set
- shop_id - unique identifier of a shop
- item_id - unique identifier of a product
- item_category_id - unique identifier of item category
- item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
- item_price - current price of an item
- date - date in format dd/mm/yyyy
- date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
- item_name - name of item
- shop_name - name of shop
- item_category_name - name of item category

In [2]:
#Importing packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
#Importing and loading training data
data = pd.read_csv('sales_train.csv',low_memory=False,parse_dates=['date'])

In [4]:
data.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,2013-02-01,0,59,22154,999.0,1.0
1,2013-03-01,0,25,2552,899.0,1.0
2,2013-05-01,0,25,2552,899.0,-1.0
3,2013-06-01,0,25,2554,1709.05,1.0
4,2013-01-15,0,25,2555,1099.0,1.0


In [5]:
# let's check if there is missing value
data.isnull().sum()

date              0
date_block_num    0
shop_id           0
item_id           0
item_price        0
item_cnt_day      0
dtype: int64

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
 #   Column          Dtype         
---  ------          -----         
 0   date            datetime64[ns]
 1   date_block_num  int64         
 2   shop_id         int64         
 3   item_id         int64         
 4   item_price      float64       
 5   item_cnt_day    float64       
dtypes: datetime64[ns](1), float64(2), int64(3)
memory usage: 134.4 MB


In [7]:
data.tail()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
2935844,2015-10-10,33,25,7409,299.0,1.0
2935845,2015-09-10,33,25,7460,299.0,1.0
2935846,2015-10-14,33,25,7459,349.0,1.0
2935847,2015-10-22,33,25,7440,299.0,1.0
2935848,2015-03-10,33,25,7460,299.0,1.0


In [8]:
# let's sort our date column
data.sort_values(by=['date'],inplace=True,ascending=True)
data.head(20)

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
109593,2013-01-01,0,46,18616,349.0,1.0
85162,2013-01-01,0,54,11854,199.0,1.0
18128,2013-01-01,0,28,4906,1799.0,1.0
112216,2013-01-01,0,42,2931,99.0,1.0
85141,2013-01-01,0,54,11604,349.0,1.0
47143,2013-01-01,0,15,3686,899.0,1.0
85130,2013-01-01,0,54,11576,149.0,1.0
85129,2013-01-01,0,54,11573,148.0,1.0
85124,2013-01-01,0,54,11562,299.0,1.0
85115,2013-01-01,0,54,11822,349.0,1.0


In [48]:
#let's modify our dates little bit
data["saleYear"] = data.date.dt.year
data["saleMonth"] = data.date.dt.month
data["saleDay"] = data.date.dt.day

In [49]:
data.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,saleYear,saleMonth,saleDay
109593,2013-01-01,0,46,18616,349.0,1.0,2013,1,1
85162,2013-01-01,0,54,11854,199.0,1.0,2013,1,1
18128,2013-01-01,0,28,4906,1799.0,1.0,2013,1,1
112216,2013-01-01,0,42,2931,99.0,1.0,2013,1,1
85141,2013-01-01,0,54,11604,349.0,1.0,2013,1,1


In [50]:
data.drop('date',inplace=True,axis=1)

In [51]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2935849 entries, 109593 to 2898514
Data columns (total 8 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date_block_num  int64  
 1   shop_id         int64  
 2   item_id         int64  
 3   item_price      float64
 4   item_cnt_day    float64
 5   saleYear        int64  
 6   saleMonth       int64  
 7   saleDay         int64  
dtypes: float64(2), int64(6)
memory usage: 201.6 MB


So our dataset does not contaiin any missing values and all the values are in numeric datatype, so we can continue with modeling

## Modeling

Let's try to fit and evaluate our data on RandomForestregressor

In [52]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [53]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2935849 entries, 109593 to 2898514
Data columns (total 8 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date_block_num  int64  
 1   shop_id         int64  
 2   item_id         int64  
 3   item_price      float64
 4   item_cnt_day    float64
 5   saleYear        int64  
 6   saleMonth       int64  
 7   saleDay         int64  
dtypes: float64(2), int64(6)
memory usage: 201.6 MB


In [54]:
# Let's split X and y
np.random.seed(42)
X = data.drop(['item_cnt_day'],axis=1)
y = data['item_cnt_day']

In [55]:
# Let's split into train and validation set
np.random.seed(42)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

In [56]:
X_train.shape, X_test.shape

((2348679, 7), (587170, 7))

In [57]:
y_train.shape, y_test.shape

((2348679,), (587170,))

In [72]:
#evaluation function
def rmse(y_test,y_preds):
    return np.sqrt(mean_squared_error(y_test,y_preds))



In [69]:
%%time
reg = RandomForestRegressor(n_jobs=-1,random_state=43)
reg.fit(X_train,y_train)

Wall time: 4min 57s


RandomForestRegressor(n_jobs=-1, random_state=43)

In [70]:
reg.score(X_test,y_test)

0.2947927270869044

In [73]:
y_preds = reg.predict(X_test)

In [77]:
reg_rmse = rmse(y_test,y_preds)
reg_rmse

1.8351105825944072

In [83]:
y_preds_train = reg.predict(X_train)
reg_train_rmse = rmse(y_train,y_preds_train)
reg_train_rmse

0.861695964705258

In [78]:
reg_mae = mean_absolute_error(y_test,y_preds)
reg_mae

0.2838367423403784

In [84]:
reg_mae_train = mean_absolute_error(y_train,y_preds_train)
reg_mae_train

0.10738475543060591

In [79]:
%%time
ridge = Ridge(random_state=43)
ridge.fit(X_train,y_train)
ridge.score(X_test,y_test)

Wall time: 261 ms


0.0010734391341536975

In [80]:
y_preds_ridge = ridge.predict(X_test)
y_preds_ridge

array([1.29666186, 1.19388123, 1.17438859, ..., 1.1239644 , 1.16597038,
       1.25259027])

In [81]:
ridge_rmse = rmse(y_test,y_preds_ridge)
ridge_rmse

2.184090495009219

In [82]:
ridge_mae = mean_absolute_error(y_test,y_preds_ridge)
ridge_mae

0.44059002212288

#### Let's use randomized search cv for hypertuning random fores regressor

In [92]:
%%time
from sklearn.model_selection import RandomizedSearchCV
grids = {"n_estimators": np.arange(10, 100, 10),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2),
           "max_features": [0.5, 1, "sqrt", "auto"],
           "max_samples": [10000]}
rs_reg = RandomizedSearchCV(reg,grids,n_iter=10,cv=5,verbose=True)
rs_reg.fit(X_train,y_train)


Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  1.3min finished


Wall time: 1min 22s


RandomizedSearchCV(cv=5,
                   estimator=RandomForestRegressor(n_jobs=-1, random_state=43),
                   param_distributions={'max_depth': [None, 3, 5, 10],
                                        'max_features': [0.5, 1, 'sqrt',
                                                         'auto'],
                                        'max_samples': [10000],
                                        'min_samples_leaf': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90])},
                   verbose=True)

In [93]:
rs_reg.best_params_

{'n_estimators': 90,
 'min_samples_split': 8,
 'min_samples_leaf': 1,
 'max_samples': 10000,
 'max_features': 'auto',
 'max_depth': 10}

In [94]:
rs_reg.score(X_test,y_test)

0.09589760390243507

In [95]:
preds = rs_reg.predict(X_test)

In [96]:
rs_rmse = rmse(y_test,y_preds)
rs_rmse

1.8351105825944072

In [97]:
#lets import test_data
data_test = pd.read_csv('test.csv')
data_test.head()

Unnamed: 0,ID,shop_id,item_id
0,0,5,5037
1,1,5,5320
2,2,5,5233
3,3,5,5232
4,4,5,5268


In [98]:
data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214200 entries, 0 to 214199
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype
---  ------   --------------   -----
 0   ID       214200 non-null  int64
 1   shop_id  214200 non-null  int64
 2   item_id  214200 non-null  int64
dtypes: int64(3)
memory usage: 4.9 MB


In [100]:
final_preds = reg.predict(data_test)
final_preds

ValueError: Number of features of the model must match the input. Model n_features is 7 and input n_features is 3 