# _Project - Predict Future Sales_

<b> Description: </b>

This challenge serves as final project for the "How to win a data science competition" Coursera course.

In this competition you will work with a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company. 

We are asking you to predict total sales for every product and store in the next month. By solving this competition you will be able to apply and enhance your data science skills.

<b> Evaluation: </b>

Submissions are evaluated by root mean squared error (RMSE). True target values are clipped into [0,20] range.

Submission File

For each id in the test set, you must predict a total number of sales. The file should contain a header and have the following format:

- ID,item_cnt_month
- 0,0.5
- 1,0.5
- 2,0.5
- 3,0.5
etc.


<b> File Descriptions </b>
- sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
- test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.
- sample_submission.csv - a sample submission file in the correct format.
- items.csv - supplemental information about the items/products.
- item_categories.csv  - supplemental information about the items categories.
- shops.csv- supplemental information about the shops.

<b> Data Fields </b>
- ID - an Id that represents a (Shop, Item) tuple within the test set
- shop_id - unique identifier of a shop
- item_id - unique identifier of a product
- item_category_id - unique identifier of item category
- item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
- item_price - current price of an item
- date - date in format dd/mm/yyyy
- date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
- item_name - name of item
- shop_name - name of shop
- item_category_name - name of item category

## _Import Libraries_

In [None]:
#data manipulation
import numpy as np
import pandas as pd

#visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import bokeh as bk
%matplotlib inline

#consistent sized plots
from pylab import rcParams
rcParams['figure.figsize'] = 12,5
rcParams['xtick.labelsize'] = 12
rcParams['ytick.labelsize'] = 12
rcParams['axes.titlesize'] = 14

#handle warnings
import warnings
warnings.filterwarnings(action='ignore',category=DeprecationWarning)
warnings.filterwarnings(action='ignore',category=FutureWarning)

#view all the columns in the dataframe
pd.options.display.max_columns = None

## _Load Data_

In [None]:
#load train and test data
sales_train = pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv',
                         delimiter=',',engine='python',parse_dates=True)
test = pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv',delimiter=',',engine='python')

In [None]:
#load the other files
item_categories = pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv',
                         delimiter=',',engine='python')
items = pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv',delimiter=',',engine='python')
shops = pd.read_csv('../input/competitive-data-science-predict-future-sales/shops.csv',delimiter=',',engine='python')

## _Understanding the raw data_

In [None]:
#view the top rows
sales_train.head()

In [None]:
#basic info
sales_train.info()

Comments

- Except for the date column, all the remaining features are numeric
- There are close to 3 million observations in the train dataset


In [None]:
#check basic stats of the numeric columns (exclude the shop and the item id)
sales_train[['item_price','item_cnt_day']].describe()

In [None]:
#check for null values
sales_train.isnull().sum()

In [None]:
#check for any duplicates in the data
len(sales_train[sales_train.duplicated()==True])

In [None]:
#check the duplicate observations
sales_train[sales_train.duplicated()==True]

In [None]:
#drop the duplicates from the dataset
sales_train.drop_duplicates(keep='first',inplace=True)

In [None]:
#view the test dataset
test.head()

- The test dataset contains only the ID, shop_id and item_id and there is no information on price and other columns as seen in the train dataset

In [None]:
#check the remaining files
shops.head()

- The shop names are in Russian as expected because the data is provided by a Russian company

In [None]:
#check for duplicate shops
shops[shops.duplicated()==True]

In [None]:
item_categories.head()

In [None]:
#check for duplicates
item_categories[item_categories.duplicated()==True]

In [None]:
#check the items
items[items.duplicated()==True]

In [None]:
items.head()

In [None]:
#number of unique shops
print(shops['shop_name'].nunique())

In [None]:
#number of unique item categories
print(item_categories['item_category_id'].nunique())

In [None]:
#number of unique item category ids
print(items['item_id'].nunique())

In [None]:
sales_train.columns

In [None]:
items.columns

In [None]:
item_categories.columns

In [None]:
shops.columns

In [None]:
sales_train['shop_id'].nunique()

In [None]:
sales_train['item_id'].nunique()

In [None]:
sales_train.head(2)

In [None]:
#disply the last few rows in the train set
sales_train.tail(3)

- There are 33 months of data in the train set

In [None]:
#number of unique shop id in the train set
sales_train['shop_id'].nunique()

In [None]:
#occurrences of various shop id's
sales_train['shop_id'].value_counts().sort_values(ascending=False)

In [None]:
sales_train['item_id'].value_counts().sort_values(ascending=False)

## _Exploratory Data Analysis_

In [None]:
sales_train.head(2)

In [None]:
#histogram of the item price
sns.distplot(sales_train['item_price'])
plt.title('Histogram of the Item Price')
plt.show()

In [None]:
#histogram of the item price
sns.distplot(sales_train['item_price'])
plt.xscale('log')
plt.title('Histogram of the Item Price (log scale)')
plt.show()

In [None]:
sns.boxplot(sales_train['item_price'],orient='h')
plt.title('Box Plot of the Item Price')
plt.grid()
plt.show()

In [None]:
sns.boxplot(sales_train['item_price'],orient='h')
plt.xscale('log')
plt.title('Box Plot of the Item Price (log scale)')
plt.show()

In [None]:
#median value of the item price
print(f'Median values of the item price is {np.median(sales_train.item_price)}')

In [None]:
#calculate interquartile range 
q3, q1 = np.percentile(sales_train['item_price'], [75 ,25])
print(f'First Quartile of the Item Price {q1}')
print(f'Third Quartile of the Item Price {q3}')

IQR = q3 - q1

#display interquartile range 
print(f'Interquartile Range of the Item Price {IQR}')

In [None]:
df_1 =  sales_train[(sales_train['item_price']<1000.0) & (sales_train['item_price']>=249.0)]

In [None]:
#histogram of the item price
sns.distplot(df_1['item_price'])
plt.title('Histogram of the Item Price')
plt.show()

In [None]:
sns.boxplot(df_1['item_price'],orient='h')
plt.title('Box Plot of the Item Price')

plt.show()

In [None]:
#check the plots of the item_cnt_day for df_1
sns.kdeplot(df_1['item_cnt_day'])
plt.title('')
plt.show()

In [None]:
#which item has the item price more than 300,000
sales_train[sales_train['item_price']>300000]

In [None]:
np.min(sales_train['item_price'])

In [None]:
np.max(sales_train['item_price'])

In [None]:
#check the rows where the item_price is negative
sales_train[sales_train['item_price']<0]

- Item price in negative is a data error and is better to remove this single row of data. 

In [None]:
sales_train['item_cnt_day'].describe()

In [None]:
np.min(sales_train['item_cnt_day'])

In [None]:
sales_train[sales_train['item_cnt_day']<0]['shop_id']

- There are 7,356 rows of day where the item count per day is -1.0  or less than 0. This is unusual to have a sales count to be less than 0. There are two ways to handle this. Either replace it with 0 or simply remove these rows of data.  

In [None]:
#top shops where the item cnt is in negative
sales_train[sales_train['item_cnt_day']<0]['shop_id'].value_counts().sort_values(ascending=False).to_frame()[:10]

In [None]:
#replace the negative item cnt with 0 
sales_train['item_cnt_day'].mask(sales_train['item_cnt_day'] <0.0 ,0.0, inplace=True)

In [None]:
#check the stats after replacement , we should not see any negative values in item sold per day
sales_train['item_cnt_day'].describe()

In [None]:
#delete the row where the item price is negative
sales_train.drop([sales_train[sales_train['item_price']<0].index[0]],inplace=True)

In [None]:
#the negative item price row should be gone now
sales_train[sales_train['item_price']<0]

In [None]:
#view top rows
sales_train.head(3)

In [None]:
#view the test data
test.head(3)

- The test dataset is for the subsequent month after the train dataset. The description of the dataset says "test.csv - the test set. You need to forecast the sales for these shops and products for November 2015."

In [None]:
sales_train.tail()

- The last dates in the train set is for the month of October 2015 which is 33. Hence for the test set the date_block_num would be 34

### _Outlier Detection & Removal_

In [None]:
#detect the outliers
# IQR
Q1 = np.percentile(sales_train['item_price'], 25,
                   interpolation = 'midpoint')
 
Q3 = np.percentile(sales_train['item_price'], 75,
                   interpolation = 'midpoint')
IQR = Q3 - Q1
 
print("Old Shape: ", sales_train.shape)
 
# Upper bound
upper = np.where(sales_train['item_price'] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(sales_train['item_price'] <= (Q1-1.5*IQR))
 
#remove the outliers
sales_train.drop(upper[0], inplace = True)
sales_train.drop(lower[0], inplace = True)

In [None]:
print("New Shape: ", sales_train.shape)

### _Data Preparation_

In [None]:
#create a new column date_block_num in the test set
test['date_block_num'] = 34

In [None]:
test.head(2)

In [None]:
data_concat = pd.concat([sales_train,test])

In [None]:
data_concat.head()

In [None]:
data_concat.info()

In [None]:
#check for the null values in the concatenated dataset
data_concat.isnull().sum()

In [None]:
len(test)

- The null values in the date, item_price and item_cnt_day is same as the length of the test dataset. The date and the ID columns can be dropped

In [None]:
#drop the ID and the date column from the concatenated dataset
data_concat.drop(['ID','date'],axis=1,inplace=True)

In [None]:
data_concat.head()

In [None]:
#group the data based on date block, shop id, item id and aggregate based on the item count sold 
data =  data_concat.groupby(by=['date_block_num','shop_id','item_id'],as_index=False)['item_cnt_day'].apply(sum)

In [None]:
#view the data
data.head()

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
#test dataset would be the rows where the date block num is 34
data[data['date_block_num']==34].tail()

## _Feature Engineering_

### _Add the lag features_

In [None]:
#add lag features based on the shop id and item id
data['shop_lag_1'] = data.groupby('shop_id')['item_cnt_day'].shift(1)
data['shop_lag_2'] = data.groupby('shop_id')['item_cnt_day'].shift(2)

data['item_lag_1'] = data.groupby('item_id')['item_cnt_day'].shift(1)
data['item_lag_2'] = data.groupby('item_id')['item_cnt_day'].shift(2)

In [None]:
#check various counts of the item id's for a random shop id 
data[data['shop_id']==2]['item_id'].value_counts().sort_values(ascending=False)[:20]

In [None]:
#mean and median features based on shop and items
data['shop_median'] = data.groupby(['shop_id'])['item_cnt_day'].median()
data['shop_mean'] = data.groupby(['shop_id'])['item_cnt_day'].mean()

data['item_median'] = data.groupby(['item_id'])['item_cnt_day'].median()
data['item_mean'] = data.groupby(['item_id'])['item_cnt_day'].mean()

In [None]:
#view the data
data.head()

In [None]:
data.describe().transpose()

- The range and scale of the values of the created features varies quite a lot

In [None]:
#fill the na values with 0's
data.fillna(0.0,inplace=True)

In [None]:
data.isna().sum()

In [None]:
#split into the train, validation and test dataset
test_data = data[data['date_block_num']==34]
data_new = data[data['date_block_num']!=34]

split_ratio = 0.80
train_data = data_new[int(split_ratio*len(data_new)):]
valid_data = data_new[len(train_data):]

In [None]:
train_data.shape, test_data.shape,valid_data.shape

In [None]:
X_train = train_data.drop('item_cnt_day',axis=1)
y_train = train_data['item_cnt_day']

X_valid = valid_data.drop('item_cnt_day',axis=1)
y_valid = valid_data['item_cnt_day']

X_test = test_data.drop('item_cnt_day',axis=1)
y_test = test_data['item_cnt_day']

## _preparation of the data for modeling_

In [None]:
# libraries for preprocessing the data
from sklearn.preprocessing import (StandardScaler,
                                   MinMaxScaler,
                                   PowerTransformer,PolynomialFeatures)
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

In [None]:
#import the required models
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
#evaluation metrics
from sklearn.metrics import mean_squared_error

In [None]:
#prediction on the test set and prepare the submission file
def test_submission(data,model):
    '''prediction on the test data and generate the submission file'''
    predictions = model.predict(data)
    submission = test['ID'].to_frame()
    submission['item_cnt_month'] = predictions
    submission.head(3)
    submission.to_csv('submission.csv',index=False)

In [None]:
preprocess = Pipeline([('scaler',StandardScaler()),('poly_features',PolynomialFeatures(degree=2)),
                      ('decompose',PCA(n_components=0.90))])

X_train = preprocess.fit_transform(X_train)
X_valid = preprocess.transform(X_valid)
X_test = preprocess.transform(X_test)

In [None]:
#instantiate the models --> sticking with the default and simple ones
lr_reg = LinearRegression()
rf_reg = RandomForestRegressor(random_state=42,max_depth=5)
gb_reg = GradientBoostingRegressor(random_state=42)

In [None]:
#fit and obtain the predictions
def modeling(model,X_train=X_train,y_train=y_train,X_valid=X_valid,y_valid=y_valid,X_test=X_test,y_test=y_test):
    
    '''fit on the train data, print evaluation metrics and predict on the valid and test set'''
    model.fit(X_train,y_train)
    #obtain the predictions
    train_pred =  model.predict(X_train)
    valid_pred = model.predict(X_valid)
    test_pred = model.predict(X_test)
    #print the evaluation metrics
    print('Model Name {}'.format(model))
    print(f'RMSE on the Train Data = {np.sqrt(mean_squared_error(y_train,train_pred))}')
    print(f'RMSE on the Validation Split Data = {np.sqrt(mean_squared_error(y_valid,valid_pred))}')    

In [None]:
modeling(lr_reg)

In [None]:
modeling(rf_reg)

In [None]:
modeling(gb_reg)

In [None]:
test_submission(data=X_test,model=lr_reg)