## Index

[1.Importing packages](#1)<br>
[2.Read CSV files into DataFrame](#2)<br>
[3.Data Preprocessing](#3)<br>
[4.Regressions and Results](#4)<br>
    <ul>
        <li>[4.1. Separate the dataset into train and test ](#41)</li>
        <li>[4.2. Running Machine Learning Models](#42)</li>
    </ul>
[5.Submission](#5)<br>
[6.References](#6)

You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.<br>


<h3>File descriptions</h3>
    <ul>
        <li>sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.</li>
        <li>test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.</li>
        <li>sample_submission.csv - a sample submission file in the correct format.</li>
        <li>items.csv - supplemental information about the items/products.</li>
        <li>item_categories.csv  - supplemental information about the items categories.</li>
        <li>shops.csv- supplemental information about the shops.</li>
    </ul>
            
<h3>Data fields</h3>
    <ul>
        <li>ID - an Id that represents a (Shop, Item) tuple within the test set</li>
        <li>shop_id - unique identifier of a shop</li>
        <li>item_id - unique identifier of a product</li>
        <li>item_category_id - unique identifier of item category</li>
        <li>item_cnt_day - number of products sold. You are predicting a monthly amount of this measure</li>
        <li>item_price - current price of an item</li>
    <li>date - date in format dd/mm/yyyy</li>
    <li>date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33</li>
    <li>item_name - name of item</li>
    <li>shop_name - name of shop</li>
    <li>item_category_name - name of item category</li>
    <li>This dataset is permitted to be used for any purpose, including commercial use.</li>
    </ul>

<a id='1'></a>
<div class="alert alert-block alert-danger">
<h2>1 Importing packages</h2>
</div>

In [None]:
import warnings
warnings.filterwarnings('ignore')

import math, datetime

import numpy as np 
import pandas as pd
pd.set_option('display.max_columns', None)

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import(mean_squared_error)

<a id='2'></a>
## 2.Read CSV files into DataFrame

In [None]:
train = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv")
test = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/test.csv")
sample_submission = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/sample_submission.csv")
items = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/items.csv")
item_categories = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv")
shops = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/shops.csv")

In [None]:
train.tail(3)

In [None]:
test.head(3)

In [None]:
print('train shape',train.shape)
print('test shape',test.shape)
print('duplicated rows',train.duplicated().sum())
print('number of columns with missing values',train.isnull().any().sum())
print('number of columns with missing values',test.isnull().any().sum())

<a id='3'></a>
<div class="alert alert-block alert-danger">
   <h2>
    3 Data Preprocessing
    </h2>
</div>

In [None]:
# Missing Values
import seaborn as sns
plt.figure(figsize=(10,5))
sns.heatmap(data=train.isnull(),cmap="viridis")
plt.show()

In [None]:
# Convert date values to datetime
train['date'] = pd.to_datetime(train['date'])

In [None]:
# Convert a datetime column to a string one
train['year_month'] = train['date'].apply(lambda x: x.strftime('%Y-%m')) 

In [None]:
# Drop unnecessary features
train = train.drop(['date','item_price'], axis=1)
train.head(3)

In [None]:
# group features to get the number of products sold per month. You are predicting a monthly amount of this measure
train_group = train.groupby(['year_month', 'shop_id', 'item_id']).sum().reset_index()
train_group.head()

In [None]:
df = train_group.pivot_table(index=['shop_id','item_id'], columns='year_month', values='item_cnt_day', 
                        fill_value=0)
df.reset_index(inplace=True)
df.head()

In [None]:
df_test = pd.merge(test, df, on=['shop_id','item_id'], how='left')
df_test.drop(['ID', '2013-01'], axis=1, inplace=True)
df_test = df_test.fillna(0)
df_test.head()

<a id='4'></a>
<div class="alert alert-block alert-danger">
   <h2>
    4. Regressions and Results
    </h2>
</div>

<a id='41'></a>
<div class="alert alert-block alert-info">
   <h3>
        4.1 Separate the dataset into train and test
   </h3>
</div>

In [None]:
X = df[df.columns[:-1]]
y = df[df.columns[-1]]
print(X.shape)
print(y.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, test_size=0.2, random_state=15)
print(X_train.shape)
print(X_test.shape)

print(y_train.shape)
print(y_test.shape)

<a id='42'></a>
<div class="alert alert-block alert-info">
   <h3>
        4.2 Running Machine Learning Models
   </h3>
</div>

In [None]:
def evaluate_model(model):
    RMSE_train = mean_squared_error(y_train, model.predict(X_train))
    RMSE_test = mean_squared_error(y_test, model.predict(X_test))
    
    print('Train set mse:', RMSE_train)
    print('Test set mse:', RMSE_test)
    print('Test set score:', model.score(X_train,y_train))
    
    return RMSE_train, RMSE_test

<div class="alert alert-block alert-success">
    <h4>
        4.2.1 Logistic
    </h4>
</div>

In [None]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression().fit(X_train, y_train)

In [None]:
RMSE_train_log, RMSE_test_log = evaluate_model(log)

<div class="alert alert-block alert-success">
    <h4>
        4.2.2 Random Forest
    </h4>
</div>

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 100)
rf.fit(X_train,y_train)

In [None]:
RMSE_train_rf, RMSE_test_rf = evaluate_model(rf)

<div class="alert alert-block alert-success">
    <h4>
        4.2.3 Stochastic Gradient Descent
    </h4>
</div>

In [None]:
from sklearn.linear_model import SGDClassifier
sgdc = SGDClassifier().fit(X_train, y_train)

In [None]:
RMSE_train_sgdc, RMSE_test_sgdc = evaluate_model(sgdc)

<div class="alert alert-block alert-success">
    <h4>
        4.2.4 Model_Selection - Final
    </h4>
</div>

In [None]:
df_models_acc = pd.DataFrame({
    'Model': ['log', 'rf', 'svr'],
    'RMSE_Train': [RMSE_train_log, RMSE_train_rf, RMSE_train_sgdc],
    'RMSE_Test': [RMSE_test_log, RMSE_test_rf, RMSE_test_sgdc],
})
df_models_acc.sort_values(by='RMSE_Test')

Since RandomForestRegressor gets the best results, it will be used for the submission.

<a id='5'></a>
<div class="alert alert-block alert-danger">
    <h2>
        5. Submission
    </h2>
</div>

In [None]:
submission = sample_submission.drop('item_cnt_month', axis=1)

# RandomForestRegressor
prediction = rf.predict(df_test)
prediction = list(map(round, prediction))
submission['item_cnt_month'] = prediction
submission.head()

In [None]:
# Are our test and submission dataframes the same length?
if len(submission) == len(sample_submission):
    print("Submission dataframe is the same length as test ({} rows).".format(len(submission)))
else:
    print("Dataframes mismatched, won't be able to submit to Kaggle.")

In [None]:
# Convert submisison dataframe to csv for submission to csv for Kaggle submisison
submission.to_csv('Predict_Future_Sales_Submission.csv', index=False)
print('Submission CSV is ready!')

In [None]:
# Check the submission csv to make sure it's in the right format
submissions_check = pd.read_csv("Predict_Future_Sales_Submission.csv")
submissions_check.head()

<a id='6'></a>
## 6. References

https://www.kaggle.com/yasserhessein/predict-future-sales-using-4-algorithms-regression<br>