<a href="https://colab.research.google.com/github/sushmitamuz002/Rossmann_Sales_Prediction_Capstone_Project-ML_Regression-/blob/main/Rossmann_Sales_Prediction_Capstone_Project_(ML_Regression).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Sales Prediction : Predicting sales of a major store chain Rossmann</u></b>

## <b> Problem Description </b>

### Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

### You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

## <b> Data Description </b>

### <b>Rossmann Stores Data.csv </b> - historical data including Sales
### <b>store.csv </b> - supplemental information about the stores


### <b><u>Data fields</u></b>
### Most of the fields are self-explanatory. The following are descriptions for those that aren't.

* #### Id - an Id that represents a (Store, Date) duple within the test set
* #### Store - a unique Id for each store
* #### Sales - the turnover for any given day (this is what you are predicting)
* #### Customers - the number of customers on a given day
* #### Open - an indicator for whether the store was open: 0 = closed, 1 = open
* #### StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
* #### SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools
* #### StoreType - differentiates between 4 different store models: a, b, c, d
* #### Assortment - describes an assortment level: a = basic, b = extra, c = extended
* #### CompetitionDistance - distance in meters to the nearest competitor store
* #### CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
* #### Promo - indicates whether a store is running a promo on that day
* #### Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
* #### Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2
* #### PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

In [None]:
# installing required library
%pip install inflection
%pip install Boruta

Collecting inflection
  Downloading inflection-0.5.1-py2.py3-none-any.whl (9.5 kB)
Installing collected packages: inflection
Successfully installed inflection-0.5.1
Collecting Boruta
  Downloading Boruta-0.3-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 2.2 MB/s 
Installing collected packages: Boruta
Successfully installed Boruta-0.3


##**Importing Libraries**

In [None]:
# importing libary that are required in the porject
import math
import json
import pylab
import pickle
import random
import requests
import datetime
import warnings
import inflection # used when renaming columns in subsection 1.1

import numpy               as np
import pandas              as pd
import xgboost             as xgb
import seaborn             as sns
import matplotlib.pyplot   as plt
import matplotlib.gridspec as gs

from scipy                 import stats as ss
from boruta                import BorutaPy
from tabulate              import tabulate

from IPython.display       import Image
from IPython.core.display  import HTML


from sklearn.metrics       import mean_absolute_error, mean_squared_error
from sklearn.ensemble      import RandomForestRegressor
from sklearn.linear_model  import LinearRegression, Lasso
from sklearn.preprocessing import RobustScaler, MinMaxScaler, LabelEncoder

warnings.filterwarnings( 'ignore' )

###**0.1 Helper Functions**
Here there are some functions that will be helpful in the project

In [None]:
# Here there are some functions that will be helpful in the project
# time series cross valdiation fuction that will be used in step 7 to find the best ML model
def cross_validation( x_training, kfold, model_name, model, verbose = False ):
    # creating empty lists to store te error results
    mae_list = []
    mape_list = []
    rmse_list = []
   
    # iterating over a range for the k fold
    for k in reversed (range( 1, kfold+1 ) ):
        if verbose:
            print( f'\nKFold Number: {k}' )
        # getting the start and the end dates for validation
        validation_start_date = x_training['date'].max() - datetime.timedelta( days = k*6*7 )
        validation_end_date = x_training['date'].max() - datetime.timedelta( days = (k-1)*6*7 )
        
        #filtering dataset
        training = x_training[x_training['date'] < validation_start_date]
        validation = x_training[x_training['date'] >= validation_end_date]
        
        # creating the training and validation datasets
        # training
        xtraining = training.drop( ['date', 'sales'], axis = 1 )
        ytraining = training['sales']
        
        # validation
        xvalidation = validation.drop( ['date', 'sales'], axis = 1 )
        yvalidation = validation['sales']
        
        # setting the model
        m = model.fit( xtraining, ytraining) # it uses the model passed in the function
        
        # prediction
        yhat = m.predict( xvalidation )
        
        # performance
        m_result = ml_error( model_name, np.expm1( yvalidation ), np.expm1( yhat ))
        
        # storing performance of each kfold iteration
        mae_list.append( m_result['MAE'] )
        mape_list.append( m_result['MAPE'] )
        rmse_list.append( m_result['RMSE'] )
       
    return pd.DataFrame( {'Model Name': model_name,
                          'MAE CV': np.round( np.mean( mae_list ), 2 ).astype( str ) + ' +/- ' + np.round( np.std( mae_list ), 2 ).astype( str ),
                          'MAPE CV': np.round( np.mean( mape_list ), 2 ).astype( str ) + ' +/- ' + np.round( np.std( mape_list ), 2 ).astype( str ),
                          'RMSE CV': np.round( np.mean( rmse_list ), 2 ).astype( str ) + ' +/- ' + np.round( np.std( rmse_list ), 2 ).astype( str ) }, index=[0] )


# function to perform a correlation coeficient with categorical variables. it'll be used in section 4: EDA.
def cramer_v( x, y ):
    cm = pd.crosstab( x, y ).to_numpy()
    n = cm.sum()
    r, k = cm.shape
    
    chi2 = ss.chi2_contingency( cm )[0]
    chi2corr = max( 0, chi2 - (k-1)*(r-1)/(n-1) )
    
    #correcting cramer's V bias
    kcorr = k - (k-1)**2/(n-1)
    rcorr = r - (r-1)**2/(n-1)
    
    return np.sqrt( (chi2corr/n) / ( min( kcorr-1, rcorr-1 ) ) )

# creating the mean absolute percentage error
def mean_percentage_error( y, y_hat ):
    return np.mean( ( y - y_hat ) / y )

# creating the mean absolute percentage error
def mean_absolute_percentage_error( y, y_hat ):
    return np.mean( np.abs( ( y - y_hat ) / y ) )

# creating a function to calculate the model error
def ml_error( model_name, y, y_hat ):
    mae = mean_absolute_error( y, y_hat )
    mape = mean_absolute_percentage_error( y, y_hat )
    rmse = np.sqrt( mean_squared_error( y, y_hat ) )
    
    
    return pd.DataFrame( { 'Model Name': model_name,
                         'MAE': mae,
                         'MAPE': mape,
                         'RMSE': rmse }, index = [0])

# setting some notebook display as default
def jupyter_settings():
    %matplotlib inline
    %pylab inline
    
    display( HTML( '<style>.container { width:100% !important; }</style>') )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    
    sns.set()
jupyter_settings()

Populating the interactive namespace from numpy and matplotlib


In [None]:
# diplay format(float type)
pd.set_option( 'display.float_format', lambda x: '%.2f' % x )

# setting plot parameters as default
plt.rcParams[ 'figure.figsize' ] = [25, 8]
plt.rcParams[ 'font.size' ] = 24
sns.set_style( "white" )

###**0.2. Loading Data**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Importing the dataset
df_Rstore = pd.read_csv("/content/drive/MyDrive/Datasets/Rossmann Stores Data.csv")
df_store = pd.read_csv("/content/drive/MyDrive/Datasets/store.csv")

In [None]:
# Merging both the dataset on store columnn because it present on both the dataset
df_raw = pd.merge(df_Rstore,df_store , on = 'Store', how='left' )