<a href="https://colab.research.google.com/github/vishalrvs/Rossmann-Store-Time-Series-Analysis/blob/main/Final_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Sales Forecasting - Time Series



##### **Project Type**    - EDA/Regression/Time Series
##### **Contribution**    - Individual
##### **Team Member 1 -Vishal R Shrivastav
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied. You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

# **GitHub Link -**

https://github.com/vishalrvs/Rossmann-Store-Time-Series-Analysis

# **Problem Statement**


Get insights for business growth, reducing losses, finding the keys to maximize profit and Forecast the 6 week sales prediction. The aim is to build a predictive model to forecast the sales.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Import Libraries
import calendar # For handling date
import pandas as pd #pandas library for working with dataset, data wrangling, and Visualization
import numpy as np # To work with array

# Matplotlib and Seaborn for visualisation and behaviour with respect to the target variable
import matplotlib.pyplot as plt
import seaborn as sbn

# For ignore warnings
import warnings
warnings.filterwarnings('ignore')

# To splits a time series into three components: trend, seasonality, and the residuals
from statsmodels.tsa.seasonal import seasonal_decompose

#For getting P-value and Performing Hypothesis test for Seasonality
from statsmodels.tsa.stattools import adfuller

# matplotlib.dates.DateFormatter class is used to format a tick
import matplotlib.dates as mdates

# Non Seasonal ARIMA model used for forecasting
from statsmodels.tsa.arima.model import ARIMA
# SARIMAX model used for forecasting for seasonal data
from statsmodels.tsa.statespace.sarimax import SARIMAX
# Metricx to get scores of model
from sklearn.metrics import mean_squared_error

# To Determine the lag value(p) and moving average value(q) for Arima and Sarimax model
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf

# For ignore warnings
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset

try:
    from google.colab import drive # Library to mount drive in colaboratory environment
    drive.mount('/content/drive/')
except Exception as e:
    print(e)
else:
    print('Drive Monted Successfully')

store_sales = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Projects /Time Series Analysis/Rossmann Stores Data.csv",low_memory=False) #csv read and stored in DataFrame
stores_detail = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Projects /Time Series Analysis/store.csv",low_memory=False) #csv read and stored in DataFrame

### Dataset First View

In [None]:
# Dataset First Look
print("*"*10,"Stores Sales Data","*"*10)
print(store_sales.head()) # Displaying 5 rows
print("*"*10,"Stores Details","*"*10)
print(stores_detail.head()) # Displaying 5 rows

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
store_sales_shape = store_sales.shape
stores_detail_shape = stores_detail.shape
print("*"*10,"Shape of Sales Data","*"*10)
print("No. of rows:",store_sales_shape[0])
print("No. of columns:",store_sales_shape[1])

print("*"*10,"Shape of Stores Detail","*"*10)
print("No. of rows:",stores_detail_shape[0])
print("No. of columns:",stores_detail_shape[1])

### Dataset Information

In [None]:
# Dataset Info
print("*"*10,"Sales Data","*"*10)
store_sales.info()
print("*"*10,"Stores Details","*"*10)
stores_detail.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("*"*10,"Sales Data","*"*10)
print("No of Duplicated rows: ",store_sales.duplicated().sum())

print("*"*10,"Stores Details","*"*10)
print("No of Duplicated rows: ",stores_detail.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("*"*10,"Sales Data","*"*10)
column_names_store_sales = list(store_sales.columns)
for i in column_names_store_sales:
    print(f"{i}: {store_sales[i].isnull().sum()}")
print("*"*10,"Stores Details","*"*10)
column_names_stores_detail = list(stores_detail.columns)
for i in column_names_stores_detail:
    print(f"{i}: {stores_detail[i].isnull().sum()}")

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
print("*"*10,"Sales Data","*"*10)
sbn.heatmap(store_sales.isnull(), cbar=False)
print("*"*10,"Stores Details","*"*10)
sbn.heatmap(stores_detail.isnull(), cbar=False)
plt.title('Heat map for visualizing Duplicate Values for columns')
plt.show()

### What did you know about your dataset?

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("*"*10,"Sales Data","*"*10)
print("Column list: ",column_names_store_sales)

print("*"*10,"Stores Details","*"*10)
print("Column list: ",column_names_stores_detail)

In [None]:
# Dataset Describe
print("*"*10,"Sales Data","*"*10)
print(store_sales.describe(exclude = 'object'))
print("----"*20)
print(store_sales.describe(include = 'object'))

In [None]:
# Dataset Describe
print("*"*10,"Stores Details","*"*10)
print(stores_detail.describe(exclude = 'object'))
print("----"*20)
print(stores_detail.describe(include = 'object'))

### Variables Description

● Store - a unique Id for each store.
● DayOfWeek - No of day (1 to 7)

● Sales - the turnover for any given day (this is what you are predicting).

● Customers - the number of customers on a given day.

● Open - an indicator for whether the store was open: 0 = closed, 1 = open.

● StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None.

● SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools.

● StoreType - differentiates between 4 different store models: a, b, c, d.

● Assortment - describes an assortment level: a = basic, b = extra, c = extended.

● CompetitionDistance - distance in meters to the nearest competitor store.

● CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened.

● Promo - indicates whether a store is running a promo on that day.

● Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating.

● Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2.

● PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. “Feb,May,Aug,Nov” means each round starts in February, May, August, November of any given year for that store.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print('-------------Unique Value for each column-----------')
for i in column_names_store_sales:
    print(store_sales[i].value_counts())
    print("--"*10)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Changing data type from object to date
store_sales['Date'] = pd.to_datetime(store_sales['Date'],dayfirst=True)

In [None]:
# Join both dataframes by store column which contain store id
df = store_sales.merge(stores_detail, left_on='Store', right_on='Store', suffixes=(False, False))
# Creted Two another column for EDA purpose year and month
df['year'] = pd.DatetimeIndex(df['Date']).year
df['month'] = pd.DatetimeIndex(df['Date']).month

In [None]:
# Write your code to make your dataset analysis ready.
print(df.head())
df.shape

### What all manipulations have you done and insights you found?

1. Changed Datatype of column Date to date type
2. Merge both dataframe for Visualization by key Store
3. Check for outliers in Sales column. We Found but it is possible to have that much values of sales


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
fig, ax = plt.subplots(figsize=(10,3))
ax = sbn.histplot(data=df, y="Sales", binwidth=1000,cbar=True)
plt.title('Sales Distribution')
plt.show()


##### 1. Why did you pick the specific chart?

Histplot for histogram to get idea of sales distribution.

##### 2. What is/are the insight(s) found from the chart?

Sales distribution
* 0-1000 = approx 17500
* 6001 - 7000 = approx 150000


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

No.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sbn.boxplot(x='DayOfWeek', y='Sales', data=df)
plt.ylabel("Sales", size=12)
plt.xlabel('Days of Week', size=12)
plt.title('Day Sales')
plt.show()

##### 1. Why did you pick the specific chart?

Boxplot to see outliers - Univariate

##### 2. What is/are the insight(s) found from the chart?

It is good to see the sales having and its possible also

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. On weekend and Monday it's seems sales demand. If closed shop will be opened on holidays it may help to increase sales

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Group by with year, month with average mean
year_month_group_by = df[['year','month','Sales']].groupby(['year','month'],as_index=False).agg({'Sales':'mean'})
# Seaborn Facet grid to show multiple graph to see relationship
g = sbn.FacetGrid(df[['year','month','Sales']].groupby(['year','month'],as_index=False)\
                  .agg({'Sales':'mean'}), col="month", height=2, aspect=.8,col_wrap=6)
g.map(sbn.barplot, "year", "Sales",order=[2013,2014,2015],ci=None)
plt.show()


##### 1. Why did you pick the specific chart?

BarPlot with FacetGrid - Multivariate
To Analyze the Average Sales with respect to year for every month.

##### 2. What is/are the insight(s) found from the chart?

We have data from Jan 2013 to July 2015
There are strong evidence of sales are growing in 1, 6, 9, 10, 11

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes we can manage the our stock management.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
#School Holiday.


fig, (axis1,axis2,axis3) = plt.subplots(3,1,figsize=(4,4))
sbn.countplot(x='SchoolHoliday', data=df, ax=axis1)
sbn.barplot(x='SchoolHoliday', y='Sales', data=df, ax=axis2)
sbn.barplot(x='SchoolHoliday', y='Customers', data=df, ax=axis3)
plt.show()

##### 1. Why did you pick the specific chart?

To Show bi-variate analysis

##### 2. What is/are the insight(s) found from the chart?

Found relation between sales , customer counts, halodays

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can observe that most of the stores remain closed during State and School Holidays. But it is interesting to note that the number of stores opened during School Holidays were more than that were opened during State Holidays.

Another important thing to note is that the stores which were opened during School holidays had more sales than normal.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
year_month_group_by
year_month_group_by['date'] = pd.to_datetime(dict(year=year_month_group_by.year, \
                                                  month=year_month_group_by.month,\
                                                  day=12))
plt.figure(figsize=(20,2))
sbn.lineplot(data=year_month_group_by,x='date',y='Sales')
plt.title("Month wise Average Sales")
plt.show()

##### 1. Why did you pick the specific chart?

To show bivariate analysis - Showing a trend idea by date

##### 2. What is/are the insight(s) found from the chart?

Strong sales in december. Always breaking previous high.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Negetive in january end and positive growth in december.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# temp
temp = df.loc[df['Sales']>0,['StoreType','Sales']].groupby('StoreType',as_index=False).agg(sum_sales = ('Sales','sum'),\
                                                                         avg_sales = ('Sales','mean'))
print(temp.head())

#
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(15, 5))
sbn.barplot(x='StoreType', y='sum_sales', data=temp, ax=ax1)
sbn.barplot(x='StoreType', y='avg_sales', data=temp, ax=ax2)
ax1.ticklabel_format(style='plain', axis='y')
ax1.set_title('Store Type wise Sum of sales')
ax2.set_title('Store Type wise Average of sales ')
# sbn.despine(fig)
plt.show()

##### 1. Why did you pick the specific chart?

Barplot to show bivariate analysis.

##### 2. What is/are the insight(s) found from the chart?

Type a store has good sales with low number of transactions.
Type b store has bad sales with maximum number of transaction.
Type c is better than b and d is better than c.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive growth for A. But B and C needs to be focused and Try some business tactics.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# print(df.shape)
c_list = list(df.columns)
# print(c_list)
df_having_sales = df.loc[df['Sales']>0,c_list]
temp = df_having_sales.groupby(['Store'],as_index=False)\
                      .agg(sum_sales=('Sales','sum'),avg_sales=('Sales','mean'),count_day=('Sales','count'))\
                      .sort_values('sum_sales',ascending=False)

# temp['year_month'] = pd.to_datetime(temp[['year','month']].assign(DAY=15))
mappings = dict(zip(df.Store,df.StoreType))
temp['StoreType'] = temp['Store'].map(mappings)
top_100 = temp.head(100)
bottom_100 = temp.tail(100)
print(top_100)
print(bottom_100)

fig,(ax1,ax2) = plt.subplots(2,1,figsize=(10,10))
a = sbn.scatterplot(data=top_100, x="sum_sales", y="avg_sales", hue="StoreType", size="count_day",ax=ax1)
b = sbn.scatterplot(data=bottom_100, x="sum_sales", y="avg_sales", hue="StoreType", size="count_day",ax=ax2)
sbn.move_legend(a, "upper left", bbox_to_anchor=(1, 1))
sbn.move_legend(b, "upper left", bbox_to_anchor=(1, 1))
ax1.set_title('Top 100')
ax2.set_title('Bottom 100')
plt.show()


##### 1. Why did you pick the specific chart?

Multivariate Analysis - Used scatter plot to show relations between total sales, average sales and count of opened day of stores

##### 2. What is/are the insight(s) found from the chart?

The most interesting thing is type b store is not listed in bottom 100.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
print(stores_detail['StoreType'].value_counts())
plt.figure(figsize=(5,4))
sbn.countplot(stores_detail, x="StoreType")
plt.title('Count Plot of Store Type')
plt.show()

##### 1. Why did you pick the specific chart?

Univariate Analysis - Count of Storetype

##### 2. What is/are the insight(s) found from the chart?

You can see the a type store have max count then followed by d , c, b

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

less counts of type b Store type. If Rossmann open Type b store can grow bussiness because it is not listed in bottom 100

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# stores_detail
print(stores_detail['Assortment'].value_counts())
plt.figure(figsize=(5,4))
sbn.countplot(stores_detail, x="Assortment")
plt.show()

##### 1. Why did you pick the specific chart?

To see distribution of categorical variable- Assoertment - Univariate

##### 2. What is/are the insight(s) found from the chart?

a > c > b

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

data = df[['Date','Sales']].groupby('Date').sum()
print(data.head())
plt.figure(figsize=(15, 3))
plt.plot(data)
plt.gca().xaxis.set_major_locator(mdates.YearLocator())
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Sales of stores')
plt.show()


##### 1. Why did you pick the specific chart?

Line Plot - Bivariate analysis - To show the Date wise sales

##### 2. What is/are the insight(s) found from the chart?

We can say that our data has some seasonality. It seems like cyclic structure

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In january it looks like trying to break previous highs.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# plotting correlation heatmap
plt.figure(figsize =(20,10))
sbn.heatmap(df.corr(), cmap="YlGnBu", annot=True)

# displaying heatmap
plt.show()

##### 1. Why did you pick the specific chart?

To see multivariate analysis

##### 2. What is/are the insight(s) found from the chart?

Positive relation on some variable:
sales - customer
sales - open
customer - open

All above have positive co-relation with one another


#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# We have a large dataset so we can't plot pairplot here.

##### 1. Why did you pick the specific chart?

We have a large dataset so we can't plot pairplot here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

In [None]:
# Perform seasonal decomposition
print(data.head())
result = seasonal_decompose(data['Sales'], model='additive')

**Components of Time Series Data**

1. **Trend**: This is the overall direction in which your sales data is moving over a long period. For example, if your ice cream shop is becoming more popular, you might see a gradual increase in your daily sales over the months. This upward trajectory in your sales data represents a positive trend.

In [None]:
# Plot trend
plt.figure(figsize=(15, 3))
plt.plot(result.trend, label='Trend')
plt.legend(loc='upper left')
plt.title('Trend')
plt.show()

2. **Seasonality**: This refers to regular patterns or fluctuations in the data that happen at the same time every year. In the case of your ice cream shop, sales might increase during the summer months due to the hot weather, and decrease during the winter. This regular up and down movement in your sales data due to seasons is seasonality.


In [None]:
# Plot seasonality
plt.figure(figsize=(14, 2))
plt.plot(result.seasonal, label='Seasonality')
plt.legend(loc='upper left')
plt.title('Seasonality')
plt.show()

3. **Random or Irregular Movements**: These are unpredictable and don't follow a pattern. For instance, if a celebrity randomly visits your shop and tweets about it, you might see a sudden, unexpected surge in sales. Such spikes are not due to the season or any longer-term trend but are random events.


In [None]:
# Plot residuals
plt.figure(figsize=(15, 3))
plt.plot(result.resid, label='Residuals')
plt.title('Residuals')
plt.legend(loc='upper left')
plt.show()

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

The ADF test is a type of unit root test. A unit root is a feature of some stochastic processes (like certain types of time series) that can cause problems in statistical inference. In the context of time series analysis, having a unit root means the series is non-stationary.

The ADF test checks if a unit root is present in a time series sample. It does so by testing for the presence of a trend or seasonal pattern. The null hypothesis of the ADF test is that the time series has a unit root (and thus is non-stationary). If the test finds sufficient evidence to reject the null hypothesis, you can conclude that the time series is stationary.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
def adfuller_test(sales):
    res = adfuller(sales)
    labels = ['ADF Test Statistics','p-value','Lags-used','Number of Observastions Used','Critical Values']
    print('adfuller_test: ')
    for i,j in zip(labels,res):
        print(f"{i}: {j}")

    print('--'*15)
    if res[1]<= 0.05:
        print("Strong evidence against null hypothesis (Ho) and Data has a unit root (and thus is stationary)")
    else:
        print("Weak evidence against null hypothesis (Ho) so we can reject the null hypothesis and say time series is non -stationary")

In [None]:
adfuller_test(data['Sales'])

##### Which statistical test have you done to obtain P-Value?

Adfuller Test

##### Why did you choose the specific statistical test?

* Adfuller test is used to perform for stationarity test for our time series data.
* We know some seasonality there and our data is not stationary.
* But test show the data is stationary.
* So this is Type 2 Error in hypothesis.
* Our data is ready to forecast.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# Splitting the data into train and test sets
train = data['Sales'].iloc[:-42]  # All data except the last 42 observations
test = data['Sales'].iloc[-42:]  # Last 42 observations for testing

##### What data splitting ratio have you used and why?

We want to forecast the 6 week data and our time series have each day data so I have splitted in that manner
7 days * 6 week so 42 days.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
# Applying ARIMA model (choosing a simple (1,1,1) order for demonstration)
arima_model = ARIMA(train, order=(1, 1, 1))
# Fit the Algorithm
arima_result = arima_model.fit()
# Predict on the model
arima_forecast = arima_result.forecast(steps=42) # Here 42 is no of days which we want to forecast

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
arima_mse = mean_squared_error(test, arima_forecast)
print('Mean squared error (Model 1 :Arima):',arima_mse)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Applying SARIMA model (considering a simple seasonal order)
sarima_model = SARIMAX(train, order=(1, 1, 1), seasonal_order=(1, 1, 1, 7))  # Yearly seasonality
sarima_result = sarima_model.fit()
sarima_forecast = sarima_result.forecast(steps=42)

# Evaluating the models
sarima_mse = mean_squared_error(test, sarima_forecast)
print('Mean squared error (Model 2 :Sarima):',sarima_mse)

In [None]:
# Visualizing Forecast of Arima and Sarima model over original data
plt.figure(figsize=(12, 3))
plt.plot(data['Sales'], label='Original Sales Data', color='blue')
plt.plot(test.index, arima_forecast, label='ARIMA Forecast', color='red')
plt.plot(test.index, sarima_forecast, label='SARIMA Forecast', color='green')
plt.legend()
plt.title('Sales forecast for Rossmann Stores 42 days for Both Model')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# We choose the acf and pcf plot for hyperparameter tuning
plot_pacf(data['Sales'],lags=20) # Partial Autocorelation function plot
plt.show()

In [None]:
plot_acf(data['Sales'],lags=20)# Autocorelation function plot
plt.show()

##### Which hyperparameter optimization technique have you used and why?

# In time series analysis we have to put value of p, d, q, for Arima and Sarimax model.
* p = No. of lag term
* d = terms used for defferencing window
* q = terms used for moving average

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

## Selected Values
* p = 5 (See pacf plot)
* d = 0 (Because our data is already in stationarity)
* q = 5 (See acf plot)



#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

* We tried p = d = q = 1 and 7 day seasonality.
* In Next model we will put p = 5, d = 0, q = 5 and seasonality = 7 in Sariamax


### ML Model - 3

In [None]:
# ML Model - 3 Implementation
# We are trying both model here

# lets try the p=5, q = 5 and d = 0

# Splitting the data into train and test sets
train_1 = data['Sales'].iloc[:-42]  # All data except the last 10 observations
test_1 = data['Sales'].iloc[-42:]  # Last 10 observations for testing
# Fit the Algorithm (Arima)
arima_model_1 = ARIMA(train_1, order=(5, 0, 5))
arima_result_1 = arima_model_1.fit()
# Predict on the model
arima_forecast_1 = arima_result_1.forecast(steps=42)


# Applying SARIMA model (considering a simple seasonal order)
sarima_model_1 = SARIMAX(train_1, order=(5, 0, 5), seasonal_order=(5, 0, 5, 7))  # Weekly seasonality
sarima_result_1 = sarima_model_1.fit()
sarima_forecast_1 = sarima_result_1.forecast(steps=42)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Evaluating the models
arima_mse_1 = mean_squared_error(test_1, arima_forecast_1)
print('previous mse (Arima) :', arima_mse)
print('New mse (Arima) :', arima_mse_1)

if(arima_mse > arima_mse_1):
    print('Difference: ', arima_mse - arima_mse_1)

In [None]:
# Evaluating the models
sarima_mse_1 = mean_squared_error(test_1, sarima_forecast_1)
print('previous mse (Sarima) :', sarima_mse)
print('New mse (Sarima) :', sarima_mse_1)

if(sarima_mse > sarima_mse_1):
    print('Difference: ', sarima_mse - sarima_mse_1)

In [None]:
# Re-plotting the forecasts along with the original data

plt.figure(figsize=(15, 4))
plt.plot(data['Sales'], label='Original Sales Data', color='blue')
plt.plot(test_1.index, arima_forecast_1, label='ARIMA Forecast', color='red')
plt.plot(test_1.index, sarima_forecast_1, label='SARIMA Forecast', color='green')
plt.legend()
plt.title('Sales Data with ARIMA and SARIMA Forecasts (42 Days)')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()

In [None]:
# Forecast with DataFrame
final_analysis = pd.DataFrame({'Original':test_1,'Sarima': sarima_forecast_1,'Arima':arima_forecast_1})
print(final_analysis)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

# We used :-
* auto autocorrelation function plot (acf_plot) for q value
* partial auto autocorrelation function plot (pacf_plot) for p value

* It is the best way to hyperparameter tuning in timeseries

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

We take rmse but other's are also can be used

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

* I will choose Model 3 and take hybhrid prediction from both arima and sarima model.
* Arima for sunday it predicts well
* For other days we take sarima's prediction.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

<p>In conclusion, the implementation of ARIMA (AutoRegressive Integrated Moving Average) and SARIMA (Seasonal ARIMA) models for forecasting Rossman store sales presents a robust approach for predicting future sales trends. Through the utilization of historical sales data, these models effectively capture the underlying patterns and seasonality within the sales data, allowing for accurate predictions over different time horizons.

<p>ARIMA models provide a solid foundation for modeling the autocorrelation and trend components of the sales data, while SARIMA extends this capability to account for seasonality. By incorporating seasonal differencing and seasonal autoregressive and moving average terms, SARIMA models offer enhanced accuracy in capturing the seasonal variations inherent in retail sales, such as those observed in the Rossman store sales data.

<p>The forecasting performance of ARIMA and SARIMA models can be further optimized through parameter tuning and model evaluation techniques such as cross-validation. Additionally, incorporating exogenous variables such as promotional events, holidays, and store-specific factors can enhance the predictive capabilities of these models, enabling more nuanced and accurate sales forecasts.

<p>Overall, the application of ARIMA and SARIMA modeling techniques for Rossman store sales forecasting provides valuable insights for decision-making processes, enabling retailers to better allocate resources, optimize inventory management, and formulate effective marketing strategies based on reliable sales projections. As data availability and computational capabilities continue to improve, further advancements in time series forecasting methodologies are expected, offering even greater precision and utility for retail sales forecasting applications.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***