<a href="https://colab.research.google.com/github/saurabhsingh3786/Capstone-2-Retail-Sales-Prediction-/blob/main/individual_notebook_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# <b><u> Project Name :- Sales Prediction : Predicting sales of a major store chain Rossmann</u></b>




 **Project Type**    - Regression
##### **Contribution**    - Team
##### **Team Member 1 -** Saurabh Singh
##### **Team Member 2 -** Bharathwaj Bejjarapu
##### **Team Member 3 -** Shriya Chouhan


# **Project Summary -**

### Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

### You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

# **GitHub Link -**

https://github.com/saurabhsingh3786/Capstone-2-Retail-Sales-Prediction-



# **Problem Statement**


We have to Develop a supervised machine learning model to accurately forecast the daily sales of Rossmann stores. Utilize historical sales data, along with additional features such as promotions, competition, holidays, seasonality, and locality, to predict the future sales for a given store. The goal is to provide reliable sales predictions that can assist store managers in making informed decisions, optimizing inventory management, and improving overall business performance.





# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')
import scipy.stats as stats

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
path = "/content/drive/MyDrive/AlmaBetter/Capstone Projects/Retail Sales Prediction/"
rossmann_sales_df = pd.read_csv(path + "Rossmann Stores Data.csv", low_memory=False)
stores_df = pd.read_csv(path + "store.csv")

### Dataset First View

In [None]:
# Dataset First Look
pd.concat([rossmann_sales_df.head(),rossmann_sales_df.tail()])

In [None]:
pd.concat([stores_df.head(),stores_df.tail()])

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rossmann_sales_df.shape, stores_df.shape

### Dataset Information

In [None]:
# Dataset Info
rossmann_sales_df.info()

In [None]:
stores_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
num_duplicates = rossmann_sales_df.duplicated().sum()

# Print the result
print(f"The dataset has {num_duplicates} duplicate values.")

In [None]:
duplicate_value = stores_df[stores_df.duplicated()]
print("Duplicate rows in stores dataset:",len(duplicate_value))

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values
rossmann_sales_df.isna().sum()

In [None]:
stores_df.isnull().sum()

In [None]:
# Visualizing the missing values
# Create a bar plot using Plotly
missing_values = stores_df.isnull().sum().reset_index()
missing_values.columns = ['Column', 'Missing Values']
fig = px.bar(missing_values, x='Column', y='Missing Values')
fig.update_layout(title='Missing Values', xaxis_title='Columns', yaxis_title='Missing Values')
fig.show()


**`Now We try to handle missing values in stores dataset:-`**

Out of 1115 entries there are missing values for the columns:
* CompetitionDistance- distance in meters to the nearest competitor store, the distribution plot would give us an idea about the distances at which generally the stores are opened and we would impute the values accordingly.

* CompetitionOpenSinceMonth- gives the approximate month of the time the nearest competitor was opened, mode of the column would tell us the most occuring month    
* CompetitionOpenSinceYear-  gives the approximate year of the time the nearest competitor was opened, mode of the column would tell us the most occuring month    
* Promo2SinceWeek, Promo2SinceYear and PromoInterval are NaN wherever Promo2 is 0 or False as can be seen in the first look of the dataset. They can be replaced with 0.

In [None]:
#distribution plot of competition distance
sns.distplot(x=stores_df['CompetitionDistance'], hist = True)
plt.xlabel('Competition Distance Distribution Plot')

It seems like most of the values of the CompetitionDistance are towards the left and the distribution is skewed on the right. Median is more robust to outlier effect.

In [None]:
# filling competition distance with the median value
stores_df['CompetitionDistance'].fillna(stores_df['CompetitionDistance'].median(), inplace = True)

In [None]:
# filling competition open since month and year with the most occuring values of the columns i.e modes of those columns
stores_df['CompetitionOpenSinceMonth'].fillna(stores_df['CompetitionOpenSinceMonth'].mode()[0], inplace = True)
stores_df['CompetitionOpenSinceYear'].fillna(stores_df['CompetitionOpenSinceYear'].mode()[0], inplace = True)

In [None]:
# imputing the nan values of promo2 related columns with 0
stores_df['Promo2SinceWeek'].fillna(value=0,inplace=True)
stores_df['Promo2SinceYear'].fillna(value=0,inplace=True)
stores_df['PromoInterval'].fillna(value=0,inplace=True)

In [None]:
#check again for missing value if anyone left
stores_df.isnull().sum()

### What did you know about your dataset?

#### Rossmann Stores Data.csv - historical data including Sales
#### store.csv  - supplemental information about the stores


#### <u>Data fields</u>
#### Most of the fields are self-explanatory.



* **Id** - an Id that represents a (Store, Date) duple within the set
*  **Store** - a unique Id for each store
*  **Sales** - the turnover for any given day (Dependent Variable)
* **Customers** - the number of customers on a given day
* **Open** - an indicator for whether the store was open: 0 = closed, 1 = open
* **StateHoliday** - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
* **SchoolHoliday** - indicates if the (Store, Date) was affected by the closure of public schools
* **StoreType** - differentiates between 4 different store models: a, b, c, d
* **Assortment** - describes an assortment level: a = basic, b = extra, c = extended. An assortment strategy in retailing involves the number and type of products that stores display for purchase by consumers.
* **CompetitionDistance** - distance in meters to the nearest competitor store
* **CompetitionOpenSince**[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened
* **Promo** - indicates whether a store is running a promo on that day
* **Promo2** - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
* **Promo2Since**[Year/Week] - describes the year and calendar week when the store started participating in Promo2
* **PromoInterval** - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(list(rossmann_sales_df.columns))
print(list(stores_df.columns))

In [None]:
# Dataset Describe
rossmann_sales_df.describe()

In [None]:
stores_df.describe().T

### Variables Description

In [None]:
rossmann_sales_df.dtypes

In [None]:
#change datatype of date column
# code for changing format of date from object to datetime
rossmann_sales_df['Date'] = pd.to_datetime(rossmann_sales_df['Date'], format= '%Y-%m-%d')

In [None]:
#change datatype of stateholiday
rossmann_sales_df['StateHoliday'].unique()

In [None]:
rossmann_sales_df['StateHoliday'] = rossmann_sales_df['StateHoliday'].replace({'a': 1, 'b': 2, 'c': 3, '0': 0})

Now we check datatypes of stores dataset and check if there have to any correction.

In [None]:
stores_df.dtypes

In [None]:
#change datatype of Assortment and storetype
stores_df['Assortment'] = stores_df['Assortment'].replace({'a': 0, 'b': 1, 'c': 2})
stores_df['StoreType'] = stores_df['StoreType'].replace({'a': 0, 'b': 1, 'c': 2,'d': 3})

In [None]:
stores_df['CompetitionDistance']= stores_df['CompetitionDistance'].astype(int)
stores_df['CompetitionOpenSinceMonth'] = stores_df['CompetitionOpenSinceMonth'].astype(int)
stores_df['CompetitionOpenSinceYear']= stores_df['CompetitionOpenSinceYear'].astype(int)
stores_df['Promo2SinceWeek']= stores_df['Promo2SinceWeek'].astype(int)
stores_df['Promo2SinceYear']= stores_df['Promo2SinceYear'].astype(int)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
#unique variables in rossmann sales dataset.
print(rossmann_sales_df.apply(lambda col: col.unique()))

In [None]:
#unique variables in stores dataset.
print(stores_df.apply(lambda col: col.unique()))

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#now we merge both dataset to make analysis.
merged_df = pd.merge(rossmann_sales_df, stores_df, on='Store', how='left')

In [None]:
merged_df.head()

`UNIVARIATE ANALYSIS`:-

In [None]:
#Question-1: what is sales distribution across stores.
# Group the sales by store and calculate the total sales
store_sales = merged_df.groupby('Store')['Sales'].sum().reset_index()
store_sales

In [None]:
#question-2: number of stores in each store type?
# Data Wrangling for Store-Related Features
store_features = merged_df[['Store', 'StoreType']]

# Count the number of stores in each store type
store_counts = store_features['StoreType'].value_counts()
store_counts

In [None]:
#Question-3: how are sales affected by promotional activities?
# Data Wrangling for Sales and Promo Columns
sales_promo_data = merged_df[['Sales', 'Promo', 'Date']]

# Convert the 'Date' column to datetime data type
sales_promo_data['Date'] = pd.to_datetime(sales_promo_data['Date'])

# Group the data by Promo (0: Non-Promo, 1: Promo) and calculate average sales for each group
promo_grouped = sales_promo_data.groupby('Promo')['Sales'].mean()
promo_grouped

`BIVARIATE ANALYSIS`

In [None]:
#Question-4: How does the average sales vary with respect to different store types?
# Data Wrangling for Store Type vs. Sales
store_sales_data = merged_df[['Store', 'StoreType', 'Sales']]

# Calculate average sales and standard deviation for each store type
avg_sales_by_store_type = store_sales_data.groupby('StoreType')['Sales'].mean()
std_sales_by_store_type = store_sales_data.groupby('StoreType')['Sales'].std()
print("average sales by storetype is:", avg_sales_by_store_type)

In [None]:
#Question-5: What is the correlation between sales and the competition distance?

# Data Wrangling for Sales vs. Competition Distance
sales_competition_data = merged_df[['Sales', 'CompetitionDistance']]

print("sales competiton data:",sales_competition_data)

# Calculate the correlation coefficient between Sales and CompetitionDistance
correlation_coefficient = sales_competition_data['Sales'].corr(sales_competition_data['CompetitionDistance'])
print(correlation_coefficient)

In [None]:
#Question-6: How does sales vary across different days of the week?
# Data Wrangling for Sales vs. DayOfWeek
sales_day_data = merged_df[['Sales', 'DayOfWeek']]

# Replace missing values (NaN) in the 'DayOfWeek' column with 'Sunday'
sales_day_data['DayOfWeek'].fillna('Sunday', inplace=True)

# Group the data by DayOfWeek and calculate average sales for each day
average_sales_by_day = sales_day_data.groupby('DayOfWeek')['Sales'].mean().reset_index()

# Rename the DayOfWeek values to meaningful day names
day_names = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
average_sales_by_day['DayOfWeek'] = average_sales_by_day['DayOfWeek'].map(dict(enumerate(day_names)))

# Sort the data by DayOfWeek for proper visualization
average_sales_by_day['DayOfWeek'] = pd.Categorical(average_sales_by_day['DayOfWeek'], categories=day_names, ordered=True)
average_sales_by_day.sort_values('DayOfWeek', inplace=True)
print(average_sales_by_day)



In [None]:
#Question-7: What is the effect of school holidays on sales?
# Data Wrangling for Sales vs. School Holidays
sales_school_holidays_data = merged_df[['Sales', 'SchoolHoliday']]

# Group the data by SchoolHoliday and calculate average sales for each category
average_sales_by_school_holiday = sales_school_holidays_data.groupby('SchoolHoliday')['Sales'].mean().reset_index()

# Replace 0 and 1 in the 'SchoolHoliday' column with meaningful labels for visualization
average_sales_by_school_holiday['SchoolHoliday'] = average_sales_by_school_holiday['SchoolHoliday'].map({0: 'No Holiday', 1: 'School Holiday'})
average_sales_by_school_holiday



`MULTIVARIATE ANALYSIS`

In [None]:
#Question-8: is there any correlation between features?
# Selecting only numerical variables for the correlation matrix
numerical_vars = merged_df.select_dtypes(include='number')

# Handling missing data (if any)
numerical_vars = numerical_vars.dropna()


In [None]:
#Question-9: relation between promo vs sales vs customers?


# Selecting the relevant columns for analysis
promo_sales_customers_data = merged_df[['Promo', 'Sales', 'Customers']]

# Handling missing data (if any)
promo_sales_customers_data = promo_sales_customers_data.dropna()
promo_sales_customers_data

### What all manipulations have you done and insights you found?

after doing EDA process we get to know about sales distribution across store like how the sales are going on among stores, then how many numbers of stores are there in each store type and is sales are affected by any promotional activities as sales are increasing or decreasing when there is any promotional activity, after that we calculate average sales with respect to different store type to check how sales are varying across store type. then we checked correlation between sales and competitiondistance to check if there any effect of distance on sales. After that, we get to know about which day of week is getting highest sales and effect of school holiday on sales like does it impact positively on sales or not and in the end, we just finally check relationship between sales and customer when there is promo present and absent.   

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#Question-1: what is sales distribution across stores.


# Increase figure size
plt.figure(figsize=(10, 6))

# Plot the distribution of sales with customized colors and grid lines
sns.distplot(merged_df['Sales'], color='skyblue', kde_kws={'color': 'darkblue', 'linewidth': 2})
plt.xlabel('Sales')
plt.ylabel('Density')
plt.title("Distribution of Sales")

# Add grid lines
plt.grid(True, linestyle='--', alpha=0.7)

# Remove top and right spines
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

plt.show()


##### 1. Why did you pick the specific chart?

 The primary purpose of this chart is to visualize the distribution of the 'Sales' data. Histograms provide a clear representation of the data distribution by dividing the data into bins and displaying the frequency of values falling into each bin. The KDE plot complements the histogram by providing a smooth estimate of the probability density function of the data, helping to reveal the underlying distribution.

##### 2. What is/are the insight(s) found from the chart?

Central Tendency: The chart can reveal the central tendency of the sales data, indicating the typical or average level of sales. This is often represented by the peak(s) or modes in the distribution.

Spread: The chart can provide insights into the spread or variability of sales across different values. Wider distributions indicate higher variability, while narrower distributions suggest relatively consistent sales levels.

Skewness: The skewness of the distribution can be observed from the histogram and KDE plot. Positive skewness indicates a longer right tail, suggesting that a few stores have exceptionally high sales. Negative skewness implies a longer left tail, indicating a few stores with very low sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying High-Performing Stores: Insights about stores with consistently high sales and positive skewness can help identify top-performing stores. These insights can lead to allocating more resources, marketing efforts, and promotions to further boost sales in these stores.

Targeting Underperforming Stores: Understanding the sales distribution can help identify stores with low sales or negative skewness (long left tail). These insights can lead to targeted interventions, such as improving store operations, introducing new products, or implementing better marketing strategies to revitalize underperforming stores.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#question-2: number of stores in each store type?

plt.figure(figsize=(8, 6))
sns.countplot(data=store_features, x='StoreType', palette='pastel')
plt.xlabel('Store Type')
plt.ylabel('Count')
plt.title('Number of Stores in Each Store Type')
plt.show()

##### 1. Why did you pick the specific chart?

The sns.countplot() function from the Seaborn library is specifically designed to show the count of occurrences of categorical data in a dataset. It is a specialized bar plot that displays the frequency of unique values in a categorical variable.

##### 2. What is/are the insight(s) found from the chart?

Number of stores are highest in type 0 store.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Identifying High-Performing Stores: Identifying stores with consistently high sales and positive performance metrics can help focus resources on these stores to further boost sales and profitability. Targeted marketing campaigns and investments can lead to increased revenue.

Understanding Customer Behavior: Analyzing customer behavior patterns, such as buying preferences and peak shopping times, can enable businesses to tailor their offerings and promotions to better meet customer needs. Improved customer satisfaction can lead to increased customer loyalty and positive word-of-mouth.

Optimizing Inventory and Supply Chain: Insights into sales patterns and demand fluctuations can aid in inventory management. Maintaining optimal stock levels can prevent stockouts, reduce holding costs, and enhance operational efficiency.


Negative Business Impact:

Ignoring Underperforming Stores: Failure to identify and address underperforming stores can lead to revenue loss and decreased profitability. Neglecting these stores may result in missed opportunities to improve their performance.

Misinterpreting Customer Behavior: Misinterpreting customer behavior insights might lead to misguided marketing efforts or product offerings. This could lead to reduced customer satisfaction and a negative impact on sales.



#### Chart - 3

In [None]:
# Chart - 3 visualization code
#Question-3: how are sales affected by promotional activities?

# Create a line plot to compare sales for promotional and non-promotional days
plt.figure(figsize=(12, 6))
sns.lineplot(data=sales_promo_data, x='Date', y='Sales', hue='Promo', palette='Set1')
plt.xlabel('Date')
plt.ylabel('Average Sales')
plt.title('Average Sales during Promotional and Non-Promotional Days')
plt.legend(title='Promo', labels=['Non-Promo', 'Promo'])
plt.show()

##### 1. Why did you pick the specific chart?

the line plot is a suitable choice for visualizing the impact of promotional activities on sales because of its ability to present time-series data, facilitate comparison, and highlight trends over time. It provides a clear and insightful representation of how promotional activities influence sales, making it an effective chart for communicating insights to business stakeholders and decision-makers.

##### 2. What is/are the insight(s) found from the chart?

Average sales were highest for both promo and non promo in period of 2013-09 to 2014-01.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Continuous monitoring and evaluation of promotional activities, along with periodic refinements based on performance data, are essential for sustained positive growth. Data-driven decision-making and a customer-centric approach are key to leveraging insights for positive business impact.

#### Chart - 4

In [None]:
#Chart-4 Visualization Code
#Question-4: How does the average sales vary with respect to different store types?

# Create a bar plot with error bars to compare average sales for different store types
plt.figure(figsize=(8, 6))
sns.barplot(data=store_sales_data, x='StoreType', y='Sales', ci='sd')
plt.errorbar(x=avg_sales_by_store_type.index, y=avg_sales_by_store_type, yerr=std_sales_by_store_type, fmt='none', color='black', capsize=5)
plt.xlabel('Store Type')
plt.ylabel('Average Sales')
plt.title('Average Sales Variation by Store Type')
plt.show()


##### 1. Why did you pick the specific chart?

The bar plot with error bars is used here to visualize the average sales for different store types while also representing the uncertainty or variability in these average sales values. This is particularly useful when dealing with aggregated data, such as average sales by store type, where we want to show not only the central tendency (average) but also the spread or dispersion of the data.

##### 2. What is/are the insight(s) found from the chart?

Average sales by store type 1 is highest among all.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

To ensure a positive business impact, it is crucial to interpret the insights in the context of the overall business strategy and goals. Businesses should use the insights to inform their decisions, implement targeted strategies, and continuously evaluate performance.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
#Question-5: What is the correlation between sales and the competition distance?
# Create a scatter plot to visualize the relationship between Sales and CompetitionDistance
plt.figure(figsize=(8, 6))
sns.scatterplot(data=sales_competition_data, x='CompetitionDistance', y='Sales')
plt.xlabel('Competition Distance')
plt.ylabel('Sales')
plt.title(f'Correlation between Sales and Competition Distance\nCorrelation Coefficient: {correlation_coefficient:.2f}')
plt.show()

##### 1. Why did you pick the specific chart?

The scatter plot allows us to visualize the pattern of data points and identify any potential correlation between sales and competition distance. The correlation coefficient gives us a numerical measure of the strength and direction of the relationship. A positive correlation coefficient indicates a positive relationship, while a negative coefficient indicates a negative relationship. A value close to 0 suggests a weak or no correlation.

##### 2. What is/are the insight(s) found from the chart?

a correlation coefficient value of -0.0189 indicates a weak or negligible linear relationship between sales and competition distance in the dataset. For business decision-making, it suggests that competition distance alone may not be a strong predictor of sales performance, and other factors might have a more significant impact on sales in the retail stores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

the weak correlation between sales and competition distance highlights the importance of considering a holistic approach to sales performance improvement. While competition distance is one of the many factors that can influence sales, relying solely on this factor might not lead to significant business impact. Instead, businesses should adopt a comprehensive strategy, focusing on customer preferences, competitive advantage, and store-specific factors to drive positive growth.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#Question-6: How does sales vary across different days of the week?

# Create a bar plot to compare average sales across different days of the week
plt.figure(figsize=(8, 6))
sns.barplot(data=average_sales_by_day, x='DayOfWeek', y='Sales')
plt.xlabel('Day of Week')
plt.ylabel('Average Sales')
plt.title('Average Sales Variation Across Different Days of the Week')
plt.show()


##### 1. Why did you pick the specific chart?

the bar plot is a suitable choice for visualizing the average sales variation across different days of the week due to its ability to represent categorical data and facilitate easy comparisons between the days. It allows us to observe the distribution of sales over days and identify patterns or trends that may inform business decisions related to marketing strategies, promotions, or staffing based on the sales performance on specific days of the week.

##### 2. What is/are the insight(s) found from the chart?

when there is sunday sales are very low and on monday sales are highest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Optimized Resource Allocation: Knowing that Sunday has the lowest sales, businesses can adjust their resource allocation accordingly. For example, they may reduce staffing levels or operational expenses on Sundays when the demand is lower, leading to cost savings.

Targeted Marketing Strategies: Understanding the sales variation across different days of the week can inform targeted marketing strategies. Businesses can focus their promotions and advertising efforts on other days with higher sales potential, such as weekdays or Saturdays, to capitalize on peak demand.

Enhanced Promotional Planning: Businesses can plan promotions and discounts strategically to boost sales on Sundays and attract more customers during traditionally slower periods. Special Sunday-only offers or exclusive deals could entice customers to visit the store on Sundays.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
#Question-7: What is the effect of school holidays on sales?
# Create a bar plot to compare average sales during school holidays and non-school holidays
plt.figure(figsize=(6, 6))
sns.barplot(data=average_sales_by_school_holiday, x='SchoolHoliday', y='Sales')
plt.xlabel('School Holiday')
plt.ylabel('Average Sales')
plt.title('Effect of School Holidays on Sales')
plt.show()

##### 1. Why did you pick the specific chart?

The 'SchoolHoliday' column is a categorical variable with two categories: 0 (Non-School Holiday) and 1 (School Holiday). Bar plots are commonly used to represent and compare data in categories, making them suitable for visualizing average sales for each category.

##### 2. What is/are the insight(s) found from the chart?

when its school holiday sales are increased a bit.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Optimized Promotional Strategies: With higher sales during school holidays, businesses can focus on optimizing their promotional strategies during these periods. They can plan targeted promotions and special offers to attract more customers and maximize sales during school holiday seasons.

Staffing and Inventory Management: The increased sales during school holidays may necessitate adjustments in staffing and inventory management. Businesses can schedule more staff during peak periods to handle higher customer demands and ensure sufficient stock availability to meet increased sales.

Marketing and Advertising: The insight regarding higher sales during school holidays can guide marketing and advertising efforts. Businesses can allocate more resources to advertising campaigns specifically designed for school holiday shoppers, reaching out to the right audience and driving higher footfall.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
#Question-8: is there any correlation between features?

# Calculate the correlation matrix
correlation_matrix = numerical_vars.corr()
plt.figure(figsize=(18,8))
correlation = merged_df.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')

##### 1. Why did you pick the specific chart?

 the heatmap, was selected because it effectively represents the correlation matrix and allows us to visually identify patterns and relationships between numerical variables in the merged_df DataFrame.

##### 2. What is/are the insight(s) found from the chart?

most of the features have negative correlation between them as showed by heatmap.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact: Insights that reveal positive correlations or relationships between certain variables and sales can have a positive business impact. For example, if the analysis shows that promotions positively impact sales, the business can focus on running more effective promotional campaigns to increase revenue. Similarly, if the correlation heatmap highlights that specific product assortments lead to higher sales, the business can adjust its inventory strategy accordingly.

Negative Growth Considerations: On the other hand, insights indicating negative correlations or relationships can highlight areas where improvements are needed. For instance, if there is a negative correlation between sales and competition distance, it may suggest that stores located farther from competitors might face challenges in attracting customers. In such cases, the business might consider strategies like targeted marketing, customer loyalty programs, or store location optimization.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
#Question-9: relation between promo vs sales vs customers?

# Create a scatter plot to visualize the relationship between 'Promo', 'Sales', and 'Customers'
plt.figure(figsize=(10, 6))
sns.scatterplot(data=promo_sales_customers_data, x='Sales', y='Customers', hue='Promo', palette='coolwarm', alpha=0.7)
plt.xlabel('Sales')
plt.ylabel('Customers')
plt.title('Promo vs. Sales vs. Customers')
plt.show()

##### 1. Why did you pick the specific chart?

We use a scatter plot to visualize the relationship between "Promo," "Sales," and "Customers" because it is suitable for analyzing the correlation between two numerical variables (in this case, "Sales" and "Customers") and how it is influenced by a categorical variable (in this case, "Promo").

##### 2. What is/are the insight(s) found from the chart?

sales are increasing as there are more customers when promo is 1.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

the scatter plot shows a clear positive correlation between promotions, sales, and customers, it may indicate that promotions lead to increased sales and higher customer footfall. In such cases, businesses can leverage this insight by strategically planning and optimizing promotions to boost revenue and customer engagement. For instance, offering attractive discounts or special deals during promotions might encourage more customers to make purchases, resulting in positive growth for the business.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Based on our EDA work we will now give three hypothesis statement and perform p test and give final conclusion based on our hypothesis.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0)**: There is no correlation between the presence of promotions (Promo) and the number of customers (Customers) visiting the stores.

**Alternate Hypothesis (H1)**: There is a positive correlation between the presence of promotions (Promo) and the number of customers (Customers) visiting the stores.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Statement 1: Correlation between 'Promo' and 'Customers'
corr_promo_customers, p_value_1 = stats.pearsonr(merged_df['Promo'], merged_df['Customers'])

# Print the correlation coefficients and p-values for each statement
print("Statement 1: Correlation between Promo and Customers")
print(f"Correlation coefficient: {corr_promo_customers:.4f}, p-value: {p_value_1:.4f}\n")

The correlation coefficient (0.3162) indicates a positive correlation between the presence of promotions (Promo) and the number of customers (Customers) visiting the stores. A positive correlation coefficient means that as the presence of promotions increases, the number of customers visiting the stores also tends to increase.

The p-value (0.0000) is extremely small, indicating that the correlation between "Promo" and "Customers" is statistically significant.

##### Which statistical test have you done to obtain P-Value?

In the code provided earlier to test the correlation between "Promo" and "Customers," I used the Pearson correlation coefficient and the `stats.pearsonr()` function from the `scipy.stats` module to obtain the p-value.

##### Why did you choose the specific statistical test?

I chose the Pearson correlation coefficient as the specific statistical test to analyze the relationship between "Promo" and "Customers" because both variables are continuous numerical variables, and we want to understand the linear correlation between them.

The Pearson correlation coefficient is a widely used measure to quantify the strength and direction of a linear relationship between two continuous variables.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0)**: There is no correlation between the presence of promotions (Promo) and the total sales (Sales) generated by the stores.

**Alternate Hypothesis (H1)**: There is a positive correlation between the presence of promotions (Promo) and the total sales (Sales) generated by the stores.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Statement 2: Correlation between 'Promo' and 'Sales'
corr_promo_sales, p_value_2 = stats.pearsonr(merged_df['Promo'], merged_df['Sales'])
print("Statement 2: Correlation between Promo and Sales")
print(f"Correlation coefficient: {corr_promo_sales:.4f}, p-value: {p_value_2:.4f}\n")

Correlation Coefficient: The correlation coefficient (0.4523) indicates a moderate positive correlation between the presence of promotions (Promo) and the total sales (Sales) generated by the stores. A positive correlation coefficient suggests that as the presence of promotions increases, the total sales tend to increase as well.

p-value: The p-value obtained from the statistical test is very small (p-value: 0.0000), which indicates that the correlation between "Promo" and "Sales" is statistically significant.

##### Which statistical test have you done to obtain P-Value?

In the code provided earlier to test the correlation between "Promo" and "Customers," I used the Pearson correlation coefficient and the `stats.pearsonr()` function from the `scipy.stats` module to obtain the p-value.

##### Why did you choose the specific statistical test?

I chose the Pearson correlation coefficient as the specific statistical test to analyze the relationship between "Promo" and "Customers" because both variables are continuous numerical variables, and we want to understand the linear correlation between them.

The Pearson correlation coefficient is a widely used measure to quantify the strength and direction of a linear relationship between two continuous variables.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0)**: There is no correlation between the distance to the nearest competitor (CompetitionDistance) and the total sales (Sales) generated by the stores.

**Alternate Hypothesis (H1)**: There is a negative correlation between the distance to the nearest competitor (CompetitionDistance) and the total sales (Sales) generated by the stores.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Statement 3: Correlation between 'CompetitionDistance' and 'Sales'
corr_competition_distance_sales, p_value_3 = stats.pearsonr(merged_df['CompetitionDistance'], merged_df['Sales'])

print("Statement 3: Correlation between CompetitionDistance and Sales")
print(f"Correlation coefficient: {corr_competition_distance_sales:.4f}, p-value: {p_value_3:.4f}\n")

Correlation Coefficient: The correlation coefficient (-0.0189) is very close to zero and negative. This indicates a weak and negative correlation between the distance to the nearest competitor (CompetitionDistance) and the total sales (Sales) generated by the stores. A negative correlation coefficient suggests that as the distance to the nearest competitor increases, the total sales may slightly decrease, but the correlation is very weak.

p-value: The p-value obtained from the statistical test is very small (p-value: 0.0000), which indicates that the correlation between "CompetitionDistance" and "Sales" is statistically significant.

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***