<a href="https://colab.research.google.com/github/rupalidawkoregithub/Rossmann_Retail_Sales_Prediction_Capstone_Project/blob/main/Retail_Sales_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Retail Sales Prediction**



##### **Project Type**    -   Regression
##### **Contribution**    -   Team
##### **Team Member 1 -**   Rupali Dawkore
##### **Team Member 2 -**
##### **Team Member 3 -**


# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.
You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.rcParams.update({'figure.figsize':(8,5),'figure.dpi':100})
from datetime import datetime

import warnings    
warnings.filterwarnings('ignore')

# import sklearn libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import LassoLars
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import ElasticNet
     

###Mount the Drive and Import the Dataset

In [None]:
# Mount the Google Drive for Import the Dataset
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Load Dataset
df1 = pd.read_csv('/content/drive/MyDrive/ML Regression /Rossmann Stores Data.csv')
df2 = pd.read_csv('/content/drive/MyDrive/ML Regression /store.csv')

**Merge the Rossmann_df and Store_df csv by column 'Store' as in both csv Store column is common.**

In [None]:
df = pd.merge(df1,df2, on='Store', how='left')

### Dataset First View

In [None]:
# Dataset First Look 
df.head()

In [None]:
df.tail()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

In [None]:
df.describe()

#### Duplicate Values

"Duplication" just means that you have repeated data in your dataset. This could be due to things like data entry errors or data collection methods. 

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

There is no duplicates values in **Rossmann Store Dataset**.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In Rossmann store data out of **1017209 entries** there are missing values for the columns:

**CompetitionDistance**- distance in meters to the nearest competitor store, the distribution plot would give us an idea about the distances at which generally the stores are opened and we would impute the values accordingly.

**CompetitionOpenSinceMonth**- gives the approximate month of the time the nearest competitor was opened, mode of the column would tell us the most occuring month.

**CompetitionOpenSinceYear**- gives the approximate year of the time the nearest competitor was opened, mode of the column would tell us the most occuring month.

**Promo2SinceWeek, Promo2SinceYear and PromoInterval** are NaN wherever Promo2 is 0 or False as can be seen in the first look of the dataset. They can be replaced with 0.

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap for Rossmann dataset
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

In [None]:
df['DayOfWeek'].value_counts()

In [None]:
df['Open'].value_counts()

In [None]:
df['Promo'].value_counts()

In [None]:
df['StateHoliday'].value_counts()

In [None]:
df['SchoolHoliday'].value_counts()

In [None]:
df['StoreType'].value_counts()

In [None]:
df['Assortment'].value_counts()

In [None]:
df['CompetitionOpenSinceMonth'].value_counts()

In [None]:
df['CompetitionOpenSinceYear'].value_counts()

In [None]:
df['Promo2'].value_counts()

### Variables Description 

**Rossmann Stores Data.csv** - historical data including Sales.

**store.csv** - supplemental information about the stores.


###Data fields

  
**1. Id** - an Id that represents a (Store, Date) duple within the set

**2. Store**- a unique Id for each store

**3. Sales** - the turnover for any given day (Dependent Variable)

**4. Customers** - the number of customers on a given day

**5. Open** - an indicator for whether the store was open: 0 = closed, 1 = open

**6. StateHoliday** - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None

**7. SchoolHoliday** - indicates if the (Store, Date) was affected by the closure of public schools

**8. StoreType** - differentiates between 4 different store models: a, b, c, d

**9. Assortment** - describes an assortment level: a = basic, b = extra, c = extended. An assortment strategy in retailing involves the number and type of products that stores display for purchase by consumers.

**10. CompetitionDistance** - distance in meters to the nearest competitor store

**11. CompetitionOpenSince[Month/Year]** - gives the approximate year and month of the time the nearest competitor was opened

**12. Promo** - indicates whether a store is running a promo on that day

**13. Promo2** - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating

**14. Promo2Since[Year/Week]** - describes the year and calendar week when the store started participating in Promo2

**15. PromoInterval** - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable in Rossmann dataset.
# No.of Stores in the Dataset
print("Store nunique",df.Store.nunique())
# No.of Sales in the Dataset
print("Sales nunique",df.Sales.nunique())
# No.of DayOfWeek in the Dataset
print("DayOfWeek nunique",df.DayOfWeek.nunique())
# No.of Date in the Dataset
print("Date nunique",df.Date.nunique())
# No.of Customers in the Dataset
print("Customers nunique",df.Customers.nunique())
# No.of Open in the Dataset
print("Open nunique",df.Open.nunique())
# No.of Promo in the Dataset
print("Promo nunique",df.Promo.nunique())
# No.of StateHoliday in the Dataset
print("StateHoliday nunique",df.StateHoliday.nunique())
# No.of SchoolHoliday in the Dataset
print("SchoolHoliday nunique",df.SchoolHoliday.nunique())
# No.of StoreType in the Dataset
print("StoreType nunique",df.StoreType.nunique())
# No.of Assortment in the Dataset
print("Assortment	nunique",df.Assortment.nunique())
# No.of CompetitionDistance in the Dataset
print("CompetitionDistance nunique",df.CompetitionDistance.nunique())
# No.of CompetitionOpenSinceMonth in the Dataset
print("CompetitionOpenSinceMonth nunique",df.CompetitionOpenSinceMonth.nunique())
# No.of CompetitionOpenSinceYear in the Dataset
print("CompetitionOpenSinceYear nunique",df.CompetitionOpenSinceYear.nunique())
# No.of Promo2 in the Dataset
print("Promo2 nunique",df.Promo2.nunique())
# No.of Promo2SinceWeek in the Dataset
print("Promo2SinceWeek nunique",df.Promo2SinceWeek.nunique())
# No.of Promo2SinceYear in the Dataset
print("Promo2SinceYear nunique",df.Promo2SinceYear.nunique())
# No.of PromoInterval in the Dataset
print("PromoInterval nunique",df.PromoInterval.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
# Distribution plot of competition distance
plt.figure(figsize=(7,4))
sns.distplot(df['CompetitionDistance'],color='blue')
plt.legend(['CompetitionDistance'])
#plt.xlabel('Competition Distance Distribution Plot')
plt.show()

It seems like most of the values of the CompetitionDistance are towards the left and the distribution is skewed on the right. Median is more robust to outlier effect.

In [None]:
# Calculate mean value
Mean_value = df['CompetitionDistance'].mean()
Mean_value

In [None]:
# Calculate median value
Median_value =  df['CompetitionDistance'].median()
Median_value

In [None]:
# Calculate mode value
Mode_value =  df['CompetitionDistance'].mode()
Mode_value

In [None]:
# Filling competition distance with the median value
df['CompetitionDistance'].fillna(Median_value, inplace = True)

In [None]:
# Checking that null values replace or not
df['CompetitionDistance'].isnull().value_counts()

In [None]:
# Distribution plot of Competition Open Since Month
plt.figure(figsize=(7,7))
sns.distplot(df['CompetitionOpenSinceMonth'],color='green')
plt.legend(['CompetitionOpenSinceMonth'])
plt.xlabel('Competition Distance Distribution Plot')
plt.show()

From the above plot we can say that the values of the CompetitionOpenSinceMonth is left skewed. Mode is more robust to outlier effect.

In [None]:
# Filling competition distance with the median value
df['CompetitionOpenSinceMonth'].fillna(Mode_value[0], inplace = True)
# Checking that null values replace or not
df['CompetitionOpenSinceMonth'].isnull().value_counts()

In [None]:
# Distribution plot of Competition Open Since Year
plt.figure(figsize=(7,4))
sns.distplot(df['CompetitionOpenSinceYear'],color='orange')
plt.legend(['CompetitionOpenSinceYear'])
plt.xlabel('Competition Distance Distribution Plot')
plt.show()

From the above plot we can say that the values of the CompetitionOpenSinceYear is left skewed. Mode is more robust to outlier effect.

In [None]:
# Calculate mean value
Mean_value = df['CompetitionOpenSinceYear'].mean()
print("Mean Value =",Mean_value)
# Calculate mean value
Median_value = df['CompetitionOpenSinceYear'].median()
print("Median Value =",Median_value)
# Calculate mean value
Mode_value = df['CompetitionOpenSinceYear'].mode()
print("Mode Value =",Mode_value)

In [None]:
# Filling competition distance with the median value
df['CompetitionOpenSinceYear'].fillna(Mode_value[0], inplace = True)
# Checking that null values replace or not
df['CompetitionOpenSinceYear'].isnull().value_counts()

In [None]:
# Imputing the nan values of promo2 related columns with 0
df['Promo2SinceWeek'].fillna(value=0,inplace=True)
df['Promo2SinceYear'].fillna(value=0,inplace=True)
df['PromoInterval'].fillna(value=0,inplace=True)
     

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.shape

**Checking Categorical columns** 

In [None]:
from pandas.core.indexes.datetimelike import final
categorical_variable = df.select_dtypes(object)
categorical_variable

# Changing different dtypes to int type.

In [None]:
# code for changing format of date from object to datetime
df['Date'] = pd.to_datetime(df['Date'], format= '%Y-%m-%d')

In [None]:
print(df['Date'].min(),'Starting Date')
print(df['Date'].max(),'Ending Date')

In [None]:
# extract year, month, day and week of year from "Date"
#df['Date']=pd.to_datetime(df['Date'])
#df['Year'] = df['Date'].apply(lambda x: x.year)
#df['Month'] = df['Date'].apply(lambda x: x.month)
#df['Day'] = df['Date'].apply(lambda x: x.day)
#df['WeekOfYear'] = df['Date'].apply(lambda x: x.weekofyear)

In [None]:
#df.sort_values(by=['Date','Store'],inplace=True,ascending=[False,True])
#df.head(2)
#df.info()

***This tells us we have a data of almost 3 years.***

**Converting StateHoliday to Numerics Value**

In [None]:
df['StateHoliday'].value_counts()

In [None]:
#replacing
df['StateHoliday'] = df['StateHoliday'].replace(['0'],0)

In [None]:
df['StateHoliday'].replace({"a":1, "b":2, "c":3},inplace=True)

In [None]:
df.StateHoliday.unique()

In [None]:
df['StateHoliday'].value_counts()

**Converting StoreType to Numerics Value**

In [None]:
df.StoreType.unique()

In [None]:
#replacing
df['StoreType'] = df['StoreType'].replace(['0'],0)

In [None]:
df['StoreType'].replace({"a":0, "b":1, "c":2,"d":3},inplace=True)

In [None]:
df['StoreType'].value_counts()

**Converting Assortment to Numerics Value**

In [None]:
df.Assortment.unique()

In [None]:
#replacing
df['Assortment'] = df['Assortment'].replace(['0'],0)

In [None]:
df['Assortment'].replace({"a":0, "b":1, "c":2},inplace=True)

In [None]:
df['StoreType'].value_counts()

**Changing Data Type Float To Int**

In [None]:
df = df.astype({"Sales":"int","CompetitionDistance":"int","CompetitionOpenSinceMonth":"int","CompetitionOpenSinceYear":"int","Promo2SinceWeek":"int","Promo2SinceYear":"int"})

In [None]:
df.info()

Let's do analysis of few variables



**Categorical variables**

In [None]:
plt.figure(figsize=(5,3))
sns.countplot(x=df['StateHoliday'])

In [None]:
plt.figure(figsize=(5,3))
sns.countplot(x=df['Open'])
# here 0 --> closed
# here 1 --> open

In [None]:
plt.figure(figsize=(5,3))
sns.countplot(x=df['StoreType'])

In [None]:
Numeric_features = ['Date','Sales','Customers','CompetitionDistance','CompetitionOpenSinceMonth','CompetitionOpenSinceYear','Promo2SinceWeek','Promo2SinceYear']
Categorical_feature = ['Store','DayOfWeek','Open','Promo','StateHoliday','SchoolHoliday','StoreType','Assortment','Promo2','PromoInterval']

**Checking Distribution Of Dependent Variable**

In [None]:
# Dependent variable 'Sales'
plt.figure(figsize=(5,3))
sns.distplot(df['Sales'],color="orange")

In [None]:
df['Sales'] = np.log10(df['Sales'])

In [None]:
df.drop(df[df['Sales'] == float("-inf")].index,inplace=True)

In [None]:
# Dependent variable 'Sales'
plt.figure(figsize=(5,3))
sns.distplot(np.log10(df['Sales']),color="orange")

In [None]:
# plot a bar plot for each numerical feature count (except Date)
for col in Numeric_features[2:]:
    fig = plt.figure(figsize=(6,4))
    ax = fig.gca()
    feature = df[col]
    feature.hist(bins=50, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)    
    ax.set_title(col)
plt.show()

In [None]:
for col in Numeric_features[2:]:
    fig = plt.figure(figsize=(6,4))
    ax = fig.gca()
    feature = np.log(df[col]+1)
    feature.hist(bins=50, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)    
    ax.set_title(col)
plt.show()

In [None]:
from scipy.stats import skew

In [None]:
"""for col in Numeric_features:
  print(col)
  print(skew(Numeric_features[col]))
  plt.figure(figsize=(4,3))
  sns.distplot(Numeric_features[col])
  plt.show()"""

In [None]:
#pd.get_dummies(category['PromoInterval'])

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# new dataframe for extract year, month, day and week of year from "Date"
copy_df = df.copy()
copy_df['Date']=pd.to_datetime(copy_df['Date'])
copy_df['Year'] = copy_df['Date'].apply(lambda x: x.year)
copy_df['Month'] = copy_df['Date'].apply(lambda x: x.month)
copy_df['Day'] = copy_df['Date'].apply(lambda x: x.day)
copy_df['WeekOfYear'] = copy_df['Date'].apply(lambda x: x.weekofyear)

In [None]:
copy_df.shape

In [None]:
copy_df.head(2)

In [None]:
copy_df.Year.value_counts()

**Distribution Of Sales in Different Years**



In [None]:
#f, ax = plt.subplots(2, 3, figsize = (20,10))
labels = '2013' , '2014' , '2015'
sizes = copy_df.Year.value_counts()
colors = ['#e6ebf1', '#063970' , '#839cb8']
explode = (0.1,0.0,0)
plt.pie(sizes,explode=explode, labels=labels, colors=colors,
        autopct='%0.1f%%', shadow=True, startangle=180)
plt.axis('equal')
plt.legend()
plt.show()


As we can see in the Piechart Sales are distributed in years 
whereas, 
          **2013 sales are 40.0 % ,
            2014 sales are 36.8 %,
            2015 sales are 23.2 %**

**Sales Affected by Schoolholiday or Not**

In [None]:
labels = 'Not-Affected' , 'Affected'
sizes = df.SchoolHoliday.value_counts()
colors = ['#839cb8', 'gray']
explode = (0.1, 0.0)
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=180)
plt.axis('equal')
plt.title("Sales Affected by Schoolholiday or Not ?",fontsize=10)
plt.legend()
plt.show()

As we can see in the Piechart Sales affected by School Holiday is **17.9% and 82.1%** Sales aren't afffected by School Holiday.

In [None]:
#plt.figure(figsize=(7,4))
#sns.distplot(df['Sales'],color='orange')

In [None]:
#df['Sales'].skew()

In [None]:
#df['Sales'] = np.log10(df['Sales'])

In [None]:
#df.dropna(inplace=True)

In [None]:
#df.drop(df[df['Sales'] == float("-inf")].index,inplace=True)

In [None]:
#distribution plot of Sales
#plt.figure(figsize=(7,4))
#sns.distplot(x=df['Sales'],color='orange')

In [None]:
#df['Sales'].skew()

In [None]:
#df.shape

In [None]:
df['CompetitionOpenSinceYear'].unique()

In [None]:
df['CompetitionOpenSinceYear'].value_counts()

In [None]:
df['CompetitionOpenSinceYear'].value_counts().min()

In [None]:
df['CompetitionOpenSinceYear'].value_counts().max()

##### 1. Why did you pick the specific chart?

**Answer Here.**The pie chart is a common chart used for representing relative proportions or percentages of different categories. It is often used to display data that is divided into parts, and it is a good choice when you want to compare the proportion of different categories in a dataset.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
sns.boxplot(df['Sales'])

In [None]:
q1 = np.percentile(df['Sales'],25)
q2 = np.percentile(df['Sales'],50)
q3 = np.percentile(df['Sales'],75)
print(f"q1:{q1},\nq2:{q2},\nq3:{q3}\n")
IQR = q3-q1
l_bound = q1 - (1.5*IQR)
u_bound = q3 + (1.5*IQR)
print(f"Lower bound: {l_bound}, \nupper bound:{u_bound}, \nIQR:{IQR}")

In [None]:
# Position of the Outlier
print(np.where(df['Promo2SinceWeek']>100))

In [None]:
# IQR
Q1 = np.percentile(df['Promo2SinceWeek'], 25,
                   interpolation = 'midpoint')
 
Q3 = np.percentile(df['Promo2SinceWeek'], 75,
                   interpolation = 'midpoint')
IQR = Q3 - Q1

In [None]:
IQR

In [None]:
plt.figure(figsize=(12,10))
for i,j in enumerate(category):
  plt.subplot(2,2,i+1)
  sns.boxplot(x=df[j],y=df["Sales"])
  plt.title(f'Box plot for {j} feature')
  

In [None]:
# Chart - 2 visualization code
# Boxplot of Sales to check outliers
plt.figure(figsize=(20,10))
for num,column in enumerate(numerical_col):
  plt.subplot(5,4,num+1)
  sns.boxplot(df[column])
  plt.title(f'{column.title()}',weight='bold')
  plt.tight_layout()

In [None]:

"""outliers = []
data = numerical_col
data = sorted(data)
q1 = np.percentile(data,25)
q2 = np.percentile(data,50)
q3 = np.percentile(data,75)
print(f"q1:{q1},q2:{q2},q3:{q3}")

IQR = q3-q1
lwr_bound = q1-(1.5*IQR)
upr_bound = q3+(1.5*IQR)
print(f"Lower bound: {lwr_bound}, upper bound:{upr_bound}, IQR:{IQR}")

for i in data:
  if (i<lwr_bound or i>upr_bound):
    outliers.append(i)
len_outliers = len(outliers)
print(f"Total number of outliers are : {len_outliers}")
print(f"Total percentage of outliers is : {round(len_outliers*100/len(data),2)} %")"""


In [None]:
# Determining IQR,Lower and Upper bound and number out outliers present in each of the numerical features
"""for feature in numerical_col:
  print(feature,":")
  detect_outliers(df[feature])
  print("\n")"""

In [None]:
numerical_col.describe()

In [None]:
from scipy.stats.stats import iqr

# Defining the function that treats outliers with the IQR technique
def treat_outliers_iqr(data):

  # Calculate the first and third quartile
  q1,q3 = np.percentile(data,[25,75])

  # Calculate the interquartile range(IQR)
  iqr = q3-q1
  lower_bound = q1 - (1.5 * iqr)
  upper_bound = q3 + (1.5 * iqr)
  outliers = [x for x in data if x <lower_bound or x > upper_bound ]

  # Treat the outliers (e.g., replace with the nearest quartile value)
  treated_data = [q1 if x < lower_bound else q3 if x > upper_bound else x for x in data]
  treated_data_int = [int(absolute) for absolute in treated_data]

  return treated_data_int



In [None]:
# Passing all the feature one by one from the list of numerical_col in above defined function for outlier treatment
for feature in numerical_col:
  df[feature] = treat_outliers_iqr(df[feature])

In [None]:
plt.figure(figsize=(20,10))
for num,column in enumerate(numerical_col):
  plt.subplot(5,4,num+1)
  sns.boxplot(df[column])
  plt.title(f'{column.title()}',weight='bold')
  plt.tight_layout()

In [None]:
# Rechecking the total number of outliers and its percentage present in our dataset.
for feature in numerical_col:
  print(feature,":")

##### 1. Why did you pick the specific chart?

**Answer Here** : A boxplot, also known as a Whisker plot, is a standardized way of displaying the distribution of data based on five number summary ("minimum", first quartile (Q1), median, third quartile (Q3), and "maximum"). It is often used to identify outliers and other characteristics of a dataset. The boxplot is useful because it graphically shows the important characteristics of the data in a single plot, such as the median and quartiles, which is useful for comparing datasets.


 Here , I have pick boxplot for checking outliers in this dataset.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing 
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why? 

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***