# **Project Name**    - **Seoul Bike Sharing Demand Prediction**



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes** **bold text**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

##**Data Description** 
##**The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.**

###**Attribute Information:** 

* ### Date : year-month-day
* ### Rented Bike count - Count of bikes rented at each hour
* ### Hour - Hour of he day
* ### Temperature-Temperature in Celsius
* ### Humidity - %
* ### Windspeed - m/s
* ### Visibility - 10m
* ### Dew point temperature - Celsius
* ### Solar radiation - MJ/m2
* ### Rainfall - mm
* ### Snowfall - cm
* ### Seasons - Winter, Spring, Summer, Autumn
* ### Holiday - Holiday/No holiday
* ### Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

## ***1. Know Your Data***

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Import Libraries

In [None]:
# Importing Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import datetime as dt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error,mean_absolute_error, r2_score

%matplotlib inline
sns.set_style("whitegrid",{'grid.linestyle':'--'})
import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Loading Dataset

bike_df = pd.read_csv("/content/drive/MyDrive/SeoulBikeData.csv", encoding= 'unicode_escape')

### Dataset First View

In [None]:
# Dataset First Look
bike_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

# computing number of rows
rows = len(bike_df.axes[0])

# computing number of columns
columns = len(bike_df.axes[1])

print("Number of Rows :",rows)
print("Number of Columns :",columns)

### Dataset Information

This dataset contains the data of rented bike count in the city of seoul. It presents the count of bikes rented per hour and the weather conditions for the day. The data is of one year from December 2017 to November 2018.

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
bike_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

bike_df.isnull().sum()

**Above we can see there are no missing values as well as no duplicate values in the dataset.**

In [None]:
# Visualizing the missing values
bike_df.isnull().sum().plot.bar()
plt.show()

###**In the given dataset there are total 14 columns and most of them have 0 null values.**

### What did you know about your dataset?

In this Dataset, we have 8760 rows and 14 columns from which "rented bike count" is our target variable. There are numerical variables as well as categorical variables and one date variable which is stored as object so we will need to convert its dtype.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

bike_df.columns=map(str.lower, bike_df.columns)
bike_df.columns

In [None]:
bike_df.info()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# converting the dtype of date column
def get_date(str_date):
  date_obj= dt.datetime.strptime(str_date,'%d/%m/%Y')
  date_obj = pd.to_datetime(date_obj.date(), format="%Y-%m-%d")
  return date_obj

bike_df['date'] = bike_df['date'].apply(get_date)

In [None]:
# extracting thr day, month and day of the week 

bike_df['day'] = bike_df['date'].apply(lambda x : x.day)
bike_df['month'] = bike_df['date'].apply(lambda x : x.month)
bike_df['day_of_week'] = bike_df['date'].dt.day_name()

bike_df = bike_df.drop("date", axis=1)

bike_df.head()


In [None]:
# Dataset Describe
bike_df.describe(include='all')

### Variables Description 

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
bike_df.nunique()

### What all manipulations have you done and insights you found?

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# **Exploratory Data Analysis**

## **Univariate Analysis**

### **Dependent Variable**

First we will start with analyzing our target variable which is **rented bike count**.

#### Chart - 1

In [None]:
# dependent variable "rented bike count"
dependent_var = "rented bike count"

In [None]:
bike_df[dependent_var].describe()

Now let's see the distribution of dependent variables 'rented bike count'

In [None]:
# distribution plot,
plt.figure(figsize=(9,7))
sns.distplot(bike_df[dependent_var])
plt.title("Distribution Plot")
plt.show()

##### 1. Why did you pick the specific chart?

* Use this chart in order to get the how the Depedent variables(rented bike count) are dsitributed along the Independent features.
* Above we can see that Dependent variable is rightly skewed.

##### 2. What is/are the insight(s) found from the chart?

Dependent variable i.e rented bike count is slightly skewed towards right side (positively skewed). So we will apply transformation and again look at the distribution.

Below are some transformation technique to reduce skewness.

<b>square-root for moderate skew:</b>
sqrt(x) for positively skewed data,
sqrt(max(x+1) - x) for negatively skewed data

<b>log for greater skew:</b>
log10(x) for positively skewed data,
log10(max(x+1) - x) for negatively skewed data

<b>inverse for severe skew:</b>
1/x for positively skewed data
1/(max(x+1) - x) for negatively skewed data

In [None]:
# applying square-root transformation

plt.figure(figsize=(9,8))
sns.distplot(np.sqrt(bike_df[dependent_var]))
plt.title("Distribution Plot- After Tranformation")
plt.show()

It looks good and almost near to the normal distribution

#### Chart - 2

In [None]:
# Chart - 2 - Boxplot
plt.figure(figsize=(5,6))
sns.boxplot(y=bike_df[dependent_var])
plt.title("Boxplot")
plt.show()

##### 1. Why did you pick the specific chart?

From boxplot, we can see the median value of rented bike count is near 500 and there are some outliers towards upper limit. After applying transformation there will be no outliers.

## **Independent Variables**

## **Numerical Variables**

Noe let's have a look at the numerical features and plot some graphs to understand them

In [None]:
# numerical variables
numerical_var = list(bike_df.describe().columns[1:])
numerical_var

In [None]:
bike_df[numerical_var].describe().T

In [None]:
# unique count of numerial variables

lst = []
for col in numerical_var:
  lst.append(bike_df[col].nunique())

unique_cnt_df = pd.DataFrame(index= numerical_var, columns=['unique_count'])
unique_cnt_df['unique_count']= lst
unique_cnt_df  

#### Chart - 3

In [None]:
# Chart - 3 
for col in numerical_var:
  features = bike_df[col]
  sns.histplot(features)
  plt.axvline(features.mean(), color='black', linestyle='dashed', linewidth=2)
  plt.axvline(features.median(), color='red', linestyle='dashed', linewidth=2)
  plt.title(col)
  plt.show()

#### Chart - 4

In [None]:
# Chart - 4 
# boxplot for each numerica features

for col in numerical_var:
  fig = plt.figure()
  ax = fig.gca()
  bike_df.boxplot(col, ax= ax)
  ax.set_title(col)
plt.show()  

Variables such as wind speed (m/s), solar radiation (mj/m2), rainfall(mm), snowfall (cm) has outliers as seen in the boxplot.

## **Categorical Variables**

In [None]:
categorical_var =list(bike_df.select_dtypes(include= 'object'))
categorical_var 

In [None]:
# Season column
print(f"Count of distinct categories in season variable: {bike_df['seasons'].nunique()}")
print(list(bike_df["seasons"].unique()))

In [None]:
# Holiday column
print(f"Count of distinct categories in holiday variable: {bike_df['holiday'].nunique()}")
print(list(bike_df['holiday'].unique()))

In [None]:
# functioning day column
print(f"Count of distinct categories in functioning day variable: {bike_df['functioning day'].nunique()}")
print(list(bike_df['functioning day'].unique()))

#### Chart - 5

In [None]:
# Chart - 5 
# count plot for the categorical features

for col in categorical_var:
  plt.figure(figsize=(3,4))
  sns.countplot(data=bike_df, x= col)
  plt.title(col)
  plt.show()

There are very less count of Holiday and No functioning day. 

##### 2. What is/are the insight(s) found from the chart?

Answer - We can say that this columns will not have a greater impact.

# **Bivariate Analysis**

## Numerical variabels vs rented bike count

#### Chart - 6

In [None]:
# Chart - 6  Scatter-plot of numerical_ver vs rented bike count

for col in numerical_var:
  fig = plt.figure(figsize=(9,6))
  ax = fig.gca()
  feature = bike_df[col]
  label = bike_df['rented bike count']
  plt.scatter(x = feature, y = label)
  plt.xlabel(col)
  plt.ylabel('rented bike count')
  ax.set_title('rented bike count vs ' + col)

  z = np.polyfit(bike_df[col], bike_df['rented bike count'], 1)
  y_hat = np.poly1d(z)(bike_df[col])

  plt.plot(bike_df[col], y_hat, "r--", lw=1)

plt.show()



#### Chart - 7

In [None]:
# Chart - 7 Boxplot of numerical_ver vs rented bike count

for col in numerical_var:
  fig = plt.figure(figsize=(9,6))
  ax = fig.gca()
  bike_df.boxplot(column='rented bike count', by=col, ax= ax)
  ax.set_title('Label by' + col)
  ax.set_ylabel("rented bike count")
plt.show()  

In [None]:
# Chart 8 - Lineplot numerical_var vs rented bike count

fig, ax = plt.subplots(3,2,figsize=(15,9))

bike_df.groupby('temperature(°c)').mean()['rented bike count'].plot(ax=ax[0][0])

bike_df.groupby('humidity(%)').mean()['rented bike count'].plot(ax=ax[0][1])

bike_df.groupby('wind speed (m/s)').mean()['rented bike count'].plot(ax=ax[1][0])

bike_df.groupby('solar radiation (mj/m2)').mean()['rented bike count'].plot(ax=ax[1][1])

bike_df.groupby('rainfall(mm)').mean()['rented bike count'].plot(ax=ax[2][0])

bike_df.groupby('snowfall (cm)').mean()['rented bike count'].plot(ax=ax[2][1])

plt.show()

##### 1. Why did you pick the specific chart?

* we use line plot to see the trend of numerical variables with respect to the rented bike count.
* means to see the count of rented bikes affected by the temperature, humidity, solar radiations, rainfaal, wind-speed and snowfall.

##### 2. What is/are the insight(s) found from the chart?

* When the temperature is more the rental bike count is also high. 
* With increase in humidity the demand of rental bikes decreases. 
* Wind speed and solar radiation do not have much impact on the bike count. 
* When there is more than 10mm rainfall the demand of bike decreases but above 20mm of rain there is a huge peak. This could be the outlier or rainfall in the Summer.
* With increase in snowfall there is a decrease in rented bike count.

#### Chart - 9

In [None]:
# Spread of numerical variables across hours
fig, ax = plt.subplots(3,2,figsize=(15,9))

sns.lineplot('hour', 'rented bike count', data=bike_df, color='Navy', ax=ax[0][0])

sns.lineplot('hour', 'temperature(°c)', data=bike_df, color='Navy', ax=ax[0][1])

sns.lineplot('hour', 'humidity(%)', data=bike_df, color='Navy', ax=ax[1][0])

sns.lineplot('hour', 'wind speed (m/s)', data=bike_df, color='Navy', ax=ax[1][1])

sns.lineplot('hour', 'visibility (10m)', data=bike_df, color='Navy', ax=ax[2][0])

sns.lineplot('hour', 'solar radiation (mj/m2)', data=bike_df, color='Navy', ax=ax[2][1])

plt.show()

##### 1. What is/are the insight(s) found from the chart?


* At the beginning of the day the demand of rental bike increases with the highest peak in the evening and later decreasing.
* The demand of rental bike is at peak at 8am and 6pm so we can say that demand is more during office opening and closing time.
* Temperature, wind speed, solar radiation also increases and are at the peak in afternoon.

#### Chart - 10

In [None]:
# Chart - 10 spread of numerical_var accorss months

fig, ax = plt.subplots(3,2,figsize=(15,9))

sns.lineplot('month', 'temperature(°c)', data=bike_df, color='#E56124', ax=ax[0][0])

sns.lineplot('month', 'humidity(%)', data=bike_df, color='#E56124', ax=ax[0][1])

sns.lineplot('month', 'wind speed (m/s)', data=bike_df, color='#E56124', ax=ax[1][0])

sns.lineplot('month', 'visibility (10m)', data=bike_df, color='#E56124', ax=ax[1][1])

sns.lineplot('month', 'rainfall(mm)', data=bike_df, color='#E56124', ax=ax[2][0])

sns.lineplot('month', 'snowfall (cm)', data=bike_df, color='#E56124', ax=ax[2][1])

plt.show()


##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 
# Pie chart Categorical_var vs rented bike count

fig, ax = plt.subplots(2,2,figsize=(12,9))

bike_df.groupby('seasons').sum()['rented bike count'].plot.pie(autopct='%1.1f%%', shadow=True,ax= ax[0][0])
ax[0][0].set_title("seasons")

bike_df.groupby('holiday').sum()['rented bike count'].plot.pie(autopct='%1.1f%%',ax= ax[0][1])
ax[0][1].set_title("holiday")

bike_df.groupby('functioning day').sum()['rented bike count'].plot.pie(autopct='%1.1f%%', ax= ax[1][0])
ax[1][0].set_title("functioning day")

bike_df.groupby('day_of_week').sum()['rented bike count'].plot.pie(autopct='%1.1f%%', ax= ax[1][1])
ax[1][1].set_title("day of week")

plt.show()

##### 1. Why did you pick the specific chart?

Here pie chart is used because it shows the percentage of category.

##### 2. What is/are the insight(s) found from the chart?

* Above we can see Autumn, Spring and Summer this three seasons has the highest demand of rented bikes.
* on working days there is approximately 97% demand of the rent bikes beccause we can say that people use this rented bike services in order to go to office, etc works, and we can predict that peoples generally on holdays prefer to stay at home or prefer there own vehicles.


#### Chart - 12

In [None]:
# Chart - 12 
# Boxplot of categorical_var vs rented bike count

for col in categorical_var:
  fig = plt.figure(figsize=(8,6))
  ax = fig.gca()
  bike_df.boxplot(column= 'rented bike count', by = col, ax = ax)
  ax.set_ylabel("rented bike count")
plt.show()


##### 2. What is/are the insight(s) found from the chart?

* In Summer the demand of rented bike is high because temperature and solar radiation is high in summer. 
* We have seen there are less holidays so obviously rented bike count is also less on holidays. 
* Almost no demand on non functioning day.
* The demand of rental bikes slightly decreases on weekend days i.e saturday and sunday.

#### Chart - 13

In [None]:
# Chart - 13 
# spread of rented bike count across categorical_var
fig, ax = plt.subplots(2,2,figsize=(12,9))

sns.barplot(x= 'seasons', y= 'rented bike count', data= bike_df, ax= ax[0][0])

sns.pointplot(x= 'month', y= 'rented bike count', hue= 'seasons',
              data= bike_df, ax= ax[0][1])

sns.lineplot(x= 'hour', y= 'rented bike count', hue= 'holiday',
             ci=None, data= bike_df, ax= ax[1][0])

sns.barplot(x= 'day_of_week', y= 'rented bike count', data= bike_df, ax= ax[1][1])

plt.show()

##### 2. What is/are the insight(s) found from the chart?

* There is a huge demand for bike rents in summer season while the least bike rents occur in winter.
* We can see there is a high demand for rented bike in the month of June, August and less demand in the month of December, January and February i.e winter season.
* Non holidays have comparatively high demand for rented bikes as compared to holidays. 
* There is a high demand for rented bikes during office days and demand decreases slightly on Sunday.


# **Feature Selection**

## **Correlation**

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

corr_df = bike_df.corr()
plt.figure(figsize=(12,10))
sns.heatmap(corr_df, annot =True, cmap ="crest" )
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

The most correlated features to the rented bike count are:
* hour
* temperature(°c)
* dew point temperature(°c)
* solar radiation (mj/m2)

There is a high correlation between dew point temperature(°c) and temperature(°c).

# **Detecting Multicollinearity using VIF**

In [None]:
# Multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
calc_vif(bike_df[[i for i in numerical_var]])

##### 1. Why did you pick the specific chart?

* VIF starts at 1 and has no upper limit
* VIF = 1, no correlation between the independent variable and the other variables
* VIF exceeding 5 or 10 indicates high multicollinearity between this independent variable and the others

##### 2. What is/are the insight(s) found from the chart?

* We can see here that 'dew point temperature(°c)', 'temperature(°c)' have a high VIF value, meaning they can be predicted by other independent variables in the dataset. These two variables are highly correlated.

* Dropping one of the correlated features will help in bringing down the multicollinearity between correlated features.

In [None]:
# droping 'dew point temperature(°c)', 'day', 'month'

calc_vif(bike_df[[ i for i in numerical_var if i not in ['dew point temperature(°c)', 'day', 'month']]])

* After droping 'dew point temperature(°c)', 'day' and 'month', VIF values for all features have decreased less than 5 that is good to build regression model.

In [None]:
# droping 'dew point temperature(°c)', 'day', 'month' from original dataset
data= bike_df.drop(['dew point temperature(°c)', 'day', 'month'], axis=1)

In [None]:
# Correlation Heatmap after reducing the multicollinearity
plt.figure(figsize=(12,8))
sns.heatmap(data.corr(), annot = True, cmap = 'coolwarm')
plt.show()

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

In [None]:
bike_df.info()

In [None]:
bike_df.loc[:,'functioning day'].value_counts(normalize = True)*100

In [None]:
bike_df['functioning day'].value_counts()

In [None]:
bike_df.loc[:,'holiday'].value_counts(normalize = True)*100

In [None]:
bike_df['holiday'].value_counts()

In [None]:
working_data = bike_df[bike_df['functioning day'] == 'Yes'].sample(250, replace = False)

non_working_data = bike_df[bike_df['functioning day'] == 'No'].sample(250, replace = False)


In [None]:
round(working_data['rented bike count'].std()**2,2), round(non_working_data['rented bike count'].std()**2 ,2)

In [None]:
#Checking the normality
import statsmodels.api as sm
fig = plt.figure(figsize = (15,12))

ax1 = fig.add_subplot(221)
sns.histplot(data = working_data, x = 'rented bike count' , bins = 50, kde = True, ax = ax1, color = 'red')
ax1.set_title('cnt of bikes rented in working days')

ax2 = fig.add_subplot(222)
sm.qqplot(working_data['rented bike count'], line = 's', ax = ax2)
ax2.set_title('qqplot for cnt in working days')

ax3 = fig.add_subplot(223)
sns.histplot(data = non_working_data, x = 'rented bike count' , bins = 50, kde = True, ax = ax3, color = 'red')
ax3.set_title('cnt of bike rented in non-working days')

ax4 = fig.add_subplot(224)
sm.qqplot(non_working_data['rented bike count'], line = 's', ax = ax4)
ax4.set_title('qqplot for cnt in non-working days')

plt.show()

In [None]:
# Calculating the p-value and test- statistics usinging ttest_ind() for this right skwed distribution sample.
# Alternative == greater as it's right skewed and one-sided

from scipy import stats

t_test, p_value = stats.ttest_ind(working_data['rented bike count'],non_working_data['rented bike count'],
                                  alternative='greater', equal_var = False)
print(t_test, p_value)

In [None]:
fig = plt.figure(figsize = (15,12))

ax1 = fig.add_subplot(221)
sns.histplot(data = np.log(working_data['rented bike count']) , bins = 50, kde = True, ax = ax1, color = 'green')
ax1.set_title('cnt of bikes rented in working days')

ax2 = fig.add_subplot(222)
sm.qqplot(np.log(working_data['rented bike count']), line = 's', ax = ax2)
ax2.set_title('qqplot for cnt in working days')

ax3 = fig.add_subplot(223)
sns.histplot(data = np.log(non_working_data['rented bike count']) , bins = 50, kde = True, ax = ax3, color  = 'green')
ax3.set_title('cnt of bike rented in non-working days')

ax4 = fig.add_subplot(224)
sm.qqplot(np.log(non_working_data['rented bike count']), line = 's', ax = ax4)
ax4.set_title('qqplot for cnt in non working days')

plt.show()


In [None]:
round(np.log(working_data['rented bike count']).std()**2,2), round(np.log(non_working_data['rented bike count']).std()**2 ,2)

In [None]:
sample_w_log = np.log(working_data['rented bike count']).sample(250)
sample_nw_log = np.log(non_working_data['rented bike count']).sample(250)

In [None]:
statistic,p_value = stats.ttest_ind(sample_w_log,sample_nw_log , alternative = 'greater')
print(statistic,p_value)

In [None]:
def htResult(p_value):
    significance_level = 0.05
    if p_value <= significance_level: 
        print('Reject NULL HYPOTHESIS') 
    else: 
        print('Fail to Reject NULL HYPOTHESIS')

htResult(p_value)        

In [None]:
sns.boxplot(x='functioning day', y='rented bike count', data=bike_df)
plt.show()

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing 
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why? 

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***