# **Project Name**    - Classification - IndiGo Airline Passenger Referral Prediction




##### **Contribution**    - Individual
Project by - Sourav Ranjan Thakur

# **GitHub Link -**

https://github.com/sourav2208?tab=repositories

# **Problem Statement**


The goal of this machine learning project is to classify airlines into categories based on certain features or attributes. Classification can serve multiple purposes, such as identifying potential partners for codeshare agreements, assisting in pricing strategies, or aiding in market analysis. In this project, We will be exploring if flyers would recommend the airline to their friends and families, based on their travel experience,reviews and ratings.

#### There are few problems that we are looking in this project:

* Develop a classification model to categorize airlines based on the likelihood of customers recommending them to friends and family.

* Recognize the pivotal role of customer satisfaction and referrals in the growth and success of airlines.

* Enable airlines to strategically utilize customer referral information for codeshare agreements, pricing strategies, and market analysis.

* Identify customers likely to refer the airline, a task complicated by the diverse factors influencing satisfaction and referrals.

* Assess the model's capability to provide actionable insights for airlines to tailor services, improve customer satisfaction, and enhance brand reputation.


This problem statement outlines the key objectives, challenges, and considerations for developing a classification model to predict customer referrals in the airline industry.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
!pip install category_encoders

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import datetime as dt
import missingno as msno
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import chi2_contingency
from scipy.stats import f_oneway
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score,accuracy_score,precision_score,recall_score,f1_score,confusion_matrix
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV , cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [None]:
#mount drive
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Load Dataset
df1 = pd.read_excel('/content/drive/MyDrive/data_airline_reviews.xlsx')

### Dataset First View

In [None]:
# Dataset First Look
df1.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df1.shape

### Dataset Information

In [None]:
# Dataset Info
df1.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df1.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df1.isnull().sum()

In [None]:
# Visualizing the missing values
#Visualizing Missing Values
plt.figure(figsize=(15,8))
sns.heatmap(df1.isnull(), cbar=False, cmap='YlGnBu')
plt.title('Missing Values Heatmap')
plt.xticks(rotation=30)
plt.show()

### What did you know about your dataset?

 Data includes airline reviews from 2006 to 2019 for popular airlines around the world with user feedback ratings and reviews based on their travel experience.

It has 131895 rows 17 different columns.

Data is scraped in Spring 2019. Feature descriptions briefly as follows:

1. **airline** - Airline name
2. **overall** - Overall score
3. **Author** - Author information
4. **review_date** - Customer Review posted date
5. **Customer_review** - Actual customer review(Textual)
6. **aircraft** - Type of aircraft
7. **traveller_type** - Type of traveller
8. **cabin**- Cabin type chosen by traveller (Economy, Business,Premium economy,First class)
9. **route** - Route flown by flyer
10. **date_flown** - Date of travel
11. **seat_comfort** - Rating provided towards seat comfort
12. **cabin_service** - Rating provided towards cabin service.
13. **food_bev** - Rating provided towards food and beverages supplied during travel.
14. **entertainment** - Rating provided towards on board flight entertainment
15. **ground_service** - Rating provided towards ground service staff.
16. **value_for_money** - Rating provided towards value for money.
17. **recommended** - Airline service Recommended by flyer (Yes/No)

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df1.columns

In [None]:
# Dataset Describe
df1.describe().T

### Variables Description

It has lot of blank rows with many null values and the columns description are as follows:

1. **airline** - Name of the airline.  (**object type**)
2. **overall** - Overall rating defined by customer. (**float type**)
3. **Author** - Customer information. (**object type**)
4. **review_date** - date on which customer posted a review. (**object type**)
5. **Customer_review** - Description of customer review. (**object type**)
6. **aircraft** - Type of aircraft. (**object type**)
7. **traveller_type** - Type of traveller. (**object type**)
8. **cabin**- Cabin type chosen by traveller. (Economy, Business,Premium economy,First class) (**object type**)
9. **route** - Route flown by flyer. (**object type**)
10. **date_flown** - Date of travel. (**object type**)
11. **seat_comfort** - Rating provided towards seat comfort. (**float type**)
12. **cabin_service** - Rating provided towards cabin service. (**float type**)
13. **food_bev** - Rating provided towards food and beverages supplied during travel. (**float type**)
14. **entertainment** - Rating provided towards on board flight entertainment. (**float type**)
15. **ground_service** - Rating provided towards ground service staff. (**float type**)
16. **value_for_money** - Rating provided towards value for money. (**float type**)
17. **recommended** - Airline service Recommended by flyer (Yes/No). (**object type**)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df1.columns.tolist():
  print("No. of unique values in ",i,"is",df1[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Make a copy of your dataset for in future revert back
df=df1.copy()

In [None]:
#Drop all duplicated rows as there are many blank and duplicated rows
df.drop_duplicates(inplace=True)

In [None]:
#Drop the index column
df.reset_index(drop=True, inplace=True)

In [None]:
#Now we checked the shape of data after dropping.
#Check for shape of your dataset
df.shape

In [None]:
#Now check for sum of NaN values and sort it according to the sum.
#check for null values and sort in ascending order
df.isnull().sum().sort_values(ascending=False)


In [None]:
#Now we are dropping our unwanted columns like `author`, `customer_review`, `route`.
df.drop(columns=(['author','customer_review','route']),axis=1,inplace=True)

In [None]:
#Here we are dropping our `aircraft` column because it almost have `70% NaN values`.
df.drop(columns=['aircraft'],axis=1,inplace=True)

In [None]:
# Again check for null values and sort in ascending order
df.isnull().sum().sort_values(ascending=False)

In [None]:
#Here we are droping nan values rows for these two columns named `ground_service` and `entertainment`.
df.dropna(subset=['ground_service','entertainment'],inplace=True)

In [None]:
#Fill the null vales with mean fo their rating
df['food_bev'].fillna(df['food_bev'].mean(),inplace=True)

In [None]:
#Drop all null values in our whole dataset
df.dropna(inplace=True)

In [None]:
#Final check for null values
df.isnull().sum()

In [None]:
# Check for shape after cleaning or dataset
df.shape

In [None]:
#First row is all null values so after we dropped it our index starts from 1 so we are resetting or index
df.reset_index(drop=True, inplace=True)

In [None]:
#Check first 5 rows of dataset after cleaning
df.head()

In [None]:
df.info()

In [None]:
#we can see there are many variables having not appropriate datatypes so we changed them to their suitable datatypes
d_type={'overall':'int8','review_date':'datetime64[ns]','seat_comfort':'int8','cabin_service':'int8','food_bev':'int8','entertainment':'int8',
        'ground_service':'int8',
        'value_for_money':'int8'}
for i,j in d_type.items():
  df[i]=df[i].astype(j)

In [None]:
#Here converted `date_flown` column in a proper date format by removing timestamp and changed to Datetime format.
df['date_flown']=pd.to_datetime(df['date_flown'], errors='coerce')

In [None]:
#Renamed `overall` to `overall_rating` and `date_flown` to `departure_date` for better understanding.
df.rename(columns={'overall':'overall_rating','date_flown':'departure_date'},inplace=True)

In [None]:
df.head()


### What all manipulations have you done and insights you found?

1. Converted Date columns to datetime format as they were in object datatype and converted various rating columns from float to int as all ratings are only in integers.
2. date_flown column was not in proper date format it also contained Timestamp so we changed to a proper date format by removing timestamp and converted it into datetime datatype.
3. Renamed overall to overall_rating and date_flown to departure_date for better understanding.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
plt.figure(figsize=(20, 5))

# Count the occurrences of each airline and reset the index
air_cnt = df['airline'].value_counts().sort_values(ascending=False).head(10).reset_index()
air_cnt.columns = ['airline', 'count']  # Rename columns for clarity

# Create a bar plot
palette = sns.color_palette("Set1", 10)
ax = sns.barplot(x=air_cnt['airline'], y=air_cnt['count'], palette=palette)

# Customize the plot
plt.xlabel('Airlines', fontsize=12)
plt.ylabel('Number of Trips', fontsize=12)
plt.title('Top 10 Airlines Based on Highest Trips', fontsize=18)

# Add bar labels
for num in ax.containers:
    ax.bar_label(num)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

`Bar graph` is typically used when we have to depict categorical values with numerical values and here it suits well as we are to show airlines with its count of reviews

##### 2. What is/are the insight(s) found from the chart?

We have shown `top 10 airlines` in terms of their reviews count and can understand that `American Airlines` has the most number of review i.e.1412 reviews followed by `United Airlines` having 1358 reviews and `Bristish Airways` having 1271 reviews.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's essential to consider the nature of the reviews, the sentiments expressed, Understanding what customers appreciate about an airline, whether it's excellent service, punctuality, or other positive aspects, can help the company leverage and enhance these strengths and Identifying common issues or complaints allows the airline to address and rectify problems, leading to an improved customer experience.

#### Chart - 2

In [None]:
cab_cnt=df['cabin'].value_counts().reset_index()

In [None]:
print(cab_cnt.head())

In [None]:
cab_cnt = df['cabin'].value_counts().reset_index()
cab_cnt.columns = ['cabin', 'count']

In [None]:
# Chart - 2 :` Pie chart` for showing distribution of diffrent cabin classes preffered by passengers
plt.figure(figsize=(12, 6))
plt.pie(
    cab_cnt['count'],  # Use the column with numeric values
    labels=cab_cnt['cabin'],  # Use the column with labels
    autopct='%1.1f%%',
    explode=[0, 0, 0.12, 0.2],
    startangle=60,
    textprops={'fontsize': 10},
    shadow=True,
    wedgeprops={'edgecolor': 'white'}
)
plt.title('Distribution of Different Cabin Classes Preferred by Passengers', y=1.08, fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

A `pie chart` is a circular statistical graphic that is divided into slices to illustrate numerical proportions. It's primarily used to show the relationship of parts to a whole.

##### 2. What is/are the insight(s) found from the chart?

From the above chart we can see that `econony class`(72.5%) constitues the largest part followed by `business class`(19.4%)and the other two class `Premium`(5.2%) and `First Class`(3.0%) constitues very less portion of the chart which tells that mostly people were travelling in economy class .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the popularity of economy class can guide the airline in customizing services to meet the needs and expectations of this larger customer segment,Given that economy class is the most popular, marketing efforts can be targeted towards this segment. Promotions, loyalty programs, and advertising can be tailored to attract and retain economy class travelers.

#### Chart - 3

In [None]:
# Chart - 3 : `Bar Chart` for comparing most popular cabin type
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming df["traveller_type"] exists and contains the data
most_trav = df["traveller_type"].value_counts().reset_index()
most_trav.columns = ['traveller_type', 'count']  # Rename columns for clarity

ax = sns.barplot(x=most_trav['traveller_type'], y=most_trav['count'], palette='Dark2')
plt.ylabel('Count')
plt.xlabel('Traveller Type')
plt.title('Most Popular Traveller Type')

# Add bar labels
for num in ax.containers:
    ax.bar_label(num)

plt.show()


##### 1. Why did you pick the specific chart?

`Bar graph` is typically used when we have to depict categorical values with numerical values and here it suits well as we are to show air

##### 2. What is/are the insight(s) found from the chart?

`Solo Leisure` is the most preffered travel_type by passengers while `Bussiness` is the lowest travel_type.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding that solo leisure travel is more popular allows the airline to tailor marketing efforts specifically toward this segment. Promotions, advertising, and loyalty programs can be designed to attract and retain solo leisure travelers, potentially increasing customer acquisition and retention.

#### Chart - 4

In [None]:
#performing the grouphby method
eda_4=df.groupby('cabin')[['food_bev','entertainment']].mean().reset_index()
eda_4

In [None]:
# Chart - 4 :`Side by side Bar Chart` for comparing cabin classes based on food_bev and entertainment ratings
plt.rcParams['figure.figsize']=(10,5)
eda_4.plot(x="cabin", y=["food_bev", "entertainment"], kind="bar")
plt.title('Cabin classes based on Food_bev and entertainment ratings')
plt.ylabel('ratings')
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

`Side by side Bar Chart` is best suited for showing side by side comparision of various cabin class wrt food_beverages and entertainment.

##### 2. What is/are the insight(s) found from the chart?

We can conclude that there is no significant change in ratings of `food_bev` and `entertainment` in Economy and first_class but in premium economy class there is more rating for entertainment as compared to food_bev and vice-versa for Bussiness_class.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Knowing that there are different preferences for entertainment and food_bev in Premium Economy and Business Class allows the airline to focus on enhancing services in each class selectively. This could involve improving menu options, upgrading entertainment systems, or introducing new features to align with passenger expectations

#### Chart - 5

In [None]:
# Chart - 5 : Distribution of diffrent types of ratings using `Voilin Plot`
columns = ['seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money','overall_rating']

# Melt the DataFrame to long format
df_melted = df.melt(value_vars=columns, var_name='Rating Category', value_name='Rating')

# Create a violin plot
plt.figure(figsize=(12, 6))
sns.violinplot(x='Rating Category', y='Rating', data=df_melted,palette='Set1')
plt.title('Violin Plot of various type of Ratings')
plt.xticks(rotation=45)
plt.show()

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize= (8,5))
overall = df.groupby(df['airline'])['overall_rating'].mean().sort_values(ascending = False).head(10).reset_index()
ax = sns.barplot(y=overall['airline'],x = overall['overall_rating'],palette ="tab10" )
plt.xlabel('Average overall rating')
plt.ylabel('Airline')
plt.title('Top rated airlines')

plt.show()

##### 1. Why did you pick the specific chart?

We picked `horizontal column chart` for comparison of various airlines wrt average overall rating.

##### 2. What is/are the insight(s) found from the chart?

From this chart we can see `Aegean airlines` is highest overall rating followed by `EVA airlines` and `ANA all Nippon Airways` while `Cathay Pecific airways` has the lowest rating .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Aegean Airlines, EVA Airlines, and ANA All Nippon Airways can continue to focus on providing excellent service to maintain their high ratings. This can include improving in-flight amenities, on-time performance, and customer service.
 Cathay Pacific Airways, being rated lower, can invest in training and development programs for its staff to enhance customer service and improve overall customer experience.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize= (20,5))
val = df.groupby(df['airline'])['value_for_money'].mean().sort_values(ascending = False).head(10).reset_index()
ax = sns.barplot(x=val['airline'],y = val['value_for_money'] ,palette = 'viridis')

plt.title('Top 10 Airlines wrt to value for money',fontsize = 18)
plt.show()

##### 1. Why did you pick the specific chart?

We picked `Bar chart` for comparison of various airlines wrt to Average rating of value for money.

##### 2. What is/are the insight(s) found from the chart?

We can see that `EVA Air ` is the highest rated followed by China Southern Airlines and Aegean Airlines  for value for money .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Based on the above insightes EVA Air can offer loyalty programs or incentives for frequent flyers to encourage repeat business and enhance customer loyalty.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
sns.histplot(df['overall_rating'], kde = True,bins =5,color='#4CBB17')
plt.title('Distribution of overall rating')
plt.show()

##### 1. Why did you pick the specific chart?

 We chose `Histogram`  for distribution of Overall rating.

##### 2. What is/are the insight(s) found from the chart?

We can conclude that most people have rated between either 1-2 or 8-10
it shows that passenger have either best or worst experience with airline.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Based on above insightes Airlines can focus on addressing the aspects that lead to extreme negative experiences, such as poor customer service, flight delays, or uncomfortable seating, to reduce the number of low ratings and engaging with customers who have provided extreme ratings (either low or high) can provide valuable feedback for improvement and allow airlines to address specific pain points or areas of excellence.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(15,5))
sns.boxplot(x=df['cabin'], y=df['cabin_service'], hue = df['recommended'])
plt.legend(loc='upper right')

##### 1. Why did you pick the specific chart?

We picked this type of `Boxplot` to show rating comparison between diffrent cabin classes.

##### 2. What is/are the insight(s) found from the chart?

We can see for every cabin class if the service rating is more then 3 then passenger is more likely to recommend that airline to others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By focusing on enhancing service quality across all cabin classes to ensure ratings exceed 3, airlines can improve overall customer satisfaction, leading to positive recommendations and repeat business

negative-impact from insights:
If service ratings for any cabin class consistently fall below 3, it could lead to negative word-of-mouth, lower customer satisfaction, and a decline in recommendations, which could result in a loss of customers and revenue.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize= (14,6))
plt.title('Number of reviews over months')
df['month_name'] = df['review_date'].dt.strftime('%B')
df['month'] = df['review_date'].dt.month
df2 = df[['month_name','month']].value_counts().reset_index().sort_values(by = 'month')
df2.rename(columns={0:'count'},inplace = True)
ax = sns.barplot(x = df2['month_name'], y = df2['count'],palette = 'Dark2')
for num in ax.containers:
  ax.bar_label(num)
plt.show()


##### 1. Why did you pick the specific chart?

Here we are plotting in which month how many reviews are submitting so with the help of this we can check is there any pattern or relation of number of reviews with the month so the best suited chart is a `bar chart`.

##### 2. What is/are the insight(s) found from the chart?

We can analyze from this chart that `january` month is having a large number of reviews as compared to others follwed by `july` and `august` while in the month of may we have the least may be january is a holiday or vacation month so more number of travellers are there so reviews is also there or the staff is not properly managing the services due to this reviews are more because passenger traffic is more.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

For holiday months if we are having more passenger traffic so we should employ the temporary staff to not spoil our services and management if the traffic is the reason.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(10,5))
plt.title('Overall rating vs cabin type')
sns.barplot(x = df.cabin, y = df.overall_rating, hue = df['recommended'], palette= ['#00e500','red'])

##### 1. Why did you pick the specific chart?

Since we are plotting our categorical value against discrete numerical value so best suited chart is a `side by side bar chart`.


##### 2. What is/are the insight(s) found from the chart?

We can clearly see from this that almost in all cabin type we have overall rating more than 8 and the customer recommend the airline to others while for no we have almost 2 rating overall in economy while 3 for rest of all, so there is no much difference between cabin type if we see recommend yes/no .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can say from above insights that if a person give overall rating more than 8 then its 99% sure that he is gonna recommend the airline to others by the help of rating we can request our customer to share their opinions on airline service on some platform for recommendation, while if person is not satisfied we will try to resolve their issue with best possible solution.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# Fit the encoder to the 'airline' column and transform it
df['airline_encoded'] = le.fit_transform(df['airline'])

plt.figure(figsize=(8, 5))
plt.title('Correlation Heatmap')

# Calculate correlation using only numerical columns, including the encoded airline column.
# We explicitly select the numerical columns using `.select_dtypes(include=np.number)`
# to avoid the error.
sns.heatmap(df.select_dtypes(include=np.number).corr(), annot=True, fmt='.2f', annot_kws={'size': 10}, vmax=1, square=True, cmap="rocket")

plt.show()

##### 1. Why did you pick the specific chart?

This particular graph is the most powerful visualisation as it depicts the relationship of all the columns with each other and one another too.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Pair Plot visualization code
column_name = [ 'seat_comfort','value_for_money','cabin_service','ground_service']
pairplot_data = df[column_name]
chart15=sns.pairplot(pairplot_data,kind = 'reg')
plt.show()

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***