<a href="https://colab.research.google.com/github/saurabhsingh3786/Airline_Passenger_Referral_Prediction/blob/main/Individual_Notebook_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - `Airline Passenger Referral Prediction`



##### **Project Type**    - Classification
##### **Contribution**    - Individual(Saurabh Singh)


# **Project Summary -**

Customer referral is a crucial aspect of business growth and success, and the airline industry is no exception. Satisfied passengers who have had positive experiences with an airline are more likely to refer the airline to their friends, family, and colleagues. Identifying these potential advocates can help airlines improve customer satisfaction and loyalty and attract new customers.

In this project, we will use machine learning algorithms to predict whether a passenger will refer an airline to others. We will use a dataset that includes past passengers and their referral behavior, as well as various features such as age, gender, flight class, and route information.

Our first step will be to perform exploratory data analysis to gain insights into the data and identify any patterns or correlations. We will then preprocess the data by handling missing values, encoding categorical variables, and scaling numeric features.

We will then apply several machine learning algorithms, including logistic regression, random forest, and support vector machines, to predict the likelihood of a passenger becoming a referral. We will also perform feature engineering and selection to improve the performance of our models.

Finally, we will evaluate our models using metrics such as accuracy, precision, recall, and F1 score. We will also use techniques such as cross-validation and grid search to tune our hyperparameters and ensure our models generalize well to new data.

# **GitHub Link -**

https://github.com/saurabhsingh3786/Airline_Passenger_Referral_Prediction

# **Problem Statement**


Data includes airline reviews from 2006 to 2019 for popular airlines around the world withmultiple choice and free text questions. Data is scraped in Spring 2019. The main objectiveis to predict whether passengers will refer the airline to their friends.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
#importing the dataset from drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
airline_df = pd.read_excel("/content/drive/MyDrive/AlmaBetter/Capstone Projects/Airline Passenger Referral Prediction/data_airline_reviews.xlsx")

### Dataset First View

In [None]:
# Dataset First Look
#first five rows
airline_df.head()

In [None]:
#last five rows
airline_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
total_rows, total_columns = airline_df.shape
print("Total Rows in the DataFrame:", total_rows)
print("Total Columns in the DataFrame:", total_columns)

### Dataset Information

In [None]:
# Dataset Info
airline_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = airline_df.duplicated(keep = 'first').sum()
print("Total Duplicate Rows in the DataFrame:", duplicate_count)

In [None]:
#Dropping the Empty rows
airline_df.drop_duplicates(keep=False,inplace= True)
airline_df.reset_index(inplace=True,drop=True)


In [None]:
# Dataset Duplicate Value Count
airline_df.duplicated(keep = 'first').sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Missing Value Count Function
def show_missing():
    missing = airline_df.columns[airline_df.isnull().any()].tolist()
    return missing

# Missing data counts and percentage
print('Missing Data Count')
print(airline_df[show_missing()].isnull().sum().sort_values(ascending = False))
print('--'*50)
print('Missing Data Percentage')
print(round(airline_df[show_missing()].isnull().sum().sort_values(ascending = False)/len(airline_df)*100,2))

In [None]:
# Visualizing the missing values

# Calculate the missing data percentage for each column

missing_percentage = round(airline_df[show_missing()].isnull().sum().sort_values(ascending = False)/len(airline_df)*100,2)

# Create a bar plot to visualize the missing data percentage
plt.figure(figsize=(12, 6))
missing_percentage.plot(kind='bar', color='steelblue')
plt.xlabel('Columns')
plt.ylabel('Missing Data Percentage (%)')
plt.title('Missing Data Percentage by Column')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


### What did you know about your dataset?

**`Insights Gain-`**
* Dataset Size: The DataFrame has a total of 131,895 entries (rows).

* Data Columns: There are 17 columns in the DataFrame, each representing a different feature.

* Non-Null Counts: The "Non-Null Count" indicates the number of non-missing (non-null) values in each column. This is important because missing data can impact the quality of the analysis and model performance.

* Data Types: The "Dtype" column shows the data types of each feature. In our DataFrame, there are seven columns with float64 data type (numeric) and ten columns with object data type (categorical or textual).


Insights from the Non-Null Counts:

* Some columns have missing data (NaN values). For example, "airline," "overall," "author," "review_date," "customer_review," "aircraft," "traveller_type," "cabin," "route," "date_flown," "seat_comfort," "cabin_service," "food_bev," "entertainment," "ground_service," "value_for_money," and "recommended" have some missing values.

**Insights from Data Types:**

* There are seven columns with numeric data types (float64), which likely represent ratings or scores for different aspects of the airline experience.

* Ten columns have the object data type, which can include categorical variables and textual data. Examples are "airline," "author," "review_date," "customer_review," "aircraft," "traveller_type," "cabin," "route," "date_flown," and "recommended."



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# Explore each column in airline_df
for column in airline_df.columns:
    print(f"Column: {column}")
    print("Data Type:", airline_df[column].dtype)
    print("Number of Unique Values:", airline_df[column].nunique())
    print("Value Counts:")
    print(airline_df[column].value_counts())
    print("-" * 30)

In [None]:
# Dataset Describe
airline_df.describe()

In [None]:
#Categorical Dataset Describe
airline_df.describe(exclude=float)

### Variables Description

**Description Of Features:**

* airline: Name of the airline.

* overall: Overall point given to the trip between 1 to 10.
* author: Author of the trip
* reviewdate: Date of the Review
* customer review: Review of the customers in free text format
* aircraft: Type of the aircraft
* travellertype: Type of traveler (e.g. business, leisure)
* cabin: Cabin at the flight
* date flown: Flight date
* seatcomfort: Rated between 1-5
* cabin service: Rated between 1-5
* foodbev: Rated between 1-5
* entertainment: Rated between 1-5
* groundservice: Rated between 1-5
* valueformoney: Rated between 1-5
* recommended: Binary, target variable

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(airline_df.apply(lambda col: col.unique()))

## 3. ***Data Wrangling***

### Data Wrangling Code

Firstly, we will handle all our missing values so that we have cleaned data for our analysis.

In [None]:
# Write your code to make your dataset analysis ready.
# function for finding Missing values :
def missing_values_check(df):
    percent_missing = airline_df.isnull().sum() * 100 / len(airline_df)
    missing_values_df = pd.DataFrame({'column_name': airline_df.columns,
                                     'percent_missing': percent_missing})
    return missing_values_df.sort_values('percent_missing',ascending=False)

In [None]:
missing_values_check(airline_df)

Handling missing value in aircraft-

Since the "aircraft" feature has a high percentage of missing values, it might be best to drop this column.

In [None]:
airline_df.drop(columns=['aircraft'], inplace=True)

Handle missing value in date_flown, route, traveller_type, cabin,recommended (Categorical columns) -

In [None]:
# Drop rows with null values in the "date_flown","route" column
airline_df.dropna(subset=['date_flown','route','traveller_type','cabin','recommended'], inplace=True)

Handle missing value in ground_service, entertainment, food_bev, seat_comfort, cabin_service, value_for_money, overall(Numerical columns) -

In [None]:
from sklearn.impute import SimpleImputer

# Replace missing values in numerical columns with the mean
numeric_columns = ['food_bev', 'seat_comfort', 'cabin_service', 'value_for_money', 'overall', 'ground_service', 'entertainment']

for col in numeric_columns:
    imputer = SimpleImputer(strategy='mean')
    airline_df[col] = imputer.fit_transform(airline_df[[col]])

In [None]:
#check again for missing values
airline_df.isnull().sum()

Now we will change review_date feature to extract day, month and year -

In [None]:
#changing review_date feature into pandas datetime

def handle_review_date(date_review_values):
    fin_date = []
    for date in date_review_values:
        #extracting day
        day = date.split()[0]
        if len(day) == 3:
            day = int(day[:1])
        else:
            day = int(day[:2])
        #extracting month
        month = date.split()[1]
        month_map = {'January':1,'February':2,'March':3,'April':4,'May':5,'June':6,'July':7,'August':8,'September':9,'October':10,'November':11,'December':12}
        month =  month_map[month]
        #extracting year
        year = date.split()[-1]
        fin_date.append(f'{year}-{month}-{day}')
    #returning as datetime
    return pd.to_datetime(fin_date)

In [None]:
airline_df.review_date = handle_review_date(airline_df.review_date)

Change date_flown feature pandas datetime -

In [None]:

#changing date_flown feature into pandas datetime

def handle_date_flown(date_flown_values):
    fin_date = []
    for date in date_flown_values:
        if pd.isna(date):
            fin_date.append(np.nan)

        else:
            try:
                fin_date.append(pd.to_datetime(date))
            except:
                year = date.split()[1]
                month = date.split()[0]
                month_map = {'January':1,'February':2,'March':3,'April':4,'May':5,'June':6,'July':7,'August':8,'September':9,'October':10,'November':11,'December':12}
                fin_date.append(pd.to_datetime(f'{year}-{month_map[month]}-01'))

    return fin_date


In [None]:
airline_df.date_flown = handle_date_flown(airline_df.date_flown)

Now, we already see that route feature have actually two values i.e departure city and arrival city so we break route feature into these two values.

In [None]:
#creating two features as visit from and visit to from route feature

def handle_route():
    final_route = []
    for route in airline_df.route.values:
        if pd.isna(route):
            final_route.append((np.nan,np.nan))
        else:
            to_ind = str(route).find(' to ')
            via_idx = str(route).find(' via ')
            if via_idx == -1:
                final_route.append((str(route)[:to_ind],str(route)[to_ind+3:]))
            else:
                final_route.append((str(route)[:to_ind],str(route)[to_ind+3:via_idx]))
    return final_route

In [None]:
airline_df.route = handle_route()
airline_df['arrival_city'] = airline_df.route.apply(lambda x: x[0])
airline_df['departure_city'] = airline_df.route.apply(lambda x : x[1])
airline_df.drop('route',inplace= True,axis= 1)

In [None]:
#printing random 5 observation
airline_df.head(5)

### What all manipulations have you done and insights you found?

Firstly, i drop aircraft feature because it has nearabout 70% null values then i separately handle categorical and numerical features . There is two date columns "date_flown" and "review_date", these data had stored as object as default so I changed these to panda's Datetime object so that we can use it for EDA much more effectively. And finally I splitted "route" features as two features as "arrival_city" and "departure_city" and dropped "route".

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

`UNIVARIATE ANALYSIS`

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#Question-1 : Distribution of some numerical features?

# Create a 2x3 grid of subplots
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))

# Distribution of Overall Ratings
sns.histplot(data=airline_df, x='overall', bins=10, kde=True, ax=axes[0, 0])
axes[0, 0].set_xlabel('Overall Rating')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Overall Ratings')

# Distribution of Seat Comfort Ratings
sns.histplot(data=airline_df, x='seat_comfort', bins=5, kde=True, ax=axes[0, 1])
axes[0, 1].set_xlabel('Seat Comfort Rating')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Distribution of Seat Comfort Ratings')

# Distribution of Cabin Service Ratings
sns.histplot(data=airline_df, x='cabin_service', bins=5, kde=True, ax=axes[0, 2])
axes[0, 2].set_xlabel('Cabin Service Rating')
axes[0, 2].set_ylabel('Frequency')
axes[0, 2].set_title('Distribution of Cabin Service Ratings')

# Distribution of Food and Beverage Ratings
sns.histplot(data=airline_df, x='food_bev', bins=5, kde=True, ax=axes[1, 0])
axes[1, 0].set_xlabel('Food and Beverage Rating')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Food and Beverage Ratings')

# Distribution of Entertainment Ratings
sns.histplot(data=airline_df, x='entertainment', bins=5, kde=True, ax=axes[1, 1])
axes[1, 1].set_xlabel('Entertainment Rating')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Distribution of Entertainment Ratings')

# Distribution of ground service Ratings
sns.histplot(data=airline_df, x= 'ground_service', bins=5, kde=True, ax=axes[1, 2])
axes[1, 2].set_xlabel('ground service Rating')
axes[1, 2].set_ylabel('Frequency')
axes[1, 2].set_title('Distribution of ground service Ratings')
# Customize the layout and spacing
plt.tight_layout()

# Show the plot
plt.show()


In [None]:

# Create a countplot for the distribution of Value for Money Ratings
plt.figure(figsize=(8, 6))
sns.histplot(data=airline_df, x='value_for_money', bins=6, kde=True)
plt.xlabel('Value for Money Rating')
plt.ylabel('Frequency')
plt.title('Distribution of Value for Money Ratings')
plt.show()

##### 1. Why did you pick the specific chart?

The histplot function with kde is more suitable for continuous numerical data, as it combines a histogram (bar plot) with a smoothed density curve to estimate the underlying distribution of the data.
It is useful for visualizing the shape of a continuous distribution and identifying patterns in the data, such as peaks and valleys.

##### 2. What is/are the insight(s) found from the chart?

* The overall feature ratings of 1 to 2 occur more frequently. From Seat comfort feature, We can say that rating of 1 is highest and rating of 4 is the second highest.
* From cabin service feature, We can say that rating of 5 is highest and rating of 1 is the second highest.

* The food bev feature ratings of 2,4 and 5 are varies equally.Which means their frequency are approximately equal.

* The features of both the entertainment & ground service, We can say that ratings of 3 is highest and ratings of 1 is the second highest.

* From value for money feature, It clearly shows that most of the passenger gives ratings of 1 as highest. From this we can say that most of the airline does not provide good service to passenger.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Overall, the insights gained suggest areas for improvement and potential positive business impact by addressing specific issues that lead to lower ratings. Improving seat comfort, enhancing food and beverage options, and increasing value for money can contribute to increased passenger satisfaction and loyalty.

The potential negative growth could arise from the high frequency of low ratings in overall experience, seat comfort, and value for money. These areas are crucial for passenger satisfaction and could result in negative reviews, decreased repeat business, and a tarnished reputation if not addressed.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#Question-2 : top 10 airlines?

# Get the top 10 airlines based on the frequency of reviews
top_10_airlines = airline_df['airline'].value_counts().head(10)

# Create a bar plot for the top 10 airlines by count
plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_airlines.values, y=top_10_airlines.index, palette='viridis')
plt.xlabel('Number of count')
plt.ylabel('Airline')
plt.title('Top 10 Airlines by Number of count')
plt.show()


##### 1. Why did you pick the specific chart?

the bar plot is a suitable choice for this visualization because it effectively communicates the frequency of reviews for each airline and allows for easy comparison of the top 10 airlines. It leverages the strengths of bar plots in representing categorical data and making meaningful comparisons.

##### 2. What is/are the insight(s) found from the chart?

American Airlines is the most popular airline, with the most number of counts. This suggests that it is a well-known and trusted airline that passengers are comfortable flying with.
Spirit Airlines is the second most popular airline, despite having the fewest number of counts. This suggests that it is a popular choice for budget-conscious travelers.
United Airlines and British Airways are also popular airlines, with a large number of counts. These airlines offer a variety of destinations and services, making them a good choice for travelers with different needs.
China Southern Airlines, Emirates, Delta Air Lines, and Turkish Airlines are all popular airlines in specific regions of the world. For example, China Southern Airlines is a popular choice for travelers to and from China, while Emirates is a popular choice for travelers to and from the Middle East.
Frontier Airlines and Qatar Airways are both low-cost carriers. They offer lower fares than other airlines, but they may not offer the same level of service

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact for airlines. For example, airlines can use the insights to:

* Identify areas where they can improve their customer service. For example, if passengers are consistently complaining about the food on board, the airline can look into ways to improve the food quality or offer more food options.
* Target their marketing campaigns more effectively. For example, if the airline knows that its target market is budget-conscious travelers, it can focus its marketing efforts on low-cost flights and travel packages.
* Develop new products and services that meet the needs of their customers. For example, if the airline knows that its customers are looking for more legroom, it can consider adding more seats with extra legroom to its fleet.

The insights can also help airlines avoid negative growth. For example, if the airline knows that its customers are dissatisfied with its customer service, it can take steps to improve its customer service before it starts to impact the airline's bottom line.

Here are some insights that could lead to negative growth if not addressed:

* Passengers are dissatisfied with the comfort of the seats. This could lead to passengers choosing to fly with other airlines that offer more comfortable seats.
* Passengers are dissatisfied with the food on board. This could lead to passengers bringing their own food on board, which could reduce the airline's revenue from food sales.
* Passengers are dissatisfied with the customer service. This could lead to passengers choosing to fly with other airlines that offer better customer service.

If airlines do not address these issues, it could lead to negative growth in the long run. By addressing these issues, airlines can improve their customer satisfaction and loyalty, which can lead to positive business impact.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***