<a href="https://colab.research.google.com/github/ruksz/Airline_passenger_referral/blob/main/Airline_ML_Classification_Capstone_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Airline Passenger Referral Prediction





##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1 -** Rukshar Shaikh


# **Project Summary -**

Predicting Airline Recommendations from Customer Reviews

This project is focused on analyzing customer reviews of various airlines and building a predictive model to determine whether a customer will recommend an airline based on their review and overall experience. The dataset used for this analysis contains valuable information about customer sentiments, ratings, and preferences, which can provide significant insights into understanding customer satisfaction and improving airline services.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The primary objective of this project is to develop a robust machine learning model that can classify whether a customer will recommend an airline or not. This classification task is based on the sentiment expressed in the customer's review, the overall rating they provide, and other relevant features. By solving this problem, we aim to help airlines understand the factors influencing customer recommendations and enhance their services accordingly.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import io
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import missingno as msno


### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

### Dataset First View

In [None]:
# Dataset First Look
data = pd.read_excel('/content/drive/MyDrive/MLProject/data_airline_reviews.xlsx')

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.head()

### Dataset Information

In [None]:
# Dataset Info
data.info()

In [None]:
data.shape

we have 17 columns and 131895 rows in our data.

###Duplicate Values


In [None]:
#counting  number of duplicated values
data.duplicated().sum()

In [None]:
#droping the null values
data.drop_duplicates(inplace = True)

In [None]:
data.duplicated().sum()

###Null Values

From the last 5 rows, we can conclude that our dataset contains null values. Let's check the number of null values present for each of the columns of this huge dataset.

In [None]:
#Checking the null value count for each column
data.isnull().sum()

In [None]:
#Overall discription of data
data.describe().T

### What did you know about your dataset?

This dataset contains information related to airline passenger reviews, encompassing attributes such as airline names, overall ratings, reviewer details, review dates, and textual customer feedback. It further includes data on flight-specific details like aircraft type, traveler type, cabin class, flight routes, and flight dates. Additionally, passengers have rated various aspects of their experience, including seat comfort, cabin service, food and beverage quality, entertainment, ground service, and value for money. The dataset also includes an indication of whether passengers recommend the airline. However, it is worth noting that there are missing values in some columns, and the dataset offers opportunities for sentiment analysis, satisfaction prediction, and insights into factors affecting passenger experiences.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe(include='all')

### Variables Description

airline: The name or identifier of the airline being reviewed.

overall: An overall rating or score given by passengers, possibly for their overall experience.

author: The author or reviewer of the feedback.

review_date: The date when the review was posted.

customer_review: The text content of the passenger's review or feedback.

aircraft: The type or identifier of the aircraft used for the flight.

traveller_type: The type of traveler (e.g., business, leisure) who left the review.

cabin: The cabin class or type (e.g., economy, business) the passenger traveled in.

route: The route or destination of the flight.

date_flown: The date when the flight took place.

seat_comfort, cabin_service, food_bev, entertainment, ground_service,
value_for_money: Ratings or scores for various aspects of the flight experience, such as seat comfort, cabin service, food and beverage, entertainment, ground service, and value for money.

recommended: An indication of whether the passenger would recommend the airline or flight

### Check Unique Values for each variable.

In [None]:
#Checking the unique values of the recommended column(target variable)
data.recommended.unique()

In [None]:
# Check Unique Values for each variable.
data.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# copy of the current dataset and assigning to app_data
df=data.copy()

In [None]:
#Checking Percentage wise missing values.
def missing_values_per_check(df1):
    percent_missing = data.isnull().sum() * 100 / len(data)
    missing_values_df = pd.DataFrame({'column_name': data.columns,
                                     'percent_missing': percent_missing})
    return missing_values_df.sort_values('percent_missing',ascending=False)

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 Cabin type and overall service ratings (out of 10)

# Create a barplot
plt.figure(figsize=(10, 5))
sns.barplot(x='cabin', y='overall', hue='recommended', data=data, palette=['green', 'red'])

# Add labels and a legend
plt.xlabel('Cabin Type')
plt.ylabel('Overall Service Rating')
plt.legend(title='Recommended', loc='upper right')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here. Which cabin type has overall service ratings?


##### 2. What is/are the insight(s) found from the chart?

If the trip is rated above 8 for overall section, the trip is most likely be recommended by the travellers.

If it is below 3 , the unhappy travellers has not referred the airlines to their friends irrespective of their cabin type.

#### Chart - 2

In [None]:
# Chart - 2 Wthe top 10 airlines with most trips

# Get the number of trips each airline make.
trip_by_airlines = data['airline'].value_counts()
trip_by_airlines

In [None]:
# Visualize the top 10 airlines with most trips
plt.figure(figsize=(20,5))
trip_by_airlines[:10].plot(kind='bar')
plt.xlabel('Airline Type')
plt.ylabel('Count',fontsize=12)
plt.title('Top 10 Airline ')
plt.xticks(rotation='horizontal')
plt.show()

##### 1. Why did you pick the specific chart?

Which airline made highest trips?

##### 2. What is/are the insight(s) found from the chart?

We have observed that the top 10 airlines with most trips are-

Spirit Airlines

American Airlines

United Airlines

British Airways

Emirates

china southern airline

frontier airlines

ryanair

delta air lines

turkish airlines

#### Chart - 3

In [None]:
# Chart - 3 visualization code

# Calculate the count of ratings for each traveller_type
traveller_type_counts = data['traveller_type'].value_counts()

# Find the traveller_type with the highest count
most_rated_traveller_type = traveller_type_counts.idxmax()

# Create a pie chart
plt.figure(figsize=(5, 5))
plt.pie(traveller_type_counts, labels=traveller_type_counts.index, autopct='%1.1f%%', startangle=140)
plt.title(f'Distribution of Ratings by Traveller Type\nMost Rated: {most_rated_traveller_type}')

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Which Traveller_type has more ratings?

##### 2. What is/are the insight(s) found from the chart?

Solo Leisure travellers has more ratings

#### Chart - 4

In [None]:
# Chart - 4 visualization code

plt.figure(figsize=(8, 6))
sns.countplot(data=data, x='cabin', hue='recommended', palette=['red', 'purple'])
plt.title('Count of Recommendations by Cabin Type')
plt.xlabel('Cabin Type')
plt.ylabel('Count')
plt.legend(title='Recommended', labels=['No', 'Yes'])

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

 Which type of Cabin has more recommendation?


##### 2. What is/are the insight(s) found from the chart?

Economy class has highest recommendation with bad reviews.

Business class has second most recommended cabin type with good reviews.

premium economy has equal reviews.

first class is least recommend cabin type with good reviews.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# Calculate the average ratings for food_bev and entertainment for all classes
average_ratings = data.groupby(['cabin'])[['food_bev', 'entertainment']].mean().reset_index()
average_ratings

# Create a bar plot
plt.rcParams['figure.figsize']=(8,6)
average_ratings.plot(x="cabin", y=["food_bev", "entertainment"], kind="bar")

# Add labels and legend
plt.title('Average Ratings of Food & Beverage and Entertainment in All Classes')
plt.xlabel('Cabin Class')
plt.ylabel('Average Rating')
plt.legend(title='Category')


##### 1. Why did you pick the specific chart?

 what is the average ratings of Food_bev and entertainment given by passengers in different classes?


##### 2. What is/are the insight(s) found from the chart?

In Economy Class the average ratings of Food_bev and entertainment given by passenger is lowest compared to other cabin classes.

#### Chart - 6

In [None]:
# Chart - 6  Creating a violin plot for cabin type and cabin service ratings

plt.figure(figsize=(10, 6))
sns.violinplot(data=data, x='cabin', y='cabin_service', hue='recommended', split=True)
plt.title('Cabin Type and Cabin Service Ratings')
plt.xlabel('Cabin Type')
plt.ylabel('Cabin Service Ratings')
plt.xticks(rotation=45)
plt.legend(title='Recommended', loc='upper right')
plt.show()

##### 1. Why did you pick the specific chart?

Which cabin type has more service ratings?

##### 2. What is/are the insight(s) found from the chart?

First class travellers are least likely to recommend the airlines they travel.

Recommendation is most probable when the cabin service is given full star rating ie 5 out of 5 here.

In economy class if we got ratings between 4 to 5, that means airlines recommended.

#### Chart - 7

In [None]:
# Chart - 7 Calculate the mean "value_for_money" ratings for each airline
mean_ratings = data.groupby('airline')['value_for_money'].mean().reset_index()

# Sort the airlines by mean "value_for_money" ratings and select the top 10
top_10_airlines = mean_ratings.sort_values(by='value_for_money', ascending=False).head(10)

# Create a barplot to visualize the mean "value_for_money" ratings for the top 10 airlines
plt.figure(figsize=(12, 6))
sns.barplot(data=top_10_airlines, x='airline', y='value_for_money', palette='viridis')
plt.xlabel('Airline')
plt.ylabel('Mean Value for Money Ratings')
plt.title('Top 10 Airlines with the Highest Mean Value for Money Ratings')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Top 10 airlines for highest value for money ratings?

##### 2. What is/are the insight(s) found from the chart?

These top 10 airlines are providing passengers with a higher perceived value for the cost of their services.
Airlines with lower mean ratings may consider reviewing their pricing strategies or customer service to enhance the perceived value for money and potentially improve customer satisfaction.

#### Chart - 8

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***