# **Project Name**    -  Airline Passenger Referral Prediction (Classification)



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1 -** - Nikita Saxena


# **Project Summary -**

1. The project involves analyzing airline reviews spanning from 2006 to 2019 to predict whether passengers would recommend the airline to their friends. The dataset contains multiple choice and free text questions, providing valuable insights into customer opinions and preferences.

2. Key project steps include efficient exploratory data analysis to understand the data and prepare it for training, addressing class imbalance in the target feature to ensure accurate predictions, and selecting an appropriate machine learning algorithm to model passenger recommendations.

3. The project evaluates the model's performance while considering class imbalance, identifies important features that influence customer satisfaction and recommendations, and provides useful insights for stakeholders in the airline industry to enhance customer experience and make informed decisions.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**The problem at hand is to analyze airline reviews collected between 2006 and 2019 and predict whether passengers will recommend the airline to their friends. The dataset includes various types of questions, allowing us to gain insights into customer opinions and preferences. The objective is to develop an effective model that can accurately predict passenger recommendations based on the provided data. By solving this problem, we aim to assist stakeholders in the airline industry in understanding customer satisfaction and improving their services to enhance overall customer experience.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
!pip install geopandas
import geopandas as gpd
from scipy import stats
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
import re
import string

### Dataset Loading

In [None]:
# Load Dataset
air= pd.read_excel('/content/data_airline_reviews.xlsx')

### Dataset First View

In [None]:
# Dataset First Look
print(air)

In [None]:
#display top 5 data from the dataset
air.head()

In [None]:
#display last 5 data of the dataset
air.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

num_rows = air.shape[0]
print("NUMBER OF ROWS :- ", num_rows)
num_columns = air.shape[1]
print("NUMBER OF COLUMN :- ", num_columns)

### Dataset Information

In [None]:
# Dataset Info
air.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = air.duplicated().sum()

print("Number of duplicate values:", duplicate_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

# Count missing values
missing_count = air.isnull().sum()

print("Missing values count:")
print(missing_count)

In [None]:
# Visualizing the missing values# 

# Create a bar graph of missing values
plt.figure(figsize=(10, 6))
missing_count.plot(kind='bar', color='skyblue')
plt.title('Missing Values by Column')
plt.xlabel('Columns')
plt.ylabel('Count')
plt.show()

### What did you know about your dataset?

Answer 

The dataset contains 131,895 rows and 17 columns. There are 70,711 duplicate values in the dataset. The column names are 'airline', 'overall', 'author', 'review_date', 'customer_review', 'aircraft', 'traveller_type', 'cabin', 'route', 'date_flown', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money', and 'recommended'. There are missing values in several columns, with the highest counts in 'aircraft', 'traveller_type', 'ground_service', and 'entertainment'. The dataset provides information about airline reviews, including ratings, review details, and various aspects of the airline experience.

## ***2. Understanding Your Variables***

In [None]:
# Get the column names
column_names = air.columns

# Print the column names
for column in column_names:
    print(column)

In [None]:
# Dataset Describe
air.describe()


### Variables Description 

*Answer* 
Here's a description of the variables in the 'air' dataset:

1. airline: The name or code of the airline.
2. overall: The overall rating given by the customer for their airline experience.
3. author: The author or reviewer of the airline review.
4. review_date: The date when the review was written.
5. customer_review: The text of the customer's review.
6. aircraft: The aircraft used for the flight.
7. traveller_type: The type of traveler (e.g., business, leisure, family).
8. cabin: The type of cabin or class (e.g., economy, business, first class).
9. route: The route or destination of the flight.
10. date_flown: The date when the flight was taken.
11. seat_comfort: The rating for seat comfort.
12. cabin_service: The rating for cabin service.
13. food_bev: The rating for food and beverages.
14. entertainment: The rating for in-flight entertainment.
15. ground_service: The rating for ground services (e.g., check-in, boarding).
16. value_for_money: The rating for value for money.
17. recommended: Whether the customer would recommend the airline to others (1 = Yes, 0 = No).


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Check unique values for each variable
for column in air.columns:
    unique_values = air[column].unique()
    print(f"Unique values for {column}:")
    print(unique_values)
    print()

In [None]:

#check the unique value
air.nunique()
     

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
air['aircraft'].fillna('Unknown', inplace=True)
air['traveller_type'].fillna('Not Specified', inplace=True)

# Convert date columns to datetime format
air['date_flown'] = pd.to_datetime(air['date_flown'])

# Encoding categorical variables
df_encoded = pd.get_dummies(air, columns=['airline', 'cabin', 'route'])

# Print the updated dataset
print(df_encoded.head())

### What all manipulations have you done and insights you found?

Answer 

Let's break it down:

1. Handling Missing Values:
   - `air['aircraft'].fillna('Unknown', inplace=True)`: This line fills the missing values in the 'aircraft' column with the string value 'Unknown'. This ensures that there are no missing values in the 'aircraft' column.
   - `air['traveller_type'].fillna('Not Specified', inplace=True)`: This line fills the missing values in the 'traveller_type' column with the string value 'Not Specified'. It ensures that all missing values in the 'traveller_type' column are replaced.

2. Convert Date Columns to Datetime Format:
   - `air['date_flown'] = pd.to_datetime(air['date_flown'])`: This line converts the 'date_flown' column to datetime format using the `pd.to_datetime()` function from pandas. This transformation allows for easier manipulation and analysis of date-related information.

3. Encoding Categorical Variables:
   - `df_encoded = pd.get_dummies(air, columns=['airline', 'cabin', 'route'])`: This line encodes the categorical variables 'airline', 'cabin', and 'route' using one-hot encoding. The `pd.get_dummies()` function creates binary indicator variables for each category within these columns. This transformation enables the inclusion of categorical variables in further analysis or modeling.

4. Print the Updated Dataset:
   - `print(df_encoded.head())`: This line prints the head of the updated dataset (`df_encoded`), showing the transformed dataset with filled missing values and encoded categorical variables.

Overall, these data manipulations ensure that missing values are handled, date columns are in the appropriate format, and categorical variables are encoded for analysis. The printed output provides a glimpse of the updated dataset, ready for further analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#Question: What is the distribution of passenger recommendations?
sns.countplot(x='recommended', data=air)
plt.title('Distribution of Passenger Recommendations')
plt.xlabel('Recommended')
plt.ylabel('Count')
plt.show()


Visualized Solution: This bar plot shows the number of passengers who would recommend the airline (1) versus those who would not (0). It helps understand the balance of recommendations and identify any class imbalance issues.

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#Question: How does seat comfort impact passenger recommendations?
sns.boxplot(x='recommended', y='seat_comfort', data=air)
plt.title('Seat Comfort vs. Passenger Recommendations')
plt.xlabel('Recommended')
plt.ylabel('Seat Comfort')
plt.show()


Visualized Solution: This box plot compares the distribution of seat comfort scores for passengers who would recommend the airline (1) versus those who would not (0). It helps assess whether higher seat comfort ratings are associated with more positive recommendations.

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#Question: What is the distribution of traveller types among passengers?
sns.countplot(x='traveller_type', data=air)
plt.title('Distribution of Traveller Types')
plt.xlabel('Traveller Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()


Visualized Solution: This bar plot displays the number of passengers in different traveller types. It provides insights into the composition of passengers and helps identify the dominant traveller types.

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
#Question: How does cabin service influence passenger recommendations?
sns.boxplot(x='recommended', y='cabin_service', data=air)
plt.title('Cabin Service vs. Passenger Recommendations')
plt.xlabel('Recommended')
plt.ylabel('Cabin Service')
plt.show()


Visualized Solution: This box plot illustrates the relationship between cabin service ratings and passenger recommendations. It helps determine whether higher cabin service scores are associated with more positive recommendations.

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
#Question: What is the distribution of airline ratings?
sns.histplot(air['overall'], bins=10, kde=True)
plt.title('Distribution of Airline Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()


Visualized Solution: This histogram displays the distribution of overall airline ratings. It provides an overview of the rating distribution and highlights any significant patterns or outliers.

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#Question: How does value for money impact passenger recommendations?
sns.violinplot(x='recommended', y='value_for_money', data=air)
plt.title('Value for Money vs. Passenger Recommendations')
plt.xlabel('Recommended')
plt.ylabel('Value for Money')
plt.show()


Visualized Solution: This violin plot compares the distribution of value for money ratings for passengers who would recommend the airline (1) versus those who would not (0). It helps assess whether higher value for money scores are associated with more positive recommendations.

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
#Question: What is the distribution of recommended airlines by aircraft type?
plt.figure(figsize=(12, 6))
air_top20 = air.head(20)

sns.countplot(x='aircraft', hue='recommended', data=air_top20)
plt.title('Distribution of Recommended Airlines by Aircraft Type (Top 20)')
plt.xlabel('Aircraft Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Recommended')
plt.show()




Visualized Solution: This stacked bar plot shows the distribution of recommended airlines by aircraft type. It provides insights into which aircraft types are associated with a higher number of positive recommendations.

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code
#Question: How does the date of the review impact passenger recommendations?
air['review_date'] = pd.to_datetime(air['review_date'])
air['year'] = air['review_date'].dt.year

sns.countplot(x='year', hue='recommended', data=air)
plt.title('Distribution of Passenger Recommendations by Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.legend(title='Recommended')
plt.xticks(rotation=45)
plt.show()



Visualized Solution: This stacked bar plot displays the distribution of passenger recommendations over the years. It helps identify any trends or changes in recommendations over time.

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
#Question: What is the distribution of passenger recommendations by cabin class?
sns.countplot(x='cabin', hue='recommended', data=air)
plt.title('Distribution of Passenger Recommendations by Cabin Class')
plt.xlabel('Cabin Class')
plt.ylabel('Count')
plt.legend(title='Recommended')
plt.show()


Visualized Solution: This stacked bar plot showcases the distribution of passenger recommendations across different cabin classes. It helps assess whether certain cabin classes are more likely to receive positive recommendations.

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code
#Question: How does the route affect passenger recommendations?
plt.figure(figsize=(12, 6))
sns.countplot(x='route', hue='recommended', data=air.head(20))
plt.title('Distribution of Passenger Recommendations by Route (Top 20)')
plt.xlabel('Route')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Recommended')
plt.show()



Visualized Solution: This stacked bar plot demonstrates the distribution of passenger recommendations based on different routes. It provides insights into which routes have a higher number of positive recommendations.



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code
#Question: How does the date of the review impact passenger recommendations?
air['review_date'] = pd.to_datetime(air['review_date'])
air['year'] = air['review_date'].dt.year

sns.countplot(x='year', hue='recommended', data=air)
plt.title('Distribution of Passenger Recommendations by Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()



Visualized Solution: This bar plot showcases the distribution of passenger recommendations based on the year of the review. It helps identify any trends or patterns in recommendations over time.



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code
#Question: What is the distribution of recommended airlines by traveler type?
sns.countplot(x='traveller_type', hue='recommended', data=air)
plt.title('Distribution of Recommended Airlines by Traveler Type')
plt.xlabel('Traveler Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Recommended')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(x='cabin', data=air)
plt.title('Distribution of Cabin Classes')
plt.xlabel('Cabin Class')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()


Visualized Solution: This pie chart displays the distribution of passenger recommendations, showing the percentage of passengers who would recommend the airline versus those who would not. It provides a clear visual representation of the recommendation distribution.



In [None]:
sns.boxplot(x='seat_comfort', y='overall', data=air)
plt.title('Seat Comfort vs. Airline Ratings')
plt.xlabel('Seat Comfort')
plt.ylabel('Rating')
plt.show()


Visualized Solution: This box plot compares the distribution of airline ratings for different levels of seat comfort. It helps assess whether higher seat comfort ratings are associated with higher overall airline ratings

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

In [None]:
sns.histplot(air['ground_service'], bins=10, kde=True)
plt.title('Distribution of Ground Service Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()


Visualized Solution: This histogram displays the distribution of ground service ratings. It provides an overview of the rating distribution and highlights any significant patterns or

Visualized Solution: This bar plot displays the number of passengers in different cabin classes. It provides insights into the distribution of passengers across different cabin classes.

In [None]:
plt.figure(figsize=(12, 6))
sns.violinplot(x='route', y='value_for_money', data=air[:20])
plt.title('Value for Money Ratings by Route')
plt.xlabel('Route')
plt.ylabel('Value for Money Rating')
plt.xticks(rotation=45)
plt.show()


This code creates a violin plot that displays the distribution of value for money ratings for each route. It uses the top 20 data from the "air" DataFrame and sets the x-axis to the "route" column and the y-axis to the "value_for_money" column. The plot is labeled with a title, x-axis label, y-axis label, and the x-axis labels are rotated for better readability. The violin plot provides a visual representation of the distribution of values and can show multiple modes in the data.






#### Chart - 14 - Correlation Heatmap

In [None]:
# Select the relevant columns for correlation analysis
selected_columns = ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money', 'recommended']

# Create a correlation matrix
correlation_matrix = air[selected_columns].corr()

# Generate a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code
# Select the relevant variables for the pair plot
variables = ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money']

# Create the pair plot
sns.pairplot(air[variables])
plt.title('Pair Plot of Variables')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Removing Multicollinearity features

In [None]:
# Creating a function to calculate VIF
def calc_vif(X):
    vif = pd.DataFrame()
    vif["Variable"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return vif


In [None]:
import numpy as np

# Remove missing values and infinite values from the data
air_cleaned = air[[i for i in air.describe().columns if i not in ['recommended', 'value_for_money', 'overall']]].replace([np.inf, -np.inf], np.nan).dropna()

# Calculate VIF for the cleaned data
vif_results = calc_vif(air_cleaned)

print(vif_results)



# **Defining the dependent and independent variables.**

In [None]:
#separating the dependent and independent variables
y = air['recommended']
x = air.drop(columns = 'recommended')


In [None]:

x.columns

#**One Hot Encoding**

In [None]:
x = pd.get_dummies(x)
     

In [None]:
x.shape

In [None]:
x.head(2)

The Percentage of both labels('yes','no) is approximately equal. So no need of Handling Class Imbalance technique.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer 

Here are three hypothetical statements based on the dataset:

1. Statement: Passengers who rate higher on "value_for_money" are more likely to recommend the airline.

2. Statement: Passengers who have a higher overall rating are more likely to rate the seat comfort positively.

3. Statement: Passengers who have a higher overall rating are more likely to rate the cabin service positively.

To perform hypothesis testing and obtain a final conclusion for each statement, we'll use the following steps:

1. Set up the null hypothesis (H0) and alternative hypothesis (H1) for each statement.
2. Perform the appropriate statistical test based on the nature of the variables involved.
3. Calculate the test statistic and p-value.
4. Determine the significance level (α) for the test.
5. Compare the p-value with the significance level to make a decision.
6. Draw a conclusion based on the decision made.


### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer 
For Hypothetical Statement 1:

Null Hypothesis (H0): Passengers who rate higher on "value_for_money" are not more likely to recommend the airline.

Alternative Hypothesis (H1): Passengers who rate higher on "value_for_money" are more likely to recommend the airline.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Drop rows with missing values
cleaned_data = air.dropna(subset=['value_for_money', 'recommended'])

# Perform statistical test
observed_values = cleaned_data[cleaned_data['recommended'] == 1]['value_for_money']
expected_values = cleaned_data[cleaned_data['recommended'] == 0]['value_for_money']

# Perform t-test
from scipy.stats import ttest_ind
t_statistic, p_value = ttest_ind(observed_values, expected_values, nan_policy='omit')

print("P-value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Answer 

The independent samples t-test was conducted to compare the mean "value_for_money" ratings between the recommended and not recommended airlines. However, the obtained p-value was "nan," indicating issues with the data or calculation. Therefore, no conclusive evidence can be drawn regarding a significant difference in value for money ratings between the two groups. Further investigation or data refinement may be necessary to obtain meaningful results.

##### Why did you choose the specific statistical test?

Answer 

I chose the independent samples t-test because it is commonly used to compare the means of two independent groups. In this case, I wanted to compare the mean "value_for_money" ratings between the recommended and not recommended airlines. The t-test allows for the assessment of whether there is a statistically significant difference in the means of the two groups. However, as the obtained p-value was "nan," further investigation is needed to understand the underlying cause of this issue.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer 

Null hypothesis (H0): There is no significant difference in the overall ratings between the different airline types.

Alternate hypothesis (HA): There is a significant difference in the overall ratings between the different airline types.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:

# Count missing values
missing_count = air.isnull().sum()

print("Missing values count:")
print(missing_count)

In [None]:
# Check for missing values
missing_values = air.isnull().sum()
print("Missing Values:")
print(missing_values)

# Fill missing values with appropriate imputation method
# For numerical columns, you can use mean imputation
air['overall'] = air['overall'].fillna(air['overall'].mean())
air['seat_comfort'] = air['seat_comfort'].fillna(air['seat_comfort'].mean())
air['cabin_service'] = air['cabin_service'].fillna(air['cabin_service'].mean())
air['food_bev'] = air['food_bev'].fillna(air['food_bev'].mean())
air['entertainment'] = air['entertainment'].fillna(air['entertainment'].mean())
air['ground_service'] = air['ground_service'].fillna(air['ground_service'].mean())
air['value_for_money'] = air['value_for_money'].fillna(air['value_for_money'].mean())

# For categorical columns, you can use mode imputation
air['airline'] = air['airline'].fillna(air['airline'].mode().iloc[0])
air['author'] = air['author'].fillna(air['author'].mode().iloc[0])
air['customer_review'] = air['customer_review'].fillna(air['customer_review'].mode().iloc[0])
air['aircraft'] = air['aircraft'].fillna(air['aircraft'].mode().iloc[0])
air['traveller_type'] = air['traveller_type'].fillna(air['traveller_type'].mode().iloc[0])
air['cabin'] = air['cabin'].fillna(air['cabin'].mode().iloc[0])
air['route'] = air['route'].fillna(air['route'].mode().iloc[0])
air['date_flown'] = air['date_flown'].fillna(air['date_flown'].mode().iloc[0])
air['recommended'] = air['recommended'].fillna(air['recommended'].mode().iloc[0])


In [None]:
# Check for missing values
missing_values = air.isnull().sum()
print("Missing Values:")
print(missing_values)

# Convert 'review_date' column to datetime format
air['review_date'] = pd.to_datetime(air['review_date'], errors='coerce')

# Fill missing values in 'review_date' column with a specific value or strategy
# For example, you can fill the missing values with the median date or the most recent date
median_date = air['review_date'].median()
air['review_date'] = air['review_date'].fillna(median_date)


#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer 
The code provided uses mean imputation for numerical columns and mode imputation for categorical columns to fill in missing values. Mean imputation replaces missing numerical values with the mean of the column, while mode imputation replaces missing categorical values with the most frequent category. These techniques are chosen based on the data type and assumptions about the missing data mechanism.

### 2. Handling Outliers

In [None]:
import numpy as np
# Handling Outliers & Outlier treatments
# Define a function to detect and handle outliers using z-score method
def handle_outliers_zscore(data, threshold=3):
    z_scores = np.abs(stats.zscore(data))
    outliers = np.where(z_scores > threshold)
    data_no_outliers = data[(z_scores <= threshold)]
    return data_no_outliers

# Apply outlier treatment to the 'value_for_money' column
air['value_for_money'] = handle_outliers_zscore(air['value_for_money'])
print(air['value_for_money'])


##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer 

The code snippet uses the z-score method to detect and handle outliers. By calculating the z-score for each data point and applying a threshold, outliers are identified and replaced with the mean value of the column. This technique is effective in reducing the impact of outliers on statistical analyses and ensuring more robust results.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

# Encode categorical columns using get_dummies()
encoded_data = pd.get_dummies(air, columns=['airline', 'traveller_type', 'cabin', 'route'])

# Print the encoded data
print(encoded_data.head())


#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer 

In the code snippet provided, the categorical encoding technique used is one-hot encoding using the `get_dummies()` function. One-hot encoding is chosen because it converts categorical variables into binary vectors, representing the presence or absence of a category. This technique allows for the inclusion of categorical data in machine learning models that require numerical inputs, without introducing ordinality or magnitude assumptions.

### 4. Textual Data Preprocessing 
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
air.head()

In [None]:
# Custom contraction dictionary
contractions_dict = {
    "can't": "cannot",
    "don't": "do not",
    "won't": "will not",
    # Add more contractions and their expanded forms as needed
}

def expand_contractions(text):
    words = text.split()
    expanded_words = [contractions_dict.get(word.lower(), word) for word in words]
    expanded_text = " ".join(expanded_words)
    return expanded_text

# Apply expand_contractions function to the 'customer_review' variable
air['customer_review'] = air['customer_review'].apply(expand_contractions)


This code defines a dictionary contractions that maps contractions to their expanded forms. The expand_contractions() function splits the input text into words, replaces contractions with their expanded forms using the dictionary mapping, and joins the expanded words back into a text string. 

#### 2. Lower Casing

In [None]:
# Lower Casing
air['customer_review'] = air['customer_review'].str.lower()

print(air['customer_review'])


To convert text to lower case, you can use the lower() method in Python. Here's an example code snippet to perform lower casing on the 'customer_review' variable:

In this code, the str.lower() method is applied to the 'customer_review' column, which converts all the text in that column to lower case.

#### 3. Removing Punctuations

In [None]:
# Remove punctuations
air['customer_review'] = air['customer_review'].str.replace('[{}]'.format(string.punctuation), '')

print(air['customer_review'])


This code will remove all punctuations from the 'customer_review' column and print the updated column.

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs
air['customer_review'] = air['customer_review'].apply(lambda x: re.sub(r'http\S+|www.\S+', '', x))

# Remove words and digits containing digits
air['customer_review'] = air['customer_review'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))

print(air['customer_review'])


This code uses regular expressions (re) to remove URLs and words/digits containing digits from the 'customer_review' column. The updated column will be printed.

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
# Download stopwords if not already downloaded
nltk.download('stopwords')

# Get the set of stopwords
stop_words = set(stopwords.words('english'))

# Function to remove stopwords from text
def remove_stopwords(text):
    tokens = text.split()
    filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
    return ' '.join(filtered_tokens)

# Apply stopwords removal to 'customer_review' column
air['customer_review'] = air['customer_review'].apply(remove_stopwords)

print(air['customer_review'])



This code utilizes the NLTK library to download and access the stopwords list for the English language. It defines a function to remove stopwords from the text and applies it to the 'customer_review' column. The updated column will be printed.

In [None]:
# Remove White spaces
# Function to remove white spaces from text
def remove_white_spaces(text):
    cleaned_text = text.strip()
    return cleaned_text

# Apply white space removal to 'customer_review' column
air['customer_review'] = air['customer_review'].apply(remove_white_spaces)

print(air['customer_review'])


This code defines a function that uses the strip() method to remove leading and trailing white spaces from the text. It then applies this function to the 'customer_review' column using the apply() method. The updated column will be printed.

#### 6. Rephrase Text

In [None]:
# Rephrase Text
# Function to rephrase text
def rephrase_text(text):
    rephrased_text = text.replace('good', 'great').replace('bad', 'poor')
    return rephrased_text

# Apply text rephrasing to 'customer_review' column
air['customer_review'] = air['customer_review'].apply(rephrase_text)

print(air['customer_review'])


the function rephrase_text() replaces the word 'good' with 'great' and the word 'bad' with 'poor' using the replace() method. The function is then applied to the 'customer_review' column using the apply() method. The updated column will be printed.






#### 7. Tokenization

In [None]:
# Tokenization
# Function for tokenization
def tokenize_text(text):
    tokens = nltk.word_tokenize(text)
    return tokens

# Apply tokenization to 'customer_review' column
air['customer_review'] = air['customer_review'].apply(tokenize_text)

print(air['customer_review'])


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
nltk.download('wordnet')

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    return lemmatized_text


##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging
def pos_tag_text(text):
    tokens = nltk.word_tokenize(text)
    tagged_tokens = nltk.pos_tag(tokens)
    return tagged_tokens

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Define your features (x) and target variable (y)
features = ['airline', 'overall', 'author', 'review_date', 'customer_review', 'aircraft', 'traveller_type',
            'cabin', 'route', 'date_flown', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment',
            'ground_service', 'value_for_money']
target = 'recommended'

# Split your data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(air[features], air[target], test_size=0.2, random_state=42)


In [None]:
#shape of x_train and x_test data
print(x_train.shape)
print(x_test.shape)
     


In [None]:
#shape of y_train and y_test data
print(y_train.shape)
print(y_test.shape)

##### What data splitting ratio have you used and why? 

Answer Here.

### 9. Handling Imbalanced Dataset

In [None]:
import numpy as np
print("The Percentage of No labels of Target Variable is",np.round(y.value_counts()[0]/len(y)*100))
print("The Percentage of Yes labels of Target Variable is",np.round(y.value_counts()[1]/len(y)*100))


The Percentage of both labels('yes','no) is approximately equal. So no need of Handling Class Imbalance technique.



## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
print(air.dtypes)

In [None]:
# Define the features and target variable
features = ['airline', 'overall', 'author', 'review_date', 'customer_review', 'aircraft', 'traveller_type',
            'cabin', 'route', 'date_flown', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment',
            'ground_service', 'value_for_money']
target = 'recommended'

# Convert non-numeric values to strings in categorical columns
categorical_columns = ['airline', 'author', 'customer_review', 'aircraft', 'traveller_type', 'cabin', 'route', 'date_flown']
for column in categorical_columns:
    air[column] = air[column].astype(str)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(air[features], air[target], test_size=0.2, random_state=42)

# Preprocessing for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

# Preprocessing for categorical features
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment',
                                      'ground_service', 'value_for_money']),
        ('cat', categorical_transformer, categorical_columns)])





In [None]:
# Model 1: Logistic Regression
logreg_model = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', LogisticRegression())])
logreg_model.fit(X_train, y_train)
logreg_pred = logreg_model.predict(X_test)
logreg_accuracy = accuracy_score(y_test, logreg_pred)
print("Logistic Regression Accuracy:", logreg_accuracy)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score



# Define the features and target variable
features = ['airline', 'overall', 'author', 'review_date', 'customer_review', 'aircraft', 'traveller_type',
            'cabin', 'route', 'date_flown', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment',
            'ground_service', 'value_for_money']
target = 'recommended'

# Convert non-numeric values to strings in categorical columns
categorical_columns = ['airline', 'author', 'customer_review', 'aircraft', 'traveller_type', 'cabin', 'route', 'date_flown']
for column in categorical_columns:
    air[column] = air[column].astype(str)

# Drop rows with missing values
air.dropna(subset=features + [target], inplace=True)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(air[features], air[target], test_size=0.2, random_state=42)

# Preprocessing for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

# Preprocessing for categorical features
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment',
                                      'ground_service', 'value_for_money']),
        ('cat', categorical_transformer, categorical_columns)])

# Model 1: Logistic Regression
logreg_model = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', LogisticRegression())])
logreg_model.fit(X_train, y_train)
logreg_pred = logreg_model.predict(X_test)
logreg_accuracy = accuracy_score(y_test, logreg_pred)
print("Logistic Regression Accuracy:", logreg_accuracy)

# Model 2: Random Forest Classifier
rf_model = Pipeline(steps=[('preprocessor', preprocessor),
                            ('classifier', RandomForestClassifier())])
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)
print("Random Forest Accuracy:", rf_accuracy)


In [None]:
# Define the Random Forest model
rf_model = RandomForestClassifier()

# Fit the Random Forest model
rf_model.fit(X_train, y_train)

# Predict using the Random Forest model
rf_pred = rf_model.predict(X_test)

# Accuracy of Random Forest
rf_accuracy = accuracy_score(y_test, rf_pred)
print("Accuracy - Random Forest:", rf_accuracy)

# Confusion matrix and classification report for Random Forest
rf_cm = confusion_matrix(y_test, rf_pred)
rf_cr = classification_report(y_test, rf_pred)

# Print confusion matrix and classification report
print("Confusion Matrix - Random Forest:")
print(rf_cm)
print("\nClassification Report - Random Forest:")
print(rf_cr)


##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

In [None]:

# Model 2: Random Forest Classifier
rf_model = Pipeline(steps=[('preprocessor', preprocessor),
                            ('classifier', RandomForestClassifier())])
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)
print("Random Forest Accuracy:", rf_accuracy)

In [None]:
# Plotting the confusion matrix for Random Forest
plt.figure(figsize=(6, 4))
sns.heatmap(rf_cm, annot=True, cmap='Blues', fmt='g')
plt.title("Confusion Matrix - Random Forest")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Classification report for Random Forest
print("Classification Report - Random Forest:")
print(rf_cr)

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

In [None]:

# Accuracy of each model
print("Accuracy - Logistic Regression:", logreg_accuracy)
print("Accuracy - Random Forest:", rf_accuracy)

In [None]:
# Accuracy of each model
model_names = ['Logistic Regression', 'Random Forest']
accuracies = [logreg_accuracy, rf_accuracy]

# Plot the accuracy
plt.bar(model_names, accuracies)
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.title('Accuracy of Each Model')
plt.show()


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***