# **Project Name**    -  Airline Passenger Referral Prediction (Classification)



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1 -** - Nikita Saxena


# **Project Summary -**

1. The project involves analyzing airline reviews spanning from 2006 to 2019 to predict whether passengers would recommend the airline to their friends. The dataset contains multiple choice and free text questions, providing valuable insights into customer opinions and preferences.

2. Key project steps include efficient exploratory data analysis to understand the data and prepare it for training, addressing class imbalance in the target feature to ensure accurate predictions, and selecting an appropriate machine learning algorithm to model passenger recommendations.

3. The project evaluates the model's performance while considering class imbalance, identifies important features that influence customer satisfaction and recommendations, and provides useful insights for stakeholders in the airline industry to enhance customer experience and make informed decisions.

# **GitHub Link -**

https://github.com/shadow9411111/AIRLINE

# **Problem Statement**


**The problem at hand is to analyze airline reviews collected between 2006 and 2019 and predict whether passengers will recommend the airline to their friends. The dataset includes various types of questions, allowing us to gain insights into customer opinions and preferences. The objective is to develop an effective model that can accurately predict passenger recommendations based on the provided data. By solving this problem, we aim to assist stakeholders in the airline industry in understanding customer satisfaction and improving their services to enhance overall customer experience.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
!pip install geopandas
import geopandas as gpd
from scipy import stats
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
import re
import string

### Dataset Loading

In [None]:
# Load Dataset
air= pd.read_excel('/content/data_airline_reviews.xlsx')

### Dataset First View

In [None]:
# Dataset First Look
print(air)

In [None]:
#display top 5 data from the dataset
air.head()

In [None]:
#display last 5 data of the dataset
air.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

num_rows = air.shape[0]
print("NUMBER OF ROWS :- ", num_rows)
num_columns = air.shape[1]
print("NUMBER OF COLUMN :- ", num_columns)

### Dataset Information

In [None]:
# Dataset Info
air.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = air.duplicated().sum()

print("Number of duplicate values:", duplicate_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

# Count missing values
missing_count = air.isnull().sum()

print("Missing values count:")
print(missing_count)

In [None]:
# Visualizing the missing values# 

# Create a bar graph of missing values
plt.figure(figsize=(10, 6))
missing_count.plot(kind='bar', color='skyblue')
plt.title('Missing Values by Column')
plt.xlabel('Columns')
plt.ylabel('Count')
plt.show()

### What did you know about your dataset?

Answer 

The dataset contains 131,895 rows and 17 columns. There are 70,711 duplicate values in the dataset. The column names are 'airline', 'overall', 'author', 'review_date', 'customer_review', 'aircraft', 'traveller_type', 'cabin', 'route', 'date_flown', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money', and 'recommended'. There are missing values in several columns, with the highest counts in 'aircraft', 'traveller_type', 'ground_service', and 'entertainment'. The dataset provides information about airline reviews, including ratings, review details, and various aspects of the airline experience.

## ***2. Understanding Your Variables***

In [None]:
# Get the column names
column_names = air.columns

# Print the column names
for column in column_names:
    print(column)

In [None]:
# Dataset Describe
air.describe()


### Variables Description 

*Answer* 
Here's a description of the variables in the 'air' dataset:

1. airline: The name or code of the airline.
2. overall: The overall rating given by the customer for their airline experience.
3. author: The author or reviewer of the airline review.
4. review_date: The date when the review was written.
5. customer_review: The text of the customer's review.
6. aircraft: The aircraft used for the flight.
7. traveller_type: The type of traveler (e.g., business, leisure, family).
8. cabin: The type of cabin or class (e.g., economy, business, first class).
9. route: The route or destination of the flight.
10. date_flown: The date when the flight was taken.
11. seat_comfort: The rating for seat comfort.
12. cabin_service: The rating for cabin service.
13. food_bev: The rating for food and beverages.
14. entertainment: The rating for in-flight entertainment.
15. ground_service: The rating for ground services (e.g., check-in, boarding).
16. value_for_money: The rating for value for money.
17. recommended: Whether the customer would recommend the airline to others (1 = Yes, 0 = No).


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Check unique values for each variable
for column in air.columns:
    unique_values = air[column].unique()
    print(f"Unique values for {column}:")
    print(unique_values)
    print()

In [None]:

#check the unique value
air.nunique()
     

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
air['aircraft'].fillna('Unknown', inplace=True)
air['traveller_type'].fillna('Not Specified', inplace=True)

# Convert date columns to datetime format
air['date_flown'] = pd.to_datetime(air['date_flown'])

# Encoding categorical variables
df_encoded = pd.get_dummies(air, columns=['airline', 'cabin', 'route'])

# Print the updated dataset
print(df_encoded.head())

### What all manipulations have you done and insights you found?

Answer 

Let's break it down:

1. Handling Missing Values:
   - `air['aircraft'].fillna('Unknown', inplace=True)`: This line fills the missing values in the 'aircraft' column with the string value 'Unknown'. This ensures that there are no missing values in the 'aircraft' column.
   - `air['traveller_type'].fillna('Not Specified', inplace=True)`: This line fills the missing values in the 'traveller_type' column with the string value 'Not Specified'. It ensures that all missing values in the 'traveller_type' column are replaced.

2. Convert Date Columns to Datetime Format:
   - `air['date_flown'] = pd.to_datetime(air['date_flown'])`: This line converts the 'date_flown' column to datetime format using the `pd.to_datetime()` function from pandas. This transformation allows for easier manipulation and analysis of date-related information.

3. Encoding Categorical Variables:
   - `df_encoded = pd.get_dummies(air, columns=['airline', 'cabin', 'route'])`: This line encodes the categorical variables 'airline', 'cabin', and 'route' using one-hot encoding. The `pd.get_dummies()` function creates binary indicator variables for each category within these columns. This transformation enables the inclusion of categorical variables in further analysis or modeling.

4. Print the Updated Dataset:
   - `print(df_encoded.head())`: This line prints the head of the updated dataset (`df_encoded`), showing the transformed dataset with filled missing values and encoded categorical variables.

Overall, these data manipulations ensure that missing values are handled, date columns are in the appropriate format, and categorical variables are encoded for analysis. The printed output provides a glimpse of the updated dataset, ready for further analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#Question: What is the distribution of passenger recommendations?
sns.countplot(x='recommended', data=air)
plt.title('Distribution of Passenger Recommendations')
plt.xlabel('Recommended')
plt.ylabel('Count')
plt.show()


Visualized Solution: This bar plot shows the number of passengers who would recommend the airline (1) versus those who would not (0). It helps understand the balance of recommendations and identify any class imbalance issues.

##### 1. Why did you pick the specific chart?

Answer 


- The specific chart chosen is a countplot.
- A countplot is suitable to visualize the distribution of passenger recommendations.
- The countplot helps understand the frequency or count of different categories in a categorical variable.
- The 'recommended' variable represents the recommendation status of passengers.
- The countplot allows easy comparison of the number of passengers who recommended and did not recommend.
- It provides a clear understanding of the distribution of passenger recommendations.

##### 2. What is/are the insight(s) found from the chart?

Answer 

The insights found from the chart are:

- The chart shows the distribution of passenger recommendations.
- The countplot reveals the frequency or count of passengers who recommended and did not recommend.
- The insight obtained is the proportion or count of passengers falling into each category of recommendation.
- By analyzing the countplot, it is possible to determine the relative popularity or preference for recommendations among passengers.
- The chart provides an overview of the distribution of passenger recommendations and can be used to identify any imbalances or patterns in the data.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 
The gained insights can potentially help create a positive business impact. Here's the justification:

Positive Business Impact:
- Understanding the distribution of passenger recommendations allows the business to gauge the level of customer satisfaction and loyalty.
- If the countplot shows a high proportion of passengers recommending the service, it indicates a positive customer experience. This can lead to increased customer retention, positive word-of-mouth, and potentially attract new customers.
- Positive recommendations can contribute to the growth of the business as satisfied customers are more likely to become repeat customers and promote the business to others.

Insights Leading to Negative Growth:
- If the countplot reveals a low proportion of passengers recommending the service, it suggests a negative customer experience or dissatisfaction.
- A significant number of passengers not recommending the service may lead to negative impacts on the business, such as decreased customer retention, lower customer acquisition, and potential damage to the brand reputation.
- It signals the need for the business to identify and address the issues causing dissatisfaction and make improvements to enhance customer experience and satisfaction.

In summary, the gained insights from the chart can help create a positive business impact by identifying satisfied customers who can contribute to business growth. Conversely, if the insights indicate a low proportion of recommendations, it highlights the need for improvements to avoid negative growth and address customer dissatisfaction.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#Question: How does seat comfort impact passenger recommendations?
sns.boxplot(x='recommended', y='seat_comfort', data=air)
plt.title('Seat Comfort vs. Passenger Recommendations')
plt.xlabel('Recommended')
plt.ylabel('Seat Comfort')
plt.show()


Visualized Solution: This box plot compares the distribution of seat comfort scores for passengers who would recommend the airline (1) versus those who would not (0). It helps assess whether higher seat comfort ratings are associated with more positive recommendations.

##### 1. Why did you pick the specific chart?

Answer 


1. Comparison of Categorical and Continuous Variables: The chart aims to compare the impact of seat comfort (a continuous variable) on passenger recommendations (a categorical variable). A boxplot is an effective choice for visualizing this type of relationship.

2. Summary of Distribution: A boxplot provides a concise summary of the distribution of the seat comfort ratings for each category of passenger recommendations. It presents information about the median, quartiles, and potential outliers, allowing for easy comparison.

3. Identification of Differences: The boxplot enables the identification of any differences or patterns in seat comfort ratings between the two categories of passenger recommendations. It allows for quick visual comparison and evaluation of the impact of seat comfort on passenger recommendations.

4. Handling Outliers: Boxplots are particularly useful for identifying outliers, which are data points that fall significantly outside the typical range. Outliers can provide valuable insights into extreme or unusual cases that might affect the relationship between seat comfort and passenger recommendations.

5. Clear Visualization: The boxplot provides a clear visual representation of the data, making it easy to interpret and understand. The plot includes labeled axes for seat comfort and recommendations, and a title to clearly indicate the purpose of the visualization.

Overall, the choice of a boxplot for this analysis allows for a meaningful comparison of seat comfort and passenger recommendations, highlighting any differences in seat comfort ratings between recommended and not recommended flights.

##### 2. What is/are the insight(s) found from the chart?

Answer 
The insights that can be derived from the provided chart comparing seat comfort and passenger recommendations are as follows:

1. Seat Comfort and Recommendations: The chart helps to understand the relationship between seat comfort and passenger recommendations. It allows us to observe how different levels of seat comfort impact the likelihood of a passenger recommending the airline or flight experience.

2. Median Seat Comfort: The median line in the boxplot represents the central tendency of the seat comfort ratings for each category of passenger recommendations. By comparing the medians, we can identify if there is a noticeable difference in seat comfort between recommended and not recommended flights.

3. Interquartile Range (IQR): The box portion of the plot represents the interquartile range, which provides insights into the spread or variability of seat comfort ratings. A larger IQR suggests a wider range of seat comfort experiences within a particular recommendation category.

4. Outliers: The plot also displays any outliers, which are data points that fall significantly outside the typical range. Outliers can indicate exceptional seat comfort experiences that may influence passenger recommendations. The presence of outliers can help identify extreme cases or potential areas for improvement.

By examining these aspects of the chart, we can gain insights into the impact of seat comfort on passenger recommendations. It may reveal if higher seat comfort ratings tend to result in more positive recommendations or if there are any notable discrepancies in seat comfort experiences between recommended and not recommended flights.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 

The gained insights from the chart comparing seat comfort and passenger recommendations can potentially have a positive business impact. Here's why:

Positive Business Impact:
1. Identifying Key Driver: If the chart reveals that higher seat comfort ratings are strongly associated with positive passenger recommendations, it highlights the importance of prioritizing and improving seat comfort as a key driver for customer satisfaction. By addressing seat comfort issues, airlines can enhance the overall passenger experience, potentially leading to positive word-of-mouth recommendations and increased customer loyalty.

2. Competitive Advantage: If the analysis demonstrates that the airline has higher seat comfort ratings compared to competitors, it can serve as a unique selling point and provide a competitive advantage in the market. This can attract more passengers who prioritize comfort and lead to increased market share and revenue.

Insights Leading to Negative Growth:
While the chart itself may not directly lead to negative growth, certain insights from the analysis could potentially indicate areas of concern. For example:

1. Lower Seat Comfort Ratings: If the chart shows that seat comfort ratings are consistently low across both recommended and not recommended flights, it indicates a problem with overall seat comfort quality. This could result in negative passenger experiences, dissatisfaction, and potentially lead to negative word-of-mouth recommendations, impacting business growth adversely.

2. Wide Variability and Outliers: If the chart displays a wide variability in seat comfort ratings within either the recommended or not recommended category, or if there are numerous outliers indicating extreme negative seat comfort experiences, it suggests inconsistent or subpar seat comfort standards. Such findings may highlight the need for improvement to ensure consistent and satisfactory seat comfort across all flights, thereby avoiding negative customer experiences and potential negative growth.

It's important to note that the specific insights gained from the chart will depend on the data and analysis performed. The potential positive or negative business impact will also vary based on the airline's specific context, market dynamics, and the actions taken in response to the insights.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#Question: What is the distribution of traveller types among passengers?
sns.countplot(x='traveller_type', data=air)
plt.title('Distribution of Traveller Types')
plt.xlabel('Traveller Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()


Visualized Solution: This bar plot displays the number of passengers in different traveller types. It provides insights into the composition of passengers and helps identify the dominant traveller types.

##### 1. Why did you pick the specific chart?

Answer 

I picked the "countplot" chart for visualizing the distribution of traveller types among passengers because it is a suitable choice for displaying the frequency/count of different categories in a categorical variable. In this case, we want to observe the count of each traveller type, making it easy to compare and understand the distribution of the data.

##### 2. What is/are the insight(s) found from the chart?

Answer 
From the countplot chart showing the distribution of traveller types among passengers, we can gain the following insights:

1. The chart provides a visual representation of the frequency or count of each traveller type.
2. We can observe which traveller types are more prevalent among the passengers.
3. The chart helps in identifying the dominant or most common traveller type.
4. It allows for a quick comparison of the counts of different traveller types.
5. It can highlight any imbalances or disparities in the distribution of traveller types.

By analyzing the chart, we can better understand the composition of passengers based on their traveller types. This information can be useful for targeted marketing strategies, personalized services, or understanding customer preferences and needs.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 
The gained insights from the chart can potentially help create a positive business impact in several ways:

Targeted Marketing: By understanding the distribution of traveller types, businesses can tailor their marketing strategies to cater to the specific needs and preferences of different types of passengers. This can lead to more effective and targeted marketing campaigns, resulting in increased customer engagement and conversion rates.

Personalized Services: Knowing the dominant traveller types among passengers allows businesses to offer personalized services and experiences that cater to their specific preferences. This can enhance customer satisfaction and loyalty, leading to positive word-of-mouth recommendations and repeat business.

Customer Segmentation: The insights gained from the chart can facilitate customer segmentation based on traveller types. This segmentation can help businesses identify high-value customer segments and develop customized offerings or loyalty programs for each segment, resulting in increased customer retention and revenue growth.




#### Chart - 4

In [None]:
# Chart - 4 visualization code
#Question: How does cabin service influence passenger recommendations?
sns.boxplot(x='recommended', y='cabin_service', data=air)
plt.title('Cabin Service vs. Passenger Recommendations')
plt.xlabel('Recommended')
plt.ylabel('Cabin Service')
plt.show()


Visualized Solution: This box plot illustrates the relationship between cabin service ratings and passenger recommendations. It helps determine whether higher cabin service scores are associated with more positive recommendations.

##### 1. Why did you pick the specific chart?

Answer

I chose a boxplot to visualize the relationship between cabin service and passenger recommendations because it allows us to compare the distribution of cabin service ratings for different recommendation categories. A boxplot displays the median, quartiles, and outliers of a continuous variable, making it suitable for showing the spread and central tendency of cabin service ratings based on passenger recommendations. This helps in understanding any potential relationship or pattern between cabin service and passenger recommendations.

##### 2. What is/are the insight(s) found from the chart?

Answer


From the chart, the insights that can be derived are:

Higher cabin service ratings tend to be associated with higher passenger recommendations. This is indicated by the higher median and upper quartile of cabin service ratings for the recommended category compared to the not recommended category.

The recommended category shows less variability in cabin service ratings compared to the not recommended category. This is evident from the narrower range between the upper and lower quartiles in the recommended category boxplot.

There are outliers present in both the recommended and not recommended categories, indicating that there are some cases where passengers have different experiences with cabin service that deviate from the general trend.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*Answer* 

The gained insights can potentially help create a positive business impact. Here's how:

Positive impact: The insight that higher cabin service ratings are associated with higher passenger recommendations suggests that improving and maintaining high-quality cabin service can lead to increased passenger satisfaction and positive word-of-mouth recommendations. This, in turn, can enhance the airline's reputation, attract more customers, and potentially increase customer loyalty and repeat business.

Negative impact: While the chart does not explicitly indicate any insights that lead to negative growth, it is important to consider the outliers present in both the recommended and not recommended categories. These outliers represent cases where passengers had significantly different experiences with cabin service. It is crucial for the airline to investigate and address these outliers, as negative experiences could potentially harm the airline's reputation and lead to negative reviews or recommendations. Identifying and resolving any issues or inconsistencies in cabin service can help mitigate the risk of negative growth.



#### Chart - 5

In [None]:
# Chart - 5 visualization code
#Question: What is the distribution of airline ratings?
sns.histplot(air['overall'], bins=10, kde=True)
plt.title('Distribution of Airline Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()


Visualized Solution: This histogram displays the distribution of overall airline ratings. It provides an overview of the rating distribution and highlights any significant patterns or outliers.

##### 1. Why did you pick the specific chart?

Answer 

I chose a histogram to visualize the distribution of airline ratings because it provides an overview of the frequency or count of ratings in different rating intervals. Histograms are suitable for analyzing the distribution of a single variable, such as airline ratings, by dividing the range of ratings into bins and showing the count or density of ratings within each bin. This helps in understanding the overall pattern and shape of the distribution of ratings and identifying any peaks, clusters, or gaps in the data.




##### 2. What is/are the insight(s) found from the chart?

Answer 

From the chart, the insights that can be derived are:

1- The distribution of airline ratings is centered around a specific rating value. The peak or mode of the distribution indicates the most frequent rating given by passengers.

2- The majority of the ratings fall within a certain range. The spread or width of the distribution shows the range of ratings given by passengers.

3- The shape of the distribution can provide additional insights. For example, a symmetrical distribution with a bell-shaped curve suggests a balanced distribution of ratings, while a skewed distribution may indicate a bias towards higher or lower ratings.

4- The presence of multiple peaks or clusters in the distribution may suggest the existence of distinct groups or subpopulations with different preferences or experiences.


##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#Question: How does value for money impact passenger recommendations?
sns.violinplot(x='recommended', y='value_for_money', data=air)
plt.title('Value for Money vs. Passenger Recommendations')
plt.xlabel('Recommended')
plt.ylabel('Value for Money')
plt.show()


Visualized Solution: This violin plot compares the distribution of value for money ratings for passengers who would recommend the airline (1) versus those who would not (0). It helps assess whether higher value for money scores are associated with more positive recommendations.

##### 1. Why did you pick the specific chart?

Answer 

I chose a violin plot to visualize the relationship between value for money and passenger recommendations because it provides insights into both the distribution and density of the data. A violin plot displays the distribution of a continuous variable (value for money) for different categories (recommended and not recommended) in a mirrored density plot format. It allows us to compare the shape, spread, and central tendency of the value for money ratings for each recommendation category. This helps in understanding the impact of value for money on passenger recommendations and identifying any patterns or differences between the two categories.

##### 2. What is/are the insight(s) found from the chart?

Answer 

Value for money ratings have a significant impact on passenger recommendations. The violin plot shows that higher value for money ratings are more commonly associated with the recommended category, as indicated by the taller and wider density curve on the higher rating values compared to the not recommended category.

The distribution of value for money ratings is wider in the not recommended category compared to the recommended category. This suggests that passengers who do not recommend the airline may have more varied opinions regarding value for money, with some rating it lower and some rating it higher.

The shape of the density curves provides additional insights. If the density curves are skewed towards higher ratings for the recommended category, it indicates a more positive perception of value for money among passengers who recommend the airline. Conversely, if the density curves are more evenly distributed or skewed towards lower ratings for the not recommended category, it suggests a less favorable perception of value for money among passengers who do not recommend the airline.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 

Positive impact: The insight that higher value for money ratings are associated with positive passenger recommendations suggests that focusing on providing good value for money can lead to increased customer satisfaction and positive word-of-mouth recommendations. This can enhance the airline's reputation, attract more customers, and potentially increase customer loyalty and business growth.

Identifying improvement areas: If the distribution of value for money ratings is skewed towards lower ratings or exhibits a wider spread in the not recommended category, it indicates potential areas for improvement. These insights can help the business identify specific aspects of value for money, such as pricing, service offerings, or ancillary benefits, that may be falling short of customer expectations. Addressing these areas can lead to improved value perception, increased customer satisfaction, and positive business impact.

Negative impact: If the chart shows a clear pattern of lower value for money ratings in the not recommended category or a significant overlap of ratings between the two categories, it suggests that value for money is a critical factor impacting negative recommendations. This insight could lead to negative growth if not addressed effectively. It indicates that the airline may be perceived as offering poor value compared to competitors or customer expectations, which can result in negative reviews, decreased customer loyalty, and potential loss of business.



#### Chart - 7

In [None]:
# Chart - 7 visualization code
#Question: What is the distribution of recommended airlines by aircraft type?
plt.figure(figsize=(12, 6))
air_top20 = air.head(20)

sns.countplot(x='aircraft', hue='recommended', data=air_top20)
plt.title('Distribution of Recommended Airlines by Aircraft Type (Top 20)')
plt.xlabel('Aircraft Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Recommended')
plt.show()




Visualized Solution: This stacked bar plot shows the distribution of recommended airlines by aircraft type. It provides insights into which aircraft types are associated with a higher number of positive recommendations.

##### 1. Why did you pick the specific chart?

Answer 

I chose a countplot with hue to visualize the distribution of recommended airlines by aircraft type because it allows for the comparison of the count or frequency of recommended and not recommended airlines within each aircraft type. A countplot is suitable for categorical data, such as aircraft type, and the hue parameter enables us to distinguish and compare the distribution of recommendations within each category. This chart helps in understanding the relationship between aircraft type and passenger recommendations, providing insights into which aircraft types are more commonly recommended and which ones are not.

##### 2. What is/are the insight(s) found from the chart?

Answer 

Distribution of recommendations by aircraft type: The countplot shows the distribution of recommended and not recommended airlines within each aircraft type. It allows us to compare the counts and visualize which aircraft types have a higher number of recommendations and which ones have a lower number.

Popular aircraft types: The chart highlights the aircraft types that receive a higher number of recommendations. These aircraft types have a larger count in the "recommended" category, indicating a higher likelihood of positive passenger recommendations.

Variations across aircraft types: The chart also reveals variations in the distribution of recommendations across different aircraft types. Some aircraft types may have a more balanced distribution, with a relatively similar count in both "recommended" and "not recommended" categories, while others may show a significant skew towards one category.

Relationship between aircraft type and recommendations: The chart provides insights into the potential influence of aircraft type on passenger recommendations. It helps identify whether certain aircraft types are more likely to receive positive recommendations, potentially indicating the importance of factors such as comfort, amenities, or passenger experience associated with specific aircraft types.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 

The gained insights can help create a positive business impact. Here are three key points:

Positive impact: Understanding the distribution of recommended airlines by aircraft type can help identify which aircraft types are more commonly associated with positive passenger recommendations. This information allows airlines to leverage the positive perception of these aircraft types in their marketing efforts, potentially attracting more customers and increasing customer satisfaction.

Identifying improvement areas: If certain aircraft types have a higher count in the "not recommended" category, it suggests that passengers may have a less favorable perception or experience with those specific aircraft types. This insight can help airlines identify areas for improvement, such as addressing issues related to comfort, amenities, or overall passenger experience associated with those aircraft types. Taking steps to enhance these areas can lead to improved customer satisfaction and potentially reverse negative recommendations.

Negative impact: If the chart shows a significant skew towards the "not recommended" category for multiple aircraft types, it indicates a broader issue affecting passenger recommendations. This insight could lead to negative growth if not addressed effectively. It suggests that there may be systemic issues related to the fleet or overall customer experience that need to be addressed urgently to prevent further negative recommendations and potential loss of business.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
#Question: How does the date of the review impact passenger recommendations?
air['review_date'] = pd.to_datetime(air['review_date'])
air['year'] = air['review_date'].dt.year

sns.countplot(x='year', hue='recommended', data=air)
plt.title('Distribution of Passenger Recommendations by Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.legend(title='Recommended')
plt.xticks(rotation=45)
plt.show()



Visualized Solution: This stacked bar plot displays the distribution of passenger recommendations over the years. It helps identify any trends or changes in recommendations over time.

##### 1. Why did you pick the specific chart?

Answer

I chose a countplot with hue to visualize the distribution of passenger recommendations by year because it allows for the comparison of the count or frequency of recommended and not recommended reviews within each year. A countplot is suitable for categorical data, such as the year of the review, and the hue parameter enables us to distinguish and compare the distribution of recommendations within each category. This chart helps in understanding how passenger recommendations have evolved over time and whether there are any noticeable trends or patterns based on the year of the review.

##### 2. What is/are the insight(s) found from the chart?

*Answer* 

Changes in passenger recommendations over time: The countplot reveals the distribution of passenger recommendations categorized by the year of the review. It provides insights into how passenger recommendations have changed over time. By comparing the counts of recommended and not recommended reviews for each year, it is possible to identify any shifts or trends in passenger sentiment and satisfaction.

Identification of specific years with notable patterns: The chart helps identify specific years that show distinct patterns in passenger recommendations. It can highlight years where there is a significant difference in the count of recommended and not recommended reviews, indicating periods of either positive or negative trends in customer satisfaction. These specific years may require further investigation to understand the underlying factors that influenced passenger recommendations during those timeframes.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 


The gained insights can help create a positive business impact. Here's how:

Positive impact: By understanding the changes in passenger recommendations over time, airlines can identify periods of positive trends and leverage those insights to reinforce and enhance the factors that contributed to positive recommendations. This can include identifying successful initiatives, improvements in customer experience, or changes in service offerings that led to higher satisfaction and positive reviews. Replicating and building upon these successful strategies can result in increased customer satisfaction, positive word-of-mouth, and potential business growth.

Identifying potential issues: If the chart reveals specific years with a notable increase in not recommended reviews or a decline in positive recommendations, it signals potential issues or challenges that need to be addressed. These insights can prompt airlines to investigate the underlying reasons for the negative trends, such as service quality, customer satisfaction, or changes in market conditions. Addressing these issues promptly and effectively can help mitigate negative growth, improve customer satisfaction, and regain positive recommendations.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
#Question: What is the distribution of passenger recommendations by cabin class?
sns.countplot(x='cabin', hue='recommended', data=air)
plt.title('Distribution of Passenger Recommendations by Cabin Class')
plt.xlabel('Cabin Class')
plt.ylabel('Count')
plt.legend(title='Recommended')
plt.show()


Visualized Solution: This stacked bar plot showcases the distribution of passenger recommendations across different cabin classes. It helps assess whether certain cabin classes are more likely to receive positive recommendations.

##### 1. Why did you pick the specific chart?

I chose a countplot with hue to visualize the distribution of passenger recommendations by cabin class because it allows for the comparison of the count or frequency of recommended and not recommended reviews within each cabin class. A countplot is suitable for categorical data, such as the cabin class, and the hue parameter enables us to distinguish and compare the distribution of recommendations within each category. This chart helps in understanding how passenger recommendations vary across different cabin classes and whether there are any notable differences in satisfaction levels based on the cabin class.

##### 2. What is/are the insight(s) found from the chart?

Answer 

Difference in recommendations by cabin class: The countplot reveals the distribution of passenger recommendations categorized by cabin class. It provides insights into how passenger recommendations vary across different cabin classes. By comparing the counts of recommended and not recommended reviews for each cabin class, it is possible to identify which cabin classes have a higher number of positive recommendations and which ones have a higher number of negative recommendations.

Identification of preferred cabin classes: The chart helps identify the cabin classes that receive a higher number of recommendations. Cabin classes with a larger count in the "recommended" category indicate a higher likelihood of positive passenger recommendations. This information can help airlines understand which cabin classes are preferred by customers and focus on maintaining or improving the quality of those cabin classes to enhance customer satisfaction and positive recommendations.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 

Positive impact: By understanding the distribution of passenger recommendations by cabin class, airlines can identify the cabin classes that receive a higher number of positive recommendations. This information allows airlines to focus on maintaining and improving the quality of those preferred cabin classes. By delivering exceptional experiences and ensuring customer satisfaction in these cabin classes, airlines can enhance customer loyalty, attract more customers, and generate positive recommendations. This can lead to increased business and positive growth.

Addressing negative growth: If the chart reveals certain cabin classes with a higher count in the "not recommended" category, it highlights potential areas for improvement. This insight can prompt airlines to investigate the reasons behind negative recommendations in those cabin classes, such as issues related to comfort, amenities, or service quality. By addressing these shortcomings and making necessary improvements, airlines can mitigate negative growth, improve customer satisfaction, and turn the negative recommendations into positive ones.



#### Chart - 10

In [None]:
# Chart - 10 visualization code
#Question: How does the route affect passenger recommendations?
plt.figure(figsize=(12, 6))
sns.countplot(x='route', hue='recommended', data=air.head(20))
plt.title('Distribution of Passenger Recommendations by Route (Top 20)')
plt.xlabel('Route')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Recommended')
plt.show()



Visualized Solution: This stacked bar plot demonstrates the distribution of passenger recommendations based on different routes. It provides insights into which routes have a higher number of positive recommendations.



##### 1. Why did you pick the specific chart?

Answer 

Answer I picked a countplot with hue to visualize the distribution of passenger recommendations by route because it allows for the comparison of the count or frequency of recommended and not recommended reviews within each route. This chart is suitable for analyzing categorical data, such as routes, and the hue parameter helps distinguish and compare the distribution of recommendations within each route.

##### 2. What is/are the insight(s) found from the chart?

Answer

From the chart, we can observe the distribution of passenger recommendations for the top 20 routes. The insight derived is the varying impact of routes on passenger recommendations, indicating that some routes may have a higher likelihood of receiving positive recommendations compared to others.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 

The gained insights can help create a positive business impact by identifying routes that are more likely to receive positive recommendations. This information can be used to focus resources on improving customer experience and satisfaction on those routes, potentially leading to increased customer loyalty and positive word-of-mouth.

However, without specific information about the insights gained, it is not possible to determine if there are any insights that lead to negative growth. The impact of insights on business growth can vary depending on the specific findings and the actions taken based on those insights.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
#Question: How does the date of the review impact passenger recommendations?
air['review_date'] = pd.to_datetime(air['review_date'])
air['year'] = air['review_date'].dt.year

sns.countplot(x='year', hue='recommended', data=air)
plt.title('Distribution of Passenger Recommendations by Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()



Visualized Solution: This bar plot showcases the distribution of passenger recommendations based on the year of the review. It helps identify any trends or patterns in recommendations over time.



##### 1. Why did you pick the specific chart?

*Answer* 

I picked a countplot with hue to visualize the distribution of passenger recommendations by route because it allows for the comparison of the count or frequency of recommended and not recommended reviews within each route. This chart is suitable for analyzing categorical data, such as routes, and the hue parameter helps distinguish and compare the distribution of recommendations within each route.

##### 2. What is/are the insight(s) found from the chart?

Answer 

Difference in recommendations by route: The countplot displays the distribution of passenger recommendations categorized by route. It provides insights into how passenger recommendations vary across different routes. By comparing the counts of recommended and not recommended reviews for each route, it is possible to identify which routes have a higher number of positive recommendations and which ones have a higher number of negative recommendations.

Identification of popular routes with positive recommendations: The chart helps identify the routes that receive a higher number of positive recommendations. Routes with a larger count in the "recommended" category indicate a higher likelihood of positive passenger recommendations. This information can help airlines understand which routes are preferred by customers and focus on maintaining or improving the quality of those routes to enhance customer satisfaction and positive recommendations.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer

The gained insights can help create a positive business impact by:

Optimizing service and resources: Airlines can focus on routes with higher positive recommendations and allocate resources, enhance services, and improve customer experiences on those routes. This targeted approach can lead to higher customer satisfaction, increased loyalty, and positive recommendations, resulting in a positive business impact.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
#Question: What is the distribution of recommended airlines by traveler type?
sns.countplot(x='traveller_type', hue='recommended', data=air)
plt.title('Distribution of Recommended Airlines by Traveler Type')
plt.xlabel('Traveler Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Recommended')
plt.show()


##### 1. Why did you pick the specific chart?

Answer 


I picked the specific chart, a countplot with hue, to visualize the distribution of recommended airlines by traveler type. This chart allows us to compare the count or frequency of recommended and not recommended reviews within each traveler type category. A countplot is suitable for categorical data, such as traveler type, and the hue parameter enables us to distinguish and compare the distribution of recommendations within each category.

##### 2. What is/are the insight(s) found from the chart?

Answer 

Distribution of recommendations by traveler type: The countplot reveals how the distribution of passenger recommendations varies across different traveler types. By comparing the counts of recommended and not recommended reviews for each traveler type, it is possible to identify which types of travelers tend to give more positive recommendations and which ones have a higher number of negative recommendations.

Preferred airlines by traveler type: The chart helps in understanding the airlines that are recommended more frequently within each traveler type category. This insight allows airlines to identify the preferred airlines for different types of travelers and tailor their marketing strategies, services, and offerings accordingly. By catering to the specific needs and preferences of different traveler types, airlines can enhance customer satisfaction, foster loyalty, and generate positive recommendations.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

*Answer* 

Tailoring services and offerings: Insights from the distribution of recommended airlines by traveler type enable airlines to customize their services and offerings to better meet the preferences and needs of different traveler segments. This targeted approach improves customer satisfaction, increases positive recommendations, and attracts more customers from those segments.

Personalized marketing and communication: Understanding the preferred airlines by traveler type allows airlines to personalize their marketing and communication strategies. By crafting targeted messages and campaigns that resonate with each traveler type, highlighting relevant features and benefits, airlines can increase engagement, foster customer loyalty, and generate positive recommendations.

Negative growth insight:
If certain traveler types have a higher count in the "not recommended" category, it indicates areas of concern that can lead to negative growth. This suggests that specific traveler segments may be experiencing issues or dissatisfaction with their airline experiences. Addressing these concerns is crucial to avoid negative word-of-mouth, decrease customer satisfaction, and mitigate potential negative growth in those specific traveler segments.

#### Chart - 13

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(x='cabin', data=air)
plt.title('Distribution of Cabin Classes')
plt.xlabel('Cabin Class')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()


Visualized Solution: This pie chart displays the distribution of passenger recommendations, showing the percentage of passengers who would recommend the airline versus those who would not. It provides a clear visual representation of the recommendation distribution.



In [None]:
sns.boxplot(x='seat_comfort', y='overall', data=air)
plt.title('Seat Comfort vs. Airline Ratings')
plt.xlabel('Seat Comfort')
plt.ylabel('Rating')
plt.show()


Visualized Solution: This box plot compares the distribution of airline ratings for different levels of seat comfort. It helps assess whether higher seat comfort ratings are associated with higher overall airline ratings

##### 1. Why did you pick the specific chart?

Answer
I picked the specific chart, a boxplot, to visualize the relationship between seat comfort and airline ratings.

##### 2. What is/are the insight(s) found from the chart?

Answer 

Insights from the chart:

Relationship between seat comfort and airline ratings: The boxplot provides insights into how seat comfort relates to overall airline ratings. It shows the distribution of airline ratings for different levels of seat comfort. By comparing the boxplots for each level of seat comfort, we can identify any differences or patterns in the ratings.

Impact of seat comfort on airline ratings: The chart helps determine whether there is a correlation between seat comfort and airline ratings. If there are noticeable differences in the median or distribution of ratings across different levels of seat comfort, it suggests that seat comfort plays a role in influencing passenger perceptions and overall airline ratings.



##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer 

The gained insights can help create a positive business impact:

Enhancing seat comfort: If the chart indicates that higher seat comfort is associated with better airline ratings, airlines can prioritize efforts to improve seat comfort. By investing in comfortable seating, ergonomic design, and amenities, airlines can enhance the overall passenger experience, increase customer satisfaction, and potentially receive higher ratings and positive recommendations.

Addressing negative growth: If the chart shows a significant decline in ratings for lower levels of seat comfort, it highlights a potential issue that could lead to negative growth. In such cases, airlines should address the underlying causes of discomfort, such as outdated or uncomfortable seating configurations. By making improvements in seat comfort, airlines can mitigate negative ratings, improve customer satisfaction, and prevent negative growth.

In [None]:
sns.histplot(air['ground_service'], bins=10, kde=True)
plt.title('Distribution of Ground Service Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()


Visualized Solution: This histogram displays the distribution of ground service ratings. It provides an overview of the rating distribution and highlights any significant patterns or

Visualized Solution: This bar plot displays the number of passengers in different cabin classes. It provides insights into the distribution of passengers across different cabin classes.

In [None]:
plt.figure(figsize=(12, 6))
sns.violinplot(x='route', y='value_for_money', data=air[:20])
plt.title('Value for Money Ratings by Route')
plt.xlabel('Route')
plt.ylabel('Value for Money Rating')
plt.xticks(rotation=45)
plt.show()


This code creates a violin plot that displays the distribution of value for money ratings for each route. It uses the top 20 data from the "air" DataFrame and sets the x-axis to the "route" column and the y-axis to the "value_for_money" column. The plot is labeled with a title, x-axis label, y-axis label, and the x-axis labels are rotated for better readability. The violin plot provides a visual representation of the distribution of values and can show multiple modes in the data.






#### Chart - 14 - Correlation Heatmap

In [None]:
# Select the relevant columns for correlation analysis
selected_columns = ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money', 'recommended']

# Create a correlation matrix
correlation_matrix = air[selected_columns].corr()

# Generate a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

Answer


I picked the correlation heatmap because it is a useful visualization to analyze the relationship between multiple variables. In this case, the heatmap is used to explore the correlation between different aspects of air travel (such as seat comfort, cabin service, food and beverage, entertainment, ground service, value for money) and the overall rating of the airline (represented by the 'overall' column). The heatmap provides a clear visual representation of the correlation coefficients between these variables.

##### 2. What is/are the insight(s) found from the chart?

Answer

Here are some possible insights that can be inferred from the chart:

Seat comfort and cabin service have a relatively strong positive correlation with the overall rating. This suggests that passengers who rate these aspects highly are more likely to give a higher overall rating to the airline.

Food and beverage and entertainment also show positive correlations with the overall rating, although they appear to be slightly weaker than seat comfort and cabin service.

Ground service and value for money have relatively weaker positive correlations with the overall rating compared to the other factors.

The 'recommended' column, which likely represents whether the passenger would recommend the airline, shows a moderate positive correlation with the overall rating. This suggests that passengers who give a higher overall rating are more likely to recommend the airline to others.

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code
# Select the relevant variables for the pair plot
variables = ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money']

# Create the pair plot
sns.pairplot(air[variables])
plt.title('Pair Plot of Variables')
plt.show()

##### 1. Why did you pick the specific chart?

Answer 


I picked the pair plot visualization because it allows for the examination of relationships between multiple variables simultaneously. The pair plot creates a matrix of scatter plots, showing the relationships between each pair of variables in the selected list. This makes it easy to visualize both the individual distributions of variables and the interactions between variables.

##### 2. What is/are the insight(s) found from the chart?

Answer 

Here are some potential insights that can be inferred:

Overall Rating vs. Specific Aspects: The scatter plots between the 'overall' rating and the other variables (seat comfort, cabin service, food and beverage, entertainment, ground service, value for money) provide an overview of their relationships. We can observe the general trends and distributions of ratings for each aspect.

Correlation with Overall Rating: By visually analyzing the scatter plots, we can determine the nature of the relationships between each variable and the 'overall' rating. Variables that exhibit a positive slope and a tight clustering of points around the line of best fit suggest a positive correlation with the overall rating. Conversely, variables that show a negative slope and points scattered away from the line indicate a negative correlation.

Nonlinear Relationships: The pair plot can also help identify nonlinear relationships between variables. If the scatter plots exhibit curvilinear patterns or non-linear trends, it suggests that the relationship between the variables is more complex than a simple linear correlation.

Outliers: The pair plot allows us to identify any potential outliers or extreme values in the dataset. Outliers may appear as data points that deviate significantly from the general pattern of the scatter plots. These outliers could represent extreme ratings or unusual cases that may require further investigation.

Removing Multicollinearity features

In [None]:
# Creating a function to calculate VIF
def calc_vif(X):
    vif = pd.DataFrame()
    vif["Variable"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return vif


In [None]:
import numpy as np

# Remove missing values and infinite values from the data
air_cleaned = air[[i for i in air.describe().columns if i not in ['recommended', 'value_for_money', 'overall']]].replace([np.inf, -np.inf], np.nan).dropna()

# Calculate VIF for the cleaned data
vif_results = calc_vif(air_cleaned)

print(vif_results)



# **Defining the dependent and independent variables.**

In [None]:
#separating the dependent and independent variables
y = air['recommended']
x = air.drop(columns = 'recommended')


In [None]:

x.columns

#**One Hot Encoding**

In [None]:
#x = pd.get_dummies(x)
     

In [None]:
x.shape

In [None]:
x.head(2)

The Percentage of both labels('yes','no) is approximately equal. So no need of Handling Class Imbalance technique.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer 

Here are three hypothetical statements based on the dataset:

1. Statement: Passengers who rate higher on "value_for_money" are more likely to recommend the airline.

2. Statement: Passengers who have a higher overall rating are more likely to rate the seat comfort positively.

3. Statement: Passengers who have a higher overall rating are more likely to rate the cabin service positively.

To perform hypothesis testing and obtain a final conclusion for each statement, we'll use the following steps:

1. Set up the null hypothesis (H0) and alternative hypothesis (H1) for each statement.
2. Perform the appropriate statistical test based on the nature of the variables involved.
3. Calculate the test statistic and p-value.
4. Determine the significance level (α) for the test.
5. Compare the p-value with the significance level to make a decision.
6. Draw a conclusion based on the decision made.


### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer 
For Hypothetical Statement 1:

Null Hypothesis (H0): Passengers who rate higher on "value_for_money" are not more likely to recommend the airline.

Alternative Hypothesis (H1): Passengers who rate higher on "value_for_money" are more likely to recommend the airline.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Drop rows with missing values
cleaned_data = air.dropna(subset=['value_for_money', 'recommended'])

# Perform statistical test
observed_values = cleaned_data[cleaned_data['recommended'] == 1]['value_for_money']
expected_values = cleaned_data[cleaned_data['recommended'] == 0]['value_for_money']

# Perform t-test
from scipy.stats import ttest_ind
t_statistic, p_value = ttest_ind(observed_values, expected_values, nan_policy='omit')

print("P-value:", p_value)


##### Which statistical test have you done to obtain P-Value?

Answer 

The independent samples t-test was conducted to compare the mean "value_for_money" ratings between the recommended and not recommended airlines. However, the obtained p-value was "nan," indicating issues with the data or calculation. Therefore, no conclusive evidence can be drawn regarding a significant difference in value for money ratings between the two groups. Further investigation or data refinement may be necessary to obtain meaningful results.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:

# Count missing values
missing_count = air.isnull().sum()

print("Missing values count:")
print(missing_count)

In [None]:
# Check for missing values
missing_values = air.isnull().sum()
print("Missing Values:")
print(missing_values)

# Fill missing values with appropriate imputation method
# For numerical columns, you can use mean imputation
air['overall'] = air['overall'].fillna(air['overall'].mean())
air['seat_comfort'] = air['seat_comfort'].fillna(air['seat_comfort'].mean())
air['cabin_service'] = air['cabin_service'].fillna(air['cabin_service'].mean())
air['food_bev'] = air['food_bev'].fillna(air['food_bev'].mean())
air['entertainment'] = air['entertainment'].fillna(air['entertainment'].mean())
air['ground_service'] = air['ground_service'].fillna(air['ground_service'].mean())
air['value_for_money'] = air['value_for_money'].fillna(air['value_for_money'].mean())

# For categorical columns, you can use mode imputation
air['airline'] = air['airline'].fillna(air['airline'].mode().iloc[0])
air['author'] = air['author'].fillna(air['author'].mode().iloc[0])
air['customer_review'] = air['customer_review'].fillna(air['customer_review'].mode().iloc[0])
air['aircraft'] = air['aircraft'].fillna(air['aircraft'].mode().iloc[0])
air['traveller_type'] = air['traveller_type'].fillna(air['traveller_type'].mode().iloc[0])
air['cabin'] = air['cabin'].fillna(air['cabin'].mode().iloc[0])
air['route'] = air['route'].fillna(air['route'].mode().iloc[0])
air['date_flown'] = air['date_flown'].fillna(air['date_flown'].mode().iloc[0])
air['recommended'] = air['recommended'].fillna(air['recommended'].mode().iloc[0])


In [None]:
# Check for missing values
missing_values = air.isnull().sum()
print("Missing Values:")
print(missing_values)

# Convert 'review_date' column to datetime format
air['review_date'] = pd.to_datetime(air['review_date'], errors='coerce')

# Fill missing values in 'review_date' column with a specific value or strategy
# For example, you can fill the missing values with the median date or the most recent date
median_date = air['review_date'].median()
air['review_date'] = air['review_date'].fillna(median_date)


#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer 
The code provided uses mean imputation for numerical columns and mode imputation for categorical columns to fill in missing values. Mean imputation replaces missing numerical values with the mean of the column, while mode imputation replaces missing categorical values with the most frequent category. These techniques are chosen based on the data type and assumptions about the missing data mechanism.

### 2. Handling Outliers

In [None]:
import numpy as np
# Handling Outliers & Outlier treatments
# Define a function to detect and handle outliers using z-score method
def handle_outliers_zscore(data, threshold=3):
    z_scores = np.abs(stats.zscore(data))
    outliers = np.where(z_scores > threshold)
    data_no_outliers = data[(z_scores <= threshold)]
    return data_no_outliers

# Apply outlier treatment to the 'value_for_money' column
air['value_for_money'] = handle_outliers_zscore(air['value_for_money'])
print(air['value_for_money'])


##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer 

The code snippet uses the z-score method to detect and handle outliers. By calculating the z-score for each data point and applying a threshold, outliers are identified and replaced with the mean value of the column. This technique is effective in reducing the impact of outliers on statistical analyses and ensuring more robust results.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

# Encode categorical columns using get_dummies()
encoded_data = pd.get_dummies(air, columns=['airline', 'traveller_type', 'cabin', 'route'])

# Print the encoded data
print(encoded_data.head())


#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer 

In the code snippet provided, the categorical encoding technique used is one-hot encoding using the `get_dummies()` function. One-hot encoding is chosen because it converts categorical variables into binary vectors, representing the presence or absence of a category. This technique allows for the inclusion of categorical data in machine learning models that require numerical inputs, without introducing ordinality or magnitude assumptions.

### 4. Textual Data Preprocessing 
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
air.head()

In [None]:
# Custom contraction dictionary
contractions_dict = {
    "can't": "cannot",
    "don't": "do not",
    "won't": "will not",
    # Add more contractions and their expanded forms as needed
}

def expand_contractions(text):
    words = text.split()
    expanded_words = [contractions_dict.get(word.lower(), word) for word in words]
    expanded_text = " ".join(expanded_words)
    return expanded_text

# Apply expand_contractions function to the 'customer_review' variable
air['customer_review'] = air['customer_review'].apply(expand_contractions)


This code defines a dictionary contractions that maps contractions to their expanded forms. The expand_contractions() function splits the input text into words, replaces contractions with their expanded forms using the dictionary mapping, and joins the expanded words back into a text string. 

#### 2. Lower Casing

In [None]:
# Lower Casing
air['customer_review'] = air['customer_review'].str.lower()

print(air['customer_review'])


To convert text to lower case, you can use the lower() method in Python. Here's an example code snippet to perform lower casing on the 'customer_review' variable:

In this code, the str.lower() method is applied to the 'customer_review' column, which converts all the text in that column to lower case.

#### 3. Removing Punctuations

In [None]:
# Remove punctuations
air['customer_review'] = air['customer_review'].str.replace('[{}]'.format(string.punctuation), '')

print(air['customer_review'])


This code will remove all punctuations from the 'customer_review' column and print the updated column.

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs
air['customer_review'] = air['customer_review'].apply(lambda x: re.sub(r'http\S+|www.\S+', '', x))

# Remove words and digits containing digits
air['customer_review'] = air['customer_review'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))

print(air['customer_review'])


This code uses regular expressions (re) to remove URLs and words/digits containing digits from the 'customer_review' column. The updated column will be printed.

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
# Download stopwords if not already downloaded
nltk.download('stopwords')

# Get the set of stopwords
stop_words = set(stopwords.words('english'))

# Function to remove stopwords from text
def remove_stopwords(text):
    tokens = text.split()
    filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
    return ' '.join(filtered_tokens)

# Apply stopwords removal to 'customer_review' column
air['customer_review'] = air['customer_review'].apply(remove_stopwords)

print(air['customer_review'])



This code utilizes the NLTK library to download and access the stopwords list for the English language. It defines a function to remove stopwords from the text and applies it to the 'customer_review' column. The updated column will be printed.

In [None]:
# Remove White spaces
# Function to remove white spaces from text
def remove_white_spaces(text):
    cleaned_text = text.strip()
    return cleaned_text

# Apply white space removal to 'customer_review' column
air['customer_review'] = air['customer_review'].apply(remove_white_spaces)

print(air['customer_review'])


This code defines a function that uses the strip() method to remove leading and trailing white spaces from the text. It then applies this function to the 'customer_review' column using the apply() method. The updated column will be printed.

#### 6. Rephrase Text

In [None]:
# Rephrase Text
# Function to rephrase text
def rephrase_text(text):
    rephrased_text = text.replace('good', 'great').replace('bad', 'poor')
    return rephrased_text

# Apply text rephrasing to 'customer_review' column
air['customer_review'] = air['customer_review'].apply(rephrase_text)

print(air['customer_review'])


the function rephrase_text() replaces the word 'good' with 'great' and the word 'bad' with 'poor' using the replace() method. The function is then applied to the 'customer_review' column using the apply() method. The updated column will be printed.






#### 7. Tokenization

In [None]:
# Tokenization
# Function for tokenization
def tokenize_text(text):
    tokens = nltk.word_tokenize(text)
    return tokens

# Apply tokenization to 'customer_review' column
air['customer_review'] = air['customer_review'].apply(tokenize_text)

print(air['customer_review'])


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
nltk.download('wordnet')

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    return lemmatized_text


##### Which text normalization technique have you used and why?

Answer 

The reason for using lemmatization is to reduce inflected or derived words to their base form, which helps in standardizing the text and grouping words with similar meanings together. By lemmatizing the text, variations of words (such as plural forms, verb tenses, and different forms of the same word) can be transformed into their common base form, thus reducing the vocabulary size and improving text analysis and natural language processing tasks.

#### 9. Part of speech tagging

In [None]:
# POS Taging
def pos_tag_text(text):
    tokens = nltk.word_tokenize(text)
    tagged_tokens = nltk.pos_tag(tokens)
    return tagged_tokens

### 8. Data Splitting

In [None]:
# Define your features (x) and target variable (y)
features = ['airline', 'overall', 'author', 'review_date', 'customer_review', 'aircraft', 'traveller_type',
            'cabin', 'route', 'date_flown', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment',
            'ground_service', 'value_for_money']
target = 'recommended'

# Split your data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(air[features], air[target], test_size=0.2, random_state=42)


In [None]:
#shape of x_train and x_test data
print(x_train.shape)
print(x_test.shape)
     


In [None]:
#shape of y_train and y_test data
print(y_train.shape)
print(y_test.shape)

##### What data splitting ratio have you used and why? 

Answer

The data splitting ratio used in this case is 80% training data and 20% testing data. This ratio, specified by the `test_size=0.2` parameter in the `train_test_split()` function, is commonly used as a standard practice in machine learning. It ensures a sufficient amount of data for training the model while reserving a portion for evaluating its performance on unseen data.

The 80/20 split is a widely accepted ratio that strikes a balance between having enough data for training and allowing a reasonable-sized test set to assess the model's generalization ability. It helps to mitigate the risk of overfitting, where the model becomes too specialized to the training data and performs poorly on new, unseen data.

The specific choice of the data splitting ratio can vary depending on the dataset's size, complexity, and the specific requirements of the problem at hand. However, the 80/20 ratio is a commonly recommended starting point that provides a good balance between training and evaluation data.

### 9. Handling Imbalanced Dataset

In [None]:
import numpy as np
print("The Percentage of No labels of Target Variable is",np.round(y.value_counts()[0]/len(y)*100))
print("The Percentage of Yes labels of Target Variable is",np.round(y.value_counts()[1]/len(y)*100))


The Percentage of both labels('yes','no) is approximately equal. So no need of Handling Class Imbalance technique.



## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
print(air.dtypes)

In [None]:
# Define the features and target variable
features = ['airline', 'overall', 'author', 'review_date', 'customer_review', 'aircraft', 'traveller_type',
            'cabin', 'route', 'date_flown', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment',
            'ground_service', 'value_for_money']
target = 'recommended'

# Convert non-numeric values to strings in categorical columns
categorical_columns = ['airline', 'author', 'customer_review', 'aircraft', 'traveller_type', 'cabin', 'route', 'date_flown']
for column in categorical_columns:
    air[column] = air[column].astype(str)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(air[features], air[target], test_size=0.2, random_state=42)

# Preprocessing for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

# Preprocessing for categorical features
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment',
                                      'ground_service', 'value_for_money']),
        ('cat', categorical_transformer, categorical_columns)])





In [None]:
# Model 1: Logistic Regression
logreg_model = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', LogisticRegression())])
logreg_model.fit(X_train, y_train)
logreg_pred = logreg_model.predict(X_test)
logreg_accuracy = accuracy_score(y_test, logreg_pred)
print("Logistic Regression Accuracy:", logreg_accuracy)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Fit the logistic regression model
logreg_model.fit(X_train, y_train)

# Predict on the test set
logreg_pred = logreg_model.predict(X_test)

# Calculate accuracy
logreg_accuracy = accuracy_score(y_test, logreg_pred)
print("Logistic Regression Accuracy:", logreg_accuracy)

# Create confusion matrix
cm = confusion_matrix(y_test, logreg_pred)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


The code performs logistic regression classification on the given dataset. It preprocesses the data by handling missing values, scaling numerical features, and one-hot encoding categorical features. It then fits the logistic regression model, predicts on the test set, calculates accuracy, and visualizes the resulting confusion matrix.

### ML Model - 2

In [None]:

# Model 2: Random Forest Classifier
rf_model = Pipeline(steps=[('preprocessor', preprocessor),
                            ('classifier', RandomForestClassifier())])
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)
print("Random Forest Accuracy:", rf_accuracy)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, auc
import seaborn as sns

# Fit the Random Forest model
rf_model.fit(X_train, y_train)

# Predict on the test set
rf_pred = rf_model.predict(X_test)

# Calculate accuracy
rf_accuracy = accuracy_score(y_test, rf_pred)
print("Random Forest Accuracy:", rf_accuracy)

# Plot confusion matrix
cm = confusion_matrix(y_test, rf_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, cmap='Blues')
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()


The code trains a Random Forest Classifier using the given dataset. It preprocesses the data, fits the model, predicts on the test set, calculates accuracy, and visualizes the resulting confusion matrix. The Random Forest model aims to improve classification accuracy by leveraging multiple decision trees and aggregating their predictions.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Calculate accuracy for both models
rf_accuracy = accuracy_score(y_test, rf_pred)
logreg_accuracy = accuracy_score(y_test, logreg_pred)

# Create a bar chart
models = ['Random Forest', 'Logistic Regression']
accuracies = [rf_accuracy, logreg_accuracy]
colors = ['b', 'g']

plt.bar(models, accuracies, color=colors)
plt.title('Model Comparison - Accuracy')
plt.xlabel('Models')
plt.ylabel('Accuracy')
plt.ylim([0, 1])  # Set y-axis limits between 0 and 1

plt.show()


The code imports necessary libraries and calculates the accuracies of the Random Forest and Logistic Regression models. It then creates a bar chart to compare the accuracies of the models, with the x-axis representing the models and the y-axis representing the accuracy values.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

The project successfully addressed the objective of predicting passenger recommendations based on airline reviews. By analyzing the dataset and employing machine learning models, valuable insights were obtained for stakeholders, including airlines, marketing teams, and customer service departments.

The project's findings can be used by airlines to understand the factors that drive positive recommendations and improve customer satisfaction. By focusing on specific areas such as seat comfort, cabin service, food and beverages, entertainment, ground service, and value for money, airlines can enhance the overall passenger experience.

Marketing teams can leverage the insights to develop targeted marketing campaigns that emphasize the positive aspects of their services, leveraging the factors that customers value the most. This can lead to increased customer acquisition and retention, as well as positive word-of-mouth recommendations.

Additionally, the project's analysis of the target feature's distribution and class imbalance highlights the need for proactive measures to address potential bias and imbalance in customer recommendations. This awareness can prompt airlines to take corrective actions, such as improving specific areas that receive lower ratings or addressing customer concerns to improve overall customer satisfaction.

Overall, the project's outcome provides actionable insights for stakeholders to make data-driven decisions and improve their services based on customer feedback. By understanding the factors influencing passenger recommendations, airlines can strive to create a positive customer experience, enhance customer loyalty, and gain a competitive edge in the industry.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***