In [1]:
# Connecting the Python Code with the google drive
from google.colab import drive

In [2]:
import pandas as pd

In [16]:
df = pd.read_csv('/content/drive/MyDrive/Projects/airline_review.csv')
df.describe()
df.head()

Unnamed: 0.1,Unnamed: 0,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Food and Beverages,Inflight Entertainment,Ground Service,Value For Money,Recommended
0,0,Couple Leisure,Business Class,Athens to London,September 2023,1,1,1,1,1,1,no
1,1,Business,Economy Class,Milan to San Jose via London,September 2023,3,3,2,4,1,1,no
2,2,Couple Leisure,First Class,Dallas to Dubrovnik via Heathrow,September 2023,1,4,4,3,3,2,no
3,3,Business,Business Class,London to Seville,September 2023,2,1,1,3,1,1,no
4,4,Couple Leisure,Economy Class,London Heathrow to Tokyo,September 2023,1,1,1,3,2,1,no


2.Data Preprocessing

We prepare our dataset for the machine learning model. This involves encoding categorical variables and scaling numerical features.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Encoding categorical variables
label_encoders = {}
for column in ['Type Of Traveller', 'Seat Type', 'Recommended']:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column])
    label_encoders[column] = le

# Dropping columns that might need more complex preprocessing
X = df.drop(columns=['Unnamed: 0', 'Route', 'Date Flown', 'Recommended'])
y = df['Recommended']

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:
3.Model Training

We chose a Random Forest Classifier due to its versatility and robustness for various datasets.

In [10]:
from sklearn.ensemble import RandomForestClassifier

# Initialize and train the model
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train_scaled, y_train)


4.Model Evaluation

After training, we evaluate the model's performance on the test dataset.

In [11]:
from sklearn.metrics import accuracy_score, classification_report

# Predict on the test set
y_pred = rf_clf.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print(accuracy)
print(classification_rep)


1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00         1

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20



To address the concerns raised and verify the robustness of the model's performance, let's implement the following steps in code:

Cross-Validation: To get a more accurate estimate of the model's performance.

Investigate Potential Data Leakage: Ensure there's no direct correlation between features and the target variable that might be causing data leakage.

Handle Imbalanced Dataset: Apply techniques to manage the imbalance in the dataset.

1.Cross-Validation

We'll use cross-validation to evaluate the model's performance more reliably.

In [12]:
from sklearn.model_selection import cross_val_score

# Performing cross-validation
cv_scores = cross_val_score(rf_clf, X_train_scaled, y_train, cv=5, scoring='accuracy')

cv_scores


array([1.    , 0.875 , 0.875 , 0.875 , 0.9375])

2.Investigate Potential Data Leakage

Let's briefly check for any strong correlations between features and the target variable.

In [14]:
# Check for correlations
correlations = df.corr()['Recommended'].sort_values()

correlations


  correlations = df.corr()['Recommended'].sort_values()


Seat Type                -0.337500
Unnamed: 0               -0.043807
Type Of Traveller         0.040679
Inflight Entertainment    0.294616
Food and Beverages        0.382776
Seat Comfort              0.546500
Cabin Staff Service       0.547451
Ground Service            0.570742
Value For Money           0.787501
Recommended               1.000000
Name: Recommended, dtype: float64

3.Handling Imbalanced Dataset

We can apply SMOTE (Synthetic Minority Over-sampling Technique) to balance the dataset.

In [15]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)

# Check the balance
balance_check = pd.Series(y_train_smote).value_counts()

balance_check


0    64
1    64
Name: Recommended, dtype: int64