# ðŸ“Š Exploratory Data Analysis: Flight Price

## 1. Description

1. **Airline:** The name of the airline company is stored in the `airline` column. It is a categorical feature with 6 different airlines.
2. **Flight:** Contains the flight code for each plane. This is a categorical feature.
3. **Source City:** The city from which the flight departs. It is a categorical feature with 6 unique cities.
4. **Departure Time:** A derived categorical feature created by grouping time periods into bins. It provides information about the departure time and has 6 distinct time labels.
5. **Stops:** A categorical feature with 3 distinct values representing the number of stops between the source and destination cities.
6. **Arrival Time:** A derived categorical feature created by grouping time intervals into bins. It has 6 distinct time labels and contains information about the arrival time.
7. **Destination City:** The city where the flight lands. It is a categorical feature with 6 unique cities.
8. **Class:** A categorical feature indicating the seat class, with two possible values: Business and Economy.
9. **Duration:** A continuous feature showing the total travel time between cities, measured in hours.
10. **Days Left:** A derived feature calculated by subtracting the booking date from the trip date.
11. **Price:** The target variable, which stores the ticket price information.

## 2. Feature Engineering

In [114]:
import pandas as pd
import numpy as np

df = pd.read_excel('data.xlsx')

# ===== Date of Journey formatting =====
# Split 'Date_of_Journey' column (e.g. '24/03/2019') into day, month, and year
df['Date'] = df['Date_of_Journey'].str.split('/').str[0].astype(int)   # Extract day
df['Month'] = df['Date_of_Journey'].str.split('/').str[1].astype(int)  # Extract month
df['Year'] = df['Date_of_Journey'].str.split('/').str[2].astype(int)   # Extract year
df.drop('Date_of_Journey', axis=1, inplace=True)                       # Remove original column

# ===== Arrival time formatting =====
# Keep only the time part (remove date if present)
df['Arrival_Time'] = df['Arrival_Time'].apply(lambda x: x.split(' ')[0])
# Split time (e.g. '22:20') into hour and minute
df['Arrival_Hour'] = df['Arrival_Time'].str.split(':').str[0].astype(int)
df['Arrival_Min'] = df['Arrival_Time'].str.split(':').str[1].astype(int)
df.drop('Arrival_Time', axis=1, inplace=True)                          # Remove original column

# ===== Departure time formatting =====
# Keep only the time part (remove date if present)
df['Dep_Time'] = df['Dep_Time'].apply(lambda x: x.split(' ')[0])
# Split time (e.g. '05:30') into hour and minute
df['Dep_Hour'] = df['Dep_Time'].str.split(':').str[0].astype(int)
df['Dep_Min'] = df['Dep_Time'].str.split(':').str[1].astype(int)
df.drop('Dep_Time', axis=1, inplace=True)                              # Remove original column

# ===== Total stops formatting =====
df['Total_Stops'].unique()                                             # View unique stop values
# df[df['Total_Stops'].isnull()]                                         # Check for missing values
df['Total_Stops'].mode()                                               # Get most common value (mode)
# Replace text values with numeric equivalents
# (and replace missing values (NaN) with 1, meaning '1 stop')
df['Total_Stops'] = df['Total_Stops'].map({
    'non-stop': 0,
    '1 stop': 1,
    '2 stops': 2,
    '3 stops': 3,
    '4 stops': 4,
    np.nan: 1
})
df.drop('Route', axis=1, inplace=True)                                 # Drop 'Route' column (not useful)

# ===== Duration formatting =====
# Extract hours (digits before 'h') and fill missing values with 0
df['Duration_Hours'] = df['Duration'].str.extract(r'(\d+)h')[0].fillna(0).astype(int)
# Extract minutes (digits before 'm') and fill missing values with 0
df['Duration_Minutes'] = df['Duration'].str.extract(r'(\d+)m')[0].fillna(0).astype(int)
df.drop('Duration', axis=1, inplace=True)                              # Drop 'Duration' column (not useful)

# ===== Encoding categorical columns =====
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()                                              # Initialize the encoder
# Fit and transform the categorical columns
encoded = encoder.fit_transform(df[['Airline','Source','Destination']]).toarray()
# Create a new DataFrame for the encoded columns
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())
# Drop the original (non-numeric) categorical columns
df.drop(['Airline', 'Source', 'Destination'], axis=1, inplace=True)
df.drop('Additional_Info', axis=1, inplace=True)                       # Drop 'Additional_Info' column (not useful for training models)
# Concatenate the cleaned numerical dataset with the encoded categorical dataset
final_df = pd.concat([df.reset_index(drop=True), encoded_df.reset_index(drop=True)], axis=1)
# Inspect the final shape and preview
print("Final dataset shape:", final_df.shape)
final_df.info()

Final dataset shape: (10683, 34)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 34 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Total_Stops                                10683 non-null  int64  
 1   Price                                      10683 non-null  int64  
 2   Date                                       10683 non-null  int32  
 3   Month                                      10683 non-null  int32  
 4   Year                                       10683 non-null  int32  
 5   Arrival_Hour                               10683 non-null  int32  
 6   Arrival_Min                                10683 non-null  int32  
 7   Dep_Hour                                   10683 non-null  int32  
 8   Dep_Min                                    10683 non-null  int32  
 9   Duration_Hours                             10683 non-null  in