In [1]:
# Set up environment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
sns.set()

In [2]:
fare_data = pd.read_csv("data/a1_FlightFare_Dataset.csv")
fare_data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR - DEL,22:20,3/22/2023 1:10,2h 50m,non-stop,No info,3897
1,Air India,1/5/2019,Kolkata,Banglore,CCU - IXR - BBI - BLR,5:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/6/2019,Delhi,Cochin,DEL - LKO - BOM - COK,9:25,6/10/2023 4:25,19h,2 stops,No info,13882
3,IndiGo,12/5/2019,Kolkata,Banglore,CCU - NAG - BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,1/3/2019,Banglore,New Delhi,BLR - NAG - DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [3]:
pd.set_option('display.max_columns', None)

In [4]:
fare_data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR - DEL,22:20,3/22/2023 1:10,2h 50m,non-stop,No info,3897
1,Air India,1/5/2019,Kolkata,Banglore,CCU - IXR - BBI - BLR,5:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/6/2019,Delhi,Cochin,DEL - LKO - BOM - COK,9:25,6/10/2023 4:25,19h,2 stops,No info,13882
3,IndiGo,12/5/2019,Kolkata,Banglore,CCU - NAG - BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,1/3/2019,Banglore,New Delhi,BLR - NAG - DEL,16:50,21:35,4h 45m,1 stop,No info,13302


The line of code pd.set_option('display.max_columns', None) is a configuration option in Pandas that sets the maximum number of columns to be displayed when you output a Pandas DataFrame or Series in a Jupyter Notebook or console. Setting it to None means there is no maximum limit to the number of columns displayed, and all columns will be shown.

Here's what it does:

By default, when you print a DataFrame or Series in Pandas, it will truncate the display of columns if there are too many columns to fit within the available width of the output window. This truncation is done to make the output more readable, especially when dealing with DataFrames with a large number of columns.

When you set pd.set_option('display.max_columns', None), you are essentially telling Pandas not to truncate the display of columns, regardless of how many columns there are in the DataFrame. This can be useful when you want to inspect or view the entire DataFrame, even if it has a large number of columns.

Here's an example of how it might be used:

import pandas as pd

###### Create a sample DataFrame with many columns
data = {'A': range(100), 'B': range(100), 'C': range(100)}
df = pd.DataFrame(data)

###### Set the option to display all columns
pd.set_option('display.max_columns', None)

###### Print the DataFrame
print(df)

###### Reset the option to the default
pd.reset_option('display.max_columns')
In the code above, initially, the pd.set_option('display.max_columns', None) command is used to allow displaying all columns of the DataFrame df. After viewing the DataFrame, you can use pd.reset_option('display.max_columns') to reset the option back to the default behavior, which will truncate the display of columns based on the available screen width.

In [5]:
fare_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001 entries, 0 to 10000
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10001 non-null  object
 1   Date_of_Journey  10001 non-null  object
 2   Source           10001 non-null  object
 3   Destination      10001 non-null  object
 4   Route            10000 non-null  object
 5   Dep_Time         10001 non-null  object
 6   Arrival_Time     10001 non-null  object
 7   Duration         10001 non-null  object
 8   Total_Stops      10000 non-null  object
 9   Additional_Info  10001 non-null  object
 10  Price            10001 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 859.6+ KB


In [6]:
# Check for missing values
fare_data.isnull().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              1
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        1
Additional_Info    0
Price              0
dtype: int64

In [7]:
# Dropping the null values
fare_data.dropna(inplace=True)

In [8]:
fare_data.isnull().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              0
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        0
Additional_Info    0
Price              0
dtype: int64

In [9]:
fare_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 10000
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10000 non-null  object
 1   Date_of_Journey  10000 non-null  object
 2   Source           10000 non-null  object
 3   Destination      10000 non-null  object
 4   Route            10000 non-null  object
 5   Dep_Time         10000 non-null  object
 6   Arrival_Time     10000 non-null  object
 7   Duration         10000 non-null  object
 8   Total_Stops      10000 non-null  object
 9   Additional_Info  10000 non-null  object
 10  Price            10000 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 937.5+ KB


### Feature Engineering

In [10]:
fare_data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR - DEL,22:20,3/22/2023 1:10,2h 50m,non-stop,No info,3897
1,Air India,1/5/2019,Kolkata,Banglore,CCU - IXR - BBI - BLR,5:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/6/2019,Delhi,Cochin,DEL - LKO - BOM - COK,9:25,6/10/2023 4:25,19h,2 stops,No info,13882
3,IndiGo,12/5/2019,Kolkata,Banglore,CCU - NAG - BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,1/3/2019,Banglore,New Delhi,BLR - NAG - DEL,16:50,21:35,4h 45m,1 stop,No info,13302


##### Handling object data

###### Handling numerical data

In [12]:
# Extracting journey day and journey month from the Date of Journey and
# adding them to the seperate columns
fare_data["Journey_day"] = pd.to_datetime(fare_data.Date_of_Journey, format="%d/%m/%Y").dt.day
fare_data["Journey_Month"] = pd.to_datetime(fare_data["Date_of_Journey"], format="%d/%m/%Y").dt.month

fare_data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Journey_day,Journey_Month
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR - DEL,22:20,3/22/2023 1:10,2h 50m,non-stop,No info,3897,24,3
1,Air India,1/5/2019,Kolkata,Banglore,CCU - IXR - BBI - BLR,5:50,13:15,7h 25m,2 stops,No info,7662,1,5
2,Jet Airways,9/6/2019,Delhi,Cochin,DEL - LKO - BOM - COK,9:25,6/10/2023 4:25,19h,2 stops,No info,13882,9,6
3,IndiGo,12/5/2019,Kolkata,Banglore,CCU - NAG - BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5
4,IndiGo,1/3/2019,Banglore,New Delhi,BLR - NAG - DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3


Both of the provided lines of code are equivalent and will produce the same result.

1. `fare_data["Journey_Day"] = pd.to_datetime(fare_data.Date_of_Journey, format="%d/%m/%Y").dt.day`

2. `fare_data["Journey_Day"] = pd.to_datetime(fare_data["Date_of_Journey"], format="%d/%m/%Y").dt.day`

Both lines are used to extract the day component from the "Date_of_Journey" column in the `fare_data` DataFrame. They use the `pd.to_datetime()` function to convert the "Date_of_Journey" column to a datetime format and then use the `.dt.day` attribute to extract the day.

You can use either of these lines, and they will achieve the same outcome, which is to create a new "Journey_Day" column in the `fare_data` DataFrame containing the day values extracted from the "Date_of_Journey" column.


In [13]:
# Drop the "Date of Journey" column
fare_data.drop(["Date_of_Journey"], axis=1, inplace=True)

In [14]:
fare_data.head()

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Journey_day,Journey_Month
0,IndiGo,Banglore,New Delhi,BLR - DEL,22:20,3/22/2023 1:10,2h 50m,non-stop,No info,3897,24,3
1,Air India,Kolkata,Banglore,CCU - IXR - BBI - BLR,5:50,13:15,7h 25m,2 stops,No info,7662,1,5
2,Jet Airways,Delhi,Cochin,DEL - LKO - BOM - COK,9:25,6/10/2023 4:25,19h,2 stops,No info,13882,9,6
3,IndiGo,Kolkata,Banglore,CCU - NAG - BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5
4,IndiGo,Banglore,New Delhi,BLR - NAG - DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3


In [16]:
# Extract hours and minutes from the Dep_Time column
# Adding them to seperate columns
fare_data["Dep_Hour"] = pd.to_datetime(fare_data["Dep_Time"]).dt.hour # Extracting hours
fare_data["Dep_Mins"] = pd.to_datetime(fare_data["Dep_Time"]).dt.minute # Extracting mins

# Drop the Dep_Time column
fare_data.drop("Dep_Time", axis=1, inplace=True)

fare_data.head()

Unnamed: 0,Airline,Source,Destination,Route,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Journey_day,Journey_Month,Dep_Hour,Dep_Mins
0,IndiGo,Banglore,New Delhi,BLR - DEL,3/22/2023 1:10,2h 50m,non-stop,No info,3897,24,3,22,20
1,Air India,Kolkata,Banglore,CCU - IXR - BBI - BLR,13:15,7h 25m,2 stops,No info,7662,1,5,5,50
2,Jet Airways,Delhi,Cochin,DEL - LKO - BOM - COK,6/10/2023 4:25,19h,2 stops,No info,13882,9,6,9,25
3,IndiGo,Kolkata,Banglore,CCU - NAG - BLR,23:30,5h 25m,1 stop,No info,6218,12,5,18,5
4,IndiGo,Banglore,New Delhi,BLR - NAG - DEL,21:35,4h 45m,1 stop,No info,13302,1,3,16,50


In [17]:
# Extract hours and minutes from the Arrival_Time column
# Adding them to seperate columns
fare_data["Arrival_Hour"] = pd.to_datetime(fare_data["Arrival_Time"]).dt.hour # Extracting hours
fare_data["Arrival_Mins"] = pd.to_datetime(fare_data["Arrival_Time"]).dt.minute # Extracting mins

# Drop the Dep_Time column
fare_data.drop("Arrival_Time", axis=1, inplace=True)

fare_data.head()

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Journey_day,Journey_Month,Dep_Hour,Dep_Mins,Arrival_Hour,Arrival_Mins
0,IndiGo,Banglore,New Delhi,BLR - DEL,2h 50m,non-stop,No info,3897,24,3,22,20,1,10
1,Air India,Kolkata,Banglore,CCU - IXR - BBI - BLR,7h 25m,2 stops,No info,7662,1,5,5,50,13,15
2,Jet Airways,Delhi,Cochin,DEL - LKO - BOM - COK,19h,2 stops,No info,13882,9,6,9,25,4,25
3,IndiGo,Kolkata,Banglore,CCU - NAG - BLR,5h 25m,1 stop,No info,6218,12,5,18,5,23,30
4,IndiGo,Banglore,New Delhi,BLR - NAG - DEL,4h 45m,1 stop,No info,13302,1,3,16,50,21,35


In [32]:
# Converting and assogning duration column into a list
duration = list(fare_data["Duration"])

# Looping through all duration values to set the missing hours and missing parts
for i in range(len(duration)):
    if len(duration[i].split()) != 2:
        if "h" in duration[i]:
            duration[i] = duration[i].strip() + " 0m" # Adding missing minutes part
        else:
            duration[i] = "0h " + duration[i].strip() # Adding missing hour part
            
# Prepare seperate duration hour and minutes lists
duration_hours = []
duration_mins = []

for i in range(len(duration)):
    duration_hours.append(int(duration[i].split(sep="h")[0])) # Extract hours
    duration_mins.append(int(duration[i].split(sep="m")[0].split()[-1])) # Extract minutes
    
# Add duration hours and minutes columns to the dataset
fare_data["Duration_Hours"] = duration_hours
fare_data["Duration_Mins"] = duration_mins

# Drop duration column from the dataset
fare_data.drop(["Duration"], axis=1, inplace=True)

fare_data.head()

Unnamed: 0,Airline,Source,Destination,Route,Total_Stops,Additional_Info,Price,Journey_day,Journey_Month,Dep_Hour,Dep_Mins,Arrival_Hour,Arrival_Mins,Duration_Hours,Duration_Mins
0,IndiGo,Banglore,New Delhi,BLR - DEL,non-stop,No info,3897,24,3,22,20,1,10,2,50
1,Air India,Kolkata,Banglore,CCU - IXR - BBI - BLR,2 stops,No info,7662,1,5,5,50,13,15,7,25
2,Jet Airways,Delhi,Cochin,DEL - LKO - BOM - COK,2 stops,No info,13882,9,6,9,25,4,25,19,0
3,IndiGo,Kolkata,Banglore,CCU - NAG - BLR,1 stop,No info,6218,12,5,18,5,23,30,5,25
4,IndiGo,Banglore,New Delhi,BLR - NAG - DEL,1 stop,No info,13302,1,3,16,50,21,35,4,45


`.split()`: The `split()` method in Python is used to split a string into a list of substrings based on a specified delimiter. By default, if you don't provide any delimiter, it will split the string by whitespace characters (spaces or tabs) and return a list of the resulting substrings.

`duration[i].strip()`: The `strip()` method in Python is used to remove leading and trailing whitespace characters (spaces, tabs, newline characters, etc.) from a string. In this context, `duration[i].strip()` is removing any leading or trailing whitespace from the string `duration[i]`. This is useful for cleaning up strings and ensuring there are no extra spaces at the beginning or end.

In [33]:
# Test code for understand strip
string_1 = string_2 = duration[1]
print(string_1, "/", string_2)

string_1_mod = string_1 + " "
print(string_1_mod, "/", string_2)

print(string_1_mod.strip(), "/", string_2)

7h 25m / 7h 25m
7h 25m  / 7h 25m
7h 25m / 7h 25m


###### Handling Numerical Data

1. Nominal data - data is not in any order --> OneHotEncorder
2. Ordinal Data - data is in order --> LabelEncorder

In [35]:
fare_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 10000
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10000 non-null  object
 1   Source           10000 non-null  object
 2   Destination      10000 non-null  object
 3   Route            10000 non-null  object
 4   Total_Stops      10000 non-null  object
 5   Additional_Info  10000 non-null  object
 6   Price            10000 non-null  int64 
 7   Journey_day      10000 non-null  int64 
 8   Journey_Month    10000 non-null  int64 
 9   Dep_Hour         10000 non-null  int64 
 10  Dep_Mins         10000 non-null  int64 
 11  Arrival_Hour     10000 non-null  int64 
 12  Arrival_Mins     10000 non-null  int64 
 13  Duration_Hours   10000 non-null  int64 
 14  Duration_Mins    10000 non-null  int64 
dtypes: int64(9), object(6)
memory usage: 1.2+ MB


In [36]:
# Feature engineering on -> Airline column
fare_data["Airline"].value_counts()

Jet Airways                          3598
IndiGo                               1927
Air India                            1633
Multiple carriers                    1129
SpiceJet                              769
Vistara                               447
Air Asia                              296
GoAir                                 179
Multiple carriers Premium economy      13
Jet Airways Business                    5
Vistara Premium economy                 3
Trujet                                  1
Name: Airline, dtype: int64

In [49]:
airline = fare_data[["Airline"]]
currant_airline_list = airline["Airline"]
new_airline_list = []

for carrier in currant_airline_list:
    if carrier in ["Jet Airways", "IndiGo", "Air India" , "Multiple carriers", "SpiceJet" , "Vistara" , "Air Asia", "GoAir"]:
        new_airline_list.append(carrier)
    else:
        new_airline_list.append("Other")
        
airline["Airline"] = pd.DataFrame(new_airline_list)
airline["Airline"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  airline["Airline"] = pd.DataFrame(new_airline_list)


Jet Airways          3598
IndiGo               1927
Air India            1632
Multiple carriers    1129
SpiceJet              769
Vistara               447
Air Asia              296
GoAir                 179
Other                  22
Name: Airline, dtype: int64

1. `Airline['Airline'] = pd.DataFrame(new)`:
   - This line creates a new DataFrame using the `pd.DataFrame()` constructor, with the list `new` as the data. It generates a DataFrame with an additional index column, and the 'Airline' column contains the same data as the `new` list.

2. `Airline['Airline'] = new`:
   - This line directly assigns the list `new` to the 'Airline' column in the `Airline` DataFrame. It does not create a new DataFrame but rather replaces the existing 'Airline' column in the DataFrame with the values from the `new` list.

The key difference is that the first line creates a new DataFrame with the 'Airline' column containing data from the list, while the second line directly replaces the existing 'Airline' column in the DataFrame with the list data. Depending on your use case, you may choose one approach over the other.


---

`airline["Airline"] = pd.DataFrame(new_airline_list)
 airline["Airline"].value_counts()`

`airline["Airline"] = new_airline_list
 airline["Airline"].value_counts()`

The main difference between the two code snippets is in the way the "Airline" column is updated in the `airline` DataFrame:

- In the first code snippet, `pd.DataFrame(new_airline_list)` is used to create a new DataFrame from the `new_airline_list`, and then this new DataFrame is assigned back to the "Airline" column in the `airline` DataFrame. Essentially, it creates a DataFrame with a single column named "Airline" containing the same data as in `new_airline_list`, and then assigns this DataFrame to the "Airline" column.

- In the second code snippet, `new_airline_list` itself is assigned directly to the "Airline" column in the `airline` DataFrame. There is no intermediate DataFrame created. It replaces the existing "Airline" column in the DataFrame with the values from `new_airline_list`.

In practical terms, for your specific use case, the result is the same in both cases. They both update the "Airline" column in the `airline` DataFrame with the values from `new_airline_list`. However, the second code snippet is more concise and efficient because it avoids the unnecessary creation of an intermediate DataFrame. Therefore, the second code snippet is preferred for simplicity and efficiency.


In [51]:
airline = pd.get_dummies(airline, drop_first=True)
airline.head()

Unnamed: 0,Airline_Air India,Airline_GoAir,Airline_IndiGo,Airline_Jet Airways,Airline_Multiple carriers,Airline_Other,Airline_SpiceJet,Airline_Vistara
0,0,0,1,0,0,0,0,0
1,1,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0
3,0,0,1,0,0,0,0,0
4,0,0,1,0,0,0,0,0


It uses one-hot encoding with pd.get_dummies() to convert the "Airline" column into binary (0s and 1s) by creating separate columns for each category, dropping the first column to avoid multicollinearity.

Finally, it displays the first few rows of the modified airline DataFrame.

---

In the context of one-hot encoding and machine learning, multicollinearity refers to a situation where two or more independent variables (features or columns in your dataset) are highly correlated or linearly dependent on each other. Multicollinearity can cause issues when building predictive models, particularly in linear regression and some other modeling techniques. Here's an explanation of why multicollinearity can be problematic:

- **Redundant Information**: When two or more variables are highly correlated, they essentially provide the same information to the model. This redundancy doesn't add any new insights and can lead to inefficiency in the model.

- **Instability in Coefficient Estimates**: In linear regression and related models, multicollinearity can cause instability in coefficient estimates. This means small changes in the data can result in significantly different coefficient values, making it challenging to interpret the impact of individual variables on the target.

- **Loss of Statistical Significance**: Multicollinearity can lead to a situation where individual predictor variables may appear to be statistically insignificant when they are actually important in the model. This can lead to incorrect conclusions about variable importance.

- **Difficulty in Interpretation**: With multicollinearity, it becomes challenging to interpret the effect of a single variable because its impact is confounded with the impact of other correlated variables.

Now, in the context of one-hot encoding, multicollinearity can occur when creating dummy variables (0s and 1s) for categorical variables. To avoid multicollinearity, one category (usually the reference category) is dropped, and the others are represented by the dummy variables. This ensures that each category is compared to the reference category, and there is no perfect linear relationship among the dummy variables.

For example, if you have a categorical variable "Color" with three categories (Red, Blue, Green), you might create two dummy variables: "IsBlue" and "IsGreen." If both "IsBlue" and "IsGreen" are 0, it means the reference category "Red" is selected.

By dropping one category, you prevent multicollinearity, which would occur if you included all three dummy variables (Red, Blue, and Green) because one of them can be predicted from the other two, leading to multicollinearity issues.

In the code you provided, `pd.get_dummies()` is used with `drop_first=True` to drop the first category (reference category) when creating dummy variables for the "Airline" column, which helps avoid multicollinearity when including these dummy variables in machine learning models.


In [48]:
airline.columns

Index(['Airline_Air India', 'Airline_GoAir', 'Airline_IndiGo',
       'Airline_Jet Airways', 'Airline_Multiple carriers', 'Airline_Other',
       'Airline_SpiceJet', 'Airline_Vistara'],
      dtype='object')