In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

df=pd.read_csv('flight_price(Sheet1).csv')
print(df.head())

       Airline Date_of_Journey    Source Destination                  Route  \
0       IndiGo      24/03/2019  Banglore   New Delhi              BLR ? DEL   
1    Air India       1/05/2019   Kolkata    Banglore  CCU ? IXR ? BBI ? BLR   
2  Jet Airways       9/06/2019     Delhi      Cochin  DEL ? LKO ? BOM ? COK   
3       IndiGo      12/05/2019   Kolkata    Banglore        CCU ? NAG ? BLR   
4       IndiGo      01/03/2019  Banglore   New Delhi        BLR ? NAG ? DEL   

  Dep_Time  Arrival_Time Duration Total_Stops Additional_Info  Price  
0    22:20  01:10 22 Mar   2h 50m    non-stop         No info   3897  
1    05:50         13:15   7h 25m     2 stops         No info   7662  
2    09:25  04:25 10 Jun      19h     2 stops         No info  13882  
3    18:05         23:30   5h 25m      1 stop         No info   6218  
4    16:50         21:35   4h 45m      1 stop         No info  13302  


# **Dataset Overview**
 
Total Rows: 10,683
 
Total Columns: 11
 
Data Types:
 
10 categorical (object) columns
 
1 numerical (integer) column (Price)
 
### FEATURES

The various features of the cleaned dataset are explained below:

1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.

2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.

3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.

4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.

5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.

6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.

7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.

8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.

9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.

10)Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.

11) Price: Target variable stores information of the ticket price.
 
## **Column Breakdown:**
 
Airline – Name of the airline (e.g., IndiGo, Air India, Jet Airways)
 
Date_of_Journey – The flight's departure date
 
Source – Flight departure location
 
Destination – Flight arrival location
 
Route – Flight path (e.g., BLR → DEL)
 
Dep_Time – Flight departure time
 
Arrival_Time – Flight arrival time
 
Duration – Total travel duration (e.g., "2h 50m")
 
Total_Stops – Number of stops (e.g., "non-stop", "1 stop", "2 stops")
 
Additional_Info – Extra details (e.g., "No info")
 
Price – Flight ticket price (numeric)
 

In [18]:
# print(df.isna().sum())
print(df['Route'].dtype)
print(df['Total_Stops'].dtype)
df['Route'].fillna(df['Route'].mode()[0],inplace=True)
df['Total_Stops'].fillna(df['Total_Stops'].mode()[0],inplace=True)
print(df.isna().sum())

object
object
Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              0
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        0
Additional_Info    0
Price              0
dtype: int64


In [21]:
df['Day'] = df['Date_of_Journey'].str.split('/').str[0]  
df['Month'] = df['Date_of_Journey'].str.split('/').str[1]
df['Year'] = df['Date_of_Journey'].str.split('/').str[2] 
print(df.head())

       Airline Date_of_Journey    Source Destination                  Route  \
0       IndiGo      24/03/2019  Banglore   New Delhi              BLR ? DEL   
1    Air India       1/05/2019   Kolkata    Banglore  CCU ? IXR ? BBI ? BLR   
2  Jet Airways       9/06/2019     Delhi      Cochin  DEL ? LKO ? BOM ? COK   
3       IndiGo      12/05/2019   Kolkata    Banglore        CCU ? NAG ? BLR   
4       IndiGo      01/03/2019  Banglore   New Delhi        BLR ? NAG ? DEL   

  Dep_Time  Arrival_Time Duration Total_Stops Additional_Info  Price Day  \
0    22:20  01:10 22 Mar   2h 50m    non-stop         No info   3897  24   
1    05:50         13:15   7h 25m     2 stops         No info   7662   1   
2    09:25  04:25 10 Jun      19h     2 stops         No info  13882   9   
3    18:05         23:30   5h 25m      1 stop         No info   6218  12   
4    16:50         21:35   4h 45m      1 stop         No info  13302  01   

  Month  Year  
0    03  2019  
1    05  2019  
2    06  2019  
3   

In [27]:
df.drop('Date_of_Journey',axis=1,inplace=True)

In [30]:
df['Arrival_hour']=df['Arrival_hour'].apply(lambda x:x.split(' ')[0])
df['Arrival_hour'] = df['Arrival_Time'].str.split(':').str[0]
df['Arrival_minute'] = df['Arrival_Time'].str.split(':').str[1]  
df.head()

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Day,Month,Year,Arrival_hour,Arrival_minute
0,IndiGo,Banglore,New Delhi,BLR ? DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,24,3,2019,1,10 22 Mar
1,Air India,Kolkata,Banglore,CCU ? IXR ? BBI ? BLR,05:50,13:15,7h 25m,2 stops,No info,7662,1,5,2019,13,15
2,Jet Airways,Delhi,Cochin,DEL ? LKO ? BOM ? COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,9,6,2019,4,25 10 Jun
3,IndiGo,Kolkata,Banglore,CCU ? NAG ? BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5,2019,23,30
4,IndiGo,Banglore,New Delhi,BLR ? NAG ? DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3,2019,21,35


In [32]:
df['Departure_hour'] = df['Dep_Time'].str.split(':').str[0].astype(int)  # Extract hour and convert to int
df['Departure_minute'] = df['Dep_Time'].str.split(':').str[1].astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Airline           10683 non-null  object
 1   Source            10683 non-null  object
 2   Destination       10683 non-null  object
 3   Route             10683 non-null  object
 4   Dep_Time          10683 non-null  object
 5   Arrival_Time      10683 non-null  object
 6   Duration          10683 non-null  object
 7   Total_Stops       10683 non-null  object
 8   Additional_Info   10683 non-null  object
 9   Price             10683 non-null  int64 
 10  Day               10683 non-null  object
 11  Month             10683 non-null  object
 12  Year              10683 non-null  object
 13  Arrival_hour      10683 non-null  object
 14  Arrival_minute    10683 non-null  object
 15  Departure_hour    10683 non-null  int32 
 16  Departure_minute  10683 non-null  int32 
dtypes: int32(2),