# Analysis on Airline Dataset: Passenger's Characteristics And Prediction of Flight Status

Author: Luoning Zhang

Course Project, UC Irvine, Math 10, Summer 2023

## 1. Introduction

In this project, I will analyze the "Airline Dataset," which contains 98,619 rows of data. Each row represents a specific flight taken by an individual passenger. I will endeavor to identify patterns in passenger characteristics and discern the potential causes of flight delays and cancellations. This analysis aims to aid airline industries in devising marketing strategies and enhancing customer satisfaction.

## 2. The Difference of Passengers' Characteristics across Times And Places

Firstly, we import this dataset, which is available on Kaggle at the following link  “https://www.kaggle.com/datasets/iamsouravbanerjee/airline-dataset” , and drop rows with null values.

Then, I convert the "Departure Date" column to "pandas datetime datatype", and add the columns "Month" and "Year" to make the data more accessible after. Also, I transform the gender from categorical data to integer 1s and 0s. The "Number" column is for making pivot_table.

In [1]:
import pandas as pd
import altair as alt
df = pd.read_csv("Airline Dataset.csv").dropna()
df['Date'] = pd.to_datetime(df['Departure Date'])
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year
df['DayofWeek'] = df['Date'].dt.day_name()
df['IsMale'] = (df['Gender'] == 'Male').astype(int)
df['Number'] = 1
df

Unnamed: 0,Passenger ID,First Name,Last Name,Gender,Age,Nationality,Airport Name,Airport Country Code,Country Name,Airport Continent,...,Departure Date,Arrival Airport,Pilot Name,Flight Status,Date,Month,Year,DayofWeek,IsMale,Number
0,10856,Edithe,Leggis,Female,62,Japan,Coldfoot Airport,US,United States,NAM,...,6/28/2022,CXF,Edithe Leggis,On Time,2022-06-28,6,2022,Tuesday,0,1
1,43872,Elwood,Catt,Male,62,Nicaragua,Kugluktuk Airport,CA,Canada,NAM,...,12/26/2022,YCO,Elwood Catt,On Time,2022-12-26,12,2022,Monday,1,1
2,42633,Darby,Felgate,Male,67,Russia,Grenoble-Isère Airport,FR,France,EU,...,1/18/2022,GNB,Darby Felgate,On Time,2022-01-18,1,2022,Tuesday,1,1
3,78493,Dominica,Pyle,Female,71,China,Ottawa / Gatineau Airport,CA,Canada,NAM,...,9/16/2022,YND,Dominica Pyle,Delayed,2022-09-16,9,2022,Friday,0,1
4,82072,Bay,Pencost,Male,21,China,Gillespie Field,US,United States,NAM,...,2/25/2022,SEE,Bay Pencost,On Time,2022-02-25,2,2022,Friday,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98614,58454,Gareth,Mugford,Male,85,China,Hasvik Airport,NO,Norway,EU,...,12/11/2022,HAA,Gareth Mugford,Cancelled,2022-12-11,12,2022,Sunday,1,1
98615,22028,Kasey,Benedict,Female,19,Russia,Ampampamena Airport,MG,Madagascar,AF,...,10/30/2022,IVA,Kasey Benedict,Cancelled,2022-10-30,10,2022,Sunday,0,1
98616,61732,Darrin,Lucken,Male,65,Indonesia,Albacete-Los Llanos Airport,ES,Spain,EU,...,9/10/2022,ABC,Darrin Lucken,On Time,2022-09-10,9,2022,Saturday,1,1
98617,19819,Gayle,Lievesley,Female,34,China,Gagnoa Airport,CI,Côte d'Ivoire,AF,...,10/26/2022,GGN,Gayle Lievesley,Cancelled,2022-10-26,10,2022,Wednesday,0,1


I highlight the "Flight Status" with different colors by pandas styler. (You may scroll to the right to see the column.)
_(This idea is from pandas documentation https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Styler-Functions, and the idea of using `color_dic.get(fs, '')` is from ChatGPT.)_

In [2]:
color_dic = {
    'On Time': 'color:white;background-color:green', 
    'Delayed': 'color:white;background-color:orange', 
    'Cancelled': 'color:white;background-color:red'}
s = df.head(10).style.apply(lambda x: [color_dic.get(fs, '') for fs in x], subset=['Flight Status'])
s

Unnamed: 0,Passenger ID,First Name,Last Name,Gender,Age,Nationality,Airport Name,Airport Country Code,Country Name,Airport Continent,Continents,Departure Date,Arrival Airport,Pilot Name,Flight Status,Date,Month,Year,DayofWeek,IsMale,Number
0,10856,Edithe,Leggis,Female,62,Japan,Coldfoot Airport,US,United States,NAM,North America,6/28/2022,CXF,Edithe Leggis,On Time,2022-06-28 00:00:00,6,2022,Tuesday,0,1
1,43872,Elwood,Catt,Male,62,Nicaragua,Kugluktuk Airport,CA,Canada,NAM,North America,12/26/2022,YCO,Elwood Catt,On Time,2022-12-26 00:00:00,12,2022,Monday,1,1
2,42633,Darby,Felgate,Male,67,Russia,Grenoble-Isère Airport,FR,France,EU,Europe,1/18/2022,GNB,Darby Felgate,On Time,2022-01-18 00:00:00,1,2022,Tuesday,1,1
3,78493,Dominica,Pyle,Female,71,China,Ottawa / Gatineau Airport,CA,Canada,NAM,North America,9/16/2022,YND,Dominica Pyle,Delayed,2022-09-16 00:00:00,9,2022,Friday,0,1
4,82072,Bay,Pencost,Male,21,China,Gillespie Field,US,United States,NAM,North America,2/25/2022,SEE,Bay Pencost,On Time,2022-02-25 00:00:00,2,2022,Friday,1,1
5,39630,Lora,Durbann,Female,55,Brazil,Coronel Horácio de Mattos Airport,BR,Brazil,SAM,South America,6/10/2022,LEC,Lora Durbann,On Time,2022-06-10 00:00:00,6,2022,Friday,0,1
6,11940,Rand,Bram,Male,73,Ivory Coast,Duxford Aerodrome,GB,United Kingdom,EU,Europe,10/30/2022,QFO,Rand Bram,Cancelled,2022-10-30 00:00:00,10,2022,Sunday,1,1
7,26470,Perceval,Dallosso,Male,36,Vietnam,Maestro Wilson Fonseca Airport,BR,Brazil,SAM,South America,4/7/2022,STM,Perceval Dallosso,Cancelled,2022-04-07 00:00:00,4,2022,Thursday,1,1
8,29447,Aleda,Pigram,Female,35,Palestinian Territory,Venice Marco Polo Airport,IT,Italy,EU,Europe,8/20/2022,VCE,Aleda Pigram,On Time,2022-08-20 00:00:00,8,2022,Saturday,0,1
9,75035,Burlie,Schustl,Male,13,Thailand,Vermilion Airport,CA,Canada,NAM,North America,4/6/2022,YVG,Burlie Schustl,On Time,2022-04-06 00:00:00,4,2022,Wednesday,1,1


### 2.1 Distributions of Passengers' Genders, Ages, and Nationalities

Before figuring out the relation of passengers' characteristics with times and airport places, let's have a look at the passengers' features directly. Because there are over 5000 rows of data in df, I cannot directly render it to the altair charts. (I tried using `alt.data_transformers.disable_max_rows()` to bypass the limit, but the error still happened.) Hence, I created three dataframe containing passengers' features and the counts of them. 

From the Charts below. We find that the ages and genders of passengers are uniformly distributed. Interestedly, **almost more than half of the passengers' nationalities are 'China', 'Indonesia', 'Russia', 'Philippines', 'Brazil', 'Portugal', 'Poland', 'France'.**
_(`mark_arc()` usuage is from Altair documentation https://altair-viz.github.io/user_guide/marks/arc.html_
_Usuage of `order` parameter in `encode()` is from https://altair-viz.github.io/user_guide/encodings/channel_options.html#order)_

In [3]:
make_df = lambda vol: pd.DataFrame({vol:df[vol].value_counts().index, 'Counts':df[vol].value_counts()})
df_ages, df_genders, df_nations = [make_df(x) for x in ['Age', 'Gender', 'Nationality']]
df_nations['Percentage'] = (df_nations['Counts']/df_nations['Counts'].sum()*100).apply(lambda x: f'{x:.2f}%')
df_genders['Percentage'] = (df_genders['Counts']/df_genders['Counts'].sum()*100).apply(lambda x: f'{x:.2f}%')

c1 = alt.Chart(df_ages).mark_bar().encode(
    x = 'Age:Q',
    y = 'Counts:Q',
    color = 'Age:Q'
).properties(
    title = 'Distribution of Ages'
)

c2 = alt.Chart(df_genders).mark_arc().encode(
    theta = 'Counts:Q',
    color = 'Gender:N',
    tooltip = ['Gender:N','Counts','Percentage']
).properties(
    title = 'Distribution of Genders'
)

c3 = alt.Chart(df_nations).mark_arc().encode(
    theta = 'Counts:Q',
    color = alt.Color('Nationality:N', sort=alt.EncodingSortField(field='Counts', order='descending')),
    order = alt.Order('Counts:Q', sort='descending'),
    tooltip = ['Nationality:N','Counts','Percentage']
).properties(
    title = 'Distribution of Nationalities'
)

c1&c2&c3

### 2.2 The Difference of Passengers' Chracteristics across Times.

There's no significant differences in passengers' ages and genders across different months and days in weeks.

In [4]:
df.groupby('Month').mean()[['Age','IsMale']]

Unnamed: 0_level_0,Age,IsMale
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
1,45.443203,0.498574
2,45.325885,0.504769
3,45.134978,0.505397
4,46.171253,0.492022
5,45.768715,0.504002
6,45.08686,0.503814
7,46.20471,0.503609
8,45.267322,0.50632
9,45.908455,0.506074
10,45.000594,0.501367


In [5]:
df.groupby('DayofWeek').mean()[['Age','IsMale']]

Unnamed: 0_level_0,Age,IsMale
DayofWeek,Unnamed: 1_level_1,Unnamed: 2_level_1
Friday,45.364307,0.495537
Monday,45.771166,0.507982
Saturday,45.655337,0.501985
Sunday,45.706557,0.509553
Thursday,45.470303,0.505237
Tuesday,45.387315,0.502136
Wednesday,45.171677,0.497955


### 2.3 The Composition of Passengers' Nationalities from Top 10 Departure Countries

Let's have a look at the top 10 departure countries of this airline, which are its major market.
I create a pivot table that shows the number of passengers of different nationalities departured from different countries. I showed top 10 countries with most passengers departured here.
_(The usage of pivot table is adapted from https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html)_

In [6]:
pivot_table = pd.pivot_table(df, values='Number', index=['Nationality'], columns=['Country Name'], aggfunc='sum')
country_list_10 = df['Country Name'].value_counts()[:10].index
pt_10 = pivot_table[country_list_10].copy()
pt_10['Nationality'] = pt_10.index
pt_10

Country Name,United States,Australia,Canada,Brazil,Papua New Guinea,China,Indonesia,Russian Federation,Colombia,India,Nationality
Nationality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Afghanistan,89.0,23.0,34.0,18.0,12.0,9.0,10.0,8.0,5.0,3.0,Afghanistan
Aland Islands,4.0,,3.0,,2.0,,1.0,,,,Aland Islands
Albania,95.0,23.0,21.0,29.0,16.0,15.0,10.0,6.0,9.0,3.0,Albania
Algeria,1.0,,,,,,,,,,Algeria
American Samoa,5.0,1.0,1.0,1.0,,1.0,2.0,1.0,1.0,,American Samoa
...,...,...,...,...,...,...,...,...,...,...,...
Wallis and Futuna,3.0,,,1.0,1.0,,,,,,Wallis and Futuna
Western Sahara,1.0,,,,,,,1.0,,,Western Sahara
Yemen,92.0,31.0,19.0,7.0,16.0,19.0,9.0,9.0,4.0,5.0,Yemen
Zambia,19.0,8.0,3.0,5.0,3.0,2.0,4.0,,4.0,,Zambia


Because 10 charts can be too long, I just make bar charts that represents the distribution of passengers' nationalities of the top 3 departure countries. I think it is meaningful.

**For example, I find that there are many Chinese, Indonesia, and Russia passengers departured from these 3 countries. Maybe airlines can improve their service by hiring more flight attendants who speak Chinese, Indonesian, or Russian in these countries.**

However, I don't find the major difference in the nationalities composition among these 3 countries.

_(Usuage of title configuration is from https://altair-viz.github.io/user_guide/configuration.html#config-composition.)_


In [7]:
def make_chart(country):
    c = alt.Chart(pt_10[[country,'Nationality']].dropna()).mark_bar().encode(
        x = alt.X('Nationality:N', sort=alt.EncodingSortField(field=country, order='descending')),
        y = alt.Y(country,title='Number of Passengers'),
        color = alt.Color('Nationality:N', sort=alt.EncodingSortField(field=country, order='descending')),
        order = alt.Order(country+':Q', sort='descending'),
    ).properties(
        title = "Distribution of Passengers' Nationalities from airports in " + country,
        #align = "left"  
    )
    return c
c_list = [make_chart(x) for x in country_list_10[:3]]
alt.vconcat(*c_list).configure_title(
        anchor='start'
    )

## 3. Prediction of Flight Status

### 3.1 Predict by Logistic Regression

Here I select 5 features. 2 of them are about passengers' characteristics, and 3 of them are about time. The accuracy for training set is about 0.34, and that for testing set is about 0.33. There's no overfitting problem, but it is actually underfitting, and the accuracy is as low as random guess one flight status from three.

In [8]:
df['DaysinMonth'] = df['Date'].dt.days_in_month
df['DayofWeek'] = df['Date'].dt.day_of_week

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [10]:
cols = ['Age','IsMale','Month','DaysinMonth','DayofWeek']
X_train, X_test, y_train, y_test = train_test_split(df[cols],df["Flight Status"],train_size=0.24, random_state=0)

In [11]:
clf = LogisticRegression()
clf.fit(X_train,y_train)

LogisticRegression()

In [12]:
clf.classes_

array(['Cancelled', 'Delayed', 'On Time'], dtype=object)

All of the coeficients are pretty small, meaning they hardly predicts the flight status.

In [13]:
clf.coef_

array([[ 3.41976072e-04, -1.20102055e-02,  9.39995219e-04,
         7.56160660e-05, -4.21343500e-03],
       [-2.55174398e-04, -1.65301461e-02, -4.24899750e-03,
         1.74108920e-03, -2.44950643e-03],
       [-8.68016742e-05,  2.85403516e-02,  3.30900228e-03,
        -1.81670527e-03,  6.66294143e-03]])

In [14]:
clf.score(X_train,y_train)

0.34252999830995434

In [15]:
clf.score(X_test,y_test)

0.33351122733519234

### 3.2 Exploration of Reasons for Underfitting

Let's simplify the input features to "age" and "month" to make it easier to visualize. (It might not rigorous, but I think the general pattern can be found approximately.)

In [16]:
cols2 = ['Age','Month']
clf2 = LogisticRegression()
clf2.fit(df[cols2],df['Flight Status'])

LogisticRegression()

In [17]:
clf2.score(df[cols2],df['Flight Status'])

0.3358582017663939

In [18]:
df['Pred2'] = clf2.predict(df[cols2])

In [19]:
c_actual = alt.Chart(df.sample(5000,random_state=0)).mark_circle().encode(
    x = 'Date:T',
    y = 'Age:Q',
    color = 'Flight Status:N'
).properties(
    width = 300,
    height = 300,
    title = 'Actual Flight Status'
)

c_pred = alt.Chart(df.sample(5000,random_state=0)).mark_circle().encode(
    x = 'Date:T',
    y = 'Age:Q',
    color = 'Pred2:N'
).properties(
    width = 300,
    height = 300,
    title = 'Predicted Flight Status'
)

c_actual | c_pred

We can see that there are no clear linear relation between month and age to predict flight status. The Flight Status seems random with respect to these two features. So, I think the linear model is not applicable here.

### 3.3 Random Forest

Random forest is good machine learning model dealing with multiple features to find inherent patterns that are deeply inside data.
First, I use the Label Encoder to transform the categorical values so that there are more features that the model can use. (I think One Hot Encoder would be a better way to preprocess these data, but the RAM of deepnote is not enough.)
Then, I train the model. However, the overfitting problem happens this time. The accuracy score for the training data is about 0.86, while that for testing data is about 0.33.

_The idea the usuage is from ChatGPT, https://chat.openai.com/share/09dff7ac-1c40-4564-bcc8-5b0704b71ae9, and scikit-learn documentation, https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier, and https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html._

In [20]:
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier

cols3 = ['Airport Name','Country Name','Continents','Arrival Airport','Pilot Name']
label_encoders = {}
for col in cols3:
    le = LabelEncoder()
    df[col+'_Encoded'] = le.fit_transform(df[col])
    label_encoders[col] = le
    
cols4 = [col+'_Encoded' for col in cols3]+['Month','DaysinMonth','DayofWeek']
X4_train, X4_test, y4_train, y4_test = train_test_split(df[cols4],df["Flight Status"],train_size=0.2, random_state=0)

rf_clf = RandomForestClassifier(min_samples_split=10,min_samples_leaf=10,n_estimators=200,random_state=0)
rf_clf.fit(X4_train, y4_train)

RandomForestClassifier(min_samples_leaf=10, min_samples_split=10,
                       n_estimators=200, random_state=0)

In [21]:
rf_clf.score(X4_train, y4_train)

0.8593013233280941

In [22]:
rf_clf.score(X4_test, y4_test)

0.329433684850943

## 4. Summary

In the project, my major finding is the distribution of passengers' nationalities, providing airline companies with a point to improve their service. On the other hand, passengers genders and ages are nearly uniformly distributed, and there's basically no difference in the distribution of passegers' nationalities among top 3 departure countries.

Unfortunately, I failed to predict the flight status with the features of passenegers' characteristics and flight time by Logistic Regression and Random Forest models.

## References

Your code above should include references.  Here is some additional space for references.

* What is the source of your dataset(s)?

It is available on Kaggle at the following link  “https://www.kaggle.com/datasets/iamsouravbanerjee/airline-dataset” 

* List any other references that you found helpful.

1. https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Styler-Functions
2. https://altair-viz.github.io/user_guide/marks/arc.html
3. https://altair-viz.github.io/user_guide/encodings/channel_options.html#order
4. https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html
5. https://altair-viz.github.io/user_guide/configuration.html#config-composition
6. https://chat.openai.com/share/09dff7ac-1c40-4564-bcc8-5b0704b71ae9
7. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
8. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=b11b8cf1-325f-4ab1-8c26-344dace7000e' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>