## KNN - Plane Total Landed Weight

K Nearest Neighbours algorithm using San Francisco air traffic dataset. Trying to predict the total landed weight of a plane.

In [1]:
# Libraries
import pandas as pd
import numpy as np

import sklearn
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier
from sklearn import linear_model, preprocessing


In [2]:
air_t = pd.read_csv('air-traffic-landings-statistics_v2.csv')

In [3]:
# check the dimesions of dataset
air_t.shape

(21762, 10)

In [4]:
# View first rows of dataset
air_t.head()

Unnamed: 0,Operating Airline,Operating Airline IATA Code,GEO Summary,GEO Region,Landing Aircraft Type,Aircraft Body Type,Aircraft Manufacturer,Aircraft Model,Landing Count,Total Landed Weight
0,ATA Airlines,TZ,Domestic,US,Passenger,Narrow Body,Boeing,757,83,16434000
1,ATA Airlines,TZ,Domestic,US,Passenger,Narrow Body,Boeing,757,3,672000
2,ATA Airlines,TZ,Domestic,US,Passenger,Wide Body,Lockheed,L1011,27,9666000
3,Air Canada,AC,International,Canada,Passenger,Narrow Body,Boeing,737,5,525000
4,Air Canada,AC,International,Canada,Passenger,Narrow Body,Boeing,737,15,1605000


The dataset consists of 10 features, 2 of which are numerical, 8 categorical.

Last column 'Total Landed Weight' will be set as the target variable.

In [5]:
# view stats of dataset
air_t.describe()

Unnamed: 0,Landing Count,Total Landed Weight
count,21762.0,21762.0
mean,113.421652,18965830.0
std,248.910829,30098760.0
min,1.0,6850.0
25%,14.0,3080500.0
50%,31.0,9678039.0
75%,84.0,19530000.0
max,2245.0,273042000.0


In [6]:
# Features by type:
# Operating Airline: Categorical
# Operating Airline IATA Code: Categorical
# GEO Summary: Categorical
# GEO Region: Categorical
# Landing Aircraft Type: Categorical
# Aircraft Body Type: Categorical
# Aircraft Manufacturer: Categorical
# Aircraft Model: Categorical
# Landing Count: Numerical
# Total Landed Weight: Numerical

In [8]:
# Need to use non-numerical data for variables:
# sklearn will covert non-numerical values into numerical values.

le = preprocessing.LabelEncoder()    # coding the labels into integer values (using Label Encoder function)

# creating lists for each column:
Operating_Airline = le.fit_transform(list(air_t['Operating Airline']))
Op_IATA_Code = le.fit_transform(list(air_t['Operating Airline IATA Code']))
GEO_Summary = le.fit_transform(list(air_t['GEO Summary']))
GEO_Region = le.fit_transform(list(air_t['GEO Region']))
Landing_Aircraft_Type = le.fit_transform(list(air_t['Landing Aircraft Type']))
Aircraft_Body_Type = le.fit_transform(list(air_t['Aircraft Body Type']))
Aircraft_Manufacturer = le.fit_transform(list(air_t['Aircraft Manufacturer']))
Aircraft_Model = le.fit_transform(list(air_t['Aircraft Model']))
Landing_Count = le.fit_transform(list(air_t['Landing Count']))
Total_Landed_Weight = le.fit_transform(list(air_t['Total Landed Weight']))

In [26]:
# View ouput for one of the original categorical variables:
# ('Operating Airline IATA Code')
Op_IATA_Code

array([82, 82, 82, ..., 90, 90, 76], dtype=int64)

We can see that the output for 'Operating Airline IATA Code' is now numerical.

### Separate the data into features and labels

In [9]:
# labels: is a result we want
# features: the variables we can use to classify the 'Total Landed Weight'

In [11]:
# Define labels as (y) and features as (x)

# x list (features):      # (zip creates tuple objects)
X = list(zip(Operating_Airline, Op_IATA_Code, GEO_Summary, GEO_Region, Landing_Aircraft_Type, Aircraft_Body_Type, Aircraft_Manufacturer, Aircraft_Model, Landing_Count, Total_Landed_Weight))

# y list (labels):
y = list(Total_Landed_Weight)

### Split data into Train and Test sets

In [12]:
from sklearn.model_selection import train_test_split
# split the features and labels data (70/30)

# Test dataset:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)  # test size ratio

In [13]:
# check the first 5 observations of the labels (y)
y[0:5]

[4823, 470, 3558, 384, 909]

### KNN Model

#### KNN Model 1

In [28]:
model = KNeighborsClassifier(n_neighbors=5)    # choose neighbours value

In [29]:
model.fit(x_train, y_train)          # train the model
acc = model.score(x_test, y_test)    # test accuracy of model
print("Model Accuracy is: ", acc)    # accuracy

Model Accuracy is:  0.3937815898299893


When neighbours is set to 5, model accuracy is 39.4%

#### KNN Model 2

In [17]:
model = KNeighborsClassifier(n_neighbors=7)    # choose neighbours value

In [18]:
model.fit(x_train, y_train)          # train the model
acc = model.score(x_test, y_test)    # test accuracy of model
print("Model Accuracy is: ", acc)    # accuracy

Model Accuracy is:  0.35794149180578955


When neighbours is set to 7, model accuracy is 35.8%

#### KNN Model 3

In [32]:
model = KNeighborsClassifier(n_neighbors=9)    # choose neighbours value

In [31]:
model.fit(x_train, y_train)          # train the model
acc = model.score(x_test, y_test)    # test accuracy of model
print("Model Accuracy is: ", acc)    # accuracy

Model Accuracy is:  0.2830448767039363


When neighbours is set to 9, model accuracy is 33.8%

From the results of the 3 models, as the neighbours value increases from 5 to 7 to 9, the model accuracy reduces.