<a href="https://colab.research.google.com/github/trandangtrungduc/BasicMachineLearningTask/blob/main/Airline_Arrivals_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  **TABLE OF CONTENTS**

---
## **1. Import librabries and data preprocessing**
## **2. Apply model without feature selection**
> ### 2.1 Hold out
> ### 2.2 Naive Bayes
> ### 2.3 Logistic Regression
> ### 2.4 Decision Tree
> ### 2.5 Random Forest
> ### 2.6 Gradient Boosting
> ### 2.7 Support Vector Machine
## **3. Apply model with PCA**
> ### 3.1 Apply PCA with component equal to the number of columns of X minus 1
> ### 3.2 Hold out
> ### 3.3 Naive Bayes
> ### 3.4 Logistic Regression
> ### 3.5 Decision Tree
> ### 3.6 Random Forest
> ### 3.7 Gradient Boosting
> ### 3.8 Support Vector Machine
## **4. Apply model with KBest**
> ### 4.1 Apply KBest with k equal to the number of columns of X minus 1
> ### 4.2 Hold out
> ### 4.3 Naive Bayes
> ### 4.4 Logistic Regression
> ### 4.5 Decision Tree
> ### 4.6 Random Forest
> ### 4.7 Gradient Boosting
> ### 4.8 Support Vector Machine
## **5. Apply model with RFE**
> ### 5.1 Naive Bayes
> ### 5.2 Logistic Regression
> ### 5.3 Decision Tree
> ### 5.4 Random Forest
> ### 5.5 Gradient Boosting
> ### 5.6 Support Vector Machine




---


## **1. Import librabries and data preprocessing**

> Connect Google Drive to Google Colab and import necessary librabries

> Load the data in file csv from Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd # Librabry for table data
import numpy as np # Librabry for algebra
# Library for visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Librabry sklearn model selection 
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
# Librabry for feature selection
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, RFE, f_classif
# Librabry for algorithm
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import classification_report
%matplotlib inline

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Data/new_2007.csv') # Load data from Google Drive
df.head() # See some information at the top of dataset

Unnamed: 0.1,Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,FlightNum,TailNum,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Diverted,ArrDelay_categorical
0,0,1,1,1,1232.0,WN,2891,N351,1.0,7.0,SMF,ONT,389,4,11,0,0
1,1,1,1,1,1918.0,WN,462,N370,8.0,13.0,SMF,PDX,479,5,6,0,0
2,3,1,1,1,1230.0,WN,1355,N364,26.0,30.0,SMF,PDX,479,3,8,0,0
3,4,1,1,1,831.0,WN,2278,N480,-3.0,1.0,SMF,PDX,479,3,9,0,0
4,5,1,1,1,1430.0,WN,2386,N611SW,3.0,10.0,SMF,PDX,479,2,7,0,0


In [None]:
df = df.drop(columns=['TailNum', 'Origin', 'Dest','Unnamed: 0','ArrDelay','DepDelay']) # Drop categorical features which are analyzed at Analysis section 

In [None]:
X = df.drop(['ArrDelay_categorical', 'Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier'], axis=1) # Drop available features to convert to categorical feature
y = df["ArrDelay_categorical"] # Target column

In [None]:
Categorical_features = df[['Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier']]
one_hot_encoding = pd.get_dummies(data= Categorical_features, columns=['Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier'])
new_X = pd.concat([X, one_hot_encoding], axis=1, sort=False)



---


## **2. Apply model without feature selection**

### 2.1 Hold out

In [None]:
X_train, X_val, y_train, y_val = train_test_split(new_X, y, test_size = 0.2, random_state = 0) # Divide into 2 train test and test set

### 2.2 Logistic Regression

In [None]:
logreg = LogisticRegression() # Initialize Logistic Regression
logreg.fit(X_train, y_train) # Train model 
y_pred = logreg.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       0.00      0.00      0.00     20650

    accuracy                           0.98   1061313
   macro avg       0.49      0.50      0.50   1061313
weighted avg       0.96      0.98      0.97   1061313



### 2.3 Decision Tree

In [None]:
decisiontree = DecisionTreeClassifier() # Initialize Decision Tree
decisiontree.fit(X_train, y_train) # Train model
y_pred = decisiontree.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

              precision    recall  f1-score   support

           0       0.98      0.98      0.98   1040663
           1       0.16      0.17      0.17     20650

    accuracy                           0.97   1061313
   macro avg       0.57      0.58      0.57   1061313
weighted avg       0.97      0.97      0.97   1061313



### 2.4 Random Forest

In [None]:
randomforest = RandomForestClassifier() # Initialize Random Forest
randomforest.fit(X_train, y_train) # Train model
y_pred = randomforest.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       0.99      0.10      0.19     20650

    accuracy                           0.98   1061313
   macro avg       0.99      0.55      0.59   1061313
weighted avg       0.98      0.98      0.98   1061313



### 2.5 GradientBoosting

In [None]:
gbk = GradientBoostingClassifier() # Initialize GradientBoosting
gbk.fit(X_train, y_train) # Train model
y_pred = gbk.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       1.00      0.10      0.19     20650

    accuracy                           0.98   1061313
   macro avg       0.99      0.55      0.59   1061313
weighted avg       0.98      0.98      0.98   1061313



### 2.6 Support Vector Machine

In [None]:
svm = LinearSVC(class_weight='balanced') # Initialize Support Vector Machine
svm.fit(X_train, y_train) # Train model
y_pred = svm.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model



              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       1.00      0.10      0.19     20650

    accuracy                           0.98   1061313
   macro avg       0.99      0.55      0.59   1061313
weighted avg       0.98      0.98      0.98   1061313





---


## **3. Apply model with PCA**

### 3.1 Apply PCA with component equal to the number of columns of X minus 1

In [None]:
pca = PCA(n_components = X.shape[1]-1) # Configure PCA
X_PCA = pca.fit_transform(X) # Apply pca
X_PCA = pd.DataFrame(X_PCA) # Convert to dataframe
X_PCA = X_PCA.reset_index() # Reset index
one_hot_encoding = one_hot_encoding.reset_index() # Reset index
X_PCA = X_PCA.drop('index', axis=1) # Drop index column
one_hot_encoding = one_hot_encoding.drop('index', axis=1) # Drop index column
new_X_PCA = pd.concat([X_PCA, one_hot_encoding], axis=1, sort=False) # Concatenate two columns

### 3.2 Hold out

In [None]:
X_train, X_val, y_train, y_val = train_test_split(new_X_PCA, y, test_size = 0.2, random_state = 0) # Divide into 2 train test and test set

### 3.3 Logistic Regression

In [None]:
logreg = LogisticRegression() # Initialize Logistic Regression
logreg.fit(X_train, y_train) # Train model 
y_pred = logreg.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       0.00      0.00      0.00     20650

    accuracy                           0.98   1061313
   macro avg       0.49      0.50      0.50   1061313
weighted avg       0.96      0.98      0.97   1061313



### 3.4 Decision Tree

In [None]:
decisiontree = DecisionTreeClassifier() # Initialize Decision Tree
decisiontree.fit(X_train, y_train) # Train model
y_pred = decisiontree.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

              precision    recall  f1-score   support

           0       0.98      0.98      0.98   1040663
           1       0.15      0.17      0.16     20650

    accuracy                           0.97   1061313
   macro avg       0.57      0.58      0.57   1061313
weighted avg       0.97      0.97      0.97   1061313



### 3.5 Random Forest

In [None]:
randomforest = RandomForestClassifier() # Initialize Random Forest
randomforest.fit(X_train, y_train) # Train model
y_pred = randomforest.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       0.96      0.10      0.18     20650

    accuracy                           0.98   1061313
   macro avg       0.97      0.55      0.59   1061313
weighted avg       0.98      0.98      0.98   1061313



### 3.6 GradientBoosting

In [None]:
gbk = GradientBoostingClassifier() # Initialize GradientBoosting
gbk.fit(X_train, y_train) # Train model
y_pred = gbk.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       0.92      0.10      0.18     20650

    accuracy                           0.98   1061313
   macro avg       0.95      0.55      0.59   1061313
weighted avg       0.98      0.98      0.98   1061313



### 3.7 Support Vector Machine

In [None]:
svm = LinearSVC(class_weight='balanced') # Initialize Support Vector Machine
svm.fit(X_train, y_train) # Train model
y_pred = svm.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model



              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       1.00      0.10      0.19     20650

    accuracy                           0.98   1061313
   macro avg       0.99      0.55      0.59   1061313
weighted avg       0.98      0.98      0.98   1061313





---


## **4. Apply model with KBest**

### 4.1 Apply KBest with k equal to the number of columns of X minus 1

In [None]:
X_KBest = SelectKBest(f_classif, k=X.shape[1]-1).fit_transform(X, y) # ANOVA F-value between feature for classification tasks 
X_KBest = pd.DataFrame(X_KBest) # Convert to dataframe
X_KBest = X_KBest.reset_index() # Reset index
one_hot_encoding = one_hot_encoding.reset_index() # Drop index column
X_KBest = X_KBest.drop('index', axis=1) # Drop index column
one_hot_encoding = one_hot_encoding.drop('index', axis=1)  # Drop index column
new_X_KBest = pd.concat([X_KBest, one_hot_encoding], axis=1, sort=False)# Concatenate two columns

### 4.2 Holdout

In [None]:
X_train, X_val, y_train, y_val = train_test_split(new_X_KBest, y, test_size = 0.2, random_state = 0) # Divide into 2 train test and test set

### 4.3 Logistic Regression

In [None]:
logreg = LogisticRegression() # Initialize Logistic Regression
logreg.fit(X_train, y_train) # Train model 
y_pred = logreg.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       0.99      0.03      0.06     20650

    accuracy                           0.98   1061313
   macro avg       0.99      0.52      0.52   1061313
weighted avg       0.98      0.98      0.97   1061313



### 4.4 Decision Tree

In [None]:
decisiontree = DecisionTreeClassifier() # Initialize Decision Tree
decisiontree.fit(X_train, y_train) # Train model
y_pred = decisiontree.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

              precision    recall  f1-score   support

           0       0.98      0.98      0.98   1040663
           1       0.16      0.17      0.16     20650

    accuracy                           0.97   1061313
   macro avg       0.57      0.58      0.57   1061313
weighted avg       0.97      0.97      0.97   1061313



### 4.5 Random Forest

In [None]:
randomforest = RandomForestClassifier() # Initialize Random Forest
randomforest.fit(X_train, y_train) # Train model
y_pred = randomforest.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       0.96      0.11      0.19     20650

    accuracy                           0.98   1061313
   macro avg       0.97      0.55      0.59   1061313
weighted avg       0.98      0.98      0.98   1061313



### 4.6 GradientBoosting

In [None]:
gbk = GradientBoostingClassifier() # Initialize GradientBoosting
gbk.fit(X_train, y_train) # Train model
y_pred = gbk.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       1.00      0.10      0.19     20650

    accuracy                           0.98   1061313
   macro avg       0.99      0.55      0.59   1061313
weighted avg       0.98      0.98      0.98   1061313



### 4.7 Support Vector Machine

In [None]:
svm = LinearSVC(class_weight='balanced') # Initialize Support Vector Machine
svm.fit(X_train, y_train) # Train model
y_pred = svm.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model



              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       1.00      0.10      0.19     20650

    accuracy                           0.98   1061313
   macro avg       0.99      0.55      0.59   1061313
weighted avg       0.98      0.98      0.98   1061313





---


## **5. Apply model with RFE**

### 5.1 Logistic Regression

In [None]:
model = LogisticRegression() # Initialize Logistic Regression
rfe = RFE(model, X.shape[1]-1) # Configure RFE 
X_RFE = rfe.fit_transform(X, y) # Apply rfe for data
X_RFE = pd.DataFrame(X_RFE) # Convert to dataframe
X_RFE = X_RFE.reset_index() # Reset index
one_hot_encoding = one_hot_encoding.reset_index() # Reset index
X_RFE = X_RFE.drop('index', axis=1) # Drop index column
one_hot_encoding = one_hot_encoding.drop('index', axis=1) # Drop index column
new_X_RFE = pd.concat([X_RFE, one_hot_encoding], axis=1, sort=False) # New input X
X_train, X_val, y_train, y_val = train_test_split(new_X_RFE, y, test_size = 0.2, random_state = 0) # Divide into 2 train test and test set
logreg = LogisticRegression() # Initialize Logistic Regression
logreg.fit(X_train, y_train) # Train model
y_pred = logreg.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       0.98      0.03      0.06     20650

    accuracy                           0.98   1061313
   macro avg       0.98      0.51      0.52   1061313
weighted avg       0.98      0.98      0.97   1061313



### 5.2 Decision Tree

In [None]:
model = DecisionTreeClassifier() # Initialize Decision Tree
rfe = RFE(model, X_train.shape[1]-1) # Configure RFE
X_RFE = rfe.fit_transform(X, y) # Apply rfe for data
X_RFE = pd.DataFrame(X_RFE) # Convert to dataframe
X_RFE = X_RFE.reset_index() # Reset index
one_hot_encoding = one_hot_encoding.reset_index() # Reset index
X_RFE = X_RFE.drop('index', axis=1)  # Drop index column
one_hot_encoding = one_hot_encoding.drop('index', axis=1)  # Drop index column
new_X_RFE = pd.concat([X_RFE, one_hot_encoding], axis=1, sort=False) # New input X
X_train, X_val, y_train, y_val = train_test_split(new_X_RFE, y, test_size = 0.2, random_state = 0) # Divide into 2 train test and test set
X = rfe.fit_transform(X,y) # Feature selection with RFE 
decisiontree = DecisionTreeClassifier() # Initialize Decision Tree
decisiontree.fit(X_train, y_train) # Train model
y_pred = decisiontree.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

              precision    recall  f1-score   support

           0       0.98      0.98      0.98   1040663
           1       0.16      0.18      0.17     20650

    accuracy                           0.97   1061313
   macro avg       0.57      0.58      0.58   1061313
weighted avg       0.97      0.97      0.97   1061313



### 5.3 Random Forest

In [None]:
model = RandomForestClassifier() # Initialize Random Forest
rfe = RFE(model, X_train.shape[1]-2) # Configure RFE
X_RFE = rfe.fit_transform(X, y) # Apply rfe for data
X_RFE = pd.DataFrame(X_RFE) # Convert to dataframe
X_RFE = X_RFE.reset_index() # Reset index
one_hot_encoding = one_hot_encoding.reset_index() # Reset index
X_RFE = X_RFE.drop('index', axis=1) # Drop index column
one_hot_encoding = one_hot_encoding.drop('index', axis=1)  # Drop index column
new_X_RFE = pd.concat([X_RFE, one_hot_encoding], axis=1, sort=False) # New input X
X_train, X_val, y_train, y_val = train_test_split(new_X_RFE, y, test_size = 0.2, random_state = 0) # Divide into 2 train test and test set
X = rfe.fit_transform(X,y) # Feature selection with RFE 
randomforest = RandomForestClassifier() # Initialize Random Forest
randomforest.fit(X_train, y_train) # Train model
y_pred = randomforest.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       0.99      0.10      0.19     20650

    accuracy                           0.98   1061313
   macro avg       0.99      0.55      0.59   1061313
weighted avg       0.98      0.98      0.98   1061313



### 5.4 GradientBoosting

In [None]:
model = GradientBoostingClassifier() # Initialize Random Forest
rfe = RFE(model, X_train.shape[1]-2) # Configure RFE
X_RFE = rfe.fit_transform(X, y) # Apply rfe for data
X_RFE = pd.DataFrame(X_RFE) # Convert to dataframe
X_RFE = X_RFE.reset_index() # Reset index
one_hot_encoding = one_hot_encoding.reset_index() # Reset index
X_RFE = X_RFE.drop('index', axis=1) # Drop index column
one_hot_encoding = one_hot_encoding.drop('index', axis=1) # Drop index column
new_X_RFE = pd.concat([X_RFE, one_hot_encoding], axis=1, sort=False) # New input X
X_train, X_val, y_train, y_val = train_test_split(new_X_RFE, y, test_size = 0.2, random_state = 0) # Divide into 2 train test and test set
gbk = GradientBoostingClassifier() # Initialize GradientBoosting
gbk.fit(X_train, y_train) # Train model
y_pred = gbk.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       1.00      0.10      0.19     20650

    accuracy                           0.98   1061313
   macro avg       0.99      0.55      0.59   1061313
weighted avg       0.98      0.98      0.98   1061313



### 5.5 Support Vector Machine

In [None]:
model = LinearSVC() # Initialize Support Vector Machine
rfe = RFE(model, X_train.shape[1]-2) # Configure RFE 
X_RFE = rfe.fit_transform(X, y) # Apply rfe for data
X_RFE = pd.DataFrame(X_RFE) # Convert to dataframe
X_RFE = X_RFE.reset_index() # Reset index
one_hot_encoding = one_hot_encoding.reset_index() # Reset index
X_RFE = X_RFE.drop('index', axis=1) # Drop index columns
one_hot_encoding = one_hot_encoding.drop('index', axis=1) # Drop index columns
new_X_RFE = pd.concat([X_RFE, one_hot_encoding], axis=1, sort=False) # New input X
X_train, X_val, y_train, y_val = train_test_split(new_X_RFE, y, test_size = 0.2, random_state = 0) # Divide into 2 train test and test set
svm = LinearSVC(class_weight='balanced') # Initialize Support Vector Machine
svm.fit(X_train, y_train) # Train model
y_pred = svm.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model



              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       1.00      0.10      0.19     20650

    accuracy                           0.98   1061313
   macro avg       0.99      0.55      0.59   1061313
weighted avg       0.98      0.98      0.98   1061313





---


##  **6. Naive Bayes**

### 6.1 Naive Bayes without feature selection

In [None]:
df_down = df.sample(frac=0.3) # Random sampling 
X_6 = df_down.drop(['ArrDelay_categorical', 'Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier'], axis=1) # Drop available features to convert to categorical feature
y_6 = df_down["ArrDelay_categorical"] # Target column
Categorical_features = df_down[['Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier']]
one_hot_encoding = pd.get_dummies(data= Categorical_features, columns=['Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier'])
X_6 = pd.DataFrame(X_6)

### 6.2 Hold out without feature selection

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_6, y_6, test_size = 0.2, random_state = 0) # Divide into 2 train test and test set

In [None]:
gaussian = GaussianNB() # Initialize Naive Bayes
gaussian.fit(X_train, y_train) # Train model
y_pred = gaussian.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

              precision    recall  f1-score   support

           0       0.98      1.00      0.99    312348
           1       1.00      0.10      0.18      6046

    accuracy                           0.98    318394
   macro avg       0.99      0.55      0.59    318394
weighted avg       0.98      0.98      0.98    318394



### 6.3 Naive Bayes with PCA and down data

In [None]:
X_PCA = df_down.drop(['ArrDelay_categorical', 'Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier'], axis=1) # Drop available features to convert to categorical feature
y_PCA = df_down["ArrDelay_categorical"] # Target column
pca = PCA(n_components = X_PCA.shape[1]-1) # Select the number of dimension
X_PCA = pca.fit_transform(X_PCA) # Apply PCA

In [None]:
Categorical_features = df_down[['Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier']]
one_hot_encoding = pd.get_dummies(data= Categorical_features, columns=['Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier'])
X_PCA = pd.DataFrame(X_PCA)

In [None]:
X_PCA = X_PCA.reset_index()
one_hot_encoding = one_hot_encoding.reset_index() # Reset index
X_PCA = X_PCA.drop('index', axis=1) # Drop index column
one_hot_encoding = one_hot_encoding.drop('index', axis=1) # Drop index column
new_X_PCA = pd.concat([X_PCA, one_hot_encoding], axis=1, sort=False)  # Concatenate two columns

### 6.4 Hold out (PCA)

In [None]:
X_train, X_val, y_train, y_val = train_test_split(new_X_PCA, y_PCA, test_size = 0.2, random_state = 0) # Divide into 2 train test and test set

In [None]:
gaussian = GaussianNB() # Initialize Naive Bayes
gaussian.fit(X_train, y_train) # Train model
y_pred = gaussian.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

              precision    recall  f1-score   support

           0       0.98      0.90      0.94    312348
           1       0.05      0.27      0.08      6046

    accuracy                           0.88    318394
   macro avg       0.52      0.58      0.51    318394
weighted avg       0.97      0.88      0.92    318394



### 6.5 Naive Bayes with KBest

In [None]:
X_KBest = SelectKBest(f_classif, k=X.shape[1]-1).fit_transform(X, y) # ANOVA F-value between feature for classification tasks 
X_KBest = pd.DataFrame(X_KBest) # Convert to dataframe
X_KBest = X_KBest.reset_index() # Reset index
one_hot_encoding = one_hot_encoding.reset_index() # Reset index
X_KBest = X_KBest.drop('index', axis=1) # Drop index column
one_hot_encoding = one_hot_encoding.drop('index', axis=1) # Drop index column
new_X_KBest = pd.concat([X_KBest, one_hot_encoding], axis=1, sort=False) # Concatenate two columns

### 6.6 Hold out (KBest)

In [None]:
X_train, X_val, y_train, y_val = train_test_split(new_X_KBest, y, test_size = 0.2, random_state = 0) # Divide into 2 train test and test set

In [None]:
gaussian = GaussianNB() # Initialize Naive Bayes
gaussian.fit(X_train, y_train) # Train model
y_pred = gaussian.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate result of model

              precision    recall  f1-score   support

           0       0.98      0.95      0.97   1040663
           1       0.09      0.24      0.13     20650

    accuracy                           0.94   1061313
   macro avg       0.54      0.60      0.55   1061313
weighted avg       0.97      0.94      0.95   1061313



### 6.7 Naive Bayes with RFE



---


##  **7. GridSearchCV**

### 7.1 Logistic Regression with feature selection

In [None]:
param_grid = {'penalty': ['l2'],
             'C' : [0.7,0.8,0.9],
             'solver': ['lbfgs', 'liblinear'],
             'class_weight': [{1:0.6, 0:0.4}, {1:0.7, 0:0.3}, {1:0.8, 0:0.2}]} # Desired parameter
logreg_grid = GridSearchCV(estimator=LogisticRegression(), # Apply GridSearch CV
                          param_grid = param_grid,
                          scoring="f1",
                          cv=3,
                          n_jobs = 1)
logreg_grid.fit(X_train, y_train) # Train
logreg_grid_best = logreg_grid.best_estimator_ # Best estimator
print("Best Model Parameter: ",logreg_grid.best_params_) # Show best parameter
print(logreg_grid_best)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Best Model Parameter:  {'C': 0.8, 'class_weight': {1: 0.8, 0: 0.2}, 'penalty': 'l2', 'solver': 'liblinear'}
LogisticRegression(C=0.8, class_weight={0: 0.2, 1: 0.8}, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)


In [None]:
logreg = logreg_grid_best # Assign best parameter
logreg.fit(X_train, y_train) # Retrain
y_pred = logreg.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate the model

              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       0.57      0.13      0.21     20650

    accuracy                           0.98   1061313
   macro avg       0.78      0.56      0.60   1061313
weighted avg       0.97      0.98      0.98   1061313



### 7.2 Decision Tree with feature selection

In [None]:
params = {'max_leaf_nodes': [2, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100], 'min_samples_split': [2, 3, 4]}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), params, verbose=1, cv=3) # Apply GridSearch CV
grid_search_cv.fit(X_train, y_train) # Train
grid_search_cv_best = grid_search_cv.best_estimator_ # Best parameter
print(grid_search_cv_best) # Show best parameter

Fitting 3 folds for each of 63 candidates, totalling 189 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 189 out of 189 | elapsed: 71.8min finished


DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=2,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')


In [None]:
print("Best Model Parameter: ",grid_search_cv.best_params_) # Best parameter
print(grid_search_cv)

Best Model Parameter:  {'max_leaf_nodes': 2, 'min_samples_split': 2}
GridSearchCV(cv=3, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=42,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
           

In [None]:
decisiontree = grid_search_cv_best # Apply best parameter
decisiontree.fit(X_train, y_train) # Retrain
y_pred = decisiontree.predict(X_val) # Predict with test set
print(classification_report(y_val, y_pred)) # Evaluate the model

              precision    recall  f1-score   support

           0       0.98      1.00      0.99   1040663
           1       1.00      0.10      0.19     20650

    accuracy                           0.98   1061313
   macro avg       0.99      0.55      0.59   1061313
weighted avg       0.98      0.98      0.98   1061313

