# Problem definition

In order to evaluate PySpark framework, we will apply the same workflow performed by it, but with scikit-learn, using a local processor, without paralel computation.

The problem is a classification, based on flights informations from North America. The dataset used here is the same applied on Spark, and is available in Databricks datasets. It contains 1,391,578 rows, 5 columns and has no null values.

Our workflow will consist of feature engineering, cross-validation, hyperparameter tuning and simple versus ensemble model.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import time

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# 1. Feature Engineering

The main difference between scikit-learn and PySpark Machine Learning library, is that Spark requires one vector containing all features, represented in sparse format. This is not necessary here, we separate the data between X and y, with X being a dataframe with all columns as features.

We still perform the feature engineering, applying the same modifications made to Spark.<br>
We will generate the target variable y, corresponding to delayed flights.<br>
Next we will transform the date column into readable datetime format and bucketize it into intervals of 3 hours along 24 hours.<br>
Finally we will implement one-hot encoder to the data.

## 1.1. Target variable y

The Federal Aviation Administration (FAA) considers a flight to be "delayed" when it arrives 15 minutes or more after its scheduled time. Thus we will be creating the target variable y, accordingly to FAA definition.

In [2]:
data = pd.read_csv('flights.csv')

In [3]:
df = data.copy()
df['label'] = np.where(df['delay'] >= 15, True, False)
df.value_counts('label')

label
False    1077104
True      314474
dtype: int64

We can see that the data is unbalanced, having 3 times more non delayed flights.<br>
However we will work with the data without changing its distribution, in order to evaluate both models performance.

In [4]:
# Transform boolean into numerical
df['label'] = df['label'].astype(int)

In [5]:
df

Unnamed: 0,date,delay,distance,origin,destination,label
0,1011245,6,602,ABE,ATL,0
1,1020600,-8,369,ABE,DTW,0
2,1021245,-2,602,ABE,ATL,0
3,1020605,-4,602,ABE,ATL,0
4,1031245,-4,602,ABE,ATL,0
...,...,...,...,...,...,...
1391573,3310623,-10,139,YUM,PHX,0
1391574,3311505,-4,139,YUM,PHX,0
1391575,3311846,0,206,YUM,LAX,0
1391576,3310500,-7,206,YUM,LAX,0


In [6]:
# Get number of records
print("The data contain %d records." % df.shape[0])

The data contain 1391578 records.


In [7]:
# Print DataFrame structure
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1391578 entries, 0 to 1391577
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   date         1391578 non-null  int64 
 1   delay        1391578 non-null  int64 
 2   distance     1391578 non-null  int64 
 3   origin       1391578 non-null  object
 4   destination  1391578 non-null  object
 5   label        1391578 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 63.7+ MB


## 1.2. Column departure

We will first convert the time column into string, so next we can convert it into timestamp.<br>
Then we will apply the bucketizer step for intervals of 3 hours.

In [8]:
# Transform string to timestamp
df['departure'] = '20140' + df['date'].astype(str)
df['departure'] = pd.to_datetime(df['departure'], format='%Y%m%d%H%M')

# Get hour from departure time
df['hour'] = df['departure'].dt.hour

In [9]:
# Bucketizing departure time
ranges = [0,3,6,9,12,15,18,21,np.inf]
group_names = ['0-3h', '3-6h', '6-9h', '9-12h', '12-15h', '15-18h', '18-21h', '21-24h']
df['departure_bucket'] = pd.cut(df['hour'], bins=ranges, labels=group_names)
df['departure_bucket'] = df['departure_bucket'].astype('category')

## 1.3. Categorical features

The categorical features origin and destination contain the IATA code for airports of North America. Since there are around 300 different airports in the dataset, we will replace them by their state, reducing to a total of 65 states.

In order to do this, we will need to import a second database, containing the airports informations. We will perform a join between the two tables, to get the corresponding states.

Finally we will apply one-hot encoder to the states.

In [10]:
df_air = pd.read_csv('airport_codes_na.csv')

In [11]:
df_air

Unnamed: 0,City,State,Country,IATA
0,Abbotsford,BC,Canada,YXX
1,Aberdeen,SD,USA,ABR
2,Abilene,TX,USA,ABI
3,Akron,OH,USA,CAK
4,Alamosa,CO,USA,ALS
...,...,...,...,...
521,Wrangell,AK,USA,WRG
522,Yakima,WA,USA,YKM
523,Yakutat,AK,USA,YAK
524,Yellowknife,NWT,Canada,YZF


In [12]:
# Join on origin
df_origin = df_air[['IATA', 'State']]
df_origin.columns = ['origin', 'origin_state']
df = df.merge(df_origin, on='origin')

In [13]:
# Join on destination
df_dest = df_air[['IATA', 'State']]
df_dest.columns = ['destination', 'dest_state']
df = df.merge(df_dest, on='destination')

In [14]:
# Select only feature columns
df = df[['distance', 'label', 'departure_bucket', 'origin_state', 'dest_state']]

In [15]:
# One-hot encoder
df = pd.get_dummies(df)

## 1.4. Train-test split

Before applying any fitting or prediction, we will split the data into training and test sets. This will ensure that data leakage does not occur during all process.

In [16]:
# Split the data
X = df.drop('label', axis=1)
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)

# 2. Models

We will apply the same two classification models applied to Spark: logistic regression and ensemble Random Forest.<br>
Being more simple, Logistic regression will be used as baseline model.<br>
We will perform hyperparameter tuning on them, with grid-search cross-validation and evaluate the best model on the test set.<br>
We will apply two evaluations for both models: ROC-AUC and Confusion Matrix.<br>
The difference here to Spark is that we will declare first the models and their grid search, in order to add to the pipeline.

## 2.1. Evaluators

We don't need to declare both evaluators roc_auc and confusion matrix because they are already implemented as functions by scikit-learn.

## 2.2. Logistic Regression

The Logistic Regression will be our baseline model, for comparison.<br>
For its hyperparameters we will search for 'C' wich is the inverse of regularization strength λ and 'penalty' which is the norm of Regularization α (L1/L2).

In [17]:
# Create Logistic Regression model
lr = LogisticRegression()

In [18]:
# Make a grid for grid-search
params_lr = {'lr__C': [1, .1, .01],
             'lr__penalty': ['l2', 'none']}

## 2.3. Ensemble Random Forest

We will declare a Random Forest model and search for its best hyperparameters, through grid-search cross-validation.<br>
For its hyperparameters we will search for 'max_features' wich is the number of features to consider and 'max_depth' which is the maximum depth of the tree.

In [19]:
# Create Random Forest model
rf = RandomForestClassifier()

In [20]:
# Make a grid for grid-search
params_rf = {'rf__max_features': [None, 0.3, 'sqrt', 'log2'],
             'rf__max_depth': [2, 5, 10]}

# 3. Logistic Regression Pipeline

Now we will define a pipeline to run normalization on the data and grid-search cross-validation for Logistic Regression.

## 3.1. Scikit Pipeline

In [21]:
# Create the Pipeline
steps_lr = [('scaler', StandardScaler()),
            ('lr', lr)]
pipeline_lr = Pipeline(steps_lr)

## 3.2. Hyperparameter Tuning

Now through grid-search cross-validation, we may search for the best hyperparameters, on the training set, and use the best model to predict the test set, in order to evaluate the results.

In [22]:
# Create the CrossValidator
cv_lr = GridSearchCV(estimator=pipeline_lr,
                     param_grid=params_lr,
                     cv=3,
                     scoring='roc_auc',
                     verbose=1,
                     n_jobs=-1)

In [23]:
# Train the model and time it
t_start = time.time()
cv_lr.fit(X_train, y_train)
t_total = time.time() - t_start
print('Cross-validation training with Logistic Regression took around {:.2f} min.'.format(t_total/60))

Fitting 3 folds for each of 6 candidates, totalling 18 fits




Cross-validation training with Logistic Regression took around 1.98 min.


## 3.3. Cross-Validation Best Model

In [24]:
# Extract the best model
best_lr = cv_lr.best_estimator_

# Print best_lr
print(best_lr)

Pipeline(steps=[('scaler', StandardScaler()), ('lr', LogisticRegression(C=1))])


In [25]:
cv_lr.best_params_

{'lr__C': 1, 'lr__penalty': 'l2'}

## 3.4. Model Evaluation

In [26]:
predictions_lr = best_lr.predict(X_test)
pred_lr_proba = best_lr.predict_proba(X_test)

In [36]:
# Evaluate the predictions
print('roc_auc score: ', roc_auc_score(y_test.values, pred_lr_proba[:,1]))

print('confusion matrix:\n', confusion_matrix(y_test.values, predictions_lr))
print(classification_report(y_test.values, predictions_lr))

roc_auc score:  0.6511188909119557
confusion matrix:
 [[315988     42]
 [ 92253     60]]
              precision    recall  f1-score   support

           0       0.77      1.00      0.87    316030
           1       0.59      0.00      0.00     92313

    accuracy                           0.77    408343
   macro avg       0.68      0.50      0.44    408343
weighted avg       0.73      0.77      0.68    408343



# 4. Ensemble Random Forest Pipeline

Next we will define a pipeline to run normalization on the data and grid-search cross-validation for Random Forest.

## 4.1. Scikit Pipeline

In [28]:
# Create the Pipeline
steps_rf = [('scaler', StandardScaler()),
            ('rf', rf)]
pipeline_rf = Pipeline(steps_rf)

## 4.2. Hyperparameter Tuning

Now through grid-search cross-validation, we may search for the best hyperparameters, on the training set, and use the best model to predict the test set, in order to evaluate the results.

In [29]:
# Create the CrossValidator
cv_rf = GridSearchCV(estimator=pipeline_rf,
                     param_grid=params_rf,
                     cv=3,
                     scoring='roc_auc',
                     verbose=1,
                     n_jobs=-1)

In [30]:
# Train the model and time it
t_start = time.time()
cv_rf.fit(X_train, y_train)
t_total = time.time() - t_start
print('Cross-validation training with Random Forest took around {:.2f} min.'.format(t_total/60))

Fitting 3 folds for each of 12 candidates, totalling 36 fits
Cross-validation training with Random Forest took around 44.62 min.


## 4.3. Cross-Validation Best Model

In [31]:
# Extract the best model
best_rf = cv_rf.best_estimator_

# Print best_rf
print(best_rf)

Pipeline(steps=[('scaler', StandardScaler()),
                ('rf',
                 RandomForestClassifier(max_depth=10, max_features='log2'))])


In [32]:
cv_rf.best_params_

{'rf__max_depth': 10, 'rf__max_features': 'log2'}

## 4.4. Model Evaluation

In [33]:
predictions_rf = best_rf.predict(X_test)
pred_rf_proba = best_rf.predict_proba(X_test)

In [37]:
# Evaluate the predictions
print('roc_auc score: ', roc_auc_score(y_test.values, pred_rf_proba[:,1]))

print('confusion matrix:\n', confusion_matrix(y_test.values, predictions_rf))
print(classification_report(y_test.values, predictions_rf))

roc_auc score:  0.6541306973710934
confusion matrix:
 [[316030      0]
 [ 92313      0]]


  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.77      1.00      0.87    316030
           1       0.00      0.00      0.00     92313

    accuracy                           0.77    408343
   macro avg       0.39      0.50      0.44    408343
weighted avg       0.60      0.77      0.68    408343



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# 5. Conclusions

Spark is optimized for big data handling. Therefore, it may take more time to deal with small data.<br>
Comparing scikit-learn and Spark performance, we see that Logistic Regression grid-search was faster in scikit-learn, with around 2 min.<br>
Random Forest, however, is impacted by the curse of dimensionality. Here we can see that Spark took around 22 min to perform the grid-search while scikit-learn took around 45 min.<br>
While our data had around 1.4 million rows (30 MB), it is still very small compared to big data, where Spark really shine.

Analyzing both models evaluation, we can see that both models performed poorly, heavily impacted by data unbalance. Here Random Forest actually turned out the worst model, because it predicted 0 (not delay) for all cases, still achieving a high accuracy and roc auc score.

|Metric|Logistic Regression|Random Forest|
|--|--|--|
|ROC AUC|0.6511|0.6541|
|Accuracy|0.77|0.77|
|Precision|0.59|0.00|
|Recall|0.00|0.00|
|F1-Score|0.00|0.00|

Oversampling or undersampling and other metrics would be proper steps, but we tried to reproduce only similar steps between both frameworks, in order to evaluate both performance.

We conclude here that Spark framework is more recommended for big data. Its parallel processing guarantees less processing time when the data escalates. Also both frameworks may perform differently depending on configurations setup.