## Model Training

This is notebook intends to explore a training algorithm for the trip analysis data

In [1]:
import pandas as pd

data = pd.read_parquet('processed_data.parquet',engine='fastparquet') # speeds up the process
data = data.astype(pd.SparseDtype("float16", 0))

X = data.drop(columns=["TripType"])
y = data["TripType"]
print(X.shape)
print(y.shape)

(75456, 85965)
(75456,)


Data will be split into train and test - 80% for training and 20% for test

In [2]:
from sklearn.model_selection import train_test_split

def split_data(X:pd.DataFrame, y:pd.Series):
    """
    Returns splitted data
    """
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, 
        test_size=0.2, 
        random_state=4, 
        stratify=y,
        shuffle=True
    )
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = split_data(X,y)

## Baseline model

For baseline model a random forest was chosen:
- no need for feature scalling 
- Resistant to overfitting
- Only looks at a subset of feaures at a time
- Handles non-linear relationships
- Provides feature importance
- Provides balancing option off the shelf

In [None]:
import joblib
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=20,
    random_state=43,
    class_weight='balanced',
    n_jobs=-1,
).fit(X_train.values,y_train.values)

# Save the model
joblib.dump(rf, 'random_forest.pkl')

Due the high volume of data model took around 40 min to train. We will save it in a pickle file

In [5]:
# Load the model back into memory
rf_model = joblib.load('random_forest.pkl')

# Check if it loaded correctly
print(f"Model loaded. Number of features expected: {rf_model.n_features_in_}")

Model loaded. Number of features expected: 85965


In [8]:
from sklearn.metrics import classification_report

def evaluate_model_performance(model:RandomForestClassifier, X:pd.DataFrame, y:pd.Series, evaluation_name:str) ->None:
    """
    Predicts and prints a classification report. 
    Handles both sparse and dense targets automatically.
    """
    print(f"--- {evaluation_name} Performance ---")
    
    y_pred = model.predict(X.values if hasattr(X, 'values') else X)
    y_true = y.values.to_dense() if hasattr(y.values, 'to_dense') else y.values
    report = classification_report(y_true, y_pred)
    print(report)
    

Prediction with trainset to access if we are overfitting

In [10]:
evaluate_model_performance(rf_model,X_train,y_train,'Baseline Train')

--- Baseline Train Performance ---
              precision    recall  f1-score   support

         3.0       0.94      0.91      0.92      2327
         4.0       0.11      0.94      0.20       226
         5.0       0.65      0.20      0.31      2181
         6.0       0.37      0.91      0.53       808
         7.0       0.62      0.56      0.59      3694
         8.0       0.80      0.37      0.50      7826
         9.0       0.93      0.08      0.14      6028
        12.0       0.52      0.87      0.65       163
        14.0       0.03      1.00      0.05         3
        15.0       0.31      0.60      0.41       622
        18.0       0.28      0.86      0.43       354
        19.0       0.25      0.94      0.39       245
        20.0       0.29      0.94      0.45       407
        21.0       0.32      0.86      0.47       418
        22.0       0.57      0.40      0.47       600
        23.0       0.17      0.99      0.29        84
        24.0       0.48      0.69      0.57   

Overall results are not good. Despite that, some classes still presented some interesting results 

In [11]:
evaluate_model_performance(rf_model,X_test,y_test,'Baseline Test')

--- Baseline Test Performance ---
              precision    recall  f1-score   support

         3.0       0.95      0.92      0.93       582
         4.0       0.09      0.80      0.16        56
         5.0       0.65      0.20      0.30       545
         6.0       0.33      0.90      0.49       202
         7.0       0.60      0.56      0.58       924
         8.0       0.78      0.35      0.49      1957
         9.0       0.94      0.09      0.16      1507
        12.0       0.20      0.38      0.26        40
        14.0       0.00      0.00      0.00         0
        15.0       0.26      0.54      0.35       156
        18.0       0.23      0.65      0.34        88
        19.0       0.25      0.90      0.39        61
        20.0       0.29      0.85      0.43       102
        21.0       0.26      0.78      0.39       105
        22.0       0.41      0.29      0.34       150
        23.0       0.17      0.95      0.29        21
        24.0       0.47      0.64      0.55    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Check features importances. It looks like non of the features is actually contributing that much to the prediction

In [12]:
importances = rf_model.feature_importances_

# 3. Create a sorted DataFrame
feature_df = pd.DataFrame({'Feature': data.drop(columns='TripType').columns, 'Importance': importances})
feature_df = feature_df.sort_values(by='Importance', ascending=False)
feature_df.head(20)

Unnamed: 0,Feature,Importance
85773,692303000000.0,0.021333
85945,petsandsupplies,0.020787
0,Total,0.020453
85948,playersandelectronics,0.017298
85913,fabricsandcrafts,0.016934
85912,electronics,0.015973
85951,produce,0.015846
85914,financialservices,0.015608
60699,68113102889.0,0.015204
85919,hardware,0.015186


## Model 2
- Since training took some time, I decided based on RF feature importance, to remove all the features related with Upc

In [13]:
cols_to_drop = [item for item in X if item.replace('.', '', 1).isdigit()] # remove all digit columns
X_reduced = X.drop(columns=cols_to_drop)
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reduced,y)

In [14]:
from sklearn.ensemble import RandomForestClassifier

rf_r = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    random_state=43,
    class_weight='balanced',
    n_jobs=-1,
).fit(X_train_r.values,y_train_r.values)

In [15]:
evaluate_model_performance(rf_r,X_train_r,y_train_r,'No Upc model train')

--- No Upc model train Performance ---
              precision    recall  f1-score   support

         3.0       0.81      0.98      0.89      2158
         4.0       0.25      0.94      0.40       203
         5.0       0.91      0.74      0.81      2067
         6.0       0.84      0.97      0.90       750
         7.0       0.91      0.95      0.93      3449
         8.0       0.91      0.86      0.89      7364
         9.0       0.95      0.66      0.78      5684
        12.0       0.82      0.99      0.90       164
        14.0       1.00      1.00      1.00         2
        15.0       0.90      0.95      0.92       594
        18.0       0.77      0.94      0.84       332
        19.0       0.41      0.99      0.58       233
        20.0       0.72      0.98      0.83       386
        21.0       0.88      0.98      0.93       373
        22.0       0.84      0.78      0.81       539
        23.0       0.42      1.00      0.59        75
        24.0       0.90      0.98      0.9

In [16]:
evaluate_model_performance(rf_r,X_test_r,y_test_r,'No Upc model test')

--- No Upc model test Performance ---
              precision    recall  f1-score   support

         3.0       0.78      0.98      0.87       751
         4.0       0.17      0.49      0.25        79
         5.0       0.58      0.53      0.55       659
         6.0       0.66      0.74      0.70       260
         7.0       0.67      0.65      0.66      1169
         8.0       0.78      0.77      0.77      2419
         9.0       0.72      0.52      0.60      1851
        12.0       0.07      0.05      0.06        39
        14.0       0.00      0.00      0.00         1
        15.0       0.46      0.48      0.47       184
        18.0       0.29      0.47      0.36       110
        19.0       0.20      0.55      0.30        73
        20.0       0.46      0.77      0.58       123
        21.0       0.56      0.68      0.62       150
        22.0       0.42      0.35      0.38       211
        23.0       0.32      0.50      0.39        30
        24.0       0.57      0.49      0.53

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Model performance improve significantly. Also we were able to improve learning and reached 91% for train

## Model 3

Lets try to oversample with. SMOTE in order to balance our results and generate more examples to the minority classes

In [18]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42, k_neighbors=1) # 3 because lower class has only 3 elements

# Apply SMOTE only to the training set
X_train_smote, y_train_smote = smote.fit_resample(X_train_r, y_train_r)

print("Original training set shape:", X_train_r.shape, y_train_r.shape)
print("Resampled training set shape:", X_train_smote.shape, y_train_smote.shape)

Original training set shape: (56592, 71) (56592,)
Resampled training set shape: (279832, 71) (279832,)


In [19]:
from sklearn.ensemble import RandomForestClassifier

rf_smote = RandomForestClassifier(
    n_estimators=200,
    max_depth=30,
    random_state=43,
    class_weight='balanced',
    n_jobs=-1,
).fit(X_train_smote.values,y_train_smote.values)

In [20]:
evaluate_model_performance(rf_smote,X_train_smote,y_train_smote,'SMOTE train model')

--- SMOTE train model Performance ---
              precision    recall  f1-score   support

         3.0       0.91      0.96      0.93      7364
         4.0       0.76      1.00      0.86      7364
         5.0       0.91      0.65      0.76      7364
         6.0       0.95      0.96      0.95      7364
         7.0       0.90      0.76      0.83      7364
         8.0       0.56      0.83      0.67      7364
         9.0       0.88      0.41      0.56      7364
        12.0       1.00      0.99      0.99      7364
        14.0       1.00      1.00      1.00      7364
        15.0       0.95      0.82      0.88      7364
        18.0       0.93      0.90      0.92      7364
        19.0       0.80      1.00      0.89      7364
        20.0       0.95      1.00      0.97      7364
        21.0       0.96      0.97      0.97      7364
        22.0       0.97      0.78      0.87      7364
        23.0       0.98      1.00      0.99      7364
        24.0       0.87      0.94      0.90

In [23]:
evaluate_model_performance(rf_smote,X_test_r,y_test_r,'SMOTE test model')

--- SMOTE test model Performance ---
              precision    recall  f1-score   support

         3.0       0.78      0.96      0.86       751
         4.0       0.16      0.59      0.25        79
         5.0       0.54      0.50      0.52       659
         6.0       0.56      0.82      0.67       260
         7.0       0.74      0.57      0.65      1169
         8.0       0.74      0.80      0.77      2419
         9.0       0.82      0.33      0.47      1851
        12.0       0.10      0.05      0.07        39
        14.0       0.00      0.00      0.00         1
        15.0       0.48      0.52      0.50       184
        18.0       0.29      0.64      0.40       110
        19.0       0.22      0.70      0.33        73
        20.0       0.35      0.91      0.51       123
        21.0       0.50      0.74      0.60       150
        22.0       0.41      0.37      0.39       211
        23.0       0.33      0.60      0.42        30
        24.0       0.57      0.53      0.55 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## Model 4
- Lets experiment with the k_best features of features of the initial model. Lets try k=1000

In [24]:
importances = rf_model.feature_importances_

# 3. Create a sorted DataFrame
feature_df = pd.DataFrame({'Feature': data.drop(columns='TripType').columns, 'Importance': importances})
feature_df = feature_df.sort_values(by='Importance', ascending=False)

X_best = X[feature_df.loc[:,'Feature'].to_list()[:1000]]
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X_best,y)

In [25]:
rf_best = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    random_state=43,
    class_weight='balanced',
    n_jobs=-1,
).fit(X_train_b.values,y_train_b.values)

In [26]:
evaluate_model_performance(rf_best,X_train_b,y_train_b,'Best 1000 train')

--- Best 1000 train Performance ---
              precision    recall  f1-score   support

         3.0       0.97      1.00      0.98      2180
         4.0       0.41      0.87      0.56       210
         5.0       0.90      0.87      0.88      2032
         6.0       0.88      0.97      0.92       733
         7.0       0.94      0.97      0.96      3489
         8.0       0.93      0.91      0.92      7414
         9.0       0.96      0.71      0.82      5674
        12.0       0.99      1.00      1.00       152
        14.0       1.00      1.00      1.00         2
        15.0       0.94      0.96      0.95       572
        18.0       0.82      0.96      0.88       341
        19.0       0.62      0.93      0.74       222
        20.0       0.83      0.97      0.89       385
        21.0       0.90      0.99      0.95       400
        22.0       0.74      0.91      0.82       572
        23.0       0.65      1.00      0.79        77
        24.0       0.89      0.99      0.94  

In [28]:
evaluate_model_performance(rf_best,X_test_b,y_test_b,'Best 1000 test')

--- Best 1000 test Performance ---
              precision    recall  f1-score   support

         3.0       0.94      0.98      0.96       729
         4.0       0.17      0.26      0.21        72
         5.0       0.63      0.64      0.63       694
         6.0       0.72      0.70      0.71       277
         7.0       0.66      0.69      0.68      1129
         8.0       0.79      0.79      0.79      2369
         9.0       0.69      0.56      0.62      1861
        12.0       0.22      0.04      0.07        51
        14.0       0.00      0.00      0.00         1
        15.0       0.51      0.37      0.43       206
        18.0       0.32      0.50      0.39       101
        19.0       0.29      0.33      0.31        84
        20.0       0.49      0.66      0.56       124
        21.0       0.55      0.63      0.58       123
        22.0       0.39      0.53      0.45       178
        23.0       0.35      0.32      0.33        28
        24.0       0.56      0.60      0.58   

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
