## Advanced Modeling

### Author: Tilova Shahrin

- [Reading Dataframe](#df)
- [Test and Train Split](#split)
- [Logreg Scoring](#logreg)
- [Multi-Output Classifier](#multi)
- [Multi-Output Classifier With Difference Weights](#multiweights)
- [Randomized Search CV](#search)
- [References](#ref)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import statsmodels.api as sm

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

In [2]:
import warnings
warnings.filterwarnings('ignore')

<a id='df'><a/>
### Reading Dataframe

In [5]:
parking_df = pd.read_csv('../data/parking_df.csv')

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


In [8]:
parking_df.head()

Unnamed: 0,datetime_of_infraction,time_of_infraction,year,month,day,infraction_code,infraction_description,set_fine_amount,location2,province,latitude,longitude,permit_time_restrictions,fee_related,time_related,fire_route,accessible_related,commercial_related,obstruction_related,cycle_related
0,2016-12-30 16:37:00,16:37:00,2016,12,30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,"1546 BLOOR ST W, TORONTO, ON, CANADA",ON,43.656337,-79.453142,0,0,0,0,0,0,0,0
1,2016-12-30 16:37:00,16:37:00,2016,12,30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,"5418 YONGE ST, TORONTO, ON, CANADA",ON,43.775587,-79.414671,0,0,0,0,0,0,0,0
2,2016-12-30 16:37:00,16:37:00,2016,12,30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,"777 QUEEN ST W, TORONTO, ON, CANADA",ON,43.646259,-79.40808,0,0,0,0,0,0,0,0
3,2016-12-30 16:37:00,16:37:00,2016,12,30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,"747 QUEEN ST E, TORONTO, ON, CANADA",ON,43.659131,-79.34808,0,0,0,0,0,0,0,0
4,2016-12-30 16:37:00,16:37:00,2016,12,30,403.0,STOP-SIGNED HIGHWAY-RUSH HOUR,150,"3042 DUNDAS ST W, TORONTO, ON, CANADA",ON,43.665651,-79.470785,0,0,0,0,0,0,0,0


In [6]:
parking_coord = parking_df[parking_df['latitude'] != 0.0]

<a id='split'><a/>
### Test Train Split

In [7]:
df_numerical_copy = parking_coord[['latitude', 'longitude', 'permit_time_restrictions', 'fee_related', 'time_related', 'fire_route', 'accessible_related', 'commercial_related', 'obstruction_related', 'cycle_related']]

X = df_numerical_copy[['latitude', 'longitude']]
y = df_numerical_copy.drop(columns=['latitude', 'longitude'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [8]:
y.columns

Index(['permit_time_restrictions', 'fee_related', 'time_related', 'fire_route',
       'accessible_related', 'commercial_related', 'obstruction_related',
       'cycle_related'],
      dtype='object')

<a id='logreg'><a/>
### Logistic Regression, predicts type of infraction based on location

In [9]:
models = {}

for column in y.columns:
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    model = LogisticRegression()
    model.fit(X_train, y_train[column])
    
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test[column], y_pred)
    
    models[column] = {'model': model, 'accuracy': accuracy}

for infraction, model_info in models.items():
    print(f"Infraction: {infraction}, Accuracy: {model_info['accuracy']}")


Infraction: permit_time_restrictions, Accuracy: 0.5955489654451519
Infraction: fee_related, Accuracy: 0.781054551015391
Infraction: time_related, Accuracy: 0.7142270890961848
Infraction: fire_route, Accuracy: 0.9779964730864752
Infraction: accessible_related, Accuracy: 0.9881319940611577
Infraction: commercial_related, Accuracy: 0.9838036852568434
Infraction: obstruction_related, Accuracy: 0.9931049034163829
Infraction: cycle_related, Accuracy: 0.9925745113714893


<a id='multi'><a/>
### Multi-Output Classifier

In [10]:
classifier = MultiOutputClassifier(RandomForestClassifier(random_state=42))

classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.82      0.82    237106
           1       0.69      0.58      0.63    113107
           2       0.75      0.61      0.67    147630
           3       0.73      0.15      0.25     11367
           4       0.65      0.53      0.58      6131
           5       0.69      0.47      0.56      8367
           6       0.74      0.04      0.07      3562
           7       0.62      0.02      0.05      3836

   micro avg       0.77      0.67      0.72    531106
   macro avg       0.71      0.40      0.45    531106
weighted avg       0.77      0.67      0.71    531106
 samples avg       0.61      0.61      0.61    531106



For obstruction related,commercial, accesible and cycle related infractions, there is a large imbalance between 0 and 1. 

With these scores, it's important to note that we should consider the context of the problem and the relative importance of precision and recall. In this case, the number of difference offences has a disadvantage in proportion. Let's try to upsample some of these models to improve their scores.

To balance these targets,let's use the `class_weight` parameter in random forest. 
Refer to: https://stackoverflow.com/questions/58275113/proper-use-of-class-weight-parameter-in-random-forest-classifier

Let's try two methods, one using `"balanced"` and another as a dictionary.

<a id='multiweights'><a/>
### Multi-Output Classifier with different weight class

In [90]:
classifier = MultiOutputClassifier(RandomForestClassifier(random_state=42, class_weight='balanced'))

classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.60      0.86      0.71    122482
           1       0.51      0.84      0.64    113104
           2       0.51      0.89      0.65     58050
           3       0.10      0.86      0.17     11359
           4       0.20      0.97      0.33      6131
           5       0.25      0.98      0.39      8367
           6       0.03      0.77      0.07      4441
           7       0.04      0.97      0.08      1436

   micro avg       0.37      0.86      0.52    325370
   macro avg       0.28      0.89      0.38    325370
weighted avg       0.51      0.86      0.63    325370
 samples avg       0.32      0.45      0.36    325370



Based on both this Random Forest Classifier, it seems the scores are not that much different compared to the model with balanced weighted class. 

In [93]:
#create dictionary for class_weight
weight = {0: 1, 1: 1, 2: 2, 3: 11, 4: 20, 5: 15, 6: 27, 7: 100}
classifier = MultiOutputClassifier(RandomForestClassifier(random_state=42, class_weight = weight))

classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.63      0.69    122482
           1       0.69      0.58      0.63    113104
           2       0.77      0.67      0.72     58050
           3       0.73      0.15      0.25     11359
           4       0.65      0.53      0.58      6131
           5       0.69      0.47      0.56      8367
           6       0.74      0.03      0.06      4441
           7       0.62      0.07      0.12      1436

   micro avg       0.73      0.59      0.65    325370
   macro avg       0.71      0.39      0.45    325370
weighted avg       0.73      0.59      0.64    325370
 samples avg       0.30      0.30      0.30    325370



These scores are much better, only obstruction and cycle related infractions are of concern. 

<a id='search'><a/>
### Grid Search

#### Pipeline for RandomForestClassifier

In [83]:
rf = RandomForestClassifier(random_state=42)
pipe = Pipeline(steps=[('clf', MultiOutputClassifier(rf))])
param_grid = {
    'clf__estimator__n_estimators': [100, 300],  #Number of trees in the forest
    'clf__estimator__max_depth': [2, 15, 30],  #Maximum depth of the trees
}


#fit
pipe.fit(X_train, y_train)

#score
pipe.score(X_test, y_test)

0.6879978474600222

In [84]:
search = RandomizedSearchCV(
    estimator=pipe,
    param_distributions=param_grid,
    scoring='accuracy',
    refit='precision_micro',
    cv = 3,
    verbose= 2
)

In [89]:
fittedgrid.score(X_train, y_train)

0.6892608825142256

In [92]:
fittedgrid.score(X_test, y_test)

0.6881333490773308

In [87]:
fittedgrid.best_estimator_

In [88]:
fittedgrid.best_params_

{'clf__estimator__n_estimators': 300, 'clf__estimator__max_depth': 30}

### References

- Class Weight: https://stackoverflow.com/questions/58275113/proper-use-of-class-weight-parameter-in-random-forest-classifier
- Multi-Output Classifier: https://calmcode.io/course/scikit-meta/multi-output#:~:text=Using%20the%20MultiOutputClassifier&text=Instead%20of%20making%20two%20pipelines,this%20by%20using%20a%20MultiOutputClassifier.
- Tuning params for multiclassification output: https://datascience.stackexchange.com/questions/107867/how-to-train-multioutput-classification-with-hyperparameter-tuning-in-sklearn
- Metrics for scoring parameters: https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules
- Random Forest Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- Class Weight for Random Forest Classifier: https://stackoverflow.com/questions/58275113/proper-use-of-class-weight-parameter-in-random-forest-classifier