## Functions to evaluate the model

In [9]:
# Following codes will create a function to test the performance of models using the testing dataset
def model_performance(model_name, model_type, test_x, test_y):
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

    # Predict the y value using the model
    y_pred = model_name.predict(test_x)
    # Calculate evaluation metrics using metrics in sklearn
    accuracy = accuracy_score(test_y, y_pred)
    precision = precision_score(test_y, y_pred)
    recall = recall_score(test_y, y_pred)
    f1 = f1_score(test_y, y_pred)

    # Print the evaluation metrics
    print(f'-----{model_type}-----')
    print('Accuracy:', accuracy)
    print('Precision:', precision)
    print('Recall:', recall)
    print('F1 Score:', f1)


### Logistic regression

#### Logistic Regression - Base Model

In [None]:
from sklearn.linear_model import LogisticRegression
lr_model= LogisticRegression(random_state= 123) # Using default parameters
lr_model.fit(train_x_fs, train_y) # Fit the model with the data

In [10]:
model_performance(lr_model,'Logistic regression base model',test_x_fs,test_y)

-----Logistic regression base model-----
Accuracy: 0.6441996231773758
Precision: 0.6429097063801604
Recall: 0.6486978151510089
F1 Score: 0.6457907915769633


In [16]:
lr_model.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 123,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

#### logistic regression - hyperparameter tuning using grid search method
First use grid search to systematically go thru different combinations of parameters to determine the best combination that can give the best model performance. 

It is done by first defining the parameter grid.  For logistic regression, parameters such as C (inverse of regularization strength), penalty and solver will be included and tuned in this experiment.

The GridSearchCV will cross validation (CV) to evaluate each combination and it is done by splitting the dataset into a number of subsets. Then, training and testing of the model will be performed using these varied combinations.

The explaination for each parameters of Logreg can be found at skleanr documentation at
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

In [11]:
# Define the parameter grid for hyperparameter tuning of logistic regression model
param_grid_lr = {
    'penalty' : ['l1', 'l2', 'elasticnet', 'none'],
    'C' : [0.01, 0.1, 1.0, 10, 100],
    'solver' : ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga']
    } 

from sklearn.model_selection import GridSearchCV
lr = LogisticRegression(random_state = 123)
# Perform the grid search 
lr_grid_search = GridSearchCV(lr,
                              param_grid=param_grid_lr,
                              cv = 5,
                              verbose = 0,
                              n_jobs= -1,
                              scoring='accuracy')

# Fit the grid search to find the best parameters
lr_grid_search.fit(train_x_fs, train_y)

# Get and save the best param for lr
best_lr_param = lr_grid_search.best_params_

400 fits failed out of a total of 600.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\pangy\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\pangy\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\pangy\AppData\

In [12]:
# Now fit the data into the model using best param from grid search
lr_finetuned_model= LogisticRegression(**best_lr_param, random_state= 123) # Using default parameters
lr_finetuned_model.fit(train_x_fs, train_y) # Fit the model with the data

lr_finetuned_model = lr_grid_search.fit(train_x_fs,train_y)

  y = column_or_1d(y, warn=True)
400 fits failed out of a total of 600.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\pangy\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\pangy\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [13]:
# Save the result for future reference
import pickle
with open('CP_datasets2/lr_finetuned_model.pkl', 'wb') as file:
    pickle.dump(lr_finetuned_model, file)

In [15]:
# Show the best combination of parameters for the best lr model 
lr_best_param = lr_finetuned_model.best_params_
print(lr_best_param)

{'C': 100, 'penalty': 'l2', 'solver': 'newton-cg'}


In [14]:
model_performance(lr_finetuned_model,'Logistic Regression Fine-tuned model',test_x_fs,test_y)

-----Logistic Regression Fine-tuned model-----
Accuracy: 0.6439397455055085
Precision: 0.6426416482707873
Recall: 0.6484750607933768
F1 Score: 0.6455451765205902
