Sri Lahari Katla

Week 09 Assignment

For this week’s assignment, you are required to investigate the accuracy-computation time tradeoffs of the different optimization algorithms (solvers) that are available for fitting linear regression models in Scikit-Learn. Using the code shared via the Python notebook (part of this week’s uploads archive) where the use of logistic regression was demonstrated, complete the following operations:

Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.

The Random Forest model demonstrated the best performance in the provided Python notebook by reaching a training dataset accuracy score of 0.9993. The test accuracy score of **0.686** demonstrates substantial overfitting since it stands significantly lower than the training accuracy score of **0.9993**. The Random Forest model reaches high accuracy on the training dataset because it uses its ensemble approach to combine the results from multiple decision trees for improved predictions. The model demonstrates a weak ability to generalize on the test set because it adopts an overly complex structure that perfectly fits the training data without identifying generalizable patterns.

The **Logistic Regression models** achieved performance balance between training and testing datasets through their evaluations using various solvers and regularization techniques. The model Logistic_L1_C_10 delivered the highest accuracy of 0.7347 on training data and 0.718 on test data. The model performance surpassed baseline (null model) accuracy which reached **0.6467** on the training data and **0.608** on the test data. The L1 regularization (LASSO) penalty with C=10 strength was applied to this model because it enhances feature selection by shrinking unimportant coefficients to zero which improves generalization.

The **Logistic Regression model using cross-validation for selecting C (Logistic_L1_C_auto)** delivered acceptable results by reaching **0.7233** training accuracy and **0.708** testing accuracy. The implementation of cross-validation during hyperparameter tuning selected the best regularization strength which improved generalization capabilities.

The fitting process of **Logistic Regression** models takes less time than **Random Forest model** fitting especially when the latter contains numerous estimators (trees) or executes **GridSearchCV** for hyperparameter tuning. The cross-validated logistic regression model (Logistic_L1_C_auto) needed more computational resources than basic models yet remained less computationally intensive than Random Forest models.

The model which optimized accuracy and generalization capabilities was **Logistic Regression with L1 regularization** using C=10. Due to overfitting issues and computational constraints the Random Forest model showed high training accuracy but did not perform well in practical applications. The regularized logistic regression models deliver dependable and consistent results for practical applications especially when working with real-world data sets containing limited training samples.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import time

In [2]:
df_health = pd.read_csv('./PatientAnalyticFile.csv')
df_health['mortality'] = np.where(df_health['DateOfDeath'].isnull(), 0, 1)
df_health['DateOfBirth'] = pd.to_datetime(df_health['DateOfBirth'])
df_health['Age_years'] = (pd.to_datetime('2015-01-01') - df_health['DateOfBirth']).dt.days / 365.25
cols_to_remove = ['PatientID', 'First_Appointment_Date', 'DateOfBirth', 'Last_Appointment_Date', 'DateOfDeath', 'mortality']
X = pd.get_dummies(df_health.drop(columns=cols_to_remove))
y = df_health['mortality']

In [3]:
# Split the data into training and holdout subsets wih 80% train and 20% holdout
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.2, random_state=42)

solvers = ['liblinear', 'lbfgs', 'newton-cg', 'sag', 'saga']
results = []

In [4]:
for solver in solvers:
    start_time = time.time()

    model = LogisticRegression(solver=solver, max_iter=5000)
    model.fit(X_train, y_train)

    train_accuracy = accuracy_score(y_train, model.predict(X_train))
    holdout_accuracy = accuracy_score(y_holdout, model.predict(X_holdout))

    end_time = time.time()
    elapsed_time = end_time - start_time

    results.append([solver, train_accuracy, holdout_accuracy, elapsed_time])

# Display results in a DataFrame
results_df = pd.DataFrame(results, columns=['Solver used', 'Training subset accuracy', 'Holdout subset accuracy', 'Time taken (seconds)'])
print(results_df)

  Solver used  Training subset accuracy  Holdout subset accuracy  \
0   liblinear                  0.747938                  0.73600   
1       lbfgs                  0.748250                  0.73625   
2   newton-cg                  0.748062                  0.73625   
3         sag                  0.748062                  0.73625   
4        saga                  0.748062                  0.73625   

   Time taken (seconds)  
0              0.151198  
1              3.342900  
2              0.556293  
3             59.538024  
4             53.172282  


Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?



The main criterion for ranking models depends on Holdout Subset Accuracy because this metric provides the most trustworthy measure of generalization performance. A holdout subset represents 20% of unseen data that exists solely for testing model performance on information that was not involved in training. A machine learning model's main objective to predict new data effectively makes this metric the essential measure for determining model quality.

The holdout accuracy results for all five solvers reached almost identical levels at 0.73625 which demonstrates their equivalent ability to fit the logistic regression model on this particular dataset. Training time reveals differences between the performance levels of the models.

Model selection depends heavily on execution time because it affects performance when working with large datasets during hyperparameter optimization. Training with the liblinear solver required only 0.15 seconds thus making it the fastest option among all solvers. The newton-cg and lbfgs solvers needed additional time to complete the process at 0.56 seconds and 3.34 seconds respectively although they remained efficient. The training process for sag and saga solvers required excessive time (59.54 seconds and 53.17 seconds) compared to other solvers.

The Training Subset Accuracy serves as an unreliable performance indicator since it demonstrates how well the model matches training data instead of generalizing its ability. All solvers demonstrated equivalent training accuracy levels at approximately 0.748 which shows they can properly fit the model to training data.

The liblinear solver proves to be the optimal selection because it demonstrates both superior holdout accuracy and rapid training duration. The training times of lbfgs and newton-cg increased without resulting in better accuracy levels compared to the liblinear solver. The training times for sag and saga solvers proved to be the slowest although their accuracy matched that of other solvers.