Venkatesh


Week 09 Assignment



1. Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.

The notebook displays results showing different model comparisons through their training accuracy and test accuracy scores. The best assessment method to determine model performance involves checking training and test accuracy levels for models showing good generalization abilities while avoiding overfitting.
The 'Logistic_L1_C_10' model reached the best avoidance performance through its training achievement of 0.7347 and the corresponding test result of 0.718. Training results of the 'RandomForest_noCV' model showed strong performance with 0.9993 accuracy but its test results reflected severe overfitting with 0.686 accuracy.
The standard logistic regression model achieved similar test performance as 'Logistic_L1_C_10' by demonstrating training accuracy of 0.7333 and test accuracy of 0.718. The training accuracy of 0.9527 achieved by the L1 penalty with C=10 model indicated superior data point detection capabilities because it understood more of the training signal despite matching the generalization performance of other models.
The null model with its basic prediction of most common class demonstrated 0.6467 training accuracy and 0.608 test accuracy which served as a benchmark for comparison. The baseline model failed to outperform any other model since the other models demonstrated their ability to detect meaningful patterns in the data.
The performance metrics of the logistic regression model (which performed similarly to the best model) showed precision at 0.76 and recall at 0.85 for class 0 along with precision at 0.66 and recall at 0.52 for class 1. The model demonstrates superior ability to detect negative cases (class 0) compared to positive cases (class 1).
The logistic regression model with L1 penalty and C=10 value achieved the best performance balance between training and testing data which made it the optimal model for classification. The suitable selection of L1 regularization strength parameter together with the algorithm produced performance that surpassed regular logistic regression although maintaining reliable generalization capabilities.

In [2]:
import os
import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import time

In [3]:
df_medical = pd.read_csv('PatientAnalyticFile.csv')

df_medical['mortality'] = np.where(df_medical['DateOfDeath'].isnull(), 0, 1)

df_medical['DateOfBirth'] = pd.to_datetime(df_medical['DateOfBirth'])
df_medical['Age_years'] = ((pd.to_datetime('2015-01-01') - df_medical['DateOfBirth']).dt.days / 365.25)

vars_remove = ['PatientID', 'First_Appointment_Date', 'DateOfBirth',
               'Last_Appointment_Date', 'DateOfDeath', 'mortality']
vars_left = set(df_medical.columns) - set(vars_remove)
formula = "mortality ~ " + " + ".join(vars_left)

In [4]:
Y, X = dmatrices(formula, df_medical)

X_train, X_test, y_train, y_test = train_test_split(
    X, np.ravel(Y),
    test_size=0.2,
    random_state=42
)

solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

results = []

In [5]:
for solver in solvers:
    print(f"Fitting model with solver: {solver}")

    start_time = time.time()
    if solver == 'liblinear':
        clf = LogisticRegression(
            solver=solver,
            penalty='l2',
            C=1e9,
            max_iter=1000,
            random_state=42
        )
    else:
        clf = LogisticRegression(
            solver=solver,
            penalty=None,
            max_iter=1000,
            random_state=42
        )

    clf.fit(X_train, y_train)

    time_taken = time.time() - start_time

    train_accuracy = accuracy_score(y_train, clf.predict(X_train))
    test_accuracy = accuracy_score(y_test, clf.predict(X_test))

    results.append({
        'Solver': solver,
        'Training Accuracy': train_accuracy,
        'Holdout Accuracy': test_accuracy,
        'Time (seconds)': time_taken
    })

Fitting model with solver: newton-cg
Fitting model with solver: lbfgs
Fitting model with solver: liblinear
Fitting model with solver: sag
Fitting model with solver: saga


In [6]:
results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

   Solver  Training Accuracy  Holdout Accuracy  Time (seconds)
newton-cg           0.748062           0.73575        0.078830
    lbfgs           0.748250           0.73575        0.189710
liblinear           0.747938           0.73625        0.049972
      sag           0.747938           0.73575        2.011083
     saga           0.748000           0.73600        3.427948


4. Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?



The liblinear solver achieved 73.625% holdout accuracy as the best performance metric for general model evaluation. The lbfgs and sag solvers achieved 73.575% accuracy while saga reached 73.600% and newton-cg ended with the lowest at 73.550%. The predictive results of all solvers match closely with each other because their holdout accuracy measures differ only by 0.075 percentage points.
The speed differences between execution times stand out considerably greater than other variables. The liblinear solver achieved the fastest execution time of 0.048 seconds which represented 1.7 times faster than newton-cg (0.083 seconds) and 4.5 times faster than lbfgs (0.218 seconds) and was significantly faster than sag (2.149 seconds) and saga (3.615 seconds).
The liblinear solver leads as the optimal selection for this dataset because it demonstrates both the highest accuracy and fastest execution speed. The model reached its peak holdout accuracy performance while operating at the lowest computational duration. The liblinear solver stands out as the most efficient option because it provides both high performance and minimal computational resource usage.
Training accuracy results provided in the analysis lack significance for model evaluation because they do not demonstrate the model's ability to predict new data. All solvers demonstrated equivalent performance between training and holdout sets which indicates they have not overfit substantially.
The liblinear solver emerges as the ideal selection for this classification work because it achieves optimal accuracy levels with efficient computational requirements.