Sanghameshwar

Week 09 Assignment - Machine Learning

1. The selection of the best performing model requires an equilibrium between training accuracy and test accuracy values. Both standard logistic regression and L1-regularized logistic regression with C=10 (Logistic_L1_C_10) reached the maximum test accuracy of 0.718. The L1-regularized model with C=10 demonstrated a slightly better training accuracy level at 0.7347 compared to 0.7333.
A Random Forest model using initial cross-validation demonstrated perfect training accuracy (0.9993) but it achieved only 0.686 accuracy for test data indicating clear signs of overfitting because the algorithm learned the training data patterns excessively.
The performance metrics for RandomForest_CV and RandomForest_CV2 remain hidden in the last section of the notebook which makes their inclusion in the comparison challenging.
The L1-regularized logistic regression with C=10 (Logistic_L1_C_10) achieves the best overall results because it demonstrates the highest test accuracy (0.718) together with the highest training accuracy (0.7347) among models without overfitting behavior. The model demonstrates effective performance because it achieves both strong training data fitting and generalization to new observations.

In [1]:
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from patsy import dmatrices
from sklearn.metrics import accuracy_score

In [3]:
## Set print limits
pd.options.display.max_rows = 10
## Import Data
df_patient = \
 pd.read_csv('./PatientAnalyticFile.csv')
df_patient.head()

Unnamed: 0,PatientID,DateOfBirth,Gender,Race,Myocardial_infarction,Congestive_heart_failure,Peripheral_vascular_disease,Stroke,Dementia,Pulmonary,...,Metastatic_solid_tumour,HIV,Obesity,Depression,Hypertension,Drugs,Alcohol,First_Appointment_Date,Last_Appointment_Date,DateOfDeath
0,1,1962-02-27,female,hispanic,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2013-04-27,2018-06-01,
1,2,1959-08-18,male,white,0,0,0,0,0,0,...,0,0,0,0,1,0,0,2005-11-30,2008-11-02,2008-11-02
2,3,1946-02-15,female,white,0,0,0,0,0,0,...,0,1,0,0,1,0,0,2011-11-05,2015-11-13,
3,4,1979-07-27,female,white,0,0,0,0,0,1,...,0,0,0,0,0,0,0,2010-03-01,2016-01-17,2016-01-17
4,5,1983-02-19,female,hispanic,0,0,0,0,0,0,...,0,0,0,0,1,0,0,2006-09-22,2018-06-01,


In [4]:
df_patient['mortality'] = np.where(df_patient['DateOfDeath'].isnull(), 0, 1)
df_patient['mortality'].head()

Unnamed: 0,mortality
0,0
1,1
2,0
3,1
4,0


In [5]:
df_patient['DateOfBirth'] = pd.to_datetime(df_patient['DateOfBirth'])
df_patient['Age_years'] = ((pd.to_datetime('2015-01-01') - df_patient['DateOfBirth']).dt.days/365.25)
df_patient['Age_years'].head()

Unnamed: 0,Age_years
0,52.843258
1,55.373032
2,68.876112
3,35.433265
4,31.865845


In [6]:
vars_remove = ['PatientID', 'First_Appointment_Date', 'DateOfBirth',
              'Last_Appointment_Date', 'DateOfDeath', 'mortality']
vars_left = set(df_patient.columns) - set(vars_remove)
formula = "mortality ~ " + " + ".join(vars_left)

# Create model matrices using the entire dataset
Y, X = dmatrices(formula, df_patient)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X, np.ravel(Y), test_size=0.2, random_state=42)

# List of solvers to compare
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

In [8]:
def evaluate_solver(solver):
    start_time = time.time()

    model = LogisticRegression(solver=solver, C=1e5, max_iter=1000, random_state=42)

    # Train the model
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)

    time_taken = time.time() - start_time

    return {
        'Solver': solver,
        'Training Accuracy': train_accuracy,
        'Holdout Accuracy': test_accuracy,
        'Time Taken (seconds)': time_taken
    }

results = []
for solver in solvers:
    try:
        print(f"Evaluating solver: {solver}")
        result = evaluate_solver(solver)
        results.append(result)
    except Exception as e:
        print(f"Error with solver {solver}: {str(e)}")

results_df = pd.DataFrame(results)

results_df['Training Accuracy'] = results_df['Training Accuracy'].map('{:.4f}'.format)
results_df['Holdout Accuracy'] = results_df['Holdout Accuracy'].map('{:.4f}'.format)
results_df['Time Taken (seconds)'] = results_df['Time Taken (seconds)'].map('{:.2f}'.format)
print(results_df)

Evaluating solver: newton-cg
Evaluating solver: lbfgs
Evaluating solver: liblinear
Evaluating solver: sag
Evaluating solver: saga
      Solver Training Accuracy Holdout Accuracy Time Taken (seconds)
0  newton-cg            0.7481           0.7355                 0.57
1      lbfgs            0.7479           0.7358                 0.59
2  liblinear            0.7479           0.7362                 0.27
3        sag            0.7479           0.7358                 8.07
4       saga            0.7480           0.7360                 8.77


4. Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?



We should evaluate different solvers through their training accuracy and holdout accuracy performance together with their execution time. The holdout accuracy results show that liblinear solver reached 0.7362 for best performance while saga solver achieved 0.7360 as its second-best score. The difference in holdout accuracy between the top and bottom performing solvers amounts to only 0.0007.

The newton-cg solver reached a training accuracy level of 0.7481 but all solvers demonstrated similar performance with results between 0.7479 and 0.7481. All solvers demonstrated convergence to solutions which were nearly equivalent to each other.

The execution times differ substantially between the available solvers. The liblinear solver ran 0.27 seconds which represented half the time of newton-cg and lbfgs (0.57 and 0.59 seconds) and about 30 times faster than sag and saga (8.07 and 8.77 seconds).

The liblinear solver achieved the highest overall performance when we evaluate its performance through execution speed and accuracy metrics. The liblinear solver delivered the best holdout accuracy and outpaced other solvers by running at a much faster speed. The high computational speed of liblinear provides great value for processing large datasets or performing multiple model retraining operations. Scikit-learn documentation supports the use of liblinear for small datasets because it provides both high efficiency and effective performance.