Sanjay Nenavath


Week 09 Assignment - Machine Learning with Scikit-learn

1) Using the provided Python notebook different models were deployed for mortality prediction of patients by running basic logistic regression and regularized logistic regression with adjusted penalty terms and random forest system tests. Performance examinations showed basic logistic regression and logistic regression with L1 penalty (C=10) performed at the highest level by attaining identical test accuracy of 0.718. The predictive accuracy level of 0.718 exceeds the null model benchmark value of 0.608.
An uncross-validated random forest model showed a significant model overfitting problem through mismatched training accuracy (0.9993) and test accuracy (0.686). The model shows strong evidence of memorizing training data because it fails to learn generalizable patterns. The logistic regression models displayed proper weighting between training and test outcome performance which implies superior generalization capabilities.

Among the examined models the logistic regression model with L1 penalty set to 10 displayed minimal superiority because it demonstrated the highest training accuracy at 0.7347 and test accuracy at 0.718. The implementation of the L1 penalty helped both identify important predictors and amplify them while decreasing the importance of unimportant variables. The overall best model detected in the notebook achieves its position through its interpretability together with reasonable computational demands and its identified superiority in predictive performance.


Overall proof can be found in the notebook via detailed performance analytics of each evaluated model. The results indicate the Logistic_L1_C_10 model and basic Logistic model shared the same 0.718 test accuracy yet the L1-penalized version had a slightly better training accuracy at 0.7347 compared to 0.7333. The results confirmed the predictive power of these models since they surpassed the null model baseline (0.608 test accuracy). The Random Forest model demonstrated classic overfitting signs by attaining a training accuracy of 0.9993 yet producing an unsatisfying test accuracy of 0.686 which constitutes a significant 0.3133 difference indicating poor ability to generalize. The regularized logistic regression models with variable C values had comparable results to the C=10 model yet displayed slightly worse performance indicating it achieved best stats for the dataset. Among the evaluated models the prediction performance combined with generalization ability was highest in the Logistic_L1_C_10 model.

In [1]:
#2 #3

import pandas as pd
import numpy as np
import time
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Set random seed for reproducibility
np.random.seed(42)

df_patient = pd.read_csv('PatientAnalyticFile.csv')

df_patient['mortality'] = np.where(df_patient['DateOfDeath'].isnull(), 0, 1)
df_patient['DateOfBirth'] = pd.to_datetime(df_patient['DateOfBirth'])
df_patient['Age_years'] = ((pd.to_datetime('2015-01-01') - df_patient['DateOfBirth']).dt.days/365.25)
vars_remove = ['PatientID', 'First_Appointment_Date', 'DateOfBirth',
               'Last_Appointment_Date', 'DateOfDeath', 'mortality']
vars_left = set(df_patient.columns) - set(vars_remove)
formula = "mortality ~ " + " + ".join(vars_left)

# Create design matrices
y, X = dmatrices(formula, df_patient)
y = np.ravel(y)

# Split data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set: {X_train.shape[0]} samples, Test set: {X_test.shape[0]} samples")

#All solvers with small penalty
all_solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
C_value = 1e10

# Store results
all_results = []

for solver in all_solvers:
    try:
        print(f"\nFitting model with solver: {solver}")

        # Create and fit model with small penalty
        model = LogisticRegression(penalty='l2', C=C_value, solver=solver, max_iter=2000, random_state=42)

        # Time the model fitting
        start_time = time.time()
        model.fit(X_train, y_train)
        fit_time = time.time() - start_time

        # Make predictions
        y_train_pred = model.predict(X_train)
        y_test_pred = model.predict(X_test)

        # Calculate accuracy
        train_accuracy = accuracy_score(y_train, y_train_pred)
        test_accuracy = accuracy_score(y_test, y_test_pred)

        # Store results
        all_results.append({
            'Solver': solver,
            'Training Accuracy': train_accuracy,
            'Holdout Accuracy': test_accuracy,
            'Time (seconds)': fit_time,
            'Penalty': f"L2 (C={C_value})"
        })

        # Print metrics
        print(f"  Training accuracy: {train_accuracy:.4f}")
        print(f"  Holdout accuracy: {test_accuracy:.4f}")
        print(f"  Fitting time: {fit_time:.4f} seconds")
        print(f"  Confusion matrix (holdout set):")
        print(confusion_matrix(y_test, y_test_pred))

    except Exception as e:
        print(f"  Error with solver {solver}: {e}")

results_df = pd.DataFrame(all_results)

results_df['Training Accuracy'] = results_df['Training Accuracy'].map(lambda x: f"{x:.4f}")
results_df['Holdout Accuracy'] = results_df['Holdout Accuracy'].map(lambda x: f"{x:.4f}")
results_df['Time (seconds)'] = results_df['Time (seconds)'].map(lambda x: f"{x:.4f}")

print("\nSolver Performance Comparison:")
print(results_df[['Solver', 'Training Accuracy', 'Holdout Accuracy', 'Time (seconds)']].to_string(index=False))

Training set: 16000 samples, Test set: 4000 samples

Fitting model with solver: newton-cg
  Training accuracy: 0.7481
  Holdout accuracy: 0.7355
  Fitting time: 0.1162 seconds
  Confusion matrix (holdout set):
[[2139  423]
 [ 635  803]]

Fitting model with solver: lbfgs
  Training accuracy: 0.7479
  Holdout accuracy: 0.7355
  Fitting time: 0.3245 seconds
  Confusion matrix (holdout set):
[[2139  423]
 [ 635  803]]

Fitting model with solver: liblinear
  Training accuracy: 0.7479
  Holdout accuracy: 0.7362
  Fitting time: 0.0569 seconds
  Confusion matrix (holdout set):
[[2140  422]
 [ 633  805]]

Fitting model with solver: sag
  Training accuracy: 0.7479
  Holdout accuracy: 0.7358
  Fitting time: 2.6909 seconds
  Confusion matrix (holdout set):
[[2140  422]
 [ 635  803]]

Fitting model with solver: saga
  Training accuracy: 0.7480
  Holdout accuracy: 0.7360
  Fitting time: 4.8133 seconds
  Confusion matrix (holdout set):
[[2140  422]
 [ 634  804]]

Solver Performance Comparison:
   Sol

4) The liblinear solver proved to be the most effective solver according to the research results. The evaluation of solver performance considers three metrics including holdout accuracy (generalization performance) and training accuracy and execution time. The evaluation metric of holdout accuracy holds the most importance because it demonstrates how well the model can predict new data points.

The liblinear solver produced the optimal holdout accuracy of 0.7362 which outperformed all other solvers including newton-cg, lbfgs, sag, and saga. The best holdout accuracy demonstrates that this solver produces the most effective predictions for unseen data points which represents the fundamental requirement for a well-performing model.

The liblinear solver delivered both top holdout accuracy and fast execution time during the fitting process which required 0.0569 seconds. The execution time of 0.0569 seconds for liblinear stands as significantly faster than the execution times of 2.6909 seconds for sag and 4.8133 seconds for saga. The fitting process for newton-cg and lbfgs took longer than liblinear even though these solvers showed better execution times than the rest.

All solvers demonstrated equivalent performance in terms of training accuracy since their accuracy levels remained almost identical. The training data fit well for each solver according to the results. The most important metric for selecting the best model is holdout accuracy because training accuracy optimization can lead to overfitting.

The logistic regression problem benefits most from the solver liblinear when considering all performance metrics. The solver achieves the best holdout accuracy while maintaining competitive training accuracy and speedier execution than most alternative solvers. The solver demonstrates optimal performance because it achieves both high processing speed and accurate results for this particular dataset.