# Week 09 Assignment - Logistic Regression Solver Comparison (PatientAnalyticFile.csv)
# Name: Sharath Kasula

In [None]:
"""
Assignment Questions Addressed (Updated with PatientAnalyticFile.csv):
1. Among the different classification models included in the Python notebook, which model had the best overall performance?
Different solvers were applied in multiple logistic regression models to analyze the PatientAnalyticFile dataset. L2 regularization ran in five different optimization solvers which consisted of lbfgs, newton-cg, sag, saga and liblinear. The models were fitted with 80% of the available data while testing occurred on the hidden 20% portion of data.

All models trained with various solvers demonstrated a similar training accuracy level at 66.1% and similar holdout accuracy results at 65.5% which demonstrates limited overfitting as well as generalization potential for unforeseen data. The chosen performance indicators showed no variation throughout the optimization procedures because the predictive features from the dataset created equivalent solution results across different solver types.

The lbfgs solver equalled the accuracy levels of other models through its optimal runtime performance. The method provides an effective solution because short processing time or limited resources are present. The newton-cg and saga solvers took much longer to execute yet produced results with minimal accuracy advantages that negated their increased computational cost.

Each fitted model failed to accurately predict patient mortality although the similar training and testing results proved the logistic regression system instituted proper regularizations which maintained stability.

The generalization abilities across all solving approaches proved equivalent to each other. The lbfgs solver proved most efficient for practical usage due to its effective balance of accuracy compared to runtime requirements thus making it the best choice for logistic regression analysis on this dataset.
2. Fit a series of logistic regression models using different solvers (with L2 regularization), using the same 80/20 train-holdout split.
3. Report training accuracy, holdout accuracy, and time taken.
4. Summarize findings: Which solver performed best and why?
"""

In [1]:
# --- Step 1: Load and prepare the data ---
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score


In [3]:
# Load PatientAnalyticFile.csv
df = pd.read_csv("PatientAnalyticFile.csv")

In [4]:
# Create binary outcome: 1 if patient has died, 0 if alive
df['Died'] = df['DateOfDeath'].notnull().astype(int)

In [5]:
# Drop non-numeric and date columns
drop_cols = ['PatientID', 'DateOfBirth', 'Gender', 'Race',
             'First_Appointment_Date', 'Last_Appointment_Date', 'DateOfDeath']
X = df.drop(columns=drop_cols + ['Died'])
y = df['Died']

In [6]:
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [7]:
# Split data (80% train, 20% holdout)
X_train, X_holdout, y_train, y_holdout = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

In [8]:
# --- Step 2: Train Logistic Regression models with different solvers ---
solvers = ['lbfgs', 'newton-cg', 'sag', 'saga', 'liblinear']
results = []


In [9]:
for solver in solvers:
    model = LogisticRegression(penalty='l2', solver=solver, max_iter=10000)
    start_time = time.time()
    model.fit(X_train, y_train)
    elapsed_time = time.time() - start_time

    train_acc = accuracy_score(y_train, model.predict(X_train))
    holdout_acc = accuracy_score(y_holdout, model.predict(X_holdout))

    results.append({
        "Solver": solver,
        "Training Accuracy": train_acc,
        "Holdout Accuracy": holdout_acc,
        "Time Taken (s)": elapsed_time
    })

results_df = pd.DataFrame(results)

In [10]:
# --- Step 3: Report and display results ---
print("\nQuestion 1: Best Performing Model in Previous Notebook")
print("Answer: The Random Forest classifier had the best performance with high training and holdout accuracy.")

print("\nQuestion 2 & 3: Logistic Regression Solver Comparison Table (PatientAnalyticFile.csv)")
print(results_df.to_string(index=False))


Question 1: Best Performing Model in Previous Notebook
Answer: The Random Forest classifier had the best performance with high training and holdout accuracy.

Question 2 & 3: Logistic Regression Solver Comparison Table (PatientAnalyticFile.csv)
   Solver  Training Accuracy  Holdout Accuracy  Time Taken (s)
    lbfgs           0.660937           0.65475        0.047133
newton-cg           0.660937           0.65475        0.125444
      sag           0.660937           0.65475        0.761992
     saga           0.660937           0.65475        0.381480
liblinear           0.660937           0.65475        0.069024


In [11]:
# --- Step 4: Identify best solver and summarize findings ---
best_solver = results_df.loc[results_df['Holdout Accuracy'].idxmax()]
print("\nQuestion 4: Best Solver Analysis")
print(f"The best solver is '{best_solver['Solver']}' with a holdout accuracy of {best_solver['Holdout Accuracy']:.4f}.")
print("This solver provided consistent training accuracy and balanced computation time.\n"
      "All solvers performed similarly in accuracy, but the fastest solver ('lbfgs') may be preferred\n"
      "when computational efficiency is critical.")


Question 4: Best Solver Analysis
The best solver is 'lbfgs' with a holdout accuracy of 0.6548.
This solver provided consistent training accuracy and balanced computation time.
All solvers performed similarly in accuracy, but the fastest solver ('lbfgs') may be preferred
when computational efficiency is critical.


In [12]:

# Final Summary
print("\nFinal Conclusion:")
print(f"Using PatientAnalyticFile.csv and evaluating 5 solvers (lbfgs, newton-cg, sag, saga, liblinear),\n"
      f"the solver '{best_solver['Solver']}' showed the best balance of generalization performance,\n"
      f"training accuracy, and execution time.")


Final Conclusion:
Using PatientAnalyticFile.csv and evaluating 5 solvers (lbfgs, newton-cg, sag, saga, liblinear),
the solver 'lbfgs' showed the best balance of generalization performance,
training accuracy, and execution time.


The evaluation used various solvers from logistic regression modeling on the PatientAnalyticFile dataset. The L2 regularization was used by lbfgs, newton-cg, sag, saga and liblinear solvers. The training process included 80% of the data with each model receiving this amount and testing took place on 20% of unprocessed data.

All solvers demonstrated a training accuracy of 66.1% while holdout accuracy reached 65.5% which indicates a low level of overfitting and satisfactory generalizability to new data points. The optimization solver choice did not influence the achieved performance metrics indicating that the predictive features of the dataset delivered equal results through different solvers.

The lbfgs solver produced identical accuracy outcomes as other methods while serving as the speediest execution method among all models. Its operation speed makes it favorable for cases that need limited computational power or face time constraints. Newton-cg and saga methods required extensive processing time because their execution rate was much slower yet they failed to achieve superior predictive accuracy when compared to other solvers.

The consistent outcome between training and testing phases shows that logistic regression performed stably even though its accuracy remained low probably because of the enigmatic nature of patient death causation.

The generalization abilities of all solvers proved equivalent to one another. From a practical perspective the lbfgs solver proved most efficient because it managed to strike the right balance between precision and execution time which made it appropriate for logistic regression work on this data set.