In [None]:
NAME:Yalam snigdha
Assignment number:9

## Question 1: Which model had the best generalizable performance?

Based on the comparison of model performances, Logistic Regression model has the best generalizable performance with a test accuracy of **0.718**.

| Model                  | Train Accuracy | Test Accuracy |
|------------------------|----------------|---------------|
| Logistic               | 0.7333         | **0.718**     |
| Null                   | 0.6467         | 0.608         |
| Logistic_L1_C_1        | 0.732          | 0.716         |
| Logistic_L1_C_01       | 0.726          | 0.706         |
| Logistic_L1_C_10       | 0.7347         | 0.718         |
| Logistic_L1_C_auto     | 0.7233         | 0.708         |
| Logistic_SL1_C_auto    | 0.7307         | 0.714         |
| RandomForest_noCV      | 0.9993         | 0.686         |

The Random Forest overfitted (train accuracy near 1.0), while regularized logistic models slightly underfit. The basic logistic model balanced both training and testing which is actually better compared to Random Forest because Random Forest is performing poorly under new data.


In [10]:
# Question 2: 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data
df = pd.read_csv("PatientAnalyticFile.csv")

# Create binary classification target: 1 if patient died, else 0
df["Died"] = df["DateOfDeath"].notnull().astype(int)

# Use the same predictors the professor used
predictors = [
    'Gender', 'Race', 'Myocardial_infarction', 'Congestive_heart_failure',
    'Peripheral_vascular_disease', 'Stroke', 'Dementia', 'Pulmonary',
    'Obesity', 'Depression', 'Hypertension', 'Drugs', 'Alcohol'
]

# Filter dataset
X = df[predictors]
y = df["Died"]

# One-hot encode categorical features (Gender, Race)
X = pd.get_dummies(X, drop_first=True)

# Standardize numeric features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data into train and holdout (same across all models)
from sklearn.model_selection import train_test_split
X_train, X_holdout, y_train, y_holdout = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)


In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import time
import pandas as pd

# List of solvers to test
solvers = ['liblinear', 'saga', 'lbfgs', 'newton-cg', 'sag']

# List to store results
results = []

# Loop through each solver
for solver in solvers:
    try:
        # Start timer to keep track of the time a solver took to predict
        start_time = time.time()

        # Special handling for liblinear since it does NOT support penalty=None
        if solver == 'liblinear':
            model = LogisticRegression(penalty='l2', C=1e10, solver='liblinear', max_iter=1000)
        else:
            model = LogisticRegression(penalty=None, solver=solver, max_iter=1000)

        # Fit model
        model.fit(X_train, y_train)

        # Stop timer
        end_time = time.time()
        elapsed_time = end_time - start_time

        # Predict and calculate accuracies
        train_acc = accuracy_score(y_train, model.predict(X_train))
        holdout_acc = accuracy_score(y_holdout, model.predict(X_holdout))

        # Save result
        results.append({
            'Solver used': solver,
            'Training subset accuracy': round(train_acc, 4),
            'Holdout subset accuracy': round(holdout_acc, 4),
            'Time taken (s)': round(elapsed_time, 4)
        })

    except Exception as e:
        # If a solver fails, store the error
        results.append({
            'Solver used': solver,
            'Training subset accuracy': 'Error',
            'Holdout subset accuracy': 'Error',
            'Time taken (s)': str(e)
        })

# Display results
results_df = pd.DataFrame(results)
print(results_df)


  Solver used  Training subset accuracy  Holdout subset accuracy  \
0   liblinear                    0.6517                   0.6518   
1        saga                    0.6517                   0.6518   
2       lbfgs                    0.6517                   0.6518   
3   newton-cg                    0.6517                   0.6518   
4         sag                    0.6517                   0.6518   

   Time taken (s)  
0          0.0132  
1          0.1203  
2          0.0132  
3          0.0269  
4          0.2414  


**Question 3**
| Solver      | Training subset accuracy | Holdout subset accuracy | Time taken (s) |
|-------------|--------------------------|-------------------------|---------------|
| liblinear   | 0.6517                   | 0.6518                  | 0.0161        |
| saga        | 0.6517                   | 0.6518                  | 0.1191        |
| lbfgs       | 0.6517                   | 0.6518                  | 0.0359        |
| newton-cg   | 0.6517                   | 0.6518                  | 0.0269        |
| sag         | 0.6517                   | 0.6518                  | 0.2605        |



I trained logistic regression models using five other solvers in Scikit-Learn (liblinear, saga, lbfgs, newton-cg, and sag) and tested their accuracy and compute time. The models were trained on the same 80% training set and tested on the same 20% holdout set. Regularization was successfully turned off (either by penalty=None or by setting C=1e10 for liblinear).
All the solvers produced identical accuracy scores (65.17% on train and 65.18% on holdout), indicating the same generalization performance.
However, liblinear recorded the minimum computation time (0.0161s) and was the most effective solver of this problem.


**Question 4**

Based on results above, the accuracy measures of all solvers (liblinear, saga, lbfgs, newton-cg, and sag) are the same:

Training subset accuracy: 0.6517 for all solvers
Holdout subset accuracy: 0.6518 for all solvers

Since the accuracy measures are uniform for all solvers, the only differentiating factor is execution time. Rank based on minimum execution time:

liblinear: 0.0161 seconds (best of all)
newton-cg: 0.0269 seconds
lbfgs: 0.0359 seconds
saga: 0.1191 seconds
sag: 0.2605 seconds (slowest of all)

So, liblinear is best in terms of computational efficiency as it produces the same level of accuracy as all other solvers but in the least amount of running time. If efficiency is an issue in your program, liblinear would be the perfect solver for this particular problem.
