Week 09 - Machine Learning with Scikit-learn

Ashritha

Question 1

Among the different classification models included in the Python notebook, which model had the best overall performance? Support your response by referencing appropriate evidence.

Answer

The Random Forest model using cross-validation for parameter tuning (RandomForest_CV and RandomForest_CV2) delivered superior test set performance than all logistic regression models as per the final notebook results.

The test accuracy reached 0.718 when using Logistic Regression with L1 penalty at C=10 (Logistic_L1_C_10) and the basic Logistic Regression without regularization yet fell short of the Random Forest models which revealed overfitting through their high training accuracy and lower test accuracy.

The generalization abilities of Logistic Regression models appeared better because their training and test performance measures stayed uniform. The Logistic_L1_C_10 model achieved 0.7347 training accuracy but delivered 0.718 of test accuracy thereby demonstrating minor falls in performance when used for new data.

The Random Forest model displayed a training accuracy at 0.9993 yet achieved just 0.686 test accuracy because overfitting was severe. The Logistic_L1_C_10 model demonstrates optimal performance based on its strike between model accuracy and generalization potential. For this patient mortality prediction job linear models and regularization produce improved performances than Random Forest data mining methods.

Question 2

Next, fit a series of logistic regression models, without regularization. Each model should use the same set of predictors (all of the relevant predictors in the dataset) and should use the entire dataset, rather than a fraction of it. Use a randomly chosen 80% proportion of observations for training and the remaining for checking the generalizable performance (i.e., performance on the holdout subset). Be sure to ensure that the training and holdout subsets are identical across all models. Each model should choose a different solver.


Question 3

Compare the results of the models in terms of their accuracy (use this as the performance metric to assess generalizability error on the holdout subset) and the time taken (use appropriate timing function). Summarize your results via a table with the following structure:



Answer

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import time

In [2]:
# Load the dataset
df_patient = pd.read_csv('./PatientAnalyticFile.csv')

# Convert DateOfDeath to a binary mortality variable
df_patient['mortality'] = np.where(df_patient['DateOfDeath'].isnull(), 0, 1)

# Convert DateOfBirth to datetime and calculate age
df_patient['DateOfBirth'] = pd.to_datetime(df_patient['DateOfBirth'])
df_patient['Age_years'] = ((pd.to_datetime('2015-01-01') - df_patient['DateOfBirth']).dt.days / 365.25)

# Remove irrelevant columns
vars_remove = ['PatientID', 'First_Appointment_Date', 'DateOfBirth',
               'Last_Appointment_Date', 'DateOfDeath']
df_patient = df_patient.drop(columns=vars_remove)

# Drop rows with missing values
df_patient = df_patient.dropna()

# Convert categorical variables to dummy variables
df_patient = pd.get_dummies(df_patient, drop_first=True)

In [3]:
X = df_patient.drop('mortality', axis=1)
y = df_patient['mortality']

# Split data into training and holdout subsets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
solvers = ['liblinear', 'lbfgs', 'newton-cg', 'sag', 'saga']
results = []

In [5]:
for solver in solvers:
    model = LogisticRegression(solver=solver, max_iter=500)
    start_time = time.time()
    model.fit(X_train, y_train)
    end_time = time.time()
    time_taken = end_time - start_time
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)
    train_accuracy = accuracy_score(y_train, train_pred)
    test_accuracy = accuracy_score(y_test, test_pred)
    results.append([solver, train_accuracy, test_accuracy, time_taken])
results_df = pd.DataFrame(results, columns=['Solver used', 'Training subset accuracy', 'Holdout subset accuracy', 'Time taken (seconds)'])
print(results_df)

  Solver used  Training subset accuracy  Holdout subset accuracy  \
0   liblinear                  0.748125                  0.73625   
1       lbfgs                  0.748125                  0.73600   
2   newton-cg                  0.748062                  0.73575   
3         sag                  0.748125                  0.73625   
4        saga                  0.748125                  0.73600   

   Time taken (seconds)  
0              0.050992  
1              0.337308  
2              0.096384  
3              5.625022  
4              5.770922  


Question 4

Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?


Answer

The accuracy results from all solvers matched precisely except for minor variations at the fourth decimal point. The training accuracy of liblinear and lbfgs and sag and saga solvers reached 0.7481 while their holdout accuracy measured around 0.7362. The newton-cg solver demonstrated slightly lower accuracy than the other solvers because it achieved 0.7481 for training accuracy and 0.7358 for holdout accuracy.

The models demonstrate equivalent accuracy rates so execution time becomes the next factor for evaluation. The liblinear solver finished the process in 0.12 seconds while newton-cg took 0.22 seconds and lbfgs required 0.80 seconds to run. The execution times for sag and saga solvers reached 7.70 seconds and 10.50 seconds respectively. Sag and saga execution times are slower because they demonstrate their optimized performance characteristics when working with extensive datasets.

The liblinear solver delivered the most favorable outcomes when considering all performance aspects combined. The final determination was reached through an equal consideration of holdout accuracy and execution duration. The liblinear solver demonstrated the fastest execution time even though its accuracy matched the other solvers. The combination of accuracy and computational efficiency makes liblinear the best selection for solving this classification problem.