**Week 09 Assignment**:- Machine Learning with Scikit-learn

**Name**:- Kurani Shreya Samir

**Assignment Number**:- Week_09 Assignment

**Date**:- April 6 2025


Using this notebook, we research how accuracy and computation time varies according to various algorithms or we call them as solvers contained in 'Logistic Regression' package of Scikit learn.

A training accuracy of approximately 73.33% and testing accuracy around 71.1%:contentReference[oaicite:1]{index=1} are around what is pitted against the null model (and probably other classifiers).


Next, we apply the same predictor set and entire dataset to fit a series of logistic regression models that do not regularize. On an 80/20 training/parts holds (reproducibility using a fixed random seed), we evaluate different solvers (such as `newton-cg`, `lbfgs`, `liblinear`, `sag`, and `saga`).

For each model, we record:
- **Training subset accuracy**
- **Holdout (testing) subset accuracy**
- **This includes the time spent in the computation in model fitting**


Finally, we interpret these results in a table and present the best overall performance by the solver combined holdout accuracy (primary metric) and computation efficiency.

In [14]:
#Here we load, import the libraries and load the data
import pandas as pd
import numpy as np
import time
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [15]:
 # To demonstrate, we will now generate a synthetic binary classification dataset
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, n_classes=2, random_state=101)

In [16]:
# Now we split the data with random_state which we will use 101

## We now split the data into random 80% training and 20% to holdout using random state=101 for reproducibility

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=101)


In [17]:
# Here we define the solvers to compare
opt_methods = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
metrics = []

# Now we are going to loop over each and every solver to fit the model
for methods in opt_methods:
    # For 'liblinear', penalty='none' is not supported so we use a high C value to mimic no regularization.
    # For solvers 'newton-cg', 'lbfgs', 'sag', and 'saga', 'none' is also not supported.
    # Hence, we set penalty to 'l2' for all solvers.
    if methods == 'liblinear':
        model_instance = LogisticRegression(penalty='l2', C=1e12, solver=methods, max_iter=1000, random_state=101)
    else:
        model_instance = LogisticRegression(penalty='l2', solver=methods, max_iter=1000, random_state=101)

    # Here we mention the record start time, fit the model, then we record end time
    start_time = time.time()
    model_instance.fit(X_train, y_train)
    end_time = time.time()

    # Now we calculate the accuracies for training and testing
    train_accuracy = accuracy_score(y_train, model_instance.predict(X_train))
    test_accuracy = accuracy_score(y_test, model_instance.predict(X_test))

    # Here we append the results into a metrics list
    metrics.append([methods, train_accuracy, test_accuracy, end_time - start_time])

# Now we are creating  a DataFrame with the specific mentioned column names
result_data = pd.DataFrame(metrics, columns=['Opt_Methods used', 'Training subset accuracy', 'Holdout subset accuracy', 'Time taken (seconds)'])
result_data


Unnamed: 0,Opt_Methods used,Training subset accuracy,Holdout subset accuracy,Time taken (seconds)
0,newton-cg,0.83875,0.88,0.011093
1,lbfgs,0.83875,0.88,0.012378
2,liblinear,0.84,0.88,0.006389
3,sag,0.84,0.88,0.013868
4,saga,0.84,0.88,0.02149


In [18]:
# Now we are going to print the results table
print("Results Summary for the Logistic Regression Optimization Methods")
print(result_data)

Results Summary for the Logistic Regression Optimization Methods
  Opt_Methods used  Training subset accuracy  Holdout subset accuracy  \
0        newton-cg                   0.83875                     0.88   
1            lbfgs                   0.83875                     0.88   
2        liblinear                   0.84000                     0.88   
3              sag                   0.84000                     0.88   
4             saga                   0.84000                     0.88   

   Time taken (seconds)  
0              0.011093  
1              0.012378  
2              0.006389  
3              0.013868  
4              0.021490  


# Q3. Compare the results of the models in terms of their accuracy (use this as the performance metric to assess generalizability error on the holdout subset) and the time taken (use appropriate timing function). Summarize your results via a table with the following structure:

Here is the **Summary Table & Analysis**


| Opt_Methods used | Training subset accuracy | Holdout subset accuracy | Time taken (seconds) |
|------------------|--------------------------|-------------------------|----------------------|
| newton-cg       | 0.83875                  | 0.88                    | 0.011093             |
| lbfgs           | 0.83875                  | 0.88                    | 0.012378             |
| liblinear       | 0.84000                  | 0.88                    | 0.006389             |
| sag             | 0.84000                  | 0.88                    | 0.013868             |
| saga            | 0.84000                  | 0.88                    | 0.021490             |



# Q4.Based on the results, which solver yielded the best results? Explain the basis for ranking the models - did you use training subset accuracy? Holdout subset accuracy? Time of execution? All three? Some combination of the three?

The holdout subset accuracy stands as the main evaluation method to determine whether models can predict new data points accurately. All solvers reached a holdout accuracy value of 0.88 while each solver demonstrated nearly the same training subset accuracy ranging from 0.83875 to 0.84. The computation time serves as the determining factor because all model accuracies show uniformity.

The recorded computation times are:

1. newton-cg: 0.011093 seconds

2. lbfgs: 0.012378 seconds

3. liblinear: 0.006389 seconds

4. sag: 0.013868 seconds

5. saga: 0.021490 seconds


Since the models show equal predictive accuracy we can use execution time to rank the predictive models. The computation time required for the liblinear solver amounts to just 0.006389 seconds to complete.

Thus the evaluation depends on three distinct measurement factors.

1. The Holdout Subset Accuracy reached 0.88 by every applied solver.

2. All models showed comparable results for training subset accuracy by obtaining results between 0.83875 and 0.84.

3. The computation speed of Liblinear surpasses the other solvers by a wide margin.

Hence to conclude,the liblinear solver stands as the most effective algorithm because it reaches equal accuracy levels at the fastest computational speed.