<H1><center>Logistic Regression<BR><BR>
</center></H1>

<H2>Task 1: Evaluating Classifiers for Diagnosing Cancers as Benign or Malignant 
</H2>

<P>For the next few tasks, I will be using <a href="https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)">breast cancer data</a> to diagnose whether a cancer is benign (0) or malignant (1). The features are derived from a digitized image of a fine needle aspirate of a breast mass. The features describe characteristics of the cell nuclei present in the image, including the radius, texture, perimeter, area, smoothness, compactness, concavity, number of concave points, symmetry, and fractal dimension, with the mean, standard error, and worst value provided for each.</P>
<P>To begin, read in the CSV file <code>breast_cancer.csv</code>, ignoring the header line, and store the feature vectors in an array <code>X</code> and the labels in an array <code>y</code>.</P>

In [1]:
# Read in data and store feature vectors in array X and labels in array y

import csv
import numpy as np

# Initialize empty lists to store feature vectors and labels
X = []
y = []

# Read the CSV file
with open('breast_cancer.csv', 'r') as file:
    reader = csv.reader(file)
    next(reader)  # Skip the header line
    
    # Iterate over each row in the CSV file
    for row in reader:
        # Extract the feature vector (all columns except the last one) and convert them to floats
        features = list(map(float, row[:-1]))
        
        # Extract the label (last column) and convert it to an integer
        label = int(row[-1])
        
        # Append the feature vector and label to the respective lists
        X.append(features)
        y.append(label)

# Convert the lists to NumPy arrays
X = np.array(X)
y = np.array(y)

print("Shape of X:", X.shape)
print("Shape of y:", y.shape)





Shape of X: (569, 30)
Shape of y: (569,)


<P>Now seed the <code>numpy</code> random number generator and then split the data into 80% training data and 20% testing data (with <code>random_state=0</code>).</P>

In [2]:
# Seed the random number generator.
np.random.seed(0)

# Split data into 80% training and 20% testing
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Print the shapes of the training and testing sets
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)




X_train: (455, 30)
X_test: (114, 30)
y_train: (455,)
y_test: (114,)


<P>Let's evaluate the performance of four different <code>sklearn</code> classifiers, a <a href="http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">decision tree</a>, a <a href="http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html"><em>k</em> nearest neighbors classifier</a>, a <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html">perceptron</a>, and a <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">logistic regression classifier</a> (with <code>random_state=0</code> for those classifiers with randomization). Since we may end up further using the best performing of the four classifiers, it is as if we are tuning a hyperparameter (where the hyperparameter is the classification algorithm), so we will evaluate the classifiers on <em>validation</em> data rather than <em>testing</em> data.</P>

<P>One option would be to split the <em>training</em> data into separate sets, one used only as <em>training</em> data and one used only as <em>validation</em> data. Instead, we will use 5-fold cross-validation where we will split the <em>training</em> data into five equal sized sets. Then, five times, we will use four of the sets as <em>training</em> data and the remaining set as <em>validation</em> data. The <em>validation</em> accuracy that we report will be the average validation accuracy over the five trials.</P>

In [3]:
# Compute the average 5-fold cross-validation accuracy for each of four classifiers (decision tree, kNN, perceptron, logistic regression)

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.model_selection import cross_val_score

# Set the random_state for the classifiers
random_state = 0

# Initialize the classifiers
decision_tree = DecisionTreeClassifier(random_state=random_state)
knn = KNeighborsClassifier()
perceptron = Perceptron(max_iter=10, random_state=random_state)
logistic_regression = LogisticRegression(random_state=random_state)

# Perform 5-fold cross-validation for each classifier and compute average accuracy
classifiers = [decision_tree, knn, perceptron, logistic_regression]
classifier_names = ['Decision Tree', 'K-Nearest Neighbors', 'Perceptron', 'Logistic Regression']

for clf, name in zip(classifiers, classifier_names):
    # Compute cross-validation scores
    scores = cross_val_score(clf, X_train, y_train, cv=5)
    avg_accuracy = scores.mean()

    # Print the average accuracy for the current classifier
    print(f'Average accuracy of {name}: {avg_accuracy:.6f}')















Average accuracy of Decision Tree: 0.912088
Average accuracy of K-Nearest Neighbors: 0.920879
Average accuracy of Perceptron: 0.868132
Average accuracy of Logistic Regression: 0.940659


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

<H2>Task 2: Regularization 
</H2>

<P>While we explored different classifiers above, let's now focus only on logistic regression classification of the breast cancer data. In particular, let's explore <em>regularized</em> logistic regression. The <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html"><code>sklearn</code> logistic regression classifier</a> has a parameter <code>C</code> that controls regularization strength, like the regularization parameter &lambda; that we studied in class. However, the parameter <code>C</code> corresponds to the <em>inverse</em> of regularization strength so that smaller values of <code>C</code> specify stronger regularization. Below, I experimented with seven different values for the parameter <code>C</code>: 1, 3, 10, 30, 100, 300, 1000. For each of these seven values for <code>C</code>, report the average 5-fold cross-validation accuracy of a logistic regression classifier (with <code>random_state=0</code>).</P>

In [4]:
# Using 5-fold cross validation, tune the regularization hyperparameter C for logistic regression

# Set the different values for the regularization parameter C
C_values = [1, 3, 10, 30, 100, 300, 1000]

# Initialize a list to store the average accuracy for each value of C
average_accuracies = []

# Perform 5-fold cross-validation for each value of C
for C in C_values:
    # Initialize the logistic regression classifier with the current C value
    logistic_regression = LogisticRegression(C=C, random_state=0)

    # Compute cross-validation scores
    scores = cross_val_score(logistic_regression, X_train, y_train, cv=5)
    avg_accuracy = np.mean(scores)

    # Store the average accuracy for the current C value
    average_accuracies.append(avg_accuracy)

# Print the average accuracy for each value of C
for C, accuracy in zip(C_values, average_accuracies):
    print(f'Average accuracy for C={C}: {accuracy:.6f}')



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Average accuracy for C=1: 0.940659
Average accuracy for C=3: 0.945055
Average accuracy for C=10: 0.947253
Average accuracy for C=30: 0.949451
Average accuracy for C=100: 0.949451
Average accuracy for C=300: 0.945055
Average accuracy for C=1000: 0.951648


Out of all the seven values experimented for the parameter C, C= [1000] led to the highest average 5-fold cross-validation accuracy. Average accuracy for C=1000 is 0.951648.


<H2>Task 3: Recall, Precision, and F1 Score
</H2>

<P>In Task 2 above, I determined the optimal value (among seven possibilities) for the regularization parameter <code>C</code> used by a logistic regression classifier on the breast cancer data. Using this optimal value for the parameter <code>C</code>, again train a logistic regression classifier (with <code>random_state=0</code>) on the breast cancer <em>training</em> data. In this task, rather than report the <em>accuracy</em> of the classifier on the <em>testing</em> data, I will report the <em>recall</em>, the <em>precision</em>, and the <em>F1</em> score of the classifier on the <em>testing</em> data.

In [5]:
# Using the optimal value for the regularization parameter C,
# report the recall, precision, and F1 score of a logistic regression classifier.

from sklearn.metrics import recall_score, precision_score, f1_score

# Set the optimal value for the regularization parameter C
C_optimal = 1000

# Train a logistic regression classifier with the optimal C value
logistic_regression = LogisticRegression(C=C_optimal, random_state=0)
logistic_regression.fit(X_train, y_train)

# Predict the labels for the testing data
y_pred = logistic_regression.predict(X_test)

# Compute the recall, precision, and F1 score on the testing data
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the recall, precision, and F1 score
print("Recall:", recall)
print("Precision:", precision)
print("F1 Score:", f1)












Recall: 0.9361702127659575
Precision: 0.9361702127659575
F1 Score: 0.9361702127659575


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<H2>Task 4: Feature Scaling
</H2>

<P>While exploring the breast cancer data, you will note that some features take on values in the thousands whereas other features never achieve values larger than 0.1. The features are not all on the same scale. Thus, I will be performing feature scaling on the data prior to using the classifier.</P>

In [6]:
# After performing feature scaling (and using the optimal value for the regularization parameter C),
# report the F1 score of a logistic regression classifier.


from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler with the training data
scaler.fit(X_train)

# Transform the training and testing data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Set the optimal value for the regularization parameter C
C_optimal = 1000

# Train a logistic regression classifier with the optimal C value using the scaled training data
logistic_regression = LogisticRegression(C=C_optimal, random_state=0)
logistic_regression.fit(X_train_scaled, y_train)

# Predict the labels for the scaled testing data
y_pred_scaled = logistic_regression.predict(X_test_scaled)

# Compute the F1 score on the scaled testing data
f1_scaled = f1_score(y_test, y_pred_scaled)

# Print the F1 score
print("F1 Score (scaled data):", f1_scaled)











F1 Score (scaled data): 0.9375000000000001


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


After applying feature scaling to the data, the F1 score of the logistic regression classifier on the testing data is 0.9375000000000001. It has improved from 0.9361702127659575 as a result of applying feature scaling to the data.