In [None]:
import numpy as np
from numpy.linalg import inv

This cell imports the necessary libraries for numerical computations. `numpy` is imported as `np`, which is a standard convention. The `inv` function from `numpy.linalg` is specifically imported to calculate the inverse of a matrix, a crucial step in Newton's method. These imports provide the mathematical tools needed for implementing the optimization algorithm.

In [None]:
#The objective function
def func(x):
    return 100*np.square(np.square(x[0])-x[1])+np.square(x[0]-1)

This cell defines the Rosenbrock function, a non-convex function used as a performance test problem for optimization algorithms. The function is given by `f(x, y) = 100 * (y - x^2)^2 + (x - 1)^2`. The goal of the optimization is to find the values of `x` and `y` that minimize this function. The global minimum of the Rosenbrock function is at `(1, 1)`, where the function value is 0.

In [None]:
# first order derivatives of the function
def dfunc(x):
    df1 = 400*x[0]*(np.square(x[0])-x[1])+2*(x[0]-1)
    df2 = -200*(np.square(x[0])-x[1])
    return np.array([df1, df2])

This cell defines the gradient of the Rosenbrock function. The gradient is a vector containing the partial derivatives of the function with respect to each variable (`x[0]` and `x[1]`). These derivatives are calculated analytically:
- The partial derivative with respect to `x[0]` is `400*x[0]*(x[0]^2 - x[1]) + 2*(x[0] - 1)`.
- The partial derivative with respect to `x[1]` is `-200*(x[0]^2 - x[1])`.
The gradient points in the direction of the steepest increase of the function. In optimization, we move in the opposite direction of the gradient to find the minimum.

In [None]:
# second order derivative: the hessian
def invhess(x):
    df11 = 1200*np.square(x[0])-400*x[1]+2
    df12 = -400*x[0]
    df21 = -400*x[0]
    df22 = 200
    hess = np.array([[df11, df12], [df21, df22]])
    return inv(hess)

This cell calculates the inverse of the Hessian matrix of the Rosenbrock function. The Hessian matrix is a square matrix of second-order partial derivatives. For a function of two variables like the Rosenbrock function, the Hessian is a 2x2 matrix:

In [None]:
#The method
def newton(x, max_int):
    miter = 1
    step = .5
    vals = []
    objectfs = []
    # you can customize your own condition of convergence, here we limit the number of iterations
    while miter <= max_int:
        vals.append(x)
        objectfs.append(func(x))
        temp = x-step*(invhess(x).dot(dfunc(x)))
        if np.abs(func(temp)-func(x))>0.01:
            x = temp
        else:
            break
        print(x, func(x), miter)
        miter += 1
    return vals, objectfs, miter

This cell defines the `newton` function, which implements Newton's method for optimization.
- `x`: The initial guess for the minimum.
- `max_int`: The maximum number of iterations allowed.
The function iteratively updates the current estimate of the minimum using the formula:
`x_new = x_old - step * (H_inv * gradient)`
where `H_inv` is the inverse Hessian and `gradient` is the gradient of the function at the current point.
- `miter`: Counter for the current iteration.
- `step`: A step size parameter (set to 0.5). In a pure Newton's method, the step size is typically 1, but a smaller step size can help with convergence in some cases.
- `vals`: A list to store the values of `x` at each iteration.
- `objectfs`: A list to store the objective function values at each iteration.
The `while` loop continues until the maximum number of iterations is reached or the absolute difference between the objective function value at the current and next step is less than 0.01, which serves as a basic convergence criterion. The current value of `x`, the objective function value, and the iteration number are printed in each step.

In [None]:
#Initialization
start = [5, 5]
val, objectf, iters = newton(start, 50)

[ 4.99950012 14.99500125] 10015.996500999738 1
[ 4.99850075 19.98500862] 2515.9891319336307 2
[ 4.9965035  22.46504264] 640.9743156346814 3
[ 4.99251498 23.67518762] 172.19472181001353 4
[ 4.98456188 24.22078475] 54.94827682735826 5
[ 4.96875194 24.37570969] 25.534508005533738 6
[ 4.93753006 24.22183522] 17.980607577859715 7
[ 4.87690338 23.70182702] 15.708688997277847 8
[ 4.7659566  22.66085333] 14.468535895722859 9
[ 4.60498788 21.153258  ] 13.273196783901964 10
[ 4.44867175 19.7399179 ] 12.151019317491137 11
[ 4.29405728 18.38964106] 11.093732736283506 12
[ 4.14236045 17.11149476] 10.101532481072594 13
[ 3.99316572 15.89928575] 9.171439844130546 14
[ 3.84669105 14.7525338 ] 8.301658919062707 15
[ 3.70291355 13.6686477 ] 7.489963487430327 16
[ 3.56190493 12.64582277] 6.734289227351629 17
[ 3.42370436 11.6819802 ] 6.032519109543599 18
[ 3.28836654 10.77515251] 5.382560838831825 19
[3.1559441  9.92334647] 4.782319830836418 20
[3.02649439 9.12459271] 4.229709167267734 21
[2.90007737 8.3

This cell initializes the optimization process.
- `start = [5, 5]`: Sets the initial guess for the minimum to `(5, 5)`. The performance of optimization algorithms can sometimes depend on the starting point.
- `val, objectf, iters = newton(start, 50)`: Calls the `newton` function with the initial guess and a maximum of 50 iterations. The returned values are:
    - `val`: A list of the `x` values at each iteration.
    - `objectf`: A list of the objective function values at each iteration.
    - `iters`: The total number of iterations performed until convergence or the maximum number of iterations was reached.
This cell executes the Newton's method and stores the results for further analysis or visualization. The output shows the progress of the optimization at each step, including the current `x` values, the objective function value, and the iteration number.

# Question
How Newton's method can be applied to the IRIS dataset for classification by framing it as an optimization problem using Logistic Regression, and provide an assignment for the user to implement this along with an example Python code response.

## The challenge

Why directly applying Newton's method for optimization (finding minima of a function) to a classification problem like the IRIS dataset isn't straightforward. Also Newton's method is typically used for minimizing continuous, differentiable functions, while classification involves discrete categories.


**Reasoning**:
Explain why Newton's method is not directly applicable to classification problems like the IRIS dataset by highlighting the difference between optimizing continuous functions and predicting discrete categories.



In [1]:
# Explanation of why Newton's method is not directly applicable to classification

# Newton's method is an iterative optimization algorithm designed to find the roots of a differentiable function or the minima/maxima of a twice-differentiable function.
# It relies on calculating the gradient and Hessian (matrix of second derivatives) of the objective function.
# Therefore, it is inherently suited for minimizing continuous and differentiable functions.

# Classification problems, such as classifying the IRIS dataset, involve predicting discrete categories or labels (e.g., 'setosa', 'versicolor', 'virginica').
# The direct performance metrics in classification, like accuracy or the number of misclassified points, are not continuous or differentiable functions of the model parameters.
# Small changes in model parameters might not change the classification of any point until a certain threshold is crossed, leading to a step-like, non-differentiable objective function.

# Hence, directly applying Newton's method to minimize a non-differentiable function that counts misclassifications is not feasible.

# To apply Newton's method to a classification problem, the problem needs to be reframed as an optimization task with a continuous and differentiable objective function.
# Logistic Regression is an example of such a reformulation, where the classification problem is solved by minimizing a cost function (like the negative log-likelihood) which is continuous and differentiable, making it suitable for optimization methods like Newton's method.

## Introduce a suitable model

To use an optimization method like Newton's, we need to frame the classification problem as an optimization problem by choosing a model with a differentiable loss function. Suggest using Logistic Regression as a suitable model for binary classification, as its loss function (cross-entropy) is differentiable.


**Reasoning**:
Explain how to frame classification as an optimization problem suitable for Newton's method using Logistic Regression and its differentiable cross-entropy loss function.



In [2]:
# To apply an optimization method like Newton's method to a classification problem,
# we need to reformulate the problem as minimizing a continuous and differentiable
# function of the model parameters. This function is typically called the "loss function"
# or "cost function".

# For binary classification problems, Logistic Regression is a suitable model choice.
# Logistic Regression models the probability of a data point belonging to a particular class
# using the sigmoid function, which is continuous and differentiable.

# The standard loss function used in Logistic Regression is the cross-entropy loss (also known as log loss).
# The cross-entropy loss measures the performance of a classification model whose output is a probability value
# between 0 and 1. For binary classification, the cross-entropy loss is defined as:
# L = -[y * log(p) + (1 - y) * log(1 - p)]
# where y is the true label (0 or 1) and p is the predicted probability.

# The key property of the cross-entropy loss function is that it is continuous and differentiable
# with respect to the model parameters (the weights and bias). This differentiability is crucial
# because Newton's method requires the calculation of the gradient (first derivative) and the
# Hessian (second derivative) of the objective function.

# By minimizing the cross-entropy loss function using optimization algorithms like Newton's method,
# we can find the optimal model parameters for the Logistic Regression model that best fit the data
# and can be used for classification.

## The Activity

Loading and preparing the IRIS dataset (focusing on a binary classification problem, e.g., distinguishing two classes), implementing Logistic Regression from scratch, using Newton's method to find the optimal parameters for the Logistic Regression model by minimizing the cross-entropy loss function, and evaluating the performance of the implemented model.


In [3]:
# Assignment: Implement Logistic Regression with Newton's Method for Binary Classification on the IRIS Dataset

# This assignment requires you to apply the concepts of framing classification as an optimization problem
# and using Newton's method to find the optimal parameters for a Logistic Regression model.

# Your task is to:

# 1.  Load and Prepare the Data:
#     - Load the IRIS dataset using libraries like scikit-learn or pandas.
#     - Focus on a binary classification problem. For instance, select two classes from the dataset
#       (e.g., 'setosa' and 'versicolor') and filter the data accordingly.
#     - Extract the features (X) and the corresponding labels (y) for the selected classes.
#     - Convert the labels to a binary format (e.g., 0 and 1).
#     - Consider adding a bias term to your feature matrix (a column of ones).

# 2.  Implement Logistic Regression from Scratch:
#     - Implement the sigmoid function, which is the core of Logistic Regression for predicting probabilities.
#     - Implement the prediction function that uses the sigmoid function and the model parameters (weights and bias)
#       to predict the probability of a data point belonging to the positive class.

# 3.  Implement the Cross-Entropy Loss Function:
#     - Implement the binary cross-entropy loss function to quantify the error of your model's predictions.
#     - This will be the function you aim to minimize.

# 4.  Implement the Gradient and Hessian of the Loss Function:
#     - Analytically derive and implement the gradient (vector of first derivatives) of the cross-entropy loss
#       function with respect to the model parameters.
#     - Analytically derive and implement the Hessian (matrix of second derivatives) of the cross-entropy loss
#       function with respect to the model parameters. These are crucial for Newton's method.

# 5.  Implement Newton's Method for Optimization:
#     - Implement Newton's method to find the optimal parameters (weights and bias) that minimize the
#       cross-entropy loss function.
#     - Start with an initial guess for the parameters.
#     - In each iteration, calculate the gradient and the inverse of the Hessian at the current parameter values.
#     - Update the parameters using the Newton update rule: parameters = parameters - inverse(Hessian) * gradient.
#     - Define a convergence criterion (e.g., based on the change in parameters, the change in loss, or a maximum number of iterations).

# 6.  Evaluate the Model:
#     - After finding the optimal parameters using Newton's method, evaluate the performance of your Logistic Regression model
#       on the prepared binary IRIS dataset.
#     - Calculate relevant classification metrics such as accuracy, precision, recall, and F1-score.

# By completing this assignment, you will gain practical experience in applying optimization techniques
# to machine learning models and understand the connection between classification and optimization.

## The Code

This code demonstrates: loading and preparing the data, implementing Logistic Regression components (sigmoid, prediction, loss, gradient, Hessian), implementing Newton's method, training the model, and evaluating the model.


In [4]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 1. Load the iris dataset and select two classes
iris = load_iris()
X = iris.data
y = iris.target

# Select two classes (e.g., setosa (0) and versicolor (1))
X_binary = X[y <= 1]
y_binary = y[y <= 1]

# 2. Prepare the data: convert labels to 0 and 1, and add a bias term
y_binary = np.where(y_binary == 0, 0, 1) # Ensure labels are 0 and 1

# Add a bias term (column of ones) to the feature matrix
X_binary = np.hstack((np.ones((X_binary.shape[0], 1)), X_binary))

# Split data into training and testing sets (optional but good practice)
# For this example, we'll train on the full binary dataset as per the prompt
# X_train, X_test, y_train, y_test = train_test_split(X_binary, y_binary, test_size=0.2, random_state=42)

# Use the full binary dataset for training and evaluation as per the prompt
X_train = X_binary
y_train = y_binary

print("Data loaded and prepared:")
print("Features shape:", X_train.shape)
print("Labels shape:", y_train.shape)
print("First 5 rows of features (with bias):")
print(X_train[:5])
print("First 5 labels:")
print(y_train[:5])

Data loaded and prepared:
Features shape: (100, 5)
Labels shape: (100,)
First 5 rows of features (with bias):
[[1.  5.1 3.5 1.4 0.2]
 [1.  4.9 3.  1.4 0.2]
 [1.  4.7 3.2 1.3 0.2]
 [1.  4.6 3.1 1.5 0.2]
 [1.  5.  3.6 1.4 0.2]]
First 5 labels:
[0 0 0 0 0]


Now that the data is loaded and prepared, implement the core components of the Logistic Regression model: the sigmoid function, the prediction function, the binary cross-entropy loss function, its gradient, and its Hessian.



In [5]:
# 3. Implement the sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# 4. Implement the prediction function for logistic regression
def predict(X, weights):
    # X is the feature matrix (with bias)
    # weights is the vector of model parameters (including bias)
    z = X.dot(weights)
    return sigmoid(z)

# 5. Implement the binary cross-entropy loss function
def cross_entropy_loss(X, y, weights):
    m = X.shape[0] # Number of training examples
    predictions = predict(X, weights)
    # Avoid log(0) by clipping probabilities
    predictions = np.clip(predictions, 1e-15, 1 - 1e-15)
    loss = -np.mean(y * np.log(predictions) + (1 - y) * np.log(1 - predictions))
    return loss

# 6. Implement the gradient of the cross-entropy loss function
def gradient(X, y, weights):
    m = X.shape[0]
    predictions = predict(X, weights)
    grad = (1/m) * X.T.dot(predictions - y)
    return grad

# 7. Implement the Hessian of the cross-entropy loss function
def hessian(X, weights):
    m = X.shape[0]
    predictions = predict(X, weights)
    # W is a diagonal matrix where W_ii = p_i * (1 - p_i)
    W = np.diag(predictions * (1 - predictions))
    hess = (1/m) * X.T.dot(W).dot(X)
    return hess

print("Logistic Regression components implemented.")

Logistic Regression components implemented.


With the Logistic Regression components defined, now implement Newton's method for optimizing the model parameters, train the model using this method, and finally evaluate the trained model on the training data.



In [6]:
# 8. Implement Newton's method for optimizing the model parameters
def newtons_method(X, y, initial_weights, max_iterations=100, tolerance=1e-6):
    weights = initial_weights
    loss_history = []

    for i in range(max_iterations):
        grad = gradient(X, y, weights)
        hess = hessian(X, weights)

        # Handle potential singularity or near-singularity of Hessian
        try:
            # Use pseudo-inverse for stability
            hess_inv = np.linalg.pinv(hess)
        except np.linalg.LinAlgError:
            print(f"Hessian inversion failed at iteration {i}. Stopping.")
            break

        # Newton's update rule
        update = hess_inv.dot(grad)
        weights -= update

        loss = cross_entropy_loss(X, y, weights)
        loss_history.append(loss)

        # Check for convergence based on the magnitude of the update step
        if np.linalg.norm(update) < tolerance:
            print(f"Convergence reached at iteration {i}.")
            break

        # Optional: print progress
        # if (i + 1) % 10 == 0:
        #     print(f"Iteration {i+1}, Loss: {loss}")

    print(f"Finished after {i+1} iterations.")
    return weights, loss_history

# 9. Train the logistic regression model using the implemented Newton's method
# Initialize weights with zeros (including for the bias term)
initial_weights = np.zeros(X_train.shape[1])

print("\nStarting Newton's method training...")
optimal_weights, history = newtons_method(X_train, y_train, initial_weights)

print("\nOptimal weights found:")
print(optimal_weights)

# 10. Evaluate the trained model
# Make predictions on the training data
# The predict function gives probabilities, convert to class labels (0 or 1)
predicted_probabilities = predict(X_train, optimal_weights)
predicted_classes = (predicted_probabilities > 0.5).astype(int)

print("\nModel Evaluation on Training Data:")
print("Accuracy:", accuracy_score(y_train, predicted_classes))
print("Precision:", precision_score(y_train, predicted_classes))
print("Recall:", recall_score(y_train, predicted_classes))
print("F1-score:", f1_score(y_train, predicted_classes))


Starting Newton's method training...
Finished after 100 iterations.

Optimal weights found:
[ -5.85966636 -25.18595097 -12.49249928  45.50349927  59.98835501]

Model Evaluation on Training Data:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-score: 1.0


## Summary:

### Data Analysis Key Findings

*   Newton's method is not directly applicable to classification problems because classification performance metrics are not continuous or differentiable functions.
*   To apply Newton's method to classification, the problem must be reframed as an optimization task with a continuous and differentiable objective function, such as minimizing the cross-entropy loss in Logistic Regression.
*   The cross-entropy loss function in Logistic Regression is differentiable with respect to the model parameters, allowing for the calculation of the gradient and Hessian required by Newton's method.
*   An assignment was outlined detailing the steps to implement Logistic Regression with Newton's method on a binary subset of the IRIS dataset.
*   An example Python implementation demonstrated:
    *   Loading and preparing a binary subset of the IRIS dataset (setosa vs. versicolor), including adding a bias term and converting labels to 0 and 1.
    *   Implementing the core Logistic Regression components: sigmoid function, prediction function, cross-entropy loss, gradient, and Hessian.
    *   Implementing Newton's method for optimization, utilizing the pseudo-inverse of the Hessian for stability.
    *   Training the model using the implemented Newton's method.
    *   Evaluating the model on the training data, which resulted in perfect scores (1.0) for Accuracy, Precision, Recall, and F1-score for the setosa vs. versicolor task, consistent with the known linear separability of these classes.

### Insights or Next Steps

*   The implementation successfully demonstrates how Newton's method can be used to optimize the parameters of a Logistic Regression model by minimizing a differentiable loss function.
*   For future work, the user could extend this implementation to handle multi-class classification (e.g., using a One-vs-Rest strategy with Logistic Regression) and explore regularization techniques within the Newton's method framework to prevent overfitting on more complex datasets.
