# Module 1: Introduction to Scikit-Learn

## Section 2: Supervised Learning Algorithms

### Part 10: Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is an optimization algorithm used in machine learning for training a wide range of models, particularly for large-scale and online learning tasks. 

### 10.1 Gradient Descent

Gradient Descent (GD) is an optimization algorithm commonly used in machine learning and deep learning to find the minimum of a function. It's particularly useful when dealing with complex, high-dimensional models where finding the minimum analytically is infeasible or computationally expensive.

There are different variants of GD, including:

- Batch Gradient Descent (BGD): It computes the gradient using the entire training dataset at each iteration. BGD is straightforward but can be slow for large datasets.

- Stochastic Gradient Descent (SGD): It updates the parameters using one randomly chosen training example at a time. SGD is faster but can have more noisy updates. While GD computes the gradient of the loss function with respect to all training examples in each iteration, SGD computes the gradient using only a single randomly selected training example (hence "stochastic").

- Mini-Batch Gradient Descent: It strikes a balance between BGD and SGD by updating the parameters using a small random subset (mini-batch) of the training data at each iteration.

Here's how GD works:

1. Objective Function: GD starts with an objective function (also called a cost or loss function) that you want to minimize. In machine learning, this function quantifies the error between your model's predictions and the actual target values.

2. Initialization: GD begins with an initial guess for the model's parameters (e.g., weights and biases). These parameters determine the shape of the objective function.

3. Gradient Calculation: The algorithm calculates the gradient of the objective function with respect to the model parameters. The gradient points in the direction of the steepest increase in the function.

4. Update Parameters: GD updates the model's parameters by taking a step in the opposite direction of the gradient. The step size is determined by a hyperparameter called the learning rate.

5. Convergence: Steps 3 and 4 are repeated iteratively until a stopping criterion is met. Common stopping criteria include a maximum number of iterations, a sufficiently small change in the objective function, or achieving a certain level of accuracy.

6. Optimal Parameters: The algorithm converges to a set of model parameters that correspond to the minimum of the objective function. These optimal parameters should produce the best model predictions on the given data.

### 10.2 Stochastic Gradient Descent (SGD)

In scikit-learn, you can use the SGDClassifier and SGDRegressor classes to perform SGD-based optimization for classification and regression tasks, respectively.

#### SGDClassifier

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.metrics import accuracy_score

# Load the breast cancer dataset
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the feature data (mean=0, std=1)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Create an SGDClassifier with logistic loss
sgd_classifier = SGDClassifier(loss='log_loss', random_state=42)  # 'log' loss for logistic regression
sgd_classifier.fit(X_train, y_train)

# Predict the target values for testing data using SGDClassifier
y_pred_sgd = sgd_classifier.predict(X_test)

# Create a LogisticRegression model
logistic_regression = LogisticRegression(random_state=42)
logistic_regression.fit(X_train, y_train)

# Predict the target values for testing data using LogisticRegression
y_pred_logistic = logistic_regression.predict(X_test)

# Calculate accuracy for SGDClassifier
accuracy_sgd = accuracy_score(y_test, y_pred_sgd)
print("SGDClassifier Accuracy:", accuracy_sgd)

# Calculate accuracy for LogisticRegression
accuracy_logistic = accuracy_score(y_test, y_pred_logistic)
print("LogisticRegression Accuracy:", accuracy_logistic)

In this comparison, the LogisticRegression model achieved a higher accuracy of approximately 98.25%, while the SGDClassifier with logistic loss achieved an accuracy of approximately 96.49%. This suggests that, in this particular dataset, the LogisticRegression model performed slightly better in terms of accuracy.

Accuracy is just one metric, though. Depending on the specific problem and dataset, other metrics like precision, recall, and F1-score might also be important for evaluating model performance. 

LogisticRegression and SGDClassifier are both classification algorithms in scikit-learn, but they differ in their optimization methods, convergence behavior, memory usage, and flexibility.

LogisticRegression directly seeks the optimal weights to minimize the logistic loss function. It generally converges faster and is suitable for smaller datasets. However, it can be computationally intensive and may require more memory, particularly with a large number of features.

In contrast, SGDClassifier uses Stochastic Gradient Descent (SGD) optimization. It updates model weights incrementally based on individual data points or small batches, making it computationally efficient and suitable for large datasets. However, its convergence can be slower, especially with noisy data, and it has a smaller memory footprint.

While LogisticRegression offers a straightforward API with fewer hyperparameters, SGDClassifier provides more flexibility, allowing you to customize optimization techniques. Your choice between these models should consider your dataset size, convergence requirements, and computational resources.

#### SGDRegressor

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Create a synthetic dataset
np.random.seed(0)
X = np.random.rand(100, 1)
y = 2 * X + 1 + 0.1 * np.random.randn(100, 1)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)
mae_lr = mean_absolute_error(y_test, y_pred_lr)

# SGDRegressor
sgd = SGDRegressor(max_iter=10000, tol=1e-3, learning_rate='constant', eta0=0.01)
sgd.fit(X_train, y_train.ravel())
y_pred_sgd = sgd.predict(X_test)
mse_sgd = mean_squared_error(y_test, y_pred_sgd)
r2_sgd = r2_score(y_test, y_pred_sgd)
mae_sgd = mean_absolute_error(y_test, y_pred_sgd)

# Print evaluation metrics
print("Linear Regression Metrics:")
print(f"MSE: {mse_lr:.2f}, R²: {r2_lr:.2f}, MAE: {mae_lr:.2f}")
print("\nSGDRegressor Metrics:")
print(f"MSE: {mse_sgd:.2f}, R²: {r2_sgd:.2f}, MAE: {mae_sgd:.2f}")

# Plot the results
plt.figure(figsize=(6, 4))
plt.scatter(X_test, y_test, color='blue')
plt.plot(X_test, y_pred_lr, color='red', linewidth=2, label='Linear Regression')
plt.plot(X_test, y_pred_sgd, color='green', linewidth=2, label='SGDRegressor')
plt.title('Comparison')
plt.legend()
plt.show()

These metrics provide a comparison of the performance of both models. In this case, Linear Regression performs slightly better in terms of R², indicating a better fit to the data, while SGDRegressor has a slightly higher MAE, indicating slightly larger prediction errors.

Linear Regression and SGDRegressor they differ in their optimization methods and computational characteristics. 

In Linear Regression, you typically use a closed-form solution to directly calculate the optimal weights (coefficients) that minimize the mean squared error (MSE) without the need for iterative steps. It's essentially a one-step process.

On the other hand, in SGDRegressor (Stochastic Gradient Descent for regression), the optimization process involves updating the model's weights incrementally based on individual data points (or small batches of data) for a number of iterations. The number of iterations is typically determined by a user-defined parameter, and for each iteration, the weights are adjusted using the gradient of the loss function with respect to the current weights and the learning rate. This process repeats until convergence or until a predefined number of iterations is reached. The number of steps can indeed be equal to the number of samples in the dataset if you configure it that way, but often, mini-batch SGD is used to strike a balance between computational efficiency and convergence speed.

Linear Regression directly calculates the optimal weights that minimize the loss function, making it computationally intensive but typically faster to converge, especially for smaller datasets. In contrast, SGDRegressor employs Stochastic Gradient Descent (SGD), optimizing the loss function incrementally based on individual data points or small batches, which makes it computationally efficient and suitable for large datasets. This efficiency comes at the cost of potentially slower convergence, especially in the presence of noisy or outlier-laden data. Additionally, SGDRegressor offers more flexibility in terms of hyperparameters and optimization techniques, while Linear Regression provides a simpler API and is a good choice when computational resources are less constrained, and quick convergence is desired. The choice between them depends on the specific characteristics of the dataset and the trade-offs between computational resources and convergence speed.

### 10.3 Summary

Stochastic Gradient Descent (SGD) is a powerful optimization algorithm widely used in machine learning for training a variety of models, particularly those with large datasets. It stands out for its efficiency and ability to handle massive amounts of data that may not fit into memory. Unlike traditional gradient descent, which computes weight updates using the entire dataset, SGD updates model parameters incrementally, typically one data point or a small batch at a time. This stochastic nature introduces randomness into the optimization process, which can sometimes help escape local minima and explore the loss landscape more effectively.

SGD iteratively adjusts model parameters by calculating gradients of the loss function with respect to these parameters and then updating them in the direction that minimizes the loss. Its adaptability and speed make it suitable for online learning and real-time scenarios. However, this adaptability can also result in noisy updates that may slow convergence. To mitigate this, learning rates can be adjusted dynamically.

In summary, SGD is a versatile and efficient optimization technique, making it a go-to choice for training machine learning models, especially when dealing with large datasets. Its stochastic nature allows for faster convergence, but careful tuning of hyperparameters, such as learning rates, is essential to ensure optimal performance.