# Module 1: Introduction to Scikit-Learn

## Section 3: Supervised Learning Algorithms

### Part 5: Logistic Classification / Regression

In this section, we will explore Logistic Regression, a popular supervised learning algorithm used for binary and multiclass classification tasks.

### 5.1 Understanding Logistic Regression

Logistic Regression is a classification algorithm that uses the logistic function (also known as the sigmoid function) to model the relationship between the independent variables and the probability of an instance belonging to a specific class. The logistic function converts the linear equation into a range between 0 and 1, representing the probability.

Despite its name, logistic regression is primarily used for binary or multiclass classification problems. It models the probability that an instance belongs to a particular class, typically represented as 0 and 1, True and False, or Negative and Positive.

It's similar to linear regression except logistic regression predicts when something is true or false instead of predicting something continuous. Instead of fitting a line, logistic regression fits a logistic function with values from 0 to 1. That means that the curve talks about the probability of beeing a particular class based on a value.

With linear regression we fit a line using least squares, we find a line that minimizes the sum of the squares of the residuals. We also use the residuals to calculate $R^2$ and evaluate the model performance.

The equation of the logistic regression model can be represented as:

$\sigma(z) = \frac{1}{(1 + e^{-z})}$

Where:

- σ(z) is the output (probability) of the function.
- $z$ is the linear combination of the feature values and their corresponding model coefficients.

With logistic regression we cannot use the concept of residuals and so we can't use the $R^2$ to evaluate the model performance. Instead we use something called maximum likehood.

### 5.2 Training and Evaluation

To train a logistic regression model, you need a labeled dataset, where the target variable (class label) is binary. The model is trained to maximize the likelihood of the observed class labels given the input features. The likelihood function measures how well the given model's parameters explain the observed data. The optimization process aims to find the coefficients that best separate the two classes. 

When we have found the best sigmoid function now ww can preduct the probabilities of belonging to a class. By default, most classifiers, including logistic regression, use a threshold of 0.5, meaning that a predicted probability greater than or equal to 0.5 is classified as the positive class, and a predicted probability less than 0.5 is classified as the negative class.

Finally evaluate out model performance. This time the output is not continuous but categorical (classes). The concept of residuals, which are based on continuous predictions, doesn't directly apply in the same way as it does for regression. For classification models, it's more common to evaluate performance using metrics like accuracy, precision, recall, F1-score, and the ROC curve. These metrics provide a more intuitive understanding of how well the model is classifying data into different categories. Let's dive into each one:

1. Confusion Matrix: A confusion matrix is a tabular representation that compares the predicted class labels with the actual class labels.

    | Actual/Predicted | Positive | Negative |
    |------------------|----------|----------|
    | Positive         | True Positive (TP) | False Negative (FN) |
    | Negative         | False Positive (FP) | True Negative (TN) |
    
    It breaks down the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).

2. Accuracy: Accuracy is a fundamental metric that measures the proportion of correct predictions (both true positives and true negatives) out of all predictions. It calculates the ratio of correctly predicted instances to the total number of instances. Accuracy might not be sufficient for imbalanced datasets where one class is much more frequent than the other.
<br/>$Accuracy = (TP + TN) / (TP + TN + FP + FN)$

3. Precision: Precision mesures the proportion of true positive predictions out of all positive predictions made by the model. It answers the question: "Of all instances predicted as positive, how many are actually positive?" High precision indicates that when the model predicts a positive result, it's likely to be correct.
<br/>$Precision = TP / (TP + FP)$

4. Recall (Sensitivity or True Positive Rate): Recall measures the proportion of actual positive instances that were correctly predicted as positive. It answers the question: "Of all actual positive instances, how many did the model predict as positive?" High recall indicates that the model is good at identifying positive instances.<br/>$Recall = TP / (TP + FN)$

5. F1-Score: The F1-score is the harmonic mean of precision and recall. It considers both false positives and false negatives, making it a balanced metric that is especially useful when class distribution is imbalanced.
<br/>$F1-Score = 2 * (Precision * Recall) / (Precision + Recall)$

6. Specificity (True Negative Rate): Specificity measures the proportion of actual negative instances that were correctly predicted as negative.
<br/>$Specificity = TN / (TN + FP)$

7. False Positive Rate: The false positive rate is the proportion of actual negative instances that were incorrectly predicted as positive.
<br/>$False Positive Rate = FP / (FP + TN)$

8. False Negative Rate: The false negative rate is the proportion of actual positive instances that were incorrectly predicted as negative.
<br/>$False Negative Rate = FN / (FN + TP)$

9. ROC Curve (Receiver Operating Characteristic Curve):
The ROC curve is a graphical representation of the trade-off between true positive rate (recall) and false positive rate (1 - specificity) as the classification threshold varies. It helps assess the model's ability to distinguish between positive and negative classes. The area under the ROC curve (AUC) summarizes the performance across all possible thresholds.

### 5.3 Implementing Logistic Regression

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc

# Generate random data from one class
X, y = make_classification(n_samples=50, n_features=1, n_informative=1, n_redundant=0, n_clusters_per_class=1, random_state=42)
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a logistic regression model
logreg = LogisticRegression()
# Train the model on the train data
logreg.fit(X_train, y_train)
# Generate a range of feature values for prediction
X_range = np.linspace(min(X), max(X), 50).reshape(-1, 1)
# Predict the probability of belonging to class 1 for the range of feature values
probabilities_train = logreg.predict_proba(X_range)[:, 1]  # Probability of belonging to class 1 for train data
probabilities_test = logreg.predict_proba(X_range)[:, 1]   # Probability of belonging to class 1 for test data

# Create subplots
plt.figure(figsize=(18, 5))
# Plot the original data points
plt.subplot(1, 3, 1)
plt.scatter(X, y, c='blue', label='Data Points')
plt.xlabel('Feature')
plt.ylabel('Class Label')
plt.title('Original Data Points')
plt.legend()
# Plot sigmoid curve on train data along with predicted probabilities as red points
plt.subplot(1, 3, 2)
plt.scatter(X_train, y_train, c='blue', label='Train Data')
plt.plot(X_range, probabilities_train, color='purple', label='Sigmoid (Train)')
plt.scatter(X_train, logreg.predict_proba(X_train)[:, 1], color='red', marker='o', label='Predicted Probabilities (Train)')
plt.xlabel('Feature')
plt.ylabel('Probability / Class Label')
plt.legend()
plt.title('Sigmoid on Train Data')
# Plot sigmoid curve on test data along with predicted probabilities as red points
plt.subplot(1, 3, 3)
plt.scatter(X_test, y_test, c='blue', label='Test Data')
plt.plot(X_range, probabilities_test, color='purple', label='Sigmoid (Test)')
plt.scatter(X_test, logreg.predict_proba(X_test)[:, 1], color='red', marker='o', label='Predicted Probabilities (Test)')
plt.xlabel('Feature')
plt.ylabel('Probability / Class Label')
plt.legend()
plt.title('Sigmoid on Test Data')
# Adjust layout to prevent overlapping
plt.tight_layout()
# Show the plots
plt.show()

# Predict classes for train and test data
y_pred_train = logreg.predict(X_train)
y_pred_test = logreg.predict(X_test)
'''
# Specify a custom threshold
custom_threshold = 0.3
# Predict classes for train and test data using the custom threshold
y_pred_train = (logreg.predict_proba(X_train)[:, 1] >= custom_threshold).astype(int)
y_pred_test = (logreg.predict_proba(X_test)[:, 1] >= custom_threshold).astype(int)
'''
# Display evaluation metrics
accuracy_test = accuracy_score(y_test, y_pred_test)
precision_test = precision_score(y_test, y_pred_test)
recall_test = recall_score(y_test, y_pred_test)
f1_test = f1_score(y_test, y_pred_test)

print("\nTest Set Metrics:")
print("Accuracy:", accuracy_test)
print("Precision:", precision_test)
print("Recall:", recall_test)
print("F1-score:", f1_test)

In this example, we have a dataset with feature values between -1.5 and 2.5. These features have associated a class label with values 0 or 1. We use a train-test split on our dataset using train_test_split from sklearn.model_selection. The left plot shows the training data points before fitting a logistic regresion. The middle plot displays the sigmoid curve fitted to the training data, and the the right plot shows the sigmoid curve found applied to the test data. 

The logistic regression model is trained on the training data and then used to predict the probabilities of belonging to class 0 on the test data. The predict_proba method returns an array where each row represents a data point, and the two columns represent the probabilities of belonging to class 0 and class 1, respectively. Now that we have the mapped probabilities of belonging to a class for each point, we specify a threshold to decide when we belong or not to a class based on our needs. By default, most classifiers, including logistic regression, use a threshold of 0.5, meaning that a predicted probability greater than or equal to 0.5 is classified as the positive class, and a predicted probability less than 0.5 is classified as the negative class.

Using the default threshold of 0.5 we can finally evaluate out model performance.
- An accuracy of 0.87 indicates that approximately 87% of the predictions were correct.
- A precision of 0.8 indicates that 80% of the instances predicted as positive were actually positive.
- A recall of 1.0 means that the model is able to identify all positive instances.
- An F1-score of 0.89 indicates a good balance between precision and recall.

### 5.4 Conclusion

Logistic Regression is a widely used algorithm for binary and multiclass classification tasks. It models the probability of an instance belonging to a certain class based on the values of independent variables. Scikit-Learn provides the LogisticRegression class to implement logistic regression models easily. Understanding the underlying assumptions and techniques is crucial for interpreting the results and applying logistic regression effectively.

In the next part, we will explore another popular supervised learning algorithm, Decision Trees, used for both classification and regression tasks.

Feel free to practice implementing Logistic Regression using Scikit-Learn. Experiment with different features, evaluation metrics, and techniques to gain a deeper understanding of the algorithm and its performance.