# Logistic regression

Logistic regression is a statistical method for predicting binary classes, where the outcome variable has only two possible classes.

In [30]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, auc, confusion_matrix, f1_score,
                             precision_score, recall_score, roc_curve)
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [31]:
# Load the Iris dataset
iris = load_iris()
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

In [32]:
X, y = iris.data, iris.target

In [33]:
# we know that each flower only has 50 data point and is in order
index = 100

In [34]:
# Preprocess the data (for binary classification)
# We'll consider only two classes: Setosa (class 0) and Versicolor (class 1)
X_binary = X[0:index]
y_binary = y[0:index]

In [35]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_binary, y_binary, test_size=0.2, random_state=42
)

In [36]:
# Initialise the logistic regression model
model = make_pipeline(StandardScaler(), LogisticRegression())

In [37]:
# Fit the model to the data
model.fit(X_train, y_train)

In [38]:
# Predict new values
y_pred = model.predict(X_test)

## Model Evaluation

## Accuracy
- **Interpretation:** Accuracy is the ratio of correctly predicted observations to the total observations. It measures the overall correctness of the model.
- **Good vs. Bad Values:** Accuracy ranges from 0 to 1. A value of 1 means the model made correct predictions for all observations. A value of 0 means all predictions were incorrect. However, accuracy can be misleading if the classes are imbalanced. In such cases, a model could achieve high accuracy by simply predicting the majority class.

In [39]:
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Accuracy: 1.0


## Precision
- **Interpretation:** Precision is the ratio of correctly predicted positive observations to the total predicted positives. It measures the model’s ability to correctly identify only relevant instances (true positives).
- **Good vs. Bad Values:** Precision ranges from 0 to 1. A value of 1 means the model has no false positives. A value of 0 means the model has no true positives. Higher precision indicates fewer false positives.

In [40]:
print(f"Precision: {precision_score(y_test, y_pred)}")

Precision: 1.0


## Recall
- **Interpretation:** Recall (or Sensitivity) is the ratio of correctly predicted positive observations to all observations in actual class. It measures the model’s ability to find all the relevant cases within a dataset.
- **Good vs. Bad Values:** Recall ranges from 0 to 1. A value of 1 means the model has no false negatives. A value of 0 means the model has no true positives. Higher recall indicates fewer false negatives.

In [41]:
print(f"Recall: {recall_score(y_test, y_pred)}")

Recall: 1.0


## F1 Score
- **Interpretation:** The F1 Score is the weighted average of Precision and Recall. It tries to find the balance between precision and recall.
- **Good vs. Bad Values:** F1 Score ranges from 0 to 1. A value of 1 means perfect precision and recall. A value of 0 means either the precision or the recall is zero. Higher F1 Score indicates better balance between precision and recall.

In [42]:
print(f"F1 Score: {f1_score(y_test, y_pred)}")

F1 Score: 1.0


## Confusion Matrix
- **Interpretation:** A confusion matrix is a table that is often used to describe the performance of a classification model. It contains information about actual and predicted classifications done by the model.
- **Good vs. Bad Values:** In a confusion matrix, the diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier. The higher the diagonal values of the confusion matrix the better, indicating many correct predictions.

In [43]:
print(f"Confusion Matrix:\n {confusion_matrix(y_test, y_pred)}")

Confusion Matrix:
 [[12  0]
 [ 0  8]]


## AUC-ROC
- **Interpretation:** The area under the ROC (Receiver Operating Characteristic) curve quantifies the overall capacity of the model to distinguish between positive and negative classes. It measures the trade-off between the true positive rate (sensitivity) and false positive rate (1-specificity).
- **Good vs. Bad Values:** AUC-ROC ranges from 0 to 1. A value of 1 means the model is perfect. A value of 0.5 means the model is no better than random guessing. A value less than 0.5 means the model is worse than random guessing. Higher AUC-ROC indicates better model performance.

In [44]:
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
print(f"false_positive_rate: {false_positive_rate}")
print(f"true_positive_rate: {true_positive_rate}")
print(f"AUC-ROC: {auc(false_positive_rate, true_positive_rate)}")

false_positive_rate: [0. 0. 1.]
true_positive_rate: [0. 1. 1.]
AUC-ROC: 1.0


## Cross Validation

In [45]:
# Perform cross-validation
scores = cross_val_score(model, X_binary, y_binary, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Average cross-validation score: {scores.mean()}")

Cross-validation scores: [1. 1. 1. 1. 1.]
Average cross-validation score: 1.0
