Logistic Regression is a statistical method used for binary classification problems, where the outcome variable is categorical with two possible outcomes (e.g., 0 or 1, True or False, Yes or No). Despite the name, it is a classification algorithm, not a regression algorithm.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

In [6]:
# Load dataset
df = pd.read_csv('./data5.csv')

In [8]:
df.head()

Unnamed: 0,CustomerID,Genre,Age,Annual Income,Spending Score
0,1,Male,19,19000,0
1,2,Male,35,20000,0
2,3,Female,26,43000,0
3,4,Female,27,57000,0
4,5,Male,19,76000,0


In [9]:
# Drop 'Gender' column
df1 = df.drop("Genre", axis=1)

In [11]:
df1.head()

Unnamed: 0,CustomerID,Age,Annual Income,Spending Score
0,1,19,19000,0
1,2,35,20000,0
2,3,26,43000,0
3,4,27,57000,0
4,5,19,76000,0


df1.iloc[:, -1]: This selects the last column of the df1 DataFrame. 
df1.iloc[:, :-1]: This selects all columns except the last one 

In [12]:
# Define features and target variable
y = df1.iloc[:, -1].values  # Target variable
X = df1.iloc[:, :-1].values  # Feature set

This imports the train_test_split function from the sklearn.model_selection module. This function is used to split your dataset into training and testing sets.

| Component        | Description                                                    |
| ---------------- | -------------------------------------------------------------- |
| `X`              | Feature set (independent variables)                            |
| `y`              | Target variable (dependent variable)                           |
| `test_size=0.25` | 25% of the data will be used for testing, and 75% for training |
| `random_state=2` | Ensures reproducibility of the split                           |

X_train: 75% of X (features) used to train the model

X_test: 25% of X used to test the model

y_train: 75% of y (targets) used for training

y_test: 25% of y used for testing

In [13]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=2)

| Step                     | Description                                               |
|--------------------------|-----------------------------------------------------------|
| `LogisticRegression()`   | Creates an instance of the logistic regression model.     |
| `.fit(X_train, y_train)` | Trains the model using the training features and labels.  |


In [14]:
# Train Logistic Regression model
LR = LogisticRegression()
LR.fit(X_train, y_train)

LR is your trained LogisticRegression() model.
.predict(X_test) tells the model to predict the class labels (0 or 1, for example) based on the test input features.
y_pred will store the predicted outputs (labels) for the test set.

If X_test contains the Age and EstimatedSalary of 100 people, then:
y_pred will be a list of 100 values (0s and 1s) — where:
    0 = did not purchase,
    1 = did purchase.

In [17]:
#make predictions
y_pred = LR.predict(X_test)

A confusion matrix is a table used to evaluate the performance of a classification model. It compares the actual values with the predicted values from the model.

|                      | **Predicted: Positive** | **Predicted: Negative** |
| -------------------- | ----------------------- | ----------------------- |
| **Actual: Positive** | True Positive (TP)      | False Negative (FN)     |
| **Actual: Negative** | False Positive (FP)     | True Negative (TN)      |

ravel() is a function in NumPy that flattens a multi-dimensional array into a 1D array.

In [19]:
# Compute confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

Accuracy measures how many predictions the model got right out of all predictions.
Precision calculates how many of the predicted positive cases are actually positive.
Recall measures how many of the actual positive cases were correctly identified by the model.
F1 Score is the harmonic mean of precision and recall. It provides a balance between precision and recall.
Specificity measures how many of the actual negative cases were correctly identified by the model.
Error rate calculates the proportion of incorrect predictions made by the model.

In [26]:
# Calculate metrics
accuracy = (tn + tp) *100 / (tp + tn + fp + fn)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1_score = (2 * precision * recall) / (precision + recall)
specificity = tn *100/ (tn + fp)
error = (fp + fn) *100/ (tp + tn + fp + fn)

In [28]:
# Display metrics
print(f"Accuracy: {accuracy:.2f}%")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1_score:.2f}")
print(f"Specificity: {specificity:.2f}%")
print(f"Error Rate: {error:.2f}%")

Accuracy: 96.00%
Precision: 0.75
Recall: 0.75
F1 Score: 0.75
Specificity: 97.83%
Error Rate: 4.00%


In [31]:
# Take user input for feature values
print("Enter values for the following features:")
input_features = []
for i in range(X_train.shape[1]):
    val = float(input(f"Feature {i+1}: "))
    input_features.append(val)


# Convert input into numpy array and reshape
input_array = np.array(input_features).reshape(1, -1)

# Make prediction
prediction = LR.predict(input_array)[0]
print(f"Predicted Output: {prediction}")



Enter values for the following features:
Predicted Output: 1
