**Logistic Regression**

**Question 1:** What is Logistic Regression, and how does it differ from Linear
Regression?

Answer:
1. What is Logistic Regression?

Logistic Regression is a supervised machine learning algorithm used for classification problems, especially binary classification (e.g., yes/no, default/no default).

It predicts the probability that a given input belongs to a particular class.

Uses the logistic (sigmoid) function to map any real-valued number to a value between 0 and 1:


The predicted probability can then be converted to a class label using a threshold (commonly 0.5).


2. Key Idea

Instead of predicting a continuous output (like linear regression), logistic regression predicts the log-odds of the outcome.

Log-odds (logit) = log(P / (1-P))

The model is linear in the log-odds space:


3. How It Differs from Linear Regression
Aspect	Linear Regression	Logistic Regression
Task	Predict continuous numeric values	Predict probability / class labels
Output Range	Any real number (-∞, ∞)	Probability [0,1]

Loss Function	Mean Squared Error (MSE)	Log-Loss / Cross-Entropy
Assumption	Linear relationship between X and y	Linear relationship between X and log-odds
Use Case Example	Predict house price, temperature	Predict loan default, disease (yes/no)
4. Intuition

Linear regression: Fits a straight line to predict numbers.

Logistic regression: Fits an “S-shaped” curve (sigmoid) to output probabilities, which are then mapped to classes.

**Question 2:** Explain the role of the Sigmoid function in Logistic Regression.

Answer:
1. What is the Sigmoid Function?

The sigmoid function (also called logistic function) is defined as:
	​

Properties of sigmoid:

Outputs values in the range 0 to 1 → perfect for probabilities.

S-shaped curve (hence “logistic”).

Smooth and differentiable → important for gradient-based optimization.

2. Role of Sigmoid in Logistic Regression

Convert linear output to probability:

Linear combination
𝑧
z can be any real number (-∞, ∞).

Sigmoid squashes it to a value between 0 and 1:


This represents the probability that the output belongs to class 1.

Enable classification:

Once we have a probability, we can set a threshold (commonly 0.5) to assign a class:


Support optimization with gradient descent:

The sigmoid function is differentiable, so we can compute gradients of the loss function (log-loss / cross-entropy) to update weights efficiently.

3. Intuition

Sigmoid turns this score into a probability, mapping negative scores to near 0 and positive scores to near 1.

This allows logistic regression to make probabilistic predictions, rather than just numeric outputs like linear regression.

Visual Example

**Question 3:** What is Regularization in Logistic Regression and why is it needed?

Answer:
1. What is Regularization?

Regularization is a technique used in machine learning to prevent overfitting by adding a penalty term to the model’s loss function.

In Logistic Regression, the standard loss function is the log-loss / cross-entropy:
Regularization adds a penalty on the size of the coefficients (
𝛽
β):

L2 Regularization (Ridge):
Penalizes large weights by squaring them.

Helps keep the model coefficients small and stable.

L1 Regularization (Lasso):

Encourages sparsity, meaning some coefficients can become exactly zero.

Useful for feature selection.

Elastic Net: Combination of L1 + L2 penalties.

2. Why Regularization is Needed in Logistic Regression

Prevent Overfitting:

Without regularization, the model might assign very large weights to some features to perfectly fit the training data.

Large weights → poor generalization on unseen data.

Handle Multicollinearity:

If features are highly correlated, coefficients can become unstable.

Regularization reduces variance and stabilizes the solution.

Feature Selection (L1):

Automatically removes irrelevant features by setting their coefficients to zero.

Better Generalization:

Produces a simpler, more robust model that works well on test/unseen data.

3. Intuition

Think of logistic regression trying to “stretch” a decision boundary to fit data.

Without regularization → it might overstretch, creating a complex boundary.

With regularization → it restrains the coefficients, leading to a smoother, more generalizable decision boundary.

4. Example in scikit-learn

In [1]:
from sklearn.linear_model import LogisticRegression

# L2 regularization (default)
model = LogisticRegression(penalty='l2', C=1.0)  # C = 1/λ

# L1 regularization
model_l1 = LogisticRegression(penalty='l1', solver='liblinear', C=1.0)


**Question 4:** What are some common evaluation metrics for classification models, and why are they important?

Answer:
1. Common Evaluation Metrics for Classification
a) Accuracy

What it measures: Overall correctness of the model.

Use case: Works well when classes are balanced.

Limitation: Misleading for imbalanced datasets.

b) Precision
Precision
	​


What it measures: Of all the positive predictions, how many were actually positive.

Importance: High precision → few false positives.

Use case: Fraud detection, spam detection (where false positives are costly).

c) Recall (Sensitivity / True Positive Rate)
Recall



What it measures: Of all actual positives, how many did the model correctly identify.

Importance: High recall → few false negatives.

Use case: Medical diagnosis, loan default prediction (missing a positive can be costly).
	​

What it measures: Harmonic mean of precision and recall.

Importance: Balances false positives and false negatives.

Use case: Imbalanced datasets where both false positives and false negatives matter.

e) Confusion Matrix
	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

What it measures: Gives a detailed breakdown of predictions.

Importance: Helps understand model behavior beyond a single metric.

f) ROC-AUC (Receiver Operating Characteristic – Area Under Curve)

ROC Curve: Plots True Positive Rate (Recall) vs False Positive Rate at different thresholds.

AUC: Measures overall separability between classes (0.5 = random, 1.0 = perfect).

Use case: Imbalanced datasets, probability-based models.

g) PR-AUC (Precision-Recall AUC)

Especially useful for highly imbalanced datasets.

Measures the trade-off between precision and recall across thresholds.

2. Why These Metrics Are Important

Different metrics highlight different errors:

Accuracy alone can be misleading if one class dominates.

Precision and recall focus on false positives vs false negatives.

Decision-making context matters:

Banking → minimize false negatives (loan defaults missed) → prioritize recall.

Spam detection → minimize false positives → prioritize precision.

Helps tune and compare models:

Metrics allow you to select the best model for your business goal.

✅ In short:

Accuracy: Overall correctness

Precision: Correct positive predictions

Recall: Captured actual positives

F1-score: Balance of precision & recall

Confusion Matrix: Detailed breakdown

ROC-AUC / PR-AUC: Performance across thresholds


**Question 5:** Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package)


Answer:


In [3]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_breast_cancer()

# Convert to pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split data into features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train Logistic Regression model
model = LogisticRegression(max_iter=10000)  # Increased max_iter to ensure convergence
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Test Accuracy: {accuracy:.2f}")


Logistic Regression Test Accuracy: 0.96


**Question 6:** Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)
Answer:

In [4]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()

# Create a DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split into features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression with L2 regularization (Ridge)
# L2 is the default regularization in LogisticRegression
model = LogisticRegression(penalty='l2', solver='lbfgs', multi_class='auto', max_iter=1000)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print model coefficients and accuracy
print("Model Coefficients:")
for i, class_label in enumerate(model.classes_):
    print(f"Class {class_label}: {model.coef_[i]}")

print(f"\nIntercepts: {model.intercept_}")
print(f"\nTest Accuracy: {accuracy:.2f}")


Model Coefficients:
Class 0: [-0.39345607  0.96251768 -2.37512436 -0.99874594]
Class 1: [ 0.50843279 -0.25482714 -0.21301129 -0.77574766]
Class 2: [-0.11497673 -0.70769055  2.58813565  1.7744936 ]

Intercepts: [  9.00884295   1.86902164 -10.87786459]

Test Accuracy: 1.00




**Question 7:** Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)
Answer:


In [5]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Logistic Regression model with One-vs-Rest (OvR)
model = LogisticRegression(multi_class='ovr', max_iter=200)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=iris.target_names))


Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





**Question 8:** Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)
Answer:


In [8]:
# Import libraries
import warnings
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Suppress warnings
warnings.filterwarnings("ignore")

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid (only penalties supported by solver)
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']  # Both work with liblinear
}

# Logistic Regression model
log_reg = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=500)

# GridSearchCV
grid = GridSearchCV(estimator=log_reg, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

# Print best parameters and validation accuracy
print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Accuracy:", grid.best_score_)

# Evaluate on test data
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'C': 10, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.9583333333333334
Test Accuracy: 1.0


**Question 9:** Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)
Answer:


In [9]:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 1. Logistic Regression WITHOUT scaling
model_no_scale = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=200)
model_no_scale.fit(X_train, y_train)
y_pred_no_scale = model_no_scale.predict(X_test)
accuracy_no_scale = accuracy_score(y_test, y_pred_no_scale)

# 2. Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression WITH scaling
model_scale = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=200)
model_scale.fit(X_train_scaled, y_train)
y_pred_scale = model_scale.predict(X_test_scaled)
accuracy_scale = accuracy_score(y_test, y_pred_scale)

# Compare results
print(f"Accuracy without scaling: {accuracy_no_scale:.4f}")
print(f"Accuracy with scaling:    {accuracy_scale:.4f}")


Accuracy without scaling: 1.0000
Accuracy with scaling:    0.9667


**Question 10:** Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world businessuse case.

Answer:

1. Problem Understanding

Goal: Predict which customers are likely to respond to a marketing campaign.

Business Value: Focus marketing budget on customers with a high predicted probability of response → higher ROI.

Challenge: Only 5% positive class (severe imbalance).

2. Data Preprocessing
🔹 Data Cleaning

Handle missing values:

Numerical: Fill with median or use KNN imputer.

Categorical: Fill with mode or "Unknown".

Remove duplicates, outliers if they’re data errors.

🔹 Feature Engineering

Create features like:

Customer purchase history (frequency, recency, monetary value).

Demographics (age, location).

Engagement metrics (email clicks, website visits).

Convert categorical variables using One-Hot Encoding or Target Encoding.

3. Train-Test Split

Use stratified split to maintain the 5% response ratio in train and test sets.


In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)


4. Feature Scaling

Logistic Regression is sensitive to feature magnitude → StandardScaler or MinMaxScaler is essential.

In [12]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


5. Handling Imbalanced Data

Option 1: Class Weights (Preferred for LR)

In [13]:
model = LogisticRegression(class_weight='balanced')


This automatically adjusts weights inversely proportional to class frequency.

Option 2: Oversampling / SMOTE

In [14]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)


Option 3: Undersampling for large datasets.

6. Hyperparameter Tuning

Use GridSearchCV or RandomizedSearchCV to tune:

C: Regularization strength.

penalty: 'l1' (sparse) or 'l2' (default).

class_weight: 'balanced' vs None.

In [15]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C':[0.01,0.1,1,10],'penalty':['l1','l2']}


7. Model Evaluation

Accuracy is misleading (95% accuracy if you predict “no” for everyone).
Instead, use:

Precision, Recall, F1-Score (focus on Recall if catching responders is key).

ROC-AUC and PR-AUC (Precision-Recall curve is more informative for imbalanced data).

Confusion Matrix for business insight.

8. Business Interpretation

Use predicted probabilities → rank customers by response likelihood.

Send campaigns only to top X% predicted customers to save cost.

Monitor model performance over time (data drift).