**Logistic Regression**

# Question 1
What is Logistic Regression, and how does it differ from Linear Regression?

**Answer:**  
Logistic Regression is a classification algorithm used to predict a categorical outcome (usually binary: 0/1, yes/no). It models the probability that an input belongs to the positive class using the logistic (sigmoid) function and outputs values between 0 and 1.

**Difference from Linear Regression:**  
- Linear Regression predicts a continuous value (e.g., price). Logistic Regression predicts probabilities for discrete classes.  
- Linear regression uses a linear function and can produce any real number; logistic regression applies a sigmoid to the linear combination of features so outputs are constrained between 0 and 1.  
- For classification, logistic regression’s decision boundary is based on probability (e.g., probability > 0.5 → class 1).

# Question 2
Explain the role of the Sigmoid function in Logistic Regression.

**Answer:**  
The sigmoid (logistic) function maps any real-valued input to the range (0, 1). In logistic regression we compute a linear score z = w·x + b and then apply sigmoid: σ(z) = 1 / (1 + e^(−z)). The result is interpreted as the probability of the positive class. Sigmoid makes the model output probabilities and allows thresholding (e.g., p ≥ 0.5 → class 1).

# Question 3
What is Regularization in Logistic Regression and why is it needed?

**Answer:**  
Regularization adds a penalty to the model’s loss to prevent overfitting by discouraging very large weights. Two common types:
- **L2 (Ridge)**: penalty = λ * sum(w^2) → shrinks weights smoothly.
- **L1 (Lasso)**: penalty = λ * sum(|w|) → can shrink some weights to zero (feature selection).

Why needed: Without regularization, a model can fit noise in training data (overfit). Regularization improves generalization on unseen data.

# Question 4
What are some common evaluation metrics for classification models, and why are they important?

**Answer:**  
- **Accuracy:** (TP+TN)/Total. Good when classes balanced.  
- **Precision:** TP / (TP + FP). Out of predicted positives, how many are correct. Important when false positives are costly.  
- **Recall (Sensitivity):** TP / (TP + FN). Out of actual positives, how many we found. Important when missing positives is costly.  
- **F1-score:** Harmonic mean of precision & recall — useful for imbalanced data.  
- **ROC-AUC:** Measures separability across thresholds. Good for ranking quality.  
- **PR-AUC (Precision-Recall AUC):** Better for highly imbalanced problems.  

Choosing metrics depends on business goals (e.g., in fraud detection prioritize recall, or precision depending on cost).

In [1]:
# Question 5
# Load a dataset from sklearn, split into train/test, train Logistic Regression, print accuracy.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Load data
data = load_breast_cancer()
X = data.data
y = data.target

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train Logistic Regression
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Predict & evaluate
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", round(acc, 4))

# Short explanation:
# We used the breast cancer dataset (binary). The printed accuracy shows model performance on test data.

Accuracy: 0.9649


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [2]:
# Question 6
# Train Logistic Regression with L2 regularization and print coefficients & accuracy.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Using same train/test from Q5
model_l2 = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=1000, random_state=42)
model_l2.fit(X_train, y_train)

# Coefficients and accuracy
coefficients = model_l2.coef_[0]
intercept = model_l2.intercept_[0]
y_pred_l2 = model_l2.predict(X_test)
acc_l2 = accuracy_score(y_test, y_pred_l2)

print("Intercept:", round(intercept,4))
print("First 10 Coefficients:", np.round(coefficients[:10], 4))
print("Accuracy (L2):", round(acc_l2, 4))

# Explanation:
# L2 is the default regularization here. Coefficients show feature importance sign & magnitude.

Intercept: 6.5806
First 10 Coefficients: [ 2.0396  0.0391 -0.1888  0.0074 -0.2456 -0.3093 -0.7021 -0.4657 -0.4492
 -0.0065]
Accuracy (L2): 0.9649


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [3]:
# Question 7
# Multiclass classification using multi_class='ovr' and print classification report (use Iris dataset).

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

iris = load_iris()
X_iris = iris.data
y_iris = iris.target

Xtr, Xte, ytr, yte = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris)

# Train with one-vs-rest
model_ovr = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=1000, random_state=42)
model_ovr.fit(Xtr, ytr)

y_pred_ovr = model_ovr.predict(Xte)
print(classification_report(yte, y_pred_ovr, target_names=iris.target_names))

# Explanation:
# OVR trains one classifier per class vs rest. Classification report shows precision/recall/F1 per class.

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.90      0.95        10
   virginica       0.91      1.00      0.95        10

    accuracy                           0.97        30
   macro avg       0.97      0.97      0.97        30
weighted avg       0.97      0.97      0.97        30





In [4]:
# Question 8
# Use GridSearchCV to tune C and penalty (note: 'l1' requires solver that supports it like 'liblinear' or 'saga').

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# pipeline: scaling + logistic (scaler is often helpful)
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=1000, random_state=42, solver='liblinear'))
])

param_grid = {
    'logreg__C': [0.01, 0.1, 1, 10],
    'logreg__penalty': ['l1', 'l2']
}

grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

print("Best Params:", grid.best_params_)
print("Best CV Score:", round(grid.best_score_, 4))
print("Validation Accuracy with best model:", round(grid.score(X_test, y_test), 4))

# Explanation:
# GridSearchCV searches best C and penalty using cross-validation and returns the best parameters & score.

Best Params: {'logreg__C': 0.1, 'logreg__penalty': 'l2'}
Best CV Score: 0.9802
Validation Accuracy with best model: 0.9825


In [5]:
  # Question 9
# Standardize features before training and compare accuracy with and without scaling.

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Without scaling
model_no_scale = LogisticRegression(max_iter=1000, random_state=42)
model_no_scale.fit(X_train, y_train)
acc_no_scale = accuracy_score(y_test, model_no_scale.predict(X_test))

# With scaling (pipeline)
pipe_scaled = Pipeline([('scaler', StandardScaler()), ('logreg', LogisticRegression(max_iter=1000, random_state=42))])
pipe_scaled.fit(X_train, y_train)
acc_scaled = accuracy_score(y_test, pipe_scaled.predict(X_test))

print("Accuracy without scaling:", round(acc_no_scale,4))
print("Accuracy with scaling:", round(acc_scaled,4))

# Explanation:
# Scaling often improves optimization and sometimes performance, especially when features have different magnitudes.

Accuracy without scaling: 0.9649
Accuracy with scaling: 0.9825


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Question 10
Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% respond), describe the approach you’d take.

**Answer (step-by-step practical plan):**

1. **Understand data & baseline:**  
   - Explore class balance, missing values, feature types, simple correlations.  
   - Build a quick baseline model (logistic regression) to have a reference.

2. **Data cleaning & feature engineering:**  
   - Handle missing values (median for numeric, mode/‘missing’ for categorical).  
   - Create useful features (recency, frequency, monetary features, interaction terms).  
   - Convert categorical features (one-hot, target-encoding if many categories).

3. **Split data properly:**  
   - Use stratified train/test split so class distribution is preserved.

4. **Address imbalance:**  
   - Prefer **resampling on training set** only:  
     - **SMOTE / ADASYN** (oversample minority), or  
     - **Random undersampling** of majority, or  
     - Use model-level solutions: **class_weight='balanced'** in LogisticRegression.  
   - Try several approaches and validate with proper CV.

5. **Feature scaling:**  
   - Standardize numeric features (StandardScaler) if using regularized models.

6. **Modeling & hyperparameter tuning:**  
   - Use cross-validation (stratified) and tune C, penalty, solver.  
   - Use pipelines so preprocessing + resampling + model are in a reproducible flow.

7. **Use proper evaluation metrics:**  
   - Accuracy is misleading for 5% positive. Use **Precision, Recall, F1**, and **PR-AUC**.  
   - For business, choose metric based on cost: e.g., maximize recall if you want to reach almost all responders; maximize precision if contacting a false positive is costly.

8. **Calibration & thresholding:**  
   - Calibrate probabilities (CalibratedClassifierCV) if you need reliable probabilities.  
   - Choose decision threshold based on business trade-offs (e.g., choose threshold to limit cost).

9. **Model interpretation & monitoring:**  
   - Check feature coefficients, SHAP or LIME for explanations.  
   - Monitor performance on production data and retrain when distribution shifts.

10. **Deployment & business integration:**  
    - Provide predicted scores and recommended threshold.
    - Track metrics and lift (e.g., response rate of selected customers vs baseline).