# Logistic Regression Assignment

# Question 1
**What is Logistic Regression, and how does it differ from Linear Regression?**

**Answer (concise):**
Logistic Regression is a classification algorithm used to predict categorical outcomes (most commonly binary).  
- It models the probability that an input belongs to a particular class using the logistic (sigmoid) function, producing outputs in [0,1].  
- Decision rule: predict class 1 if probability ≥ 0.5 (or another chosen threshold), else class 0.

**Difference from Linear Regression:**
- Linear Regression predicts continuous values (real numbers); Logistic Regression predicts class probabilities.
- Linear output is unbounded; logistic output is bounded between 0 and 1 via the sigmoid.
- Loss: Linear Regression typically uses MSE; Logistic Regression uses log-loss / cross-entropy.


# Question 2
**Explain the role of the Sigmoid function in Logistic Regression.**

**Answer (concise):**
The sigmoid (logistic) function maps real-valued model scores (z = w·x + b) to the range (0,1) as:
\[
\sigma(z) = \frac{1}{1 + e^{-z}}
\]
This value is interpreted as the probability of the positive class. The sigmoid ensures outputs are probabilities and enables the use of cross-entropy loss for training.


# Question 3
**What is Regularization in Logistic Regression and why is it needed?**

**Answer (concise):**
Regularization penalizes large weights to reduce overfitting and improve generalization.
- **L2 (Ridge)**: adds \(\lambda ||w||_2^2\) penalty; shrinks weights continuously.
- **L1 (Lasso)**: adds \(\lambda ||w||_1\) penalty; can set some weights exactly to zero (feature selection).
Regularization helps when features are many, collinear, or when model overfits training data.


# Question 4
**What are some common evaluation metrics for classification models, and why are they important?**

**Answer (concise):**
- **Accuracy**: fraction of correct predictions. Good when classes balanced.
- **Precision**: TP / (TP + FP). Useful when false positives are costly.
- **Recall (Sensitivity)**: TP / (TP + FN). Useful when false negatives are costly.
- **F1-score**: harmonic mean of precision & recall; good single-number metric for imbalance.
- **ROC-AUC**: area under ROC curve (probability that a random positive ranks above a random negative).
- **Confusion Matrix**: counts TP, TN, FP, FN — helps analyze types of errors.
Choosing metrics matters depending on class balance and business cost of error types.


In [1]:

# Question 5
# Load a dataset, split, train LogisticRegression and print accuracy.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

model = LogisticRegression(max_iter=10000, solver='lbfgs')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred))


Accuracy: 0.9649122807017544

Classification report:
               precision    recall  f1-score   support

           0       0.97      0.93      0.95        42
           1       0.96      0.99      0.97        72

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



In [6]:

# Question 6
# Train Logistic Regression with L2 regularization (default) and print coefficients and accuracy.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
import warnings
warnings.filterwarnings("ignore")

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# L2 regularization is the default (penalty='l2'). C controls inverse strength.
model_l2 = LogisticRegression(penalty='l2', C=1.0, max_iter=10000, solver='lbfgs')
model_l2.fit(X_train, y_train)
y_pred = model_l2.predict(X_test)

print("Accuracy (L2):", accuracy_score(y_test, y_pred))
# Coefficients
coef = model_l2.coef_.ravel()
# Print top 10 absolute coefficient ranks
inds = np.argsort(np.abs(coef))[::-1][:10]
print("\nTop 10 features by absolute coefficient:")
for i in inds:
    print(f"Feature {i}: coef = {coef[i]:.4f}")


Accuracy (L2): 0.9649122807017544

Top 10 features by absolute coefficient:
Feature 26: coef = -1.3212
Feature 11: coef = 1.0810
Feature 0: coef = 0.8096
Feature 28: coef = -0.7855
Feature 25: coef = -0.7577
Feature 27: coef = -0.5526
Feature 6: coef = -0.4553
Feature 21: coef = -0.3742
Feature 24: coef = -0.3188
Feature 8: coef = -0.3054


In [7]:

# Question 7
# Multiclass logistic regression using multi_class='ovr' and print classification report.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

model_ovr = LogisticRegression(multi_class='ovr', max_iter=10000, solver='lbfgs')
model_ovr.fit(X_train, y_train)
y_pred = model_ovr.predict(X_test)

print("Classification report (ovr):\n")
print(classification_report(y_test, y_pred))


Classification report (ovr):

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.80      0.89        10
           2       0.83      1.00      0.91        10

    accuracy                           0.93        30
   macro avg       0.94      0.93      0.93        30
weighted avg       0.94      0.93      0.93        30



In [4]:

# Question 8
# GridSearchCV to tune C and penalty for Logistic Regression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l2'],  # using l2 with lbfgs; l1 requires solver='saga' or 'liblinear'
}

clf = GridSearchCV(LogisticRegression(max_iter=10000, solver='lbfgs'), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
clf.fit(X_train, y_train)

print("Best params:", clf.best_params_)
best_model = clf.best_estimator_
y_pred = best_model.predict(X_val)
print("Validation accuracy with best params:", accuracy_score(y_val, y_pred))


Best params: {'C': 10, 'penalty': 'l2'}
Validation accuracy with best params: 0.9649122807017544


In [5]:

# Question 9
# Compare accuracy with and without StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Without scaling
model_raw = LogisticRegression(max_iter=10000, solver='lbfgs')
model_raw.fit(X_train, y_train)
acc_raw = accuracy_score(y_test, model_raw.predict(X_test))

# With scaling
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

model_scaled = LogisticRegression(max_iter=10000, solver='lbfgs')
model_scaled.fit(X_train_s, y_train)
acc_scaled = accuracy_score(y_test, model_scaled.predict(X_test_s))

print("Accuracy without scaling:", acc_raw)
print("Accuracy with scaling:", acc_scaled)


Accuracy without scaling: 0.9649122807017544
Accuracy with scaling: 0.9824561403508771


# Question 10
**Real-world approach for an imbalanced e-commerce dataset (5% responders)**

**Answer (detailed steps):**
1. **Data understanding & cleaning**: validate labels, handle missing values, convert categorical features (one-hot / target encoding).
2. **Feature engineering**: create interaction features, customer recency/frequency/monetary (RFM), embedding categorical high-cardinality features if needed.
3. **Train-test split**: keep a stratified split so minority class proportion preserved (e.g., `train_test_split(..., stratify=y)`).
4. **Scaling**: scale numerical features (StandardScaler) since logistic regression is sensitive to feature scales.
5. **Handling imbalance**:
   - Use class weights in LogisticRegression (`class_weight='balanced'` or custom weights).
   - Or resample: oversample minority (SMOTE), or undersample majority, or combine (SMOTE+Tomek).
6. **Regularization & hyperparameter tuning**:
   - Tune `C` (inverse of regularization strength), penalty (`l1`, `l2`, `elasticnet`) using `GridSearchCV` or `RandomizedSearchCV`.
   - Use `solver` compatible with chosen penalty (e.g., `'saga'` for `l1`).
7. **Evaluation strategy**:
   - Prefer metrics robust to imbalance: ROC-AUC, PR-AUC (precision-recall), F1 for chosen operating point.
   - Use cross-validation stratified folds.
   - Calibrate predicted probabilities if you need reliable probability estimates (Platt scaling / isotonic).
8. **Business decisioning**:
   - Choose threshold based on business metric (e.g., maximize expected profit or recall at a given precision).
   - Run champion-challenger tests and A/B tests before deployment.
9. **Monitoring**:
   - Monitor model performance and data drift; keep feedback loop for retraining.

This workflow balances statistical rigor with business needs for low-response-rate campaigns.


## End of assignment