# **Logistic Regression | Assignment**

**1. What is Logistic Regression, and how does it differ from Linear Regression?**

1.Logistic Regression:

- Logistic Regression is a statistical machine learning algorithm used for classification problems.

- It predicts the probability that a given input belongs to a certain class (e.g., yes/no, spam/not spam).

- Instead of outputting continuous values, it outputs values between 0 and 1, using the sigmoid (logistic) function.

- If the probability ≥ 0.5 → class = 1, otherwise class = 0.

Mathematical Representation:

- $$
P(Y=1∣X)=1+e−(β0​+β1​X1​+...+βn​Xn​)1
​$$

2.Linear Regression:

- Linear Regression is used for regression problems where the output variable is continuous (e.g., predicting house prices, salary, sales).

- It assumes a linear relationship between the dependent variable (Y) and independent variables (X).

Equation:

- $$
 Y=β0​+β1​X1​+β2​X2​+...+βn​Xn​+ϵ
 $$

3.Key Differences:

| Aspect            | Linear Regression                            | Logistic Regression                                 |
| ----------------- | -------------------------------------------- | --------------------------------------------------- |
| **Purpose**       | Predicts **continuous values**               | Predicts **categorical values** (0/1, Yes/No)       |
| **Output**        | Any real number (-∞ to +∞)                   | Probability between **0 and 1**                     |
| **Function Used** | Straight line equation                       | Sigmoid (logistic) function                         |
| **Error Term**    | Minimizes **Mean Squared Error (MSE)**       | Uses **Maximum Likelihood Estimation (MLE)**        |
| **Use Cases**     | Predicting house prices, sales, stock prices | Spam detection, disease diagnosis, churn prediction |

4.Example:

- Linear Regression Example:
Predicting house price based on area.
Input → 2000 sq. ft → Output → ₹50 lakhs.

- Logistic Regression Example:
Predicting whether a patient has diabetes (Yes/No).
Input → age, BMI, sugar level → Output → Probability = 0.87 → Class = "Yes".

**2. Explain the role of the Sigmoid function in Logistic Regression.**

1.What is the Sigmoid Function?

- The Sigmoid function (also called the logistic function) is a mathematical function that maps any real-valued number into the range (0, 1).

- Formula:

- $$ σ(z)=1+e−z1 ​$$

- Where

- $$ z=β0​+β1​X1​+β2​X2​+...+βn​Xn​ $$

2.Why Sigmoid in Logistic Regression?

- Logistic Regression predicts the probability that an observation belongs to a certain class.

- Since probabilities must always lie between 0 and 1, we cannot use a simple linear equation (which can give negative or >1 values).

- The Sigmoid function compresses any real number into [0,1], making it ideal for probability estimation.

3.Role in Classification

- Output of Sigmoid = Probability of belonging to class 1.

- Decision rule:

- If 𝜎(𝑧)≥0.5 → Predict class = 1

- If 𝜎(𝑧)< 0.5 → Predict class = 0


4.Example

Suppose a Logistic Regression model gives:

- 𝑧 = 2.3

Applying sigmoid:

- $$ σ(2.3)=1+e−2.31​≈0.91 $$

- Interpretation → The model predicts a 91% chance of being in Class 1.

5.Key Points:

- Sigmoid ensures outputs are valid probabilities.

- Acts as a link function in Logistic Regression.

- Provides a decision boundary for classification.

- Smooth and differentiable, which helps in optimization (Maximum Likelihood Estimation).


**3. What is Regularization in Logistic Regression and why is it needed?**

1.Definition of Regularization

- Regularization is a technique used in Logistic Regression (and other ML models) to prevent overfitting by adding a penalty term to the cost (loss) function.

- It discourages the model from fitting too closely to the training data and keeps the coefficients (β) small and stable.

2.Why is Regularization Needed?

- In Logistic Regression, if we have too many features, the model may assign very large weights to some of them.

- This leads to overfitting:

- Good performance on training data.

- Poor performance on unseen/test data.

- Regularization reduces model complexity and improves generalization.

3.Types of Regularization in Logistic Regression

- L1 Regularization (Lasso):

- Adds the absolute values of coefficients as a penalty.

- Cost function:

- $$J(β)=−m1​i=1∑m​[yi​log(hβ​(xi​))+(1−yi​)log(1−hβ​(xi​))]+λj=1∑n​∣βj​∣$$

- Advantage: Performs feature selection (some coefficients become 0).

- L2 Regularization (Ridge):

- Adds the squared values of coefficients as a penalty.

- Cost function:

- $$J(β)=−m1​i=1∑m​[yi​log(hβ​(xi​))+(1−yi​)log(1−hβ​(xi​))]+λj=1∑n​βj2​$$

- Advantage: Prevents coefficients from becoming too large.

- Elastic Net:

- Combination of L1 and L2.

4.Key Benefits

- Controls model complexity.

- Prevents overfitting.

- Improves prediction accuracy on unseen data.

- Can perform feature selection (L1).

5.Example

- Without regularization: Model predicts perfectly on training data but fails on test data (overfit).

- With L2 regularization: Coefficients shrink → smoother decision boundary → better test accuracy.

**4. What are some common evaluation metrics for classification models, and why are they important?**

1.Why Evaluation Metrics are Important?

- In classification, accuracy alone is not enough (especially when data is imbalanced).

- Evaluation metrics help us measure:

- How well the model predicts correctly

- Whether it misclassifies certain classes more than others

- Overall performance and fairness of the model

2.Common Evaluation Metrics

- Accuracy

- Formula:

- Accuracy = $$TP+TN / TP+TN+FP+FN $$

- Measures the percentage of correctly classified samples.

- Limitation: Misleading when classes are imbalanced.

- Precision (Positive Predictive Value)

- Formula:

- Precision = $$ 𝑇𝑃 / 𝑇𝑃+𝐹𝑃$$

- Out of all predicted positives, how many are actually positive.

- Useful in applications where false positives are costly (e.g., spam detection).

3.Recall (Sensitivity or True Positive Rate)

- Formula:

- Recall = $$𝑇𝑃 / 𝑇𝑃+𝐹𝑁$$
	​
- Out of all actual positives, how many are correctly predicted.

- Important when missing a positive case is costly (e.g., cancer detection).

4.F1 Score

- Formula:

- $$F1=2×Precision*Recall / Precision+Recall​$$
	​
- Harmonic mean of Precision and Recall.

- Useful when there is class imbalance.

5.ROC Curve & AUC (Area Under Curve)

- ROC curve plots True Positive Rate vs. False Positive Rate.

- AUC measures the overall ability of the model to separate classes.

- Higher AUC = better classifier.

6.Example (Binary Classification)

- Model predicting whether an email is spam or not:

- Accuracy: Overall correct predictions.

- Precision: Of emails marked as spam, how many are actually spam.

- Recall: Of all spam emails, how many were detected.

- F1 Score: Balance between precision and recall.

- AUC: How well the model distinguishes spam vs. non-spam.  

7.Key Points

- Accuracy is not always reliable (imbalanced datasets).

- Precision & Recall explain false positives vs. false negatives.

- F1 balances precision and recall.

- AUC measures separability of classes.


**5. Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.**

**(Use Dataset from sklearn package)**

Dataset Used

- Breast Cancer Wisconsin dataset (load_breast_cancer) from sklearn.datasets.

- Features: tumor measurements (mean radius, mean texture, etc.)

- Target: 0 = malignant, 1 = benign.

Python Program – Logistic Regression with Train/Test Split


In [1]:
# Import libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load dataset from sklearn
data = load_breast_cancer()

# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target   # Add target column

print("Dataset shape:", df.shape)
print("First 5 rows:")
print(df.head())

# 2. Define features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# 3. Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Train Logistic Regression model
model = LogisticRegression(max_iter=5000)  # Increased iterations for convergence
model.fit(X_train, y_train)

# 5. Make predictions
y_pred = model.predict(X_test)

# 6. Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")


Dataset shape: (569, 31)
First 5 rows:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst 

**6. Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.**

**(Use Dataset from sklearn package)**

Explanation of Key Points

- Dataset: Breast Cancer (from sklearn.datasets).

- Regularization: penalty='l2' applies Ridge Regularization (default in sklearn).

- Coefficients: model.coef_ shows feature weights after shrinkage.

- Accuracy: Evaluates performance on test set.

This program covers all requirements:

- Uses L2 regularization

- Prints coefficients & intercept

- Prints accuracy

Logistic Regression with L2 Regularization (Ridge)


In [2]:
# Import libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load dataset
data = load_breast_cancer()

# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Features (X) and Target (y)
X = df.drop('target', axis=1)
y = df['target']

# 2. Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Logistic Regression with L2 Regularization
# (penalty='l2' is default, solver='liblinear' or 'lbfgs' works)
model = LogisticRegression(penalty='l2', solver='liblinear', max_iter=5000)
model.fit(X_train, y_train)

# 4. Predictions
y_pred = model.predict(X_test)

# 5. Print Coefficients and Accuracy
print("Model Coefficients:")
print(model.coef_)   # array of feature weights
print("\nIntercept:", model.intercept_)

accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")


Model Coefficients:
[[ 2.13248406e+00  1.52771940e-01 -1.45091255e-01 -8.28669349e-04
  -1.42636015e-01 -4.15568847e-01 -6.51940282e-01 -3.44456106e-01
  -2.07613380e-01 -2.97739324e-02 -5.00338038e-02  1.44298427e+00
  -3.03857384e-01 -7.25692126e-02 -1.61591524e-02 -1.90655332e-03
  -4.48855442e-02 -3.77188737e-02 -4.17516190e-02  5.61347410e-03
   1.23214996e+00 -4.04581097e-01 -3.62091502e-02 -2.70867580e-02
  -2.62630530e-01 -1.20898539e+00 -1.61796947e+00 -6.15250835e-01
  -7.42763610e-01 -1.16960181e-01]]

Intercept: [0.40847797]

Model Accuracy: 0.9561


**7. Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.**

**(Use Dataset from sklearn package)**

Explanation

- Dataset: Iris (3 classes → Setosa, Versicolor, Virginica).

- multi_class='ovr': Builds one Logistic Regression classifier per class (One-vs-Rest).

- classification_report: Shows precision, recall, f1-score, and support for each class.

This program covers all requirements:

- Uses Logistic Regression for multiclass

- multi_class='ovr' implemented

- Prints classification report

Logistic Regression for Multiclass Classification (OvR)


In [3]:
# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# 1. Load dataset
data = load_iris()

# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print("Dataset shape:", df.shape)
print("Target classes:", data.target_names)

# 2. Features (X) and Target (y)
X = df.drop('target', axis=1)
y = df['target']

# 3. Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Logistic Regression with OvR (One-vs-Rest)
model = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=200)
model.fit(X_train, y_train)

# 5. Predictions
y_pred = model.predict(X_test)

# 6. Print Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Dataset shape: (150, 5)
Target classes: ['setosa' 'versicolor' 'virginica']

Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30





**8. Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation accuracy.**

**(Use Dataset from sklearn package)**

Explanation

- Dataset: Breast Cancer (binary classification).

- C: Inverse of regularization strength. Smaller C = stronger regularization.

- penalty: 'l1' (Lasso), 'l2' (Ridge).

- GridSearchCV: Performs cross-validation to find the best hyperparameters.

- Best model: Evaluated on test set.

This program covers all requirements:

- Applies GridSearchCV

- Tunes C and penalty

- Prints best parameters & validation accuracy

- Evaluates on test data

Logistic Regression with GridSearchCV for Hyperparameter Tuning

In [4]:
# Import libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load dataset
data = load_breast_cancer()

# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Features (X) and Target (y)
X = df.drop('target', axis=1)
y = df['target']

# 2. Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Define Logistic Regression model
log_reg = LogisticRegression(max_iter=5000, solver='liblinear')

# 4. Define parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],        # Regularization strength
    'penalty': ['l1', 'l2']              # L1 (Lasso), L2 (Ridge)
}

# 5. GridSearchCV
grid_search = GridSearchCV(
    estimator=log_reg,
    param_grid=param_grid,
    cv=5,              # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

# 6. Best parameters and accuracy
print("Best Parameters:", grid_search.best_params_)
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")

# 7. Evaluate on test data
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {test_accuracy:.4f}")


Best Parameters: {'C': 100, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.9670
Test Accuracy: 0.9825


**9. Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.**

**(Use Dataset from sklearn package)**

Explanation

- StandardScaler: Transforms features so that they have mean = 0 and standard deviation = 1.

- Logistic Regression (especially with regularization) can perform better when features are on the same scale.

- We train two models:

  - Without scaling

  - With scaling

- Then compare accuracies.

This program covers:

- Loads dataset from sklearn

- Splits train/test sets

- Trains Logistic Regression with & without scaling

- Compares accuracies

Logistic Regression with and without Standardization

In [5]:
# Import libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# 1. Load dataset
data = load_breast_cancer()

# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Features (X) and Target (y)
X = df.drop('target', axis=1)
y = df['target']

# 2. Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Logistic Regression without scaling
model_no_scaling = LogisticRegression(max_iter=5000, solver='liblinear')
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# 4. Logistic Regression with scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression(max_iter=5000, solver='liblinear')
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

# 5. Compare results
print(f"Accuracy without scaling: {accuracy_no_scaling:.4f}")
print(f"Accuracy with scaling:    {accuracy_scaled:.4f}")


Accuracy without scaling: 0.9561
Accuracy with scaling:    0.9737


**10. Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business use case.**

1.Understanding the Problem

- Business goal: Predict which customers will respond to a marketing campaign.

- Dataset issue: Highly imbalanced (5% positive, 95% negative).

- Challenge: If we only use accuracy, a model could predict “no response” for everyone and still be 95% accurate → but useless for the business.

2.Data Handling

- Data Cleaning: Handle missing values, remove duplicates, encode categorical features (One-Hot Encoding/Label Encoding).

- Feature Engineering: Create meaningful features (e.g., past purchases, browsing history, campaign history).

- Feature Scaling: Apply StandardScaler so features are comparable in magnitude.

3.Balancing Classes

- Since only 5% respond, we must handle imbalance:

- Resampling techniques:

- Oversampling the minority class (SMOTE, Random Oversampling).

- Undersampling the majority class.

- Class Weights in Logistic Regression:

- Use class_weight='balanced' in sklearn to give higher weight to minority class.

4.Model Building (Logistic Regression)

- Use Logistic Regression with regularization (L1/L2).

5.Hyperparameter Tuning

- Use GridSearchCV to tune:

- C (regularization strength).

- penalty (L1, L2).

- class_weight (balanced vs. none).

- Perform k-fold cross-validation to avoid overfitting.

6.Evaluation Metrics

- Since accuracy is misleading in imbalanced data, use:

- Precision: Out of predicted responders, how many actually respond.

- Recall (Sensitivity): Out of all actual responders, how many were identified.

- F1 Score: Balance between precision and recall.

- ROC-AUC: Measures separability of classes.

- PR-AUC (Precision-Recall Curve): More informative than ROC when dataset is highly imbalanced.

7.Business Perspective

- High Recall ensures we don’t miss too many potential responders.

- High Precision ensures we don’t waste resources targeting uninterested customers.

- Balance depends on business strategy:

- If campaign cost is low → prioritize recall.

- If campaign cost is high → prioritize precision.

8.Approach Summary

- Clean & preprocess dataset.

- Scale features with StandardScaler.

- Handle imbalance (SMOTE / class_weight).

- Train Logistic Regression with regularization.

- Tune hyperparameters (C, penalty, class_weight).

- Evaluate using Precision, Recall, F1, ROC-AUC, PR-AUC.

- Select model based on business trade-off (precision vs. recall).

What this program does

- Creates a synthetic imbalanced dataset (95% vs 5%).

- Splits into train/test sets.

- Applies feature scaling.

- Uses SMOTE to balance classes in training data.

- Tunes Logistic Regression with GridSearchCV (C, penalty, class_weight).

- Evaluates model with classification report, confusion matrix, ROC-AUC.

Python Code: Logistic Regression on Imbalanced Dataset

In [6]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from imblearn.over_sampling import SMOTE

# 1. Create an imbalanced dataset (95% class 0, 5% class 1)
X, y = make_classification(n_samples=5000, n_features=10, n_informative=6,
                           n_redundant=2, n_classes=2, weights=[0.95, 0.05],
                           random_state=42)

# Convert to DataFrame (optional, for readability)
df = pd.DataFrame(X, columns=[f"Feature_{i}" for i in range(1, 11)])
df["Target"] = y

print("Original Class Distribution:\n", df["Target"].value_counts())

# 2. Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# 3. Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Balance classes using SMOTE (oversampling minority class)
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled, y_train)

print("Balanced Class Distribution (after SMOTE):\n", pd.Series(y_train_balanced).value_counts())

# 5. Logistic Regression with GridSearchCV for hyperparameter tuning
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear'],  # liblinear supports l1 & l2
    'class_weight': [None, 'balanced']
}

grid = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5, scoring='f1')
grid.fit(X_train_balanced, y_train_balanced)

print("\nBest Parameters:", grid.best_params_)

# 6. Evaluate on test set
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test_scaled)

print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, best_model.predict_proba(X_test_scaled)[:,1]))


Original Class Distribution:
 Target
0    4724
1     276
Name: count, dtype: int64
Balanced Class Distribution (after SMOTE):
 1    3779
0    3779
Name: count, dtype: int64

Best Parameters: {'C': 0.01, 'class_weight': None, 'penalty': 'l1', 'solver': 'liblinear'}

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.76      0.85       945
           1       0.14      0.67      0.23        55

    accuracy                           0.75      1000
   macro avg       0.56      0.72      0.54      1000
weighted avg       0.93      0.75      0.82      1000

Confusion Matrix:
 [[716 229]
 [ 18  37]]
ROC-AUC Score: 0.7777777777777778
