In [1]:
Question 5: Write a Python program that loads a CSV file into a Pandas
DataFrame,splits into train/test sets, trains a Logistic Regression model, and
prints its accuracy.

Program =>

# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# 1. Load dataset (from sklearn)
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target   # add target column

# 2. Split into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# 3. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Train Logistic Regression model
model = LogisticRegression(max_iter=10000)  # increase iterations for convergence
model.fit(X_train, y_train)

# 5. Predictions and Accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy of Logistic Regression model:", accuracy)


Accuracy of Logistic Regression model: 0.956140350877193


In [2]:
Question 6: Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.

Program =>
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import qload_breast_cancer

# 1. Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# 2. Features and target
X = df.drop('target', axis=1)
y = df['target']

# 3. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Train Logistic Regression with L2 regularization
model = LogisticRegression(penalty="l2", solver="liblinear", max_iter=1000)
model.fit(X_train, y_train)

# 5. Predictions and accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# 6. Model coefficients
coefficients = model.coef_[0]

print("Accuracy of Logistic Regression model with L2 regularization:", accuracy)
print("\nFirst 10 coefficients of the model:\n", coefficients[:10])

Accuracy of Logistic Regression model with L2 regularization: 0.956140350877193

First 10 coefficients of the model:
 [ 2.13248406e+00  1.52771940e-01 -1.45091255e-01 -8.28669349e-04
 -1.42636015e-01 -4.15568847e-01 -6.51940282e-01 -3.44456106e-01
 -2.07613380e-01 -2.97739324e-02]


In [3]:
Question 7: Write a Python program to train a Logistic Regression model for
multiclass classification using multi_class='ovr' and print the classification
report.

Proram =>

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris

# 1. Load dataset (Iris - multiclass)
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# 2. Features and target
X = df.drop('target', axis=1)
y = df['target']

# 3. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Train Logistic Regression with One-vs-Rest (OvR)
model = LogisticRegression(multi_class="ovr", solver="liblinear", max_iter=1000)
model.fit(X_train, y_train)

# 5. Predictions
y_pred = model.predict(X_test)

# 6. Classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30





In [4]:
Question 8: Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and
validation accuracy.

Program =>

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer

# 1. Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# 2. Features and target
X = df.drop('target', axis=1)
y = df['target']

# 3. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Logistic Regression + GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],       # Regularization strength
    'penalty': ['l1', 'l2']             # Regularization type
}

# Use solver that supports both l1 and l2 (liblinear or saga)
log_reg = LogisticRegression(solver="liblinear", max_iter=1000)

grid = GridSearchCV(
    estimator=log_reg,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy'
)

grid.fit(X_train, y_train)

# 5. Results
print("Best Parameters:", grid.best_params_)
print("Best Cross-validation Accuracy:", grid.best_score_)


Best Parameters: {'C': 100, 'penalty': 'l1'}
Best Cross-validation Accuracy: 0.9670329670329672


In [5]:
Question 9: Write a Python program to standardize the features before training
Logistic Regression and compare the model's accuracy with and without scaling.

Program =>

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# 1. Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# 2. Features and target
X = df.drop('target', axis=1)
y = df['target']

# 3. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ---- Without Scaling ----
model_no_scaling = LogisticRegression(max_iter=1000, solver="liblinear")
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# ---- With Scaling ----
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression(max_iter=1000, solver="liblinear")
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

# 4. Results
print("Accuracy without Scaling:", accuracy_no_scaling)
print("Accuracy with Scaling   :", accuracy_scaled)


Accuracy without Scaling: 0.956140350877193
Accuracy with Scaling   : 0.9736842105263158


Logistic Regession | Assignment

Question 1: What is Logistic Regression, and how does it differ from Linear Regression?

Answer :- 1. Logistic Regression

* Logistic Regression is a supervised machine learning algorithm mainly used for classification problems (binary or multi-class).

* Instead of predicting a continuous value, it predicts the probability that an observation belongs to a particular class (e.g., "yes/no", "spam/not spam", "disease/no disease").

* It uses the logistic (sigmoid) function to map predictions to probabilities between 0 and 1.

Sigmoid Function:

P(Y=1∣X)=1+e−(β0​+β1​X1​+⋯+βn​Xn​)1​
	​
2. Linear Regression

* Linear Regression is used for regression problems where the target/output variable is continuous (e.g., predicting house prices, temperature, salary).

* It assumes a linear relationship between input variables (X) and the output (Y).
* Formula:

Y=β0​+β1X1+β2X2+⋯+βnXn+ϵ

Question 2: Explain the role of the Sigmoid function in Logistic Regression.

Answer:

Role of the Sigmoid Function in Logistic Regression

In Logistic Regression, we want to model the probability that a given input belongs to a particular class (for example, class = 1 vs class = 0).

1. Linear Combination (Raw Score / Logit):

Logistic regression first computes a linear combination of the input features:

z=w0​+w1​x1​+w2​x2​+⋯+wn​xn​

This 𝑧 value can range from − ∞ to +∞.

2. Problem with Linear Output:

Probabilities must always lie between 0 and 1, but a raw linear function can produce any real number.

3. Solution → Sigmoid Function:

To convert the linear output into a probability, we pass 𝑧 through the Sigmoid (Logistic) function:

σ(z)= 1 / 1+e−z

* When 𝑧 → + ∞, σ(z)→1

* When z → − ∞, σ(z)→0

* When z=0, σ(z)=0.5

4. Interpretation:

* The output of the sigmoid function is the predicted probability of belonging to the positive class (class = 1).

* For example: If σ(z)=0.8, the model predicts an 80% chance that the sample belongs to class 1.

5. Decision Boundary:
By default, logistic regression uses 0.5 as the threshold:
* If σ(z)≥0.5 → Predict class 1
* If σ(z)<0.5 → Predict class 0

Question 3: What is Regularization in Logistic Regression and why is it needed?

Answer: Regularization in Logistic Regression
1. What is Regularization?

Regularization is a technique used to prevent overfitting in logistic regression (and other models).
It works by adding a penalty term to the cost (loss) function, discouraging the model from relying too heavily on any one feature or having excessively large coefficients.

2. Logistic Regression without Regularization
* Logistic regression tries to minimize the log-loss (cross-entropy loss):

J(w)=−m1​i=1∑m​[y(i)log(y^​(i))+(1−y(i))log(1−y^​(i))]

* If there are many features, the model may assign very large weights (coefficients) to fit the training data perfectly → this leads to overfitting.

3. Regularization in Logistic Regression

To control overfitting, we add a penalty term:
J(w)=−m1​i=1∑m​[y(i)log(y^​(i))+(1−y(i))log(1−y^​(i))]+λR(w)

where:

* λ = regularization strength (controls penalty weight).
* R(w) = regularization term (depends on type).

4. Types of Regularization

* L1 Regularization (Lasso):

R(w)=∑∣w2j

=> Encourages sparsity → some coefficients become exactly zero.

=> Useful for feature selection.

* L2 Regularization (Ridge):

R(w)=∑w2j	​

=> Shrinks weights towards zero (but not exactly zero).

=> Helps reduce model complexity and variance.

* Elastic Net (Combination of L1 + L2):

=> Balances sparsity (L1) and stability (L2).

5. Why is Regularization Needed?

* Prevents overfitting → model generalizes better to unseen data.

* Controls the size of coefficients (avoids extreme weights).

* Improves stability and interpretability of the model.

* Helps in high-dimensional datasets (many features).

Question 4: What are some common evaluation metrics for classification models, and why are they important?

Answer :- Common Evaluation Metrics for Classification Models

When we build a classification model (like logistic regression), we need ways to measure its performance. Different metrics highlight different aspects of performance.

1. Accuracy

Accuracy = Correct Predictions/Total Predictions

* Tells us the overall proportion of correctly classified samples.
* Good when classes are balanced, but misleading for imbalanced datasets.
* Example: If 95% of patients are healthy, a model predicting “healthy” always will get 95% accuracy, but it’s useless.

2. Precision

Precision= TP/ TP+FP
	​
* Of all the samples predicted as positive, how many are actually positive?
* High precision = fewer false alarms.
* Important in applications like spam detection (don’t want to misclassify real emails as spam).

3. Recall (Sensitivity / True Positive Rate)

Recall=TP/TP+FN
	​
* Of all the actual positive cases, how many did the model correctly identify?
* High recall = fewer missed positives.
* Important in medical diagnosis (don’t want to miss sick patients).

4. F1 Score

F1=2⋅Precision.Recall/Precision+Recall
	​
* Harmonic mean of precision and recall.

* Useful when classes are imbalanced.

* Balances false positives and false negatives.

5. ROC Curve & AUC (Area Under Curve)

* ROC Curve plots True Positive Rate (Recall) vs False Positive Rate at different thresholds.

* AUC measures overall ability to distinguish between classes.

1. AUC = 1 → perfect classifier

2. AUC = 0.5 → random guessing

6. Confusion Matrix

A table that shows counts of:

* True Positives (TP)

* True Negatives (TN)

* False Positives (FP)

* False Negatives (FN)

Helps visualize what kinds of errors the model makes.

Why Are These Metrics Important?

* Different applications care about different errors.

* Medical diagnosis → Recall is critical (don’t miss positives).

* Fraud detection → Precision is critical (don’t wrongly accuse innocent transactions).

* Accuracy alone can be misleading in imbalanced datasets.

* Metrics guide model improvement (tuning thresholds, regularization, feature engineering).

Question 10: Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.

Answer: 1. Data Handling & Preprocessing

* Data Cleaning: Handle missing values, outliers, and inconsistencies (e.g., invalid ages, duplicate records).

* Feature Engineering: Create meaningful features such as past purchase frequency, recency (last purchase date), average order value, browsing behavior, etc.

* Categorical Encoding: Convert categorical variables (like location, product category preference) into numerical form using one-hot encoding or target encoding.

2. Feature Scaling

* Since Logistic Regression uses gradient descent optimization, scaling helps convergence.

* Apply StandardScaler (Z-score normalization) or MinMax scaling to numerical features so that all features contribute equally.

3. Handling Class Imbalance (Critical Step)

The dataset has only 5% positives (responders), so a naive model would just predict "non-response" and achieve 95% accuracy — but that’s useless. Options:

1. Resampling Techniques

* Oversampling: Use SMOTE or random oversampling to synthetically increase responders.

* Undersampling: Reduce majority class (non-responders). Works if dataset is very large.

* Hybrid: Combine SMOTE + undersampling.

2. Class Weights

* Logistic Regression in scikit-learn allows class_weight="balanced" to penalize misclassification of minority class more.

* Practical Choice: In business cases like marketing, I’d first try class weights (less risk of overfitting) and validate against SMOTE.

4. Hyperparameter Tuning

Key Logistic Regression parameters to tune:

* Regularization strength (C) → Smaller values = stronger regularization (helps prevent overfitting).

* Penalty type → l1, l2, or elasticnet.

* Solver → (liblinear, saga) depending on dataset size and penalty.

* Use GridSearchCV or RandomizedSearchCV with stratified cross-validation to maintain class ratio in folds.

5. Model Evaluation

Because of imbalance, accuracy is misleading. Use:

* Precision, Recall, F1-score (with more emphasis on Recall if the business wants to minimize missed responders, or Precision if cost of contacting uninterested customers is high).

* ROC-AUC: Measures ability to rank positives above negatives.

* PR-AUC (Precision-Recall curve): More informative under heavy imbalance.

* Confusion Matrix: To see trade-offs at different thresholds.

=> Business alignment:

* If the cost of sending a marketing message is low, maximize Recall (catch as many responders as possible).

* If marketing cost is high (expensive offers), maximize Precision (avoid wasting resources on uninterested customers).

6. Deployment & Monitoring

* After selecting the best threshold (not default 0.5, but tuned based on Precision-Recall tradeoff), deploy the model.

* Continuously monitor drift (customer behavior may change over time).

* Retrain periodically as new campaign data comes in.











































