Assignment Code: DA-AG-011
Name: Sakshi Upadhyay
Email: sakshiup74744@gmail.com


Q1. What is Logistic Regression, and how does it differ from Linear Regression?

Ans. Logistic regression is a technique used when you want to predict categories, like whether an email is spam or not, or whether someone will buy a product — basically a yes or no kind of outcome.
On the other hand, linear regression is used when you want to predict numbers, like the price of a house, someone's salary, or the temperature tomorrow.
Linear regression draws a straight line to predict values. You give it some input (like years of experience), and it tells you a number (like expected salary).
Logistic regression doesn't give a number like 72 or 5.4 — it gives a probability (like 0.8), which means "80% chance this is spam." Then it converts that into a category (like spam or not spam).


Q.2  Explain the role of the Sigmoid function in Logistic Regression ?
Ans.In logistic regression, the sigmoid function is what helps the model make decisions in terms of probabilities.
Logistic regression first does what linear regression does — it takes your inputs (like age, salary, etc.), multiplies them by some weights, adds them up, and gets a number. This number could be anything — negative, positive, small, or large.
But we don’t want just any number — we want something between 0 and 1, because we’re trying to say how likely something is. That’s where the sigmoid function comes in.

Q3. What is Regularization in Logistic Regression and why is it needed ?
Ans. Regularization is like adding a rule to keep your logistic regression model from going overboard.
Imagine you're training a model to predict whether someone will buy a product based on their age, income, location, etc. Now, if your model gets too focused on tiny quirks or noise in the data, it might start giving too much importance to certain features — even if they aren’t that useful. This is called overfitting.
Regularization helps prevent this by penalizing the model for being too complex.
Because logistic regression is a linear model, if you have a lot of features or if they’re noisy, it might try too hard to fit everything perfectly. That leads to a model that performs well on training data but poorly on new, unseen data — which we don’t want.

Q4. What are some common evaluation metrics for classification models, and why are they important?
Ans. Classification models are evaluated using several metrics to measure their performance. Some of the most commonly used metrics are:

Accuracy:

Definition: The ratio of correctly predicted instances to the total number of instances.

Importance: Gives an overall idea of how well the model is performing.

Limitation: Not reliable when the data is imbalanced.

Precision:

Definition: The ratio of true positives to the total predicted positives.

Formula: Precision = TP / (TP + FP)

Importance: Indicates how many of the predicted positive instances are actually correct.

Useful when false positives are costly.

Recall (Sensitivity or True Positive Rate):

Definition: The ratio of true positives to the total actual positives.

Formula: Recall = TP / (TP + FN)

Importance: Measures how well the model identifies actual positive cases.

Useful when false negatives are critical (e.g., in medical diagnoses).

F1 Score:

Definition: The harmonic mean of precision and recall.

Formula: F1 = 2 * (Precision * Recall) / (Precision + Recall)

Importance: Balances precision and recall; useful when both are equally important.

ROC-AUC (Receiver Operating Characteristic – Area Under Curve):

Definition: Measures the model's ability to distinguish between classes.

Importance: Provides a comprehensive view of performance across all classification thresholds.

AUC closer to 1 indicates a better performing model.


Q.5 Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy. (Use Dataset from sklearn package)

In [1]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Logistic Regression model:", accuracy)

Accuracy of Logistic Regression model: 0.956140350877193


Q6. Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.


In [2]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(penalty='l2', solver='liblinear')  # L2 regularization
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Coefficients:")
print(model.coef_)
print("\nIntercept:")
print(model.intercept_)
print("\nAccuracy:", accuracy)


Model Coefficients:
[[ 2.13248406e+00  1.52771940e-01 -1.45091255e-01 -8.28669349e-04
  -1.42636015e-01 -4.15568847e-01 -6.51940282e-01 -3.44456106e-01
  -2.07613380e-01 -2.97739324e-02 -5.00338038e-02  1.44298427e+00
  -3.03857384e-01 -7.25692126e-02 -1.61591524e-02 -1.90655332e-03
  -4.48855442e-02 -3.77188737e-02 -4.17516190e-02  5.61347410e-03
   1.23214996e+00 -4.04581097e-01 -3.62091502e-02 -2.70867580e-02
  -2.62630530e-01 -1.20898539e+00 -1.61796947e+00 -6.15250835e-01
  -7.42763610e-01 -1.16960181e-01]]

Intercept:
[0.40847797]

Accuracy: 0.956140350877193


Q7. Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report. (Use Dataset from sklearn package)

In [3]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(multi_class='ovr', solver='liblinear')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30





Q.8 Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy.


In [None]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

grid_search = GridSearchCV(LogisticRegression(multi_class='ovr'), param_grid, cv=5)
grid_search.fit(X_train, y_train)

Q.9 Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)


In [6]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression without scaling
model_no_scaling = LogisticRegression(max_iter=10000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression with scaling
model_scaled = LogisticRegression(max_iter=10000)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

# Print the comparison
print("Accuracy without scaling:", acc_no_scaling)
print("Accuracy with scaling:", acc_scaled)

Accuracy without scaling: 0.956140350877193
Accuracy with scaling: 0.9736842105263158


Q.10  Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

Ans. To build a Logistic Regression model for predicting customer responses in an imbalanced dataset scenario (5% positive class), the following systematic approach should be taken:

1. Data Handling
Load and explore the dataset to understand the features and target variable.

Handle missing values, outliers, and incorrect data types.

Perform feature engineering to create meaningful variables (e.g., total purchases, time since last activity).

Encode categorical variables using label encoding or one-hot encoding.

2. Feature Scaling
Use StandardScaler to normalize numerical features.

This ensures that all features contribute equally and improves the performance of Logistic Regression, especially when regularization is used.

3. Dealing with Class Imbalance
Due to only 5% of customers being responders, special handling is required:

Class Weighting: Use class_weight='balanced' in Logistic Regression to penalize the majority class more lightly.

Oversampling or Undersampling (optional): Apply techniques like SMOTE to balance classes or randomly undersample the majority class.

4. Train-Test Split
Use train_test_split() with the stratify parameter to maintain the class distribution in both training and testing datasets.

5. Hyperparameter Tuning
Use GridSearchCV to tune hyperparameters such as:

C: Regularization strength

penalty: Type of regularization (l1 or l2)

solver: Suitable solver (e.g., liblinear for small datasets)

6. Model Evaluation
Since the data is imbalanced, accuracy is not sufficient. Better evaluation metrics include:

Precision: Percentage of predicted responders that are actually correct.

Recall: Percentage of actual responders correctly identified.

F1-Score: Harmonic mean of precision and recall.

ROC-AUC Score: Measures how well the model distinguishes between classes.

Confusion Matrix: Provides insight into true/false positives and negatives.

Also, threshold tuning may be applied (default is 0.5) to maximize recall or precision depending on business needs.

7. Business Application
Use predicted probabilities to rank customers by their likelihood to respond.

Deploy the model to help marketing teams target only high-probability customers, optimizing campaign costs and improving ROI.