# Logistic Regression  Assignment

1. What is Logistic Regression, and how does it differ from Linear
Regression?

Ans:Logistic regression and linear regression are both supervised learning algorithms, but they differ in their purpose and how they handle the output variable. Linear regression predicts continuous numerical values, while logistic regression predicts the probability of a categorical outcome. In simpler terms, linear regression is used for things like predicting house prices, while logistic regression is used for things like predicting whether an email is spam or not.

Here's a more detailed breakdown:

1. Linear Regression:

 - Objective:
Predicts a continuous dependent variable (output) based on one or more independent variables (inputs).

 - Output:
A numerical value that can fall anywhere on a continuous scale (e.g., temperature, sales figures, house price).

 - Example:
Predicting the price of a house based on its size, location, and number of bedrooms.

  - Assumptions:

- Linear relationship between dependent and independent variables.

- Residuals (errors) are normally distributed.

- Homoscedasticity (constant variance of errors).

- Estimation Method: Ordinary Least Squares (OLS), minimizing the sum of squared differences between observed and predicted values.

- Evaluation Metrics: R-squared, Mean Squared Error (MSE), Root Mean Squared Error (RMSE).

2. Logistic Regression:

 - Objective:
Predicts the probability of a categorical dependent variable belonging to a specific class.

 - Output:
A probability value between 0 and 1, representing the likelihood of the event occurring (e.g., probability of a customer clicking on an ad, probability of a patient having a disease).

 - Example:
Predicting whether a customer will click on an advertisement (yes or no), or classifying an email as spam or not spam.

 - Assumptions:

- The dependent variable is binary.

- Observations are independent.

- Little or no multicollinearity among independent variables.

- Estimation Method: Maximum Likelihood Estimation (MLE), finding parameters that maximize the likelihood of observing the given data.

- Evaluation Metrics: Accuracy, Precision, Recall, F1 Score, Area Under the ROC Curve (AUC-ROC)

 - Key feature:
Uses a sigmoid function (or similar) to transform the linear combination of input variables into a probability.

Practical Applications

 1.Linear Regression:

 - Forecasting sales revenue based on advertising spend.

 - Estimating a person's weight based on height and age.

2. Logistic Regression:

- Determining whether a customer will buy a product (yes/no).

- Predicting if an email is spam or not


2. Explain the role of the Sigmoid function in Logistic Regression.?

Ans:In Logistic Regression, the sigmoid function, also known as the logistic function, plays a crucial role in transforming the output of a linear model into a probability value between 0 and 1. This transformation allows for binary classification, where the model predicts the probability of an instance belonging to a particular class.

Here's a more detailed explanation:
1. Linear Model Output:
Logistic Regression starts with a linear model that combines input features with weights and a bias term. This linear equation produces a real-valued output (z).

2. Sigmoid Transformation:
The sigmoid function (σ(z)) then takes this real-valued output (z) and maps it to a value between 0 and 1. The sigmoid function is defined as: σ(z) = 1 / (1 + exp(-z)).

3. Probability Interpretation:
The output of the sigmoid function (which is between 0 and 1) is interpreted as the probability of the instance belonging to the "positive" class (e.g., the probability of a customer clicking on an ad, or the probability of a medical diagnosis being malignant).

4. Binary Classification:
Based on a threshold (usually 0.5), the predicted probability is used to classify the instance. If the probability is above the threshold, the instance is classified as belonging to the positive class; otherwise, it's classified as belonging to the negative class.

The sigmoid function allows logistic regression to model probabilities, making it suitable for binary classification tasks. It converts the unbounded output of the linear model into a meaningful probability, which is then used to make class predictions.


How It’s Used in Logistic Regression:

 - Mapping linear scores to probabilities
The model first computes a linear combination of inputs:

𝑧
=
𝛽
0
+
𝛽
1
𝑥
1
+
⋯
+
𝛽
𝑘
𝑥
𝑘
z=β
0
​
 +β
1
​
 x
1
​
 +⋯+β
k
​
 x
k
​

 - Then applies the sigmoid to turn
𝑧
z (which can be any real number) into a probability
𝑝
=
𝜎
(
𝑧
)
p=σ(z) — i.e., the estimated likelihood that the outcome is “1”


 - Interpreting probabilities as classes
These probability outputs can be:

Used directly (say, giving a 93.2% chance of being in class 1)

Thresholded (commonly at 0.5) to assign a class label:
𝑝
≥
0.5
→
class 1
,
p≥0.5→class 1, otherwise class 0

 - Enabling probability-based modeling
This mapping lets logistic regression interpret outputs probabilistically, unlike methods like linear discriminant analysis or SVMs
Data Science Stack Exchange


Why the Sigmoid Function is Ideal

 - Bounded and continuous: Always outputs a smooth probability between 0 and 1 — vital for modeling real-world outcomes


 - Differentiable: Facilitates gradient-based optimization (e.g., maximizing the likelihood) for parameter estimation.


 - Statistical justification: It naturally emerges when modeling the log-odds (logit)—that is, logistic regression can be framed as modeling a linear function of the log-odds of the outcome, which upon inversion yields the sigmoid

3.  What is Regularization in Logistic Regression and why is it needed?

Ans:Regularization in logistic regression is a technique used to prevent overfitting by adding a penalty term to the loss function. This penalty discourages overly complex models, leading to better generalization performance on unseen data. Essentially, it trades a slight increase in training error for a larger decrease in generalization error.

Why is it needed

 - Overfitting:
Without regularization, logistic regression models can become too complex and overfit the training data, meaning they perform well on the training set but poorly on new, unseen data.

 - Generalization:
Regularization helps logistic regression models generalize better to new data by simplifying the model and preventing it from learning noise or irrelevant details in the training data.

 - Feature Selection:
Regularization can also be used for feature selection by shrinking the coefficients of less important features towards zero, effectively removing them from the model.



 - Balances bias–variance tradeoff: Regularization adds slight bias while reducing variance, helping you avoid both overfitting and underfitting if tuned properly.

 - Handles multicollinearity and noisy data: It distributes the weight more sensibly among correlated features, improving robustness in the presence of redundant or noisy predictors.

 - Enhances interpretability: Especially with L1 regularization, the model becomes sparser and easier to understand and explain.

How it works:

 - Regularization adds a penalty term to the loss function, which is the function that the model tries to minimize during training.

 - This penalty term is typically based on the magnitude of the model's coefficients (weights).
Common types of regularization include L1 (Lasso) and L2 (Ridge) regularization, which penalize the sum of absolute values and the sum of squared values of the coefficients, respectively.

 - By adding this penalty, regularization encourages the model to find simpler solutions with smaller coefficient values, leading to better generalization.

 - Regularization helps logistic regression models to strike a balance between fitting the training data well and avoiding overfitting, resulting in more reliable and accurate predictions on new data.



What Regularization Does in Logistic Regression

 - Penalizes large coefficients: Regularization adds a term to the loss function that depends on the magnitude of the model’s weights, thereby reducing the risk of the model fitting to random noise.

 - Controls model complexity: By imposing a constraint (“budget”) on how large weights can grow, regularization prevents the model from becoming too flexible and overfitting.

 - Improves stability and convergence: Especially in high-dimensional or sparse settings, regularization can help logistic regression converge more reliably and avoid unstable solutions


 Regularization is a powerful tool in logistic regression to:

1. Avoid overfitting

2. Improve model generalization

3. Stabilize learning in high-dimensional or noisy settings

4. Enable feature selection and interpretability (especially via L1)

5. The choice between L1, L2, or Elastic Net—and the tuning of the regularization strength (often via λ or inverse parameter
𝐶
C in libraries like scikit-learn)—should be guided by your data characteristics and modeling goals.

4. What are some common evaluation metrics for classification models, and
why are they important?


Ans:Common evaluation metrics for classification models include accuracy, precision, recall, F1-score, and AUC-ROC. These metrics help assess how well a model distinguishes between different classes, particularly in cases of imbalanced datasets, and guide model selection and improvement.

Here's a breakdown of these metrics:

1. Accuracy:

 - Definition:
The proportion of correctly classified instances out of the total number of instances.

 - Formula:
(True Positives + True Negatives) / (True Positives + True Negatives + False Positives + False Negatives)

 - Importance:
While simple and widely used, accuracy can be misleading for imbalanced datasets, where one class has significantly more instances than the other.

2. Precision:

 - Definition: The proportion of correctly predicted positive instances out of all instances predicted as positive.

 - Formula: True Positives / (True Positives + False Positives)

 - Importance: Useful when the cost of a false positive is high. For example, in spam detection, we want to minimize the number of legitimate emails wrongly classified as spam.

3. Recall (also known as Sensitivity or True Positive Rate):

 - Definition: The proportion of correctly predicted positive instances out of all actual positive instances.

 - Formula: True Positives / (True Positives + False Negatives)

 - Importance: Useful when the cost of a false negative is high. For example, in medical diagnosis, we want to minimize the number of infected patients wrongly classified as not infected.

4. F1-score:

 - Definition: The harmonic mean of precision and recall, providing a balanced measure of both.

 - Formula: 2 * (Precision * Recall) / (Precision + Recall)

 - Importance: Useful when you need a balance between precision and recall, especially in situations where both false positives and false negatives are undesirable.

5. AUC-ROC (Area Under the Receiver Operating Characteristic Curve):

 - Definition:
Measures the model's ability to distinguish between classes across various probability thresholds.

 - Importance:
Effective for evaluating binary classification models, especially when dealing with imbalanced datasets. A higher AUC-ROC indicates better performance.

Why are these metrics important

 - Model Evaluation:
They provide a quantitative way to assess the performance of a classification model, helping to identify strengths and weaknesses.

 - Model Selection:
By comparing different models based on these metrics, you can choose the one that best suits the specific problem and data characteristics.

 - Parameter Tuning:
Metrics guide the optimization process, helping to fine-tune model parameters for improved performance.

 - Imbalanced Datasets:
They are crucial for evaluating models on datasets where the classes are not equally represented.



More Advanced Metrics

 - Log Loss / Cross-Entropy
Measures how well the predicted probabilities align with actual outcomes. A lower score indicates better-calibrated predictions. Particularly useful when probability estimates matter.


 - Cohen’s Kappa
Accounts for agreement occurring by chance, offering a more nuanced evaluation, especially for imbalanced datasets.


 - Matthews Correlation Coefficient (MCC)
Incorporates all elements of the confusion matrix and is generally regarded as one of the most informative single-value metrics, especially in the presence of imbalanced classes.


 - Other metrics:
Balanced accuracy, F-beta, expected cost, Brier score, and calibration loss also offer value depending on the specific application and output type.


selecting the appropriate evaluation metrics and understanding their implications is vital for building effective and reliable classification models.

In [3]:
# 5. 5. Write a Python program that loads a CSV file into a Pandas DataFrame,
# splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
#(Use Dataset from sklearn package) ?

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification # For creating a sample dataset

# 1. Load the dataset into a Pandas DataFrame
#    Replace this with pd.read_csv('your_file.csv') for your own data
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
df['target'] = y

# Define features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# 2. Split the dataset into training and testing sets
#    test_size: proportion of the dataset to include in the test split
#    random_state: ensures reproducibility of the split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train a Logistic Regression model
#    random_state: ensures reproducibility of the model training
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# 4. Make predictions on the test set
y_pred = model.predict(X_test)

# 5. Print the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Logistic Regression model: {accuracy:.4f}")

Accuracy of the Logistic Regression model: 0.8300


In [4]:
# 6. Write a Python program to train a Logistic Regression model using L2
#regularization (Ridge) and print the model coefficients and accuracy.
#(Use Dataset from sklearn package).

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load a sample dataset (Breast Cancer dataset)
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Logistic Regression model with L2 regularization
# The 'penalty' parameter is set to 'l2' for Ridge regularization.
# The 'C' parameter is the inverse of regularization strength; smaller values mean stronger regularization.
model = LogisticRegression(penalty='l2', C=1.0, solver='liblinear', random_state=42)

# Train the model
model.fit(X_train, y_train)

# Print the model coefficients
print("Model Coefficients:")
for i, coef in enumerate(model.coef_[0]):
    print(f"Feature {i}: {coef:.4f}")
print(f"Intercept: {model.intercept_[0]:.4f}")

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")


Model Coefficients:
Feature 0: 2.1325
Feature 1: 0.1528
Feature 2: -0.1451
Feature 3: -0.0008
Feature 4: -0.1426
Feature 5: -0.4156
Feature 6: -0.6519
Feature 7: -0.3445
Feature 8: -0.2076
Feature 9: -0.0298
Feature 10: -0.0500
Feature 11: 1.4430
Feature 12: -0.3039
Feature 13: -0.0726
Feature 14: -0.0162
Feature 15: -0.0019
Feature 16: -0.0449
Feature 17: -0.0377
Feature 18: -0.0418
Feature 19: 0.0056
Feature 20: 1.2321
Feature 21: -0.4046
Feature 22: -0.0362
Feature 23: -0.0271
Feature 24: -0.2626
Feature 25: -1.2090
Feature 26: -1.6180
Feature 27: -0.6153
Feature 28: -0.7428
Feature 29: -0.1170
Intercept: 0.4085

Model Accuracy: 0.9561


In [6]:
#7. Write a Python program to train a Logistic Regression model for multiclass
#classification using multi_class='ovr' and print the classification report.
#(Use Dataset from sklearn package)


import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# 1. Load a multiclass dataset — e.g., Iris
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split into train and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Instantiate and train logistic regression with One-vs-Rest strategy
model = LogisticRegression(
    multi_class='ovr',  # One-vs-Rest approach
    solver='liblinear',  # Supports 'ovr' for multiclass
    max_iter=1000
)
model.fit(X_train, y_train)

# 4. Make predictions
y_pred = model.predict(X_test)

# 5. Print classification report (precision, recall, F1-score per class)
print("Classification Report:\n",
      classification_report(y_test, y_pred, target_names=iris.target_names))




Classification Report:
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30





In [8]:
# 8.  Write a Python program to apply GridSearchCV to tune C and penalty
#hyperparameters for Logistic Regression and print the best parameters and validation
#accuracy.
#(Use Dataset from sklearn package)


import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load a dataset from sklearn (e.g., Iris dataset)
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the Logistic Regression model
logistic_regression = LogisticRegression(max_iter=1000, solver='liblinear') # 'liblinear' supports both 'l1' and 'l2' penalties

# Define the parameter grid for C and penalty
param_grid = {
    'C': np.logspace(-4, 4, 9),  # Example C values from 10^-4 to 10^4
    'penalty': ['l1', 'l2']     # L1 and L2 regularization
}

# Initialize GridSearchCV
# cv=5 for 5-fold cross-validation
# scoring='accuracy' to evaluate performance based on accuracy
grid_search = GridSearchCV(logistic_regression, param_grid, cv=5, scoring='accuracy', verbose=1)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters found by GridSearchCV
print(f"Best parameters: {grid_search.best_params_}")

# Print the best validation accuracy achieved during the grid search
print(f"Best validation accuracy: {grid_search.best_score_:.4f}")

# Optionally, evaluate the best model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test set accuracy with best model: {test_accuracy:.4f}")

Fitting 5 folds for each of 18 candidates, totalling 90 fits
Best parameters: {'C': np.float64(1000.0), 'penalty': 'l2'}
Best validation accuracy: 0.9667
Test set accuracy with best model: 1.0000


In [10]:
#  9. Write a Python program to standardize the features before training Logistic
#Regression and compare the model's accuracy with and without scaling.
#(Use Dataset from sklearn package)


from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load a dataset (e.g., Breast Cancer dataset)
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- Model without scaling ---
print("--- Model without scaling ---")
logistic_regression_unscaled = LogisticRegression(max_iter=1000, random_state=42)
logistic_regression_unscaled.fit(X_train, y_train)
y_pred_unscaled = logistic_regression_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)
print(f"Accuracy without scaling: {accuracy_unscaled:.4f}")

# --- Model with scaling ---
print("\n--- Model with scaling (StandardScaler) ---")

# Initialize StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression on scaled data
logistic_regression_scaled = LogisticRegression(max_iter=1000, random_state=42)
logistic_regression_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = logistic_regression_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy with scaling: {accuracy_scaled:.4f}")

# Compare accuracies
print(f"\nAccuracy difference (scaled - unscaled): {accuracy_scaled - accuracy_unscaled:.4f}")

--- Model without scaling ---
Accuracy without scaling: 0.9708

--- Model with scaling (StandardScaler) ---
Accuracy with scaling: 0.9825

Accuracy difference (scaled - unscaled): 0.0117


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


10.  Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.?

Ans:Building a Logistic Regression model for predicting marketing campaign responders with an imbalanced dataset requires a systematic approach.

 1. Data Handling:

 - Exploratory Data Analysis (EDA):
Understand the features, identify potential outliers, and assess missing values.

 - Feature Engineering:
Create new features from existing ones that might be more predictive (e.g., customer lifetime value, recency of last purchase, frequency of purchases).

 - Handling Missing Values:
Impute missing values using appropriate strategies like mean, median, mode imputation, or more advanced methods like K-Nearest Neighbors (KNN) imputation.

 2. Feature Scaling:
Logistic Regression is sensitive to feature scales. Apply Standardization (Z-score normalization) or Min-Max Scaling to ensure all features contribute equally to the model.

 - Python

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

 3. Balancing Classes:
Given the 5% response rate, addressing class imbalance is crucial.

 - Oversampling the minority class: Techniques like SMOTE (Synthetic Minority Over-sampling Technique) generate synthetic samples of the minority class, increasing its representation.

 - Undersampling the majority class: Randomly remove samples from the majority class to balance the dataset.

 - Class Weights: Assign higher weights to the minority class during model training. This tells the model to pay more attention to correctly classifying the minority class.

 - Python

from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_scaled, y)

 4. Hyperparameter Tuning:
Use techniques like Grid Search or Randomized Search with cross-validation to find the optimal hyperparameters for the Logistic Regression model.
Key hyperparameters to tune include C (regularization strength) and solver.

 - Python

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'solver': ['liblinear', 'lbfgs']}
grid_search = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=5, scoring='f1')
grid_search.fit(X_resampled, y_resampled)
best_model = grid_search.best_estimator_

 5. Model Evaluation:

 - Metrics for Imbalanced Data:
Precision, Recall, and F1-score: Focus on these for the minority class (responders) as accuracy can be misleading.

 - ROC AUC (Receiver Operating Characteristic Area Under the Curve): Provides a comprehensive measure of the model's ability to distinguish between classes.

 -Precision-Recall Curve: Useful for visualizing the trade-off between precision and recall at different thresholds.

 - Cross-validation:
Use techniques like K-fold cross-validation to get a robust estimate of model performance and prevent overfitting.

 - Business Context:
Evaluate the model's performance in terms of business impact, considering the cost of false positives (sending campaigns to non-responders) and false negatives (missing potential responders). A high recall for responders might be prioritized even if it means a slightly lower precision.