Q1.What is Logistic Regression, and how does it differ from Linear Regression.
Ans.Logistic Regression vs. Linear Regression in Python
1. Logistic Regression
Logistic Regression is a classification algorithm used to predict categorical outcomes (e.g., binary classification: 0 or 1, spam or not spam).
It applies the sigmoid function (or logistic function) to map predictions to probabilities between 0 and 1.
The decision boundary is determined using a threshold (e.g., 0.5 for binary classification).
It minimizes the log loss (cross-entropy loss) instead of mean squared error (MSE).
2. Linear Regression
Linear Regression is a regression algorithm used for predicting continuous values (e.g., house prices, temperature).
It assumes a linear relationship between independent and dependent variables.
It minimizes the Mean Squared Error (MSE).
The output can be any real number, unlike Logistic Regression, which outputs probabilities.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset (only two classes for binary classification)
iris = load_iris()
X, y = iris.data[:100], iris.target[:100]  # Selecting two classes (0 and 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predict
predictions = log_reg.predict(X_test)
print(predictions)


Q2.What is the mathematical equation of Logistic Regression.
Ans.Mathematical Equation of Logistic Regression
Logistic Regression is based on the logistic (sigmoid) function, which maps any real number to a value between 0 and 1.

1. Hypothesis Function
The hypothesis function for Logistic Regression is:

ℎ
𝜃
(
𝑥
)
=
1
1
+
𝑒
−
(
𝜃
0
+
𝜃
1
𝑥
1
+
𝜃
2
𝑥
2
+
.
.
.
+
𝜃
𝑛
𝑥
𝑛
)
h
θ
​
 (x)=
1+e
−(θ
0
​
 +θ
1
​
 x
1
​
 +θ
2
​
 x
2
​
 +...+θ
n
​
 x
n
​
 )

1
​

or simply,

ℎ
𝜃
(
𝑥
)
=
1
1
+
𝑒
−
𝜃
𝑇
𝑋
h
θ
​
 (x)=
1+e
−θ
T
 X

1
​

where:

ℎ
𝜃
(
𝑥
)
h
θ
​
 (x) is the predicted probability (output between 0 and 1).
𝜃
θ (theta) represents the model parameters (weights).
𝑋
X is the input feature vector.
𝑒
e is Euler’s number (~2.718).
2. Decision Rule
To classify an input, we set a threshold (typically 0.5) for the probability:

𝑦
=
{
1
,
if
ℎ
𝜃
(
𝑥
)
≥
0.5
0
,
if
ℎ
𝜃
(
𝑥
)
<
0.5
y={
1,
0,
​
  
if h
θ
​
 (x)≥0.5
if h
θ
​
 (x)<0.5
​

3. Cost Function (Log Loss)
Instead of using Mean Squared Error (like in Linear Regression), Logistic Regression minimizes the log loss (cross-entropy loss):

𝐽
(
𝜃
)
=
−
1
𝑚
∑
𝑖
=
1
𝑚
[
𝑦
(
𝑖
)
log
⁡
ℎ
𝜃
(
𝑥
(
𝑖
)
)
+
(
1
−
𝑦
(
𝑖
)
)
log
⁡
(
1
−
ℎ
𝜃
(
𝑥
(
𝑖
)
)
)
]
J(θ)=−
m
1
​
  
i=1
∑
m
​
 [y
(i)
 logh
θ
​
 (x
(i)
 )+(1−y
(i)
 )log(1−h
θ
​
 (x
(i)
 ))]
where:

𝐽
(
𝜃
)
J(θ) is the cost function.
𝑚
m is the number of training examples.
𝑦
(
𝑖
)
y
(i)
  is the actual class label (0 or 1).
ℎ
𝜃
(
𝑥
(
𝑖
)
)
h
θ
​
 (x
(i)
 ) is the predicted probability.
This function penalizes wrong predictions heavily, ensuring the model improves.

4. Optimization (Gradient Descent Update Rule)
To minimize the cost function, we update parameters using Gradient Descent:

𝜃
𝑗
:
=
𝜃
𝑗
−
𝛼
1
𝑚
∑
𝑖
=
1
𝑚
(
ℎ
𝜃
(
𝑥
(
𝑖
)
)
−
𝑦
(
𝑖
)
)
𝑥
𝑗
(
𝑖
)
θ
j
​
 :=θ
j
​
 −α
m
1
​
  
i=1
∑
m
​
 (h
θ
​
 (x
(i)
 )−y
(i)
 )x
j
(i)
​

where:

𝛼
α is the learning rate.
The term
(
ℎ
𝜃
(
𝑥
(
𝑖
)
)
−
𝑦
(
𝑖
)
)
(h
θ
​
 (x
(i)
 )−y
(i)
 ) is the error.

In [None]:
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def logistic_regression(X, y, lr=0.01, epochs=1000):
    m, n = X.shape
    theta = np.zeros(n)  # Initialize weights

    for _ in range(epochs):
        z = np.dot(X, theta)
        h = sigmoid(z)
        gradient = (1/m) * np.dot(X.T, (h - y))
        theta -= lr * gradient  # Gradient descent update

    return theta

# Example usage
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])  # Features
y = np.array([0, 0, 1, 1])  # Labels

theta = logistic_regression(X, y)
print("Optimized Parameters:", theta)


Why do we use the Sigmoid function in Logistic Regression.
Ans.Why Do We Use the Sigmoid Function in Logistic Regression?
The sigmoid function is used in Logistic Regression because it converts any real-valued input into a probability between 0 and 1, making it ideal for classification tasks.

1. Sigmoid Function Definition
The sigmoid function is mathematically defined as:

𝜎
(
𝑧
)
=
1
1
+
𝑒
−
𝑧
σ(z)=
1+e
−z

1
​

where:

𝑧
=
𝜃
𝑇
𝑋
z=θ
T
 X (linear combination of weights and features).
𝑒
e is Euler’s number (~2.718).
The output is always between
0
0 and
1
1, which can be interpreted as a probability.
Reasons for Using the Sigmoid Function
1. Converts Any Input into a Probability
Unlike Linear Regression, which outputs any real number, the sigmoid function bounds the output between 0 and 1.
This makes it perfect for binary classification (e.g., spam detection, disease prediction).
2. Maps Predictions to Class Labels
We can use a threshold (typically 0.5) to classify outputs:
𝑦
=
{
1
,
if
𝜎
(
𝑧
)
≥
0.5
0
,
if
𝜎
(
𝑧
)
<
0.5
y={
1,
0,
​
  
if σ(z)≥0.5
if σ(z)<0.5
​

This ensures the model outputs discrete class labels rather than continuous values.
3. Smooth and Differentiable Function
The sigmoid function is smooth and has a well-defined derivative:
𝑑
𝜎
(
𝑧
)
𝑑
𝑧
=
𝜎
(
𝑧
)
(
1
−
𝜎
(
𝑧
)
)
dz
dσ(z)
​
 =σ(z)(1−σ(z))
This allows us to efficiently optimize the model using gradient descent.
4. Helps in Logistic Regression’s Cost Function
The log loss (cross-entropy loss) function in Logistic Regression is based on the sigmoid function.
The loss function is convex, making it easier to optimize.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Define the sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Generate values for visualization
z = np.linspace(-10, 10, 100)
sigmoid_values = sigmoid(z)

# Plot the sigmoid function
plt.plot(z, sigmoid_values, label="Sigmoid Function")
plt.xlabel("z")
plt.ylabel("σ(z)")
plt.title("Sigmoid Function")
plt.legend()
plt.grid()
plt.show()


What is the cost function of Logistic Regression.
Ans.Cost Function of Logistic Regression
In Logistic Regression, we use the Log Loss (Cross-Entropy Loss) as the cost function instead of Mean Squared Error (MSE). This is because MSE is non-convex for logistic regression, making optimization difficult.

Mathematical Formulation
For a single training example
(
𝑥
(
𝑖
)
,
𝑦
(
𝑖
)
)
(x
(i)
 ,y
(i)
 ), the hypothesis function is:

ℎ
𝜃
(
𝑥
)
=
1
1
+
𝑒
−
𝜃
𝑇
𝑥
h
θ
​
 (x)=
1+e
−θ
T
 x

1
​

Since Logistic Regression predicts probabilities, we need a cost function that penalizes incorrect predictions while being convex for optimization.

The log loss function is:

𝐽
(
𝜃
)
=
−
1
𝑚
∑
𝑖
=
1
𝑚
[
𝑦
(
𝑖
)
log
⁡
ℎ
𝜃
(
𝑥
(
𝑖
)
)
+
(
1
−
𝑦
(
𝑖
)
)
log
⁡
(
1
−
ℎ
𝜃
(
𝑥
(
𝑖
)
)
)
]
J(θ)=−
m
1
​
  
i=1
∑
m
​
 [y
(i)
 logh
θ
​
 (x
(i)
 )+(1−y
(i)
 )log(1−h
θ
​
 (x
(i)
 ))]
where:

𝑚
m = number of training examples.
𝑦
(
𝑖
)
y
(i)
  = actual class label (0 or 1).
ℎ
𝜃
(
𝑥
(
𝑖
)
)
h
θ
​
 (x
(i)
 ) = predicted probability (output of the sigmoid function).
Intuition Behind the Cost Function
The cost function is designed to:

Punish wrong predictions heavily:
If the actual class
𝑦
=
1
y=1 but the model predicts a small probability, the term
log
⁡
ℎ
𝜃
(
𝑥
)
logh
θ
​
 (x) gives a large negative value.
If
𝑦
=
0
y=0 but the model predicts close to 1, the term
log
⁡
(
1
−
ℎ
𝜃
(
𝑥
)
)
log(1−h
θ
​
 (x)) also gives a large negative value.
Encourage correct predictions:
If
𝑦
=
1
y=1 and
ℎ
𝜃
(
𝑥
)
≈
1
h
θ
​
 (x)≈1, then
log
⁡
ℎ
𝜃
(
𝑥
)
≈
0
logh
θ
​
 (x)≈0, leading to a small loss.
If
𝑦
=
0
y=0 and
ℎ
𝜃
(
𝑥
)
≈
0
h
θ
​
 (x)≈0, then
log
⁡
(
1
−
ℎ
𝜃
(
𝑥
)
)
≈
0
log(1−h
θ
​
 (x))≈0, again resulting in a small loss.
Gradient Descent Optimization
To minimize the cost function, we use gradient descent with the following update rule:

𝜃
𝑗
:
=
𝜃
𝑗
−
𝛼
1
𝑚
∑
𝑖
=
1
𝑚
(
ℎ
𝜃
(
𝑥
(
𝑖
)
)
−
𝑦
(
𝑖
)
)
𝑥
𝑗
(
𝑖
)
θ
j
​
 :=θ
j
​
 −α
m
1
​
  
i=1
∑
m
​
 (h
θ
​
 (x
(i)
 )−y
(i)
 )x
j
(i)
​

where:

𝛼
α is the learning rate.
(
ℎ
𝜃
(
𝑥
(
𝑖
)
)
−
𝑦
(
𝑖
)
)
(h
θ
​
 (x
(i)
 )−y
(i)
 ) represents the error.
Cost Function of Logistic Regression
In Logistic Regression, we use the Log Loss (Cross-Entropy Loss) as the cost function instead of Mean Squared Error (MSE). This is because MSE is non-convex for logistic regression, making optimization difficult.

Mathematical Formulation
For a single training example
(
𝑥
(
𝑖
)
,
𝑦
(
𝑖
)
)
(x
(i)
 ,y
(i)
 ), the hypothesis function is:

ℎ
𝜃
(
𝑥
)
=
1
1
+
𝑒
−
𝜃
𝑇
𝑥
h
θ
​
 (x)=
1+e
−θ
T
 x

1
​

Since Logistic Regression predicts probabilities, we need a cost function that penalizes incorrect predictions while being convex for optimization.

The log loss function is:

𝐽
(
𝜃
)
=
−
1
𝑚
∑
𝑖
=
1
𝑚
[
𝑦
(
𝑖
)
log
⁡
ℎ
𝜃
(
𝑥
(
𝑖
)
)
+
(
1
−
𝑦
(
𝑖
)
)
log
⁡
(
1
−
ℎ
𝜃
(
𝑥
(
𝑖
)
)
)
]
J(θ)=−
m
1
​
  
i=1
∑
m
​
 [y
(i)
 logh
θ
​
 (x
(i)
 )+(1−y
(i)
 )log(1−h
θ
​
 (x
(i)
 ))]
where:

𝑚
m = number of training examples.
𝑦
(
𝑖
)
y
(i)
  = actual class label (0 or 1).
ℎ
𝜃
(
𝑥
(
𝑖
)
)
h
θ
​
 (x
(i)
 ) = predicted probability (output of the sigmoid function).
Intuition Behind the Cost Function
The cost function is designed to:

Punish wrong predictions heavily:
If the actual class
𝑦
=
1
y=1 but the model predicts a small probability, the term
log
⁡
ℎ
𝜃
(
𝑥
)
logh
θ
​
 (x) gives a large negative value.
If
𝑦
=
0
y=0 but the model predicts close to 1, the term
log
⁡
(
1
−
ℎ
𝜃
(
𝑥
)
)
log(1−h
θ
​
 (x)) also gives a large negative value.
Encourage correct predictions:
If
𝑦
=
1
y=1 and
ℎ
𝜃
(
𝑥
)
≈
1
h
θ
​
 (x)≈1, then
log
⁡
ℎ
𝜃
(
𝑥
)
≈
0
logh
θ
​
 (x)≈0, leading to a small loss.
If
𝑦
=
0
y=0 and
ℎ
𝜃
(
𝑥
)
≈
0
h
θ
​
 (x)≈0, then
log
⁡
(
1
−
ℎ
𝜃
(
𝑥
)
)
≈
0
log(1−h
θ
​
 (x))≈0, again resulting in a small loss.
Gradient Descent Optimization
To minimize the cost function, we use gradient descent with the following update rule:

𝜃
𝑗
:
=
𝜃
𝑗
−
𝛼
1
𝑚
∑
𝑖
=
1
𝑚
(
ℎ
𝜃
(
𝑥
(
𝑖
)
)
−
𝑦
(
𝑖
)
)
𝑥
𝑗
(
𝑖
)
θ
j
​
 :=θ
j
​
 −α
m
1
​
  
i=1
∑
m
​
 (h
θ
​
 (x
(i)
 )−y
(i)
 )x
j
(i)
​

where:

𝛼
α is the learning rate.
(
ℎ
𝜃
(
𝑥
(
𝑖
)
)
−
𝑦
(
𝑖
)
)
(h
θ
​
 (x
(i)
 )−y
(i)
 ) represents the error.


In [None]:
import numpy as np

# Sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Compute logistic regression cost function
def compute_cost(X, y, theta):
    m = len(y)
    h = sigmoid(X @ theta)  # Compute predictions
    cost = (-1 / m) * np.sum(y * np.log(h) + (1 - y) * np.log(1 - h))  # Log loss
    return cost

# Example usage
X = np.array([[1, 2], [2, 3], [3, 4]])  # Feature matrix (without bias term)
y = np.array([0, 1, 1])  # Labels
theta = np.zeros(X.shape[1])  # Initialize parameters

cost = compute_cost(X, y, theta)
print("Initial Cost:", cost)


What is Regularization in Logistic Regression? Why is it needed.
Ans.Regularization in Logistic Regression: Why Is It Needed?
1. What Is Regularization?
Regularization is a technique used in Logistic Regression (and other machine learning models) to prevent overfitting by adding a penalty term to the cost function.

In Logistic Regression, two common types of regularization are:

L1 Regularization (Lasso Regression)
L2 Regularization (Ridge Regression)
These methods add a penalty to the model’s parameters (weights/coefficients) to prevent excessively large values, which can lead to overfitting.

2. Why Is Regularization Needed?
👉 Overfitting Problem:

Without regularization, Logistic Regression may learn a model that fits the training data too well but performs poorly on new data.
This happens when the model assigns very large weights to some features, making it highly sensitive to noise in the data.
👉 Regularization Solves This By:

Reducing the impact of less important features.
Keeping model weights small and stable.
Improving the generalization of the model on unseen data.
3. Types of Regularization in Logistic Regression
(i) L2 Regularization (Ridge, Default in Scikit-learn)
Adds the sum of squared weights to the cost function.
Prevents weights from becoming too large by penalizing them.
Formula:
𝐽
(
𝜃
)
=
−
1
𝑚
∑
𝑖
=
1
𝑚
[
𝑦
(
𝑖
)
log
⁡
ℎ
𝜃
(
𝑥
(
𝑖
)
)
+
(
1
−
𝑦
(
𝑖
)
)
log
⁡
(
1
−
ℎ
𝜃
(
𝑥
(
𝑖
)
)
)
]
+
𝜆
2
𝑚
∑
𝑗
=
1
𝑛
𝜃
𝑗
2
J(θ)=−
m
1
​
  
i=1
∑
m
​
 [y
(i)
 logh
θ
​
 (x
(i)
 )+(1−y
(i)
 )log(1−h
θ
​
 (x
(i)
 ))]+
2m
λ
​
  
j=1
∑
n
​
 θ
j
2
​

where:

𝜆
λ (regularization parameter) controls the penalty strength.
∑
𝜃
𝑗
2
∑θ
j
2
​
  is the sum of squared weights.
🔹 Effect: Shrinks coefficients but does not eliminate them completely.

(ii) L1 Regularization (Lasso)
Adds the sum of absolute weights to the cost function.
Can force some weights to be exactly zero, leading to feature selection.
Formula:
𝐽
(
𝜃
)
=
−
1
𝑚
∑
𝑖
=
1
𝑚
[
𝑦
(
𝑖
)
log
⁡
ℎ
𝜃
(
𝑥
(
𝑖
)
)
+
(
1
−
𝑦
(
𝑖
)
)
log
⁡
(
1
−
ℎ
𝜃
(
𝑥
(
𝑖
)
)
)
]
+
𝜆
𝑚
∑
𝑗
=
1
𝑛
∣
𝜃
𝑗
∣
J(θ)=−
m
1
​
  
i=1
∑
m
​
 [y
(i)
 logh
θ
​
 (x
(i)
 )+(1−y
(i)
 )log(1−h
θ
​
 (x
(i)
 ))]+
m
λ
​
  
j=1
∑
n
​
 ∣θ
j
​
 ∣
🔹 Effect: Can remove less important features (useful for feature selection).

4. Regularization in Python (Scikit-learn Example)
By default, LogisticRegression in scikit-learn applies L2 regularization.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X, y = iris.data, (iris.target == 2).astype(int)  # Convert to binary classification

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression with L2 Regularization (default)
log_reg_l2 = LogisticRegression(C=1.0, penalty='l2', solver='liblinear')
log_reg_l2.fit(X_train, y_train)

# Logistic Regression with L1 Regularization
log_reg_l1 = LogisticRegression(C=1.0, penalty='l1', solver='liblinear')
log_reg_l1.fit(X_train, y_train)

print("L2 Regularization Coefficients:", log_reg_l2.coef_)
print("L1 Regularization Coefficients:", log_reg_l1.coef_)


Explain the difference between Lasso, Ridge, and Elastic Net regression
Ans.Difference Between Lasso, Ridge, and Elastic Net Regression
Regularization techniques help prevent overfitting in regression models by adding a penalty term to the loss function. The three main types of regularization are Lasso (L1), Ridge (L2), and Elastic Net (L1 + L2).

1. Ridge Regression (L2 Regularization)
🔹 Adds the sum of squared weights as a penalty
🔹 Shrinks coefficients but does not set them to zero
🔹 Works well when all features are useful

Cost Function:
𝐽
(
𝜃
)
=
∑
𝑖
=
1
𝑚
[
𝑦
(
𝑖
)
−
ℎ
𝜃
(
𝑥
(
𝑖
)
)
]
2
+
𝜆
∑
𝑗
=
1
𝑛
𝜃
𝑗
2
J(θ)=
i=1
∑
m
​
 [y
(i)
 −h
θ
​
 (x
(i)
 )]
2
 +λ
j=1
∑
n
​
 θ
j
2
​

where:

𝜆
λ is the regularization parameter (higher
𝜆
λ means stronger penalty).
The second term
∑
𝜃
𝑗
2
∑θ
j
2
​
  penalizes large weights.
✔ Keeps all features but shrinks them
❌ Does not perform feature selection

In [None]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)  # alpha = λ (regularization strength)
ridge.fit(X_train, y_train)


2. Lasso Regression (L1 Regularization)
🔹 Adds the sum of absolute weights as a penalty
🔹 Shrinks some coefficients to exactly zero (performs feature selection)
🔹 Useful when some features are irrelevant

Cost Function:
𝐽
(
𝜃
)
=
∑
𝑖
=
1
𝑚
[
𝑦
(
𝑖
)
−
ℎ
𝜃
(
𝑥
(
𝑖
)
)
]
2
+
𝜆
∑
𝑗
=
1
𝑛
∣
𝜃
𝑗
∣
J(θ)=
i=1
∑
m
​
 [y
(i)
 −h
θ
​
 (x
(i)
 )]
2
 +λ
j=1
∑
n
​
 ∣θ
j
​
 ∣
✔ Feature selection: Eliminates irrelevant features
❌ May struggle when features are highly correlated

In [None]:
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=1.0)
lasso.fit(X_train, y_train)


3. Elastic Net Regression (L1 + L2 Regularization)
🔹 Combines Ridge (L2) and Lasso (L1) penalties
🔹 Shrinks coefficients and performs feature selection
🔹 Works well when features are highly correlated

Cost Function:
𝐽
(
𝜃
)
=
∑
𝑖
=
1
𝑚
[
𝑦
(
𝑖
)
−
ℎ
𝜃
(
𝑥
(
𝑖
)
)
]
2
+
𝜆
1
∑
𝑗
=
1
𝑛
∣
𝜃
𝑗
∣
+
𝜆
2
∑
𝑗
=
1
𝑛
𝜃
𝑗
2
J(θ)=
i=1
∑
m
​
 [y
(i)
 −h
θ
​
 (x
(i)
 )]
2
 +λ
1
​
  
j=1
∑
n
​
 ∣θ
j
​
 ∣+λ
2
​
  
j=1
∑
n
​
 θ
j
2
​

✔ Balances feature selection and coefficient shrinkage
✔ Handles correlated features better than Lasso
❌ More complex to tune (requires two parameters:
𝜆
1
λ
1
​
  and
𝜆
2
λ
2
​
 )

In [None]:
from sklearn.linear_model import ElasticNet

elastic_net = ElasticNet(alpha=1.0, l1_ratio=0.5)  # l1_ratio controls balance between L1 & L2
elastic_net.fit(X_train, y_train)


When should we use Elastic Net instead of Lasso or Ridge.
Ans.When Should We Use Elastic Net Instead of Lasso or Ridge?
Elastic Net is a combination of Ridge (L2) and Lasso (L1) regularization. It is most useful when features are highly correlated and we need both shrinkage and feature selection.

Scenarios Where Elastic Net Is Preferred
✅ 1. When Features Are Highly Correlated

Lasso struggles with highly correlated features and arbitrarily selects one while ignoring the others.
Ridge keeps all features but does not eliminate any.
Elastic Net ensures all correlated features contribute by balancing between L1 and L2 regularization.
✅ 2. When You Need Both Feature Selection and Shrinkage

Lasso can set some coefficients to exactly zero (feature selection), but it may be unstable when features are correlated.
Ridge shrinks coefficients but never eliminates features.
Elastic Net combines both:
Selects important features (like Lasso).
Shrinks the rest (like Ridge).
✅ 3. When You Have More Features Than Samples (High-Dimensional Data)

In cases like genomics, text classification, or finance, where number of features
𝑝
p > number of samples
𝑛
n, Lasso may select too few features, and Ridge may include too many.
Elastic Net balances both effects, leading to better generalization.
✅ 4. When Lasso Selects Too Few Features

Lasso tends to pick only one feature from a group of correlated features, which may not always be ideal.
Elastic Net allows for more distributed feature selection, ensuring important correlated variables are included.

In [None]:
from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

# Generate synthetic dataset
X, y = make_regression(n_samples=100, n_features=20, noise=0.1, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Elastic Net with 50% L1 and 50% L2 regularization
elastic_net = ElasticNet(alpha=1.0, l1_ratio=0.5)  # l1_ratio=0.5 balances Ridge & Lasso
elastic_net.fit(X_train, y_train)

# Print coefficients
print("Elastic Net Coefficients:", elastic_net.coef_)


Q8.What is the impact of the regularization parameter (λ) in Logistic Regression.
Ans.Impact of the Regularization Parameter (λ) in Logistic Regression
The regularization parameter (λ) in Logistic Regression controls the strength of the penalty applied to the model’s coefficients. It determines the balance between fitting the training data well and preventing overfitting.

1. Effect of λ on Model Performance
Small λ (Weak Regularization) → Overfitting

The model gives high importance to all features.
Can lead to overfitting, meaning it performs well on training data but poorly on unseen data.
Coefficients
𝜃
θ can take large values.
Large λ (Strong Regularization) → Underfitting

The penalty becomes stronger, forcing weights to be small or zero.
Can lead to underfitting, where the model is too simple and fails to capture important patterns.
The model may assign too little importance to important features, reducing accuracy.
Optimal λ (Balanced Regularization)

Finds a trade-off between variance and bias.
Prevents overfitting while keeping model performance high.
2. Regularization Types & Effect of λ
In Logistic Regression, we typically use:

L1 Regularization (Lasso): Larger λ forces more coefficients to exactly zero (feature selection).
L2 Regularization (Ridge): Larger λ shrinks coefficients toward zero but doesn’t eliminate them.
3. Mathematical Formulation
The regularized cost function in Logistic Regression is:

𝐽
(
𝜃
)
=
−
1
𝑚
∑
𝑖
=
1
𝑚
[
𝑦
(
𝑖
)
log
⁡
ℎ
𝜃
(
𝑥
(
𝑖
)
)
+
(
1
−
𝑦
(
𝑖
)
)
log
⁡
(
1
−
ℎ
𝜃
(
𝑥
(
𝑖
)
)
)
]
+
𝜆
2
𝑚
∑
𝑗
=
1
𝑛
𝜃
𝑗
2
J(θ)=−
m
1
​
  
i=1
∑
m
​
 [y
(i)
 logh
θ
​
 (x
(i)
 )+(1−y
(i)
 )log(1−h
θ
​
 (x
(i)
 ))]+
2m
λ
​
  
j=1
∑
n
​
 θ
j
2
​

The first term is the standard log loss function.
The second term adds a penalty to large coefficients.
The strength of this penalty is controlled by λ.
4. Visualizing the Impact of λ
Effect of Different λ Values
Regularization Strength (λ)	Effect on Model	Effect on Coefficients (θ)
λ = 0 (No Regularization)	Overfits, learns noise	Large, unconstrained coefficients
Small λ	Slight regularization, still flexible	Some shrinkage, but features remain
Optimal λ	Best balance, prevents overfitting	Coefficients are meaningful
Large λ	Underfits, too simple	Very small or zero coefficients
5. Python Example: Tuning λ in Logistic Regression
We use the C parameter in sklearn, where
𝐶
=
1
𝜆
C=
λ
1
​
 .

Higher C → Less regularization
Lower C → Stronger regularization

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate synthetic dataset
X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Try different values of λ (in sklearn, we adjust C = 1/λ)
for C in [100, 1, 0.01]:  # Larger C = Smaller λ, Smaller C = Larger λ
    log_reg = LogisticRegression(C=C, penalty='l2', solver='liblinear')
    log_reg.fit(X_train, y_train)
    print(f"C={C}, Accuracy: {log_reg.score(X_test, y_test):.4f}")


What are the key assumptions of Logistic Regression.
Ans.Key Assumptions of Logistic Regression
Although Logistic Regression is a powerful and widely used classification algorithm, it relies on several key assumptions for optimal performance. Here are the main assumptions:

1. The Dependent Variable Must Be Binary or Categorical
Logistic Regression is designed for classification problems, typically binary classification (e.g., Spam vs. Not Spam).
It can be extended to multiclass classification using one-vs-rest (OvR) or softmax regression, but the standard logistic regression assumes a binary output:
𝑦
∈
{
0
,
1
}
y∈{0,1}
2. Independence of Observations
Each data point should be independent of the others.
Logistic Regression does not work well on correlated observations (e.g., time-series data) unless modifications like time-series splitting or feature engineering are applied.
🔹 Example Violation:

If you predict whether a patient has a disease, but some patients appear multiple times in the dataset, their observations are correlated, violating this assumption.
3. No Multicollinearity Between Independent Variables
Multicollinearity occurs when two or more features are highly correlated, meaning they provide redundant information.
Logistic Regression assumes that independent variables (features) are not too strongly correlated with each other.
🔹 How to Detect Multicollinearity?

Variance Inflation Factor (VIF): If VIF > 10, the feature is highly collinear.
Correlation Matrix: A high correlation (>0.8) between two features suggests multicollinearity.
🔹 Solution:

Remove highly correlated features.
Use Principal Component Analysis (PCA) or L1 Regularization (Lasso) to reduce multicollinearity.
4. The Relationship Between Independent Variables and Log-Odds is Linear
Logistic Regression assumes a linear relationship between the independent variables (X) and the log-odds of the dependent variable.

This means that the logistic function (sigmoid) transforms a linear combination of features into a probability:

log
⁡
(
𝑝
1
−
𝑝
)
=
𝛽
0
+
𝛽
1
𝑋
1
+
𝛽
2
𝑋
2
+
.
.
.
+
𝛽
𝑛
𝑋
𝑛
log(
1−p
p
​
 )=β
0
​
 +β
1
​
 X
1
​
 +β
2
​
 X
2
​
 +...+β
n
​
 X
n
​

🔹 How to Check This Assumption?

Use logarithm transformation or interaction terms if relationships appear non-linear.
Box-Tidwell test can check for non-linearity in log-odds.
5. Large Sample Size for Reliable Estimates
Logistic Regression performs best with a large dataset, especially when features are highly dimensional.
If the dataset is too small, the model may not converge properly or may overfit.
🔹 Solution:

Use more training samples if available.
Apply regularization (L1/L2) to avoid overfitting.
6. No Extreme Outliers in Features
Logistic Regression is sensitive to extreme outliers, which can distort the model’s coefficients.
Since logistic regression uses Maximum Likelihood Estimation (MLE) instead of Least Squares, extreme values in independent variables can impact predictions.
🔹 Solution:

Identify and remove outliers using:
Boxplots or Z-score (|Z| > 3).
Winsorization (capping extreme values).
Log transformations to reduce the effect of outliers.
7. The Classes Should Be Well-Separated and Balanced
If the classes overlap significantly, Logistic Regression may struggle to find a clear decision boundary.
If the dataset is imbalanced (e.g., 95% of cases are one class and 5% are the other), the model will bias toward the majority class.
🔹 Solution:

For class imbalance:
Use class weights (class_weight='balanced' in sklearn).
Use oversampling (SMOTE) or undersampling techniques.
For non-linearly separable data:
Use polynomial features or switch to a non-linear model like SVM or Neural Networks.
8. Errors Should Be Independent (No Autocorrelation)
In time-series or sequential data, errors may be correlated (autocorrelation).
Logistic Regression assumes no pattern in the errors (i.e., residuals should be randomly distributed).
🔹 Solution:

If using time-series data, check residuals for patterns using Durbin-Watson test.
Consider models like Recurrent Neural Networks (RNNs) for time-dependent predictions.
Summary of Assumptions
Assumption	Description	Possible Fix if Violated
Binary or Categorical Target	Outcome must be 0 or 1	Use Softmax for multi-class
Independence of Observations	No repeated/related observations	Restructure data, remove duplicates
No Multicollinearity	Features should not be highly correlated	Remove redundant features, PCA
Linear Relationship Between Log-Odds & Features	Independent variables must have a linear relationship with log-odds	Use transformations (log, polynomials)
Sufficient Sample Size	Small datasets lead to unreliable coefficients	Collect more data, use regularization
No Extreme Outliers	Outliers distort predictions	Remove/cap outliers using winsorization
Balanced Classes	Imbalanced data biases predictions	Use class_weight='balanced', SMOTE
No Autocorrelation in Errors	Residuals should not be correlated	Use time-series models if needed


10.What are some alternatives to Logistic Regression for classification tasks.
Ans.Alternatives to Logistic Regression for Classification Tasks
Logistic Regression is a simple and effective classification model, but in cases where its assumptions are violated or better performance is needed, other models can be used. Below are some powerful alternatives:

1. Decision Tree Classifier 🌳
How It Works:

Splits data into nodes based on feature values using if-else conditions.
Continues splitting until a stopping criterion (e.g., max depth) is met.
Pros:
✅ Handles non-linear relationships.
✅ Works with categorical and numerical data.
✅ Easy to interpret (like a flowchart).

Cons:
❌ Prone to overfitting without pruning.
❌ Unstable to small changes in data.

Python Example:

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=5)
model.fit(X_train, y_train)


2. Random Forest Classifier 🌲🌲
How It Works:

An ensemble of multiple Decision Trees.
Uses bagging (random sampling + averaging) to reduce variance.
Final prediction is based on majority voting.
Pros:
✅ More accurate than a single Decision Tree.
✅ Handles missing values and outliers well.
✅ Works with high-dimensional data.

Cons:
❌ Slower than Logistic Regression for large datasets.
❌ Harder to interpret compared to single Decision Trees.

Python Example:

python
Copy
Edit
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
3. Support Vector Machine (SVM) ⚡
How It Works:

Finds the optimal hyperplane that best separates classes.
Can handle non-linear classification using the kernel trick.
Pros:
✅ Effective for high-dimensional data.
✅ Works well with small datasets.
✅ Robust to outliers (with appropriate kernel).

Cons:
❌ Computationally expensive for large datasets.
❌ Hard to tune hyperparameters (kernel, C, gamma, etc.).

Python Example:

python
Copy
Edit
from sklearn.svm import SVC
model = SVC(kernel='rbf', C=1.0)
model.fit(X_train, y_train)
4. k-Nearest Neighbors (KNN) 🔍
How It Works:

Stores the entire dataset and classifies a new point by looking at the k nearest neighbors.
Uses majority voting to assign a class.
Pros:
✅ Simple and intuitive.
✅ No training required (lazy learning).
✅ Works well with small datasets.

Cons:
❌ Slow on large datasets (because it stores all training points).
❌ Sensitive to irrelevant features and imbalanced data.

Python Example:

python
Copy
Edit
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
5. Naïve Bayes 🤖
How It Works:

Uses Bayes' Theorem to calculate the probability of each class given the features.
Assumes independent features (hence "naïve").
Pros:
✅ Fast and efficient for large datasets.
✅ Works well for text classification (e.g., spam detection).
✅ Works even with small datasets.

Cons:
❌ Assumes feature independence, which is unrealistic in many cases.
❌ Less flexible compared to other models.

Python Example:

python
Copy
Edit
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
6. Gradient Boosting Models 🚀
Boosting algorithms are powerful alternatives to Logistic Regression that combine multiple weak models to create a strong one.

(a) XGBoost (Extreme Gradient Boosting)
Pros:
✅ Very high accuracy, widely used in Kaggle competitions.
✅ Handles missing values well.
✅ Works with structured/tabular data.

Cons:
❌ Requires careful hyperparameter tuning.

Python Example:

python
Copy
Edit
from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)
(b) LightGBM (Light Gradient Boosting Machine)
Pros:
✅ Faster than XGBoost for large datasets.
✅ Handles categorical features natively.

Cons:
❌ Needs careful tuning.

Python Example:

python
Copy
Edit
from lightgbm import LGBMClassifier
model = LGBMClassifier(n_estimators=100)
model.fit(X_train, y_train)
(c) CatBoost
Pros:
✅ Great for categorical data (no need for encoding).
✅ Fast training with GPU acceleration.

Cons:
❌ Tuning can be tricky.

Python Example:

python
Copy
Edit
from catboost import CatBoostClassifier
model = CatBoostClassifier(iterations=100)
model.fit(X_train, y_train)
7. Neural Networks (Deep Learning) 🧠
How It Works:

Uses multiple layers of neurons to learn complex patterns.
Works well with large datasets and non-linear data.
Pros:
✅ Best for highly complex data (e.g., images, speech, NLP).
✅ Can automatically learn feature representations.

Cons:
❌ Requires large data and high computational power.
❌ Harder to interpret than traditional models.

Python Example (Using TensorFlow/Keras):

python
Copy
Edit
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(16, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(8, activation='relu'),
    Dense(1, activation='sigmoid')  # Binary classification
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=32)
Summary Table of Logistic Regression Alternatives
Model	Strengths	Weaknesses
Decision Tree	Simple, interpretable	Overfits without pruning
Random Forest	Robust, handles missing data	Slower for large datasets
SVM	Good for small, high-dimensional data	Hard to tune for large datasets
KNN	No training needed, simple	Slow, sensitive to irrelevant features
Naïve Bayes	Fast, good for text	Assumes feature independence
XGBoost / LightGBM / CatBoost	High accuracy, good for structured data	Requires tuning
Neural Networks	Best for complex patterns (NLP, vision)	Needs large data and GPUs


Q11.What are Classification Evaluation Metrics.
Ans.Classification Evaluation Metrics in Python 📊
When training a classification model (e.g., Logistic Regression, SVM, Random Forest, etc.), it’s essential to evaluate how well it performs. Below are key classification evaluation metrics along with Python examples.

1. Accuracy 🏆
Definition:

Measures the percentage of correct predictions out of total predictions.
Formula:
𝐴
𝑐
𝑐
𝑢
𝑟
𝑎
𝑐
𝑦
=
𝑇
𝑃
+
𝑇
𝑁
𝑇
𝑃
+
𝑇
𝑁
+
𝐹
𝑃
+
𝐹
𝑁
Accuracy=
TP+TN+FP+FN
TP+TN
​

Best for:

Balanced datasets where classes have equal distribution.

In [None]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)


2. Precision 🎯
Definition:

Measures how many predicted positive values are actually positive.
Formula:
𝑃
𝑟
𝑒
𝑐
𝑖
𝑠
𝑖
𝑜
𝑛
=
𝑇
𝑃
𝑇
𝑃
+
𝐹
𝑃
Precision=
TP+FP
TP
​

Best for:

When False Positives (FP) are costly, e.g., spam detection, where classifying a normal email as spam is bad.
Python Example:

python
Copy
Edit
from sklearn.metrics import precision_score
precision = precision_score(y_true, y_pred)
print("Precision:", precision)
3. Recall (Sensitivity) 🔍
Definition:

Measures how many actual positive values were correctly predicted.
Formula:
𝑅
𝑒
𝑐
𝑎
𝑙
𝑙
=
𝑇
𝑃
𝑇
𝑃
+
𝐹
𝑁
Recall=
TP+FN
TP
​

Best for:

When False Negatives (FN) are costly, e.g., medical diagnoses, where missing a disease is dangerous.
Python Example:

python
Copy
Edit
from sklearn.metrics import recall_score
recall = recall_score(y_true, y_pred)
print("Recall:", recall)
4. F1-Score ⚖️
Definition:

Harmonic mean of Precision & Recall (balances both).
Formula:
𝐹
1
=
2
×
𝑃
𝑟
𝑒
𝑐
𝑖
𝑠
𝑖
𝑜
𝑛
×
𝑅
𝑒
𝑐
𝑎
𝑙
𝑙
𝑃
𝑟
𝑒
𝑐
𝑖
𝑠
𝑖
𝑜
𝑛
+
𝑅
𝑒
𝑐
𝑎
𝑙
𝑙
F1=2×
Precision+Recall
Precision×Recall
​

Best for:

Imbalanced datasets, where accuracy alone is misleading.
Python Example:

python
Copy
Edit
from sklearn.metrics import f1_score
f1 = f1_score(y_true, y_pred)
print("F1-Score:", f1)
5. Confusion Matrix 🟦🟥
Definition:

A matrix that shows True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
Python Example:

python
Copy
Edit
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
💡 Use case: Helps understand which type of errors the model is making.

6. ROC Curve & AUC Score 🚀
Definition:

ROC Curve (Receiver Operating Characteristic): Plots True Positive Rate (Recall) vs. False Positive Rate at different thresholds.
AUC (Area Under Curve): Measures overall performance (higher is better).
Python Example:

python
Copy
Edit
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fpr, tpr, _ = roc_curve(y_true, y_prob)  # y_prob = model's probability predictions
auc_score = auc(fpr, tpr)

plt.plot(fpr, tpr, label=f"AUC = {auc_score:.2f}")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()
💡 Use case:

AUC = 1.0 → Perfect classifier
AUC = 0.5 → Random guessing
7. Log Loss (Cross-Entropy Loss) 🔥
Definition:

Measures how confident a model is in its predictions.
Formula:
𝐿
𝑜
𝑔
𝐿
𝑜
𝑠
𝑠
=
−
1
𝑁
∑
[
𝑦
log
⁡
(
𝑝
)
+
(
1
−
𝑦
)
log
⁡
(
1
−
𝑝
)
]
LogLoss=−
N
1
​
 ∑[ylog(p)+(1−y)log(1−p)]
Best for:

Probabilistic classifiers like Logistic Regression.
Python Example:

python
Copy
Edit
from sklearn.metrics import log_loss
loss = log_loss(y_true, y_prob)
print("Log Loss:", loss)
Summary Table
Metric	Best When	Formula
Accuracy	Classes are balanced
(
𝑇
𝑃
+
𝑇
𝑁
)
/
(
𝑇
𝑃
+
𝑇
𝑁
+
𝐹
𝑃
+
𝐹
𝑁
)
(TP+TN)/(TP+TN+FP+FN)
Precision	False Positives are costly
𝑇
𝑃
/
(
𝑇
𝑃
+
𝐹
𝑃
)
TP/(TP+FP)
Recall (Sensitivity)	False Negatives are costly
𝑇
𝑃
/
(
𝑇
𝑃
+
𝐹
𝑁
)
TP/(TP+FN)
F1-Score	Imbalanced data (combines precision & recall)
2
×
(
𝑃
×
𝑅
)
/
(
𝑃
+
𝑅
)
2×(P×R)/(P+R)
Confusion Matrix	Understands types of errors	Table of TP, FP, TN, FN
ROC-AUC Score	Evaluates probability-based models	Plot of TPR vs. FPR
Log Loss	Measures confidence in probability predictions	Cross-entropy formula


Q12.How does class imbalance affect Logistic Regression.
Ans.Problems Caused by Class Imbalance in Logistic Regression
1 Biased Model Towards the Majority Class
Logistic Regression minimizes overall error (by default).
If 90% of data belongs to Class A and 10% to Class B, the model can achieve 90% accuracy just by always predicting A.
However, it completely fails to recognize Class B.
 Example:

A fraud detection system where 99% of transactions are normal and only 1% are fraudulent.
The model predicts "Normal" for everything, giving 99% accuracy but zero fraud detection.
2 Misleading Accuracy
Accuracy is misleading when classes are imbalanced.
Suppose a dataset has:
950 "No Fraud" (Class 0)
50 "Fraud" (Class 1)
If the model predicts "No Fraud" for everything, it achieves 95% accuracy, but detects 0% of fraud cases.
 Better metrics:
Use Precision, Recall, F1-score, and ROC-AUC instead of Accuracy.

3 Poor Decision Boundary
Logistic Regression learns a decision boundary based on the available data.
With imbalanced data, the boundary gets shifted towards the minority class, making it harder to classify correctly.
 Example:

If 99% of patients are healthy and 1% have a disease, the model is biased toward healthy patients and may miss diagnoses.
 How to Handle Class Imbalance in Logistic Regression
1 Use Different Evaluation Metrics
Instead of Accuracy, use:
Precision (if False Positives matter)
Recall (if False Negatives matter, e.g., in medical diagnosis)
F1-score (when balancing both is important)
ROC-AUC Score (for probabilistic models)

from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
2 Use Class Weights in Logistic Regression
Set class_weight='balanced' in LogisticRegression().
This adjusts weights inversely proportional to class frequencies.
python

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)
 Why?
This makes the model pay more attention to the minority class.

3 Resampling Techniques (Oversampling & Undersampling)
 Oversampling the Minority Class (SMOTE)
Generates synthetic examples of the minority class using SMOTE (Synthetic Minority Over-sampling Technique).
python

from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
 Best for: When data is small and we need more samples.

 Undersampling the Majority Class
Removes samples from the majority class to balance the dataset.
python

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
 Best for: When we have lots of data and can afford to drop some.

4 Use Ensemble Methods
Random Forest or XGBoost often perform better on imbalanced data than Logistic Regression.
Both support balanced class weighting.
python

from xgboost import XGBClassifier

model = XGBClassifier(scale_pos_weight=ratio_of_majority_to_minority)
model.fit(X_train, y_train)
 Why?
These methods focus more on misclassified samples.

5 Adjust Decision Threshold
By default, Logistic Regression predicts "Positive" if probability > 0.5.
Lowering this threshold can improve recall for the minority class.
python

y_prob = model.predict_proba(X_test)[:, 1]  # Get probabilities
new_threshold = 0.3  # Lower the decision threshold
y_pred_adjusted = (y_prob > new_threshold).astype(int)
 Best for: When False Negatives are costly (e.g., missing a fraud transaction).



Q13.What is Hyperparameter Tuning in Logistic Regression.
Ans.Hyperparameter Tuning in Logistic Regression 🔍
 What is Hyperparameter Tuning?
Hyperparameters are model parameters that cannot be learned from the data but must be set before training.
Tuning these hyperparameters helps improve the model’s accuracy, generalization, and performance.
 Key Hyperparameters in Logistic Regression
Hyperparameter	Description	Default
C	Inverse of regularization strength (smaller = stronger regularization)	1.0
penalty	Type of regularization (l1, l2, elasticnet, none)	'l2'
solver	Algorithm to optimize the loss function (liblinear, lbfgs, saga, etc.)	'lbfgs'
max_iter	Maximum number of iterations for optimization	100
class_weight	Handles imbalanced data ('balanced' or None)	None
 1. Grid Search for Hyperparameter Tuning
Grid Search exhaustively tries all possible combinations of hyperparameters and selects the best one.

python

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Define parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength
    'penalty': ['l1', 'l2'],  # Regularization type
    'solver': ['liblinear', 'saga']  # Solvers that support L1 and L2
}

# Initialize Logistic Regression
log_reg = LogisticRegression(max_iter=500)

# Perform Grid Search
grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best hyperparameters
print("Best Parameters:", grid_search.best_params_)
 Best for: Small datasets where we can afford to test all combinations.

 2. Randomized Search for Faster Tuning
Instead of testing all combinations, Randomized Search tests a random subset for efficiency.

python
Copy
Edit
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

param_dist = {
    'C': np.logspace(-3, 3, 10),  # 10 values between 0.001 and 1000
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
}

random_search = RandomizedSearchCV(LogisticRegression(max_iter=500),
                                   param_distributions=param_dist,
                                   n_iter=10, cv=5, scoring='accuracy', random_state=42)
random_search.fit(X_train, y_train)

print("Best Parameters:", random_search.best_params_)
 Best for: Large datasets where Grid Search is too slow.

 3. Bayesian Optimization (Advanced)
Bayesian Optimization intelligently selects hyperparameter values based on past evaluations, making it faster than Grid Search.

python

from skopt import BayesSearchCV
from skopt.space import Real, Categorical

bayes_search = BayesSearchCV(
    LogisticRegression(max_iter=500),
    {
        'C': Real(0.0001, 100, prior='log-uniform'),
        'penalty': Categorical(['l1', 'l2']),
        'solver': Categorical(['liblinear', 'saga'])
    },
    n_iter=25, cv=5, scoring='accuracy', random_state=42
)

bayes_search.fit(X_train, y_train)
print("Best Parameters:", bayes_search.best_params_)
 Best for: Large datasets with expensive training times.

Q14. What are different solvers in Logistic Regression? Which one should be used.
Ans.Solvers in Logistic Regression
Logistic Regression in scikit-learn provides different solvers (optimization algorithms) to minimize the cost function. The choice of solver depends on dataset size, regularization type, and performance needs.
Which Solver Should You Use?
Scenario	Recommended Solver
Small dataset (<10,000 samples)	liblinear
Large dataset (>50,000 samples)	sag or saga
Multi-class classification (multi_class='multinomial')	lbfgs, newton-cg, or saga
L1 regularization (Lasso)	liblinear or saga
L2 regularization (Ridge)	lbfgs, newton-cg, sag, saga
Elastic Net regularization	saga

Q15.How is Logistic Regression extended for multiclass classification.
Ans.Extending Logistic Regression for Multiclass Classification
By default, Logistic Regression is designed for binary classification (two classes, e.g., 0 and 1). However, it can be extended for multiclass classification using two main approaches:

 1. One-vs-Rest (OvR) / One-vs-All (OvA)
 How it works?

The model trains one classifier per class.
Each classifier treats one class as "positive" and the rest as "negative."
The class with the highest probability is selected.
 Example (3 Classes: A, B, C)

Train Classifier 1: A vs (B, C)
Train Classifier 2: B vs (A, C)
Train Classifier 3: C vs (A, B)
For a new sample, all classifiers predict probabilities, and the class with the highest probability is chosen.
python
Copy
Edit
from sklearn.linear_model import LogisticRegression

# OvR is the default for multiclass Logistic Regression
model = LogisticRegression(multi_class='ovr', solver='liblinear')
model.fit(X_train, y_train)
 Best for:

Small datasets
Fast training
Interpretable models
 Limitations:

Can be less accurate than other methods.
Separate classifiers may not work well for overlapping classes.
 2. Softmax (Multinomial) Regression
 How it works?

Uses a single model with a Softmax function to predict class probabilities.
Instead of binary decision boundaries, it finds a single probability distribution over all classes.
 Softmax Formula:

𝑃
(
𝑦
=
𝑘
∣
𝑥
)
=
𝑒
𝜃
𝑘
𝑇
𝑥
∑
𝑗
=
1
𝐾
𝑒
𝜃
𝑗
𝑇
𝑥
P(y=k∣x)=
∑
j=1
K
​
 e
θ
j
T
​
 x

e
θ
k
T
​
 x

​

where K is the number of classes.

python
Copy
Edit
from sklearn.linear_model import LogisticRegression

# Multinomial Logistic Regression with softmax activation
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model.fit(X_train, y_train)
 Best for:

Large datasets
Better accuracy than OvR
Well-suited for datasets where classes have strong relationships
 Limitations:

More computationally expensive than OvR.
 OvR vs. Softmax: Which One to Use?
Feature	One-vs-Rest (OvR)	Softmax (Multinomial)
Number of classifiers	K (one for each class)	1
Training complexity	Faster	Slower
Best for	Small datasets	Large datasets
Accuracy	Lower	Higher
Interpretability	Easier	Harder


Q16.What are the advantages and disadvantages of Logistic Regression.
Ans.Advantages of Logistic Regression
Advantage	Explanation
1 Simple & Easy to Interpret	The model outputs probabilities, making results easy to understand.
2 Works Well for Linearly Separable Data	If data is linearly separable, Logistic Regression performs well.
3Fast & Computationally Efficient	Compared to complex models (e.g., SVM, Neural Networks), it trains quickly.
4 Handles Large Datasets Well	Efficient for large datasets when using optimizations like saga solver.
5 Provides Probabilistic Outputs	Unlike Decision Trees, it gives class probabilities instead of just labels.
6 Can Handle Class Imbalance (With Regularization)	Can be improved using class_weight='balanced' or oversampling techniques like SMOTE.
7 Works with Regularization (L1, L2, Elastic Net)	Prevents overfitting with penalty='l1' (Lasso), penalty='l2' (Ridge), or penalty='elasticnet'.
8 Extends to Multiclass Problems (Softmax / OvR)	Supports multi_class='multinomial' for Softmax Regression.
 Disadvantages of Logistic Regression
Disadvantage	Explanation
1 Assumes Linearity	Struggles when data is non-linearly separable.
2 Poor Performance on Large Feature Sets	If there are too many independent variables, it may overfit.
3 Sensitive to Outliers	Logistic Regression can be heavily impacted by outliers, affecting decision boundaries.
4 Doesn't Work Well with Highly Correlated Features	If independent variables are highly correlated, it can reduce model performance (use PCA or VIF to handle multicollinearity).
5 Can Struggle with Class Imbalance	Without adjustments (class_weight='balanced' or SMOTE), it may favor the majority class.
6 Cannot Model Complex Relationships	Unlike Decision Trees or Neural Networks, it can't capture complex, nonlinear decision boundaries.
7 Requires Feature Engineering	Works best when features are well-preprocessed (e.g., normalized, transformed).
 When to Use Logistic Regression?
 When data is linearly separable
 When interpretability is important (e.g., medical diagnosis, fraud detection) When you need a quick and efficient model

Avoid Logistic Regression when:

The relationship between input and output is nonlinear (use Decision Trees, SVM, or Neural Networks).
You have many correlated features (use PCA or feature selection first).
You have a large number of categorical variables (use embeddings or tree-based models).

Q17.What are some use cases of Logistic Regression.
Ans.Use Cases of Logistic Regression
Logistic Regression is widely used in binary and multiclass classification problems across various domains. Here are some of its key applications:

1. Medical Diagnosis
 Example: Predicting whether a patient has a disease (Yes/No).

Use case: Diabetes detection, cancer diagnosis, heart disease prediction.
Features: Age, blood pressure, cholesterol levels, glucose levels, etc.
python
Copy
Edit
# Example: Predicting diabetes
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)  # X: patient data, y: diabetes (0 = No, 1 = Yes)
 2. Spam Email Detection
  Example: Classifying emails as Spam or Not Spam.

Use case: Email filtering systems like Gmail’s spam filter.
Features: Presence of certain keywords, email length, sender reputation.
python

# Example: Detecting spam emails
log_reg = LogisticRegression()
log_reg.fit(email_features, labels)  # labels: 0 = Not Spam, 1 = Spam
 3. Customer Churn Prediction
 Example: Predicting if a customer will cancel a subscription.

Use case: Telecom, banking, and SaaS companies use this to reduce customer attrition.
Features: Monthly bill, customer complaints, contract type, usage patterns.
python

# Example: Predicting customer churn
log_reg = LogisticRegression()
log_reg.fit(customer_data, churn_labels)  # churn_labels: 0 = Stay, 1 = Leave
 4. Credit Risk Assessment
 Example: Predicting whether a loan applicant will default.

Use case: Banks use this to decide loan approvals.
Features: Credit score, income, loan amount, previous defaults.
python
Copy
Edit
# Example: Predicting loan default
log_reg = LogisticRegression()
log_reg.fit(loan_data, default_labels)  # 0 = No Default, 1 = Default
5. Fraud Detection
 Example: Identifying fraudulent transactions.

Use case: Banks and payment processors (e.g., PayPal, Visa).
Features: Transaction amount, frequency, location, device used.
python

# Example: Fraud detection
log_reg = LogisticRegression()
log_reg.fit(transaction_data, fraud_labels)  # 0 = Genuine, 1 = Fraud
 6. Sentiment Analysis
 Example: Predicting whether a review is positive or negative.

Use case: Amazon, Yelp, and social media analysis.
Features: Word frequencies, sentiment scores, length of review.
python

# Example: Sentiment analysis on product reviews
log_reg = LogisticRegression()
log_reg.fit(review_data, sentiment_labels)  # 0 = Negative, 1 = Positive
 7. Employee Attrition Prediction
 Example: Predicting whether an employee will quit a job.

Use case: HR analytics in companies.
Features: Salary, job satisfaction, work experience, commute time.
python

# Example: Employee attrition prediction
log_reg = LogisticRegression()
log_reg.fit(employee_data, attrition_labels)  # 0 = Stay, 1 = Leave
 8. Predicting Click-Through Rate (CTR)
  Example: Predicting whether a user will click on an ad.

Use case: Google Ads, Facebook Ads.
Features: User demographics, past clicks, ad content.
python

# Example: Click-through rate prediction
log_reg = LogisticRegression()
log_reg.fit(ad_data, click_labels)  # 0 = No Click, 1 = Click
 9. Image Classification
 Example: Recognizing handwritten digits (0-9).

Use case: Optical Character Recognition (OCR), digit recognition in bank cheques.
Features: Pixel intensity values.
python

# Example: Handwritten digit classification (0-9)
log_reg = LogisticRegression(multi_class='multinomial', solver='lbfgs')
log_reg.fit(image_data, digit_labels)
 10. Political Campaign Prediction
 Example: Predicting whether a voter will support a candidate.

Use case: Political analytics and election forecasting.
Features: Age, income, political preference, past voting history.
python

# Example: Predicting voter behavior
log_reg = LogisticRegression()
log_reg.fit(voter_data, support_labels)  # 0 = No Support, 1 = Support


Q18.What is the difference between Softmax Regression and Logistic Regression.
Ans.Softmax Regression vs. Logistic Regression
Both Logistic Regression and Softmax Regression are used for classification, but they differ in how they handle the number of classes.

Feature	Logistic Regression	Softmax Regression (Multinomial Logistic Regression)
Used for	Binary classification (2 classes: 0 or 1)	Multiclass classification (3+ classes)
Output	Probability of one class vs. the other	Probability distribution over all classes
Decision Rule	Uses Sigmoid function	Uses Softmax function
Formula
𝜎
(
𝑧
)
=
1
1
+
𝑒
−
𝑧
σ(z)=
1+e
−z

1
​

Softmax
(
𝑧
𝑖
)
=
𝑒
𝑧
𝑖
∑
𝑗
=
1
𝐾
𝑒
𝑧
𝑗
Softmax(z
i
​
 )=
∑
j=1
K
​
 e
z
j
​


e
z
i
​


​

Model Training	Fits a single decision boundary	Learns one weight vector per class
Multiclass Support	Requires One-vs-Rest (OvR) approach	Directly supports multiple classes
 Example: Logistic Regression (Binary Classification)
 Use Case: Predict if an email is spam or not spam (0/1)

python

from sklearn.linear_model import LogisticRegression

# Binary Classification (Spam/Not Spam)
model = LogisticRegression()
model.fit(X_train, y_train)  # y contains 0 or 1
 Example: Softmax Regression (Multiclass Classification)
 Use Case: Predict the digit in handwritten digit recognition (0-9)

python

from sklearn.linear_model import LogisticRegression

# Multiclass Classification (Digits 0-9)
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model.fit(X_train, y_train)  # y contains multiple classes (0,1,2,...,9)


Q19. How do we choose between One-vs-Rest (OvR) and Softmax for multiclass classification.
Ans.Choosing Between One-vs-Rest (OvR) and Softmax (Multinomial) for Multiclass Classification
When dealing with multiclass classification (3+ classes) in Logistic Regression, we have two main approaches:
1 One-vs-Rest (OvR) / One-vs-All (OvA)
2 Softmax (Multinomial) Regression

 Key Differences:
Feature	One-vs-Rest (OvR)	Softmax (Multinomial)
Concept	Trains one binary classifier for each class (each class vs. rest).	Trains a single model with Softmax activation to handle all classes together.
Number of Models	Requires K binary classifiers (one per class).	Uses one single model for all classes.
Predictions	Each classifier gives a probability, and the class with the highest probability is chosen.	Computes a probability distribution over all classes using Softmax.
Computational Complexity	Faster training (each classifier is trained separately).	More complex (all weights optimized together).
Best for	Small datasets, imbalanced data.	Large datasets, complex decision boundaries.
Interpretability	Easier to interpret since each class is treated separately.	Harder to interpret as all classes influence the decision.
Performance	Can be less accurate if classes are highly related.	Typically more accurate for well-distributed datasets.
 When to Choose OvR vs. Softmax
Scenario	Best Choice
Small dataset, faster training needed	OvR
Large dataset, better accuracy needed	Softmax
Highly imbalanced classes	OvR (handles imbalance better)
Classes are mutually exclusive (e.g., digit classification)	Softmax
Interpretability is important	OvR
 Example: Implementing OvR and Softmax in Python
1 One-vs-Rest (OvR)
python

from sklearn.linear_model import LogisticRegression

# Train using One-vs-Rest (default for multiclass Logistic Regression)
model_ovr = LogisticRegression(multi_class='ovr', solver='liblinear')
model_ovr.fit(X_train, y_train)
2 Softmax (Multinomial) Regression
python

from sklearn.linear_model import LogisticRegression

# Train using Softmax Regression
model_softmax = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model_softmax.fit(X_train, y_train)


Q20.How do we interpret coefficients in Logistic Regression?
Ans.Interpreting Coefficients in Logistic Regression
In Logistic Regression, the model outputs a probability, and the coefficients
𝛽
𝑖
β
i
​
  tell us how each feature influences the outcome. However, unlike Linear Regression, the interpretation is in terms of log-odds.

 1. Logistic Regression Model Equation
The probability of an event occurring is given by the Sigmoid function:

𝑃
(
𝑌
=
1
)
=
1
1
+
𝑒
−
(
𝛽
0
+
𝛽
1
𝑋
1
+
𝛽
2
𝑋
2
+
.
.
.
+
𝛽
𝑛
𝑋
𝑛
)
P(Y=1)=
1+e
−(β
0
​
 +β
1
​
 X
1
​
 +β
2
​
 X
2
​
 +...+β
n
​
 X
n
​
 )

1
​

Taking the log-odds transformation (logit function):

log
⁡
(
𝑃
1
−
𝑃
)
=
𝛽
0
+
𝛽
1
𝑋
1
+
𝛽
2
𝑋
2
+
.
.
.
+
𝛽
𝑛
𝑋
𝑛
log(
1−P
P
​
 )=β
0
​
 +β
1
​
 X
1
​
 +β
2
​
 X
2
​
 +...+β
n
​
 X
n
​

Each coefficient
𝛽
𝑖
β
i
​
  represents the change in log-odds when the corresponding feature
𝑋
𝑖
X
i
​
  increases by one unit, keeping other variables constant.

 2. Interpreting Coefficients in Logistic Regression
Coefficient
𝛽
𝑖
β
i
​
 	Interpretation
𝛽
𝑖
>
0
β
i
​
 >0	The feature increases the probability of the event happening (positive influence).
𝛽
𝑖
<
0
β
i
​
 <0	The feature decreases the probability of the event happening (negative influence).
𝛽
𝑖
=
0
β
i
​
 =0	The feature has no effect on the outcome.
 Odds Ratio Interpretation
The exponentiated coefficient (
𝑒
𝛽
𝑖
e
β
i
​

 ) tells us the odds ratio (OR):

If
𝑒
𝛽
𝑖
>
1
e
β
i
​

 >1 → Feature increases odds of the event occurring.
If
𝑒
𝛽
𝑖
<
1
e
β
i
​

 <1 → Feature decreases odds of the event occurring.
Odds Ratio
=
𝑒
𝛽
𝑖
Odds Ratio=e
β
i
​


3. Example Interpretation
Example: Predicting Loan Default (Yes = 1, No = 0)
Feature	Coefficient
𝛽
𝑖
β
i
​

𝑒
𝛽
𝑖
e
β
i
​

  (Odds Ratio)	Interpretation
Income
−
0.5
−0.5
𝑒
−
0.5
=
0.61
e
−0.5
 =0.61	Higher income reduces the odds of defaulting.
Credit Score
0.3
0.3
𝑒
0.3
=
1.35
e
0.3
 =1.35	Higher credit score increases the odds of defaulting.
Debt-to-Income Ratio
1.2
1.2
𝑒
1.2
=
3.32
e
1.2
 =3.32	A higher debt-to-income ratio greatly increases the odds of default.
🛠 Interpretation for Credit Score (
𝛽
=
0.3
β=0.3):

A one-unit increase in credit score multiplies the odds of default by 1.35.
If baseline default odds were 10% (0.1 probability), they would increase to ~13.5%.
4. Standardization & Scaling Considerations
Feature scaling matters: If variables are not standardized, coefficients can be misleading.
Categorical variables: Need one-hot encoding before interpretation.
Multicollinearity: If features are highly correlated, coefficients may be unstable.


**Practical**

Q1.Write a Python program that loads a dataset, splits it into training and testing sets, applies Logistic
Regression, and prints the model accuracy
Ans.import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load the dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)  # Features
y = data.target  # Target (0, 1, or 2 for different iris species)

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the Logistic Regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=200)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")


Q2.Write a Python program to apply L1 regularization (Lasso) on a dataset using LogisticRegression(penalty='l1')
and print the model accuracy

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load the dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)  # Features
y = data.target  # Target (0, 1, or 2 for different iris species)

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the Logistic Regression model with L1 regularization (Lasso)
model = LogisticRegression(penalty='l1', solver='liblinear', max_iter=200)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with L1 Regularization: {accuracy:.4f}")


Q3. Write a Python program to train Logistic Regression with L2 regularization (Ridge) using
LogisticRegression(penalty='l2'). Print model accuracy and coefficients
Ans.import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load the dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)  # Features
y = data.target  # Target (0, 1, or 2 for different iris species)

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the Logistic Regression model with L2 regularization (Ridge)
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=200)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with L2 Regularization: {accuracy:.4f}")

# Print coefficients
print("Coefficients:")
print(model.coef_)


Q4.Write a Python program to train Logistic Regression with Elastic Net Regularization (penalty='elasticnet')
Ans.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load the dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)  # Features
y = data.target  # Target (0, 1, or 2 for different iris species)

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the Logistic Regression model with Elastic Net regularization
model = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5, max_iter=200)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with Elastic Net Regularization: {accuracy:.4f}")

# Print coefficients
print("Coefficients:")
print(model.coef_)


Q5.Write a Python program to train a Logistic Regression model for multiclass classification using
multi_class='ovr'
Ans.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load the dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)  # Features
y = data.target  # Target (0, 1, or 2 for different iris species)

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the Logistic Regression model for multiclass classification using One-vs-Rest (OvR)
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=200)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with One-vs-Rest (OvR): {accuracy:.4f}")

# Print coefficients
print("Coefficients:")
print(model.coef_)


Q6.Write a Python program to apply GridSearchCV to tune the hyperparameters (C and penalty) of Logistic
Regression. Print the best parameters and accuracy
Ans.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load the dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)  # Features
y = data.target  # Target (0, 1, or 2 for different iris species)

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']  # 'liblinear' supports both 'l1' and 'l2'
}

# Apply GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(LogisticRegression(max_iter=200), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

# Get best parameters and accuracy
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)
best_accuracy = accuracy_score(y_test, y_pred)

print(f"Best Parameters: {best_params}")
print(f"Best Model Accuracy: {best_accuracy:.4f}")


Q7.Write a Python program to evaluate Logistic Regression using Stratified K-Fold Cross-Validation. Print the
average accuracy
Ans.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load the dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)  # Features
y = data.target  # Target (0, 1, or 2 for different iris species)

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Define the Logistic Regression model
model = LogisticRegression(max_iter=200, solver='lbfgs', multi_class='ovr')

# Apply Stratified K-Fold Cross-Validation
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_scaled, y, cv=kf, scoring='accuracy')

# Calculate and print the average accuracy
average_accuracy = np.mean(cv_scores)
print(f"Average Accuracy with Stratified K-Fold CV: {average_accuracy:.4f}")


Q8.Write a Python program to load a dataset from a CSV file, apply Logistic Regression, and evaluate its
accuracy.
Ans.

In [None]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from CSV file
def load_data(csv_file):
    data = pd.read_csv(csv_file)
    X = data.iloc[:, :-1]  # Features (all columns except the last)
    y = data.iloc[:, -1]   # Target (last column)
    return X, y

# Load data
csv_file = "dataset.csv"  # Replace with your CSV file name
X, y = load_data(csv_file)

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the Logistic Regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")


Q9.Write a Python program to apply RandomizedSearchCV for tuning hyperparameters (C, penalty, solver) in
Logistic Regression. Print the best parameters and accuracy.
Ans.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from scipy.stats import uniform

# Load the dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)  # Features
y = data.target  # Target (0, 1, or 2 for different iris species)

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define parameter distribution for RandomizedSearchCV
param_dist = {
    'C': uniform(0.01, 10),
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
}

# Apply RandomizedSearchCV to find the best hyperparameters
random_search = RandomizedSearchCV(LogisticRegression(max_iter=200), param_distributions=param_dist,
                                   n_iter=10, cv=5, scoring='accuracy', random_state=42, n_jobs=-1)
random_search.fit(X_train_scaled, y_train)

# Get best parameters and accuracy
best_params = random_search.best_params_
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)
best_accuracy = accuracy_score(y_test, y_pred)

print(f"Best Parameters: {best_params}")
print(f"Best Model Accuracy: {best_accuracy:.4f}")


Q10.Write a Python program to implement One-vs-One (OvO) Multiclass Logistic Regression and print accuracy
Ans.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsOneClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load the dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)  # Features
y = data.target  # Target (0, 1, or 2 for different iris species)

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the One-vs-One (OvO) Logistic Regression model
ovo_model = OneVsOneClassifier(LogisticRegression(max_iter=200))
ovo_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = ovo_model.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with One-vs-One (OvO): {accuracy:.4f}")


Q11.Write a Python program to train a Logistic Regression model and visualize the confusion matrix for binary
classificationM
Ans.import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.datasets import make_classification

# Generate synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the Logistic Regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()


Q12.Write a Python program to train a Logistic Regression model and evaluate its performance using Precision,
Recall, and F1-ScoreM
Ans.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.datasets import make_classification

# Generate synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the Logistic Regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()


Q13.Write a Python program to train a Logistic Regression model on imbalanced data and apply class weights to
improve model performanceM
Ans.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.datasets import make_classification

# Generate synthetic imbalanced binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the Logistic Regression model with class weights
model = LogisticRegression(max_iter=200, class_weight='balanced')
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()


Q14.Write a Python program to train Logistic Regression on the Titanic dataset, handle missing values, and
evaluate performanceM
Ans.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Load Titanic dataset
data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Select relevant features
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
data = data[features + ["Survived"]]

# Handle missing values
data["Age"].fillna(data["Age"].median(), inplace=True)
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
data["Sex"] = label_encoder.fit_transform(data["Sex"])
data["Embarked"] = label_encoder.fit_transform(data["Embarked"])

# Split features and target variable
X = data.drop("Survived", axis=1)
y = data["Survived"]

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the Logistic Regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Survived', 'Survived'], yticklabels=['Not Survived', 'Survived'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()


Q15.Write a Python program to apply feature scaling (Standardization) before training a Logistic Regression
model. Evaluate its accuracy and compare results with and without scalingM
Ans

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Load Titanic dataset
data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Select relevant features
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
data = data[features + ["Survived"]]

# Handle missing values
data["Age"].fillna(data["Age"].median(), inplace=True)
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
data["Sex"] = label_encoder.fit_transform(data["Sex"])
data["Embarked"] = label_encoder.fit_transform(data["Embarked"])

# Split features and target variable
X = data.drop("Survived", axis=1)
y = data["Survived"]

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train Logistic Regression without scaling
model_no_scaling = LogisticRegression(max_iter=200)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)
print(f"Model Accuracy without Scaling: {accuracy_no_scaling:.4f}")

# Apply Standardization (Feature Scaling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression with scaling
model_scaled = LogisticRegression(max_iter=200)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Model Accuracy with Scaling: {accuracy_scaled:.4f}")

# Compare results
if accuracy_scaled > accuracy_no_scaling:
    print("Feature scaling improved model performance.")
elif accuracy_scaled < accuracy_no_scaling:
    print("Feature scaling reduced model performance.")
else:
    print("Feature scaling had no effect on model performance.")

# Compute confusion matrices
cm_no_scaling = confusion_matrix(y_test, y_pred_no_scaling)
cm_scaled = confusion_matrix(y_test, y_pred_scaled)

# Visualize confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.heatmap(cm_no_scaling, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Survived', 'Survived'], yticklabels=['Not Survived', 'Survived'], ax=axes[0])
axes[0].set_title("Confusion Matrix - No Scaling")
axes[0].set_xlabel("Predicted Label")
axes[0].set_ylabel("True Label")

sns.heatmap(cm_scaled, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Survived', 'Survived'], yticklabels=['Not Survived', 'Survived'], ax=axes[1])
axes[1].set_title("Confusion Matrix - With Scaling")
axes[1].set_xlabel("Predicted Label")
axes[1].set_ylabel("True Label")

plt.show()


Q16.Write a Python program to train Logistic Regression and evaluate its performance using ROC-AUC scoreM
Ans.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve

# Load Titanic dataset
data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Select relevant features
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
data = data[features + ["Survived"]]

# Handle missing values
data["Age"].fillna(data["Age"].median(), inplace=True)
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
data["Sex"] = label_encoder.fit_transform(data["Sex"])
data["Embarked"] = label_encoder.fit_transform(data["Embarked"])

# Split features and target variable
X = data.drop("Survived", axis=1)
y = data["Survived"]

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"Model Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"ROC-AUC Score: {roc_auc:.4f}")

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Survived', 'Survived'], yticklabels=['Not Survived', 'Survived'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

# Plot ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

Q17.Write a Python program to train Logistic Regression using a custom learning rate (C=0.5) and evaluate
accuracy
Ans

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve

# Load Titanic dataset
data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Select relevant features
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
data = data[features + ["Survived"]]

# Handle missing values
data["Age"].fillna(data["Age"].median(), inplace=True)
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
data["Sex"] = label_encoder.fit_transform(data["Sex"])
data["Embarked"] = label_encoder.fit_transform(data["Embarked"])

# Split features and target variable
X = data.drop("Survived", axis=1)
y = data["Survived"]

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression model with custom learning rate (C=0.5)
model = LogisticRegression(C=0.5, max_iter=200)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"Model Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"ROC-AUC Score: {roc_auc:.4f}")

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Survived', 'Survived'], yticklabels=['Not Survived', 'Survived'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

# Plot ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()


Q18.Write a Python program to train Logistic Regression and identify important features based on model
coefficients
Ans.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve

# Load Titanic dataset
data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Select relevant features
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
data = data[features + ["Survived"]]

# Handle missing values
data["Age"].fillna(data["Age"].median(), inplace=True)
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
data["Sex"] = label_encoder.fit_transform(data["Sex"])
data["Embarked"] = label_encoder.fit_transform(data["Embarked"])

# Split features and target variable
X = data.drop("Survived", axis=1)
y = data["Survived"]

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression model
model = LogisticRegression(C=0.5, max_iter=200)
model.fit(X_train_scaled, y_train)

# Identify important features based on model coefficients
feature_importance = pd.DataFrame({"Feature": X.columns, "Coefficient": model.coef_[0]})
feature_importance = feature_importance.sort_values(by="Coefficient", ascending=False)

# Plot feature importance
plt.figure(figsize=(8, 5))
sns.barplot(x="Coefficient", y="Feature", data=feature_importance, palette="coolwarm")
plt.title("Feature Importance based on Logistic Regression Coefficients")
plt.xlabel("Coefficient Value")
plt.ylabel("Feature")
plt.show()

# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"Model Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"ROC-AUC Score: {roc_auc:.4f}")

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Survived', 'Survived'], yticklabels=['Not Survived', 'Survived'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

# Plot ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()


Q19.Write a Python program to train Logistic Regression and evaluate its performance using Cohen’s Kappa
Score
Ans

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve, cohen_kappa_score

# Load Titanic dataset
data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Select relevant features
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
data = data[features + ["Survived"]]

# Handle missing values
data["Age"].fillna(data["Age"].median(), inplace=True)
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
data["Sex"] = label_encoder.fit_transform(data["Sex"])
data["Embarked"] = label_encoder.fit_transform(data["Embarked"])

# Split features and target variable
X = data.drop("Survived", axis=1)
y = data["Survived"]

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression model
model = LogisticRegression(C=0.5, max_iter=200)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
kappa_score = cohen_kappa_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"ROC-AUC Score: {roc_auc:.4f}")
print(f"Cohen's Kappa Score: {kappa_score:.4f}")

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Survived', 'Survived'], yticklabels=['Not Survived', 'Survived'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

# Plot ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()


Q20.Write a Python program to train Logistic Regression and visualize the Precision-Recall Curve for binary
classificatio:
Ans.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve, cohen_kappa_score, precision_recall_curve

# Load Titanic dataset
data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Select relevant features
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
data = data[features + ["Survived"]]

# Handle missing values
data["Age"].fillna(data["Age"].median(), inplace=True)
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
data["Sex"] = label_encoder.fit_transform(data["Sex"])
data["Embarked"] = label_encoder.fit_transform(data["Embarked"])

# Split features and target variable
X = data.drop("Survived", axis=1)
y = data["Survived"]

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression model
model = LogisticRegression(C=0.5, max_iter=200)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
kappa_score = cohen_kappa_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"ROC-AUC Score: {roc_auc:.4f}")
print(f"Cohen's Kappa Score: {kappa_score:.4f}")

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Survived', 'Survived'], yticklabels=['Not Survived', 'Survived'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

# Plot ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

# Plot Precision-Recall Curve
precision_vals, recall_vals, _ = precision_recall_curve(y_test, y_pred_proba)
plt.figure(figsize=(6, 4))
plt.plot(recall_vals, precision_vals, color='green', label='Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.show()


Q21.Write a Python program to train Logistic Regression with different solvers (liblinear, saga, lbfgs) and compare
their accuracy
Ans.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve, cohen_kappa_score, precision_recall_curve

# Load Titanic dataset
data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Select relevant features
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
data = data[features + ["Survived"]]

# Handle missing values
data["Age"].fillna(data["Age"].median(), inplace=True)
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
data["Sex"] = label_encoder.fit_transform(data["Sex"])
data["Embarked"] = label_encoder.fit_transform(data["Embarked"])

# Split features and target variable
X = data.drop("Survived", axis=1)
y = data["Survived"]

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression with different solvers
solvers = ['liblinear', 'saga', 'lbfgs']
solver_results = {}

for solver in solvers:
    model = LogisticRegression(solver=solver, max_iter=200)
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    solver_results[solver] = accuracy
    print(f"Solver: {solver} - Accuracy: {accuracy:.4f}")

# Plot solver comparison
plt.figure(figsize=(6, 4))
plt.bar(solver_results.keys(), solver_results.values(), color=['blue', 'green', 'red'])
plt.xlabel('Solver')
plt.ylabel('Accuracy')
plt.title('Comparison of Logistic Regression Solvers')
plt.show()

# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
kappa_score = cohen_kappa_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"ROC-AUC Score: {roc_auc:.4f}")
print(f"Cohen's Kappa Score: {kappa_score:.4f}")

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Survived', 'Survived'], yticklabels=['Not Survived', 'Survived'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

# Plot Precision-Recall Curve
precision_vals, recall_vals, _ = precision_recall_curve(y_test, y_pred_proba)
plt.figure(figsize=(6, 4))
plt.plot(recall_vals, precision_vals, color='green', label='Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.show()


Q22.Write a Python program to train Logistic Regression and evaluate its performance using Matthews
Correlation Coefficient (MCC)
Ans.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve, cohen_kappa_score, precision_recall_curve, matthews_corrcoef

# Load Titanic dataset
data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Select relevant features
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
data = data[features + ["Survived"]]

# Handle missing values
data["Age"].fillna(data["Age"].median(), inplace=True)
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
data["Sex"] = label_encoder.fit_transform(data["Sex"])
data["Embarked"] = label_encoder.fit_transform(data["Embarked"])

# Split features and target variable
X = data.drop("Survived", axis=1)
y = data["Survived"]

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression model
model = LogisticRegression(C=0.5, max_iter=200)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
kappa_score = cohen_kappa_score(y_test, y_pred)
mcc_score = matthews_corrcoef(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"ROC-AUC Score: {roc_auc:.4f}")
print(f"Cohen's Kappa Score: {kappa_score:.4f}")
print(f"Matthews Correlation Coefficient (MCC): {mcc_score:.4f}")

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Survived', 'Survived'], yticklabels=['Not Survived', 'Survived'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

# Plot Precision-Recall Curve
precision_vals, recall_vals, _ = precision_recall_curve(y_test, y_pred_proba)
plt.figure(figsize=(6, 4))
plt.plot(recall_vals, precision_vals, color='green', label='Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.show()


Q23.Write a Python program to train Logistic Regression on both raw and standardized data. Compare their
accuracy to see the impact of feature scaling
Ans.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve, cohen_kappa_score, precision_recall_curve, matthews_corrcoef

# Load Titanic dataset
data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Select relevant features
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
data = data[features + ["Survived"]]

# Handle missing values
data["Age"].fillna(data["Age"].median(), inplace=True)
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
data["Sex"] = label_encoder.fit_transform(data["Sex"])
data["Embarked"] = label_encoder.fit_transform(data["Embarked"])

# Split features and target variable
X = data.drop("Survived", axis=1)
y = data["Survived"]

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train Logistic Regression on raw data
model_raw = LogisticRegression(max_iter=200)
model_raw.fit(X_train, y_train)
y_pred_raw = model_raw.predict(X_test)
accuracy_raw = accuracy_score(y_test, y_pred_raw)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression on standardized data
model_scaled = LogisticRegression(max_iter=200)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"Accuracy on raw data: {accuracy_raw:.4f}")
print(f"Accuracy on standardized data: {accuracy_scaled:.4f}")

# Compare performance
plt.figure(figsize=(6, 4))
plt.bar(["Raw Data", "Standardized Data"], [accuracy_raw, accuracy_scaled], color=['blue', 'green'])
plt.xlabel("Feature Scaling")
plt.ylabel("Accuracy")
plt.title("Impact of Feature Scaling on Logistic Regression Accuracy")
plt.show()


Q24.Write a Python program to train Logistic Regression and find the optimal C (regularization strength) using
cross-validation
Ans.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

# Load Titanic dataset
data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Select relevant features
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
data = data[features + ["Survived"]]

# Handle missing values
data["Age"].fillna(data["Age"].median(), inplace=True)
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
data["Sex"] = label_encoder.fit_transform(data["Sex"])
data["Embarked"] = label_encoder.fit_transform(data["Embarked"])

# Split features and target variable
X = data.drop("Survived", axis=1)
y = data["Survived"]

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define hyperparameter grid for C values
param_grid = {"C": np.logspace(-4, 4, 20)}

# Perform GridSearchCV to find optimal C
log_reg = LogisticRegression(max_iter=200)
grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring="accuracy")
grid_search.fit(X_train_scaled, y_train)

# Get best parameter
best_C = grid_search.best_params_["C"]
print(f"Optimal C value: {best_C}")

# Train Logistic Regression with optimal C
best_model = LogisticRegression(C=best_C, max_iter=200)
best_model.fit(X_train_scaled, y_train)
y_pred = best_model.predict(X_test_scaled)

# Evaluate model accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with optimal C: {accuracy:.4f}")


Q25.Write a Python program to train Logistic Regression, save the trained model using joblib, and load it again to
make predictions.
Ans.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

# Load Titanic dataset
data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

# Select relevant features
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
data = data[features + ["Survived"]]

# Handle missing values
data["Age"].fillna(data["Age"].median(), inplace=True)
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
data["Sex"] = label_encoder.fit_transform(data["Sex"])
data["Embarked"] = label_encoder.fit_transform(data["Embarked"])

# Split features and target variable
X = data.drop("Survived", axis=1)
y = data["Survived"]

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression model
model = LogisticRegression(C=1.0, max_iter=200)
model.fit(X_train_scaled, y_train)

# Save the trained model
joblib.dump(model, "logistic_regression_model.joblib")

# Load the model back
loaded_model = joblib.load("logistic_regression_model.joblib")

# Make predictions
y_pred = loaded_model.predict(X_test_scaled)

# Evaluate model accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy after loading: {accuracy:.4f}")
