In [1]:
#Theory
# Question 1: What is Logistic Regression, and how does it differ from Linear Regression?
# Answer:
# Logistic Regression is a statistical method used for binary classification tasks.
# It predicts the probability that a given input belongs to a certain class (e.g., 0 or 1).
# The output is always between 0 and 1, which makes it suitable for classification.
# Unlike Linear Regression, which outputs continuous values and fits a straight line,
# Logistic Regression uses the sigmoid function to squeeze the output into the [0, 1] range.
# Thus, Linear Regression is used for regression tasks, while Logistic Regression is used for classification.

# Question 2: What is the mathematical equation of Logistic Regression?
# Answer:
# The hypothesis of Logistic Regression is given by:
# hθ(x) = 1 / (1 + e^(-θᵀx))
# Here, θ is the vector of model parameters, x is the feature vector,
# and hθ(x) gives the probability that the output is 1.

# Question 3: Why do we use the Sigmoid function in Logistic Regression?
# Answer:
# The sigmoid function maps any real-valued number into a value between 0 and 1,
# which is interpreted as a probability.
# It allows us to convert the linear combination θᵀx into a probability score.
# This is essential for binary classification problems.

# Question 4: What is the cost function of Logistic Regression?
# Answer:
# The cost function is the log loss or cross-entropy loss, defined as:
# J(θ) = -1/m ∑ [y log(hθ(x)) + (1 - y) log(1 - hθ(x))]
# This function penalizes wrong predictions more heavily and is convex, which is good for optimization.

# Question 5: What is Regularization in Logistic Regression? Why is it needed?
# Answer:
# Regularization is a technique to prevent overfitting by adding a penalty term to the cost function.
# It discourages the model from learning overly complex or large parameter values.
# This helps in improving generalization to unseen data.

# Question 6: Explain the difference between Lasso, Ridge, and Elastic Net regression.
# Answer:
# - Lasso (L1 regularization): Adds λ∑|θj| to the cost. Can shrink some coefficients to zero, effectively selecting features.
# - Ridge (L2 regularization): Adds λ∑θj² to the cost. Shrinks coefficients but doesn’t eliminate them.
# - Elastic Net: Combines both L1 and L2 penalties. Useful when there are multiple correlated features.

# Question 7: When should we use Elastic Net instead of Lasso or Ridge?
# Answer:
# Use Elastic Net when:
# - You have many correlated features.
# - Lasso eliminates too many features.
# - Ridge retains too many irrelevant features.
# Elastic Net provides a balance between feature selection and coefficient shrinkage.

# Question 8: What is the impact of the regularization parameter (λ) in Logistic Regression?
# Answer:
# - A higher λ increases regularization, shrinking coefficients more and reducing overfitting but may lead to underfitting.
# - A lower λ reduces regularization, allowing the model to fit the training data more closely, increasing the risk of overfitting.

# Question 9: What are the key assumptions of Logistic Regression?
# Answer:
# - The dependent variable is binary.
# - Observations are independent.
# - There is little to no multicollinearity.
# - The relationship between independent variables and the log-odds is linear.
# - Large sample sizes for stability.

# Question 10: What are some alternatives to Logistic Regression for classification tasks?
# Answer:
# - Decision Trees
# - Random Forest
# - Support Vector Machines
# - k-Nearest Neighbors
# - Naive Bayes
# - Gradient Boosting (e.g., XGBoost)
# - Neural Networks

# Question 11: What are Classification Evaluation Metrics?
# Answer:
# - Accuracy
# - Precision
# - Recall
# - F1 Score
# - ROC-AUC
# - Confusion Matrix
# - Log Loss

# Question 12: How does class imbalance affect Logistic Regression?
# Answer:
# Class imbalance can cause the model to be biased toward the majority class,
# resulting in poor recall and precision for the minority class.
# Techniques like resampling, using class weights, or alternative metrics like ROC-AUC help handle this.

# Question 13: What is Hyperparameter Tuning in Logistic Regression?
# Answer:
# It’s the process of finding the best values for parameters like regularization strength (λ),
# solver type, or max iterations using methods like Grid Search or Random Search with cross-validation.

# Question 14: What are different solvers in Logistic Regression? Which one should be used?
# Answer:
# Common solvers:
# - liblinear: Good for small datasets and L1 regularization.
# - lbfgs: Suitable for multiclass and large datasets.
# - saga: Supports L1, L2, and Elastic Net; works with large data.
# - newton-cg: Suitable for L2 regularization and large datasets.
# Choose based on data size, regularization type, and computational resources.

# Question 15: How is Logistic Regression extended for multiclass classification?
# Answer:
# - One-vs-Rest (OvR): Trains a separate classifier for each class vs. all others.
# - Multinomial Logistic Regression (Softmax): Generalizes logistic regression to handle multiple classes in one model using the softmax function.

# Question 16: What are the advantages and disadvantages of Logistic Regression?
# Answer:
# Advantages:
# - Simple and fast
# - Probabilistic output
# - Works well with linearly separable data
# Disadvantages:
# - Assumes linearity in log-odds
# - Not effective for complex relationships
# - Sensitive to outliers and multicollinearity

# Question 17: What are some use cases of Logistic Regression?
# Answer:
# - Spam detection
# - Disease diagnosis
# - Credit scoring
# - Customer churn prediction
# - Fraud detection

# Question 18: What is the difference between Softmax Regression and Logistic Regression?
# Answer:
# - Logistic Regression: Used for binary classification.
# - Softmax Regression: Used for multiclass classification.
# It assigns probabilities to each class using the softmax function, summing to 1.

# Question 19: How do we choose between One-vs-Rest (OvR) and Softmax for multiclass classification?
# Answer:
# - Use OvR for small datasets or binary classifiers with good performance.
# - Use Softmax (multinomial) when classes are balanced and mutually exclusive.
# It is often more accurate for true multiclass settings.

# Question 20: How do we interpret coefficients in Logistic Regression?
# Answer:
# Each coefficient θj represents the change in the log-odds of the outcome for a one-unit increase in the corresponding feature xj.
# Exponentiating θj gives the odds ratio:
# e^(θj) = multiplicative change in odds
# Values >1 increase odds; values <1 decrease odds.


#Practical

# Question 1: Write a Python program that loads a dataset, splits it into training and testing sets, applies Logistic Regression, and prints the model accuracy.
# Answer:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Logistic Regression
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
# Predict and print accuracy
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

# Question 2: Write a Python program to apply L1 regularization (Lasso) on a dataset using LogisticRegression(penalty='l1') and print the model accuracy.
# Answer:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(penalty='l1', solver='liblinear', max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy with L1 regularization:", accuracy_score(y_test, y_pred))

#Question 3: Write a Python program to train Logistic Regression with L2 regularization (Ridge) using LogisticRegression(penalty='l2'). Print model accuracy and coefficients.
#Answer:
model = LogisticRegression(penalty='l2', solver='liblinear', max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy with L2 regularization:", accuracy_score(y_test, y_pred))
print("Coefficients:", model.coef_)

#Question 4: Write a Python program to train Logistic Regression with Elastic Net Regularization (penalty='elasticnet').
#Answer:
model = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5, max_iter=10000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy with Elastic Net:", accuracy_score(y_test, y_pred))

#Question 5: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr'.
#Answer:
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=10000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Multiclass Accuracy (OvR):", accuracy_score(y_test, y_pred))

# Question 6: Write a Python program to apply GridSearchCV to tune the hyperparameters (C and penalty) of Logistic Regression. Print the best parameters and accuracy.
#Answer:
from sklearn.model_selection import GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']  # 'liblinear' supports both l1 and l2
}
grid = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Accuracy:", grid.best_score_)

# Question 7: Write a Python program to evaluate Logistic Regression using Stratified K-Fold Cross-Validation. Print the average accuracy.
#Answer:
from sklearn.model_selection import StratifiedKFold, cross_val_score
skf = StratifiedKFold(n_splits=5)
model = LogisticRegression(max_iter=1000)
scores = cross_val_score(model, X, y, cv=skf)
print("Stratified K-Fold Accuracy Scores:", scores)
print("Average Accuracy:", scores.mean())

# Question 8: Write a Python program to load a dataset from a CSV file, apply Logistic Regression, and evaluate its accuracy.
# Answer:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv('your_dataset.csv')  # Replace with actual path
X = df.drop('target', axis=1)
y = LabelEncoder().fit_transform(df['target'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))

# Question 9: Write a Python program to apply RandomizedSearchCV for tuning hyperparameters (C, penalty, solver) in Logistic Regression. Print the best parameters and accuracy.
# Answer:
from sklearn.model_selection import RandomizedSearchCV
import numpy as np
param_dist = {
    'C': np.logspace(-3, 3, 10),
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
}
random_search = RandomizedSearchCV(LogisticRegression(max_iter=10000), param_distributions=param_dist, n_iter=10, cv=5, random_state=42)
random_search.fit(X_train, y_train)
print("Best Parameters:", random_search.best_params_)
print("Best Accuracy:", random_search.best_score_)

# Question 10: Write a Python program to implement One-vs-One (OvO) Multiclass Logistic Regression and print accuracy.
Answer:
from sklearn.multiclass import OneVsOneClassifier
model = OneVsOneClassifier(LogisticRegression(max_iter=1000))
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("One-vs-One Accuracy:", accuracy_score(y_test, y_pred))

#Question 11: Write a Python program to train a Logistic Regression model and visualize the confusion matrix for binary classification.
#Answer:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.title("Confusion Matrix")
plt.show()

#Question 12: Write a Python program to train a Logistic Regression model and evaluate its performance using Precision, Recall, and F1-Score.
#Answer:
from sklearn.metrics import precision_score, recall_score, f1_score
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))

#Question 13: Write a Python program to train a Logistic Regression model on imbalanced data and apply class weights to improve model performance.
#Answer:
model = LogisticRegression(class_weight='balanced', max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy with Class Weights:", accuracy_score(y_test, y_pred))

#Question 14: Write a Python program to train Logistic Regression on the Titanic dataset, handle missing values, and evaluate performance.
#Answer:
import seaborn as sns
df = sns.load_dataset("titanic")
df = df[['sex', 'age', 'fare', 'embarked', 'survived']].dropna()
df['sex'] = df['sex'].map({'male': 0, 'female': 1})
df['embarked'] = df['embarked'].map({'S': 0, 'C': 1, 'Q': 2})
X = df.drop('survived', axis=1)
y = df['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
print("Titanic Data Accuracy:", accuracy_score(y_test, model.predict(X_test)))

#Question 15: Write a Python program to apply feature scaling (Standardization) before training a Logistic Regression model. Evaluate its accuracy and compare results with and without scaling.
#Answer:
from sklearn.preprocessing import StandardScaler
# Without Scaling
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
acc_no_scaling = accuracy_score(y_test, model.predict(X_test))
# With Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model_scaled = LogisticRegression(max_iter=1000)
model_scaled.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, model_scaled.predict(X_test_scaled))
print("Accuracy without Scaling:", acc_no_scaling)
print("Accuracy with Scaling:", acc_scaled)

#Question 16: Write a Python program to train Logistic Regression and evaluate its performance using ROC-AUC score.
#Answer:
from sklearn.metrics import roc_auc_score
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba))

#Question 17: Write a Python program to train Logistic Regression using a custom learning rate (C=0.5) and evaluate accuracy.
#Answer:
model = LogisticRegression(C=0.5, max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy with C=0.5:", accuracy_score(y_test, y_pred))

#Question 18: Write a Python program to train Logistic Regression and identify important features based on model coefficients.
#Answer:
import numpy as np
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
importance = np.abs(model.coef_[0])
for i, v in enumerate(importance):
    print(f"Feature {i}: Coefficient = {model.coef_[0][i]}")

#Question 19: Write a Python program to train Logistic Regression and evaluate its performance using Cohen’s Kappa Score.
#Answer:
from sklearn.metrics import cohen_kappa_score
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Cohen's Kappa Score:", cohen_kappa_score(y_test, y_pred))

#Question 20: Write a Python program to train Logistic Regression and visualize the Precision-Recall Curve for binary classification.
#Answer:
from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_scores = model.predict_proba(X_test)[:, 1]
precision, recall, _ = precision_recall_curve(y_test, y_scores)
disp = PrecisionRecallDisplay(precision=precision, recall=recall)
disp.plot()
plt.title("Precision-Recall Curve")
plt.show()
