# Theoretical Questions

---

### **1. What is Logistic Regression, and how does it differ from Linear Regression?**  
Logistic Regression is a classification algorithm that predicts probabilities using the sigmoid function. Unlike Linear Regression, which predicts continuous values, Logistic Regression outputs probabilities mapped to classes (0 or 1).  
**Example:** Predicting if an email is spam (1) or not (0), whereas Linear Regression predicts house prices.

---

### **2. What is the mathematical equation of Logistic Regression?**  
The equation is:  
\[
P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n)}}
\]  
Where \( P(Y=1|X) \) is the probability of class 1, and \( \beta \) are the model coefficients.

---

### **3. Why do we use the Sigmoid function in Logistic Regression?**  
The sigmoid function maps any real number to a range between 0 and 1, making it suitable for probability estimation. It helps classify inputs into two categories.  
**Example:** A probability of 0.8 means the model is 80% confident the instance belongs to class 1.

---

### **4. What is the cost function of Logistic Regression?**  
Logistic Regression uses the **log loss (cross-entropy loss)**:  
\[
J(\theta) = - \frac{1}{m} \sum [ y \log(h) + (1 - y) \log(1 - h)]
\]  
where \( h \) is the predicted probability. This function penalizes incorrect confident predictions.

---

### **5. What is Regularization in Logistic Regression? Why is it needed?**  
Regularization (L1/L2) prevents overfitting by adding a penalty to large coefficients. It ensures the model generalizes well to unseen data.  
**Example:** Without regularization, a model may perfectly fit training data but fail on new data.

---

### **6. Explain the difference between Lasso, Ridge, and Elastic Net regression.**  
- **Lasso (L1):** Shrinks coefficients to zero, selecting important features.  
- **Ridge (L2):** Reduces coefficient values but doesn’t eliminate them.  
- **Elastic Net:** Combines L1 and L2, balancing feature selection and coefficient shrinkage.

---

### **7. When should we use Elastic Net instead of Lasso or Ridge?**  
Use **Elastic Net** when data has many correlated features. Lasso alone might randomly select one feature, while Elastic Net balances feature selection and shrinkage, improving stability.

---

### **8. What is the impact of the regularization parameter (λ) in Logistic Regression?**  
- **High \( \lambda \):** Stronger regularization, smaller coefficients, avoids overfitting.  
- **Low \( \lambda \):** Weak regularization, fits training data better but may overfit.  
- **Zero \( \lambda \):** No regularization, behaving like standard Logistic Regression.

---

### **9. What are the key assumptions of Logistic Regression?**  
1. **Linearity in log-odds:** The relationship between predictors and log-odds is linear.  
2. **No multicollinearity:** Independent variables should not be highly correlated.  
3. **Large sample size:** More data improves model stability.  
4. **Independent observations:** No autocorrelation in data.

---

### **10. What are some alternatives to Logistic Regression for classification tasks?**  
1. **Decision Trees**  
2. **Random Forest**  
3. **Support Vector Machines (SVM)**  
4. **Naïve Bayes**  
5. **Neural Networks**  
6. **Gradient Boosting (XGBoost, LightGBM, CatBoost)**  

---

### **11. What are Classification Evaluation Metrics?**  
- **Accuracy:** Correct predictions/Total predictions.  
- **Precision:** \( TP / (TP + FP) \).  
- **Recall:** \( TP / (TP + FN) \).  
- **F1-score:** Harmonic mean of Precision and Recall.  
- **ROC-AUC:** Measures classifier performance across different thresholds.

---

### **12. How does class imbalance affect Logistic Regression?**  
Class imbalance skews probability estimates, making the model biased toward the majority class. **Solutions:**  
- Use **balanced class weights**.  
- **Oversample the minority class** (SMOTE).  
- **Undersample the majority class**.  
- Use different **evaluation metrics** like F1-score and AUC.

---

### **13. What is Hyperparameter Tuning in Logistic Regression?**  
Hyperparameter tuning optimizes model performance by adjusting parameters like:  
- **Regularization strength (\(\lambda\))**  
- **Solver choice (liblinear, saga, lbfgs, newton-cg)**  
- **Class weights**  
Use **GridSearchCV** or **RandomizedSearchCV** to find the best values.

---

### **14. What are different solvers in Logistic Regression? Which one should be used?**  
- **liblinear:** Good for small datasets.  
- **lbfgs & newton-cg:** Best for multiclass problems.  
- **saga:** Works well with large datasets and L1/L2 regularization.  

Use **liblinear** for binary and **lbfgs** for multiclass.

---

### **15. How is Logistic Regression extended for multiclass classification?**  
Two common approaches:  
- **One-vs-Rest (OvR):** Trains one classifier per class.  
- **Softmax Regression:** Uses a single model to assign probabilities to multiple classes.  
**Example:** Classifying handwritten digits (0-9).

---

### **16. What are the advantages and disadvantages of Logistic Regression?**  

**Advantages:**  
✅ Simple and easy to interpret.  
✅ Works well with small datasets.  
✅ Probabilistic output.  

**Disadvantages:**  
❌ Assumes linear decision boundary.  
❌ Struggles with high-dimensional data.  
❌ Sensitive to multicollinearity.

---

### **17. What are some use cases of Logistic Regression?**  
- **Medical Diagnosis:** Predicting disease risk (e.g., diabetes).  
- **Fraud Detection:** Identifying fraudulent transactions.  
- **Marketing:** Predicting customer churn.  
- **HR Analytics:** Employee attrition prediction.

---

### **18. What is the difference between Softmax Regression and Logistic Regression?**  
- **Logistic Regression:** Used for **binary classification** (0 or 1).  
- **Softmax Regression:** Extends Logistic Regression for **multiclass problems** by normalizing probabilities across multiple classes.

---

### **19. How do we choose between One-vs-Rest (OvR) and Softmax for multiclass classification?**  
- **OvR:** Preferred when the dataset is small and training time is a concern.  
- **Softmax:** Works better for **mutually exclusive** classes and large datasets.  

---

### **20. How do we interpret coefficients in Logistic Regression?**  
Each coefficient represents the change in log-odds of the outcome for a one-unit increase in the predictor, keeping others constant.  
**Example:** If \( \beta_1 = 0.5 \), increasing \( X_1 \) by 1 increases the odds of success by \( e^{0.5} \).  



In [None]:
#1. Write a Python program that loads a dataset, splits it into training and testing sets, applies Logistic Regression, and prints the model accuracy.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)

# Predict
y_pred = log_reg.predict(X_test)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")


In [None]:
#2. Write a Python program to apply L1 regularization (Lasso) on a dataset using LogisticRegression(penalty='l1') and print the model accuracy.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_wine
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression with L1 regularization
log_reg_l1 = LogisticRegression(penalty='l1', solver='liblinear', max_iter=200)
log_reg_l1.fit(X_train, y_train)

# Predict
y_pred = log_reg_l1.predict(X_test)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with L1 Regularization: {accuracy:.4f}")


In [None]:
#3. Write a Python program to train Logistic Regression with L2 regularization (Ridge) using LogisticRegression(penalty='l2'). Print model accuracy and coefficients.
log_reg_l2 = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=200)
log_reg_l2.fit(X_train, y_train)

y_pred = log_reg_l2.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy with L2 Regularization: {accuracy:.4f}")
print("Coefficients:", log_reg_l2.coef_)


In [None]:
#4.Write a Python program to train Logistic Regression with Elastic Net Regularization (penalty='elasticnet').
log_reg_en = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5, max_iter=200)
log_reg_en.fit(X_train, y_train)

y_pred = log_reg_en.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with Elastic Net Regularization: {accuracy:.4f}")


In [None]:
# 5.Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr'.
log_reg_ovr = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=200)
log_reg_ovr.fit(X_train, y_train)

y_pred = log_reg_ovr.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy using One-vs-Rest (OvR): {accuracy:.4f}")


In [None]:
#6. Write a Python program to apply GridSearchCV to tune the hyperparameters (C and penalty) of Logistic Regression. Print the best parameters and accuracy
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.01, 0.1, 1, 10], 'penalty': ['l1', 'l2'], 'solver': ['liblinear']}
grid_search = GridSearchCV(LogisticRegression(max_iter=200), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print(f"Best Accuracy: {grid_search.best_score_:.4f}")


In [None]:
#7. Write a Python program to evaluate Logistic Regression using Stratified K-Fold Cross-Validation. Print the average accuracy
from sklearn.model_selection import StratifiedKFold, cross_val_score

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
log_reg = LogisticRegression(max_iter=200)

scores = cross_val_score(log_reg, X, y, cv=skf, scoring='accuracy')
print(f"Average Accuracy: {scores.mean():.4f}")


In [None]:
#8. Write a Python program to load a dataset from a CSV file, apply Logistic Regression, and evaluate its accuracy.
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Replace 'your_dataset.csv' with the actual path to your dataset
# If the file is in the same directory as the notebook, just use the filename
# Example: df = pd.read_csv('data.csv')
# or provide the full path if in a different directory.
# Example: df = pd.read_csv('/path/to/your/dataset.csv')
df = pd.read_csv('your_dataset.csv')

X = df.drop(columns=['target'])  # Assuming 'target' is your target variable column
y = df['target']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Logistic Regression model
log_reg = LogisticRegression(max_iter=200)  # Increase max_iter if needed
log_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred = log_reg.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

In [None]:
#9. M Write a Python program to apply RandomizedSearchCV for tuning hyperparameters (C, penalty, solver) in Logistic Regression. Print the best parameters and accuracy.
from sklearn.model_selection import RandomizedSearchCV

param_dist = {'C': [0.01, 0.1, 1, 10], 'penalty': ['l1', 'l2'], 'solver': ['liblinear', 'saga']}
rand_search = RandomizedSearchCV(LogisticRegression(max_iter=200), param_dist, cv=5, scoring='accuracy', n_iter=10)
rand_search.fit(X_train, y_train)

print("Best Parameters:", rand_search.best_params_)
print(f"Best Accuracy: {rand_search.best_score_:.4f}")


In [None]:
#10.Write a Python program to implement One-vs-One (OvO) Multiclass Logistic Regression and print accuracy
log_reg_ovo = LogisticRegression(multi_class='ovo', solver='lbfgs', max_iter=200)
log_reg_ovo.fit(X_train, y_train)

y_pred = log_reg_ovo.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy using One-vs-One (OvO): {accuracy:.4f}")


In [None]:
#11. Write a Python program to train a Logistic Regression model and visualize the confusion matrix for binary classification.
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()


In [None]:
#12. Write a Python program to train a Logistic Regression model and evaluate its performance using Precision, Recall, and F1-Score.
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1-score: {f1:.4f}")


In [None]:
# 13. Write a Python program to train a Logistic Regression model on imbalanced data and apply class weights to improve model performance.
log_reg_balanced = LogisticRegression(class_weight='balanced', max_iter=200)
log_reg_balanced.fit(X_train, y_train)

y_pred = log_reg_balanced.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with Balanced Class Weights: {accuracy:.4f}")


In [None]:
#14. Write a Python program to train Logistic Regression on the Titanic dataset, handle missing values, and evaluate performance.
titanic = pd.read_csv("titanic.csv")
titanic.fillna(titanic.mean(), inplace=True)  # Fill missing values
X = titanic.drop(columns=['Survived'])
y = titanic['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)

print(f"Model Accuracy: {accuracy_score(y_test, log_reg.predict(X_test)):.4f}")


In [None]:
#15. Write a Python program to apply feature scaling (Standardization) before training a Logistic Regression model. Evaluate its accuracy and compare results with and without scaling.
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_reg_scaled = LogisticRegression(max_iter=200)
log_reg_scaled.fit(X_train_scaled, y_train)

y_pred_scaled = log_reg_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"Accuracy without Scaling: {accuracy:.4f}")
print(f"Accuracy with Scaling: {accuracy_scaled:.4f}")


In [None]:
#16.Write a Python program to train Logistic Regression and evaluate its performance using ROC-AUC score.
from sklearn.metrics import roc_auc_score

# Ensure log_reg is trained on the correct data (Titanic in this case)
log_reg.fit(X_train, y_train)  # Fit the model to the Titanic data

# Get probabilities for all classes (2D array)
y_prob = log_reg.predict_proba(X_test)

# Use roc_auc_score for binary classification without multi_class parameter
roc_auc = roc_auc_score(y_test, y_prob[:, 1])

print(f"ROC-AUC Score: {roc_auc:.4f}")

In [None]:
#17. Write a Python program to train Logistic Regression using a custom learning rate (C=0.5) and evaluate accuracy.
log_reg_custom = LogisticRegression(C=0.5, max_iter=200)
log_reg_custom.fit(X_train, y_train)

y_pred_custom = log_reg_custom.predict(X_test)
accuracy_custom = accuracy_score(y_test, y_pred_custom)

print(f"Model Accuracy with C=0.5: {accuracy_custom:.4f}")


In [None]:
#18. Write a Python program to train Logistic Regression and identify important features based on model coefficients.
feature_importance = pd.Series(log_reg.coef_[0], index=X.columns)
feature_importance.sort_values(ascending=False, inplace=True)

print("Top Important Features:")
print(feature_importance)


In [None]:
#19. Write a Python program to train Logistic Regression and evaluate its performance using Cohen’s Kappa Score.
from sklearn.metrics import cohen_kappa_score

kappa_score = cohen_kappa_score(y_test, y_pred)
print(f"Cohen’s Kappa Score: {kappa_score:.4f}")


In [None]:
#20. Write a Python program to train Logistic Regression and visualize the Precision-Recall Curve for binary classification.
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
import numpy as np

# Assuming y_test and y_prob are from a multiclass problem
# Choose one class to treat as positive (e.g., class 1)
pos_class = 1

# Convert y_test to binary for the chosen class
y_test_binary = np.where(y_test == pos_class, 1, 0)

# Get probabilities for the positive class only
y_prob_binary = y_prob[:, pos_class]

precision, recall, _ = precision_recall_curve(y_test_binary, y_prob_binary)
plt.plot(recall, precision, marker='.')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve for Class {}'.format(pos_class))
plt.show()

In [None]:
#21. Write a Python program to train Logistic Regression with different solvers (liblinear, saga, lbfgs) and compare their accuracy.
solvers = ['liblinear', 'saga', 'lbfgs']
for solver in solvers:
    model = LogisticRegression(solver=solver, max_iter=200)
    model.fit(X_train, y_train)
    acc = accuracy_score(y_test, model.predict(X_test))
    print(f"Accuracy with solver {solver}: {acc:.4f}")


In [None]:
#22. Write a Python program to train Logistic Regression and evaluate its performance using Matthews Correlation Coefficient (MCC).
from sklearn.metrics import matthews_corrcoef

mcc = matthews_corrcoef(y_test, y_pred)
print(f"Matthews Correlation Coefficient: {mcc:.4f}")


In [None]:
# 23.Write a Python program to train Logistic Regression on both raw and standardized data. Compare their accuracy to see the impact of feature scaling.
# Import necessary libraries if not already imported
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Assuming X_train, X_test, y_train, y_test are already defined

# Create a StandardScaler object
scaler = StandardScaler()

# Fit the scaler to the training data and transform it
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the fitted scaler
X_test_scaled = scaler.transform(X_test)

# Create and train the Logistic Regression model on scaled data
log_reg_scaled = LogisticRegression(max_iter=200)
log_reg_scaled.fit(X_train_scaled, y_train)

# Now you can use log_reg_scaled for prediction
acc_scaled = accuracy_score(y_test, log_reg_scaled.predict(X_test_scaled))

In [None]:
#24. Write a Python program to train Logistic Regression and find the optimal C (regularization strength) using cross-validation.
from sklearn.model_selection import cross_val_score

C_values = [0.01, 0.1, 1, 10]
for C in C_values:
    model = LogisticRegression(C=C, max_iter=200)
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"C={C}, Mean Accuracy: {scores.mean():.4f}")


In [None]:
#25. Write a Python program to train Logistic Regression, save the trained model using joblib, and load it again to make predictions.
import joblib

# Train and save model
joblib.dump(log_reg, 'logistic_model.pkl')

# Load model
loaded_model = joblib.load('logistic_model.pkl')

# Make predictions
y_pred_loaded = loaded_model.predict(X_test)
print(f"Accuracy of Loaded Model: {accuracy_score(y_test, y_pred_loaded):.4f}")
