#**LOGISTIC REGRESSION**

##**THEORITICAL QUESTIONS:**
### 1. **What is Logistic Regression, and how does it differ from Linear Regression?**

**Answer:** Logistic Regression is a supervised learning algorithm used for **classification tasks**, predicting the probability of a binary outcome (0 or 1). Unlike Linear Regression, which predicts continuous numerical values, Logistic Regression outputs probabilities constrained between **0 and 1** using the **sigmoid function**.

---

### 2. **What is the mathematical equation of Logistic Regression?**

**Answer:**

$$
P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n)}}
$$

Or more simply:

$$
P = \sigma(Z) = \frac{1}{1 + e^{-Z}}
$$

where $Z = \beta_0 + \sum \beta_i X_i$

---

### 3. **Why do we use the Sigmoid function in Logistic Regression?**

**Answer:** The sigmoid function transforms any real number into a value between **0 and 1**, which represents a **probability**. This is essential for classification problems.

---

### 4. **What is the cost function of Logistic Regression?**

**Answer:** Logistic Regression uses **Binary Cross-Entropy (Log Loss)**:

$$
Cost = -\frac{1}{N} \sum \left[ y \cdot \log(p) + (1-y) \cdot \log(1-p) \right]
$$

where $p = \sigma(Z)$

---

### 5. **What is Regularization in Logistic Regression? Why is it needed?**

**Answer:** Regularization is a technique used to **prevent overfitting** by penalizing large coefficients in the model. It keeps the model simple and generalizable.

---

### 6. **Explain the difference between Lasso, Ridge, and Elastic Net regression.**

**Answer:**

| Regression     | Penalty Term                | Effect                                       |    |                                                       |
| -------------- | --------------------------- | -------------------------------------------- | -- | ----------------------------------------------------- |
| **Ridge**      | L2 ($\lambda \sum \beta^2$) | Shrinks coefficients but keeps all variables |    |                                                       |
| **Lasso**      | L1 ((\lambda \sum           | \beta                                        | )) | Some coefficients become **zero** (feature selection) |
| **ElasticNet** | L1 + L2                     | Combines Ridge and Lasso for balance         |    |                                                       |

---

### 7. **When should we use Elastic Net instead of Lasso or Ridge?**

**Answer:** Use Elastic Net when:

* Features are **highly correlated**.
* You need both **feature selection** (like Lasso) and **shrinkage** (like Ridge).

---

### 8. **What is the impact of the regularization parameter (C) in Logistic Regression?**

**Answer:**

* **C is the inverse of regularization strength** (`C = 1/λ`).
* A **high C** means less regularization (risk of overfitting).
* A **low C** means stronger regularization (risk of underfitting).

---

### 9. **What are the key assumptions of Logistic Regression?**

**Answer:**

* **Linearity** in the log-odds.
* **No multicollinearity** among independent variables.
* **Independence** of observations.
* **Large sample size** preferred for stability.

---

### 10. **What are some alternatives to Logistic Regression for classification tasks?**

**Answer:**

* Decision Trees
* Random Forest
* Gradient Boosting (e.g., XGBoost, LightGBM)
* Support Vector Machines (SVM)
* K-Nearest Neighbors (KNN)
* Naive Bayes
* Neural Networks

---

### 11. **What are Classification Evaluation Metrics?**

**Answer:**

* **Accuracy**
* **Precision**
* **Recall (Sensitivity)**
* **F1-Score**
* **AUC-ROC**
* **Log Loss**
* **Confusion Matrix**

---

### 12. **How does class imbalance affect Logistic Regression?**

**Answer:** Logistic Regression may be biased toward the **majority class**, leading to poor recall for the minority class. Solutions include:

* **`class_weight='balanced'`** parameter
* **SMOTE** (Synthetic Minority Oversampling Technique)
* Focus on **Recall, Precision, F1-Score** rather than Accuracy.

---

### 13. **What is Hyperparameter Tuning in Logistic Regression?**

**Answer:** It is the process of finding the best set of parameters like:

* **C** (regularization strength)
* **Penalty** (`l1`, `l2`, `elasticnet`)
* **Solver** (`liblinear`, `saga`, `lbfgs`)
  Methods used include **Grid Search**, **Random Search**, and **Bayesian Optimization**.

---

### 14. **What are different solvers in Logistic Regression? Which one should be used?**

**Answer:**

| Solver        | Supports           | Best For                                 |
| ------------- | ------------------ | ---------------------------------------- |
| **liblinear** | L1, L2             | Small datasets, binary classification    |
| **lbfgs**     | L2                 | Multinomial, medium-sized datasets       |
| **newton-cg** | L2                 | Multinomial                              |
| **sag**       | L2                 | Large datasets, fast for large samples   |
| **saga**      | L1, L2, ElasticNet | Large datasets, multinomial, sparse data |

---

### 15. **How is Logistic Regression extended for multiclass classification?**

**Answer:**

* **One-vs-Rest (OvR):** Fits one binary classifier per class.
* **Softmax (Multinomial Logistic Regression):** Models all classes simultaneously. Use `multi_class='multinomial'` with `solver='lbfgs'` or `saga`.

---

### 16. **What are the advantages and disadvantages of Logistic Regression?**

**Answer:**

| Advantages                    | Disadvantages                            |
| ----------------------------- | ---------------------------------------- |
| Simple and easy to interpret  | Assumes linear relationship in log-odds  |
| Fast to train                 | Not suitable for complex non-linear data |
| Outputs probabilities         | Sensitive to outliers                    |
| Works well for small datasets | Struggles with multicollinearity         |

---

### 17. **What are some use cases of Logistic Regression?**

**Answer:**

* Credit risk modeling
* Email spam detection
* Customer churn prediction
* Disease diagnosis (e.g., cancer detection)
* Marketing campaign response prediction

---

### 18. **What is the difference between Softmax Regression and Logistic Regression?**

**Answer:**

| Logistic Regression                | Softmax Regression                                    |
| ---------------------------------- | ----------------------------------------------------- |
| Used for **binary** classification | Used for **multiclass** classification                |
| Uses **sigmoid** function          | Uses **softmax** function to generalize probabilities |

---

### 19. **How do we choose between One-vs-Rest (OvR) and Softmax for multiclass classification?**

**Answer:**

* Use **OvR** when classes are somewhat independent or if simplicity is preferred.
* Use **Softmax** when classes are **mutually exclusive**, for better probability calibration.

---

### 20. **How do we interpret coefficients in Logistic Regression?**

**Answer:**

* Each coefficient $\beta_i$ represents the **change in log-odds** of the target per unit increase in that feature.
* The **odds ratio** is calculated as:

$$
\text{Odds Ratio} = e^{\beta_i}
$$

* Example: If $\beta = 0.7$, then the odds increase by **$e^{0.7} \approx 2$**, meaning odds **double**.

---



##**PRACTICAL QUESTIONS:**

###**1. Write a Python program that loads a dataset, splits it into training and testing sets, applies Logistic Regression, and prints the model accuracy.**

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy:", accuracy)


Model Accuracy: 1.0


#**2. Write a Python program to apply L1 regularization (Lasso) on a dataset using LogisticRegression(penalty='l1') and print the model accuracy.**

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(penalty='l1', solver='saga', max_iter=500)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("L1 Regularized Model Accuracy:", accuracy)


L1 Regularized Model Accuracy: 1.0




###**3. Write a Python program to train Logistic Regression with L2 regularization (Ridge) using LogisticRegression (penalty='l2'). Print model accuracy and coefficients.**

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=500)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("L2 Regularized Model Accuracy:", accuracy)
print("Model Coefficients:", model.coef_)


L2 Regularized Model Accuracy: 1.0
Model Coefficients: [[-0.39345607  0.96251768 -2.37512436 -0.99874594]
 [ 0.50843279 -0.25482714 -0.21301129 -0.77574766]
 [-0.11497673 -0.70769055  2.58813565  1.7744936 ]]


###**4. Write a Python program to train Logistic Regression with Elastic Net Regularization (penalty='elasticnet').**

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5, max_iter=500)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("ElasticNet Regularized Model Accuracy:", accuracy)


ElasticNet Regularized Model Accuracy: 1.0




###**5. Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr'.**

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=500)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("OvR Model Accuracy:", accuracy)


OvR Model Accuracy: 0.9666666666666667




###**6. Write a Python program to apply GridSearchCV to tune the hyperparameters (C and penalty) of Logistic Regression. Print the best parameters and accuracy.**

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(solver='saga', max_iter=500)

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2', 'elasticnet'],
    'l1_ratio': [0, 0.5, 1]
}

grid = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Cross-Validated Accuracy:", grid.best_score_)


Best Parameters: {'C': 1, 'l1_ratio': 0, 'penalty': 'l1'}
Best Cross-Validated Accuracy: 0.975




###**7. Write a Python program to evaluate Logistic Regression using Stratified K-Fold Cross-Validation. Print the average accuracy.**

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np

data = load_iris()
X = data.data
y = data.target

model = LogisticRegression(max_iter=500)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')

print("Cross-Validation Accuracy for each fold:", scores)
print("Average Accuracy:", np.mean(scores))


Cross-Validation Accuracy for each fold: [1.         0.96666667 0.93333333 1.         0.93333333]
Average Accuracy: 0.9666666666666668


###**8. Write a Python program to load a dataset from a CSV file, apply Logistic Regression, and evaluate its accuracy.**

In [None]:


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

try:
    from google.colab import files
    print("Upload your CSV file:")
    uploaded = files.upload()
    file_name = list(uploaded.keys())[0]
except:
    file_name = 'data.csv'


df = pd.read_csv(file_name)
print("\nDataset Preview:")
print(df.head())

print("\nDataset Shape:", df.shape)
print("Missing Values:\n", df.isnull().sum())

X = df.iloc[:, :-1]
y = df.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("\nModel Accuracy: {:.2f}%".format(accuracy * 100))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=model.classes_, yticklabels=model.classes_)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()


Upload your CSV file:


###**9. Write a Python program to apply RandomizedSearchCV for tuning hyperparameters (C, penalty, solver) in Logistic Regression. Print the best parameters and accuracy.**

In [None]:
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd

try:
    from google.colab import files
    print("Upload your CSV file:")
    uploaded = files.upload()
    file_name = list(uploaded.keys())[0]
except:
    file_name = 'extended_logistic_data.csv'


df = pd.read_csv(file_name)
X = df[['Age', 'Salary']]
y = df['Purchased']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_dist = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
}

model = LogisticRegression(max_iter=500)
rand_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10, cv=3, random_state=42)
rand_search.fit(X_train, y_train)

print("Best Parameters:", rand_search.best_params_)
print("Best Cross-Validated Accuracy:", rand_search.best_score_)

###**10. Write a Python program to implement One-vs-One (OvO) Multiclass Logistic Regression and print accuracy.**

In [None]:
from sklearn.multiclass import OneVsOneClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

ovo_model = OneVsOneClassifier(LogisticRegression(max_iter=500))
ovo_model.fit(X_train, y_train)

y_pred = ovo_model.predict(X_test)
print("One-vs-One Accuracy:", accuracy_score(y_test, y_pred))


###**11. Write a Python program to train a Logistic Regression model and visualize the confusion matrix for binary classification.**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()


###**12. Write a Python program to train a Logistic Regression model and evaluate its performance using Precision, Recall, and F1-Score.**

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

y_pred = model.predict(X_test)

print("Precision:", precision_score(y_test, y_pred, average='weighted'))
print("Recall:", recall_score(y_test, y_pred, average='weighted'))
print("F1-Score:", f1_score(y_test, y_pred, average='weighted'))

###**13. Write a Python program to train a Logistic Regression model on imbalanced data and apply class weights to improve model performance.**

In [None]:
model = LogisticRegression(class_weight='balanced', max_iter=500)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Accuracy (Balanced):", accuracy_score(y_test, y_pred))


###**14. Write a Python program to train Logistic Regression on the Titanic dataset, handle missing values, and evaluate performance.**

In [None]:
import seaborn as sns
from sklearn.impute import SimpleImputer

df = sns.load_dataset("titanic")
df = df[["age", "fare", "sex", "survived"]].dropna(subset=["sex"])

df['sex'] = df['sex'].map({'male': 0, 'female': 1})

imputer = SimpleImputer(strategy="mean")
df[["age", "fare"]] = imputer.fit_transform(df[["age", "fare"]])

X = df[["age", "fare", "sex"]]
y = df["survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Titanic Dataset Accuracy:", accuracy_score(y_test, y_pred))


###**15. Write a Python program to apply feature scaling (Standardization) before training a Logistic Regression model. Evaluate its accuracy and compare results with and without scaling.**

In [None]:
from sklearn.preprocessing import StandardScaler

model1 = LogisticRegression(max_iter=500)
model1.fit(X_train, y_train)
acc1 = accuracy_score(y_test, model1.predict(X_test))

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model2 = LogisticRegression(max_iter=500)
model2.fit(X_train_scaled, y_train)
acc2 = accuracy_score(y_test, model2.predict(X_test_scaled))

print("Accuracy without scaling:", acc1)
print("Accuracy with scaling:", acc2)


###**16. Write a Python program to train Logistic Regression and evaluate its performance using ROC-AUC score.**

In [None]:
from sklearn.metrics import roc_auc_score

y_probs = model.predict_proba(X_test)[:, 1]

auc = roc_auc_score(y_test, y_probs)
print("ROC-AUC Score:", auc)


###**17. Write a Python program to train Logistic Regression using a custom learning rate (C=0.5) and evaluate accuracy.**

In [None]:
model = LogisticRegression(C=0.5, max_iter=500)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy with C=0.5:", accuracy_score(y_test, y_pred))

###**18. Write a Python program to train Logistic Regression and identify important features based on model coefficients.**

In [None]:
model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)

features = X.columns
coefficients = model.coef_[0]

for f, c in zip(features, coefficients):
    print(f"{f}: {c:.4f}")


###**19. Write a Python program to train Logistic Regression and evaluate its performance using Cohen's Kappa Score.**

In [None]:
from sklearn.metrics import cohen_kappa_score

y_pred = model.predict(X_test)
kappa = cohen_kappa_score(y_test, y_pred)
print("Cohen's Kappa Score:", kappa)


###**20. Write a Python program to train Logistic Regression and visualize the Precision-Recall Curve for binary classification.**

In [None]:
from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay
import matplotlib.pyplot as plt

model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)

y_scores = model.predict_proba(X_test)[:, 1]

precision, recall, _ = precision_recall_curve(y_test, y_scores)
disp = PrecisionRecallDisplay(precision=precision, recall=recall)
disp.plot()
plt.title("Precision-Recall Curve")
plt.show()


###**21. Write a Python program to train Logistic Regression with different solvers (liblinear, saga, lbfgs) and compare their accuracy.**

In [None]:
solvers = ['liblinear', 'saga', 'lbfgs']

for solver in solvers:
    model = LogisticRegression(solver=solver, max_iter=500)
    model.fit(X_train, y_train)
    acc = model.score(X_test, y_test)
    print(f"Accuracy with solver '{solver}': {acc:.4f}")


###**22. Write a Python program to train Logistic Regression and evaluate its performance using Matthews Correlation Coefficient (MCC).**

In [None]:
from sklearn.metrics import matthews_corrcoef

model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mcc = matthews_corrcoef(y_test, y_pred)
print("Matthews Correlation Coefficient (MCC):", mcc)


###**23. Write a Python program to train Logistic Regression on both raw and standardized data. Compare their accuracy to see the impact of feature scaling.**

In [None]:
from sklearn.preprocessing import StandardScaler

model_raw = LogisticRegression(max_iter=500)
model_raw.fit(X_train, y_train)
acc_raw = model_raw.score(X_test, y_test)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression(max_iter=500)
model_scaled.fit(X_train_scaled, y_train)
acc_scaled = model_scaled.score(X_test_scaled, y_test)

print("Accuracy (raw data):", acc_raw)
print("Accuracy (standardized data):", acc_scaled)


###**24. Write a Python program to train Logistic Regression and find the optimal C (regularization strength) using cross-validation.**

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.01, 0.1, 1, 10, 100]}
model = LogisticRegression(max_iter=500)

grid = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

print("Best C:", grid.best_params_['C'])
print("Best Cross-Validated Accuracy:", grid.best_score_)


###**25. Write a Python program to train Logistic Regression, save the trained model using joblib, and load it again to make predictions.**

In [None]:
import joblib
from sklearn.metrics import accuracy_score

model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)
joblib.dump(model, "logistic_model.pkl")

loaded_model = joblib.load("logistic_model.pkl")
y_pred = loaded_model.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print("Accuracy from loaded model:", acc)
