## 1.  What is Logistic Regression, and how does it differ from Linear Regression? 

Logistic Regression is a supervised machine learning algorithm used for classification problems (like predicting Yes/No, 0/1, True/False).

It predicts the probability that an input belongs to a certain class.

For example:
- Will a student pass or fail? (Yes/No)
- Will an email be spam or not spam? (1/0)

The output of Logistic Regression is always between 0 and 1 because it uses the Sigmoid function to map predictions into probability form.

## 2.  Explain the role of the Sigmoid function in Logistic Regression.

The **Sigmoid function** is the heart of Logistic Regression.

It converts any real-valued number (from -∞ to +∞) into a value between **0 and 1**.  
This makes it perfect for representing **probabilities**.

### Role in Logistic Regression

1. **Converts Linear Output to Probability:**  
   The linear equation \( b_0 + b_1x \) can give any value.  
   The sigmoid maps it into a probability between 0 and 1.

2. **Helps in Classification:**  
   - If probability ≥ 0.5 → class 1 (Yes/Positive)  
   - If probability < 0.5 → class 0 (No/Negative)

3. **Smooth Transition:**  
   The sigmoid curve is S-shaped, which means small changes around 0 
   create noticeable probability changes — ideal for decision boundaries.

## 3. What is Regularization in Logistic Regression and why is it needed?

**Regularization** is a technique used to **reduce overfitting** in Logistic Regression (and other models).

When a model learns **too well** from training data, it may start memorizing noise or random patterns.  
This causes **poor performance on new/unseen data** — that’s called **overfitting**.

Regularization helps control this by adding a **penalty term** to the cost function,  
which discourages the model from having very large coefficient values.

- It is needed because:
    - Prevents **overfitting**
    - Improves **generalization** on test data
    - Keeps the model **simpler and more stable**
    - Helps handle **multicollinearity** (when features are correlated)

## 4. What are some common evaluation metrics for classification models, and why are they important? 

After training a classification model (like Logistic Regression), we need to measure **how well it performs** — this is done using **evaluation metrics**.

These metrics help us understand if our model is predicting correctly, and whether it is biased toward a certain class or not.

- They are important because:
    - They give a **complete picture** of model performance.  
    - Help choose the **right model** for real-world tasks.  
    - Detect if a model is **biased** (e.g., always predicting one class).  
    - Useful in **medical, fraud detection, or spam filtering** where accuracy alone is not enough.


## 5.  Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy. 
(Use Dataset from sklearn package)

In [None]:
pip install scikit-learn


In [3]:
import warnings
warnings.filterwarnings('ignore')


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score



iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

print("Sample data from CSV-like DataFrame:")
print(df.head())

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


model = LogisticRegression(max_iter=200, solver='lbfgs', multi_class='auto')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("\nModel Accuracy on Test Data:", round(accuracy, 3))


Sample data from CSV-like DataFrame:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  

Model Accuracy on Test Data: 1.0


## 6. Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy. 
(Use Dataset from sklearn package) 
(Include your Python code and output in the code box below.)

In [4]:

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=200)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Coefficients (Weights):")
print(model.coef_)
print("\nModel Intercept:")
print(model.intercept_)
print("\nAccuracy of Logistic Regression Model:", round(accuracy, 3))


Model Coefficients (Weights):
[[-0.39346234  0.96250944 -2.3751269  -0.99874615]
 [ 0.50843644 -0.25482245 -0.2130109  -0.77574713]
 [-0.11497411 -0.70768699  2.5881378   1.77449329]]

Model Intercept:
[  9.00890328   1.86898996 -10.87789324]

Accuracy of Logistic Regression Model: 1.0


## 7. Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report. 
(Use Dataset from sklearn package) 
(Include your Python code and output in the code box below.)

In [5]:

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_wine   # multiclass dataset
from sklearn.metrics import classification_report

wine = load_wine()
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=200)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Classification Report (OvR Logistic Regression):\n")
print(classification_report(y_test, y_pred))


Classification Report (OvR Logistic Regression):

              precision    recall  f1-score   support

           0       1.00      0.93      0.96        14
           1       0.93      1.00      0.97        14
           2       1.00      1.00      1.00         8

    accuracy                           0.97        36
   macro avg       0.98      0.98      0.98        36
weighted avg       0.97      0.97      0.97        36



## 8.  Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy. 
(Use Dataset from sklearn package) 
(Include your Python code and output in the code box below.)

In [6]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

log_reg = LogisticRegression(max_iter=200)

param_grid = {
    'C': [0.01, 0.1, 1, 10],            # Regularization strength
    'penalty': ['l1', 'l2'],           # Regularization type
    'solver': ['liblinear']            # Supports both l1 and l2
}

grid = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

y_pred = grid.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Output
print("Best Parameters from GridSearchCV:")
print(grid.best_params_)
print("\nBest Cross-Validation Score:", round(grid.best_score_, 3))
print("\nValidation Accuracy on Test Data:", round(accuracy, 3))


Best Parameters from GridSearchCV:
{'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}

Best Cross-Validation Score: 0.958

Validation Accuracy on Test Data: 1.0


## 9. Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling. 
(Use Dataset from sklearn package) 
(Include your Python code and output in the code box below.)

In [7]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

wine = load_wine()
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model_no_scale = LogisticRegression(max_iter=200)
model_no_scale.fit(X_train, y_train)
pred_no_scale = model_no_scale.predict(X_test)
acc_no_scale = accuracy_score(y_test, pred_no_scale)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scale = LogisticRegression(max_iter=200)
model_scale.fit(X_train_scaled, y_train)
pred_scale = model_scale.predict(X_test_scaled)
acc_scale = accuracy_score(y_test, pred_scale)

# Output
print("Accuracy WITHOUT Scaling :", round(acc_no_scale, 3))
print("Accuracy WITH Standard Scaling :", round(acc_scale, 3))


Accuracy WITHOUT Scaling : 0.972
Accuracy WITH Standard Scaling : 1.0


## 10.  Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

### Logistic Regression for Imbalanced Marketing Data

We want to predict which customers will respond to a campaign. The dataset is highly imbalanced (5% responders).

**Steps:**

1. **Data Preprocessing:** Handle missing values, encode categorical variables, and remove outliers.
2. **Feature Scaling:** Standardize numeric features using StandardScaler.
3. **Handle Imbalance:** Use `class_weight='balanced'` or oversample minority class with SMOTE.
4. **Train-Test Split:** Use stratified splitting to keep class ratio same in train and test sets.
5. **Model Training:** Train Logistic Regression with L2 regularization.
6. **Hyperparameter Tuning:** Use GridSearchCV to tune `C`, `penalty`, and `class_weight`.
7. **Evaluation:** Use Precision, Recall, F1-score, and ROC-AUC instead of accuracy. Focus on Recall to catch more responders.
8. **Business Use:** Rank customers by probability, adjust threshold to improve campaign efficiency, and target high-probability responders.

**Conclusion:**  
This approach ensures the model works well with imbalanced data and helps the business target the right customers effectively.
