
### **Q1: Why should you not use accuracy for model evaluation in an imbalanced dataset?**

**Answer:**

Accuracy is not a good metric for evaluating models on imbalanced datasets. In a dataset where 90% of the data belongs to class 0 and 10% belongs to class 1, a model that predicts everything as class 0 would still have 90% accuracy, despite completely failing to identify the minority class (class 1). This leads to **misleading results** because accuracy doesn’t consider the distribution of classes.

**Accuracy Formula**:

```python
Accuracy = (True Positives + True Negatives) / (Total Samples)
```

Where:
- **True Positives (TP)**: Correct predictions for the positive class.
- **True Negatives (TN)**: Correct predictions for the negative class.

In imbalanced datasets, this formula often gives a high value, even if the model fails to correctly identify the minority class.

Instead, other metrics like **Precision, Recall, F1-Score, and AUC-ROC** are more informative in evaluating performance, particularly for the minority class.

- **Precision**: Measures how many of the predicted positive cases are actually positive.
- **Recall**: Measures how many of the actual positive cases were predicted correctly.
- **F1-Score**: Harmonic mean of Precision and Recall, balancing both metrics.
- **AUC-ROC**: Evaluates how well the model can distinguish between the two classes by plotting True Positive Rate (Recall) against False Positive Rate.

**Example response for an interview**:  
_"In an imbalanced classification problem, accuracy can give an illusion of high performance when the model is just predicting the majority class. Instead, metrics like Precision, Recall, and F1-Score are more effective in evaluating performance, especially in identifying the minority class. AUC-ROC can also help by showing how well the model discriminates between classes."_

**Python Implementation Example:**

```python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Example true and predicted values
y_true = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
y_pred = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

# Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
```

In the above example, despite the **accuracy** being decent, metrics like **precision**, **recall**, and **F1-score** will provide a better picture for imbalanced datasets.

---

### **Q2: Explain oversampling and how it can help with imbalanced data.**

**Answer:**

**Oversampling** is a technique used to handle imbalanced datasets by increasing the number of instances in the minority class. This is done by **duplicating** samples or creating synthetic samples. Oversampling ensures that the model is trained on a more balanced dataset, preventing it from being biased towards the majority class.

**Popular Oversampling Techniques:**
1. **Random Oversampling**: Randomly duplicates samples from the minority class until the classes are balanced. While simple, it can lead to **overfitting** because the model may memorize these duplicated instances.
2. **SMOTE (Synthetic Minority Over-sampling Technique)**: A more advanced technique where new synthetic samples are created by interpolating between existing minority class samples.

**Example response for an interview**:  
_"Oversampling helps in balancing the data by either duplicating or synthesizing more minority class samples, allowing the model to pay more attention to the minority class. Techniques like SMOTE are widely used to generate synthetic data points for the minority class, thereby improving the model's ability to generalize."_

**Formula** (for SMOTE synthetic samples creation):

```python
New Sample = Sample_min + λ * (Sample_nearest - Sample_min)
```

Where:
- `Sample_min`: A sample from the minority class.
- `Sample_nearest`: The nearest neighbor of `Sample_min`.
- `λ`: A random number between 0 and 1.

**Python Implementation (using SMOTE):**

```python
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt

# Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2,
                           n_redundant=0, n_clusters_per_class=1,
                           weights=[0.9], flip_y=0, random_state=42)

# Apply SMOTE
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X, y)

# Plot original vs resampled data
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

axs[0].scatter(X[:, 0], X[:, 1], c=y)
axs[0].set_title('Original Data')

axs[1].scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled)
axs[1].set_title('Resampled Data (SMOTE)')

plt.show()
```

---

### **Q3: Explain undersampling and its benefits and drawbacks.**

**Answer:**

**Undersampling** is the opposite of oversampling. It reduces the size of the majority class to balance the dataset. By removing some of the majority class instances, we can make the dataset more balanced and avoid biasing the model towards the majority class.

**Benefits:**
- It is computationally less expensive since the dataset becomes smaller.
- Can work well when there’s a lot of redundancy in the majority class.

**Drawbacks:**
- By removing data, you risk losing important information from the majority class.
- This can lead to **underfitting**, where the model doesn't learn enough because important patterns in the majority class may be discarded.

**Example response for an interview**:  
_"Undersampling is useful when we have large datasets, as it reduces the training data size by removing instances from the majority class. While it balances the dataset, it comes with the risk of losing valuable data from the majority class, potentially underfitting the model."_

**Example Formula** for undersampling:  
If we have 1000 samples in the majority class and 100 in the minority class, we randomly select 100 samples from the majority class to create a balanced dataset.

**Python Implementation Example:**

```python
from imblearn.under_sampling import RandomUnderSampler

# Original dataset (same as above)
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2,
                           n_redundant=0, n_clusters_per_class=1,
                           weights=[0.9], flip_y=0, random_state=42)

# Apply Random Undersampling
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)

# Plot original vs resampled data
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

axs[0].scatter(X[:, 0], X[:, 1], c=y)
axs[0].set_title('Original Data')

axs[1].scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled)
axs[1].set_title('Resampled Data (Undersampling)')

plt.show()
```

---

### **Q4: What is SMOTE, and how does it work?**

**Answer:**

**SMOTE (Synthetic Minority Over-sampling Technique)** is an oversampling technique that creates synthetic examples rather than simply duplicating minority class samples. 

**How it works:**
- For each minority class sample, SMOTE selects its **k-nearest neighbors**.
- It then creates a new synthetic sample by randomly choosing one of the neighbors and generating a point along the line connecting the two samples.
- This way, it generates new instances that are not mere copies but fall in between existing minority class examples.

**Benefits:**
- **Reduces overfitting**: Since it generates new synthetic samples rather than duplicating data, the model doesn’t memorize the data.
- **Balances the data**: Improves model performance by making the data distribution more balanced.

**Example response for an interview**:  
_"SMOTE is a technique that generates synthetic samples for the minority class by interpolating between existing samples and their nearest neighbors. This helps the model to generalize better and avoid overfitting, which can occur in simple random oversampling methods."_ 

**Formula** (same as in Q2):

```python
New Sample = Sample_min + λ * (Sample_nearest - Sample_min)
```

**Python Implementation**: (already provided in Q2)

---

### **Q5: How can you alter the cost function to address imbalanced data?**

**Answer:**

In **imbalanced classification problems**, we can **alter the cost function** of the model to penalize misclassifying the minority class more than the majority class. This approach assigns **higher weights** to the minority class during training, which forces the model to focus more on correctly predicting minority class samples.

**Examples of altering cost functions:**
1. **Weighted Loss Function**: In algorithms like Logistic Regression or SVM, we can add class weights to the loss function. This means that the model will incur a larger penalty for misclassifying minority class samples, forcing it to pay more attention to those samples.
    - In Scikit-Learn, you can use `class_weight='balanced'` in many classifiers (like Logistic Regression, SVM, etc.) to automatically adjust the weights inversely proportional to class frequencies.
2. **Focal Loss** (used in deep learning): A variant of cross-entropy loss that adds a modulating term to focus learning more on hard-to-classify examples (usually from the minority class).

**Mathematical Intuition** (for weighted loss):
Let `W_0` and `W_1` represent the weights assigned to class 0 and class 1, respectively. The loss function can be modified as:

```python
L = W_0 * L_0 + W_1 * L_1
```
Where:
- `L_0` and `L_1` represent the individual loss for class 0 and class 1.
- `W_0` and `W_1` are inversely proportional to the class frequencies.

**Example response for an interview**:  
_"By altering the cost function, we can assign higher weights to the minority class, ensuring the model pays more attention to misclassifications in that class. This technique works well in logistic regression, SVMs, and even deep learning models, using weighted loss or focal loss."_ 

**Python Implementation Example:**

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Create dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2,
                           n_redundant=10, weights=[0.9], flip_y=0, random_state=42)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply weighted classifier
clf = RandomForestClassifier(class_weight='balanced', random_state=42)
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)

# Classification report
print(classification_report(y_test, y_pred))
```

In this implementation, the `class_weight='balanced'` parameter automatically assigns weights to classes inversely proportional to their frequencies in the data.

---

### **Summary of Formulas and Python Examples**

1. **Accuracy Formula**:
   **Why accuracy is a poor metric**: It can mislead you into thinking your model is performing well even if it's just predicting the majority class.
   - Formula: `Accuracy = (TP + TN) / (Total Samples)`

   Python: `accuracy_score`

2. **Oversampling Formula (SMOTE)**: Duplicates or generates synthetic samples for the minority class.
   ```python
   New Sample = Sample_min + λ * (Sample_nearest - Sample_min)
   ```
   Python: `SMOTE`

3. **Undersampling Formula**: Reduces the number of majority class instances to balance the dataset.

   No specific formula; it involves randomly removing instances from the majority class.

4. **Weighted Loss Formula**: Generates synthetic data points for the minority class to balance the dataset and reduce overfitting.

   ```python
   Weighted Loss = W_0 * Loss_0 + W_1 * Loss_1
   ```
   Python: `class_weight='balanced'`

5. **Altering cost function**: Assign higher weights to the minority class during training to make the model focus on correctly predicting minority class examples.

Each technique has its advantages and drawbacks, and in practice, combining these methods (like oversampling with altering the cost function) can often yield the best results.

---
