<a href="https://colab.research.google.com/github/yonathanm772/classtest/blob/main/imbalanced_data_SC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imbalanced data

Imbalanced data refers to a situation in classification problems where the classes are not represented equally in the dataset.

One class (the majority class) has significantly more instances than one or more other classes (the minority class or classes).

Imbalanced datasets are common in various real-world scenarios, such as fraud detection, medical diagnosis, anomaly detection, and text classification.

### Problems with Imbalanced Datasets:

* **Biased Model Performance**: Models trained on imbalanced data tend to be biased towards the majority class, as they tend to focus more on the majority class during training, resulting in poor generalization to the minority class.

* **Misleading Evaluation Metrics**: Traditional evaluation metrics like **accuracy** can be misleading on imbalanced datasets. A model may achieve high accuracy by simply predicting the majority class most of the time, while completely ignoring the minority class.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
from sklearn.datasets import load_iris
import pandas as pd

In [7]:
# load iris data
iris = load_iris()
x = iris.data   # feature
y = iris.target # target

In [8]:
type(y)

numpy.ndarray

In [10]:
# check the class distribution
print(pd.Series(y).value_counts())

0    50
1    50
2    50
Name: count, dtype: int64


## Solutions to Imbalanced Data Problems:

### **1. Resampling Techniques**:

## 1.1 Under-Sampling

Reduce the number of instances in the majority class by randomly removing instances.



In [12]:
from sklearn.datasets import load_breast_cancer
from imblearn.under_sampling import RandomUnderSampler
import pandas as pd
breast_cancer = load_breast_cancer()
x = breast_cancer.data
y = breast_cancer.target
print(pd.Series(y).value_counts())

1    357
0    212
Name: count, dtype: int64


In [13]:
rus = RandomUnderSampler(random_state=42)
x_resampled, y_resampled = rus.fit_resample(x, y)
print(pd.Series(y_resampled).value_counts())

0    212
1    212
Name: count, dtype: int64


**Advantages**:
* Reduces computation time and memory usage.
* Helps balance class distribution, potentially improving model performance.

**Disadvantages**:
* May discard potentially useful information from majority class.
* Risk of underfitting if important samples are removed.

## 1.2. Over-Sampling

Increase the number of instances in the minority class by duplicating existing instances or generating synthetic samples.



In [15]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
x_resampled1, y_resampled1 = ros.fit_resample(x, y)
print(pd.Series(y_resampled1).value_counts())

0    357
1    357
Name: count, dtype: int64


**Advantages**:
* Increases representation of minority class, preventing bias towards majority class.
* Can improve model performance.

**Disadvantages**:
* May lead to overfitting if synthetic samples are not well-generated.
* Can increase computation time and memory usage.



Generate synthetic samples for the minority class by interpolating between existing instances.

### Step-by-Step Process of SMOTE:
#### Identify Minority Class Samples:

SMOTE first identifies the minority class samples from the dataset. Let's assume we have a binary classification problem where the majority class has far more instances than the minority class.
Find k Nearest Neighbors:

- For each minority class sample, SMOTE finds its k-nearest neighbors in the feature space using techniques like Euclidean distance. Typically, k is set to 5 by default, though it can be adjusted.
- These neighboring points are other minority class examples that are "close" to the original point in the feature space.
Generate Synthetic Samples:

SMOTE then selects one of the k-nearest neighbors at random and creates a new synthetic sample.
![image.png](attachment:a8416356-eabf-4c2a-ae82-3b96bab4c87c.png)

In [17]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
x_resampled2, y_resampled2 = smote.fit_resample(x, y)
print(pd.Series(y_resampled2).value_counts())

0    357
1    357
Name: count, dtype: int64


**Advantages**:
* Generates synthetic samples, preserving minority class distribution.
* Addresses imbalanced data without discarding information.

**Disadvantages**:
* Can introduce noise if synthetic samples are poorly generated.
* May not work well with high-dimensional data.
* Sensitive towards outliers.

**Model Selection:**

Choose models that are robust to class imbalance, such as linear regression, logistic regression, decision trees, random forests, support vector machines with appropriate kernels, or anomaly detection algorithms.

## 2. Data-Level Techniques:

* **Collect More Data:** Gather additional data for the minority class to balance the dataset.
* **Domain Knowledge:** Incorporate domain knowledge to better understand the importance of different classes and guide the selection of appropriate techniques for handling imbalance.
