<a href="https://colab.research.google.com/github/revanthbhuvanagiri/MachineLearning/blob/main/Handling_Imbalanced_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Handling Imbalanced Dataset

1. Up Sampling
2. Down Sampling

## Handling Imbalanced Datasets

An imbalanced dataset is one where the number of observations in one class is significantly lower than in other classes. This can lead to machine learning models that are biased towards the majority class and perform poorly on the minority class.

Two common techniques to address imbalanced datasets are:

### 1. Up-Sampling (Oversampling)

**Definition:** Up-sampling is a technique where you increase the number of instances in the minority class to match the number of instances in the majority class. This is done by randomly duplicating instances from the minority class.

**How to Apply:**

* **Identify the minority and majority classes:** Separate your dataset into dataframes for each class.
* **Resample the minority class:** Use a library like `sklearn.utils.resample` to randomly sample with replacement from the minority class until it has the same number of instances as the majority class.
* **Concatenate the dataframes:** Combine the up-sampled minority dataframe with the original majority dataframe to create a new, balanced dataset.

**Real Project Application:** In fraud detection, where fraudulent transactions are rare (minority class) compared to legitimate transactions (majority class), up-sampling can help the model learn the patterns of fraudulent activities more effectively.

### 2. Down-Sampling (Undersampling)

**Definition:** Down-sampling is a technique where you decrease the number of instances in the majority class to match the number of instances in the minority class. This is done by randomly removing instances from the majority class.

**How to Apply:**

* **Identify the minority and majority classes:** Separate your dataset into dataframes for each class.
* **Resample the majority class:** Use a library like `sklearn.utils.resample` to randomly sample without replacement from the majority class until it has the same number of instances as the minority class.
* **Concatenate the dataframes:** Combine the down-sampled majority dataframe with the original minority dataframe to create a new, balanced dataset.

**Real Project Application:** In medical diagnosis, where a particular disease is rare (minority class) compared to healthy individuals (majority class), down-sampling can be used to balance the dataset. However, be cautious with down-sampling as you might lose valuable information from the majority class.

Choosing between up-sampling and down-sampling depends on the size of your dataset and the specific problem. Up-sampling is generally preferred when you have a small dataset, while down-sampling can be useful for large datasets to reduce training time.

`np.random.seed()` is used to set the seed for NumPy's random number generator. This is important for reproducibility.

When you set a seed, the sequence of random numbers generated will be the same every time you run the code. Without a seed, you would get a different sequence of random numbers each time, which could make it difficult to reproduce your results or debug your code.

In the context of this notebook, setting `np.random.seed(123)` ensures that the randomly generated dataframes for `class_0` and `class_1` will be the same every time the cell is executed, allowing for consistent results when demonstrating up-sampling and down-sampling.

In [2]:
import numpy as np
import pandas as pd


# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

In [3]:

n_class_0,n_class_1

(900, 100)

In [4]:
## CREATE MY DATAFRAME WITH IMBALANCED DATASET
class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

In [5]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)

In [6]:
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,1.376371,2.845701,1
996,2.23981,0.880077,1
997,1.13176,1.640703,1
998,2.902006,0.390305,1
999,2.69749,2.01357,1


In [7]:
df['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,900
1,100


In [8]:
## upsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [9]:
from sklearn.utils import resample
df_minority_upsampled=resample(df_minority,replace=True, #Sample With replacement
         n_samples=len(df_majority),
         random_state=42
        )

In [10]:
df_minority_upsampled.shape

(900, 3)

In [11]:
df_minority_upsampled.head()

Unnamed: 0,feature_1,feature_2,target
951,1.125854,1.843917,1
992,2.19657,1.397425,1
914,1.93217,2.998053,1
971,2.272825,3.034197,1
960,2.870056,1.550485,1


In [12]:
df_upsampled=pd.concat([df_majority,df_minority_upsampled])

In [13]:
df_upsampled['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,900
1,900


## Down Sampling

In [14]:
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

df = pd.concat([class_0, class_1]).reset_index(drop=True)

# Check the class distribution
print(df['target'].value_counts())

target
0    900
1    100
Name: count, dtype: int64


In [15]:
## downsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [16]:
from sklearn.utils import resample
df_majority_upsampled=resample(df_minority,replace=True, #Sample With replacement
         n_samples=len(df_majority),
         random_state=42
        )

In [17]:
df_majority_upsampled.shape

(900, 3)

In [18]:
df_majority_upsampled.head()

Unnamed: 0,feature_1,feature_2,target
951,1.125854,1.843917,1
992,2.19657,1.397425,1
914,1.93217,2.998053,1
971,2.272825,3.034197,1
960,2.870056,1.550485,1
