## Imbalance Dataset

An imbalanced dataset in machine learning refers to a situation where the distribution of classes in the dataset is not equal. In other words, one or more classes have a significantly lower number of samples than the others. For example, in a binary classification problem, if one class has 95% of the samples and the other has 5%, the dataset is imbalanced.

Imbalanced datasets can present challenges for machine learning algorithms because they can bias the learning process towards the majority class, leading to poor performance on the minority class. This can be particularly problematic if the minority class is the one of interest, for example, detecting rare diseases or fraudulent transactions. There are several techniques that can be used to address imbalanced datasets, including:

1. Resampling techniques: These involve either undersampling the majority class, oversampling the minority class, or a combination of both.

2. Cost-sensitive learning: This involves adjusting the misclassification costs for each class, so that the algorithm is penalized more for misclassifying instances of the minority class.

3. Ensemble methods: These involve combining multiple models to create a more robust and accurate classifier.

4. Synthetic data generation: This involves creating artificial samples for the minority class to balance out the dataset.

It's important to note that there is no one-size-fits-all solution to addressing imbalanced datasets, and the most appropriate approach will depend on the specific problem and the available data.

**Imbalanced dataset in machine learning is a common problem where the distribution of classes is not equal. For example, if we have a binary classification problem with 1000 samples, where 900 samples belong to class A and 100 samples belong to class B, the dataset is imbalanced.**

To handle imbalanced datasets, two common techniques are up-sampling and down-sampling:

1. Up-Sampling: In up-sampling, the minority class samples are duplicated to balance the class distribution. For example, in the above dataset, we can duplicate the 100 samples of class B to create 800 additional samples, making the dataset balanced. This technique is suitable for small datasets, but it can lead to overfitting if used excessively.

2. Down-Sampling: In down-sampling, the majority class samples are randomly removed to balance the class distribution. For example, in the above dataset, we can remove 800 samples of class A to make the dataset balanced. This technique is suitable for large datasets, but it can lead to the loss of important information if the majority class samples are informative.

Both up-sampling and down-sampling have their advantages and disadvantages, and the choice of the technique depends on the dataset size, the class distribution, and the machine learning algorithm used. Other techniques like Synthetic Minority Over-sampling Technique (SMOTE) and Cost-Sensitive Learning can also be used to handle imbalanced datasets.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Set the random seed for reproducibility
np.random.seed(123)

In [3]:
# Create a dataframe with two classes
n_samp = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samp * class_0_ratio)
n_class_1 = n_samp - n_class_0

In [4]:
n_class_0,n_class_1

(900, 100)

In [5]:
## CREATE MY DATAFRAME WITH IMBALANCED DATASET

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

In [8]:
df = pd.concat([class_0,class_1]).reset_index(drop = True)

In [9]:
df.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.25275,0
4,-0.5786,-0.292004,0


In [10]:
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,1.376371,2.845701,1
996,2.23981,0.880077,1
997,1.13176,1.640703,1
998,2.902006,0.390305,1
999,2.69749,2.01357,1


In [11]:
df['target'].value_counts()

0    900
1    100
Name: target, dtype: int64

In [12]:
## Upsampling

df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [13]:
from sklearn.utils import resample

In [15]:
Minority_Sampled = resample(df_minority, replace = True, n_samples = len(df_majority), random_state = 42)

In [17]:
Minority_Sampled.shape

(900, 3)

In [18]:
Minority_Sampled.head()

Unnamed: 0,feature_1,feature_2,target
951,1.125854,1.843917,1
992,2.19657,1.397425,1
914,1.93217,2.998053,1
971,2.272825,3.034197,1
960,2.870056,1.550485,1


In [21]:
df_upsampled = pd.concat([df_majority, Minority_Sampled])

In [22]:
df_upsampled['target'].value_counts()

0    900
1    900
Name: target, dtype: int64

In [23]:
## Down sampling

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

df = pd.concat([class_0, class_1]).reset_index(drop=True)

# Check the class distribution
print(df['target'].value_counts())

0    900
1    100
Name: target, dtype: int64


In [26]:
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [27]:
Majority_Sampled = resample(df_majority, replace = True, n_samples = len(df_minority), random_state = 42)

In [29]:
Majority_Sampled.shape

(100, 3)

In [30]:
Majority_Sampled.head()

Unnamed: 0,feature_1,feature_2,target
102,0.712265,0.718151,0
435,1.199988,0.574621,0
860,0.304515,-0.759475,0
270,-1.213385,0.675504,0
106,0.179549,-0.202659,0


In [31]:
df_downsampled = pd.concat([df_minority, Majority_Sampled])

In [32]:
df_downsampled.target.value_counts()

1    100
0    100
Name: target, dtype: int64