At initally we get scratched data or we can say that raw data with huge number of missing value and imbalanced dataset. Now we need convert that data clean and processed data to our machine learning or deep learning models.

### Missing Values

Missing values occurs in dataset when some of the informations is not stored for a variable. There are 3 mechanisms:

1. Missing Completely at Random:
There is no systematic reason for why they are missing. It may also becuase some question or part was not releated.

2. Missing data not at Random:
Data is not missing at random but missing because of unobsereved or unmeasured factors that are associated with the missing values.

3. Missing value at Random:
In this the missing value are systematically releated to the observed data, but not to the missing data.

* To check if any missing value in dataset use `df.isnull().sum()`

### Categorical Values: 
If data is categorical use for e.g. `gender` use mode, find the unique value and among that most frequenct(repeating) value we can replace as nan value.

### Outlier : 
If we have outlier in the dataset use median, arrange the data in accending or decending order and replace with median value. Because median will not effect much with outlier as compare to mean for e.g `salary`.

### Numerical Values : 
If data is numerical for e.g `age` calculate mean and replace the mean value with the nan value.

## Handling Imbalanced Dataset

1. Up Sampling (Oversampling): This technique is used to increase the number of data points in the minority class. This often done by :
   * Duplicating existing samples from the minority class.
   * Creating synthetic samples, as seen in methods like SMOTE(Synthetic Minority Over-sampling Technique) or ADASYN(Adaptive Synthetic Sampling).
   * These methods generate new instances based on existing ones, considering the features of the data to ensure diversity.
     
2. Downsampling (Undersampling) : This technique is aimed at reducing the number of data points in the majority class. By removing some of the instance from the majority class, the aim is to achieve a balanced ratio between classes. It's important to do this carefully to maintain the integrity of the overall dataset.

Both techniques are crucial for developing models that do not become biased towards the majority class, which can lead to poor prediction performance on the majority class.

In [2]:
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

In [3]:
n_class_0,n_class_1

(900, 100)

In [7]:
### Create my Dataframe with Imbalnced dataset

class_0 = pd.DataFrame({
    'feature_1' : np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2' : np.random.normal(loc=0, scale=1, size=n_class_0),
    'target' : [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1' : np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2' : np.random.normal(loc=2, scale=1, size=n_class_1),
    'target' : [1] * n_class_1

})

In [8]:
df = pd.concat([class_0,class_1]).reset_index(drop=True)

In [9]:
df.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.774224,0.285744,0
1,-1.201377,0.333279,0
2,1.096257,0.531807,0
3,0.861037,-0.354766,0
4,-1.520367,-1.120815,0


In [10]:
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,2.677156,1.092048,1
996,2.963404,0.181955,1
997,1.621476,1.877267,1
998,3.429559,3.794486,1
999,3.532273,1.67949,1


In [12]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

#### Upsampling

In [13]:
df_minority = df[df['target']==1]
df_majority = df[df['target']==0]

In [29]:
from sklearn.utils import resample
df_minority_unsampled=resample(df_majority, replace=True, # sampling with replacement
                               n_samples = len(df_majority),
                        random_state =42
                       )

In [21]:
df_minority_unsampled.shape

(900, 3)

In [22]:
df_minority_unsampled.head()

Unnamed: 0,feature_1,feature_2,target
102,-0.3417,1.173744,0
435,-0.503997,0.18673,0
860,-2.039363,-0.694636,0
270,0.395589,0.084408,0
106,-0.113904,1.281844,0


In [24]:
df_unsampled = pd.concat([df_majority, df_minority_unsampled])

In [25]:
df_unsampled['target'].value_counts()

target
0    1800
Name: count, dtype: int64

#### DownSampling

In [27]:
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 10000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

class_0 = pd.DataFrame({
    'feature_1' : np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2' : np.random.normal(loc=0, scale=1, size=n_class_0),
    'target' : [0] * n_class_0

})

class_1 = pd.DataFrame({
    'feature_1' : np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2' : np.random.normal(loc=2, scale=1, size=n_class_1),
    'target' : [1] * n_class_1

})

df = pd.concat([class_0, class_1]).reset_index(drop=True)

# Check the class distribution
print(df['target'].value_counts())

target
0    9000
1    1000
Name: count, dtype: int64


In [36]:
## downsampling

df_majority=df[df['target']==1]
df_majority=df[df['target']==0]

In [37]:
from sklearn.utils import resample
df_majority_downsamlped = resample(df_majority, replace=False,
                                   n_samples = len(df_minority),
                                   random_state=42
                                  )

In [38]:
df_majority_downsamlped.shape

(100, 3)

In [39]:
df_downsampled = pd.concat([df_minority, df_majority_downsamlped])

In [40]:
df_downsampled.target.value_counts()

target
1    100
0    100
Name: count, dtype: int64