# 1.2 데이터 불균형

## 업샘플링 (SMOTE, Boaderline SMOTE, Adasyn)


### 데이터셋 생성

In [7]:
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import BorderlineSMOTE

X, y = make_classification(n_classes = 2, class_sep = 2, weights = [0.1, 0.9], n_informative = 3, n_redundant=1, flip_y = 0,
                            n_features = 20, n_clusters_per_class=1, n_samples = 1000, random_state = 123)


In [13]:
import pandas as pd
imbalanced_df = pd.DataFrame(X)
imbalanced_df['class'] = y
imbalanced_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,class
0,1.028368,-0.833351,2.008017,-2.054824,-1.058958,0.00087,0.715136,1.108162,0.180774,-1.703395,...,0.176379,-0.077279,-0.624918,-0.59281,-1.569512,1.76614,-1.439002,-0.851583,-1.056961,1
1,-0.948006,0.89954,2.073949,-0.241187,0.418379,0.900114,-0.829749,0.700975,-0.768594,-0.869385,...,2.129611,-0.931974,-0.10172,-0.445312,-1.628836,0.025042,0.930134,0.811761,-0.831855,1
2,1.607323,0.18938,2.042288,0.310414,0.581914,0.314189,1.515464,-0.765339,-0.98797,0.930051,...,-0.503749,-0.425103,0.24568,-0.098194,-2.754477,1.324391,-0.584268,0.783292,-3.69272,1
3,-0.75924,-0.166984,2.194113,-0.269572,0.539652,-0.604082,0.082591,-1.163324,-0.355386,0.507094,...,0.406347,-0.621168,-1.23646,-0.155795,-2.199876,-1.546607,-1.233934,0.961971,-2.005697,1
4,0.243432,0.604924,2.31594,-1.020196,1.468722,-1.214363,1.565992,-0.203765,-0.253224,-0.28524,...,-1.918273,-0.774854,-0.397427,1.037227,-2.392618,-1.412563,2.356046,0.373055,-1.571927,1


In [15]:
imbalanced_df.value_counts('class')

class
1    900
0    100
dtype: int64

In [19]:
from sklearn.utils import resample

majority_data = imbalanced_df[imbalanced_df['class'] == 1]
minority_data = imbalanced_df[imbalanced_df['class'] == 0]

# upsampling
minority_upsampled = resample(minority_data,
                                replace = True, # sample with replacement
                                n_samples = len(majority_data), # match number in majority class
                                random_state = 123)
len(minority_upsampled)

900

## imblearn package 사용

In [1]:
# %pip install imblearn

Collecting imblearn
  Using cached imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
  Using cached imbalanced_learn-0.9.1-py3-none-any.whl (199 kB)
Collecting scikit-learn>=1.1.0
  Downloading scikit_learn-1.1.3-cp39-cp39-macosx_10_9_x86_64.whl (8.7 MB)
     |████████████████████████████████| 8.7 MB 1.5 MB/s            
Installing collected packages: scikit-learn, imbalanced-learn, imblearn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0
    Uninstalling scikit-learn-1.0:
      Successfully uninstalled scikit-learn-1.0
Successfully installed imbalanced-learn-0.9.1 imblearn-0.0 scikit-learn-1.1.3
Note: you may need to restart the kernel to use updated packages.


In [1]:
from random import Random
import imblearn
from imblearn.under_sampling import RandomUnderSampler

Random_undersampling = RandomUnderSampler(return_indices = True) # Initialize to return indices of dropped row
X_RandUnderSampled, Y_RandUnderSampled, dropped = Random_undersampling(X, y)

## 다운샘플링 (Under-sampling)

In [None]:
# undersampling
majority_upsampled = resample(majority_data,
                                replace = True, # sample with replacement
                                n_samples = len(minority_data), # match number in minority class
                                random_state = 123)

## imblearn 활용

In [None]:
from random import Random
import imblearn
from imblearn.over_sampling import RandomOverSampler

Random_oversampling = RandomOverSampler() 
X_RandUnderSampled, Y_RandUnderSampled, dropped = Random_oversampling.fit_resample(X,y)


## Under-sampling : Tomek links

Tomek links는 거리가 가장 가깝지만 다른 Class를 가진 인자끼리 짝을지어 제거함으로서, Class사이의 공간을 확보하는 방법이다.

In [33]:
from imblearn.under_sampling import TomekLinks

tomek = TomekLinks(sampling_strategy='auto')
X_tomek, y_tomek = tomek.fit_resample(X,y)

print(Counter(y_tomek))

Counter({1: 900, 0: 100})


# Over-sampling : SMOTE

SMOTE (Synthetic Minority Oversampling TEchnique)은 존재하는 minority class를 활용하여 새로운 값들을 만들어내는 방법이며 보통 k-nearest neighbors를 사용하여 조합한다.

In [35]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='auto')
X_smote, y_smote = smote.fit_resample(X,y)
print(Counter(y_smote))

Counter({1: 900, 0: 900})


## Over-sampling : Borderline SMOTE

기존의 SMOTE 기법은 minority class에서 랜덤하게 생성했다면, Borderline-SMOTE기법은 다른 class와의 경계(Borderline)에 있는 샘플들을 늘려 분류하기 더 어려운 부분에 집중했다.

출처: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning" (http://sci2s.ugr.es/keel/keel-dataset/pdfs/2005-Han-LNCS.pdf)

In [6]:
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import BorderlineSMOTE

X, y = make_classification(n_classes = 2, class_sep = 2, weights = [0.1, 0.9], n_informative = 3, n_redundant=1, flip_y = 0,
                            n_features = 20, n_clusters_per_class=1, n_samples = 1000, random_state = 123)

print('현재 데이터의 크기 %s' % Counter(y))

sm = BorderlineSMOTE(random_state = 123)
X_res, y_res = sm.fit_resample(X,y)
print('Borderline SMOTE 적용 이후 데이터의 크기 %s' % Counter(y_res))



현재 데이터의 크기 Counter({1: 900, 0: 100})
Borderline SMOTE 적용 이후 데이터의 크기 Counter({1: 900, 0: 900})


## ADASYN(Adaptive Synthetic Sampling Approach for Imbalanced Learning)

SMOTE기법과 유사하지만, 소수의 클래스에서 가장 가까운 K개의 데이터 중 무작위로 선택하여 클래스를 만드는 기법이다.

In [37]:
from imblearn.over_sampling import ADASYN

ada = ADASYN(random_state=123)
X_res, y_res = ada.fit_resample(X,y)
print(Counter(y_res))

Counter({1: 900, 0: 898})


# 그 외

- 복합샘플링
    - SMOTE + ENN
    - SMOTE + Tomek