Skip to content
This repository has been archived by the owner on Nov 15, 2021. It is now read-only.

preprocess.upsample 구현하기 #29

Closed
Tracked by #26
ArtemisDicoTiar opened this issue Nov 5, 2021 · 2 comments · Fixed by #38
Closed
Tracked by #26

preprocess.upsample 구현하기 #29

ArtemisDicoTiar opened this issue Nov 5, 2021 · 2 comments · Fixed by #38
Assignees

Comments

@ArtemisDicoTiar
Copy link
Member

ArtemisDicoTiar commented Nov 5, 2021

What?

가장 데이터 개수가 많은 클래스에 맞춰서, 상대적으로 개수가 적은 클래스의 데이터를 upsample하는 함수

@eubinecto
Copy link
Member

@ArtemisDicoTiar 이거 예전에 어떤 함수를 써서 이미 구현했었죠? 그 부분 commit이 있으면 여기에 올려줄 수 있나요 혹시?

@ArtemisDicoTiar
Copy link
Member Author

ArtemisDicoTiar commented Nov 8, 2021

@eubinecto 이런식으로 구현했었습니다.
wisdom이 df에서 언급된 세고 그중 제일 많은 걸 기준으로 나머지 데이터들도 랜덤하게 업샘플링되게 해뒀습니다.
(코드 다시보니 왜 카운팅을 저런식으로 해뒀는 지 잘 이해가 안되네요 ㅋㅋㅋㅋㅋ 그냥 그룹 소트 하면 됐을텐데 ㅋㅋㅋ)
업샘플링의 경우 사이킷런 써서 랜덤하게 업샘플링 시켰습니다.
별거 없어서 지금 해둘게요

from collections import Counter
from sklearn.utils import resample

counts = sorted(Counter(data_df['wisdom']).items(), key=lambda r: r[1], reverse=True)
major = counts[0]

# Upsample minority class
total_df = data_df.loc[data_df['wisdom'] == major[0]]
for wis, ct in counts[1:]:
    df_minority_upsampled = resample(data_df[data_df['wisdom'] == wis],
                                        replace=True,  # sample with replacement
                                        n_samples=major[1],  # to match majority class
                                        random_state=123)  # reproducible results

    total_df = total_df.append(df_minority_upsampled)
return total_df

ArtemisDicoTiar added a commit that referenced this issue Nov 8, 2021
@ArtemisDicoTiar ArtemisDicoTiar linked a pull request Nov 8, 2021 that will close this issue
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants