# 2. 데이터 준비

- 머신러닝 모델을 작성하기 위해서 알고리즘을 적용할 수 있는 형태로 데이터 가공, 정형

## 2.1 데이터 읽기와 확인

In [1]:
# 전처리 위한 파일 로드
import pandas as pd

bank_df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/전처리/chap3-4/data/data/bank.csv')
bank_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,,5,may,261,1,-1,0,,no
1,36,technician,single,secondary,no,265,yes,yes,,5,may,348,1,-1,0,,no
2,25,blue-collar,married,secondary,no,-7,yes,no,,5,may,365,1,-1,0,,no
3,53,technician,married,secondary,no,-3,no,no,,5,may,1666,1,-1,0,,no
4,24,technician,single,secondary,no,-103,yes,yes,,5,may,145,1,-1,0,,no


In [3]:
# 각 항목 데이터형 재확인
bank_df.shape

(7234, 17)

In [4]:
bank_df.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

- shape, dtypes : 데이터를 불러온 후 항상 확인하는 습관을 가질 것

## 2.2 결손값 제외

In [5]:
bank_df.isnull().sum()

age             0
job            44
marital         0
education     273
default         0
balance         0
housing         0
loan            0
contact      2038
day             0
month           0
duration        0
campaign        0
pdays           0
previous        0
poutcome     5900
y               0
dtype: int64

- 결손값 개수가 적은 job, education을 대상으로 먼저 처리
- 일반적으로 결손값이 데이터 갯수의 1/3 이상일 경우 많다고 정의

In [6]:
# 결손값 제외
bank_df = bank_df.dropna(subset = ['job', 'education'])
bank_df.shape

(6935, 17)

- 다음으로 contact, poutcome을 처리
- poutcome의 경우 결손값 개수가 데이터 개수의 1/3보다 많음 → 데이터셋에서 제외

In [7]:
# 연습 8) poutcome 제외 후 데이터의 행, 열 갯수 확인
bank_df = bank_df.dropna(thresh = 2400, axis = 1) # thresh: 결손값 경계치. thresh = 2400 → 결손값이 2400개 이상인 열 삭제
bank_df.shape

(6935, 16)

## 2.3 결손값 보완

결손값 보완 방법 )
1. 데이터형이 수치 : 0 / 정수 / 전후의 값/ 평균값 등으로 보완
2. 데이터형이 문자 : 정해진 문자열로 보완

- contact 데이터형 : object (문자열)
- contact 결손값 : 고객에 대해 어떤 방법으로 연락을 취했는지 알 수 없음을 의미 → 'Unknown' 으로 보완

In [8]:
# 결손값 보완
bank_df = bank_df.fillna({'contact' : 'unknown'})
bank_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,no
1,36,technician,single,secondary,no,265,yes,yes,unknown,5,may,348,1,-1,0,no
2,25,blue-collar,married,secondary,no,-7,yes,no,unknown,5,may,365,1,-1,0,no
3,53,technician,married,secondary,no,-3,no,no,unknown,5,may,1666,1,-1,0,no
4,24,technician,single,secondary,no,-103,yes,yes,unknown,5,may,145,1,-1,0,no


## 2.4 특이값(이상치) 제외

In [9]:
bank_df.describe()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,6935.0,6935.0,6935.0,6935.0,6935.0,6935.0,6935.0
mean,40.675847,1375.198414,15.568277,262.853353,2.709877,40.675126,0.571449
std,10.621373,3063.660588,8.287186,268.43365,2.977714,99.567128,1.842921
min,2.0,-3313.0,1.0,0.0,1.0,-1.0,0.0
25%,32.0,73.0,8.0,103.0,1.0,-1.0,0.0
50%,38.0,450.0,16.0,184.0,2.0,-1.0,0.0
75%,48.0,1463.5,21.0,321.0,3.0,-1.0,0.0
max,157.0,81204.0,31.0,3366.0,44.0,850.0,40.0


- age 항목 : 최소치 2, 최대치 157 → 이상치 존재 → 이상치를 포함한 데이터 행 제외

In [10]:
bank_df = bank_df[bank_df['age'] >= 18]
bank_df = bank_df[bank_df['age'] < 100]

bank_df.shape

(6933, 16)

## 2.5 문자열을 수치로 변환

- 머신러닝 알고리즘 : 주로 수치형의 데이터 사용 → 문자열 데이터를 수치화해야 함

In [11]:
# 값 두개(yes, no)를 가지는 데이터의 값을 1, 0의 수치로 변환
bank_df = bank_df.replace('yes', '1')
bank_df = bank_df.replace('no', '0')

bank_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,y
0,58,management,married,tertiary,0,2143,1,0,unknown,5,may,261,1,-1,0,0
1,36,technician,single,secondary,0,265,1,1,unknown,5,may,348,1,-1,0,0
2,25,blue-collar,married,secondary,0,-7,1,0,unknown,5,may,365,1,-1,0,0
3,53,technician,married,secondary,0,-3,0,0,unknown,5,may,1666,1,-1,0,0
4,24,technician,single,secondary,0,-103,1,1,unknown,5,may,145,1,-1,0,0


In [12]:
# 다수의 값을 가지는 데이터 변환 → 원-핫 인코딩
bank_df_job = pd.get_dummies(bank_df['job'])
bank_df_job.head()

Unnamed: 0,admin.,blue-collar,entrepreneur,housemaid,management,retired,self-employed,services,student,technician,unemployed
0,0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,0
2,0,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,1,0


In [13]:
# 연습 9) marital, education, contact, month → 더미 변수화
bank_df_marital = pd.get_dummies(bank_df['marital'])
bank_df_education = pd.get_dummies(bank_df['education'])
bank_df_contact = pd.get_dummies(bank_df['contact'])
bank_df_month = pd.get_dummies(bank_df['month'])

## 2.6 분석 데이터셋 작성

- 먼저 데이터형이 수치인 항목만 추출

In [15]:
# 분석 데이터셋으로 완성
tmp1 = bank_df[['age', 'default', 'balance', 'housing', 'loan', 'day', 'duration', 'campaign', 'pdays', 'previous', 'y']]

tmp1.head()

Unnamed: 0,age,default,balance,housing,loan,day,duration,campaign,pdays,previous,y
0,58,0,2143,1,0,5,261,1,-1,0,0
1,36,0,265,1,1,5,348,1,-1,0,0
2,25,0,-7,1,0,5,365,1,-1,0,0
3,53,0,-3,0,0,5,1666,1,-1,0,0
4,24,0,-103,1,1,5,145,1,-1,0,0


- tmp1에 더미 변수화한 데이터 결합

In [16]:
# 더미 변수화한 데이터 결합
tmp2 = pd.concat([tmp1, bank_df_marital], axis = 1)
tmp3 = pd.concat([tmp2, bank_df_education], axis = 1)
tmp4 = pd.concat([tmp3, bank_df_contact], axis = 1)
bank_df_new = pd.concat([tmp4, bank_df_month], axis = 1)

bank_df_new.head()

Unnamed: 0,age,default,balance,housing,loan,day,duration,campaign,pdays,previous,y,divorced,married,single,primary,secondary,tertiary,cellular,telephone,unknown,apr,aug,dec,feb,jan,jul,jun,mar,may,nov,oct,sep
0,58,0,2143,1,0,5,261,1,-1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
1,36,0,265,1,1,5,348,1,-1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
2,25,0,-7,1,0,5,365,1,-1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
3,53,0,-3,0,0,5,1666,1,-1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
4,24,0,-103,1,1,5,145,1,-1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0


In [17]:
# CSV 파일로 결과 출력
bank_df_new.to_csv('/content/drive/My Drive/Colab Notebooks/전처리/chap3-4/data/bank-prep.csv', index = False)