# 기초 라이브러리 사용 설명서 3탄 - Scikit-learn

## 1. Scikit-learn 이란?

- 데이터 분석가가 사용하는 대표적인 ML 라이브러리


- 대표적으로는 Classification(분류), Regression(회귀), Clustering(군집화) 문제를 해결 가능


- 직관적인 API 프레임워크, 다양한 모듈을 지원하는 장점을 가지고 있음


- 가장 다양한 머신러닝 기능 제공

In [1]:
# scikit-learn 패지키 import
import sklearn
import pandas as pd

## 2. 주요 모듈

일반적으로 머신러닝 모델을 구축하는 주요 프로세스는
1) 피처 처리
2) ML 알고리즘 학습 및 예측 수행
3) 모델 평가


![image.png](attachment:image.png)

### 1) sklearn.dataset
- dataset: scikit-learn이 기본적으로 제공하는 다양한 데이터 샘플들을 모아둔 모듈


    - Load_ 계열: 패키지 안에 내장되어 있는 데이터 세트
    - Fetch_ 계열: 데이터 세트의 크기가 커 인터넷에서 내려받아 사용하는 데이터로 반드시 인터넷 연결이 되어있어야 함

In [4]:
# iris 모듈 불러오기
from sklearn.datasets import load_iris
iris = load_iris()
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

### 2) sklearn.preprocessing

In [5]:
df = pd.read_csv('./data/train.csv')
df

Unnamed: 0,index,quality,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,type
0,0,5,5.6,0.695,0.06,6.8,0.042,9.0,84.0,0.99432,3.44,0.44,10.2,white
1,1,5,8.8,0.610,0.14,2.4,0.067,10.0,42.0,0.99690,3.19,0.59,9.5,red
2,2,5,7.9,0.210,0.39,2.0,0.057,21.0,138.0,0.99176,3.05,0.52,10.9,white
3,3,6,7.0,0.210,0.31,6.0,0.046,29.0,108.0,0.99390,3.26,0.50,10.8,white
4,4,6,7.8,0.400,0.26,9.5,0.059,32.0,178.0,0.99550,3.04,0.43,10.9,white
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5492,5492,5,7.7,0.150,0.29,1.3,0.029,10.0,64.0,0.99320,3.35,0.39,10.1,white
5493,5493,6,6.3,0.180,0.36,1.2,0.034,26.0,111.0,0.99074,3.16,0.51,11.0,white
5494,5494,7,7.8,0.150,0.34,1.1,0.035,31.0,93.0,0.99096,3.07,0.72,11.3,white
5495,5495,5,6.6,0.410,0.31,1.6,0.042,18.0,101.0,0.99195,3.13,0.41,10.5,white


#### [ 포매팅(drop) ]
- 데이터 탐색 결과 필요하지 않은 속성 값 제거

In [10]:
df.drop(['index'], axis=1)

Unnamed: 0,quality,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,type
0,5,5.6,0.695,0.06,6.8,0.042,9.0,84.0,0.99432,3.44,0.44,10.2,white
1,5,8.8,0.610,0.14,2.4,0.067,10.0,42.0,0.99690,3.19,0.59,9.5,red
2,5,7.9,0.210,0.39,2.0,0.057,21.0,138.0,0.99176,3.05,0.52,10.9,white
3,6,7.0,0.210,0.31,6.0,0.046,29.0,108.0,0.99390,3.26,0.50,10.8,white
4,6,7.8,0.400,0.26,9.5,0.059,32.0,178.0,0.99550,3.04,0.43,10.9,white
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5492,5,7.7,0.150,0.29,1.3,0.029,10.0,64.0,0.99320,3.35,0.39,10.1,white
5493,6,6.3,0.180,0.36,1.2,0.034,26.0,111.0,0.99074,3.16,0.51,11.0,white
5494,7,7.8,0.150,0.34,1.1,0.035,31.0,93.0,0.99096,3.07,0.72,11.3,white
5495,5,6.6,0.410,0.31,1.6,0.042,18.0,101.0,0.99195,3.13,0.41,10.5,white


#### [ 결측값(null) 처리 ]
- scikit-learn 알고리즘은 null 값을 허용하지 않으므로 null 값을 반드시 처리해야 함

In [11]:
df['density'].fillna(df['density'].mean(), inplace=True)
df

Unnamed: 0,index,quality,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,type
0,0,5,5.6,0.695,0.06,6.8,0.042,9.0,84.0,0.99432,3.44,0.44,10.2,white
1,1,5,8.8,0.610,0.14,2.4,0.067,10.0,42.0,0.99690,3.19,0.59,9.5,red
2,2,5,7.9,0.210,0.39,2.0,0.057,21.0,138.0,0.99176,3.05,0.52,10.9,white
3,3,6,7.0,0.210,0.31,6.0,0.046,29.0,108.0,0.99390,3.26,0.50,10.8,white
4,4,6,7.8,0.400,0.26,9.5,0.059,32.0,178.0,0.99550,3.04,0.43,10.9,white
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5492,5492,5,7.7,0.150,0.29,1.3,0.029,10.0,64.0,0.99320,3.35,0.39,10.1,white
5493,5493,6,6.3,0.180,0.36,1.2,0.034,26.0,111.0,0.99074,3.16,0.51,11.0,white
5494,5494,7,7.8,0.150,0.34,1.1,0.035,31.0,93.0,0.99096,3.07,0.72,11.3,white
5495,5495,5,6.6,0.410,0.31,1.6,0.042,18.0,101.0,0.99195,3.13,0.41,10.5,white


#### [ 레이블 인코딩 ]
- 문자열 값을 숫자로 변환하는 인코딩 방식


- 각각의 문자열 값에 숫자를 1개씩 대치

In [9]:
import numpy as np
from sklearn import preprocessing

In [12]:
# 샘플 데이터 입력
input_labels = ['red', 'black', 'red', 'green']

In [17]:
# 레이블 인코더 생성 후 앞에서 정의한 레이블로 학습시키기
encoder = preprocessing.LabelEncoder()     # LabelEncoder(): 레이블 인코딩을 구현하는 클래스
encoder.fit(input_labels)                  # fit(): 인코딩을 학습하는 메소드 -> 실행 후 문자열을 숫자값으로 학습
labels = encoder.transform(input_labels)   # transform(): 학습한 숫자값을 실제로 변환
labels

array([2, 0, 2, 1])

#### [ 원-핫 인코딩(one-hot encoding) ]
- 레이블 인코딩의 문제를 해결할 수 있는 인코딩 방법


- 2차원 구현으로 레이블 인코딩을 거쳐야 함


- 고유값에 해당하는 컬럼에만 1을 표시하고 나머지는 0으로 표시하는 방식


![image.png](attachment:image.png)

In [18]:
# 레이블 인코딩 2차원 변환
labels = labels.reshape(-1, 1)

In [19]:
# 원-핫 인코딩
encoder2 = preprocessing.OneHotEncoder()
encoder2.fit(labels)
labels2 = encoder2.transform(labels)
labels2

<4x3 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>

#### [ train_test_split ]
- 데이터를 train data / test data 로 나누는 방법

In [20]:
from sklearn.datasets import load_iris
iris = load_iris()

In [23]:
# 피처만으로 된 데이터 저장
iris_data = iris.data

# 레이블(결정 값) 데이터 저장
iris_label = iris.target

print(iris_label)
print(iris.target_names)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
['setosa' 'versicolor' 'virginica']


In [24]:
from sklearn.model_selection import train_test_split

In [31]:
# X_train, X_test: 모의고사 데이터를 학습/테스트 데이터로 나눔
# y_train, y_test: 정답 데이터를 학습/테스트 데이터로 나눔
# test_size: 학습/테스트 데이터 세트의 비율
# random_state: 지정하지 않으면 매번 다른 데이터세트 만들어짐

X_train, X_test, y_train, y_test = train_test_split(iris_data, iris_label, test_size=0.2, random_state=11)
X_test

array([[6.8, 3. , 5.5, 2.1],
       [6.7, 3. , 5.2, 2.3],
       [6.3, 2.8, 5.1, 1.5],
       [6.3, 3.3, 4.7, 1.6],
       [6.4, 2.7, 5.3, 1.9],
       [4.9, 3.1, 1.5, 0.1],
       [6.7, 3.1, 4.4, 1.4],
       [5.7, 4.4, 1.5, 0.4],
       [4.8, 3.1, 1.6, 0.2],
       [6.1, 2.9, 4.7, 1.4],
       [6. , 2.2, 5. , 1.5],
       [6. , 2.2, 4. , 1. ],
       [5.4, 3. , 4.5, 1.5],
       [5.7, 2.5, 5. , 2. ],
       [6.9, 3.1, 5.4, 2.1],
       [4.5, 2.3, 1.3, 0.3],
       [6.3, 2.9, 5.6, 1.8],
       [5.6, 3. , 4.5, 1.5],
       [6.5, 3.2, 5.1, 2. ],
       [5.8, 2.7, 5.1, 1.9],
       [5.6, 2.5, 3.9, 1.1],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.6, 1. , 0.2],
       [6.4, 3.2, 4.5, 1.5],
       [4.8, 3. , 1.4, 0.1],
       [4.8, 3.4, 1.6, 0.2],
       [5.9, 3. , 5.1, 1.8],
       [6.6, 3. , 4.4, 1.4],
       [5.4, 3.9, 1.3, 0.4],
       [6. , 3.4, 4.5, 1.6]])

#### [ K-fold 방식 ]
- Training data를 여러 개의 묶음으로 나누어 모의고사를 실행하는 것


1) Fold는 각각 하나의 학습 과정이고, Fold의 숫자(k)만큼 학습/검증의 과정을 거침

2) 각 Fold 마다 training data는 k-1개, test data는 1개로 생성

3) 각 Fold 결과의 평균이 최종 학습 결과가 됨

#### [ stratified KFold ]
- label data가 불균형할 때 사용하는 방식


- fold를 나눌 때 원본 데이터의 label 분포를 고려한 뒤 학습/테스트 데이터 분리 시 원본 데이터 비율과 동일하게 분배

![image.png](attachment:image.png)

#### [ Cross_val_score ]
- sklearn.model_selection 모듈에 내장되어있는 메소드


- 교차 검증을 보다 간편하게 해주는 API


- KFold로 데이터를 학습하고 예측하려면 


    1) fold 세트 설정
    2) for loop 반복 학습 및 테스트 데이터의 인덱스 추출
    3) 반복적인 학습과 예측 수행 후 성능 반환 과정


- Cross_val_score()는 이런 일련의 과정을 한꺼번에 수행해 주는 API