#### 대규모 데이터(big data)의 경우에는 메모리 등의 문제로 특정한 모형은 사용할 수 없는 경우가 많다. 이 때는

+ 사전 확률분포를 설정할 수 있는 생성 모형
+ 시작 가중치를 설정할 수 있는 모형

등을 이용하고 전체 데이터를 처리 가능한 작은 조각으로 나누어 학습을 시키는 점진적 학습 방법을 사용한다.

In [1]:
import warnings
warnings.filterwarnings(action='ignore')

import matplotlib.pylab as plt
import matplotlib as mpl
import matplotlib.font_manager as fm
import seaborn as sns
import numpy as np

sns.set_style("whitegrid")
mpl.rcParams['axes.unicode_minus'] = False
plt.rcParams['font.size'] = 12

path = "/Library/Fonts/NanumGothic.otf"
font_name = fm.FontProperties(fname=path, size=20).get_name()

plt.rc('font', family=font_name)

In [4]:
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

covtype = fetch_covtype()
X_covtype = covtype.data
y_covtype = covtype.target - 1  # 1, 2, 3, 4, 5, 6, 7 -> 0, 1, 2, 3, 4, 5, 6
classes = np.unique(y_covtype)
X_train, X_test, y_train, y_test = train_test_split(X_covtype, y_covtype)

X_train.shape, X_test.shape

((435759, 54), (145253, 54))

In [11]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

def read_Xy(start, end):
    # 실무에서는 파일이나 데이터베이스에서 읽어온다.
    idx = list(range(start, min(len(y_train)-1, end)))
    X = X_train[idx, :]
    y = y_train[idx]
    return X, y

## SGD

퍼셉트론 모형은 가중치를 계속 업데이트하므로 
#### 일부 데이터를 사용하여 구한 가중치를 다음 단계에서 초기 가중치로 사용할 수 있다.

In [12]:
%%time

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

model = SGDClassifier(random_state=0)
n_split = 10
n_X = len(y_test) // n_split # // 나눈 후 정수값만 반환
n_epoch = 10

for epoch in range(n_epoch): # 데이터를 10개로 나눠서 10번 적용
    for n in range(n_split):
        X, y = read_Xy(n * n_X, (n+1) * n_X) # 불러들일 데이터의 범위를 설정
        model.partial_fit(X, y, classes=classes)
    accuracy_train = accuracy_score(y_train, model.predict(X_train))
    accuracy_test = accuracy_score(y_test, model.predict(X_test))
    print("epoch={:d} train acc={:5.3f} test acc={:5.3f}".format(epoch, accuracy_train, accuracy_test))

epoch=0 train acc=0.694 test acc=0.693
epoch=1 train acc=0.706 test acc=0.705
epoch=2 train acc=0.710 test acc=0.709
epoch=3 train acc=0.710 test acc=0.708
epoch=4 train acc=0.711 test acc=0.710
epoch=5 train acc=0.711 test acc=0.710
epoch=6 train acc=0.712 test acc=0.711
epoch=7 train acc=0.712 test acc=0.711
epoch=8 train acc=0.712 test acc=0.711
epoch=9 train acc=0.712 test acc=0.711
CPU times: user 6.24 s, sys: 266 ms, total: 6.5 s
Wall time: 4.34 s


10개로 분리하여 각 데이터의 조각들에 대한 정확도를 구한 결과이다.

## 나이브베이즈 모형
#### 나이브베이즈 모형과 같은 생성모형은 일부 데이터를 이용하여 구한 확률분포를 사전확률분포로 사용할 수 있다.

In [13]:
%%time

from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

model = BernoulliNB(alpha=0.1)

n_split = 10
n_X = len(y_train) // n_split
for n in range(n_split):
    X, y = read_Xy(n * n_X, (n+1) * n_X)
    model.partial_fit(X, y, classes=classes)
    accuracy_train = accuracy_score(y_train, model.predict(X_train))
    accuracy_test = accuracy_score(y_test, model.predict(X_test))
    print("n={:d} train accuracy={:5.3f} test accuracy={:5.3f}".format(n, accuracy_train, accuracy_test))

n=0 train accuracy=0.631 test accuracy=0.630
n=1 train accuracy=0.634 test accuracy=0.633
n=2 train accuracy=0.633 test accuracy=0.632
n=3 train accuracy=0.633 test accuracy=0.633
n=4 train accuracy=0.633 test accuracy=0.632
n=5 train accuracy=0.633 test accuracy=0.632
n=6 train accuracy=0.632 test accuracy=0.631
n=7 train accuracy=0.633 test accuracy=0.632
n=8 train accuracy=0.632 test accuracy=0.632
n=9 train accuracy=0.632 test accuracy=0.631
CPU times: user 7.17 s, sys: 1.51 s, total: 8.67 s
Wall time: 4.66 s


## 그레디언트 부스팅

#### 그레디언트 부스팅에서는 초기 커미티 멤버로 일부 데이터를 사용하여 학습한 모형을 사용할 수 있다.

In [16]:
%%time

from lightgbm import train, Dataset
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

params = {
    'objective': 'multiclass',
    'num_class': len(classes),
    'learning_rate': 0.2,
    'seed': 0,
}

n_split = 10
n_X = len(y_train) // n_split
num_tree = 10
model = None
for n in range(n_split):
    X, y = read_Xy(n*n_X, (n+1)*n_X)
    model = train(params, init_model=model, train_set=Dataset(X, y), keep_training_booster=False, num_boost_round=num_tree)
    accuracy_train = accuracy_score(y_train, np.argmax(model.predict(X_train), axis=1))
    accuracy_test = accuracy_score(y_test, np.argmax(model.predict(X_test), axis=1))
    print("n={:d} train accuracy={:5.3f} test accuracy={:5.3f}".format(n, accuracy_train, accuracy_test))

n=0 train accuracy=0.771 test accuracy=0.769
n=1 train accuracy=0.794 test accuracy=0.791
n=2 train accuracy=0.812 test accuracy=0.808
n=3 train accuracy=0.825 test accuracy=0.820
n=4 train accuracy=0.833 test accuracy=0.827
n=5 train accuracy=0.842 test accuracy=0.836
n=6 train accuracy=0.848 test accuracy=0.841
n=7 train accuracy=0.853 test accuracy=0.845
n=8 train accuracy=0.854 test accuracy=0.846
n=9 train accuracy=0.854 test accuracy=0.845
CPU times: user 4min 30s, sys: 2.61 s, total: 4min 32s
Wall time: 1min 21s


## Random Forest

랜덤 포레스트와 같은 앙상블 모형에서는 일부 데이터를 사용한 모형을 개별 분류기로 사용할 수 있다.

In [17]:
%%time

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

n_split = 10
n_X = len(y_train) // n_split
num_tree_ini = 10
num_tree_step = 10
model = RandomForestClassifier(n_estimators=num_tree_ini, warm_start=True)
for n in range(n_split):
    X, y = read_Xy(n*n_X, (n+1)*n_X)
    model.fit(X, y)
    accuracy_train = accuracy_score(y_train, model.predict(X_train))
    accuracy_test = accuracy_score(y_test, model.predict(X_test))
    print("epoch={:d} train accuracy={:5.3f} test accuracy={:5.3f}".format(n, accuracy_train, accuracy_test))
    
    model.n_estimators += num_tree_step  # 

epoch=0 train accuracy=0.866 test accuracy=0.851
epoch=1 train accuracy=0.890 test accuracy=0.870
epoch=2 train accuracy=0.898 test accuracy=0.877
epoch=3 train accuracy=0.903 test accuracy=0.882
epoch=4 train accuracy=0.905 test accuracy=0.884
epoch=5 train accuracy=0.906 test accuracy=0.885
epoch=6 train accuracy=0.906 test accuracy=0.885
epoch=7 train accuracy=0.906 test accuracy=0.885
epoch=8 train accuracy=0.907 test accuracy=0.886
epoch=9 train accuracy=0.908 test accuracy=0.887
CPU times: user 2min 44s, sys: 6.56 s, total: 2min 51s
Wall time: 1min 35s


가장 정확도가 높게 나타난다.