# TensorFlow CNN-Ensemble Learning with Big-MNIST 

by [Sungchi](http://facebook.com/sungchi)

## 과정 요약

1. infimnist로 mnist training을 변형해 110만 장을 만든다. (메모리 오류 안나는 수준) 
2. tensorflow api로 불러온 데이터 중 training data와 validation data를 합친다.
3. 랜덤으로 training set을 여러벌 만든다. 
4. 앙상블용 신경망을 50개 만들어 신경망당 20000번 씩 학습시킨다. 
5. 학습한 신경망 모델마다 test-set으로 예상값을 기록한다. 
6. test-set 결과값을 앙상블해서 최종 정확도를 확인한다.

한 줄 요약: mnist training set을 110만 장 만들고 랜덤 training set으로 신경망 여러개 만들어 결과값을 모아 오차를 줄인다. 

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import numpy as np
from sklearn.metrics import confusion_matrix
import time
from datetime import timedelta
import math
import os
import prettytensor as pt

재현을 위해 random seed 설정

In [2]:
tf.set_random_seed(777)
np.random.seed(777)

## 데이터 로딩

[infimnist](http://leon.bottou.org/projects/infimnist)를 이용해 training-set을 110만장으로 뻥튀기 
(200만장은 메모리 에러나서 줄임)

*infimnist 생성 방법*
1. infimnist 소스 다운로드 후 빌드: [소스.tar.gz](http://leon.bottou.org/_media/projects/infimnist.tar.gz) (350mb)
2. 콘솔 명령어로 데이터셋 생성 
```
$ infimnist lab 10000 1109999 > mnist1m-labels-idx1-ubyte
$ infimnist pat 10000 1109999 > mnist1m-patterns-idx3-ubyte
```
3. tensorflow 라이브러리에 맞게 이름 변경 후 gzip 압축
```
mv mnist1m-labels-idx1-ubyte train-labels-idx1-ubyte
mv mnist1m-patterns-idx3-ubyte train-images-idx3-ubyte
```
4. test-set은 따로 생성하지 않습니다. (없으면 자동으로 tensorflow mnist test-set을 다운받음)

In [3]:
data = input_data.read_data_sets('1mdata', one_hot=True)
print("Size of:")
print("- Training-set:\t\t{}".format(len(data.train.labels)))
print("- Test-set:\t\t{}".format(len(data.test.labels)))
print("- Validation-set:\t{}".format(len(data.validation.labels)))

Extracting 1mdata\train-images-idx3-ubyte.gz
Extracting 1mdata\train-labels-idx1-ubyte.gz
Extracting 1mdata\t10k-images-idx3-ubyte.gz
Extracting 1mdata\t10k-labels-idx1-ubyte.gz
Size of:
- Training-set:		1095000
- Test-set:		10000
- Validation-set:	5000


### 데이터 가공

one-hot 인코딩 된 레이블을 10진수 배열로 바꿔주고, 
tensorflow api로 로딩된 이미지를 train, validation 구분 없이 합친다.

In [4]:
data.test.cls = np.argmax(data.test.labels, axis=1)
data.validation.cls = np.argmax(data.validation.labels, axis=1)

combined_images = np.concatenate([data.train.images, data.validation.images], axis=0)
combined_labels = np.concatenate([data.train.labels, data.validation.labels], axis=0)

print("images shape:", combined_images.shape)
print("label shape:", combined_labels.shape)

combined_size = len(combined_images)
train_size = int(0.8 * combined_size)
validation_size = combined_size - train_size
print("\n데이터셋 사이즈\n- combined:\t\t{0}\n- train:\t\t{1}\n- validation:\t\t{2}".format(combined_size,train_size,validation_size))

images shape: (1100000, 784)
label shape: (1100000, 10)

데이터셋 사이즈
- combined:		1100000
- train:		880000
- validation:		220000


## 데이터셋에서 무작위 training set 만드는 함수

In [5]:
def random_training_set():
    idx = np.random.permutation(combined_size)
    idx_train = idx[0:train_size]
    idx_validation = idx[train_size:]
    
    x_train = combined_images[idx_train, :]
    y_train = combined_labels[idx_train, :]
    x_validation = combined_images[idx_validation, :]
    y_validation = combined_labels[idx_validation, :]

    return x_train, y_train, x_validation, y_validation

### 학습 데이터 기본값

In [6]:
img_size = 28
img_size_flat = img_size * img_size
img_shape = (img_size, img_size)
num_channels = 1
num_classes = 10

In [7]:
x = tf.placeholder(tf.float32, shape=[None, img_size_flat], name='x')
x_image = tf.reshape(x, [-1, img_size, img_size, num_channels])
y_true = tf.placeholder(tf.float32, shape=[None, 10], name='y_true')
y_true_cls = tf.argmax(y_true, dimension=1)

### pretty_tensor로 신경망 초기화
- 테스트 해봤을 때 짧은 시간에 높은 정확도가 나왔던 구조
- pretty tensor 사용 

In [8]:
x_pretty = pt.wrap(x_image)
with pt.defaults_scope(activation_fn=tf.nn.relu):
    y_pred, loss = x_pretty.\
        conv2d(kernel=3, depth=32, name='layer_conv1').\
        max_pool(kernel=2, stride=2).\
        conv2d(kernel=3, depth=64, name='layer_conv2').\
        max_pool(kernel=2, stride=2).\
        conv2d(kernel=3, depth=128, name='layer_conv3').\
        max_pool(kernel=2, stride=2).\
        flatten().\
        fully_connected(size=100, name='layer_fc1').\
        softmax_classifier(num_classes=num_classes, labels=y_true)

ADAM 최적화 알고리즘

In [9]:
optimizer = tf.train.AdamOptimizer(learning_rate=1e-4).minimize(loss)

In [10]:
y_pred_cls = tf.argmax(y_pred, dimension=1)
correct_prediction = tf.equal(y_pred_cls, y_true_cls)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

### Tensorflow Session 저장을 위한 Saver 객체 생성

In [11]:
saver = tf.train.Saver(max_to_keep=100)
save_dir = 'checkpoints_ensemble/'
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

네트워크별 체크포인트, 네트워크별 최고 정확도 체크포인트 (이건 안씀) 저장 경로 함수

In [12]:
def get_save_path(net_number):
    return save_dir + 'network' + str(net_number)
def get_save_best_path(net_number):
    return save_dir + 'network_best_' + str(net_number)

## Tensorflow Session 시작

In [13]:
session = tf.Session()
def init_variables():
    session.run(tf.initialize_all_variables())

### 무작위 Training Batch 생성 함수 

In [14]:
train_batch_size = 128
def random_batch(x_train, y_train):
    num_images = len(x_train)
    idx = np.random.choice(num_images,
                           size=train_batch_size,
                           replace=False)
    x_batch = x_train[idx, :]  
    y_batch = y_train[idx, :]  
    return x_batch, y_batch

### 학습과 정확도 측정 준비

In [15]:
batch_size = 256

def predict_cls(images, labels, cls_true):
    num_images = len(images)
    cls_pred = np.zeros(shape=num_images, dtype=np.int)
    i = 0
    while i < num_images:
        j = min(i + batch_size, num_images)
        feed_dict = {x: images[i:j, :],
                     y_true: labels[i:j, :]}

        cls_pred[i:j] = session.run(y_pred_cls, feed_dict=feed_dict)
        i = j
    correct = (cls_true == cls_pred)
    return correct, cls_pred


def cls_accuracy(correct):
    correct_sum = correct.sum()
    acc = float(correct_sum) / len(correct)
    return acc, correct_sum

In [16]:
def optimize(neural,num_iterations, x_train, y_train):
    start_time = time.time()
    for i in range(num_iterations):
        x_batch, y_true_batch = random_batch(x_train, y_train)
        feed_dict_train = {x: x_batch,
                           y_true: y_true_batch}

        session.run(optimizer, feed_dict=feed_dict_train)

        if i % 1000 == 0:
            acc = session.run(accuracy, feed_dict=feed_dict_train)
            msg = "Iteration 차수: {0:>6}, Training Batch 정확도: {1:>6.2%}"
            print(msg.format(i + 1, acc))
    end_time = time.time()
    time_dif = end_time - start_time
    print("소요 시간: " + str(timedelta(seconds=int(round(time_dif)))))

In [17]:
batch_size = 256

def predict_labels(images):
    num_images = len(images)
    pred_labels = np.zeros(shape=(num_images, num_classes),
                           dtype=np.float)
    i = 0
    while i < num_images:
        j = min(i + batch_size, num_images)
        feed_dict = {x: images[i:j, :]}
        pred_labels[i:j] = session.run(y_pred, feed_dict=feed_dict)
        i = j
    return pred_labels

def correct_prediction(images, labels, cls_true):
    pred_labels = predict_labels(images=images)
    cls_pred = np.argmax(pred_labels, axis=1)
    correct = (cls_true == cls_pred)
    return correct

def test_correct():
    return correct_prediction(images = data.test.images,
                              labels = data.test.labels,
                              cls_true = data.test.cls)

def validation_correct():
    return correct_prediction(images = data.validation.images,
                              labels = data.validation.labels,
                              cls_true = data.validation.cls)

def classification_accuracy(correct):
    return correct.mean()

def test_accuracy():
    correct = test_correct()
    return classification_accuracy(correct)

def validation_accuracy():
    correct = validation_correct()
    return classification_accuracy(correct)

### 여러개의 신경망 학습 시키기

In [18]:
num_networks = 50
num_iterations = 20000
for i in range(num_networks):
    print("Neural network: {0}".format(i))
    x_train, y_train, x_validation, y_validation = random_training_set()
    session.run(tf.global_variables_initializer())
    optimize(neural=i,num_iterations=num_iterations,
             x_train=x_train,
             y_train=y_train)
    saver.save(sess=session, save_path=get_save_path(i))
    print()

Neural network: 0
Iteration 차수:      1, Training Batch 정확도: 10.94%
Iteration 차수:   1001, Training Batch 정확도: 96.88%
Iteration 차수:   2001, Training Batch 정확도: 97.66%
Iteration 차수:   3001, Training Batch 정확도: 97.66%
Iteration 차수:   4001, Training Batch 정확도: 99.22%
Iteration 차수:   5001, Training Batch 정확도: 99.22%
Iteration 차수:   6001, Training Batch 정확도: 98.44%
Iteration 차수:   7001, Training Batch 정확도: 99.22%
Iteration 차수:   8001, Training Batch 정확도: 100.00%
Iteration 차수:   9001, Training Batch 정확도: 99.22%
Iteration 차수:  10001, Training Batch 정확도: 98.44%
Iteration 차수:  11001, Training Batch 정확도: 99.22%
Iteration 차수:  12001, Training Batch 정확도: 98.44%
Iteration 차수:  13001, Training Batch 정확도: 100.00%
Iteration 차수:  14001, Training Batch 정확도: 100.00%
Iteration 차수:  15001, Training Batch 정확도: 100.00%
Iteration 차수:  16001, Training Batch 정확도: 100.00%
Iteration 차수:  17001, Training Batch 정확도: 100.00%
Iteration 차수:  18001, Training Batch 정확도: 100.00%
Iteration 차수:  19001, Training Batch 정확도: 99

Iteration 차수:      1, Training Batch 정확도: 17.97%
Iteration 차수:   1001, Training Batch 정확도: 96.88%
Iteration 차수:   2001, Training Batch 정확도: 99.22%
Iteration 차수:   3001, Training Batch 정확도: 99.22%
Iteration 차수:   4001, Training Batch 정확도: 98.44%
Iteration 차수:   5001, Training Batch 정확도: 97.66%
Iteration 차수:   6001, Training Batch 정확도: 100.00%
Iteration 차수:   7001, Training Batch 정확도: 99.22%
Iteration 차수:   8001, Training Batch 정확도: 99.22%
Iteration 차수:   9001, Training Batch 정확도: 99.22%
Iteration 차수:  10001, Training Batch 정확도: 100.00%
Iteration 차수:  11001, Training Batch 정확도: 100.00%
Iteration 차수:  12001, Training Batch 정확도: 100.00%
Iteration 차수:  13001, Training Batch 정확도: 100.00%
Iteration 차수:  14001, Training Batch 정확도: 99.22%
Iteration 차수:  15001, Training Batch 정확도: 100.00%
Iteration 차수:  16001, Training Batch 정확도: 100.00%
Iteration 차수:  17001, Training Batch 정확도: 100.00%
Iteration 차수:  18001, Training Batch 정확도: 100.00%
Iteration 차수:  19001, Training Batch 정확도: 99.22%
소요 시간: 0:13

Iteration 차수:      1, Training Batch 정확도: 16.41%
Iteration 차수:   1001, Training Batch 정확도: 96.09%
Iteration 차수:   2001, Training Batch 정확도: 97.66%
Iteration 차수:   3001, Training Batch 정확도: 97.66%
Iteration 차수:   4001, Training Batch 정확도: 98.44%
Iteration 차수:   5001, Training Batch 정확도: 96.88%
Iteration 차수:   6001, Training Batch 정확도: 99.22%
Iteration 차수:   7001, Training Batch 정확도: 100.00%
Iteration 차수:   8001, Training Batch 정확도: 99.22%
Iteration 차수:   9001, Training Batch 정확도: 100.00%
Iteration 차수:  10001, Training Batch 정확도: 100.00%
Iteration 차수:  11001, Training Batch 정확도: 100.00%
Iteration 차수:  12001, Training Batch 정확도: 100.00%
Iteration 차수:  13001, Training Batch 정확도: 99.22%
Iteration 차수:  14001, Training Batch 정확도: 100.00%
Iteration 차수:  15001, Training Batch 정확도: 100.00%
Iteration 차수:  16001, Training Batch 정확도: 100.00%
Iteration 차수:  17001, Training Batch 정확도: 100.00%
Iteration 차수:  18001, Training Batch 정확도: 100.00%
Iteration 차수:  19001, Training Batch 정확도: 100.00%
소요 시간: 0:

Iteration 차수:      1, Training Batch 정확도: 16.41%
Iteration 차수:   1001, Training Batch 정확도: 99.22%
Iteration 차수:   2001, Training Batch 정확도: 98.44%
Iteration 차수:   3001, Training Batch 정확도: 100.00%
Iteration 차수:   4001, Training Batch 정확도: 96.88%
Iteration 차수:   5001, Training Batch 정확도: 99.22%
Iteration 차수:   6001, Training Batch 정확도: 100.00%
Iteration 차수:   7001, Training Batch 정확도: 98.44%
Iteration 차수:   8001, Training Batch 정확도: 99.22%
Iteration 차수:   9001, Training Batch 정확도: 100.00%
Iteration 차수:  10001, Training Batch 정확도: 98.44%
Iteration 차수:  11001, Training Batch 정확도: 100.00%
Iteration 차수:  12001, Training Batch 정확도: 100.00%
Iteration 차수:  13001, Training Batch 정확도: 100.00%
Iteration 차수:  14001, Training Batch 정확도: 100.00%
Iteration 차수:  15001, Training Batch 정확도: 100.00%
Iteration 차수:  16001, Training Batch 정확도: 100.00%
Iteration 차수:  17001, Training Batch 정확도: 100.00%
Iteration 차수:  18001, Training Batch 정확도: 100.00%
Iteration 차수:  19001, Training Batch 정확도: 99.22%
소요 시간: 0:

Iteration 차수:   1001, Training Batch 정확도: 99.22%
Iteration 차수:   2001, Training Batch 정확도: 99.22%
Iteration 차수:   3001, Training Batch 정확도: 99.22%
Iteration 차수:   4001, Training Batch 정확도: 99.22%
Iteration 차수:   5001, Training Batch 정확도: 100.00%
Iteration 차수:   6001, Training Batch 정확도: 99.22%
Iteration 차수:   7001, Training Batch 정확도: 99.22%
Iteration 차수:   8001, Training Batch 정확도: 100.00%
Iteration 차수:   9001, Training Batch 정확도: 100.00%
Iteration 차수:  10001, Training Batch 정확도: 100.00%
Iteration 차수:  11001, Training Batch 정확도: 100.00%
Iteration 차수:  12001, Training Batch 정확도: 100.00%
Iteration 차수:  13001, Training Batch 정확도: 100.00%
Iteration 차수:  14001, Training Batch 정확도: 100.00%
Iteration 차수:  15001, Training Batch 정확도: 100.00%
Iteration 차수:  16001, Training Batch 정확도: 100.00%
Iteration 차수:  17001, Training Batch 정확도: 100.00%
Iteration 차수:  18001, Training Batch 정확도: 100.00%
Iteration 차수:  19001, Training Batch 정확도: 100.00%
소요 시간: 0:13:30

Neural network: 33
Iteration 차수:      1,

Iteration 차수:   1001, Training Batch 정확도: 97.66%
Iteration 차수:   2001, Training Batch 정확도: 100.00%
Iteration 차수:   3001, Training Batch 정확도: 100.00%
Iteration 차수:   4001, Training Batch 정확도: 99.22%
Iteration 차수:   5001, Training Batch 정확도: 98.44%
Iteration 차수:   6001, Training Batch 정확도: 98.44%
Iteration 차수:   7001, Training Batch 정확도: 99.22%
Iteration 차수:   8001, Training Batch 정확도: 98.44%
Iteration 차수:   9001, Training Batch 정확도: 100.00%
Iteration 차수:  10001, Training Batch 정확도: 100.00%
Iteration 차수:  11001, Training Batch 정확도: 100.00%
Iteration 차수:  12001, Training Batch 정확도: 100.00%
Iteration 차수:  13001, Training Batch 정확도: 99.22%
Iteration 차수:  14001, Training Batch 정확도: 98.44%
Iteration 차수:  15001, Training Batch 정확도: 100.00%
Iteration 차수:  16001, Training Batch 정확도: 99.22%
Iteration 차수:  17001, Training Batch 정확도: 100.00%
Iteration 차수:  18001, Training Batch 정확도: 100.00%
Iteration 차수:  19001, Training Batch 정확도: 100.00%
소요 시간: 0:13:54

Neural network: 41
Iteration 차수:      1, Tr

Iteration 차수:   2001, Training Batch 정확도: 98.44%
Iteration 차수:   3001, Training Batch 정확도: 98.44%
Iteration 차수:   4001, Training Batch 정확도: 98.44%
Iteration 차수:   5001, Training Batch 정확도: 96.09%
Iteration 차수:   6001, Training Batch 정확도: 100.00%
Iteration 차수:   7001, Training Batch 정확도: 100.00%
Iteration 차수:   8001, Training Batch 정확도: 100.00%
Iteration 차수:   9001, Training Batch 정확도: 97.66%
Iteration 차수:  10001, Training Batch 정확도: 96.88%
Iteration 차수:  11001, Training Batch 정확도: 100.00%
Iteration 차수:  12001, Training Batch 정확도: 100.00%
Iteration 차수:  13001, Training Batch 정확도: 100.00%
Iteration 차수:  14001, Training Batch 정확도: 100.00%
Iteration 차수:  15001, Training Batch 정확도: 100.00%
Iteration 차수:  16001, Training Batch 정확도: 100.00%
Iteration 차수:  17001, Training Batch 정확도: 100.00%
Iteration 차수:  18001, Training Batch 정확도: 100.00%
Iteration 차수:  19001, Training Batch 정확도: 100.00%
소요 시간: 0:13:43

Neural network: 49
Iteration 차수:      1, Training Batch 정확도:  8.59%
Iteration 차수:   1001, 

## 신경망별 정확도 확인

In [19]:
batch_size = 256

def predict_labels(images):
    num_images = len(images)
    pred_labels = np.zeros(shape=(num_images, num_classes),
                           dtype=np.float)
    i = 0
    while i < num_images:
        j = min(i + batch_size, num_images)
        feed_dict = {x: images[i:j, :]}
        pred_labels[i:j] = session.run(y_pred, feed_dict=feed_dict)
        i = j
    return pred_labels

def correct_prediction(images, labels, cls_true):
    pred_labels = predict_labels(images=images)
    cls_pred = np.argmax(pred_labels, axis=1)
    correct = (cls_true == cls_pred)
    return correct

def test_correct():
    return correct_prediction(images = data.test.images,
                              labels = data.test.labels,
                              cls_true = data.test.cls)

def validation_correct():
    return correct_prediction(images = data.validation.images,
                              labels = data.validation.labels,
                              cls_true = data.validation.cls)

def classification_accuracy(correct):
    return correct.mean()

def test_accuracy():
    correct = test_correct()
    return classification_accuracy(correct)

def validation_accuracy():
    correct = validation_correct()
    return classification_accuracy(correct)

In [20]:
def ensemble_predictions():
    pred_labels = []
    test_accuracies = []
    val_accuracies = []

    for i in range(num_networks):
        saver.restore(sess=session, save_path=get_save_path(i))
        test_acc = test_accuracy()
        test_accuracies.append(test_acc)
        val_acc = validation_accuracy()
        val_accuracies.append(val_acc)

        msg = "신경망 번호: {0}, Validation-Set 정확도: {1:.4f}, Test-Set 정확도: {2:.4f}"
        print(msg.format(i, val_acc, test_acc))
        pred = predict_labels(images=data.test.images)
        pred_labels.append(pred)
    
    return np.array(pred_labels), \
           np.array(test_accuracies), \
           np.array(val_accuracies)

In [21]:
pred_labels, test_accuracies, val_accuracies = ensemble_predictions()

신경망 번호: 0, Validation-Set 정확도: 0.9980, Test-Set 정확도: 0.9926
신경망 번호: 1, Validation-Set 정확도: 0.9976, Test-Set 정확도: 0.9938
신경망 번호: 2, Validation-Set 정확도: 0.9990, Test-Set 정확도: 0.9922
신경망 번호: 3, Validation-Set 정확도: 0.9988, Test-Set 정확도: 0.9939
신경망 번호: 4, Validation-Set 정확도: 0.9966, Test-Set 정확도: 0.9913
신경망 번호: 5, Validation-Set 정확도: 0.9976, Test-Set 정확도: 0.9923
신경망 번호: 6, Validation-Set 정확도: 0.9980, Test-Set 정확도: 0.9939
신경망 번호: 7, Validation-Set 정확도: 0.9982, Test-Set 정확도: 0.9935
신경망 번호: 8, Validation-Set 정확도: 0.9984, Test-Set 정확도: 0.9929
신경망 번호: 9, Validation-Set 정확도: 0.9986, Test-Set 정확도: 0.9932
신경망 번호: 10, Validation-Set 정확도: 0.9978, Test-Set 정확도: 0.9935
신경망 번호: 11, Validation-Set 정확도: 0.9988, Test-Set 정확도: 0.9935
신경망 번호: 12, Validation-Set 정확도: 0.9992, Test-Set 정확도: 0.9932
신경망 번호: 13, Validation-Set 정확도: 0.9984, Test-Set 정확도: 0.9923
신경망 번호: 14, Validation-Set 정확도: 0.9986, Test-Set 정확도: 0.9934
신경망 번호: 15, Validation-Set 정확도: 0.9964, Test-Set 정확도: 0.9926
신경망 번호: 16, Validation-Set 정확도: 0.

### 앙상블 결과 확인

앙상블 네트워크 중 가장 높은 정확도를 가진 네트워크 

In [23]:
ensemble_pred_labels = np.mean(pred_labels, axis=0)
ensemble_cls_pred = np.argmax(ensemble_pred_labels, axis=1)
ensemble_correct = (ensemble_cls_pred == data.test.cls)
ensemble_incorrect = np.logical_not(ensemble_correct)

싱글 네트워크 중 가장 높은 정확도를 가진 네트워크 

In [25]:
best_net = np.argmax(test_accuracies)
best_net_pred_labels = pred_labels[best_net, :, :]
best_net_cls_pred = np.argmax(best_net_pred_labels, axis=1)
best_net_correct = (best_net_cls_pred == data.test.cls)
best_net_incorrect = np.logical_not(best_net_correct)

### 앙상블 네트워크와 싱글 네트워크 비교

In [52]:
print("앙상블: {0:>6.2%}".format(np.sum(ensemble_correct) / len(data.test.labels)))
print("싱글: {0:>8.2%}".format(np.sum(best_net_correct) / len(data.test.labels)))

앙상블: 99.52%
싱글:   99.45%
