# 使用Gluon对CIFAR-10数据集进行分类

## CIFAR-10数据集

[CIFAR-10数据集](http://www.cs.toronto.edu/~kriz/cifar.html)是一个公开的目标识别的数据集，它的训练数据集一共包含了6万张图片，但图片的尺寸都比较小，为32×32的彩色图像。一共包含10类目标，每类目标包含6000张图片。测试数据集一共有30万张，其中1万张用来计分，但为了防止人工标测试集，里面另加了29万张不计分的图片。

它是由 Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton收集整理，[相关论文](http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf)。

Kaggle上下载下来的数据集一共有3个文件
- train.7z : 解压后为train文件夹里一共包含了5万张图片，命名规则是[1-50000].png
- test.7z：解压后为test文件夹里一共包含了30万张图片，命名规则是[1-300000].png
- trainLabels.csv：对应于train文件夹中每张图片的label，一共5万行，每行的格式为：id,label_string

首先呢，我们需要对数据集进行一定的整理，方便MXNet的数据读取脚本进行处理。整理后同一类的图片将出现在同一个文件夹下。

**第一步：获取所有训练数据的标签，并建立对应的查找字典**

In [15]:
import os

data_root_dir = '/home/yansheng/kaggle_dog'
train_dir = 'train'
label_file = 'labels.csv'
test_dir = 'test'
target_dir = 'train_valid_test'

# 读取训练数据的标签，并把它保存在一个字典里

with open(os.path.join(data_root_dir, label_file), 'r') as f:
    lines = f.readlines()[1:] #第一行是头行，跳过去
    tokens = [l.rstrip().split(',') for l in lines]
    idx_label = dict(((idx, label) for idx, label in tokens))
labels = set(idx_label.values()) # 去重

In [16]:
org_label_count = {}
for label in idx_label.values():
    org_label_count[label] = org_label_count.get(label, 0) + 1
print(org_label_count)

{'boston_bull': 87, 'dingo': 80, 'pekinese': 75, 'bluetick': 85, 'golden_retriever': 67, 'bedlington_terrier': 89, 'borzoi': 75, 'basenji': 110, 'scottish_deerhound': 126, 'shetland_sheepdog': 76, 'walker_hound': 69, 'maltese_dog': 117, 'norfolk_terrier': 83, 'african_hunting_dog': 86, 'wire-haired_fox_terrier': 82, 'redbone': 72, 'lakeland_terrier': 99, 'boxer': 75, 'doberman': 74, 'otterhound': 69, 'standard_schnauzer': 72, 'irish_water_spaniel': 78, 'black-and-tan_coonhound': 77, 'cairn': 106, 'affenpinscher': 80, 'labrador_retriever': 84, 'ibizan_hound': 91, 'english_setter': 83, 'weimaraner': 85, 'giant_schnauzer': 69, 'groenendael': 82, 'dhole': 76, 'toy_poodle': 80, 'border_terrier': 91, 'tibetan_terrier': 107, 'norwegian_elkhound': 95, 'shih-tzu': 112, 'irish_terrier': 82, 'kuvasz': 71, 'german_shepherd': 69, 'greater_swiss_mountain_dog': 82, 'basset': 82, 'australian_terrier': 102, 'schipperke': 86, 'rhodesian_ridgeback': 88, 'irish_setter': 88, 'appenzeller': 78, 'bloodhound'

**第二步：遍历每一个训练数据，将其放到对应label的文件夹下**

将train_dir里的所有图片，拷贝到target_dir下的三个目录中：

- train_valid: 包括了完整的训练数据集
- train: 只包含了用于训练的部分
- valid: 只包含了用于验证的部分

In [17]:
import shutil

def mkdir_if_not_exist(path):
        if not os.path.exists(os.path.join(*path)):
            os.makedirs(os.path.join(*path))

valid_ratio = 0.2 # 验证集的比例
label_count = dict() # 用户统计已经整理到每类文件夹中训练数据的数量，达到数量后，剩下的拷贝到验证集文件夹

for train_file in os.listdir(os.path.join(data_root_dir, train_dir)):
    idx = train_file.split('.')[0]
    label = idx_label[idx]
    mkdir_if_not_exist([data_root_dir, target_dir, 'train_valid', label])
    shutil.copy(os.path.join(data_root_dir, train_dir, train_file),
                    os.path.join(data_root_dir, target_dir, 'train_valid', label))
    if label not in label_count or label_count[label] < org_label_count[label] * (1 - valid_ratio):
            mkdir_if_not_exist([data_root_dir, target_dir, 'train', label])
            shutil.copy(os.path.join(data_root_dir, train_dir, train_file),
                        os.path.join(data_root_dir, target_dir, 'train', label))
            label_count[label] = label_count.get(label, 0) + 1
    else:
        mkdir_if_not_exist([data_root_dir, target_dir, 'valid', label])
        shutil.copy(os.path.join(data_root_dir, train_dir, train_file),
                    os.path.join(data_root_dir, target_dir, 'valid', label))

**第三步：将测试数据集也按类别存放，因为没有对应的标签，所有都归类到unkown文件夹下**

In [18]:
mkdir_if_not_exist([data_root_dir, target_dir, 'test', 'unknown'])
for test_file in os.listdir(os.path.join(data_root_dir, test_dir)):
    shutil.copy(os.path.join(data_root_dir, test_dir, test_file),
                os.path.join(data_root_dir, target_dir, 'test', 'unknown'))

## 使用Gluon读取数据集

### 定义图像的预处理

1. 将图片像素值转化为0-1的浮点数值
2. 对图片进行翻转增强
3. 对图片进行减均值，除方差处理。
4. 将图片数据存储序由HWC转为CHW

In [19]:
from mxnet import autograd
from mxnet import gluon
from mxnet import image
from mxnet import init
from mxnet import nd
from mxnet.gluon.data import vision
import numpy as np

def transform_train(data, label):
    im = image.imresize(data.astype('float32') / 255, 96, 96)
    auglist = image.CreateAugmenter(data_shape=(3, 96, 96),resize=0,
                        rand_crop=False, rand_resize=False, rand_mirror=True,
                        mean=None,
                        std=None,
                        brightness=0.125, contrast=0.125,
                        saturation=0, hue=0,
                        pca_noise=0, rand_gray=0, inter_method=2)
    for aug in auglist:
        im = aug(im)
    # 将数据格式从"高*宽*通道"改为"通道*高*宽"。
    im = nd.transpose(im, (2,0,1))
    return (im, nd.array([label]).asscalar().astype('float32'))

# 测试时，无需对图像做标准化以外的增强数据处理。
def transform_test(data, label):
    im = image.imresize(data.astype('float32') / 255, 96, 96)
    im = nd.transpose(im, (2,0,1))
    return (im, nd.array([label]).asscalar().astype('float32'))

接下来我们使用Gluon的[`ImageFolderDataset`](https://mxnet.incubator.apache.org/api/python/gluon/data.html#mxnet.gluon.data.vision.datasets.ImageFolderDataset)类来读取整理后的数据集

In [20]:
input_str = data_root_dir + '/' + target_dir + '/'
batch_size = 128

# 读取原始图像文件。flag=1说明输入图像有三个通道（彩色）。
train_ds = vision.ImageFolderDataset(input_str + 'train', flag=1,
                                     transform=transform_train)
valid_ds = vision.ImageFolderDataset(input_str + 'valid', flag=1,
                                     transform=transform_test)
train_valid_ds = vision.ImageFolderDataset(input_str + 'train_valid',
                                           flag=1, transform=transform_train)
test_ds = vision.ImageFolderDataset(input_str + 'test', flag=1,
                                     transform=transform_test)

loader = gluon.data.DataLoader
train_data = loader(train_ds, batch_size, shuffle=True, last_batch='keep')
valid_data = loader(valid_ds, batch_size, shuffle=True, last_batch='keep')
train_valid_data = loader(train_valid_ds, batch_size, shuffle=True, last_batch='keep')
test_data = loader(test_ds, batch_size, shuffle=False, last_batch='keep')

## 定义模型

In [21]:
from mxnet.gluon import nn
from mxnet import nd

class Residual(nn.HybridBlock):
    def __init__(self, channels, same_shape=True, **kwargs):
        super(Residual, self).__init__(**kwargs)
        self.same_shape = same_shape
        with self.name_scope():
            strides = 1 if same_shape else 2
            self.conv1 = nn.Conv2D(channels, kernel_size=3, padding=1,
                                  strides=strides)
            self.bn1 = nn.BatchNorm()
            self.conv2 = nn.Conv2D(channels, kernel_size=3, padding=1)
            self.bn2 = nn.BatchNorm()
            if not same_shape:
                self.conv3 = nn.Conv2D(channels, kernel_size=1,
                                      strides=strides)

    def hybrid_forward(self, F, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        if not self.same_shape:
            x = self.conv3(x)
        return F.relu(out + x)


class ResNet(nn.HybridBlock):
    def __init__(self, num_classes, verbose=False, **kwargs):
        super(ResNet, self).__init__(**kwargs)
        self.verbose = verbose
        with self.name_scope():
            net = self.net = nn.HybridSequential()
            # 模块1
            net.add(nn.Conv2D(channels=32, kernel_size=3, strides=1, padding=1))
            net.add(nn.BatchNorm())
            net.add(nn.Activation(activation='relu'))
            # 模块2
            for _ in range(3):
                net.add(Residual(channels=32))
            # 模块3
            net.add(Residual(channels=64, same_shape=False))
            for _ in range(2):
                net.add(Residual(channels=64))
            # 模块4
            net.add(Residual(channels=128, same_shape=False))
            for _ in range(2):
                net.add(Residual(channels=128))
            # 模块5
            net.add(nn.GlobalAvgPool2D())
            net.add(nn.Flatten())
            net.add(nn.Dense(num_classes))

    def hybrid_forward(self, F, x):
        out = x
        for i, b in enumerate(self.net):
            out = b(out)
            if self.verbose:
                print('Block %d output: %s'%(i+1, out.shape))
        return out


def get_net(ctx):
    num_outputs = len(labels)
    net = ResNet(num_outputs)
    net.initialize(ctx=ctx, init=init.Xavier())
    return net

In [22]:
import mxnet as mx
resnet18 = get_net(ctx=mx.gpu())
data = mx.sym.var('data')
net_symbol = resnet18(data)
#mx.viz.plot_network(net_symbol)

In [23]:
def accuracy(output, label):
     return nd.mean(output.argmax(axis=1)==label).asscalar()
def evaluate_accuracy(data_iter, net, ctx=mx.cpu()):
    acc = 0
    for data, label in data_iter:
        data = data.as_in_context(ctx)
        label = label.as_in_context(ctx)
        output = net(data)
        acc += accuracy(output, label)
    return acc / len(data_iter)

softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()

In [24]:
import datetime

def train(net, train_data, valid_data, num_epochs, lr, wd, ctx, lr_period, lr_decay):
    trainer = gluon.Trainer(
        net.collect_params(), 'sgd', {'learning_rate': lr, 'momentum': 0.9, 'wd': wd})
    prev_time = datetime.datetime.now()
    for epoch in range(num_epochs):
        train_loss = 0.0
        train_acc = 0.0
        if epoch > 0 and epoch % lr_period == 0:
            trainer.set_learning_rate(trainer.learning_rate * lr_decay)
            pass
        for data, label in train_data:
            label = label.as_in_context(ctx)
            with autograd.record():
                output = net(data.as_in_context(ctx))
                loss = softmax_cross_entropy(output, label)
            loss.backward()
            trainer.step(batch_size)
            train_loss += nd.mean(loss).asscalar()
            train_acc += accuracy(output, label)
        cur_time = datetime.datetime.now()
        h, remainder = divmod((cur_time - prev_time).seconds, 3600)
        m, s = divmod(remainder, 60)
        time_str = "Time %02d:%02d:%02d" % (h, m, s)
        if valid_data is not None:
            valid_acc = evaluate_accuracy(valid_data, net, ctx)
            epoch_str = ("Epoch %d. Loss: %f, Train acc %f, Valid acc %f, "
                         % (epoch, train_loss / len(train_data),
                            train_acc / len(train_data), valid_acc))
        else:
            epoch_str = ("Epoch %d. Loss: %f, Train acc %f, "
                         % (epoch, train_loss / len(train_data),
                            train_acc / len(train_data)))
        prev_time = cur_time
        print(epoch_str + time_str + ', lr ' + str(trainer.learning_rate))

In [29]:
def get_loss(data, net, ctx):
    loss = 0.0
    for feas, label in data:
        label = label.as_in_context(ctx)
        output = net(feas.as_in_context(ctx))
        cross_entropy = softmax_cross_entropy(output, label)
        loss += nd.mean(cross_entropy).asscalar()
    return loss / len(data)

def train(net, train_data, valid_data, num_epochs, lr, wd, ctx, lr_period,
          lr_decay):
    trainer = gluon.Trainer(
        net.collect_params(), 'sgd', {'learning_rate': lr, 'momentum': 0.9,
                                      'wd': wd})
    prev_time = datetime.datetime.now()
    for epoch in range(num_epochs):
        train_loss = 0.0
        if epoch > 0 and epoch % lr_period == 0:
            trainer.set_learning_rate(trainer.learning_rate * lr_decay)
        for data, label in train_data:
            label = label.as_in_context(ctx)
            with autograd.record():
                output = net(data.as_in_context(ctx))
                loss = softmax_cross_entropy(output, label)
            loss.backward()
            trainer.step(batch_size)
            train_loss += nd.mean(loss).asscalar()
        cur_time = datetime.datetime.now()
        h, remainder = divmod((cur_time - prev_time).seconds, 3600)
        m, s = divmod(remainder, 60)
        time_str = "Time %02d:%02d:%02d" % (h, m, s)
        if valid_data is not None:
            valid_loss = get_loss(valid_data, net, ctx)
            epoch_str = ("Epoch %d. Train loss: %f, Valid loss %f, "
                         % (epoch, train_loss / len(train_data), valid_loss))
        else:
            epoch_str = ("Epoch %d. Train loss: %f, "
                         % (epoch, train_loss / len(train_data)))
        prev_time = cur_time
        print(epoch_str + time_str + ', lr ' + str(trainer.learning_rate))

In [30]:
ctx = mx.gpu()
num_epochs = 50
learning_rate = 0.01
weight_decay = 5e-4
lr_period = 10
lr_decay = 0.5

In [None]:
net = get_net(ctx)
net.hybridize()
train(net, train_data, valid_data, num_epochs, learning_rate,
      weight_decay, ctx, lr_period, lr_decay)

Epoch 0. Train loss: 4.891469, Valid loss 4.882747, Time 00:01:05, lr 0.01
Epoch 1. Train loss: 4.672906, Valid loss 4.636792, Time 00:01:25, lr 0.01
Epoch 2. Train loss: 4.569214, Valid loss 4.733593, Time 00:01:25, lr 0.01
Epoch 3. Train loss: 4.478176, Valid loss 4.587576, Time 00:01:25, lr 0.01
Epoch 4. Train loss: 4.379323, Valid loss 4.568875, Time 00:01:27, lr 0.01
Epoch 5. Train loss: 4.272317, Valid loss 4.987030, Time 00:01:26, lr 0.01
Epoch 6. Train loss: 4.163936, Valid loss 5.074187, Time 00:01:26, lr 0.01
Epoch 7. Train loss: 4.060266, Valid loss 4.602572, Time 00:01:25, lr 0.01
Epoch 8. Train loss: 3.989079, Valid loss 4.582050, Time 00:01:26, lr 0.01
Epoch 9. Train loss: 3.880605, Valid loss 4.808239, Time 00:01:26, lr 0.01
Epoch 10. Train loss: 3.743272, Valid loss 4.061832, Time 00:01:25, lr 0.005
Epoch 11. Train loss: 3.659518, Valid loss 4.002586, Time 00:01:24, lr 0.005


In [54]:
import numpy as np
import pandas as pd

net = get_net(ctx)
net.hybridize()
train(net, train_valid_data, None, num_epochs, learning_rate,
      weight_decay, ctx, lr_period, lr_decay)

preds = []
for data, label in test_data:
    output = net(data.as_in_context(ctx))
    preds.extend(output.argmax(axis=1).astype(int).asnumpy())

sorted_ids = list(range(1, len(test_ds) + 1))
sorted_ids.sort(key = lambda x:str(x))

df = pd.DataFrame({'id': sorted_ids, 'label': preds})
df['label'] = df['label'].apply(lambda x: train_valid_ds.synsets[x])
df.to_csv('submission.csv', index=False)

Epoch 0. Loss: 1.855835, Train acc 0.322736, Time 00:01:06, lr 0.1
Epoch 1. Loss: 1.368335, Train acc 0.499900, Time 00:01:07, lr 0.1
Epoch 2. Loss: 1.121591, Train acc 0.596576, Time 00:01:06, lr 0.1
Epoch 3. Loss: 0.921709, Train acc 0.675805, Time 00:01:06, lr 0.1
Epoch 4. Loss: 0.765785, Train acc 0.730082, Time 00:01:06, lr 0.1
Epoch 5. Loss: 0.643581, Train acc 0.775052, Time 00:01:06, lr 0.1
Epoch 6. Loss: 0.562520, Train acc 0.804082, Time 00:01:05, lr 0.1
Epoch 7. Loss: 0.507999, Train acc 0.822692, Time 00:01:07, lr 0.1
Epoch 8. Loss: 0.450344, Train acc 0.844838, Time 00:01:07, lr 0.1
Epoch 9. Loss: 0.419663, Train acc 0.855712, Time 00:01:06, lr 0.1
Epoch 10. Loss: 0.317010, Train acc 0.890749, Time 00:01:08, lr 0.05
Epoch 11. Loss: 0.270628, Train acc 0.908482, Time 00:01:05, lr 0.05
Epoch 12. Loss: 0.254198, Train acc 0.913186, Time 00:01:07, lr 0.05
Epoch 13. Loss: 0.235958, Train acc 0.917953, Time 00:01:06, lr 0.05
Epoch 14. Loss: 0.226780, Train acc 0.921086, Time 00: