# Lab 1: Mixtures of Experts

This notebook demonstrates how mixtures of experts can be used to boost performance.

The objective of this lab is to classify images from Cifar10 (https://www.cs.toronto.edu/~kriz/cifar.html) to one of ten classes: {0: airplane, 1: automobile, 2: bird, 3: cat, 4: deer, 5: dog, 6: frog, 7: horse, 8: ship, 9: truck}


![Cifar10](cifar10_resize.png )

Specifically, a gating function is trained to pass examples to two experts which are trained separatly, where one expert is trained to classify images within the "natural image" category (e.g. cat, dog, etc) and another to classify images within the  "artificial image" category (e.g. plane, car). The experts are then used to boost the performance of a baseline architecture that classifies image to one of the 10 classes.

Specifically, the mixture is built in the following order:
1. A single model is trained to to classify all 10 classes. (This is included in the mixture, and is also our evaluation benchmark)

2. An expert gating function is trained to recognise whether an image is of an artificial or  natural subject.

3. An artificial expert is trained to classify artificial objects that have a label in {0, 1, 8, 9}.

4. A natural expert is trained to classify natural objects that have a label in  {2, 3, 4, 5, 6, 7}.

5. A gating function is trained to determine the contribution of the experts and the contribution of the baseline architecture to the final output.

6. The mixture is built as illustrated in the figure below.


![](moe_architecture_illus.png)

## Import Prerequisites

In [163]:
import keras
from keras.models import Model
from keras.layers import Dense, Dropout, Flatten, BatchNormalization, Input
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import concatenate, Lambda, Reshape
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.optimizers import Adam
import keras.backend as K
from keras.datasets import cifar10
from keras.utils import plot_model

import numpy as np
import os
import copy
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sn
import pydot
from IPython.display import SVG

In [164]:
# Parameters (not to be changed)
orig_classes = 10 ; gate0_classes = 2

## Mixture Parameters

You can try changing the mixture parameters in the following piece of code, when doing so, consider the following:

1 - Increasing the number of epochs increases the fit to training data, at some point this should cause over-fitting. Conversly, setting it low should cause under-fitting.

2 - Increasing the number of training examples increases the number of learnable features.

3 - Using a large model for different classifiers increases their capacity to learn. This increases the amount of required epochs for training, and also increases the risk of over-fitting.  只是模型的网络框架不一样

By changing the parameters, the performance of the mixture of experts and the baseline classifier should change accordingly.

In [165]:
# Number of training/testing examples per batch
batch_size = 50

# Training epochs. A higher number of epochs corresponds to "more fitting to training data"
epochs = 10

# Number of training/testing examples to use
train_examples = 5000 # Max is 50000
test_examples = 1000   # Max is 5000

# Large/small model flags. Set to true to change a classifier to "large"
use_large_experts = True
use_large_gating_mlp = True
use_large_baseline_classifier = True

In [166]:
# delete previous model checkpoints
# https://blog.csdn.net/ZauberC/article/details/125391367
# 递归删除文件夹下所有的文件
import shutil
shutil.rmtree('gate0Cifar10', ignore_errors=True)
shutil.rmtree('moe3Cifar10', ignore_errors=True)
shutil.rmtree('natureCifar10', ignore_errors=True)
shutil.rmtree('baseCifar10', ignore_errors=True)
shutil.rmtree('artCifar10', ignore_errors=True)

# get the newest model file within a directory
def getNewestModel(model, dirname):
    from glob import glob
    target = os.path.join(dirname, '*')
    files = [(f, os.path.getmtime(f)) for f in glob(target)]
    print(files)
    if len(files) == 0:
        return model
    else:
        newestModel = sorted(files, key=lambda files: files[1])[-1]  # the last(newest) model
        # https://www.python100.com/html/111929.html
        # 按照模型的最后修改时间排序
        model.load_weights(newestModel[0])  # 加载训练好的模型参数，load_weight到模型的文件名称
    
        return model

## Prepare datasets

In [167]:
# load dataset  ; X: input images,  Y: class label ground truth
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train = x_train[:train_examples] ; x_test = x_test[:test_examples]
y_train = y_train[:train_examples] ; y_test = y_test[:test_examples]
print(x_train.shape)
print(y_train)


(5000, 32, 32, 3)
[[6]
 [9]
 [9]
 ...
 [5]
 [4]
 [6]]


In [168]:
# prepare x dataset
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255  # normalization
x_test /= 255

In [169]:
# Convert class vectors to binary class matrices
y_train0 = keras.utils.to_categorical(y_train, orig_classes)
y_test0 = keras.utils.to_categorical(y_test, orig_classes)

print("y train0:{0}\ny test0:{1}".format(y_train0.shape, y_test0.shape))
print(y_train0)
# change into one-hot vector

y train0:(5000, 10)
y test0:(1000, 10)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 1.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## Define architectures

In [170]:
# input layer
cifarInput = Input(shape=(x_train.shape[1:]), name="input")
# 32x32x3

In [171]:
# Small VGG-like model
def simpleVGG(cifarInput, num_classes, name="vgg"):
    name = [name+str(i) for i in range(12)]
    
    # convolution and max pooling layers
    vgg = Conv2D(32, (3, 3), padding='same', activation='relu', name=name[0])(cifarInput)
    vgg = Conv2D(32, (3, 3), padding='same', activation='relu', name=name[1])(vgg)
    vgg = MaxPooling2D(pool_size=(2,2), name=name[2])(vgg)
    vgg = Dropout(0.25, name=name[3])(vgg)
    vgg = Conv2D(64, (3, 3), padding='same', activation='relu', name=name[4])(vgg)
    vgg = Conv2D(64, (3, 3), padding='same', activation='relu', name=name[5])(vgg)
    vgg = MaxPooling2D(pool_size=(2,2), name=name[6])(vgg)
    vgg = Dropout(0.25, name=name[7])(vgg)

    # classification layers
    vgg = Flatten(name=name[8])(vgg)
    vgg = Dense(512, activation='relu', name=name[9])(vgg)
    vgg = Dropout(0.5, name=name[10])(vgg)
    vgg = Dense(num_classes, activation='softmax', name=name[11])(vgg)  # softmax probabilities output
    return vgg

In [172]:
# Large VGG-like model
def fatVGG(cifarInput, num_classes, name="vgg"):
    name = [name+str(i) for i in range(17)]
    
    # convolution and max pooling layers
    vgg = Conv2D(32, (3, 3), padding='same', activation='relu', name=name[0])(cifarInput)
    vgg = Conv2D(32, (3, 3), padding='same', activation='relu', name=name[1])(vgg)
    vgg = MaxPooling2D(pool_size=(2,2), name=name[2])(vgg)
    vgg = Dropout(0.25, name=name[3])(vgg)
    vgg = Conv2D(64, (3, 3), padding='same', activation='relu', name=name[4])(vgg)
    vgg = Conv2D(64, (3, 3), padding='same', activation='relu', name=name[5])(vgg)
    vgg = MaxPooling2D(pool_size=(2,2), name=name[6])(vgg)
    vgg = Dropout(0.25, name=name[7])(vgg)
    vgg = Conv2D(128, (3, 3), padding='same', activation='relu', name=name[8])(vgg)
    vgg = Conv2D(128, (3, 3), padding='same', activation='relu', name=name[9])(vgg)
    vgg = Conv2D(128, (3, 3), padding='same', activation='relu', name=name[10])(vgg)
    vgg = MaxPooling2D(pool_size=(2,2), name=name[11])(vgg)
    vgg = Dropout(0.25, name=name[12])(vgg)

    # classification layers
    vgg = Flatten(name=name[13])(vgg)
    vgg = Dense(512, activation='relu', name=name[14])(vgg)
    vgg = Dropout(0.5, name=name[15])(vgg)
    vgg = Dense(num_classes, activation='softmax', name=name[16])(vgg)
    return vgg

In [173]:
# first gating network, to decide artificial or natural object
if use_large_gating_mlp:
    gate0VGG = fatVGG(cifarInput, gate0_classes, "gate0")  # binary classification
else:
    gate0VGG = simpleVGG(cifarInput, gate0_classes, "gate0")
gate0Model = Model(cifarInput, gate0VGG)  # 为了方便之后拼接更多的层，可以把这个整体看成一个可增模型结构

# base VGG  # 10 class classification
if use_large_baseline_classifier:
    baseVGG = fatVGG(cifarInput, orig_classes, "base")
else:
    baseVGG = simpleVGG(cifarInput, orig_classes, "base") 
baseModel = Model(cifarInput, baseVGG)

# artificial expert VGG
if use_large_experts:
    artificialVGG = fatVGG(cifarInput, orig_classes, "artificial")
else:
    artificialVGG = simpleVGG(cifarInput, orig_classes, "artificial")
artificialModel = Model(cifarInput, artificialVGG)

# naturalVGG = fatVGG(cifarInput, orig_classes, "natural")
if use_large_experts:
    naturalVGG = fatVGG(cifarInput, orig_classes, "natural")
else:
    naturalVGG = simpleVGG(cifarInput, orig_classes, "natural")

naturalModel = Model(cifarInput, naturalVGG)

## Train networks

### Train 10-Class Classifier

In [174]:
# compile
baseModel.compile(loss='categorical_crossentropy',
                   optimizer=Adam(),
                   metrics=['accuracy'])



In [175]:
# make saving directory for checkpoints
baseSaveDir = "./baseCifar10/"
print(os.path)
if not os.path.isdir(baseSaveDir):
    os.makedirs(baseSaveDir)
    
# early stopping and model checkpoint
es_cb = EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='auto')
# https://blog.csdn.net/u014568072/article/details/110818232 earlystopping
chkpt = os.path.join(baseSaveDir, 'Cifar10_.{epoch:02d}-{val_loss:.2f}.hdf5')
cp_cb = ModelCheckpoint(filepath = chkpt, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')
# 以指定的时间间隔保存最新的模型
# https://blog.csdn.net/Marryvivien/article/details/126954192

# load the newest model data from the directory if exists
baseModel = getNewestModel(baseModel, baseSaveDir)

<module 'posixpath' (frozen)>
[]


In [176]:
# train
baseModel.fit(x_train, y_train0,
               batch_size=batch_size,
               epochs=epochs,
               validation_data=(x_test, y_test0),
               callbacks=[es_cb,cp_cb])

Epoch 1/10
Epoch 1: val_loss improved from inf to 2.26607, saving model to ./baseCifar10/Cifar10_.01-2.27.hdf5
Epoch 2/10
  1/100 [..............................] - ETA: 13s - loss: 2.2669 - accuracy: 0.1200

  saving_api.save_model(


Epoch 2: val_loss improved from 2.26607 to 2.00343, saving model to ./baseCifar10/Cifar10_.02-2.00.hdf5
Epoch 3/10
Epoch 3: val_loss improved from 2.00343 to 1.83074, saving model to ./baseCifar10/Cifar10_.03-1.83.hdf5
Epoch 4/10
Epoch 4: val_loss improved from 1.83074 to 1.71004, saving model to ./baseCifar10/Cifar10_.04-1.71.hdf5
Epoch 5/10
Epoch 5: val_loss improved from 1.71004 to 1.66845, saving model to ./baseCifar10/Cifar10_.05-1.67.hdf5
Epoch 6/10
Epoch 6: val_loss improved from 1.66845 to 1.57667, saving model to ./baseCifar10/Cifar10_.06-1.58.hdf5
Epoch 7/10
Epoch 7: val_loss improved from 1.57667 to 1.51248, saving model to ./baseCifar10/Cifar10_.07-1.51.hdf5
Epoch 8/10
Epoch 8: val_loss improved from 1.51248 to 1.47196, saving model to ./baseCifar10/Cifar10_.08-1.47.hdf5
Epoch 9/10
Epoch 9: val_loss improved from 1.47196 to 1.45888, saving model to ./baseCifar10/Cifar10_.09-1.46.hdf5
Epoch 10/10
Epoch 10: val_loss improved from 1.45888 to 1.37151, saving model to ./baseCifa

<keras.src.callbacks.History at 0x283f07cd0>

In [177]:
# evaluate: testing
baseModel = getNewestModel(baseModel, baseSaveDir)  # 先load训练好的model然后再testing
baseScore = baseModel.evaluate(x_test, y_test0)
print(baseScore)

[('./baseCifar10/Cifar10_.10-1.37.hdf5', 1706385937.771537), ('./baseCifar10/Cifar10_.08-1.47.hdf5', 1706385894.5169065), ('./baseCifar10/Cifar10_.05-1.67.hdf5', 1706385838.254511), ('./baseCifar10/Cifar10_.06-1.58.hdf5', 1706385857.0048165), ('./baseCifar10/Cifar10_.09-1.46.hdf5', 1706385918.046521), ('./baseCifar10/Cifar10_.02-2.00.hdf5', 1706385781.0059214), ('./baseCifar10/Cifar10_.04-1.71.hdf5', 1706385819.2838368), ('./baseCifar10/Cifar10_.03-1.83.hdf5', 1706385800.3182273), ('./baseCifar10/Cifar10_.07-1.51.hdf5', 1706385875.6231842), ('./baseCifar10/Cifar10_.01-2.27.hdf5', 1706385759.8907313)]
[1.3715049028396606, 0.5]


## Train 2-Class Natural/Artificial Classifier

The expert gating model determines whether an image is "natural" or "artificial".

In [178]:
# Make ground truth for whether an example is "natural" or "artificial"
y_trainG0 = np.array([0 if i in [0,1,8,9] else 1 for i in y_train])
y_testG0 = np.array([0 if i in [0,1,8,9] else 1 for i in y_test])

y_trainG0 = keras.utils.to_categorical(y_trainG0, 2)
y_testG0  = keras.utils.to_categorical(y_testG0, 2)

print("y trainG0:{0}\ny testG0:{1}".format(y_trainG0.shape, y_testG0.shape))

y trainG0:(5000, 2)
y testG0:(1000, 2)


In [179]:
# compile
gate0Model.compile(loss='categorical_crossentropy',
                   optimizer=Adam(),
                   metrics=['accuracy'])



In [180]:
# make saving directory for check point
gate0SaveDir = "./gate0Cifar10/"
if not os.path.isdir(gate0SaveDir):
    os.makedirs(gate0SaveDir)
    
# early stopping and model checkpoint
es_cb = EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='auto')
chkpt = os.path.join(gate0SaveDir, 'Cifar10_.{epoch:02d}-{val_loss:.2f}.hdf5')
cp_cb = ModelCheckpoint(filepath = chkpt, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')

# load the newest model data from the directory if exists
gate0Model = getNewestModel(gate0Model, gate0SaveDir)

[]


In [181]:
# train
gate0Model.fit(x_train, y_trainG0,
               batch_size=batch_size,
               epochs=epochs,
               validation_data=(x_test, y_testG0),
               callbacks=[es_cb,cp_cb])

Epoch 1/10
Epoch 1: val_loss improved from inf to 0.39944, saving model to ./gate0Cifar10/Cifar10_.01-0.40.hdf5
Epoch 2/10
Epoch 2: val_loss improved from 0.39944 to 0.35131, saving model to ./gate0Cifar10/Cifar10_.02-0.35.hdf5
Epoch 3/10
Epoch 3: val_loss improved from 0.35131 to 0.28798, saving model to ./gate0Cifar10/Cifar10_.03-0.29.hdf5
Epoch 4/10
Epoch 4: val_loss did not improve from 0.28798
Epoch 5/10
Epoch 5: val_loss improved from 0.28798 to 0.27235, saving model to ./gate0Cifar10/Cifar10_.05-0.27.hdf5
Epoch 6/10
Epoch 6: val_loss improved from 0.27235 to 0.26318, saving model to ./gate0Cifar10/Cifar10_.06-0.26.hdf5
Epoch 7/10
Epoch 7: val_loss improved from 0.26318 to 0.23022, saving model to ./gate0Cifar10/Cifar10_.07-0.23.hdf5
Epoch 8/10
Epoch 8: val_loss did not improve from 0.23022
Epoch 9/10
Epoch 9: val_loss did not improve from 0.23022
Epoch 9: early stopping


<keras.src.callbacks.History at 0x14bfbd990>

In [182]:
# evaluate
gate0Model = getNewestModel(gate0Model, gate0SaveDir)
gate0Score = gate0Model.evaluate(x_test, y_testG0)
print(gate0Score)

[('./gate0Cifar10/Cifar10_.05-0.27.hdf5', 1706386040.6111171), ('./gate0Cifar10/Cifar10_.07-0.23.hdf5', 1706386078.2443082), ('./gate0Cifar10/Cifar10_.01-0.40.hdf5', 1706385960.2594962), ('./gate0Cifar10/Cifar10_.03-0.29.hdf5', 1706386000.1423075), ('./gate0Cifar10/Cifar10_.02-0.35.hdf5', 1706385979.7541835), ('./gate0Cifar10/Cifar10_.06-0.26.hdf5', 1706386059.6627994)]
[0.23021505773067474, 0.9120000004768372]


## Train "Natural" and "Artificial" Experts
<br>
The expert networks are specialized in predicting a certain classes.<br>
Each network is only trained with its specialized field: the artificial expert get trained for labels 0, 1, 8 and 9; the natural expert for labels 2, 3, 4, 5, 6 and 7.

In [183]:
# get the position of artificial images and natural images in training and test dataset
artTrain = [i for i in range(len(y_train)) if y_train[i] in [0,1,8,9]]
natureTrain = [i for i in range(len(y_train)) if y_train[i] in [2,3,4,5,6,7]]
artTest = [i for i in range(len(y_test)) if y_test[i] in [0,1,8,9]]
natureTest = [i for i in range(len(y_test)) if y_test[i] in [2,3,4,5,6,7]]

In [184]:
# get artificial dataset and natural dataset
# separate out artificial and natural
x_trainArt = x_train[artTrain]
x_testArt = x_test[artTest]
y_trainArt = y_train[artTrain]
y_testArt = y_test[artTest]

### Artificial expert network

In [185]:
# for artificial dataset
y_trainArt = keras.utils.to_categorical(y_trainArt, orig_classes)
y_testArt = keras.utils.to_categorical(y_testArt, orig_classes)

print("y train art:{0}\ny test art:{1}".format(y_trainArt.shape, y_testArt.shape))

# why still use 10 classes for one-hot vector here? can we reduce it to 6-class and 4-class, or we just want
# to keep the compatibility for labels

y train art:(1983, 10)
y test art:(407, 10)


In [186]:
# compile
artificialModel.compile(loss='categorical_crossentropy',
                        optimizer=Adam(),
                        metrics=['accuracy'])



In [187]:
# make saving directory for check point
artSaveDir = "./artCifar10/"
if not os.path.isdir(artSaveDir):
    os.makedirs(artSaveDir)
    
# early stopping and model checkpoint
es_cb = EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='auto')
chkpt = os.path.join(artSaveDir, 'Cifar10_.{epoch:02d}-{val_loss:.2f}.hdf5')
cp_cb = ModelCheckpoint(filepath = chkpt, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')

# load the newest model data if exists
artificialModel = getNewestModel(artificialModel, artSaveDir)

[]


In [188]:
# train
artificialModel.fit(x_trainArt, y_trainArt,
               batch_size=batch_size,
               epochs=epochs,
               validation_data=(x_testArt, y_testArt),
               callbacks=[es_cb,cp_cb])

Epoch 1/10


Epoch 1: val_loss improved from inf to 1.39926, saving model to ./artCifar10/Cifar10_.01-1.40.hdf5
Epoch 2/10
Epoch 2: val_loss improved from 1.39926 to 1.30413, saving model to ./artCifar10/Cifar10_.02-1.30.hdf5
Epoch 3/10
Epoch 3: val_loss improved from 1.30413 to 1.13554, saving model to ./artCifar10/Cifar10_.03-1.14.hdf5
Epoch 4/10
Epoch 4: val_loss improved from 1.13554 to 1.06485, saving model to ./artCifar10/Cifar10_.04-1.06.hdf5
Epoch 5/10
Epoch 5: val_loss improved from 1.06485 to 1.02846, saving model to ./artCifar10/Cifar10_.05-1.03.hdf5
Epoch 6/10
Epoch 6: val_loss did not improve from 1.02846
Epoch 7/10
Epoch 7: val_loss improved from 1.02846 to 0.94017, saving model to ./artCifar10/Cifar10_.07-0.94.hdf5
Epoch 8/10
Epoch 8: val_loss improved from 0.94017 to 0.92437, saving model to ./artCifar10/Cifar10_.08-0.92.hdf5
Epoch 9/10
Epoch 9: val_loss improved from 0.92437 to 0.86936, saving model to ./artCifar10/Cifar10_.09-0.87.hdf5
Epoch 10/10
Epoch 10: val_loss did not improv

<keras.src.callbacks.History at 0x2836b6310>

In [189]:
# evaluate
artificialModel = getNewestModel(artificialModel, artSaveDir)
artScore = artificialModel.evaluate(x_testArt, y_testArt)
print(artScore)

[('./artCifar10/Cifar10_.08-0.92.hdf5', 1706386180.8181164), ('./artCifar10/Cifar10_.02-1.30.hdf5', 1706386133.4525778), ('./artCifar10/Cifar10_.04-1.06.hdf5', 1706386149.4936426), ('./artCifar10/Cifar10_.03-1.14.hdf5', 1706386141.6884685), ('./artCifar10/Cifar10_.05-1.03.hdf5', 1706386157.0556881), ('./artCifar10/Cifar10_.09-0.87.hdf5', 1706386188.3284364), ('./artCifar10/Cifar10_.07-0.94.hdf5', 1706386172.8703427), ('./artCifar10/Cifar10_.01-1.40.hdf5', 1706386125.765381)]
[0.8693640232086182, 0.6339066624641418]


### Natural expert network

In [190]:
# for natural dataset
x_trainNat = x_train[natureTrain]
x_testNat = x_test[natureTest]
y_trainNat = y_train[natureTrain]
y_testNat = y_test[natureTest]

In [191]:
# get natural dataset
y_trainNat = keras.utils.to_categorical(y_trainNat, orig_classes)
y_testNat = keras.utils.to_categorical(y_testNat, orig_classes)

print("y train nature:{0}\ny test nature:{1}".format(y_trainNat.shape, y_testNat.shape))

y train nature:(3017, 10)
y test nature:(593, 10)


In [192]:
# compile
naturalModel.compile(loss='categorical_crossentropy',
                   optimizer=Adam(),
                   metrics=['accuracy'])



In [193]:
# make saving directory for check point
natSaveDir = "./natureCifar10/"
if not os.path.isdir(natSaveDir):
    os.makedirs(natSaveDir)
    
# early stopping and model checkpoint
es_cb = EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='auto')
chkpt = os.path.join(natSaveDir, 'Cifar10_.{epoch:02d}-{val_loss:.2f}.hdf5')
cp_cb = ModelCheckpoint(filepath = chkpt, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')

# load the newest model data if exists
naturalModel = getNewestModel(naturalModel, natSaveDir)

[]


In [194]:
# train
naturalModel.fit(x_trainNat, y_trainNat,
               batch_size=batch_size,
               epochs=epochs,
               validation_data=(x_testNat, y_testNat),
               callbacks=[es_cb,cp_cb])

Epoch 1/10


Epoch 1: val_loss improved from inf to 1.81940, saving model to ./natureCifar10/Cifar10_.01-1.82.hdf5
Epoch 2/10
Epoch 2: val_loss improved from 1.81940 to 1.81219, saving model to ./natureCifar10/Cifar10_.02-1.81.hdf5
Epoch 3/10
Epoch 3: val_loss improved from 1.81219 to 1.69517, saving model to ./natureCifar10/Cifar10_.03-1.70.hdf5
Epoch 4/10
Epoch 4: val_loss improved from 1.69517 to 1.63483, saving model to ./natureCifar10/Cifar10_.04-1.63.hdf5
Epoch 5/10
Epoch 5: val_loss improved from 1.63483 to 1.60018, saving model to ./natureCifar10/Cifar10_.05-1.60.hdf5
Epoch 6/10
Epoch 6: val_loss improved from 1.60018 to 1.51311, saving model to ./natureCifar10/Cifar10_.06-1.51.hdf5
Epoch 7/10
Epoch 7: val_loss improved from 1.51311 to 1.49053, saving model to ./natureCifar10/Cifar10_.07-1.49.hdf5
Epoch 8/10
Epoch 8: val_loss improved from 1.49053 to 1.47615, saving model to ./natureCifar10/Cifar10_.08-1.48.hdf5
Epoch 9/10
Epoch 9: val_loss did not improve from 1.47615
Epoch 10/10
Epoch 10:

<keras.src.callbacks.History at 0x14e148990>

In [195]:
# evaluate
naturalModel = getNewestModel(naturalModel, natSaveDir)
natScore = naturalModel.evaluate(x_testNat, y_testNat)
print(natScore)

[('./natureCifar10/Cifar10_.10-1.37.hdf5', 1706386319.2599046), ('./natureCifar10/Cifar10_.03-1.70.hdf5', 1706386233.3922484), ('./natureCifar10/Cifar10_.05-1.60.hdf5', 1706386257.321572), ('./natureCifar10/Cifar10_.07-1.49.hdf5', 1706386282.4135869), ('./natureCifar10/Cifar10_.08-1.48.hdf5', 1706386294.404692), ('./natureCifar10/Cifar10_.01-1.82.hdf5', 1706386209.376171), ('./natureCifar10/Cifar10_.02-1.81.hdf5', 1706386221.6177313), ('./natureCifar10/Cifar10_.04-1.63.hdf5', 1706386245.505035), ('./natureCifar10/Cifar10_.06-1.51.hdf5', 1706386269.277734)]
[1.3676996231079102, 0.45868465304374695]


### Freeze the weights of all trained models so far (i.e. baseline, experts, and expert gating models).


In [196]:
for l in baseModel.layers:
    l.trainable = False
for l in gate0Model.layers:
    l.trainable = False
for l in artificialModel.layers:
    l.trainable = False
for l in naturalModel.layers:
    l.trainable = False

## Connecting the overall networks to form the mixture of experts model

Sub-gate

Up to this point, we have a baseline classifier which classifies the 10 classes, we have 2 experts which each classify the 10 classes but specialise in classifying either the natural or artificial categories of classes, and we have a first gate which decides which of the experts to choose for the given cifar input. We want our second gate to be able to take in the cifar input and decide what the importance should be of the output of the i) baseline and ii) chosen expert, thereby determining the importance of the baseline and the chosen expert in producing a final MoE output prediction. To do this, the second gate will be composed of 2 sub-gates; one for each expert.

*Sub-gate Structure*: First layer of sub-gate is our (32, 32, 3) input layer as before which takes in the cifar data -> flatten input -> pass through 512 unit dense layer with relu activation function -> pass through dropout layer which randomly sets units to 0 with probability 0.5 -> pass through orig_classes x 2 = 10 x 2 = 20 unit dense layer with softmax activation function -> now have importance values for the i) baseline and ii) chosen expert -> reshape this 1D output into (10, 2) output. I.e. along [:, 0], have a (10,) tensor of what the sub-gate 'thinks' the importance is of the 10 baseline classifier outputs, and along [:,1], have a (10,) tensor of what the sub-gate 'thinks' the importance is of the expert.

Instantiate one of these sub-gates for each expert. Whichever expert is chosen by the first binary classifier 'hard' gate will determine which corresponding expert sub-gate is used to 'soft gate' between the baseline and the chosen expert.

In [197]:
# define sub-Gate network, for the second gating network layer
def subGate(cifarInput, orig_classes, name="subGate"):
    name = [name+str(i) for i in range(5)]
    subgate = Flatten(name=name[0])(cifarInput)
    subgate = Dense(512, activation='relu', name=name[1])(subgate)
    subgate = Dropout(0.5, name=name[2])(subgate)
    subgate = Dense(orig_classes*2, activation='softmax', name=name[3])(subgate)
    subgate = Reshape((orig_classes, 2), name=name[4])(subgate)
    return subgate

In [198]:
# the artificial gating network
artGate = subGate(cifarInput, orig_classes, "artExpertGate")

# the natural gating network
natureGate = subGate(cifarInput, orig_classes, "natureExpertGate")

Sub-gate Lambda

Want this layer to take outputs of i) baseline ii) expert and iii) sub-gate of the corresponding expert -> get logit outputs for each of the 10 orig_classes

- Takes in argument gx, which is a list of 3 tensors (the output of i) baseline ii) expert iii) corresponding expert sub-gate)

- gx[0] -> baseline output tensor (10,). Is a softmax output for each of the 10 classes

- gx[1] -> expert network output tensor (10,). Is a softmax output for each of the 10 classes. Which expert's output reached here depends on the binary classifier output of the first gate (we will implement the logic of choosing the expert when we tie all the models together, see below).

- gx[2] -> corresponding expert sub-gate output tensor (10,2,). 
- gx[2][:,:,0] -> baseline importance tensor of shape (10,). Is what the sub-gate thinks the importance is of each of the 10 baseline output classes. 
- gx[2][:,:,1] -> expert importance tensor of shape (10,). Is what the sub-gate thinks the importance is of each of the 10 expert output classes.

We ultimately want a logit output for each of the 10 classes. We want this output to be determined by what the i) baseline and ii) expert thought, weighted by the importance of what the sub-gate thought. To do this inference, we can define a simple function which: i) multiplies the baseline's output by the sub-gate's baseline importance -> get a (10,) tensor ii) multiplies the expert's output by the sub-gate's expert importance -> get a (10,) tensor iii) sum these two importance-weighted terms to get a final (10,) tensor of logit outputs (one for each class).

In [199]:
# define inference calculation with Keras Lambda layer with base VGG, expert network and the second gating network of corresponding expert as input
# the inference is calculated as sum of multiplications of base VGG inference output and its importance, and expert network inference output and its importance
def subGateLambda(base, expert, subgate):
    output = Lambda(lambda gx: (gx[0]*gx[2][:,:,0]) + (gx[1]*gx[2][:,:,1]), output_shape=(orig_classes,))([base, expert, subgate])
    return output

# DEBUG
# def subGateLambda(base, expert, subgate):
#     output = Lambda(lambda gx: print('\ngx: {}\ngx[0]: {}\ngx[2][:,:,0]: {}\ngx[1]: {}\ngx[2][:,:,1]: {}'.format(gx, gx[0],gx[2][:,:,0],gx[1],gx[2][:,:,1])))([base, expert, subgate])
#     return output

Connecting the Networks

Now we just need to i) tie all of the above models together and ii) implement the logic for choosing an expert at the first gate. The first 'expert gate' binary classifier is a 'hard gate' since it will choose one expert and block the other. The second gate (a sub-gate) is a 'soft gate' since it will apply its importance weights across the i) baseline and ii) chosen expert.

To do this, we can define a layer which takes the outputs of i) baseline ii) first gate iii) artificial expert iv) natural expert v) artificial sub-gate vi) natural sub-gate.

- Takes in argument gx, which is a list of 6 tensors (the output of i) baseline ii) first gate iii) artificial expert iv) natural expert v) artificial sub-gate vi) natural sub-gate)
- gx[0] -> baseline output tensor (10,)
- gx[1] -> first gate binary output tensor (2,)
- gx[2] -> artificial expert output tensor (10,)
- gx[3] -> natural expert output tensor (10,)
- gx[4] -> artificial sub-gate output tensor (10,2,)
- gx[5] -> natural sub-gate output tensor (10,2,)

We want to implement the logic that the first gate's chosen expert's output and corresponding sub-gate output should be passed (along with the baseline's output) to the sub-gate lambda function defined previously. To do this, we can use the Keras backend switch function, which will pass the chosen expert's output and its corresponding sub-gate output to the sub-gate lambda depending on which of the 2 outputs of the first gate was greater.

In [200]:
import tensorflow as tf 
# connecting the overall networks.
# the Keras backend switch works as deciding with the first gating network, leading to artificial or natural gate
# https://blog.csdn.net/wdh315172/article/details/105437494
# lambda将任意表达式封装成一个层级对象
# 所谓的一个gate就是一个fusion连结层
output = Lambda(lambda gx: K.switch(tf.expand_dims(gx[1][:,0],axis=1) > tf.expand_dims(gx[1][:,1],axis=1), 
                                    subGateLambda(gx[0], gx[2], gx[4]), # choose the larger one
                                    subGateLambda(gx[0], gx[3], gx[5])), 
                output_shape=(orig_classes,))([baseVGG, gate0VGG, artificialVGG, naturalVGG, artGate, natureGate])
# gx[1] from gate0VGG of binary classification, 属于哪个类别就交给哪个expert去处理

# https://gist.github.com/monk1337/22b00851302ccc8fddf84404e5dd00f8
# DEBUG
# output = Lambda(lambda gx: print('\ngx: {}\ngx[0]: {}\ngx[1]: {}\ngx[2]: {}\ngx[3]: {}\ngx[4]: {}\ngx[5]: {}'.format(gx, gx[0], gx[1], gx[2], gx[3], gx[4], gx[5])), 
#                 output_shape=(orig_classes,))([baseVGG, gate0VGG, artificialVGG, naturalVGG, artGate, natureGate])

In [201]:
# the mixture of experts model
model = Model(cifarInput, output)  # integrate into entire model

In [202]:
# compile
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(),
              metrics=['accuracy'])



At this point we have already trained the baseline, the experts, and the first expert gate. Now we need only train the sub-gates and the final Lambda inference layer, which will take the outputs of the above trained models and learn to output a final (10,) MoE prediction. 

In [203]:
# show layers and if it's trainable or not
# only the second gating network layers and the last Lambda inference layer are left trainable
# because previous layers are all freezed after previous traning
for l in model.layers:
    print(l, l.trainable)

<keras.src.engine.input_layer.InputLayer object at 0x14c33bc90> False
<keras.src.layers.convolutional.conv2d.Conv2D object at 0x30f4fa2d0> False
<keras.src.layers.convolutional.conv2d.Conv2D object at 0x28ea7dbd0> False
<keras.src.layers.convolutional.conv2d.Conv2D object at 0x30f4f5b50> False
<keras.src.layers.convolutional.conv2d.Conv2D object at 0x14b43e450> False
<keras.src.layers.convolutional.conv2d.Conv2D object at 0x29d5a2b10> False
<keras.src.layers.convolutional.conv2d.Conv2D object at 0x283f04410> False
<keras.src.layers.convolutional.conv2d.Conv2D object at 0x28e986050> False
<keras.src.layers.convolutional.conv2d.Conv2D object at 0x2ec2f25d0> False
<keras.src.layers.pooling.max_pooling2d.MaxPooling2D object at 0x29d9b7f50> False
<keras.src.layers.pooling.max_pooling2d.MaxPooling2D object at 0x28ea8b090> False
<keras.src.layers.pooling.max_pooling2d.MaxPooling2D object at 0x2c6f7f350> False
<keras.src.layers.pooling.max_pooling2d.MaxPooling2D object at 0x30f097ed0> False
<k

In [204]:
# make saving directory for check point
saveDir = "./moe3Cifar10/"
if not os.path.isdir(saveDir):
    os.makedirs(saveDir)
    
# early stopping and model checkpoint
es_cb = EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='auto')
chkpt = os.path.join(saveDir, 'Cifar10_.{epoch:02d}-{val_loss:.2f}.hdf5')
cp_cb = ModelCheckpoint(filepath = chkpt, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')

# load the newest model data if exists
model = getNewestModel(model, saveDir)

[]


In [205]:
# train
model.fit(x_train, y_train0,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test0),
          callbacks=[es_cb, cp_cb])

Epoch 1/10


Epoch 1: val_loss improved from inf to 1.31862, saving model to ./moe3Cifar10/Cifar10_.01-1.32.hdf5
Epoch 2/10
Epoch 2: val_loss improved from 1.31862 to 1.31822, saving model to ./moe3Cifar10/Cifar10_.02-1.32.hdf5
Epoch 3/10
Epoch 3: val_loss improved from 1.31822 to 1.31666, saving model to ./moe3Cifar10/Cifar10_.03-1.32.hdf5
Epoch 4/10
Epoch 4: val_loss did not improve from 1.31666
Epoch 5/10
Epoch 5: val_loss improved from 1.31666 to 1.30542, saving model to ./moe3Cifar10/Cifar10_.05-1.31.hdf5
Epoch 6/10
Epoch 6: val_loss did not improve from 1.30542
Epoch 7/10
Epoch 7: val_loss improved from 1.30542 to 1.30418, saving model to ./moe3Cifar10/Cifar10_.07-1.30.hdf5
Epoch 8/10
Epoch 8: val_loss improved from 1.30418 to 1.30375, saving model to ./moe3Cifar10/Cifar10_.08-1.30.hdf5
Epoch 9/10
Epoch 9: val_loss improved from 1.30375 to 1.30218, saving model to ./moe3Cifar10/Cifar10_.09-1.30.hdf5
Epoch 10/10
Epoch 10: val_loss improved from 1.30218 to 1.30006, saving model to ./moe3Cifar10

<keras.src.callbacks.History at 0x282f0c510>

In [206]:
# evaluate
mixture_loss_accuracy = model.evaluate(x_test, y_test0)
print(mixture_loss_accuracy)  # return loss and metrics
# https://www.tensorflow.org/api_docs/python/tf/keras/Model

[1.3000566959381104, 0.5429999828338623]


In [207]:
# notice that when final use the subgate lambda and the whole model, we use
# the already trained model previously entirely, there is no need to train again and all weights are fixed,
# but still input needed to be fed into the last gate, however, it goes through not only last gate but also previous all networks with 
# trained weights