https://www.pyimagesearch.com/2019/05/27/keras-feature-extraction-on-large-datasets-with-deep-learning/?_ga=2.254520140.590795110.1633623212-2142415394.1633395405

hầu hết các triển khai, bao gồm cả scikit-learn; hồi quy logistic, SVM đều yêu cầu toàn bộ tập dữ liệu được truy cập 1 lần cho việc training, tứ là nó phải fit với kích thước RAM. 
=> giải pháp: sử dụng incremental learning, cho phép đào tạo model trên một tập nhỏ dữ liệu gọi là batch.

các bước:
- load a small batch of data from dataset
- train model on the batch
- lặp lại qua tập dữ liệu theo batch, tiếp tục đào tạo cho đến khi hội tụ.

neural network là một ví dụ của học online learning.

In [5]:
import os

ORIG_INPUT_BASE = 'Food-5K'
BASE_PATH = 'dataset'
TRAIN = 'training'
TEST = 'evaluation'
VAL = 'validation'

CLASSES = ['non_food', 'food']
BATCH_SIZE = 32
LE_PATH = os.path.join('output', 'le.pickle')
BASE_CSV_PATH = 'output'

In [2]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array, load_img
import numpy as np
import pickle
import random

In [3]:
model = ResNet50(weights='imagenet', include_top=False)
le = None

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5


In [6]:
for tp in (TRAIN, TEST, VAL):
    print(f'[INFO] preprocessing {tp} split...')
    p = os.path.join(BASE_PATH, tp)
    pp = [os.path.join(p, lb) for lb in CLASSES] 
    imagePaths = [os.path.join(p, f) for p in pp for f in os.listdir(p)]
    random.shuffle(imagePaths)
    labels = [p.split(os.path.sep)[-2] for p in imagePaths]

    if le is None:
        le = LabelEncoder()
        le.fit(labels)
    
    csvPath = os.path.join(BASE_CSV_PATH, f'{tp}.csv')
    if not os.path.exists(BASE_CSV_PATH):
        os.makedirs(BASE_CSV_PATH)
    csv = open(csvPath, 'w')
    for (b, i) in enumerate(range(0, len(imagePaths), BATCH_SIZE)):
        print(f'[INFO] processing batch {b+1}/{int(np.ceil(len(imagePaths)/BATCH_SIZE))}')
        batchPaths = imagePaths[i:i+BATCH_SIZE]
        batchLabels = le.transform(labels[i:i+BATCH_SIZE])
        batchImages = []
        for batchPath in batchPaths:
            image = load_img(batchPath, target_size=(224, 224))
            image = img_to_array(image)
            image = np.expand_dims(image, axis=0)
            image = preprocess_input(image)
            batchImages.append(image)
        batchImages = np.vstack(batchImages)
        features = model.predict(batchImages, batch_size=BATCH_SIZE)
        features = features.reshape((features.shape[0], 7*7*2048))
        for (label, vec) in zip(batchLabels, features):
            vec = ','.join([str(v) for v in vec])
            csv.write(f'{label}, {vec}\n')
    csv.close()
f = open(LE_PATH, 'wb')
f.write(pickle.dumps(le))
f.close()

[INFO] preprocessing training split...
[INFO] processing batch 1/94
[INFO] processing batch 2/94
[INFO] processing batch 3/94
[INFO] processing batch 4/94
[INFO] processing batch 5/94
[INFO] processing batch 6/94
[INFO] processing batch 7/94
[INFO] processing batch 8/94
[INFO] processing batch 9/94
[INFO] processing batch 10/94
[INFO] processing batch 11/94
[INFO] processing batch 12/94
[INFO] processing batch 13/94
[INFO] processing batch 14/94
[INFO] processing batch 15/94
[INFO] processing batch 16/94
[INFO] processing batch 17/94
[INFO] processing batch 18/94
[INFO] processing batch 19/94
[INFO] processing batch 20/94
[INFO] processing batch 21/94
[INFO] processing batch 22/94
[INFO] processing batch 23/94
[INFO] processing batch 24/94
[INFO] processing batch 25/94
[INFO] processing batch 26/94
[INFO] processing batch 27/94
[INFO] processing batch 28/94
[INFO] processing batch 29/94
[INFO] processing batch 30/94
[INFO] processing batch 31/94
[INFO] processing batch 32/94
[INFO] pro

# implement the incremental learning 

In [11]:
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import classification_report

In [39]:
def csv_feature_generator(inputPath, bs, numClasses, mode='train'):
    f = open(inputPath, 'r')
    f.seek(0)
    while True:
        data = []
        labels = []
        while(len(data)<bs):
            row = f.readline()
            if row =='':
                f.seek(0)
                row = f.readline()
                if mode=='test':
                    break
            row = row.strip().split(',')
            label = row[0]
            label = to_categorical(label, num_classes= numClasses) #one hot vector
            features = np.array(row[1:], dtype='float')
            
            data.append(features)
            labels.append(label)
        yield(np.array(data), np.array(labels))

In [17]:
le = pickle.loads(open(LE_PATH, 'rb').read())
trainPath = 'output/training.csv'
testPath = 'output/evaluation.csv'
valPath = 'output/validation.csv'
totalTrain = sum([1 for l in open(trainPath)])
totalVal = sum([1 for l in open(valPath)])
testLabels = [int(row.strip().split(',')[0]) for row in open(testPath)]
totalTest = len(testLabels)

In [40]:
trainGen = csv_feature_generator(trainPath, BATCH_SIZE, len(CLASSES), mode='train')
testGen = csv_feature_generator(testPath, BATCH_SIZE, len(CLASSES), mode='test')
valGen = csv_feature_generator(valPath, BATCH_SIZE, len(CLASSES), mode='test')

In [19]:
model = Sequential()
model.add(Dense(256, activation='relu', input_shape=(7*7*2048,)))
model.add(Dense(16, activation='relu'))
model.add(Dense(len(CLASSES), activation='softmax'))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 256)               25690368  
_________________________________________________________________
dense_1 (Dense)              (None, 16)                4112      
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 34        
Total params: 25,694,514
Trainable params: 25,694,514
Non-trainable params: 0
_________________________________________________________________


a good rule os thumb is to take the square root of the previous number of nodes in the layer and then find the closest power of 2.

In [21]:
opt = SGD(learning_rate=1e-3, momentum=0.9, decay=1e-3/25)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

In [22]:
print('[INFO] training simple network...')
H = model.fit(x=trainGen, steps_per_epoch=(totalTrain//BATCH_SIZE),
             validation_data=valGen,
             validation_steps=(totalVal//BATCH_SIZE),
             epochs=10)

[INFO] training simple network...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [41]:
print('[INFO] evaluate network...')
predIdx = model.predict(x=testGen, steps=np.ceil(totalTest/BATCH_SIZE))
predIdxx = np.argmax(predIdx, axis=1)
print(classification_report(testLabels, predIdxx, target_names=le.classes_))

[INFO] evaluate network...
              precision    recall  f1-score   support

        food       0.99      0.99      0.99       500
    non_food       0.99      0.99      0.99       500

    accuracy                           0.99      1000
   macro avg       0.99      0.99      0.99      1000
weighted avg       0.99      0.99      0.99      1000

