# CNN   small datasets 학습


- Data의 수가 많지 않을 때 CNN을 통한 모형 학습이 어려울 수 있음
  - 딥러닝은 많은 수의 데이터를 통해 feature engineering 과정 없이 feature를 찾을 수 있는데 있음 
- Data가 많지 않아 CNN 학습에 어려움이 있을 때 사용 가능한 방법
    - Data augmentation 활용
        - 이미지의 색깔, 각도 등을 약간씩 변형하여 data의 수를 늘림 
    - Pre-trained network의 활용
        - 매우 큰 데이터셋으로 미리 Training한 모델의 파라미터(가중치)를 가져와서 풀려는 문제에 맞게 모델을 재보정해서 사용하는 것.
        - 미리 다양한 데이터를 가지고 학습된 모델을 사용하므로 적은 데이터에도 좋은 성능을 낼 수있다.

## Data for cats vs. dogs
- 2013년 Kaggle의 computer vision competition data 활용 https://www.kaggle.com/c/dogs-vs-cats/data
- 개와 고양이를 구분하기 위한 문제로 각 12,500개의 이미지를 포함
- Medium-resolution color JPEGs
- 25000장의 사진 중 4000장의 cats/dogs 사진(2000 cats, 2000 dogs) 만을 사용하여 학습하여 좋은 모형을 만들어 낼 수 있을까?
    - 학습: 2000, 검증: 1000, 테스트: 1000
    
![cats_vs_dogs_samples](https://s3.amazonaws.com/book.keras.io/img/ch5/cats_vs_dogs_samples.jpg)

- gdown 패키지 : 구글 드라이브의 공유파일 다운로드 패키지    
- `pip install gdown==3.3.1`
- 코랩에는 설치 되어 있음.

In [None]:
# 이미지 다운로드
# https://drive.google.com/uc?id=공유파일ID

import gdown
url = 'https://drive.google.com/uc?id=1nBE3N2cXQGwD8JaD0JZ2LmFD-n3D5hVU'
fname = 'cats_and_dogs_small.zip'

gdown.download(url, fname, quiet=False)  # url, 저장할 경로

Downloading...
From: https://drive.google.com/uc?id=1nBE3N2cXQGwD8JaD0JZ2LmFD-n3D5hVU
To: /content/cats_and_dogs_small.zip
90.8MB [00:00, 133MB/s]


'cats_and_dogs_small.zip'

In [None]:
# 리눅스 명령어로 디렉토리 생성
!mkdir data

In [None]:
# 압축풀기 -q: 로그 남기지 말아라 / -d: 압축을 어디에 풀 것인지 디렉토리 지정
!unzip -q cats_and_dogs_small.zip -d data/cats_and_dogs_small

## Build a network

- Input: $150 \times 150$ 픽셀의 RGB layer 
- Output: cat or dog (binary classification) 
- ImageDataGenerator를 이용해 파일시스템에 저장된 이미지데이터셋을 학습시킨다.

In [32]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

np.random.seed(1)
tf.random.set_seed(1)

In [33]:
# 하이퍼파라미터
LEARNING_RATE = 0.001
DROPOUT_RATE = 0.5
N_EPOCHS = 50
N_BATCHS = 20
IMAGE_SIZE = 150

In [34]:
def create_model():
    model = keras.Sequential()
    model.add(layers.Input((IMAGE_SIZE, IMAGE_SIZE, 3)))

    model.add(layers.Conv2D(filters=64, kernel_size=3, padding='same', activation='relu'))
    model.add(layers.MaxPool2D(padding='same'))

    model.add(layers.Conv2D(filters=128, kernel_size=3, padding='same', activation='relu'))
    model.add(layers.MaxPool2D(padding='same'))

    model.add(layers.Conv2D(filters=256, kernel_size=3, padding='same', activation='relu'))
    model.add(layers.MaxPool2D(padding='same'))

    # classification layer
    model.add(layers.Flatten())
    model.add(layers.Dropout(DROPOUT_RATE))
    model.add(layers.Dense(units=512, activation='relu'))

    # 출력
    model.add(layers.Dense(units=1, activation='sigmoid'))  # dog/cat: binary classification

    return model

In [None]:
model = create_model()

In [None]:
model.compile(optimizer=keras.optimizers.Adam(learning_rate=LEARNING_RATE),
              loss='binary_crossentropy',
              metrics=['accuracy']
              )
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_3 (Conv2D)            (None, 150, 150, 64)      1792      
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 75, 75, 64)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 75, 75, 128)       73856     
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 38, 38, 128)       0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 38, 38, 256)       295168    
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 19, 19, 256)       0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 92416)            

In [None]:
# ImageDataGenerator 생성 -> Augmentation, 입력 pipeline
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt

test_dir = '/content/data/cats_and_dogs_small/test'
validation_dir = '/content/data/cats_and_dogs_small/validation'
train_dir = '/content/data/cats_and_dogs_small/train'

In [None]:
# ImageDataGenerator - No Augmentation
train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

In [None]:
# Gen.flow_from_directory() 이용해서 iterator 생성
train_iterator = train_datagen.flow_from_directory(directory=train_dir,  # 이미지들의 디렉토리
                                                   target_size=(IMAGE_SIZE, IMAGE_SIZE),  # resize 크기 (height, width)
                                                   class_mode='binary',  # dog, cat의 binary
                                                   batch_size=N_BATCHS
                                                   )
validation_iterator = test_datagen.flow_from_directory(directory=validation_dir,
                                                       target_size=(IMAGE_SIZE, IMAGE_SIZE),
                                                       class_mode='binary',
                                                       batch_size=N_BATCHS
                                                       )
test_iterator = test_datagen.flow_from_directory(directory=test_dir,
                                                 target_size=(IMAGE_SIZE, IMAGE_SIZE),
                                                 class_mode='binary',
                                                 batch_size=N_BATCHS
                                                 )

Found 2000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.


In [None]:
train_iterator.class_indices

{'cats': 0, 'dogs': 1}

In [None]:
len(train_iterator), len(validation_iterator), len(test_iterator)  # 1 에폭당 step 수

(100, 50, 50)


##  Model Training(학습)

In [None]:
history = model.fit(train_iterator,
                    epochs=N_EPOCHS,
                    steps_per_epoch=len(train_iterator),
                    validation_data=validation_iterator,
                    validation_steps=len(validation_iterator))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [None]:
model.evaluate(test_iterator)



[2.1738440990448, 0.6169999837875366]

- Overfitting 발생
    - 원인: 적은 train dataset

# Using data augmentation

- 학습 이미지의 수가 적어서 overfitting이 발생할 가능성을 줄이기 위해 기존 훈련 데이터로부터 그럴듯하게 이미지 변환을 통해서 이미지(데이터)를 늘리는 작업을 Image augmentation
- train_set에만 적용, validation, test set에는 적용하지 않는다. (rescaling만 한다.)

In [None]:
train_datagen = ImageDataGenerator(rescale=1./255,
                                   rotation_range=40,
                                   width_shift_range=0.1,
                                   height_shift_range=0.1,
                                   zoom_range=0.2,
                                   horizontal_flip=True,
                                   brightness_range=(0.7, 1.3),
                                   fill_mode='constant'
                                   )
# validation, test 용
test_datagen = ImageDataGenerator(rescale=1./255)

In [None]:
train_iterator = train_datagen.flow_from_directory(train_dir,
                                                   target_size=(IMAGE_SIZE, IMAGE_SIZE),
                                                   class_mode='binary',
                                                   batch_size=N_BATCHS,
                                                  )
validation_iterator = test_datagen.flow_from_directory(validation_dir,
                                                       target_size=(IMAGE_SIZE, IMAGE_SIZE),
                                                       class_mode='binary',
                                                       batch_size=N_BATCHS,
                                                      )
test_iterator = test_datagen.flow_from_directory(test_dir,
                                                 target_size=(IMAGE_SIZE, IMAGE_SIZE),
                                                 class_mode='binary',
                                                 batch_size=N_BATCHS,
                                                )

Found 2000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.


In [None]:
# 이미지 확인
batch_image = train_iterator.next()
batch_image[0].shape, batch_image[1].shape  # batch_image[0]: image, batch_image[1]: labels

((20, 150, 150, 3), (20,))

In [None]:
plt.figure(figsize=(30, 15))
for i in range(20):
    plt.subplot(4, 5, i+1)
    img = batch_image[0][i].astype('uint8')
    plt.imshow(img)
    plt.axis('off')
plt.tight_layout()
plt.show()

Output hidden; open in https://colab.research.google.com to view.

In [None]:
model2 = create_model()
model2.compile(optimizer=keras.optimizers.Adam(learning_rate=LEARNING_RATE),
                                               loss='binary_crossentropy',
                                               metrics=['accuracy']
                                               )

In [None]:
history2 = model2.fit(train_iterator,
                      epochs=N_EPOCHS,
                      steps_per_epoch=len(train_iterator),
                      validation_data=validation_iterator,
                      validation_steps=len(validation_iterator)
                      )

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


###  DataFrame 이용
- flow_from_dataframe() 사용
    - 파일경로와 label을 DataFrame으로 저장하고 그것을 이용해 데이터셋을 읽어온다.

In [1]:
import gdown

url ='https://drive.google.com/uc?id=17ejPJw42TgTv0jCPMMlVTHwF57XYE2kb'
fname = 'cats_and_dogs_union.zip'
gdown.download(url, fname, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=17ejPJw42TgTv0jCPMMlVTHwF57XYE2kb
To: /content/cats_and_dogs_union.zip
90.7MB [00:00, 353MB/s]


'cats_and_dogs_union.zip'

In [2]:
!mkdir data

In [3]:
!unzip -q ./cats_and_dogs_union.zip -d ./data/cats_and_dogs

# DataFrame 생성
- path, label 컬럼

In [5]:
# 파일 경로 다루기 - glob
from glob import glob
# /**: 모든 하위 경로를 의미 / *.jpg: 확장자가 jpg인 모든 파일
path_list = glob('/content/data/cats_and_dogs/**/*.jpg')  # 지정한 파일들의 absolute path(절대 경로)를 문자열로 반환 (리스트로 담아서 반환)
len(path_list)

4000

In [6]:
path_list[:10]

['/content/data/cats_and_dogs/cats/cat.1303.jpg',
 '/content/data/cats_and_dogs/cats/cat.1974.jpg',
 '/content/data/cats_and_dogs/cats/cat.1335.jpg',
 '/content/data/cats_and_dogs/cats/cat.766.jpg',
 '/content/data/cats_and_dogs/cats/cat.1203.jpg',
 '/content/data/cats_and_dogs/cats/cat.1351.jpg',
 '/content/data/cats_and_dogs/cats/cat.151.jpg',
 '/content/data/cats_and_dogs/cats/cat.1079.jpg',
 '/content/data/cats_and_dogs/cats/cat.768.jpg',
 '/content/data/cats_and_dogs/cats/cat.1916.jpg']

In [7]:
path_list[-10:]

['/content/data/cats_and_dogs/dogs/dog.1999.jpg',
 '/content/data/cats_and_dogs/dogs/dog.1812.jpg',
 '/content/data/cats_and_dogs/dogs/dog.486.jpg',
 '/content/data/cats_and_dogs/dogs/dog.1908.jpg',
 '/content/data/cats_and_dogs/dogs/dog.603.jpg',
 '/content/data/cats_and_dogs/dogs/dog.1585.jpg',
 '/content/data/cats_and_dogs/dogs/dog.1375.jpg',
 '/content/data/cats_and_dogs/dogs/dog.575.jpg',
 '/content/data/cats_and_dogs/dogs/dog.577.jpg',
 '/content/data/cats_and_dogs/dogs/dog.1948.jpg']

In [11]:
import os

f = '/content/data/cats_and_dogs/dogs/dog.1999.jpg'
print(os.path.basename(f))  # basename(경로): 경로에서 파일명만 추출

print(os.path.dirname(f))  # dirname(경로): 경로에서 디렉토리 부분만 추출

print(os.path.dirname(f).split(r'/')[4])

dog.1999.jpg
/content/data/cats_and_dogs/dogs
dogs


In [13]:
label_list = []
for path in path_list:
    l = os.path.dirname(path).split(r'/')[4]
    label_list.append(l)

In [15]:
label_list[:5], label_list[-5:], len(label_list)

(['cats', 'cats', 'cats', 'cats', 'cats'],
 ['dogs', 'dogs', 'dogs', 'dogs', 'dogs'],
 4000)

In [16]:
import pandas as pd
d = {
    'path': path_list,
    'label': label_list
}
data_df = pd.DataFrame(d)
data_df.shape

(4000, 2)

In [17]:
data_df.head()

Unnamed: 0,path,label
0,/content/data/cats_and_dogs/cats/cat.1303.jpg,cats
1,/content/data/cats_and_dogs/cats/cat.1974.jpg,cats
2,/content/data/cats_and_dogs/cats/cat.1335.jpg,cats
3,/content/data/cats_and_dogs/cats/cat.766.jpg,cats
4,/content/data/cats_and_dogs/cats/cat.1203.jpg,cats


In [18]:
data_df.tail()

Unnamed: 0,path,label
3995,/content/data/cats_and_dogs/dogs/dog.1585.jpg,dogs
3996,/content/data/cats_and_dogs/dogs/dog.1375.jpg,dogs
3997,/content/data/cats_and_dogs/dogs/dog.575.jpg,dogs
3998,/content/data/cats_and_dogs/dogs/dog.577.jpg,dogs
3999,/content/data/cats_and_dogs/dogs/dog.1948.jpg,dogs


In [20]:
data_df['label'].value_counts()

dogs    2000
cats    2000
Name: label, dtype: int64

In [21]:
data_df.to_csv('./data/cats_and_dogs_filelist.csv', encoding='utf-8', index=None)

In [23]:
# cats, dogs DataFrame으로 분리
cats_df = data_df[data_df['label'] == 'cats']
dogs_df = data_df[data_df['label'] == 'dogs']

In [24]:
cats_df.shape, dogs_df.shape

((2000, 2), (2000, 2))

In [25]:
# train/test DataFrame 생성 (8 : 2)
split_idx = int(dogs_df.shape[0]*0.8)

In [26]:
train_df = pd.concat([dogs_df[:split_idx], cats_df[:split_idx]], axis=0)  # dogs, cats의 0-1599(1600개)를 묶어서 train_df 생성
print(train_df.shape)
train_df['label'].value_counts()

(3200, 2)


dogs    1600
cats    1600
Name: label, dtype: int64

In [27]:
test_df = pd.concat([dogs_df[split_idx:], cats_df[split_idx:]], axis=0)
print(test_df.shape)
test_df['label'].value_counts()

(800, 2)


dogs    400
cats    400
Name: label, dtype: int64

In [29]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rescale=1./255,
                                   rotation_range=40,
                                   width_shift_range=0.1,
                                   height_shift_range=0.1,
                                   zoom_range=0.2,
                                   horizontal_flip=True,
                                   brightness_range=(0.7, 1.3),
                                   fill_mode='constant'
                                   )
# validation, test 용
test_datagen = ImageDataGenerator(rescale=1./255)

In [35]:
train_iterator = train_datagen.flow_from_dataframe(dataframe=train_df,  # path, label을 가진 DataFrame객체를 지정
                                                   x_col='path',  # 이미지 경로를 가진 컬럼명
                                                   y_col='label',  # label 컬럼명
                                                   target_size=(IMAGE_SIZE, IMAGE_SIZE),
                                                   class_mode='binary',
                                                   batch_size=N_BATCHS
                                                  )
test_iterator = test_datagen.flow_from_dataframe(dataframe=test_df,
                                                 x_col='path',
                                                 y_col='label',
                                                 target_size=(IMAGE_SIZE, IMAGE_SIZE),
                                                 class_mode='binary',
                                                 batch_size=N_BATCHS
                                                 )

Found 3200 validated image filenames belonging to 2 classes.
Found 800 validated image filenames belonging to 2 classes.


In [36]:
model = create_model()
model.compile(optimizer=keras.optimizers.Adam(learning_rate=LEARNING_RATE),
              loss='binary_crossentropy',
              metrics=['accuracy']
              )
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 150, 150, 64)      1792      
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 75, 75, 64)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 75, 75, 128)       73856     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 38, 38, 128)       0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 38, 38, 256)       295168    
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 19, 19, 256)       0         
_________________________________________________________________
flatten (Flatten)            (None, 92416)             0

In [37]:
train_iterator.class_indices

{'cats': 0, 'dogs': 1}

In [39]:
model.fit(train_iterator,
          epochs=N_EPOCHS,
          steps_per_epoch=len(train_iterator),
          validation_data=test_iterator,
          validation_steps=len(test_iterator),
          )

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f46c004ded0>

# 추론

In [52]:
from tensorflow.keras.preprocessing.image import load_img, img_to_array

def predict_cat_dog(path):
    class_name = ['cat', 'dog']
    img = load_img(path, target_size=(IMAGE_SIZE, IMAGE_SIZE))
    # image -> ndarray
    sample = img_to_array(img)[np.newaxis, ...]
    # scaling
    sample = sample/255.

    pred = model.predict(sample)  # 확률로 뽑아내는 과정
    pred = pred[0, 0]
    pred_class = np.where(pred < 0.5, 0, 1)
    pred_class_name = class_name[pred_class]
    return pred, pred_class, pred_class_name

In [53]:
pred, pred_class, pred_class_name = predict_cat_dog('/content/dog.jpg')

In [55]:
pred, pred_class, pred_class_name

(0.4169263, array(0), 'cat')