<p style="font-size:20px">You may need to install <b>tqdm</b> and <b>cv2</b>. Simply do <b>conda install tqdm</b> and <b>conda install cv2</b> in your virtual environment. You are also free to use other tools to fetch the image files and resize them.</p>

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from random import shuffle
from tqdm import tqdm
import pickle
import os
import cv2
import timeit

<p style="font-size:20px">In Problem 4a, you will preprocess the dogs and cats data and save them into "Pickle" files for later use.

In [2]:
train_dir = 'catdog/train/'
test_dir = 'catdog/test'

<p style="font-size:20px">Step 1: Kaggle does not provide a file that contains the labels. So we need to create labels from the training file.

In [3]:
"""
a function to return the label of a image
input: 
    image
return:
    if the image is cat, return [1,0]
    if the image is dog, return [0,1]
"""
def label_img(img):
    label = img.split('.')[-3]
    
    if label == 'cat':
        return [1,0]
    elif label == 'dog':
        return [0,1]


<p style="font-size:20px">Step 2: define a function to fetch all images from the training directory and return a <b>list</b> that every element contains two <b>Numpy array</b>:image and its label.

In [14]:
"""
Hint 1: use tqdm to fetch file
    for file in tqdm(os.listdir(directory)):
        ...

Hint 2: use cv2 to read file
    cv2.imread(path)

Hint 3: use cv2 to resize img
    cv2.resize(img, (size, size))
"""
def create_train_data():
    train_data = []
    
    for image in tqdm(os.listdir(train_dir)):
        ###get label of img###
        label = label_img(image)
        path = os.path.join(train_dir, image)
        
        ###use cv2 to read the img and resize the it to (227 x 227)###
        image = cv2.imread(path, cv2.IMREAD_COLOR)
        image = cv2.resize(image, (227,227))
        
        ###append the img and label to the list###
        train_data.append([np.array(image), np.array(label)])
    
    ###shuffle training data###
    shuffle(train_data)
    
    ###return training data###
    return train_data

<p style="font-size:20px">Step 3: define a similar function to fetch all test data. You don't need to label them.

In [15]:
def create_test_data():
    test_data = []
    
    for image in tqdm(os.listdir(test_dir)):
        path = os.path.join(test_dir, image)
        i_n = image.split('.')[0]
        
        image = cv2.imread(path, cv2.IMREAD_COLOR)
        image = cv2.resize(image, (227, 227))
        
        test_data.append([np.array(image), i_n])
        
    shuffle(test_data)
    
    return test_data
        

<p style="font-size:20px">Step 4: create your train and test data</p>

In [16]:
train_data = create_train_data()
test_data = create_test_data()

100%|██████████| 25000/25000 [01:14<00:00, 337.05it/s]
100%|██████████| 12500/12500 [00:38<00:00, 322.64it/s]


<p style="font-size:20px"> You can visualize the image using plt.imshow()

<p style="font-size:20px">Step 5: Reshape all images to have shape (#, 227, 227, 3). Use 500 training data as your validation set.

In [17]:
train = train_data[:-500]
valid = train_data[-500:]

x_train = np.array([i[0] for i in train]).reshape(len(train), 227, 227, 3)
y_train = np.array([i[1] for i in train])

x_valid = np.array([i[0] for i in valid]).reshape(len(valid), 227, 227, 3)
y_valid = np.array([i[1] for i in valid])

x_test = np.array([i[0] for i in test_data]).reshape(len(test_data), 227, 227, 3)
y_test = np.array([i[1] for i in test_data])

<p style="font-size:20px">Step 6: Save training data, validation data and testing data as Pickle object</p>
<p style="font-size:20px">Note: You can't save all training data into one file because it has several Gigabytes. Separate your data wisely and save them into different files</p>

In [None]:
pickle.dump((x_train, y_train), open('catdog_train_data.p', 'wb'))
pickle.dump((x_test, y_test), open('catdog_test_data.p', 'wb'))
pickle.dump((x_valid, y_valid), open('catdog_valid_data.p', 'wb'))