Before we begin, let's enable Google Colab's free GPU to speed up the training process. Click on **Edit**, **Notebook Settings**, and then select **GPU** as the hardware accelator. To import the project data, run the cell below by pressing **Shift + Enter**. Click **Choose Files** to upload the train_posters.zip and test_posters.zip from your local machine using **files.upload( )**. This will take 5-10 minutes.

In the meantime, upload the train_data.csv by clicking on the black arrow tab on the left side of the screen and then the Files tab. Click on Upload and upload the train_data.csv. This .csv file contains tabular data as well as the necessary labels for our training images. 









In [None]:
from google.colab import files
files.upload()

Use the **unzip** command to unzip the folders containing the training and test images.

In [None]:
!unzip train_posters.zip
!unzip test_posters.zip

Import necessary Python libraries and read the train_data csv file using Pandas. We sort the rows of the data with respect to the IMDB ID numbers to match the order of the training images. To get the genre labels, we simply grab the last column of the** csv_data** array and assign it to the **genres** variable. 

Since this is a classification task, let's one-hot encode our labels (0,1,2,3). For example, a label of 0 will be one-hot encoded into the vector [1 0 0 0] and a label of 1 will be one-hot encoded into the vector [0 1 0 0]. We use the ** to_categorical** function from Keras, a high-level deep learning Python library that runs on top of Tensorflow, to one-hot encode our labels. 

Throughout this notebook, I would highly suggest printing out variables to get an understand of what they look like as well as to encourage good debugging practices. 

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical
import os
import cv2
import matplotlib.pyplot as plt

# Read csv data
# Reorder the labels to match the order of the images

csv_data = pd.read_csv('train_data.csv').as_matrix()
csv_data = csv_data[csv_data[:,0].argsort()]
genres = csv_data[:,-1]

# One-hot encode genres column

train_labels = to_categorical(np.array(genres))
print("Label for first training example: {}".format(genres[0]))
print("One-hot encoded label for first training example: {}".format(train_labels[0]))

We create variables storing the paths to our training and test image directories. Inside the **preprocess_training_data** function, we sort the images with respect to the IMDB IDs. We then create an empty list called **train_images**, preprocess each image, and append the new image into our **train_images** list. 

Using cv2, we read the image as a grayscale image and resize the image into a 64x64 image. We normalize the image by dividing by 255 since each pixel has a value between 0 and 255 to speed up convergence during the training process. We reshape the output in a way that the Convolutional Neural Network can work with the data. 

Repeat the process for the test images. 

In [None]:
train_data = 'train_posters'
test_data = 'test_posters'

def preprocess_training_data():
    train_images = []
    image_num = 0
    dirFiles = os.listdir(train_data)
    filelist = sorted(dirFiles,key=lambda x: int(os.path.splitext(x)[0]))
    for ind,i in enumerate(filelist):

        path = os.path.join(train_data,i)
        img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
        img = cv2.resize(img, (64,64))

        train_images.append(np.array(img)/255)
    return train_images


def preprocess_test_data():
    test_images = []
    dirFiles = os.listdir(test_data)
    filelist = sorted(dirFiles,key=lambda x: int(os.path.splitext(x)[0]))
    for ind,i in enumerate(filelist):

        path = os.path.join(test_data,i)
        img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
        img = cv2.resize(img, (64,64))
        test_images.append(np.array(img)/255)
            
    return test_images
    
preprocessed_train = preprocess_training_data()
preprocessed_test = preprocess_test_data()


x_train = np.array(preprocessed_train).reshape(-1,64,64,1)
y_train = train_labels

x_test = np.array(preprocessed_test).reshape(-1,64,64,1)

Try displaying the movie posters from both the training and test images. Since **x_train** and **x_test** are reshaped in a CNN usable format, we aren't able to visualize it easily. Instead use the **preprocessed_train** and ** preprocessed_test** lists. Print out the associated label and movie title of the training image to verify the order is correct. 

In [None]:
# Display training example #1371 Death Note, a well-acclaimed Hollywood adaption of the Death Note anime (jk)
# Display movie poster and associated label and title

# Feel free to change the train_ind and see how the preprocessing affect the images
train_ind = 1371
plt.imshow(preprocessed_train[train_ind])
print(csv_data[:,1][train_ind])
print(y_train[train_ind])

In [None]:
# Displaying test example #200 
# Remember there is no genre label or title associated with this image
# We are trying to predict the labels! 

plt.imshow(preprocessed_test[200])
print("Test example #200")

After preprocessing the images, we can now create our CNN by specifying the network architecture using Keras, a library that makes it easy to create deep neural networks. 

I will not explain what each layer does as you can simply read the Keras documentation and/or Google how a CNN works. 

I left the ** activation** function argument blank to encourage you guys to research activation functions and try out different ones. In your write-up, please explain why you chose a specific activation function or a combination of activation functions!

I also left the** epochs** and **batch_size** arguments blank as well. Try experimenting with different number of epochs and batch sizes! 

When running this cell, you should see a training progress bar at the bottom for every epoch as well as the associated loss and training accuracy. Remember, just because you get a very high training accuracy does not necessarily mean you will get a similar accuracy for the test data. Why is that? 

In [None]:
from keras import Sequential
from keras.layers import InputLayer, Conv2D, MaxPool2D, Flatten, Dense

model = Sequential()

model.add(InputLayer(input_shape=[64,64,1]))
model.add(Conv2D(filters=32,kernel_size=(3,3),padding='same',activation=____))
model.add(MaxPool2D(pool_size=(2,2),padding='same'))

model.add(Flatten())
model.add(Dense(128, activation=____))
model.add(Dense(4, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(x=x_train, y=y_train, epochs=____, batch_size=____)
model.summary()

The next step is to create a vector of predictions by sending your test images through the trained CNN. **Figure out a way to output predicted labels using your test data and then converting it into a proper .csv file that Kaggle can use. **

This project is a great introduction into deep learning and so I encourage you guys to be curious and have fun! 

Here are some suggestions to improve this network:
1. Add more layers
2. Figure out a way to prevent overfitting on the training data
3. Experiment with different hyperparameters like activation function, optimizers, batch sizes, etc. 
4. Preprocessing methods
5. Figure out a way to utilize the tabular data from the train_data.csv file 