## CT5133 / CT5145 Deep Learning (/Online) 2022-2023

## Assignment 2

## James McDermott

* Student ID(s): 22223696
* Student name(s): Smitesh Nitin Patil

**Due date**: midnight Sunday 19 March (end Week 10).

**Weighting**: 20% of the module.

In this assignment the goal is to take advantage of pre-trained NN models to create an embedding with a dataset of movie posters, and demonstrate how to use that embedding.

The dataset is provided, along with some skeleton code for loading it.

The individual steps to be carried out are specified below, with `### YOUR CODE HERE` markers, together with the number of marks available for each part.

* **Topics**: in Part 5 below, students are asked to add some improvement to their models. In general, these improvements will differ between students (or student groups). **The proposed improvement must be notified to the lecturer at least 1 week before submission, and approved by the lecturer**. If working in a group, the two members of the group should not work on different topics in Part 5: they must work on the same topic and submit identical submissions.

* Students are not required to work incrementally on the parts. It is ok to do all the work in one day, so long as you abide by the rules on notifying groups and notifying topics.

* **Groups**: students may work solo or in a group of two. A student may not work together in a group with any student they have previously worked on a group project with, in this module or any other in the MSc programme. **Groups must be notified to the lecturer in writing before beginning work and at least 1 week before submission.** If working in a group, both students must submit and both submissions must be identical. If working in a group, both students may be asked to explain any aspect of the code in interview (see below), therefore working independently on separate components is not recommended. Any emails concerning the project should be cc-ed to the other group member.

* **Libraries**: code can be written in Keras/Tensorflow, or in PyTorch. 

* **Plagiarism**: students may discuss the assignment together, but you may not look at another student or group's work or allow other students to view yours (other than within a group). You may use snippets of code (eg 1-2 lines) from the internet, **if you provide a citation with URL**. You may also use a longer snippet of code if it is a utility function, again only with citation. You may not use code from the internet to carry out the core of the assignment. You may not use a large language model to generate code.

* **Submission**: after completing your work in this Jupyter notebook, submit the notebook both in `.ipynb` and `.pdf` formats. The content should be identical.

* **Interviews**: a number of students may be selected for interview, post-submission. The selection will depend on submissions, and random chance may be used also. Interviews will be held in-person (CT5133) or online (CT5145). Interviews will last approximately 10 minutes. The purpose of interviews will be to assess students' understanding of their own submission.


### Dataset Credits

The original csv file is from: 

https://www.kaggle.com/datasets/neha1703/movie-genre-from-its-poster

I have added the *year* column for convenience.

I believe most of the information is originally from the famous MovieLens dataset:

* https://grouplens.org/datasets/movielens/
* https://movielens.org/

However, I'm not clear whether the poster download URLs (Amazon AWS URLs) which are in the csv obtained from the Kaggle URL above are from a MovieLens source, or elsewhere.

To create the dataset we are using, I have randomly sampled a small proportion of the URLs in the csv, and downloaded the images. I have removed those which fail to download. Code below also filters out those which are in black and white, ie 1 channel only.

### Imports

You can add more imports if needed.

In [1]:
import numpy as np
import pandas as pd
import math
import os
import random
from PIL import Image
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist, pdist, squareform # useful for distances in the embedding

In [2]:
import tensorflow as tf
import torch
from tensorflow import keras
from keras import layers, models

import os    
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'

  from .autonotebook import tqdm as notebook_tqdm


### Utility functions

These functions are provided to save you time. You might not need to understand any of the details here.

In [3]:
# walk the directory containing posters and read them in. all are the same shape: (268, 182).
# all have 3 channels, with a few exceptions (see below).
# each is named <imdbId>.jpg, which will later allow us to get the metadata from the csv.
IDs = []
images = []
for dirname, _, filenames in os.walk('DL_Sample'):
    for filename in filenames:
        if filename.endswith(".jpg"):
            ID = int(filename[:-4])
            pathname = os.path.join(dirname, filename)
            im = Image.open(pathname)
            imnp = np.array(im, dtype=float)
            if len(imnp.shape) != 3: # we'll ignore a few black-and-white (1 channel) images
                print("This is 1 channel, so we omit it", imnp.shape, filename)
                continue # do not add to our list
            IDs.append(ID)
            images.append(imnp)

This is 1 channel, so we omit it (268, 182) 290031.jpg
This is 1 channel, so we omit it (268, 182) 294266.jpg
This is 1 channel, so we omit it (268, 182) 30337.jpg
This is 1 channel, so we omit it (268, 182) 3626440.jpg
This is 1 channel, so we omit it (268, 182) 50192.jpg
This is 1 channel, so we omit it (268, 182) 54880.jpg
This is 1 channel, so we omit it (268, 182) 57006.jpg


In [4]:
img_array = np.array(images)

In [5]:
img_array.shape

(1254, 268, 182, 3)

In [6]:
# read the csv
df = pd.read_csv("Movie_Genre_Year_Poster.csv", encoding="ISO-8859-1", index_col="Unnamed: 0")
df.head()

Unnamed: 0,imdbId,Imdb Link,Title,IMDB Score,Genre,Poster,Year
0,114709,http://www.imdb.com/title/tt114709,Toy Story (1995),8.3,Animation|Adventure|Comedy,https://images-na.ssl-images-amazon.com/images...,1995.0
1,113497,http://www.imdb.com/title/tt113497,Jumanji (1995),6.9,Action|Adventure|Family,https://images-na.ssl-images-amazon.com/images...,1995.0
2,113228,http://www.imdb.com/title/tt113228,Grumpier Old Men (1995),6.6,Comedy|Romance,https://images-na.ssl-images-amazon.com/images...,1995.0
3,114885,http://www.imdb.com/title/tt114885,Waiting to Exhale (1995),5.7,Comedy|Drama|Romance,https://images-na.ssl-images-amazon.com/images...,1995.0
4,113041,http://www.imdb.com/title/tt113041,Father of the Bride Part II (1995),5.9,Comedy|Family|Romance,https://images-na.ssl-images-amazon.com/images...,1995.0


In [7]:
df2 = df.drop_duplicates(subset=["imdbId"]) # some imdbId values are duplicates - just drop

In [8]:
df3 = df2.set_index("imdbId") # the imdbId is a more useful index, eg as in the next cell...

In [9]:
df4 = df3.loc[IDs] # ... we can now use .loc to take a subset

In [10]:
df4.shape # 1254 rows matches the image data shape above

(1254, 6)

In [11]:
years = df4["Year"].values
titles = df4["Title"].values

assert img_array.shape[0] == years.shape[0] == titles.shape[0]

In [12]:
def imread(filename):
    """Convenience function: we can supply an ID or a filename.
    We read and return the image in Image format.
    """
    
    if type(filename) == int:
        # assume its an ID, so create filename
        filename = f"DL_Sample/{filename}.jpg"
        
    # now we can assume it's a filename, so open and read
    im = Image.open(filename)
    
    return im

def imshow(im):
    plt.imshow(im)
    plt.axis('off')
    plt.show()

### Part 1. Create embedding [3 marks]

Use a pretrained model, eg as provided by Keras, to create a flat (ie 1D) embedding vector of some size `embedding_size` for each movie poster, and put all of these together into a single tensor of shape `(n_movies, embedding_size)`.

In [None]:
from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input
from keras.layers import Flatten, Dense, Embedding
from keras.models import Model

n_movies = img_array.shape[0]
embedding_size = 4 # YOUR CODE HERE
X = tf.zeros((n_movies, embedding_size))
### YOUR CODE HERE
n_movies = np.array(df4.index)

processed_images = []

for image in img_array:
    processed_images.append(preprocess_input(image))

model = keras.models.Sequential()
model.add(VGG16(include_top=False,input_shape = img_array[0].shape))
model.add(Flatten())
model.add(Dense(4, ))
model = Model(model.input, model.output)

out = model.predict(img_array)

X = torch.cat((torch.tensor(out), torch.tensor(n_movies).unsqueeze(dim=1)), dim=1)
assert len(X.shape) == 2 # X should be (n_movies, embedding_size)
assert X.shape[0] == len(n_movies)



### Part 2. Define a nearest-neighbour function [3 marks]

Write a function `def nearest(img, k)` which accepts an image `img`, and returns the `k` movies in the dataset whose posters are most similar to `img` (as measured in the embedding), ranked by similarity. 

In [None]:
def k_nearest(img, k):
    ### YOUR CODE HERE
    #subset image indexes and embeddings
    index = X[:, -1]
    vector_space = X[:, :4]
    
    # get the embedding of the image 
    image_embedding = [vector_space[i] for i, idx in enumerate(index) if img == (int(idx))]
    
    euclidean_distance = []
    #calculate k nearest images to the image embedding
    for idx, embeddings in enumerate(vector_space):
        distance = math.dist(image_embedding[0], embeddings)
        euclidean_distance.append(distance)
        
    #gettings the index of k embeddings by shortest distance
    index_distance = [idx for idx, distance in enumerate(euclidean_distance) if distance in sorted(euclidean_distance)[:k+1]]
    
    #returning image ids
    return [int(idx) for i, idx in enumerate(index) if i in index_distance and idx != img]

### Part 3: Demonstrate your nearest-neighbour function [4 marks]

Choose any movie poster. Call this the query poster. Show it, and use your nearest-neighbour function to show the 3 nearest neighbours (excluding the query itself). This means **call** the function you defined above.

Write a comment: in what ways are they similar or dissimilar? Do you agree with the choice and the ranking? Why do you think they are close in the embedding? Do you notice, for example, that the nearest neighbours are from a similar era? 


In [None]:
### YOUR CODE HERE
Q_idx = 43313 # YOUR VALUE HERE - DO NOT USE MY VALUE

out = k_nearest(Q_idx, 3)

imshow(imread(Q_idx))

In [None]:
for img in out:
    imshow(imread(img))   

### Part 4: Year regression [5 marks]

Let's investigate the last question ("similar era") above by running **regression** on the year, ie attempt to predict the year, given the poster. Use a train-test split. Build a suitable Keras neural network model for this, **as a regression head on top of the embedding from Part 1**. Include comments to explain the purpose of each part of the model. It should be possible to make a prediction, given a new poster (not part of the original dataset). Write a short comment on model performance: is it possible to predict the year? Based on this result, are there trends over time?

In [None]:
import cv2
### YOUR CODE HERE
import matplotlib.pyplot as plt

plt.imshow(img_array[3])

### Part 5: Improvements [5 marks]

Propose a possible improvement. Some ideas are suggested below. The chosen improvement must be notified to the lecturer at least 1 week before submission and **must be approved by the lecturer to avoid duplication with other students**. Compare the performance between your original and your new model (the proposed improvement might not actually improve on model performance -- that is ok). Some marks will be awarded for more interesting / challenging / novel improvements.

Ideas:

* Try a different pretrained model for creating the embedding
* Alternative ways of reducing the pretrained model's output to a flat vector for the embedding
* Gather more data (see the csv file for URLs)
* Add different architectural details to the regression head
* Fine-tuning
* Training an end-to-end convnet of your own design (no pretraining)
* Improve the embedding by training a multi-headed model, eg predicting both genre and year
* Create a good visualisation of the embedding.


In [None]:
### YOUR CODE HERE

In [None]:
df4['Year'].fillna(1960)

In [None]:
year = np.array(df4['Year'].tolist())

In [None]:
model = keras.models.Sequential()
model.add(VGG16(include_top=False,input_shape = img_array[0].shape))
model.add(Flatten())
model.add(Dense(4, ))
model.add(Dense(1, ))
model = Model(model.input, model.output)

model.compile(optimizer='adam', loss='mse')
model.summary()

In [None]:
c = list(zip(img_array, year))
random.shuffle(c)
img_array, year = zip(*c)

### 70 20 10 split for training

train_size = int(np.ceil(len(img_array)*0.7))
test_size = int(np.ceil(len(img_array)*0.2))
val_size = int(np.ceil(len(img_array)*0.1))

X_train = img_array[:train_size]
y_train = year[:train_size]

X_test = img_array[train_size+1:train_size + test_size]
y_test = year[train_size+1:train_size + test_size]

X_val = img_array[train_size + test_size + 1: train_size + test_size + val_size -1]
y_val = year[train_size + test_size + 1: train_size + test_size + val_size -1]

In [None]:
np.mean(year)