# Recommendation system by poster images with conovolution neueral networks

In this notebook a recommendation system is presented which recommends movies by similarity of posters. The image similarity will be calculated with the following method:
1. Train a convolutional neural network with the imagenet images
2. Calculate the prediction matrix of each poster images
3. Calculate the similarity between the image matrixes.

The training will be done with the use of three different algorithms:
* VGG16
![](https://media.geeksforgeeks.org/wp-content/uploads/20200219152207/new41.jpg)
* ResNet50
![](https://i.stack.imgur.com/gI4zT.png)
* InceptionV3
![](https://www.researchgate.net/profile/Masoud-Mahdianpari/publication/326421398/figure/fig6/AS:649353890889730@1531829440919/Schematic-diagram-of-InceptionV3-model-compressed-view.png)


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
image_paths = []
for dirname, _, filenames in os.walk('/kaggle/input/movielensposters'):
    for filename in filenames:
        image_paths.append(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
from keras.preprocessing.image import load_img, img_to_array
from keras.applications.vgg16 import preprocess_input as vgg16_preprocess_input
from keras.preprocessing import image as kimage
from keras.applications.resnet import preprocess_input as resnet50_preprocess_input

## Common functions

### Used data

Because of the high number of pictures, which had to be scraped, only the most voted movies will be used.

In [None]:
df_links = pd.read_csv("/kaggle/input/movielens-25m-dataset/ml-25m/links.csv")
df_movies = pd.read_csv("/kaggle/input/movielens-25m-dataset/ml-25m/movies.csv")
ratings_df = pd.read_csv("/kaggle/input/movielens-25m-dataset/ml-25m/ratings.csv")
ratings_df.drop(columns = ["timestamp"], inplace=True)

### Data preprocessing
The following preprocesses are done:
* Get the movies with more than 5k ratings
* Get their TMDB ids, so it can be joined with the posters
* Preprocess images to 224*224 size

In [None]:
df_links = df_links.merge(df_movies, on="movieId")

In [None]:
df_links.fillna(0, inplace=True)
df_links["tmdbId"] = df_links["tmdbId"].astype(int)
df_links

In [None]:
ratings_df["movie_freq"] = ratings_df.groupby("movieId")["movieId"].transform('count')
MOVIE_FREQ_LIMIT = 5000
ratings_df = ratings_df.loc[(ratings_df["movie_freq"] > MOVIE_FREQ_LIMIT)]
most_popular_film_ids = ratings_df["movieId"].unique()
most_popular_film_ids.sort()
useful_links_df = df_links.loc[df_links["movieId"].isin(most_popular_film_ids)]
useful_links_df

In [None]:
normal_images = []
image_size = 224
def read_images(img_path, img_height=image_size, img_width=image_size):
    image = load_img(img_path, target_size=(img_height, img_width))
    return image


In [None]:
for train_img in image_paths:
    normal_images.append(read_images(train_img))

Preprocess each image for the required model

In [None]:
def prep_images(preprocessing_function):
    result = []
    for image in normal_images:
        img = img_to_array(image)
        img = np.expand_dims(img, axis=0)
        img = preprocessing_function(img)
        result.append(img)
    return result

### Training

Creates prediction matrix for each of the images, and in the similarity_deep the
    similarities between these matrixes are presented

In [None]:
def train(feature_num, model):
    """
    Creates prediction matrix for each of the images, and in the similarity_deep the
    similarities between these matrixes are presented
    params:
    feature_num : number of features in the model
    model: model to train on
    returns: similarity matrix between the models
    """
    total_movies = len(image_paths)

    prediction = [0]*total_movies
    matrix_res = np.zeros([total_movies,feature_num])
    for i in range(total_movies):
        prediction[i] = model.predict(images[i]).ravel()
        matrix_res[i,:] = prediction[i] 

    similarity_deep = matrix_res.dot(matrix_res.T)
    norms = np.array([np.sqrt(np.diagonal(similarity_deep))])
    similarity_deep = similarity_deep / norms / norms.T
    
    return similarity_deep

Creates recommendation prediction for movie with index 'idx'

In [None]:
def predict_by_id(idx, similarity_deep):
    """
    Creates recommendation prediction for movie with index 'idx'
    params:
    idx : idx of movie
    similarity_deep: Similarity between the posters in matrix format
    return: The 7 most similar posters 
    """
    print("Actual movies")
    plt.imshow(normal_images[idx])
    plt.show()
    similar_movies = list(enumerate(similarity_deep[idx]))
    most_similar_movies_idx = sorted(range(len(similar_movies)), key=lambda k: similar_movies[k][1], reverse=True)
    
    for i in range(2,8):
        print("Predicted movies")
        plt.imshow(normal_images[most_similar_movies_idx[i]])
        plt.show()

In [None]:
def predict_by_title(title, similarity_deep):
    tmdbId = useful_links_df[useful_links_df["title"].str.contains(title)]["tmdbId"].iloc[0]
    idx = image_paths.index("/kaggle/input/movielensposters/%s.jpg" % tmdbId)
    predict_by_id(idx, similarity_deep)

## Predictions

### VGG16

In [None]:
images = prep_images(vgg16_preprocess_input)

In [None]:
from keras.applications.vgg16 import VGG16
from keras.layers import Input, Flatten, Dense
from keras.models import Model

vgg16_model = VGG16(include_top=False, weights='imagenet')

In [None]:
similarity_vgg = train(25088, vgg16_model)

In [None]:
predict_by_id(808, similarity_vgg)

In most of the cases the VGG16 brought the best result. Here it can be seen an example, where the algorithm proposes another Spiderman movie as the most similar, even though the colours are quite different in the two posters:

![](https://i.ibb.co/1GX1f3H/Screenshot-2021-12-12-at-17-06-22.png)

### ResNet50

In [None]:
images = prep_images(resnet50_preprocess_input)

In [None]:
from keras.applications.resnet import ResNet50
from keras.layers import Input, Flatten, Dense
from keras.models import Model

resnet50_model = ResNet50(include_top=False, weights='imagenet')

In [None]:
similarity_resnet = train(100352, resnet50_model)

In [None]:
predict_by_id(808, similarity_resnet)

In [None]:
predict_by_title("American Pie", similarity_resnet)

Typically the ResNet produced the worst results. In the example shown the genre and the vibe of the two movies are quite different, and the posters don't seem to be very similar. The only relevant similar feature is the striped cloths on both of the persons presented. 

![](https://i.ibb.co/xCzRq4r/Screenshot-2021-12-12-at-17-06-32.png)

### Inception

In [None]:
from keras.applications.inception_v3 import preprocess_input as inception_preprocess_input

In [None]:
images = prep_images(inception_preprocess_input)

In [None]:
from keras.applications.inception_v3 import InceptionV3
from keras.layers import Input, Flatten, Dense
from keras.models import Model

inception_model = InceptionV3(include_top=False, weights='imagenet')

In [None]:
similarity_inception = train(51200, inception_model)

In [None]:
predict_by_title("American Pie", similarity_inception)

In [None]:
predict_by_id(808, similarity_inception)

The Inception model was the 2nd best in most of the cases, here is an example. The mood and the genre of the two movie seem to be quite similar:

![](https://i.ibb.co/FWgJR4G/Screenshot-2021-12-12-at-17-06-48.png)

## Conclusion

The ranking between the algorithms seems to be the following:
1. VGG16
2. InceptionV3
3. ResNet50

However, these results are not definitive, as a clear metric should be defined for correct ranking and for hyperparameter optimalization.