# Movie Recommendation (AutoEncoder)

## 1. Dataset

This dataset is from grouplen.org. url:https://grouplens.org/datasets/movielens/

We take 100k and 1m dataset for the training dataset.

## 2. Model

We are going to use AutoEncoder to generate a recommendation system for movies. 

## 3. Data Preprocessing

### Import relevent libraries

In [1]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.optim as optim
import torch.utils.data
from torch.autograd import Variable

### Import Dataset

In [2]:
movies = pd.read_csv('ml-1m/movies.dat', sep='::', header= None, engine='python', encoding='latin-1')

In [3]:
movies

Unnamed: 0,0,1,2
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama


In [4]:
users = pd.read_csv('ml-1m/users.dat', sep='::', header= None, engine='python', encoding='latin-1')

In [5]:
users

Unnamed: 0,0,1,2,3,4
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,02460
4,5,M,25,20,55455
...,...,...,...,...,...
6035,6036,F,25,15,32603
6036,6037,F,45,1,76006
6037,6038,F,56,1,14706
6038,6039,F,45,0,01060


In [6]:
ratings = pd.read_csv('ml-1m/ratings.dat', sep='::', header= None, engine='python', encoding='latin-1')

In [7]:
ratings

Unnamed: 0,0,1,2,3
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648


### Preparing the training set and the test set

In [8]:
training_set = pd.read_csv('ml-100k/u1.base', delimiter='\t')

In [9]:
training_set

Unnamed: 0,1,1.1,5,874965758
0,1,2,3,876893171
1,1,3,4,878542960
2,1,4,3,876893119
3,1,5,3,889751712
4,1,7,4,875071561
...,...,...,...,...
79994,943,1067,2,875501756
79995,943,1074,4,888640250
79996,943,1188,3,888640250
79997,943,1228,3,888640275


In [10]:
training_set = np.array(training_set, dtype='int')

In [11]:
training_set

array([[        1,         2,         3, 876893171],
       [        1,         3,         4, 878542960],
       [        1,         4,         3, 876893119],
       ...,
       [      943,      1188,         3, 888640250],
       [      943,      1228,         3, 888640275],
       [      943,      1330,         3, 888692465]])

In [12]:
test_set = pd.read_csv('ml-100k/u1.test', delimiter='\t')

In [13]:
test_set

Unnamed: 0,1,6,5,887431973
0,1,10,3,875693118
1,1,12,5,878542960
2,1,14,5,874965706
3,1,17,3,875073198
4,1,20,4,887431883
...,...,...,...,...
19994,458,648,4,886395899
19995,458,1101,4,886397931
19996,459,934,3,879563639
19997,460,10,3,882912371


In [14]:
test_set = np.array(test_set, dtype='int')

In [15]:
test_set

array([[        1,        10,         3, 875693118],
       [        1,        12,         5, 878542960],
       [        1,        14,         5, 874965706],
       ...,
       [      459,       934,         3, 879563639],
       [      460,        10,         3, 882912371],
       [      462,       682,         5, 886365231]])

### Getting the number of users and movies

In [16]:
nb_users = int(max(max(training_set[:,0]), max(test_set[:,0])))
nb_movies = int(max(max(training_set[:,1]), max(test_set[:,1])))

In [17]:
nb_users

943

In [19]:
nb_movies

1682

### Converting the data into an array with users in lines and movies in columns

In [20]:
# Create a function to do this
def convert(data):
    """
    Create an array to show user ID in line and movies in columns.
    """
    new_data = []
    for id_users in range(1, nb_users+1):
        id_movies = data[:,1][data[:,0]==id_users]
        id_ratings = data[:,2][data[:,0]==id_users]
        ratings = np.zeros(nb_movies)
        ratings[id_movies-1] = id_ratings
        new_data.append(list(ratings))
    return new_data

In [21]:
training_set = convert(training_set)

In [24]:
test_set = convert(test_set)

In [23]:
len(training_set)

943

In [25]:
len(test_set)

943

### Converting the data into Torch tensors

In [27]:
training_set = torch.FloatTensor(training_set)

In [29]:
test_set = torch.FloatTensor(test_set)

In [28]:
training_set

tensor([[0., 3., 4.,  ..., 0., 0., 0.],
        [4., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [5., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 5., 0.,  ..., 0., 0., 0.]])

In [30]:
test_set

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

## 4. Modelling

### Create the architecture of the Neural Network

We need to create a class for the AutoEncoder model. We are using a class not only because it is convenient for making an object with many attributes and methods, but most important we are using inheritance from PyTorch.
Inheritance is we are going to build a class called SAE(stacked-autoencoder-engine), and it will be the child class of an existing parent class in PyTorch called Module. To do this is that we can use all the modeuls and functions from the parent class.

In [31]:
# Create a child class SAE
class SAE(nn.Module):
    def __init__(self,):
        super(SAE, self).__init__()
        
        # the first full connection layer related to the autoencoder object
        self.fc1 = nn.Linear(nb_movies, 20)
        
        # the second full connection layer
        self.fc2 = nn.Linear(20, 10)
        
        # the third full connection layer
        self.fc3 = nn.Linear(10, 20)
        
        # the fourth full connection layer (output layer)
        self.fc4 = nn.Linear(20, nb_movies)
        
        # Define activation function (we use sigmoid function)
        self.activation = nn.Sigmoid()
        
    # the encoding function    
    def forward(self, x):
        # x: the input vector
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        x = self.activation(self.fc3(x))
        x = self.fc4(x)
        return x

In [32]:
# Create a object of SAE class
sae = SAE()

# Define a loss measurement (we use rmse)
criterion = nn.MSELoss()

# Choose an optimizer (we use RMSprop)
optimizer = optim.RMSprop(sae.parameters(), lr=0.01, weight_decay=0.5)

### Train the SAE

In [33]:
# Choose the epoch
nb_epoch = 200

In [36]:
# Loop over epochs and observations(users)
for epoch in range(1, nb_epoch+1):
    # loss error
    train_loss = 0
    #the number of users that rated at least one movie
    s=0.
    
    for id_user in range(nb_users):
        # We need another dimension for the batch of inputs, so use Variable funciton
        input = Variable(training_set[id_user]).unsqueeze(0)
        target = input.clone()
        
        # We only consider the users that rate at least one movie, that will save some memories.
        if torch.sum(target.data > 0) > 0:
            output = sae(input)
            # To make sure that the gradient is computer only with respect to the input but not the target
            target.require_grad = False
            # We don't want the unrated movies to be account in the computation of error, and impact the weights 
            output[target == 0] = 0
            loss = criterion(output, target)
            
            # mean_corrector represents the average of the error but by only considering the movies that were rated.
            # add 1e-10 is for the math reason that this denominator won't be zero and cause error of infinity computation
            mean_corrector = nb_movies/float(torch.sum(target.data > 0) + 1e-10)
            
            # Call the backward method
            loss.backward()
            
            # add up the train loss (the rmse)
            train_loss += np.sqrt(loss.data * mean_corrector)
            
            # increment the count
            s += 1.
            
            # optimizer
            optimizer.step()
            
    # print what happen in the function
    print('epoch: ' + str(epoch) + ' loss: ' + str(train_loss/s))

epoch: 1 loss: tensor(1.7646)
epoch: 2 loss: tensor(1.0966)
epoch: 3 loss: tensor(1.0537)
epoch: 4 loss: tensor(1.0384)
epoch: 5 loss: tensor(1.0307)
epoch: 6 loss: tensor(1.0265)
epoch: 7 loss: tensor(1.0239)
epoch: 8 loss: tensor(1.0219)
epoch: 9 loss: tensor(1.0208)
epoch: 10 loss: tensor(1.0196)
epoch: 11 loss: tensor(1.0187)
epoch: 12 loss: tensor(1.0186)
epoch: 13 loss: tensor(1.0178)
epoch: 14 loss: tensor(1.0178)
epoch: 15 loss: tensor(1.0172)
epoch: 16 loss: tensor(1.0171)
epoch: 17 loss: tensor(1.0167)
epoch: 18 loss: tensor(1.0165)
epoch: 19 loss: tensor(1.0165)
epoch: 20 loss: tensor(1.0162)
epoch: 21 loss: tensor(1.0161)
epoch: 22 loss: tensor(1.0161)
epoch: 23 loss: tensor(1.0160)
epoch: 24 loss: tensor(1.0157)
epoch: 25 loss: tensor(1.0158)
epoch: 26 loss: tensor(1.0155)
epoch: 27 loss: tensor(1.0155)
epoch: 28 loss: tensor(1.0150)
epoch: 29 loss: tensor(1.0124)
epoch: 30 loss: tensor(1.0112)
epoch: 31 loss: tensor(1.0102)
epoch: 32 loss: tensor(1.0072)
epoch: 33 loss: t

## 5. Evaluation

### Test the SAE

In [39]:
# loss error
test_loss = 0
#the number of users that rated at least one movie
s=0.

for id_user in range(nb_users):
    # We need to take the input of the specific user cause we are comparing the user predicting unrated movies
    input = Variable(training_set[id_user]).unsqueeze(0)
    target = Variable(test_set[id_user]).unsqueeze(0)

    # Test the prediction
    if torch.sum(target.data > 0) > 0:
        output = sae(input)
        # To make sure that the gradient is computer only with respect to the input but not the target
        target.require_grad = False
        
        output[target == 0] = 0
        loss = criterion(output, target)

        mean_corrector = nb_movies/float(torch.sum(target.data > 0) + 1e-10)

        # add up the test loss (the rmse)
        test_loss += np.sqrt(loss.data * mean_corrector)

        # increment the count
        s += 1.

# print what happen in the function
print('loss: ' + str(test_loss/s))

loss: tensor(0.9478)
