<a href="https://colab.research.google.com/github/shahhassansh/Project_Movie_Recommendation_System_in_Pytorch/blob/master/An_Autoencocder_Model_for_Movie_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **A Movie Recommandation System With Pytorch using Autoencoder Model**
In this project, we try to predict the ratings a user will give to an unseen movie, based on the ratings he gave to other movies. We will use the movielens dataset. We will use AutoEncoders to create our recommandation system.





In [0]:
# Importing the libraries

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.optim as optim
import torch.utils.data
from torch.autograd import Variable
from torchvision import datasets, transforms
from torch.utils.data import Dataset

In [0]:
# Dataloader Class

batch_size  = 500

class DatasetR(Dataset):

    def __init__(self, training_set, nb_users, transform=None):
        super(DatasetR, self).__init__()

        self.training_set = training_set
        self.nb_users = nb_users
    def __len__(self):
        return self.nb_users

    def __getitem__(self, idx):

        sample = self.training_set[idx]

        return sample

The MovieLens data contains:
- item.data : contains informations related to a movie.
- user.data : contains informations related to a user
- u.data : contains the ratings given by users on different movies

In [72]:
# Importing Ratings

from google.colab import files
uploaded = files.upload()

Saving u.data to u (1).data


In [73]:
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp'] 
ratings = pd.read_csv('u.data', sep='\t', names=r_cols, header = None, engine = 'python', encoding = 'latin-1') 
ratings


Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
...,...,...,...,...
99995,880,476,3,880175444
99996,716,204,5,879795543
99997,276,1090,1,874795795
99998,13,225,2,882399156


In [74]:
# Importing Movies

from google.colab import files
uploaded = files.upload()

Saving u.item to u (1).item


In [75]:
i_cols = ['movie_id', 'title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']


movies = pd.read_csv('u.item',  sep='|', names=i_cols, encoding='latin-1')

movies

Unnamed: 0,movie_id,title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,1678,Mat' i syn (1997),06-Feb-1998,,http://us.imdb.com/M/title-exact?Mat%27+i+syn+...,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1678,1679,B. Monkey (1998),06-Feb-1998,,http://us.imdb.com/M/title-exact?B%2E+Monkey+(...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0
1679,1680,Sliding Doors (1998),01-Jan-1998,,http://us.imdb.com/Title?Sliding+Doors+(1998),0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
1680,1681,You So Crazy (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?You%20So%20Cr...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [76]:
# Importing Users

from google.colab import files
uploaded = files.upload()

Saving u.user to u (1).user


In [77]:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']

users = pd.read_csv('u.user', sep='|', names=u_cols,
 encoding='latin-1')

users

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
...,...,...,...,...,...
938,939,26,F,student,33319
939,940,32,M,administrator,02215
940,941,20,M,student,97229
941,942,48,F,librarian,78209


Ratings data (u.data) is the main dataset. We train the autoencoder using a subset of the data. **u1.base** is the data we use to train the model. We then test our algorithm using the **u1.test**.

In [78]:
# Importing Train Dataset (Subset of Ratings)

from google.colab import files
uploaded = files.upload()

Saving u1.base to u1 (1).base


In [79]:
# Importing Test Dataset (Subset of Ratings)

from google.colab import files
uploaded = files.upload()

Saving u1.test to u1 (1).test


# Preparing the training set and the test set


In [0]:
training = ['user_id', 'movie_id', 'rating', 'timestamp' ] 
test = ['user_id', 'movie_id', 'rating', 'timestamp' ]   

In [0]:
training_set = pd.read_csv('u1.base', names=training, delimiter = '\t') # Read the file
test_set = pd.read_csv('u1.test', names=test, delimiter = '\t') #Read the file

In [82]:
# Visualizing some elements of the training_set

training_set[130:140]

Unnamed: 0,user_id,movie_id,rating,timestamp
130,1,263,1,875693007
131,1,268,5,875692927
132,1,269,5,877482427
133,1,270,5,888732827
134,1,271,2,887431672
135,2,1,4,888550871
136,2,10,2,888551853
137,2,14,4,888551853
138,2,25,4,888551648
139,2,100,5,888552084


In [83]:
# Visualizing some elements of the training_set

test_set[130:140]

Unnamed: 0,user_id,movie_id,rating,timestamp
130,1,260,1,875071713
131,1,262,3,875071421
132,1,264,2,875071713
133,1,265,4,878542441
134,1,266,1,885345728
135,1,267,4,875692955
136,1,272,3,887431647
137,2,13,4,888551922
138,2,19,3,888550871
139,2,50,5,888552084


Since, Pytorch expects the sata as numpy arrays therefore, we convert our dataframe into numpy arrays.

In [0]:
# Converting the training and test sets into numpy arrays

training_set = np.array(training_set, dtype = 'int')
test_set = np.array(test_set, dtype = 'int')

We need the number of users and movies. Since the data is divided into training and test set, the maximum value of id_user/id_movie is either in the training_set or in the test_set.

In [85]:
# Getting the number of users and movies

nb_users = int(max(max(training_set[:, 0]), max(test_set[:, 0])))
nb_movies = int(max(max(training_set[:, 1]), max(test_set[:, 1])))
print("Number of users: {}".format(nb_users))
print("Number of movies: {}".format(nb_movies))

Number of users: 943
Number of movies: 1682


Then, we create the data as list of lists, expected by Pytorch. Each list of list contains the ratings that a specific user gave to the movies. If a user didn't rate a movie, we just add a 0 for that observation. We define a **convert** function which creates this list of list for us.



In [0]:
def convert(data):
    new_data = [] 
    for id_users in range(1, nb_users + 1):
        id_movies = data[:, 1][data[:, 0] == id_users] #The id of the movies rated by the current user.
        id_ratings = data[:, 2][data[:, 0] == id_users] #The id of the ratings given by the current_user.
        ratings = np.zeros(nb_movies)
        ratings[id_movies - 1] = id_ratings
        new_data.append(list(ratings))
    return new_data

training_set = convert(training_set)
test_set = convert(test_set)

In [87]:
# Converting the train data into Torch tensors

training_set = torch.FloatTensor(training_set)
training_set.shape


torch.Size([943, 1682])

In [88]:
# Converting the test data into Torch tensors

test_set = torch.FloatTensor(test_set)
test_set.shape

torch.Size([943, 1682])

# Building the  Autoencoder (AE) Model

Then, we create an autoencoder. As can be seen below, the number of nodes of the output layer should equal the number of nodes of the input layer.


![autoencoder](https://miro.medium.com/max/3524/1*oUbsOnYKX5DEpMOK3pH_lg.png)

In [0]:
## The Autoencoder (AE) class.

class AE(nn.Module):
    def __init__(self, ):
        # making the class get all the functions from the parent class nn.Module
        super(AE, self).__init__()
        # Creating the first encoding layer. From nb_movies to 20 outputs.
        self.fc1 = nn.Linear(nb_movies, 20)
        self.bn1 = nn.BatchNorm1d(20)
        self.do1 = nn.Dropout(0.2)
        # Creating the second encoding layer. From 20 inputs to 10 outputs
        self.fc2 = nn.Linear(20, 10)
        self.bn2 = nn.BatchNorm1d(10)
        self.do2 = nn.Dropout(0.2)
        # Creating the first decoding layer. From 10 inputs to 20 outputs
        self.fc3 = nn.Linear(10, 20)
        self.bn3 = nn.BatchNorm1d(20)
        self.do3 = nn.Dropout(0.2)
        # Creating the second hidden layer. From 20 inputs to nb_movies outputs
        self.fc4 = nn.Linear(20, nb_movies)
        # Creating the activation fucntion. 
        self.activation = nn.Sigmoid()
        
        # Creating the function for forward propagation
    def forward(self, x):
        x = self.bn1(self.activation(self.fc1(x)))
        x = self.bn2(self.activation(self.fc2(x)))
        x = self.bn3(self.activation(self.fc3(x)))
        x = self.fc4(x)
        return x

# Training the Model

In [0]:
 #Creating an instance of the AE class

ae = AE()

datasetTrain = DatasetR(training_set = training_set, nb_users = nb_users)
train_loader = torch.utils.data.DataLoader(datasetTrain, batch_size = batch_size, shuffle=True, num_workers=4, drop_last=True)

datasetTest = DatasetR(training_set = test_set, nb_users = nb_users)
test_loader = torch.utils.data.DataLoader(datasetTest, batch_size = batch_size, shuffle=True, num_workers=4, drop_last=True)

# Defining a Root Mean Square Error Loss Function.
def RMSELoss(yhat,y):
    return torch.sqrt(torch.mean((yhat-y)**2))

criterion = RMSELoss
# Defining the algorithm used to minimize the loss function.
optimizer = optim.RMSprop(ae.parameters(), lr = 0.01, weight_decay = 0.5)

In [67]:

nb_epoch = 1000
for epoch in range(1, nb_epoch + 1):
    train_loss = 0
    s = 0.
    for batch_idx, (sample) in enumerate(train_loader):
        input = Variable(sample)
        target = input.clone()
        # We don't consider movies NOT rated by the current user.
        if torch.sum(target.data > 0) > 0:
            output = ae(input)
            target.require_grad = False
            output[target == 0] = 0
            loss = criterion(output, target)
            mean_corrector = nb_movies/float(torch.sum(target.data > 0) + 1e-10)
            loss.backward()
            train_loss += np.sqrt(loss.data*mean_corrector)
            s += 1.
            # Updating the weights of the network.
            optimizer.step()
    print('epoch: '+str(epoch)+' loss: '+str(train_loss/s))



epoch: 1 loss: tensor(0.1843)
epoch: 2 loss: tensor(0.1804)
epoch: 3 loss: tensor(0.1836)
epoch: 4 loss: tensor(0.1805)
epoch: 5 loss: tensor(0.1799)
epoch: 6 loss: tensor(0.1801)
epoch: 7 loss: tensor(0.1800)
epoch: 8 loss: tensor(0.1845)
epoch: 9 loss: tensor(0.1797)
epoch: 10 loss: tensor(0.1797)
epoch: 11 loss: tensor(0.1803)
epoch: 12 loss: tensor(0.1802)
epoch: 13 loss: tensor(0.1795)
epoch: 14 loss: tensor(0.1830)
epoch: 15 loss: tensor(0.1801)
epoch: 16 loss: tensor(0.1834)
epoch: 17 loss: tensor(0.1804)
epoch: 18 loss: tensor(0.1800)
epoch: 19 loss: tensor(0.1828)
epoch: 20 loss: tensor(0.1819)
epoch: 21 loss: tensor(0.1795)
epoch: 22 loss: tensor(0.1834)
epoch: 23 loss: tensor(0.1795)
epoch: 24 loss: tensor(0.1830)
epoch: 25 loss: tensor(0.1789)
epoch: 26 loss: tensor(0.1822)
epoch: 27 loss: tensor(0.1779)
epoch: 28 loss: tensor(0.1821)
epoch: 29 loss: tensor(0.1820)
epoch: 30 loss: tensor(0.1841)
epoch: 31 loss: tensor(0.1799)
epoch: 32 loss: tensor(0.1796)
epoch: 33 loss: t

After a training of 1000 epochs, the train loss is 0.9124. That means for the training_set, we have : **predicted_rating - 0.0924 <= real_rating <= predicted_rating + 0.0924**

# Testing the Model

In [69]:
test_loss = 0
s = 0.
for input1,target1 in zip(train_loader,test_loader):
    input = Variable(input1)
    target = Variable(target1)
    if torch.sum(target.data > 0) > 0:
        output = ae(input)
        target.require_grad = False
        output[(target == 0)] = 0
        loss = criterion(output, target)
        mean_corrector = nb_movies/float(torch.sum(target.data > 0) + 1e-10)
        test_loss += np.sqrt(loss.data*mean_corrector)
        s += 1.
print('test_loss: '+str(test_loss/s))

test_loss: tensor(0.1380)


The test_loss is 0.138. Therefore, for this specific test_set, we have : **predicted_rating - 0.138 <= real_rating <= predicted_rating + 0.138**

#Few more results. 

1) With RMSELoss + Batch_size=200 + batchnorm + Epoch 200 + LR=0.01:
Training Loss is 0.1427, Test Loss: 0.2189

2) With RMSELoss + Batch_size=500 + batchnorm + Epoch 200 + LR=0.01:
Training Loss is 0.1354, Test Loss: 0.2056

3) With RMSELoss + Batch_size=500 + batchnorm + Dropout = 0.2 + Epoch 200  + LR=0.01:
Training Loss is 0.1470 , Test Loss: 0.2186

4) With RMSELoss + Batch_size=500 + batchnorm + Epoch 500  + LR=0.01:
Training Loss is 0.0941 , Test Loss: 0.1412

**5) With RMSELoss + Batch_size=500 + batchnorm + Epoch 1000  + LR=0.01:  
Training Loss is 0.0924 , Test Loss: 0.138**

6) With RMSELoss + Batch_size=1000 + batchnorm + Epoch 500  + LR=0.001:
Training Loss is 0.0877 , Test Loss: 0.1424