Lets create a model which uses pytorch's Module (nn.Module) to create an autoencoder. Autoencoders are used to replicate the input we feed into the network by encoding and decoding the input multiple times.
Refer to SAE.py for the implementation used here.  

We start by importing all the necessary modules.

In [1]:
import numpy as np
import pandas as pd
import pickle
import torch
import torch.nn as nn
import torch.utils.data
import torch.optim as optim
import torch.nn.parallel as parallel
from torch.autograd import Variable
from SAE import SAE

Numpy and pndas are going to help us out with arithmetic and for handling dataframes, respectively. Torch will be for creating out autoencoder, training it and testing the results.
And the last line imports the Stacked AutoEncoder class we’ve created. If we prefer we can move the training logic into the class itself in the form of class methods but in order to experiment with optimizers and training logic, it can be left here.

Before we move onto the neural network, let’s understand the dataset. This is a Kaggle dataset that uses GoodReads ratings of multiple users for ten thousand popular books. More information about the dataset can be found at https://www.kaggle.com/zygmunt/goodbooks-10k.

The file books.csv contains metadata around the book like title, year of publishing, language, author among others. What we’d need are simply the title, language and ID from this file. Ratings.csv contains around 100 reviews for each book which range from one to five. 

In [42]:
ratings = pd.read_csv('Goodreads_Ratings/ratings.csv')
books = pd.read_csv('Goodreads_Ratings/books.csv', usecols=['title', 'language_code', 'book_id', 'id'])

books_rat_id = set(ratings.book_id)
books = books[books.language_code.isin(['en', 'eng', 'en-CA', 'en-US', 'en-GB'])].sort_values(by=['title'])
books_eng = list(books.id)
# books_eng = books.sort_values(by=['title']).loc[books.language_code.isin(['en', 'eng', 'en-CA', 'en-US', 'en-GB']), 'id'].append(books.loc[books.language_code.isna(), 'id'])

ratings = ratings.query("book_id in @books_eng").sort_values(by=['book_id'])
print(ratings.head())

    book_id  user_id  rating
0         1      314       5
72        1    33890       3
71        1    33872       5
70        1    33716       5
69        1    33697       4


The dataset contains much more information that would prove important to creating a user recommendation system, like the book’s author,  year of publishing etc. There’s even another file containing tags for the books and another which have books that have been marked as to-read by users. These have been excluded here, however, to stress only on the autoencoder’s ability to predict books the user might like simply based on the ones they have liked before. These can be added as features to the input vector after sufficient cleaning though. We also use only English books to add some consistency.

Having many books with the same title is also an issue that we wish to solve by combining all ratings of the same book into the same bookID.

In [70]:
eng_ratings_df = pd.DataFrame(columns = ['ID', 'title', 'user_id', 'rating'])
id_ctr = 0

for i, buk in books.iterrows():
    buk_ratings = ratings[ratings['book_id'] == buk.id]
    found_id = 0

    if buk.title in list(eng_ratings_df.title):
        found_id = eng_ratings_df[eng_ratings_df.title == buk.title].iloc[0, 0]  #To get the index of the first element
        print("Duplicate found for\n", buk, "\nID found:", found_id, "\n")

    for j, b_r in buk_ratings.iterrows():
        eng_ratings_df = eng_ratings_df.append({'ID':int(id_ctr) if found_id == 0 else found_id,
            'title':buk.title, 'user_id':b_r.user_id, 'rating':b_r.rating}, ignore_index=True)

    id_ctr = id_ctr + (1 if found_id == 0 else 0)
    
n_users = int(max(eng_ratings_df.user_id))
n_books = int(max(eng_ratings_df.ID)) + 1

print(n_users, n_books)
print(eng_ratings_df)

Duplicate found for
 id                        349
book_id                 11590
title            'Salem's Lot
language_code           en-GB
Name: 348, dtype: object 
ID found: 4 

Duplicate found for
 id                  6481
book_id           384597
title            Arcadia
language_code      en-US
Name: 6480, dtype: object 
ID found: 568 

Duplicate found for
 id                              579
book_id                      197084
title            Are You My Mother?
language_code                   eng
Name: 578, dtype: object 
ID found: 573 

Duplicate found for
 id                 3402
book_id          739840
title             Bambi
language_code       eng
Name: 3401, dtype: object 
ID found: 693 

Duplicate found for
 id                                                    3846
book_id                                           12283261
title            Between the Lines (Between the Lines, #1)
language_code                                          eng
Name: 3845, dtype: object 
ID f

To begin with, we will need to combine the books and ratings as picking only English books will lead to issues with indexing while training the model. Moreover, some books were observed to be missing in books.csv but present in ratings.csv which can be excluded only by the creation of a new index. This will also make future prediction easier where we would have to take ratings from users through a web application and figure out the correct order to send them as input to the autoencoder.

The structure eng_ratings_df contains only English books with their names sorted and all the ratings for it from the ratings dataframe. We store this in a file (eng_books_ratings) to ensure we don’t have to process all the datasets again and again each time we wish to play with the autoencoder. Another file (eng_books_sorted.csv) contains just the books we have selected, in order to make it quicker for the flask application to retrieve and display on the HTML form. 

In [71]:
pd.DataFrame(eng_ratings_df.title.unique()).to_csv('eng_books_sorted.csv')
eng_ratings_df.to_csv('eng_books_ratings.csv')

The final data structure (eng_usr_ratings) to store the ratings will be a 2D array with each column representing a book and each row representing a list of ratings by a user for some of the books. For compatibility we use a torch tensor eng_usr_ratings to store the above. We proceed with a 80-20 split for the training and testing sets.

In [92]:
eng_usr_ratings = torch.zeros([n_users, n_books], dtype=torch.float32)

for i, rating_row in eng_ratings_df.iterrows():
    eng_usr_ratings[int(rating_row.user_id)-1][int(rating_row.ID)-1] = rating_row.rating

print(eng_usr_ratings[:5])

lim = int(n_users * 0.8)
tr_set = eng_usr_ratings[:lim][:]
te_set = eng_usr_ratings[lim:][:]

print(lim, "Length of training test: ", tr_set.shape[0], " test set: ", te_set.shape[0])

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
42739 Length of training test:  42739  test set:  10685


Now lets create an object of SAE class that we'll use to create an autoencoder. We pass the no of books to the module to create the number of nodes in the input layer. Next, we set an optimizer that uses mean squared error as loss and for calculating weights, it uses a learning rate of 0.04 and weight decay of 0.005.

These values have been selected through a trial and error approach, where other contenders were 0.05/0.008, 0.02/0.008 among others.

In [95]:
sae = SAE(n_books=n_books)
criterion = nn.MSELoss()

optimizer = optim.RMSprop(sae.parameters(), lr=0.05, weight_decay=0.008)
#Epoch:  50  Loss:  28184.567553707646  usrs:  41414.0  Avg Loss:  0.6805565160020197, TESTLOSS:0.9455141043192761

# optimizer = optim.RMSprop(sae.parameters(), lr=0.04, weight_decay=0.005)
#Epoch:  8  Loss:  36309.51346146319  usrs:  41414.0  Avg Loss:  0.8767449041740278

Training this autoencoder will take many steps.

To begin with, we choose 100/50 as number of epochs. Using 100 causes overfitting and hence a smaller value is used here.
For each epoch we have training loss (train_loss) and usrs, to keep track of the training loss and number of users with more than zero ratings. train_loss will be divided by usrs to get the average loss per epoch. It may be possible that we encounter rows corresponding to a user's ratings that have no non-zero value, i.e that user gave no ratings for the books we're interested in, since we have filtered out many books. Such rows will impair the calculation of average loss, since using the length of the training set will count these users too.

We employ a gradient descent approach by going through the training set one by one and training the model using one user at a time. Since we're creating an autoencoder, the output (or target) is the same as the input given.

Each row will go through a check to ensure that it contains at least one non-zero rating, and will increment usrs if it does. Based on the current weights and parameters of the autoencoder, we get the output and compare it to the actual target. For all the input values that were already 0, i.e. books that weren't rated by that particular user, we set the output element to 0, since we wouldnt need to calculate the loss against these values. To ensure that gradient descent isnt applied to the target variable like the weights, we set requires_grad as False.

The loss is calculated against the calculated output and target variable and the next step instructs it to propagate backward. In order to correctly calculate the training loss, we count the number of ratings that are non-zero and use it as a mean corrector, square root it's product with the loss value, and add it to the training loss accumulated thus far. Next we call the optimizer we had declared earlier and call step() on it to change the weights using the learning rate and weight decay passed as params lr, weight_decay earlier.

For each epoch we print out the training loss and the average value of it. For this model, we see it starts from 1.116 and drops down to 0.680 later. This shows that the model does learn significantly over time, and we can expect a difference of 0.68 between the actual and the real rating.

In [96]:
n_epoch = 50 #100

for ep in range(n_epoch):
    train_loss = 0.                                                     # Calculates training loss in one epoch
    usrs = 0.                                                           # Counts no of rows (users) with > 0 nonzero ratings

    for i in range(len(tr_set)):
        input = Variable(tr_set[i]).unsqueeze(0)
        target = input.clone()
        
        if torch.sum(target.data > 0) > 0:
            usrs += 1
            
            output = sae(input)
            output[target == 0] = 0
            target.requires_grad = False
            
            loss = criterion(output, target)
            loss.backward()
            
            mean_corrector = n_books / float(torch.sum(target.data > 0) + 1e-10)
            train_loss += np.sqrt(loss.item() * mean_corrector)
            
            optimizer.step()

    print('Epoch: ', ep+1, ' Loss: ', train_loss, ' usrs: ', usrs, ' Avg Loss: ', train_loss/usrs)

Epoch:  1  Loss:  46245.95948159071  usrs:  41414.0  Avg Loss:  1.1166745419807482
Epoch:  2  Loss:  39498.6450982867  usrs:  41414.0  Avg Loss:  0.9537510285962888
Epoch:  3  Loss:  38480.26008338464  usrs:  41414.0  Avg Loss:  0.9291606723181687
Epoch:  4  Loss:  37750.32420220691  usrs:  41414.0  Avg Loss:  0.9115353311007609
Epoch:  5  Loss:  37143.611114910316  usrs:  41414.0  Avg Loss:  0.8968853797003505
Epoch:  6  Loss:  36566.0572917775  usrs:  41414.0  Avg Loss:  0.8829395202534771
Epoch:  7  Loss:  36057.929476698395  usrs:  41414.0  Avg Loss:  0.8706700506277683
Epoch:  8  Loss:  35592.13275662691  usrs:  41414.0  Avg Loss:  0.8594227255668834
Epoch:  9  Loss:  35206.384903446145  usrs:  41414.0  Avg Loss:  0.8501082943798267
Epoch:  10  Loss:  34797.22016826099  usrs:  41414.0  Avg Loss:  0.8402284292331335
Epoch:  11  Loss:  34410.32627953615  usrs:  41414.0  Avg Loss:  0.8308863253860084
Epoch:  12  Loss:  34087.54692769816  usrs:  41414.0  Avg Loss:  0.8230923583256425


Testing the model created involves many of the exact same steps. We start off by creating test_loss and usrs like before, which will calculate the testing loss and number of users with more than zero ratings. 

Again, the input and the target are clones of each other, and we disregard the ratings predicted by SAE for books that the user didnt originally rate (in input) by setting the corresponding values in output to zero. 
The loss is calculated using the same criterion as before but we ensure that we dont let it propagate backwards or call optimizer.step() to shift weights. We simply calculate the mean corrector, use it to calculate the loss and add it to the test loss obtained thus far.

We print the testing loss and since we see that its 0.945, we can conclude that we have an issue of overfitting in our autoencoder.

In [97]:
test_loss = 0.                                                          # Calculates test loss
usrs = 0.                                                               # Counts no of rows (users) with >= 1 nonzero ratings

for i in range(len(te_set)):
    input = Variable(te_set[i]).unsqueeze(0)
    target = input.clone()
    
    if torch.sum(target.data > 0) > 0:
        usrs += 1
        
        output = sae(input)
        output[target.data == 0] = 0
        target.requires_grad = False
        
        loss = criterion(output, target)
        
        mean_corrector = n_books / float(torch.sum(target.data > 0) + 1e-10)
        test_loss += np.sqrt(loss.item() * mean_corrector)
        
print('Test Loss: ', test_loss, ' usrs: ', usrs, 'Avg Test Loss: ', test_loss/usrs)

Test Loss:  9763.378641200845  usrs:  10326.0 Avg Test Loss:  0.9455141043192761


Now that we have a trained Stacked AutoEncoder ready to predict ratings, we have to make sure that we store it somewhere so it doesn’t have be initialized every time we want to make a prediction. Hence, we pickle it to store into a new file trainedSae. The flask application only needs to import this once and use for every prediction.

In [98]:
import pickle

with open('trainedSae','wb') as outf:
    pickle.dump(sae, outf)

Next we read from the same pickled file, and we see how we can predict the ratings for a randomly created user.
This pickled file will be used in our web application to make predictions on ratings given by the user.