We start by creating a model which uses Tensorflow's Keras Model to create an autoencoder. Autoencoders are used to replicate the input we feed into the network by encoding and decoding the input multiple times. Refer to Autoencoder.py for the implementation used here. 

It has layers in the following format:

1. Input: Same number of nodes as number of books
2. Hidden1: 80 nodes to encode the input once
3. Hidden2: 40 nodes to encode the input twice
4. Hidden3: 80 nodes to decode the input once
5. Hidden4: Output layer, decodes twice to bring back number of dimensions same as number of books
 
We start by importing all the necessary modules.

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np

from Autoencoder import Autoencoder

C:\Users\rhead\.conda\envs\playplace\lib\site-packages\numpy\.libs\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll
C:\Users\rhead\.conda\envs\playplace\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll


Numpy and pndas are going to help us out with arithmetic and for handling dataframes, respectively. Tensorflow will be for creating out autoencoder, training it and testing the results.
And the last line imports the Stacked AutoEncoder class we’ve created. If we prefer we can move the training logic into the class itself in the form of class methods but in order to experiment with optimizers and training logic, it can be left here.

Before we move onto the neural network, let’s understand the dataset. This is a Kaggle dataset that uses GoodReads ratings of multiple users for ten thousand popular books. More information about the dataset can be found at https://www.kaggle.com/zygmunt/goodbooks-10k.

The file books.csv contains metadata around the book like title, year of publishing, language, author among others. What we’d need are simply the title, language and ID from this file. Ratings.csv contains around 100 reviews for each book which range from one to five. 

In [2]:
ratings = pd.read_csv('Goodreads_Ratings/ratings.csv')
books = pd.read_csv('Goodreads_Ratings/books.csv', usecols=['title', 'language_code', 'book_id', 'id'])

books_rat_id = set(ratings.book_id)
books = books[books.language_code.isin(['en', 'eng', 'en-CA', 'en-US', 'en-GB'])].sort_values(by=['title'])
books_eng = list(books.id)
# books_eng = books.sort_values(by=['title']).loc[books.language_code.isin(['en', 'eng', 'en-CA', 'en-US', 'en-GB']), 'id'].append(books.loc[books.language_code.isna(), 'id'])

ratings = ratings.query("book_id in @books_eng").sort_values(by=['book_id'])
print(ratings.head())

    book_id  user_id  rating
0         1      314       5
72        1    33890       3
71        1    33872       5
70        1    33716       5
69        1    33697       4


The dataset contains much more information that would prove important to creating a user recommendation system, like the book’s author,  year of publishing etc. There’s even another file containing tags for the books and another which have books that have been marked as to-read by users. These have been excluded here, however, to stress only on the autoencoder’s ability to predict books the user might like simply based on the ones they have liked before. These can be added as features to the input vector after sufficient cleaning though. We also use only English books to add some consistency.

Having many books with the same title is also an issue that we wish to solve by combining all ratings of the same book into the same bookID.

In [None]:
eng_ratings_df = pd.DataFrame(columns = ['ID', 'title', 'user_id', 'rating'])
id_ctr = 0

for i, buk in books.iterrows():
    buk_ratings = ratings[ratings['book_id'] == buk.id]
    found_id = 0

    if buk.title in list(eng_ratings_df.title):
        found_id = eng_ratings_df[eng_ratings_df.title == buk.title].iloc[0, 0]  #To get the index of the first element
        print("Duplicate found for\n", buk, "\nID found:", found_id, "\n")

    for j, b_r in buk_ratings.iterrows():
        eng_ratings_df = eng_ratings_df.append({'ID':int(id_ctr) if found_id == 0 else found_id,
            'title':buk.title, 'user_id':b_r.user_id, 'rating':b_r.rating}, ignore_index=True)

    id_ctr = id_ctr + (1 if found_id == 0 else 0)
    
n_users = int(max(eng_ratings_df.user_id))
n_books = int(max(eng_ratings_df.ID)) + 1

print(n_users, n_books)
print(eng_ratings_df)

To begin with, we will need to combine the books and ratings as picking only English books will lead to issues with indexing while training the model. Moreover, some books were observed to be missing in books.csv but present in ratings.csv which can be excluded only by the creation of a new index. This will also make future prediction easier where we would have to take ratings from users through a web application and figure out the correct order to send them as input to the autoencoder.

The structure eng_ratings_df contains only English books with their names sorted and all the ratings for it from the ratings dataframe. We store this in a file (eng_books_ratings) to ensure we don’t have to process all the datasets again and again each time we wish to play with the autoencoder. Another file (eng_books_sorted.csv) contains just the books we have selected, in order to make it quicker for the flask application to retrieve and display on the HTML form. 

In [None]:
pd.DataFrame(eng_ratings_df.title.unique()).to_csv('eng_books_sorted.csv')
eng_ratings_df.to_csv('eng_books_ratings.csv')

The final data structure (eng_usr_ratings) to store the ratings will be a 2D array with each column representing a book and each row representing a list of ratings by a user for some of the books. For compatibility we use a torch tensor eng_usr_ratings to store the above. We proceed with a 80-20 split for the training and testing sets.

In [2]:
eng_ratings_df = pd.read_csv('eng_books_ratings.csv')
n_users = int(max(eng_ratings_df.user_id))
n_books = int(max(eng_ratings_df.ID)) + 1

eng_usr_ratings = np.zeros([n_users, n_books], dtype=np.float32)

for i, rating_row in eng_ratings_df.iterrows():
    eng_usr_ratings[int(rating_row.user_id)-1][int(rating_row.ID)-1] = rating_row.rating

# eng_usr_ratings = tf.convert_to_tensor(eng_usr_ratings, dtype=tf.float32)
print(eng_usr_ratings[:5])

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [3]:
lim = int(n_users*0.80)
tr_set = eng_usr_ratings[:lim][:]
te_set = eng_usr_ratings[lim:][:]

print(n_users, "Length of training test: ", tr_set.shape[0], " test set: ", te_set.shape[0])

53424 Length of training test:  42739  test set:  10685


We create our own RMS error function for the network.
Our data might have rows with some values all set to zero, as they represent books that the user hasn't read yet. On an average, each user will have only 3-5 ratings, and the rest of the columns would be 0.
As a result, we don't need to consider the predicted values of these ratings in our function as they should always be 0.

In [4]:
def my_rmse(y_true, y_pred):
    nonzero = (y_true != 0)
    nonzero = tf.dtypes.cast(nonzero, tf.float32)
    
    y_new = y_pred * nonzero
    
    error = y_true-y_new
    sqr_error = K.square(error)
    mean_sqr_error = K.mean(sqr_error)
    sqrt_mean_sqr_error = K.sqrt(mean_sqr_error)
    
    return sqrt_mean_sqr_error

Now lets create an object of Autoencoder class that we'll use to create our neural network. We pass the number of books to the module to create the number of nodes in the input layer. Next, we set an optimizer that uses the mean squared error function we just defined. The batch_size is set to 128, instead of the default 32.

In [5]:
autoencoder = Autoencoder(n_books)
autoencoder.compile(optimizer='adam', loss=my_rmse)
autoencoder.fit(tr_set, tr_set, epochs=50, shuffle=True, batch_size=128, validation_data=(te_set, te_set))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x1a5b1f7f040>

In [6]:
autoencoder.save(r"C:\Users\rhead\PycharmProjects\autoEncoderRecommender\tensorflowSAE")

INFO:tensorflow:Assets written to: C:\Users\rhead\PycharmProjects\autoEncoderRecommender\tensorflowSAE\assets


The trained autoencoder is saved to be called again whenever we need it to make predictions based on the user's inputs.