# **Designing a Recommendation System Using Transformers**

The aim of this project is to develop a recommendation system for books using a transformer-based architecture. The inspiration for the implementation comes from two papers: 'Behavior Sequence Transformer for E-commerce Recommendation in Alibaba' and 'Attention Is All You Need.' However, modifications will be made to the architecture to ensure its compatibility with our specific dataset.<br>

Our primary objective is to improve the predictive capabilities of our model by leveraging the historical ratings of books. We intend to compare the performance of our proposed model against existing recommendation systems.<br>


The dataset used for this project can be obtained from the following link: https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset?select=Users.csv

In [None]:
 pip install SQLAlchemy==1.4.46 

In [None]:
pip install pandasql

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pandasql
  Downloading pandasql-0.7.3.tar.gz (26 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pandasql
  Building wheel for pandasql (setup.py) ... [?25l[?25hdone
  Created wheel for pandasql: filename=pandasql-0.7.3-py3-none-any.whl size=26771 sha256=6d8b96c159af313499bca66aa1efd49abc66a69edbacd59e4f49455d3cd1e837
  Stored in directory: /root/.cache/pip/wheels/e9/bc/3a/8434bdcccf5779e72894a9b24fecbdcaf97940607eaf4bcdf9
Successfully built pandasql
Installing collected packages: pandasql
Successfully installed pandasql-0.7.3


In [None]:
import pandas as pd
import numpy as np
import os
import pandasql as ps
import requests
from bs4 import BeautifulSoup
import requests


from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
books_data = pd.read_csv('/content/drive/MyDrive/book-dataset/archive (1)/Books.csv')
rating_data = pd.read_csv('/content/drive/MyDrive/book-dataset/archive (1)/Ratings.csv')
users_data = pd.read_csv('/content/drive/MyDrive/book-dataset/archive (1)/Users.csv')

  books_data = pd.read_csv('/content/drive/MyDrive/book-dataset/archive (1)/Books.csv')


## Data Cleaning

In [None]:

books_data = books_data.drop(columns=['Image-URL-S', 'Image-URL-M', 'Image-URL-L'])
books_data = books_data.rename(columns={'Book-Title': 'books_title', 'Book-Author': 'books_author', 'Year-Of-Publication': 'publication_year', 'Publisher': 'publisher'})
users_data = users_data.drop(columns=['Location'])
users_data = users_data.rename(columns = {'User-ID': 'user_id', 'Age':'age'})
rating_data = rating_data.rename(columns = {'User-ID': 'user_id', 'Book-Rating':'book_rating'})
books_data

Unnamed: 0,ISBN,books_title,books_author,publication_year,publisher
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company
...,...,...,...,...,...
271355,0440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm)
271356,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books
271357,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco
271358,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press


In [None]:
users_data

Unnamed: 0,user_id,age
0,1,
1,2,18.0
2,3,
3,4,17.0
4,5,
...,...,...
278853,278854,
278854,278855,50.0
278855,278856,
278856,278857,


In [None]:
rating_data

Unnamed: 0,user_id,ISBN,book_rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6
...,...,...,...
1149775,276704,1563526298,9
1149776,276706,0679447156,0
1149777,276709,0515107662,10
1149778,276721,0590442449,10


In [None]:
total_ratings_by_user = ps.sqldf("""SELECT user_id, COUNT(user_id) 
                                    FROM rating_data
                                    GROUP BY user_id
                                    ORDER BY COUNT(user_id) DESC""")
total_ratings_by_user

Unnamed: 0,user_id,COUNT(user_id)
0,11676,13602
1,198711,7550
2,153662,6109
3,98391,5891
4,35859,5850
...,...,...
105278,20,1
105279,19,1
105280,12,1
105281,7,1


In [None]:
distinct_users = ps.sqldf("""SELECT COUNT(DISTINCT(user_id)) FROM rating_data""")
distinct_users

Unnamed: 0,COUNT(DISTINCT(user_id))
0,105283


So we have 1.1 million ratings while only 105283 users

In [None]:
user_98391 = ps.sqldf("""SELECT * FROM rating_data WHERE user_id = '98391'""")
user_98391

Unnamed: 0,user_id,ISBN,book_rating
0,98391,0060001445,8
1,98391,0060001453,9
2,98391,0060001461,8
3,98391,006000147X,9
4,98391,0060001801,8
...,...,...,...
5886,98391,9046610518,9
5887,98391,9375506276,9
5888,98391,9425183157,9
5889,98391,976530130X,8


We can enhance the positional embedding for each user by incorporating ratings instead of relying solely on the time frame. Merely clicking on a book does not guarantee that it will be read. Therefore, a more accurate representation can be achieved by aggregating the ratings in ascending order for each user and subsequently feeding this information into the multi-head attention layer

In [None]:
ratings_per_book = ps.sqldf("""
                              SELECT 
                                rating_data.user_id AS user_id,
                                rating_data.ISBN AS ISBN,
                                books_data.books_title AS book_title,
                                books_data.books_author AS books_author,
                                books_data.publication_year AS publication_year,
                                books_data.publisher AS publisher,
                                rating_data.book_rating AS book_rating
                              FROM rating_data
                              INNER JOIN books_data ON rating_data.ISBN = books_data.ISBN
                            """)
ratings_per_book

Unnamed: 0,user_id,ISBN,book_title,books_author,publication_year,publisher,book_rating
0,276725,034545104X,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,0
1,276726,0155061224,Rites of Passage,Judith Rae,2001,Heinle,5
2,276727,0446520802,The Notebook,Nicholas Sparks,1996,Warner Books,0
3,276729,052165615X,Help!: Level 1,Philip Prowse,1999,Cambridge University Press,3
4,276729,0521795028,The Amsterdam Connection : Level 4 (Cambridge ...,Sue Leather,2001,Cambridge University Press,6
...,...,...,...,...,...,...,...
1031131,276704,0876044011,Edgar Cayce on the Akashic Records: The Book o...,Kevin J. Todeschi,1998,A.R.E. Press (Association of Research &amp; Enlig,0
1031132,276704,1563526298,Get Clark Smart : The Ultimate Guide for the S...,Clark Howard,2000,Longstreet Press,9
1031133,276706,0679447156,Eight Weeks to Optimum Health: A Proven Progra...,Andrew Weil,1997,Alfred A. Knopf,0
1031134,276709,0515107662,The Sherbrooke Bride (Bride Trilogy (Paperback)),Catherine Coulter,1996,Jove Books,10


In [None]:
ratings_per_book.isnull().sum()

user_id             0
ISBN                0
book_title          0
books_author        1
publication_year    0
publisher           2
book_rating         0
dtype: int64

## Data Preprocessing

In [None]:
ratings_per_book

Unnamed: 0,user_id,ISBN,book_title,books_author,publication_year,publisher,book_rating
0,276725,034545104X,Flesh Tones: A Novel,M. J. Rose,2002,Ballantine Books,0
1,276726,0155061224,Rites of Passage,Judith Rae,2001,Heinle,5
2,276727,0446520802,The Notebook,Nicholas Sparks,1996,Warner Books,0
3,276729,052165615X,Help!: Level 1,Philip Prowse,1999,Cambridge University Press,3
4,276729,0521795028,The Amsterdam Connection : Level 4 (Cambridge ...,Sue Leather,2001,Cambridge University Press,6
...,...,...,...,...,...,...,...
1031131,276704,0876044011,Edgar Cayce on the Akashic Records: The Book o...,Kevin J. Todeschi,1998,A.R.E. Press (Association of Research &amp; Enlig,0
1031132,276704,1563526298,Get Clark Smart : The Ultimate Guide for the S...,Clark Howard,2000,Longstreet Press,9
1031133,276706,0679447156,Eight Weeks to Optimum Health: A Proven Progra...,Andrew Weil,1997,Alfred A. Knopf,0
1031134,276709,0515107662,The Sherbrooke Bride (Bride Trilogy (Paperback)),Catherine Coulter,1996,Jove Books,10


In [None]:
users_data

Unnamed: 0,user_id,age
0,1,
1,2,18.0
2,3,
3,4,17.0
4,5,
...,...,...
278853,278854,
278854,278855,50.0
278855,278856,
278856,278857,


## Books Genres
We wanted to find genre for all books. However, after trying to use multiple APIs to get genre for each book, we were barely able to obtain genre for 400 books out of 1000. Hence, one way is to use OpenAI API to generate these results; however, it's  not free. Hence, we will ignore this section for now and can test it after model completion.

In [None]:
## Lets add genre for each book 
total_distinct_book = pd.DataFrame(ratings_per_book['book_title'].unique())
total_distinct_book.rename(columns={0: 'book_title'}, inplace=True)
total_distinct_book

Unnamed: 0,book_title
0,Flesh Tones: A Novel
1,Rites of Passage
2,The Notebook
3,Help!: Level 1
4,The Amsterdam Connection : Level 4 (Cambridge ...
...,...
241066,Death Crosses the Border
241067,Jazz Funeral: A Skip Langdon Novel
241068,Triplet Trouble and the Class Trip (Triplet Tr...
241069,A Desert of Pure Feeling (Vintage Contemporaries)


In [None]:
#Find books' genre using google api
def get_book_genre(title):
    base_url = 'https://www.googleapis.com/books/v1/volumes'
    params = {'q': f'intitle:{title}'}
    
    response = requests.get(base_url, params=params)
    
    if response.status_code == 200:
        data = response.json()
        items = data.get('items', [])
        
        if items:
            book = items[0]
            volume_info = book.get('volumeInfo', {})
            
            if 'categories' in volume_info:
                genres = volume_info['categories']
                return genres
            else:
                return 'No genre information available for this book.'
        
        else:
            return 'Book not found.'
    
    else:
        return 'An error occurred while fetching the book.'


def create_book_genre_dataframe(dataframe, limit):
    book_genre_dict = {}
    
    for index, row in dataframe.head(limit).iterrows():
        book_name = row['book_title']
        genre = get_book_genre(book_name)
        genre_str = ', '.join(genre) if isinstance(genre, list) else genre
        book_genre_dict[index] = {
            'book_title': book_name,
            'genre': genre_str
        }
    
    book_genre_df = pd.DataFrame.from_dict(book_genre_dict, orient='index')
    
    return book_genre_df



book_genre_result = create_book_genre_dataframe(total_distinct_book, 1000)  # Limiting to first 1000 rows
book_genre_result




Unnamed: 0,book_title,genre
0,Flesh Tones: A Novel,Fiction
1,Rites of Passage,Social Science
2,The Notebook,Fiction
3,Help!: Level 1,No genre information available for this book.
4,The Amsterdam Connection : Level 4 (Cambridge ...,Fiction
...,...,...
995,Nightshade,An error occurred while fetching the book.
996,While I Was Gone,An error occurred while fetching the book.
997,The White Boy Shuffle,An error occurred while fetching the book.
998,The Voice of the Night,An error occurred while fetching the book.


In [None]:
book_genre_result.to_pickle("/content/drive/MyDrive/book-dataset/book_genre_result.pkl")

In [None]:
book_genre_result_upload = pd.read_pickle("/content/drive/MyDrive/book-dataset/book_genre_result.pkl")
book_genre_result_upload

Unnamed: 0,book_title,genre
0,Flesh Tones: A Novel,Fiction
1,Rites of Passage,Social Science
2,The Notebook,Fiction
3,Help!: Level 1,No genre information available for this book.
4,The Amsterdam Connection : Level 4 (Cambridge ...,Fiction
...,...,...
995,Nightshade,An error occurred while fetching the book.
996,While I Was Gone,An error occurred while fetching the book.
997,The White Boy Shuffle,An error occurred while fetching the book.
998,The Voice of the Night,An error occurred while fetching the book.


In [None]:
#Want to count how many results we didn't recieve

total_error_occured= book_genre_result_upload['genre'].str.count('An error occurred while fetching the book.').sum()
book_not_found = book_genre_result_upload['genre'].str.count('Book not found.').sum()
no_genre_available = book_genre_result_upload['genre'].str.count('No genre information available for this book.').sum()
print(f"Books with which error occured while fetching {total_error_occured} | Books not found {book_not_found} |  Books with no genre available {no_genre_available} | Sum of book results we didn't retrieve {total_error_occured + book_not_found + no_genre_available}")

Books with which error occured while fetching 632 | Books not found 35 |  Books with no genre available 54 | Sum of book results we didn't retrieve 721


In [None]:

#Combine books_with_genre results with ratings_per_book

books_with_genre = ps.sqldf("""
                            SELECT
                                ratings_per_book.user_id AS user_id,
                                ratings_per_book.ISBN AS isbn,
                                book_genre_result_upload.book_title AS book_title,
                                ratings_per_book.books_author AS books_author,
                                ratings_per_book.publication_year AS publication_year,
                                ratings_per_book.publisher AS publisher,
                                ratings_per_book.book_rating AS book_rating,
                                book_genre_result_upload.genre AS genre
                            FROM book_genre_result_upload 
                            INNER JOIN ratings_per_book ON book_genre_result_upload.book_title = ratings_per_book.book_title
                            GROUP BY ratings_per_book.user_id, ratings_per_book.ISBN, book_genre_result_upload.book_title, books_author, publication_year, publisher, book_rating, genre
                            
                            """)

books_with_genre

Unnamed: 0,user_id,isbn,book_title,books_author,publication_year,publisher,book_rating,genre
0,14,0971880107,Wild Animus,Rich Shapero,2004,Too Far,0,Adventure fiction
1,17,0312978383,Winter Solstice,Rosamunde Pilcher,2001,St. Martin's Paperbacks,0,An error occurred while fetching the book.
2,23,0375406328,Lying Awake,Mark Salzman,2000,Alfred A. Knopf,0,An error occurred while fetching the book.
3,26,0446310786,To Kill a Mockingbird,Harper Lee,1988,Little Brown &amp; Company,10,FICTION
4,51,0440225701,The Street Lawyer,JOHN GRISHAM,1999,Dell,9,An error occurred while fetching the book.
...,...,...,...,...,...,...,...,...
49662,278843,0345420748,While I Was Gone,Sue Miller,2002,Ballantine Books,0,An error occurred while fetching the book.
49663,278843,0399146431,The Bonesetter's Daughter,Amy Tan,2001,Putnam Publishing Group,9,An error occurred while fetching the book.
49664,278843,059035342X,Harry Potter and the Sorcerer's Stone (Harry P...,J. K. Rowling,1999,Arthur A. Levine Books,8,An error occurred while fetching the book.
49665,278843,0743205812,Fleeced : A Regan Reilly Mystery (Regan Reilly...,Carol Higgins Clark,2001,Scribner,0,An error occurred while fetching the book.


## Eva's part

Example API call (to test the service):

In [None]:
import requests
isbn = '034545104X'
r = requests.get(f'https://openlibrary.org/isbn/{isbn}.json')
text = r.json()
text
#text["genres"]
# r.json()

{'identifiers': {'librarything': ['706825'], 'goodreads': ['1140750']},
 'subtitle': 'a novel',
 'subject_place': ['New York (N.Y.)'],
 'covers': [210533],
 'ia_loaded_id': ['fleshtonesnovel00rose'],
 'lc_classifications': ['PS3568.O76386 F57 2002'],
 'ocaid': 'fleshtonesnovel00rose',
 'uri_descriptions': ['Contributor biographical information',
  'Sample text',
  'Publisher description'],
 'edition_name': '1st ed.',
 'genres': ['Fiction.'],
 'source_records': ['ia:fleshtonesnovel00rose',
  'marc:marc_loc_2016/BooksAll.2016.part29.utf8:202816813:991',
  'ia:fleshtonesnovel0000rose',
  'ia:fleshtonesnovel0000rose_j2u0'],
 'title': 'Flesh tones',
 'languages': [{'key': '/languages/eng'}],
 'subjects': ['Trials (Murder) -- Fiction.', 'New York (N.Y.) -- Fiction.'],
 'publish_country': 'nyu',
 'by_statement': 'M.J. Rose.',
 'oclc_numbers': ['49664389'],
 'type': {'key': '/type/edition'},
 'uris': ['http://www.loc.gov/catdir/bios/random054/2002063111.html',
  'http://www.loc.gov/catdir/samp

In [None]:
def get_book_genre_isbn(isbn):
    url = f'https://openlibrary.org/isbn/{isbn}.json'
    response = requests.get(url)
    
    if response.status_code == 200:
        data = response.json()
        if 'genres' in data:
              genres = data['genres']
              return genres
        else:
              return 'No genre information available for this book.'
        
    else:
        return 'Book not found.'


def create_book_genre_dataframe_by_isbn(dataframe, limit):
    book_genre_dict = {}
    
    for index, row in dataframe.head(limit).iterrows():
        isbn = row['ISBN']
        genres = get_book_genre_isbn(isbn)
        genre_str = ', '.join(genres) if isinstance(genres, list) else genres
        book_genre_dict[index] = {
            'ISBN': isbn,
            'genre': genre_str
        }
    
    book_genre_df = pd.DataFrame.from_dict(book_genre_dict, orient='index')
    
    return book_genre_df

In [None]:
book_genre_result.to_pickle("/content/drive/MyDrive/book-dataset/book_genre_results.pkl")
book_genre_result_upload = pd.read_pickle("/content/drive/MyDrive/book-dataset/book_genre_results.pkl")
book_genre_result_upload

## Transformer

Transform the movie ratings data into sequences

Let's sort the the ratings data and then group the book_id values by user_id.
The output DataFrame will have a record for each user_id, with one ordered list (sorted by rating): the books they have rated, and how.

In [None]:
ratings_per_book_short = ps.sqldf("""
                              SELECT *
                              FROM books_with_genre_upload
                              LIMIT 1000
                            """)
ratings_per_book_short

Unnamed: 0,user_id,isbn,book_title,books_author,publication_year,publisher,book_rating,genre
0,14,0971880107,Wild Animus,Rich Shapero,2004,Too Far,0,Adventure fiction
1,17,0312978383,Winter Solstice,Rosamunde Pilcher,2001,St. Martin's Paperbacks,0,An error occurred while fetching the book.
2,23,0375406328,Lying Awake,Mark Salzman,2000,Alfred A. Knopf,0,An error occurred while fetching the book.
3,26,0446310786,To Kill a Mockingbird,Harper Lee,1988,Little Brown &amp; Company,10,FICTION
4,51,0440225701,The Street Lawyer,JOHN GRISHAM,1999,Dell,9,An error occurred while fetching the book.
...,...,...,...,...,...,...,...,...
995,7158,0688163165,Mystic River,Dennis Lehane,2001,William Morrow &amp; Company,0,An error occurred while fetching the book.
996,7158,0743206045,Daddy's Little Girl,Mary Higgins Clark,2002,Simon &amp; Schuster,0,An error occurred while fetching the book.
997,7158,0743407059,The First Time,Joy Fielding,2000,Atria,0,An error occurred while fetching the book.
998,7158,074343627X,Dreamcatcher,Stephen King,2001,Pocket,1,An error occurred while fetching the book.


In [None]:
ratings_group = ratings_per_book_short.sort_values(by=["book_rating"]).groupby("user_id")
ratings_data = pd.DataFrame(    
    data={
          "user_id": list(ratings_group.groups.keys()),        
          "book_ids": list(ratings_group.isbn.apply(list)),        
          "ratings": list(ratings_group.book_rating.apply(list)),  
          })

Now, let's split the book_ids list into a set of sequences of a fixed length. We do the same for the ratings. Set the sequence_length variable to change the length of the input sequence to the model. You can also change the step_size to control the number of sequences to generate for each user.

In [None]:
sequence_length = 4
step_size = 2

def create_sequences(values, window_size, step_size):
      sequences = []    
      start_index = 0    
      while True:        
        end_index = start_index + window_size        
        seq = values[start_index:end_index]        
        if len(seq) < window_size:            
          seq = values[-window_size:]            
          if len(seq) == window_size:                
            sequences.append(seq)            
            break        
            sequences.append(seq)        
            start_index += step_size    
            return sequences
ratings_data.book_ids = ratings_data.book_ids.apply(    
    lambda ids: create_sequences(ids, sequence_length, step_size))

In [None]:
random_selection = np.random.rand(len(ratings_data.index)) <= 0.85
train_data = ratings_data[random_selection]
test_data = ratings_data[~random_selection]

train_data.to_csv("train_data.csv", index=False, sep="|", header=False)
test_data.to_csv("test_data.csv", index=False, sep="|", header=False)