<a href="https://colab.research.google.com/github/solodezaldivar/readAlike/blob/main/readAlike.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Project Proposal Feedback:** Think about which dataset of books can support your project and how to evaluate the performance of your recommendation system

Book datasets: https://www.kaggle.com/datasets/elvinrustam/books-dataset/data, https://github.com/luminati-io/Amazon-popular-books-dataset,

Technische Idee:
1. Input: max 5 book titles for the model
2. model does the magic and produces recommendation (5 books)



- Categories
- Title
- Description
- Author

In [1]:
import sklearn
import nltk
import pandas as pd
import numpy as np
import kagglehub

#import surprise
#import lenskit
#import librec

In [2]:
# import dataset
readAlikeDataFrame = pd.read_csv('BooksDatasetClean.csv', usecols=['Description', 'Category', 'Title'])

In [3]:
## helper functions
def getDescLen(desc):
  len(desc.split())

#Data Preprocessing

In [14]:

#drop books with missing or empty description

readAlikeDataFrame["Description"] = readAlikeDataFrame["Description"].replace(r'', np.nan, regex=True)
readAlikeDataFrame["Category"] = readAlikeDataFrame["Category"].replace(r'', np.nan, regex=True)
readAlikeDataFrame.dropna(subset=["Description"], inplace=True)
readAlikeDataFrame.dropna(subset=["Category"], inplace=True)

readAlikeDataFrame["description_length"] = [getDescLen(desc) for desc in readAlikeDataFrame["Description"]]


#drop same books


#remove book series




readAlikeDataFrame['Genre_and_Description'] = readAlikeDataFrame['Category'] + ' ' + readAlikeDataFrame['Description']
readAlikeDataFrame = readAlikeDataFrame.head()


#Feature Extraction

1. TF-IDF (Term Frequency-Inverse Document Frequency) give weight to important words in the book description.
2. Cosine Similarity: Measures how similar two books are by comparing the angles between their vector representations. If two books are more similar, the cosine similarity score will be closer to 1.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tfidf = TfidfVectorizer(stop_words='english')


tfidf_matrix = tfidf.fit_transform(readAlikeDataFrame['Genre_and_Description']) #currently cosine_simetry uses too much ram, lets look at fixes



#similarity scores
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)




##Sentiment Analysis


In [16]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()


def get_sentiment(text):
  return sia.polarity_scores(text)['compound']

readAlikeDataFrame['Sentiment'] = readAlikeDataFrame['Description'].apply(get_sentiment)

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


#Recommend

Book Obj: Description, genre, Title

In [17]:
class Book:
  title: str
  description: str
  genre: str
  author: str

  def __init__(self, title, description, genre, author):
    self.title = title
    self.description = description
    self.genre = genre
    self.author = author


print(readAlikeDataFrame)

                                                Title  \
7                          Journey Through Heartsongs   
8                        In Search of Melancholy Baby   
10       The Dieter's Guide to Weight Loss During Sex   
11  Germs : Biological Weapons and America's Secre...   
13  The Good Book: Reading the Bible with Mind and...   

                                          Description  \
7   Collects poems written by the eleven-year-old ...   
8   The Russian author offers an affectionate chro...   
10  A humor classic, this tongue-in-cheek diet pla...   
11  Deadly germs sprayed in shopping malls, bomb-l...   
13  "The Bible and the social and moral consequenc...   

                                        Category description_length  \
7                               Poetry , General               None   
8            Biography & Autobiography , General               None   
10   Health & Fitness , Diet & Nutrition , Diets               None   
11   Technology & Engineering 

In [18]:
index = pd.Series(readAlikeDataFrame.index, index=readAlikeDataFrame['Title']).drop_duplicates()

def sentiment_similarity(user_sentiment, books_sentiments):
  return 1-abs(user_sentiment - books_sentiments)



def recommend_books_with_tf_idf(book: Book, cosine_sim=cosine_sim, w_tfidf=0.7, w_sentiment=0.3):
  idx = index[book.title] # what if words don't match 1:1?
  print(idx)
  if (not idx):
    #pairwise sim scores for all books x input book
    sim_scores_tfidf = cosine_sim[idx]

    input_book_sentiment = readAlikeDataFrame.loc[idx, 'Sentiment']
  else:
    input_book_info = book.genre + ' ' + book.description
    input_book_tfidf = tfidf.transform([input_book_info])


    #tfidf
    sim_scores_tfidf = cosine_similarity(input_book_tfidf, tfidf_matrix).flatten()

    #sentiment
    input_book_sentiment = get_sentiment(input_book_info)


  sim_scores_sentiment = readAlikeDataFrame['Sentiment'].apply(lambda x: sentiment_similarity(input_book_sentiment, x)).values

  #combined
  combined_scores = (w_tfidf * sim_scores_tfidf) + (w_sentiment * sim_scores_sentiment)

  sim_scores_indexes = combined_scores.argsort()[-6:-1][::-1]

  return readAlikeDataFrame['Title'].iloc[sim_scores_indexes]



In [19]:
book = Book(
    "Journey Through Heartsongs",
    "Mattie J. T. Stepanek takes us on a Journey Through Heartsongs with more of his moving poems. These poems share the rare wisdom that Mattie has acquired through his struggle with a rare form of muscular dystrophy and the death of his three siblings from the same disease. His life view was one of love and generosity and as a poet and a peacemaker, his desire was to bring his message of peace to as many people as possible.",
    " Poetry , Subjects & Themes , Inspirational & Religious",
    "By Stepanek, Mattie J. T.")

res = recommend_books_with_tf_idf(book)
print(res)


7
8                          In Search of Melancholy Baby
10         The Dieter's Guide to Weight Loss During Sex
13    The Good Book: Reading the Bible with Mind and...
11    Germs : Biological Weapons and America's Secre...
Name: Title, dtype: object
