# Notes (delete later)

**Project Proposal Feedback:** Think about which dataset of books can support your project and how to evaluate the performance of your recommendation system

<a href="https://colab.research.google.com/github/solodezaldivar/readAlike/blob/main/readAlike.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Book datasets: https://www.kaggle.com/datasets/elvinrustam/books-dataset/data, https://github.com/luminati-io/Amazon-popular-books-dataset,

Technische Idee:
1. Input: max 5 book titles for the model
2. model does the magic and produces recommendation (5 books)



- Categories
- Title
- Description
- Author

# ReadAlike

In [1]:
import sklearn
import nltk
import pandas as pd
import numpy as np
#import kagglehub

#import surprise
#import lenskit
#import librec

In [2]:
# import dataset
# data from https://www.kaggle.com/datasets/elvinrustam/books-dataset?resource=download
readAlikeDataFrame = pd.read_csv('./resources/datasets/BooksDatasetClean.csv', usecols=['Description', 'Category', 'Title'])

In [3]:
readAlikeDataFrame.head()

Unnamed: 0,Title,Description,Category
0,Goat Brothers,,"History , General"
1,The Missing Person,,"Fiction , General"
2,Don't Eat Your Heart Out Cookbook,,"Cooking , Reference"
3,When Your Corporate Umbrella Begins to Leak: A...,,
4,Amy Spangler's Breastfeeding : A Parent's Guide,,


In [4]:
# helper functions
def getDescLen(desc):
  len(desc.split())

## Data Preprocessing

In [5]:
#drop books with missing or empty description and category
readAlikeDataFrame['Description'] = readAlikeDataFrame["Description"].replace(r'', np.nan, regex=True)
readAlikeDataFrame["Category"] = readAlikeDataFrame["Category"].replace(r'', np.nan, regex=True)
readAlikeDataFrame.dropna(subset=["Description"], inplace=True)
readAlikeDataFrame.dropna(subset=["Category"], inplace=True)

# TODO: does not work yet, needed?
readAlikeDataFrame["description_length"] = [getDescLen(desc) for desc in readAlikeDataFrame["Description"]]


# TODO: drop same books


# TODO: remove book series


readAlikeDataFrame['Genre_and_Description'] = readAlikeDataFrame['Category'] + ' ' + readAlikeDataFrame['Description']



In [6]:
readAlikeDataFrame.head()

Unnamed: 0,Title,Description,Category,description_length,Genre_and_Description
7,Journey Through Heartsongs,Collects poems written by the eleven-year-old ...,"Poetry , General",,"Poetry , General Collects poems written by th..."
8,In Search of Melancholy Baby,The Russian author offers an affectionate chro...,"Biography & Autobiography , General",,"Biography & Autobiography , General The Russi..."
10,The Dieter's Guide to Weight Loss During Sex,"A humor classic, this tongue-in-cheek diet pla...","Health & Fitness , Diet & Nutrition , Diets",,"Health & Fitness , Diet & Nutrition , Diets A..."
11,Germs : Biological Weapons and America's Secre...,"Deadly germs sprayed in shopping malls, bomb-l...","Technology & Engineering , Military Science",,"Technology & Engineering , Military Science D..."
13,The Good Book: Reading the Bible with Mind and...,"""The Bible and the social and moral consequenc...","Religion , Biblical Biography , General",,"Religion , Biblical Biography , General ""The ..."


## Feature Extraction

1. TF-IDF (Term Frequency-Inverse Document Frequency) give weight to important words in the book description.
2. Cosine Similarity: Measures how similar two books are by comparing the angles between their vector representations. If two books are more similar, the cosine similarity score will be closer to 1.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tfidf = TfidfVectorizer(stop_words='english')

tfidf_matrix = tfidf.fit_transform(readAlikeDataFrame['Genre_and_Description'])

#similarity scores
# TODO: currently cosine_symmetry uses too much ram, lets look at fixes (below)
readAlikeDataFrame = readAlikeDataFrame.head() # Only use a fraction of the dataset
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)


In [None]:
# FIX 1 USING A SPARSE MATRIX: does not work either
# In NLP sparse matrices are commonly encountered, especially when dealing with text data. This is because most documents or text entries share only a few common words, which leads to a high number of zero values in the matrix.

# from scipy.sparse import csr_matrix
# from sklearn.metrics.pairwise import cosine_similarity
# from sklearn.feature_extraction.text import TfidfVectorizer

# tfidf = TfidfVectorizer(stop_words='english')

# tfidf_matrix = tfidf.fit_transform(readAlikeDataFrame['Genre_and_Description'])

# Convert the TF-IDF matrix to a sparse representation
# tfidf_matrix_sparse = csr_matrix(tfidf_matrix)

# Perform cosine similarity computation with the sparse matrix
# cosine_sim = cosine_similarity(tfidf_matrix_sparse, dense_output=False)


In [23]:
# FIX 2 USING APPROXIMATE NEAREST NEIGHBORS: does not work either
# We do not need the exact similarity values and an approximately close is good enough for our recommendation system. The main idea is to trade a small amount of accuracy for a significant gain in speed and efficiency.

# import faiss # conda install conda-forge::faiss

# Convert the TF-IDF matrix to a numpy array
# tfidf_array = tfidf_matrix.toarray()

# Build the FAISS index
# d = tfidf_array.shape[1]
# index = faiss.IndexFlatL2(d)
# index.add(tfidf_array)

# Search for the 5 nearest neighbors of each item
# k = 5
# distances, indices = index.search(tfidf_array, k)


MemoryError: Unable to allocate 56.3 GiB for an array with shape (65296, 115650) and data type float64

In [8]:
# FIX 3 COMBINING SPARSE MATRIX, DIMENSIONALITY REDUCTION AND ANN: works but need to play around with parameters

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from annoy import AnnoyIndex # conda install conda-forge::python-annoy

# Create a Sparse TF-IDF Matrix:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(readAlikeDataFrame['Genre_and_Description'])

# Number of Dimensionality
num_documents, num_features = tfidf_matrix.shape
print(f'Number of documents: {num_documents}')
print(f'Number of features (dimensions): {num_features}')

# Sparsity ratio
non_zero_count = tfidf_matrix.nnz
total_elements = tfidf_matrix.shape[0] * tfidf_matrix.shape[1]
sparsity = 1.0 - (non_zero_count / total_elements)
print(f"Sparsity Ratio: {sparsity:.4f}")  # e.g., 0.8750 indicates 87.50% of the matrix is zero

# Apply Dimensionality Reduction:
n_components = 100  # TODO: figure out what number to use for best results
svd = TruncatedSVD(n_components=n_components)
tfidf_matrix_reduced = svd.fit_transform(tfidf_matrix)

# Implement ANN
f = tfidf_matrix_reduced.shape[1]  # Number of dimensions
t = AnnoyIndex(f, 'angular')  # Choose distance metric

for i in range(tfidf_matrix_reduced.shape[0]):
    t.add_item(i, tfidf_matrix_reduced[i])

t.build(10)  # Build the index with 10 trees # TODO: figure out what number to use for best results


Number of documents: 65296
Number of features (dimensions): 115650


### Sentiment Analysis


In [9]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()


def get_sentiment(text):
  return sia.polarity_scores(text)['compound']

readAlikeDataFrame['Sentiment'] = readAlikeDataFrame['Description'].apply(get_sentiment)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\stefa\AppData\Roaming\nltk_data...


## Recommend

Book Obj: Description, genre, Title

In [13]:
class Book:
  title: str
  description: str
  genre: str
  author: str

  def __init__(self, title, description, genre, author):
    self.title = title
    self.description = description
    self.genre = genre
    self.author = author


print(readAlikeDataFrame)

                                                    Title  \
7                              Journey Through Heartsongs   
8                            In Search of Melancholy Baby   
10           The Dieter's Guide to Weight Loss During Sex   
11      Germs : Biological Weapons and America's Secre...   
13      The Good Book: Reading the Bible with Mind and...   
...                                                   ...   
103050                             Like A Sister: A Novel   
103052  Creating Web Pages Simplified (3-D Visual Series)   
103053               EVA: The Real Key to Creating Wealth   
103056  The Essentials of Spanish (REA's Language Seri...   
103062               Your First Puppy (Your First Series)   

                                              Description  \
7       Collects poems written by the eleven-year-old ...   
8       The Russian author offers an affectionate chro...   
10      A humor classic, this tongue-in-cheek diet pla...   
11      Deadly germs sp

In [14]:
index = pd.Series(readAlikeDataFrame.index, index=readAlikeDataFrame['Title']).drop_duplicates()

def sentiment_similarity(user_sentiment, books_sentiments):
  return 1-abs(user_sentiment - books_sentiments)

In [11]:
def recommend_books_with_tf_idf(book: Book, cosine_sim=cosine_sim, w_tfidf=0.7, w_sentiment=0.3):
  idx = index[book.title] # what if words don't match 1:1?
  print(idx)
  if not idx: # TODO: why not? shouldn't it be opposite?
    #pairwise sim scores for all books x input book
    sim_scores_tfidf = cosine_sim[idx]

    input_book_sentiment = readAlikeDataFrame.loc[idx, 'Sentiment']
  else:
    input_book_info = book.genre + ' ' + book.description
    input_book_tfidf = tfidf.transform([input_book_info])


    #tfidf
    sim_scores_tfidf = cosine_similarity(input_book_tfidf, tfidf_matrix).flatten()

    #sentiment
    input_book_sentiment = get_sentiment(input_book_info)


  sim_scores_sentiment = readAlikeDataFrame['Sentiment'].apply(lambda x: sentiment_similarity(input_book_sentiment, x)).values

  #combined
  combined_scores = (w_tfidf * sim_scores_tfidf) + (w_sentiment * sim_scores_sentiment)

  sim_scores_indexes = combined_scores.argsort()[-6:-1][::-1]

  return readAlikeDataFrame['Title'].iloc[sim_scores_indexes]



In [37]:
# TODO: still WIP
def recommend_books_with_ann(book: Book, tfidf=tfidf, readAlikeDataFrame=readAlikeDataFrame,t=t, w_tfidf=0.7, w_sentiment=0.3, n_neighbors=5):

    idx = 9694 # readAlikeDataFrame[readAlikeDataFrame['Title'] == book.title].index

    if idx is None:
        input_book_info = book.genre + ' ' + book.description
        input_book_tfidf = tfidf.transform([input_book_info])

        # Get the nearest neighbors based on the input book's TF-IDF vector and find the top N nearest neighbors in the Annoy index
        nearest_indices = t.get_nns_by_vector(input_book_tfidf.toarray()[0], n_neighbors)

        # tfidf
        sim_scores_tfidf = cosine_similarity(input_book_tfidf, tfidf_matrix[nearest_indices]).flatten()

        # Sentiment
        input_book_sentiment = get_sentiment(input_book_info)

    else:
        nearest_indices = t.get_nns_by_item(idx, n_neighbors)
        sim_scores_tfidf = np.ones(n_neighbors)  # If you want to treat the found item as similar

        # sentiment
        # input_book_sentiment = readAlikeDataFrame.loc[idx, 'Sentiment']

    # Sentiment
    # sim_scores_sentimentsim_scores_sentiment = readAlikeDataFrame['Sentiment'].apply(lambda x: sentiment_similarity(input_book_sentiment, x)).values[nearest_indices]
    sim_scores_sentiment = 0.6

    # Combined
    combined_scores = (w_tfidf * sim_scores_tfidf) + (w_sentiment * sim_scores_sentiment)

    sim_scores_indexes = combined_scores.argsort()[-6:-1][::-1]

    return readAlikeDataFrame['Title'].iloc[sim_scores_indexes]


In [38]:
book = Book(
    "Journey Through Heartsongs",
    "Mattie J. T. Stepanek takes us on a Journey Through Heartsongs with more of his moving poems. These poems share the rare wisdom that Mattie has acquired through his struggle with a rare form of muscular dystrophy and the death of his three siblings from the same disease. His life view was one of love and generosity and as a poet and a peacemaker, his desire was to bring his message of peace to as many people as possible.",
    " Poetry , Subjects & Themes , Inspirational & Religious",
    "By Stepanek, Mattie J. T.")

# res = recommend_books_with_tf_idf(book)
res = recommend_books_with_ann(book)
print(res)


11    Germs : Biological Weapons and America's Secre...
10         The Dieter's Guide to Weight Loss During Sex
8                          In Search of Melancholy Baby
7                            Journey Through Heartsongs
Name: Title, dtype: object
