# Notes (delete later)

**Project Proposal Feedback:** Think about which dataset of books can support your project and how to evaluate the performance of your recommendation system

<a href="https://colab.research.google.com/github/solodezaldivar/readAlike/blob/main/readAlike.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Book datasets: https://www.kaggle.com/datasets/elvinrustam/books-dataset/data, https://github.com/luminati-io/Amazon-popular-books-dataset,

Technische Idee:
1. Input: max 5 book titles for the model
2. model does the magic and produces recommendation (5 books)



- Categories
- Title
- Description
- Author

# ReadAlike

In [1]:
import nltk
import pandas as pd
import numpy as np

In [86]:
# data from https://www.kaggle.com/datasets/elvinrustam/books-dataset?resource=download
readAlikeDataFrame = pd.read_csv('./datasets/BooksDatasetClean.csv',
                                 usecols=['Description', 'Category', 'Title'])

In [87]:
readAlikeDataFrame.head()

Unnamed: 0,Title,Description,Category
0,Goat Brothers,,"History , General"
1,The Missing Person,,"Fiction , General"
2,Don't Eat Your Heart Out Cookbook,,"Cooking , Reference"
3,When Your Corporate Umbrella Begins to Leak: A...,,
4,Amy Spangler's Breastfeeding : A Parent's Guide,,


## Data Preprocessing

In [88]:
# helper functions
def getDescLen(desc):
    return len(desc.split())

In [89]:
# drop books with missing or empty description and category
readAlikeDataFrame['Description'] = readAlikeDataFrame["Description"].replace(r'', np.nan, regex=True)
readAlikeDataFrame["Category"] = readAlikeDataFrame["Category"].replace(r'', np.nan, regex=True)
readAlikeDataFrame.dropna(subset=["Description"], inplace=True)
readAlikeDataFrame.dropna(subset=["Category"], inplace=True)

# drop books with too short description
readAlikeDataFrame["description_length"] = readAlikeDataFrame["Description"].apply(getDescLen)
readAlikeDataFrame = readAlikeDataFrame[readAlikeDataFrame["description_length"] >= 10]

# drop duplicate books
readAlikeDataFrame.drop_duplicates(subset='Title', keep='first', inplace=True)

# TODO: drop remove book series

# merge content into one column to calculate tfidf on this later
readAlikeDataFrame['Genre_and_Description'] = readAlikeDataFrame['Category'] + ' ' + readAlikeDataFrame['Description']

# reset index after all pre-processing steps
readAlikeDataFrame.reset_index(drop=True, inplace=True)

readAlikeDataFrame.head()

Unnamed: 0,Title,Description,Category,description_length,Genre_and_Description
0,Journey Through Heartsongs,Collects poems written by the eleven-year-old ...,"Poetry , General",26,"Poetry , General Collects poems written by th..."
1,In Search of Melancholy Baby,The Russian author offers an affectionate chro...,"Biography & Autobiography , General",30,"Biography & Autobiography , General The Russi..."
2,The Dieter's Guide to Weight Loss During Sex,"A humor classic, this tongue-in-cheek diet pla...","Health & Fitness , Diet & Nutrition , Diets",70,"Health & Fitness , Diet & Nutrition , Diets A..."
3,Germs : Biological Weapons and America's Secre...,"Deadly germs sprayed in shopping malls, bomb-l...","Technology & Engineering , Military Science",429,"Technology & Engineering , Military Science D..."
4,The Good Book: Reading the Bible with Mind and...,"""The Bible and the social and moral consequenc...","Religion , Biblical Biography , General",42,"Religion , Biblical Biography , General ""The ..."


In [90]:
is_getting_stats = True # is set to true to get statistics on how different parameter perform, when set to false feel free to experiment otherwise
is_reduced_dataset = True # is set to true if you want to compare ann and cos_sym, you need to reduce the dataset to be able to run cos_sym
n_reduced_dataset = 500

if not is_getting_stats:
    if is_reduced_dataset:
        readAlikeDataFrame = readAlikeDataFrame.head(n_reduced_dataset)
        readAlikeDataFrame_reduced = readAlikeDataFrame.head(n_reduced_dataset)
    else:
        readAlikeDataFrame_reduced = readAlikeDataFrame.head(n_reduced_dataset)

## Feature Extraction

1. TF-IDF (Term Frequency-Inverse Document Frequency) give weight to important words in the book description.
2. Cosine Similarity: Measures how similar two books are by comparing the angles between their vector representations. If two books are more similar, the cosine similarity score will be closer to 1.

In [91]:
from sklearn.metrics.pairwise import cosine_similarity


# generate the TF-IDF matrix
def get_tfidf_matrix(df):
    tfidf = TfidfVectorizer(stop_words='english')
    return tfidf.fit_transform(df['Genre_and_Description'])


# compute cosine similarity scores from a TF-IDF matrix
def get_cos_sym_scores(tfidf_matrix):
    return cosine_similarity(tfidf_matrix, tfidf_matrix)


# cosine_symmetry uses too much ram, so we only use the reduced df
if not is_getting_stats:
    tfidf_matrix_cos_sym = get_tfidf_matrix(readAlikeDataFrame_reduced)
    cosine_sim_scores = get_cos_sym_scores(tfidf_matrix_cos_sym)


In [92]:
# FIX 3 COMBINING SPARSE MATRIX, DIMENSIONALITY REDUCTION AND ANN: works but need to play around with parameters
# Why: Our dataset has a high sparsity ratio of 0.9995, which indicates that we can benefit a lot from using a sparse matrix representation. Additionally, the dataset contains 115,650 features, which is quite substantial. When we apply dimensionality reduction it can significantly reduce the complexity of our data. Furthermore, using Approximate Nearest Neighbors (ANN) is sufficient because it provides a fast and efficient way to retrieve similar items without requiring exact calculations.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from annoy import AnnoyIndex  # conda install conda-forge::python-annoy

def get_ann_index(n_dim_reduction, n_trees, df=readAlikeDataFrame):
    # create a sparse TF-IDF Matrix:
    tfidf_matrix = get_tfidf_matrix(df)

    # number of dimensionality
    num_documents, num_features = tfidf_matrix.shape
    # print(f'Number of documents: {num_documents}')
    # print(f'Number of features (dimensions): {num_features}')

    # sparsity ratio
    non_zero_count = tfidf_matrix.nnz
    total_elements = tfidf_matrix.shape[0] * tfidf_matrix.shape[1]
    sparsity = 1.0 - (non_zero_count / total_elements)
    # print(f"Sparsity Ratio: {sparsity:.4f}")  # e.g., 0.8750 indicates 87.50% of the matrix is zero

    # apply dimensionality reduction:
    svd = TruncatedSVD(n_components=n_dim_reduction)
    tfidf_matrix_reduced = svd.fit_transform(tfidf_matrix)

    # implement ANN
    f = tfidf_matrix_reduced.shape[1]  # Number of dimensions
    t = AnnoyIndex(f, 'angular')  # Choose distance metric

    for i in range(tfidf_matrix_reduced.shape[0]):
        t.add_item(i, tfidf_matrix_reduced[i])

    t.build(n_trees)  # build the index with n trees

    return tfidf_matrix, t

In [None]:
n_dim_reduction = 100
n_trees = 10

if not is_getting_stats:
    tfidf_matrix_ann, t = get_ann_index(n_dim_reduction, n_trees, readAlikeDataFrame)

### Sentiment Analysis


In [93]:
# maybe consider sentiment analysis later but right now not the focus
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

def get_sentiment(text):
    return sia.polarity_scores(text)['compound']

#readAlikeDataFrame['Sentiment'] = readAlikeDataFrame['Description'].apply(get_sentiment)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\stefa\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [94]:
index = pd.Series(readAlikeDataFrame.index, index=readAlikeDataFrame['Title']).drop_duplicates()

def sentiment_similarity(user_sentiment, books_sentiments):
    return 1 - abs(user_sentiment - books_sentiments)

## Recommend

Book Obj: Description, genre, Title

In [95]:
class Book:
    title: str
    description: str
    genre: str
    author: str

    def __init__(self, title, description, genre, author):
        self.title = title
        self.description = description
        self.genre = genre
        self.author = author

In [96]:
# This method uses TF-IDF and Cosine Similarity to calculate book recommendations
def recommend_books_with_cosine_sym(book: Book, tfidf_matrix, cosine_sim,
                                    df, w_tfidf=1.0, n_recommend=5):
    try:
        idx = index[book.title]
    except KeyError:
        idx = None

    if idx is not None:
        sim_scores_tfidf = cosine_sim[idx]
    else:
        # if the book is not in the dataset, create a tfidf vector from its info
        input_book_info = book.genre + ' ' + book.description
        input_book_tfidf = tfidf_matrix_cos_sym.transform([input_book_info])

        # calculate cos_sym between the input book and all other books in the dataset
        sim_scores_tfidf = cosine_similarity(input_book_tfidf, tfidf_matrix).flatten()

    combined_scores = w_tfidf * sim_scores_tfidf

    # exclude input book
    if idx is not None:
        combined_scores[idx] = -1

    sim_scores_indexes = combined_scores.argsort()[-n_recommend:][::-1]

    return df['Title'].iloc[sim_scores_indexes]



In [97]:
# This method uses TF-IDF and ANN to calculate book recommendations
def recommend_books_with_ann(book: Book, tfidf, df, t, n_neighbors=5):
    try:
        idx = index[book.title]
    except KeyError:
        idx = None

    if idx is None:
        input_book_info = book.genre + ' ' + book.description
        input_book_tfidf = tfidf.transform([input_book_info])

        # do ANN to find the nearest neighbors based on the vector
        nearest_indices = t.get_nns_by_vector(input_book_tfidf.toarray()[0], n_neighbors + 1)
    else:
        # do ANN directly based on the item index
        nearest_indices = t.get_nns_by_item(int(idx), n_neighbors + 1)

    # exclude input book
    if idx is not None and idx in nearest_indices:
        nearest_indices.remove(idx)

    nearest_indices = nearest_indices[:n_neighbors]

    # TODO: to increase precision within the n_neighbours an option would be to recalculate the n_nearest neighbours with cosine-sym again

    return df['Title'].iloc[nearest_indices]


In [98]:
book = Book(
    "Journey Through Heartsongs",
    "Mattie J. T. Stepanek takes us on a Journey Through Heartsongs with more of his moving poems. These poems share the rare wisdom that Mattie has acquired through his struggle with a rare form of muscular dystrophy and the death of his three siblings from the same disease. His life view was one of love and generosity and as a poet and a peacemaker, his desire was to bring his message of peace to as many people as possible.",
    " Poetry , Subjects & Themes , Inspirational & Religious",
    "By Stepanek, Mattie J. T.")


In [99]:
if not is_getting_stats:
    res_cos_sym = recommend_books_with_cosine_sym(book, n_recommend=10)
    res_cos_sym.head(10)

In [100]:
if not is_getting_stats:
    res_ann = recommend_books_with_ann(book, n_neighbors=10)
    res_ann.head(10)

In [101]:
if not is_getting_stats:
    intersection_count = len(set(res_cos_sym) & set(res_ann))
    overlap_percentage = (intersection_count / min(len(res_cos_sym), len(res_ann))) * 100

    print(f"Overlap Percentage: {overlap_percentage}%")

In [122]:
# evaluate the impact of different parameters on the overlap between ANN and cosine similarity recommendations

def get_statistics(stats, n_reduced_dataset, n_trees, n_dim_reduction, n_output):
    for idx_dataset in range(len(n_reduced_dataset_list)):
        for idx_tree in range(len(n_trees_list)):
            for idx_dim in range(len(n_dim_reduction_list)):
                n_reduced_dataset = n_reduced_dataset_list[idx_dataset]
                n_trees = n_trees_list[idx_tree]
                n_dim_reduction = n_dim_reduction_list[idx_dim]
                n_output = 10

                readAlikeDataFrame_reduced = readAlikeDataFrame.head(n_reduced_dataset)
                tfidf_matrix_cos_sym = get_tfidf_matrix(readAlikeDataFrame_reduced)
                cosine_sim_scores = get_cos_sym_scores(tfidf_matrix_cos_sym)
                tfidf_matrix_ann, t = get_ann_index(n_dim_reduction, n_trees, readAlikeDataFrame_reduced)

                res_cos_sym = recommend_books_with_cosine_sym(book, tfidf_matrix_cos_sym, cosine_sim_scores, readAlikeDataFrame_reduced, n_recommend=n_output)
                res_ann = recommend_books_with_ann(book, tfidf_matrix_ann, readAlikeDataFrame_reduced, t, n_neighbors=n_output)

                intersection_count = len(set(res_cos_sym) & set(res_ann))
                overlap_percentage = (intersection_count / min(len(res_cos_sym), len(res_ann))) * 100

                stats.append({
                    'n_reduced_dataset': n_reduced_dataset,
                    'n_trees': n_trees,
                    'n_dim_reduction': n_dim_reduction,
                    'n_output': n_output,
                    'overlap_percentage': overlap_percentage
                })

                print(stats[-1])

    stats_df = pd.DataFrame(stats)
    return stats_df


In [123]:
stats_df = pd.read_pickle("./statistics/stats_df_241102.pkl")
stats_list = stats_df.to_dict(orient='records')

In [124]:
is_getting_stats = True
n_reduced_dataset_list = [20000, 25000]
n_trees_list = [60, 100, 150]
n_dim_reduction_list = [200, 300, 400]
n_output = 10

#stats_df = get_statistics(stats_list, n_reduced_dataset, n_trees, n_dim_reduction, n_output)
#stats_df.to_pickle("./statistics/stats_df_XXXXX.pkl")

# for me it cosine_sym stopped working when inputting the following: {'n_reduced_dataset': 25000, 'n_trees': 60, 'n_dim_reduction': 200, 'n_output': 10, 'overlap_percentage': XXXX}

{'n_reduced_dataset': 20000, 'n_trees': 60, 'n_dim_reduction': 200, 'n_output': 10, 'overlap_percentage': 40.0}
{'n_reduced_dataset': 20000, 'n_trees': 60, 'n_dim_reduction': 300, 'n_output': 10, 'overlap_percentage': 40.0}
{'n_reduced_dataset': 20000, 'n_trees': 60, 'n_dim_reduction': 400, 'n_output': 10, 'overlap_percentage': 10.0}
{'n_reduced_dataset': 20000, 'n_trees': 100, 'n_dim_reduction': 200, 'n_output': 10, 'overlap_percentage': 40.0}
{'n_reduced_dataset': 20000, 'n_trees': 100, 'n_dim_reduction': 300, 'n_output': 10, 'overlap_percentage': 40.0}
{'n_reduced_dataset': 20000, 'n_trees': 100, 'n_dim_reduction': 400, 'n_output': 10, 'overlap_percentage': 50.0}
{'n_reduced_dataset': 20000, 'n_trees': 150, 'n_dim_reduction': 200, 'n_output': 10, 'overlap_percentage': 40.0}
{'n_reduced_dataset': 20000, 'n_trees': 150, 'n_dim_reduction': 300, 'n_output': 10, 'overlap_percentage': 40.0}
{'n_reduced_dataset': 20000, 'n_trees': 150, 'n_dim_reduction': 400, 'n_output': 10, 'overlap_perce

MemoryError: Unable to allocate 4.66 GiB for an array with shape (25000, 25000) and data type float64

In [111]:
stats_df.corr()['overlap_percentage']


n_reduced_dataset    -0.673873
n_trees               0.132187
n_dim_reduction       0.077492
n_output                   NaN
overlap_percentage    1.000000
Name: overlap_percentage, dtype: float64

In [112]:
stats_df.groupby('n_trees')['overlap_percentage'].mean()


n_trees
40     53.333333
60     58.333333
100    60.833333
150    59.166667
Name: overlap_percentage, dtype: float64

In [113]:
stats_df.groupby('n_dim_reduction')['overlap_percentage'].mean()

n_dim_reduction
100    53.333333
200    60.833333
300    60.833333
400    56.666667
Name: overlap_percentage, dtype: float64

In [132]:
stats_df[stats_df['n_reduced_dataset'] == 15000].sort_values(by='overlap_percentage', ascending=False)


Unnamed: 0,n_reduced_dataset,n_trees,n_dim_reduction,n_output,overlap_percentage
39,15000,60,400,10,60.0
41,15000,100,200,10,60.0
33,15000,40,200,10,50.0
34,15000,40,300,10,50.0
38,15000,60,300,10,50.0
40,15000,100,100,10,50.0
42,15000,100,300,10,50.0
43,15000,100,400,10,50.0
44,15000,150,100,10,50.0
45,15000,150,200,10,50.0


In [131]:
stats_df[stats_df['n_reduced_dataset'] == 20000].sort_values(by='overlap_percentage', ascending=False)

Unnamed: 0,n_reduced_dataset,n_trees,n_dim_reduction,n_output,overlap_percentage
53,20000,100,400,10,50.0
48,20000,60,200,10,40.0
49,20000,60,300,10,40.0
51,20000,100,200,10,40.0
52,20000,100,300,10,40.0
54,20000,150,200,10,40.0
55,20000,150,300,10,40.0
56,20000,150,400,10,40.0
50,20000,60,400,10,10.0
