## Book recommandation system

#### Introduction: 
In this notebook, we implemented a model that select the 5 best sellers books of the New York Times that are the most related to the subject of an article. Therefore, each article will be linked to 5 books.
Similary between a book and an article was assessed using the TF-IDF computation and the cosine similarity. 

In [52]:
import json 
import pandas as pd 
import pprint
import re
import multiprocessing as mp
import numpy as np
import pandas as pd
import nltk
nltk.download('stopwords', quiet=True)
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package punkt to /Users/anavekua/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Load books data

In [53]:
# Load books data wich contains NYT bestsellers books information and books summary
file_paths = "data/bestsellers_with_abstract_and_genre_3350.json"
with open(file_paths, 'r') as file:
        data = json.load(file)

df = pd.DataFrame(data)

In [54]:
# Books per category
grouped_df = df.groupby('genre').count()
print(grouped_df['title'])

genre
Arts & Entertainment             55
Biographies & Memoirs          1237
Business & Personal Finance     166
Comics & Graphic Novels           1
Computers & Internet              3
Cookbooks, Food & Wine            4
Fiction & Literature             32
Health, Mind & Body             113
History                         387
Humor                           121
January 30                        1
Kids                              1
Lifestyle & Home                 22
Mysteries & Thrillers             1
N/A                             203
Nonfiction                      199
Parenting                        22
Politics & Current Events       432
Professional & Technical         48
Reference                         2
Religion & Spirituality          76
Romance                           3
Sci-Fi & Fantasy                  1
Science & Nature                119
Sports & Outdoors                80
Travel & Adventure               19
Young Adult                       2
Name: title, dtype: in

In [55]:
# Remove N/A category
df = df[df['genre'] != 'N/A']

# Select only books in 'Politics & Current Events'
corpus_books = df[df['genre'] == 'Politics & Current Events']
corpus_books = corpus_books.reset_index(drop=True)

#### Load NYT Articles data

In [56]:
# Load data articles archives. This dataset contains all the archived political articles from the NYT. 
file_paths = "/Users/anavekua/Documents/DataScienTest/API_nyt_json/nytimes_archives_elections_articles.json"

with open(file_paths, 'r') as file:
        data = json.load(file)

articles = pd.DataFrame(data)

In [57]:
#Load an article from archive
a = 100300 # article number
article = articles.iloc[a, :]
print("Headline:", article['headline_main'])
pprint.pprint(article['abstract'])

Headline: Obama Ad Features Someone Big, Yellow and Feathery
('A new television ad by President Obama’s campaign features Big Bird in a '
 'tongue-in-cheek bid to attack Mitt Romney for suggesting he would crack down '
 'on federal funding for public television, while not cracking down on big '
 'banks.')


#### Data combination : Books and the selected article

In [58]:
# Combine all books abstracts with the article's abstract in one dataset
corpus_article = article['abstract']
corpus_books_article = pd.DataFrame(corpus_books['abstract'])
corpus_books_article.loc[-1] = corpus_article # add article's abstract at the end of the df
corpus_books_article = corpus_books_article.reset_index(drop=True)
corpus_books_article.tail(3)

Unnamed: 0,abstract
430,NEW YORK TIMES BESTSELLER • From bestselling a...
431,"A NEW YORK TIMES, USA TODAY BESTSELLER! The Ne..."
432,A new television ad by President Obama’s campa...


# TF-IDF and similarity scores

#### Theory:
TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a common function used in text analysis and Natural Language Processing to calculate the similarity between documents. TF-IDF works by multiplying Term Frequency and Inverse Document Frequency. Term frequency represents how many times a term occurs in a document, and Inverse Document Frequency represents how common this word is across all documents.

The TF-IDF matrix assigns a value (TF-IDF) for each document and each word. The similarity score across the matrix is then computed using cosine similarity computation. In summary, two documents will have higher similarity if they share more words between them and fewer words with other documents.

TF-IDF requires the following text preprocessing:

- The text data is preprocessed by removing stop words, punctuation, and other non-alphanumeric characters.
- Tokenization: The text is tokenized into individual words.

#### In practice:
Our dataset is composed of book summaries, to which we will add the article abstract as the last row. Text preprocessing will be performed on the text documents. Then, we will apply the TF-IDF computation to the dataset. Cosine similarity will be calculated between the article's TF-IDF word values and the book summaries' TF-IDF word values. Finally, the 5 most similar books to the article will be displayed.

### Data preprocessing

In [59]:
def text_cleaning(text): 
    
    import re 
    # Remove #1 from, for exemple, #1 NATIONAL BESTSELLER
    remove_hastag1 = re.sub(r'\#\d', '', text)
    # Remove all numbers 
    remove_numbers = re.sub(r'\d+', '', remove_hastag1)
    # Regular expression to match fully uppercase words and words containing uppercase letters
    remove_full_upper = re.sub(r'\b[A-Z]+\b', '', remove_numbers)
    # lowercasing
    lowercased_text = remove_full_upper.lower()
    # remove everything that is not a word character (including letters, digits and underscore) or a blank space. 
    remove_punctuation = re.sub(r'[^\w\s]', '', lowercased_text)
    # Remove any sequence of one or more white space by one white space. Removes white spaces at begining and end of word. It also removes '\xa0', Unicode for non-breaking space. 
    remove_white_space = re.sub(r'\s+', ' ', remove_punctuation).strip()

    return (remove_white_space)

def tokenization(text_clean):
    # Tokenization = Breaking down each text into words put in an array based on blank spaces.
    from nltk.tokenize import word_tokenize
    tokenized_text = word_tokenize(text_clean)
    return tokenized_text

def remove_stop_words(abstract_token):
# Stop Words/filtering = Removing irrelevant words
    from nltk.corpus import stopwords
    stopwords = set(stopwords.words('english'))
    stopwords_removed = [word for word in abstract_token if word not in stopwords]
    return stopwords_removed

# Stemming = Transforming words into their base form
def stemming(abstract_stop_words):
    from nltk.stem import PorterStemmer
    ps = PorterStemmer()
    stemmed_text = [ps.stem(word) for word in abstract_stop_words]
    return stemmed_text

def preprocessing_abstract(abstract):
    abstract_clean = text_cleaning(abstract)
    abstract_token = tokenization(abstract_clean)
    abstract_stop_words = remove_stop_words(abstract_token) 
    abstract_stemming = stemming(abstract_stop_words)
    return abstract_stemming

In [60]:
# Preprocess the text data and save the preprocessed text in a new column. 
corpus_books_article['abstract_preprocessed'] = None
col_index = corpus_books_article.columns.get_loc('abstract_preprocessed')

# Loop through each abstract, preprocess it, transform list in string, update the DataFrame
for index, abstract in enumerate(corpus_books_article['abstract']):
    abstract_preprocessed = preprocessing_abstract(abstract)
    string = ' '.join([str(item)for item in abstract_preprocessed])
    corpus_books_article.iloc[index, col_index] = string

### TF - IDF computation

In [61]:
# Count TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus_books_article['abstract_preprocessed']) # return a document-term matrix

# Get words 
feature_names = vectorizer.get_feature_names_out()

# Combine corpus with the weighted word matrix by creating 'id' index and merge
corpus_books_article['id'] = range(0, len(corpus_books_article))
df_tfidf_prev = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
df_tfidf_prev['id'] = range(0, len(df_tfidf_prev))
df_tfidf = pd.merge(corpus_books_article, df_tfidf_prev, on='id')

In [62]:
def get_top_10_most_important_words(row_num):
    """
    return :
    - the 10 top words with the highest weight based of a book abstract based on the row number
    - the abstract
    """
    abstract = df_tfidf.iloc[row_num, df_tfidf.columns.get_loc("abstract")]
    row = df_tfidf.iloc[row_num, df_tfidf.columns.get_loc("id"):]
    row_sort = row[1:].sort_values(ascending = False)
    top_10_words = row_sort[:10]
    return print(top_10_words), pprint.pprint(abstract)

### Cosine similarity computation

In [63]:
# Computation of Cosine similarity. The higher the cosim value, the more similar the elements are. 
cosim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosim = pd.DataFrame(cosim)
cosim_article = cosim.tail(1)

### Books to article selection (top 5)

In [64]:
# Select the last row of the cosine matrix which represents the artcile cosine values to each books summary.
cosim_artile_sort = cosim_article.iloc[-1,:].sort_values(ascending = False)

# Select the 5 books that are the most similare to the article
top_5_books = cosim_artile_sort[1:6]
top_5_books = pd.DataFrame(top_5_books)
top_5_books.columns = ['cosine']
top_5_books = top_5_books.reset_index()
top_5_books

Unnamed: 0,index,cosine
0,155,0.188388
1,168,0.130171
2,82,0.107223
3,144,0.106198
4,139,0.099749


In [65]:
corpus_books = corpus_books.reset_index()

In [66]:
# Display the top 5 most similar books to the article
top_5_books_info = pd.merge(corpus_books, top_5_books, how='inner', on='index')
top_5_books_info.head(5)

Unnamed: 0,index,title,author,publisher,book_uri,buy_links,genre,abstract,cosine
0,82,THE ROOTS OF OBAMA'S RAGE,Dinesh D'Souza,Regnery,nyt://book/cfac8787-c017-5d12-ad52-387dbe48d29a,https://goto.applebooks.apple/9781596986251?at...,Politics & Current Events,Critics of President Obama have attacked him a...,0.107223
1,139,BAILOUT,Neil Barofsky,Free Press,nyt://book/b1fcf65b-7534-5f60-bd17-0e44e4fcbd17,https://goto.applebooks.apple/9781451684940?at...,Politics & Current Events,In this riveting account of the mishandling of...,0.099749
2,144,OBAMA'S LAST STAND,Glenn Thrush,Random House Publishing,nyt://book/f7476331-0deb-5d56-94fb-dbc3bb7648da,https://goto.applebooks.apple/9780679645092?at...,Politics & Current Events,A series of four instant eBooks on the 2012 pr...,0.106198
3,155,THE END OF THE LINE,Glenn Thrush and Jonathan Martin,Random House Publishing,nyt://book/073b4702-bff2-5f0d-92f2-aa67d7e60d2d,https://goto.applebooks.apple/9780679645108?at...,Politics & Current Events,The fourth and final eBook in POLITICO’s Playb...,0.188388
4,168,THE CENTER HOLDS,Jonathan Alter,Simon & Schuster,nyt://book/98e73896-4a4a-5652-b2ab-e7a451b72483,https://goto.applebooks.apple/9781451646108?at...,Politics & Current Events,"From the bestselling author of The Promise, th...",0.130171
