## Book recommandation system

In [237]:
import json 
import pandas as pd 
import pprint
import re
import multiprocessing as mp
import numpy as np
import pandas as pd
import nltk
nltk.download('stopwords', quiet=True)
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package punkt to /Users/anavekua/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Data books

In [240]:
# Load books data
file_paths = "data/bestsellers_with_abstract_and_genre_3350.json"
with open(file_paths, 'r') as file:
        data = json.load(file)

df = pd.DataFrame(data)

In [241]:
# Books per category
grouped_df = df.groupby('genre').count()
print(grouped_df['title'])

genre
Arts & Entertainment             55
Biographies & Memoirs          1237
Business & Personal Finance     166
Comics & Graphic Novels           1
Computers & Internet              3
Cookbooks, Food & Wine            4
Fiction & Literature             32
Health, Mind & Body             113
History                         387
Humor                           121
January 30                        1
Kids                              1
Lifestyle & Home                 22
Mysteries & Thrillers             1
N/A                             203
Nonfiction                      199
Parenting                        22
Politics & Current Events       432
Professional & Technical         48
Reference                         2
Religion & Spirituality          76
Romance                           3
Sci-Fi & Fantasy                  1
Science & Nature                119
Sports & Outdoors                80
Travel & Adventure               19
Young Adult                       2
Name: title, dtype: in

In [292]:
# Remove N/A category
df = df[df['genre'] != 'N/A']

# Select only books in 'Politics & Current Events'
corpus_books = df[df['genre'] == 'Politics & Current Events']
corpus_books = corpus_books.reset_index(drop=True)

#### Data Articles


In [294]:
# Load data articles archives
file_paths = "/Users/anavekua/Documents/DataScienTest/API_nyt_json/nytimes_archives_elections_articles.json"

with open(file_paths, 'r') as file:
        data = json.load(file)

articles = pd.DataFrame(data)

In [295]:
#Load an article from archive
a = 100000 # article number
article = articles.iloc[a, :]
print("Headline:", article['headline_main'])
pprint.pprint(article['abstract'])

Headline: Republicans Champion ‘Voluntary Taxes’
('Republicans will do anything to avoid raising taxes, and their latest bill '
 'to encourage voluntary contributions is nothing more than a distraction, an '
 'economist writes.')


#### Data combination

In [297]:
# Combine all books abstracts with the article's abstract
corpus_article = article['abstract']
corpus_books_article = pd.DataFrame(corpus_book['abstract'])
corpus_books_article.loc[-1] = corpus_article # add article's abstract at the end of the df
corpus_books_article = corpus_books_article.reset_index(drop=True)
corpus_books_article.tail(3)

Unnamed: 0,abstract
430,"NEW YORK TIMES BESTSELLER • From bestselling author and longtime New York Times columnist Frank Bruni comes a lucid, powerful examination of the ways in which grievance has come to define our current culture and politics, on both the right and left.The twists and turns of American politics are unpredictable, but the tone is a troubling given. It’s one of grievance. More and more Americans are convinced that they’re losing because somebody else is winning. More and more tally their slights, measure their misfortune, and assign particular people responsibility for it. The blame game has become the country’s most popular sport and victimhood its most fashionable garb.Grievance needn’t be bad. It has done enormous good. The United States is a nation born of grievance, and across the nearly two hundred and fifty years of our existence as a country, grievance has been the engine of morally urgent change. But what happens when all sorts of grievances—the greater ones, the lesser ones, the authentic, the invented—are jumbled together? When people take their grievances to lengths that they didn’t before? A violent mob storms the US Capitol, rejecting the results of a presidential election. Conspiracy theories flourish. Fox News knowingly peddles lies in the service of profit. College students chase away speakers, and college administrators dismiss instructors for dissenting from progressive orthodoxy. Benign words are branded hurtful; benign gestures are deemed hostile. And there’s a potentially devastating erosion of the civility, common ground, and compromise necessary for our democracy to survive.How did we get here? What does it say about us, and where does it leave us? The Age of Grievance examines these critical questions and charts a path forward."
431,"A NEW YORK TIMES, USA TODAY BESTSELLER! The New York Times bestselling author, governor of South Dakota, and former congresswoman tells eye-opening stories of DC dysfunction, shares lessons from leading her state through unprecedented challenge, and explains how we seize this moment to move America forward. Any elected official can talk about how broken our government is. But their solutions always seem to involve more money, new programs—and reelection to another term. Few offer an unfiltered glimpse into how government actually works, empowering citizens with the knowledge to be part of the solution. Governor Kristi Noem never planned on being in politics. But her concern for our nation compelled her, on a local, national, and global level. Because she took a different path into public service, as a concerned mom and rancher, her insights help every citizen understand how positive change really happens, despite the dysfunction in Washington DC. Governor Noem explains how the country is not going back to the Republican party of the 2000s. And that’s a good thing. This book is packed with surprising stories and practical lessons from the front lines of the battle. And she names names. ​ A lot has changed since 2016, and based on her accomplishments in Congress and as Governor, no one is better equipped than Kristi Noem to explain the tremendous opportunities this opens up for every American."
432,"Republicans will do anything to avoid raising taxes, and their latest bill to encourage voluntary contributions is nothing more than a distraction, an economist writes."


# TF-IDF and similarity scores

### Data preprocessing

In [247]:
def text_cleaning(text): 
    
    import re 
    # Remove #1 from, for exemple, #1 NATIONAL BESTSELLER
    remove_hastag1 = re.sub(r'\#\d', '', text)
    # Remove all numbers 
    remove_numbers = re.sub(r'\d+', '', remove_hastag1)
    # Regular expression to match fully uppercase words and words containing uppercase letters
    remove_full_upper = re.sub(r'\b[A-Z]+\b', '', remove_numbers)
    # lowercasing
    lowercased_text = remove_full_upper.lower()
    # remove everything that is not a word character (including letters, digits and underscore) or a blank space. 
    remove_punctuation = re.sub(r'[^\w\s]', '', lowercased_text)
    # Remove any sequence of one or more white space by one white space. Removes white spaces at begining and end of word. It also removes '\xa0', Unicode for non-breaking space. 
    remove_white_space = re.sub(r'\s+', ' ', remove_punctuation).strip()

    return (remove_white_space)

def tokenization(text_clean):
    # Tokenization = Breaking down each text into words put in an array based on blank spaces.
    from nltk.tokenize import word_tokenize
    tokenized_text = word_tokenize(text_clean)
    return tokenized_text

def remove_stop_words(abstract_token):
# Stop Words/filtering = Removing irrelevant words
    from nltk.corpus import stopwords
    stopwords = set(stopwords.words('english'))
    stopwords_removed = [word for word in abstract_token if word not in stopwords]
    return stopwords_removed

# Stemming = Transforming words into their base form
def stemming(abstract_stop_words):
    from nltk.stem import PorterStemmer
    ps = PorterStemmer()
    stemmed_text = [ps.stem(word) for word in abstract_stop_words]
    return stemmed_text

def preprocessing_abstract(abstract):
    abstract_clean = text_cleaning(abstract)
    abstract_token = tokenization(abstract_clean)
    abstract_stop_words = remove_stop_words(abstract_token) 
    abstract_stemming = stemming(abstract_stop_words)
    return abstract_stemming

In [253]:
corpus_books_article['abstract_preprocessed'] = None
col_index = corpus_books_article.columns.get_loc('abstract_preprocessed')

# Loop through each abstract, preprocess it, transform list in string, update the DataFrame
for index, abstract in enumerate(corpus_books_article['abstract']):
    abstract_preprocessed = preprocessing_abstract(abstract)
    string = ' '.join([str(item)for item in abstract_preprocessed])
    corpus_books_article.iloc[index, col_index] = string

### TF - IDF computation

In [254]:
# Count TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus_books_article['abstract_preprocessed']) # return a document-term matrix

# Get words 
feature_names = vectorizer.get_feature_names_out()

# Combine corpus with the weighted word matrix by creating 'id' index and merge
corpus_books_article['id'] = range(0, len(corpus_books_article))
df_tfidf_prev = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
df_tfidf_prev['id'] = range(0, len(df_tfidf_prev))
df_tfidf = pd.merge(corpus_books_article, df_tfidf_prev, on='id')

In [255]:
def get_top_10_most_important_words(row_num):
    """
    return :
    - the 10 top words with the highest weight based of a book abstract based on the row number
    - the abstract
    """
    abstract = df_tfidf.iloc[row_num, df_tfidf.columns.get_loc("abstract")]
    row = df_tfidf.iloc[row_num, df_tfidf.columns.get_loc("id"):]
    row_sort = row[1:].sort_values(ascending = False)
    top_10_words = row_sort[:10]
    return print(top_10_words), pprint.pprint(abstract)

### Cosine similarity computation

In [256]:
# Computatuon of Cosine similarity. The higher the cosim value, the more similar the elements are. 
cosim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosim = pd.DataFrame(cosim)
cosim_article = cosim.tail(1)

### Books to article selection

In [279]:
# Select the last row of the cosine matrix
cosim_artile_sort = cosim_article.iloc[-1,:].sort_values(ascending = False)

# Select the 5 books that are the most similare to the article
top_5_books = cosim_artile_sort[1:5]
top_5_books = pd.DataFrame(top_5_books)
top_5_books.columns = ['cosine']
top_5_books = top_5_books.reset_index()
top_5_books

Unnamed: 0,index,cosine
0,230,0.075591
1,101,0.062915
2,403,0.060432
3,335,0.058875


In [299]:
corpus_books = corpus_books.reset_index()


In [305]:
top_5_books_info = pd.merge(corpus_books, top_5_books, how='inner', on='index')
top_5_books_info.head(5)

Unnamed: 0,index,title,author,publisher,book_uri,buy_links,genre,abstract,cosine
0,101,AFTER AMERICA,Mark Steyn,Regnery,nyt://book/2defdfba-2ec7-5e8b-8b19-b1f9801875c2,https://goto.applebooks.apple/9781596982796?at=10lIEQ,Politics & Current Events,"Mark Steyn's New York Times bestseller, After America, is now in paperback! Featuring a new introduction and updated throughout, After America takes on Obama's disastrous plan for our nation, and reveals exactly what a post-American world will look like if we don't change our ways soon. Says Steyn: ""Nothing is certain but debt and taxes. And then more debt. If the government of the United States had to use GAAP (the 'Generally Accepted Accounting Practices' that your company and the publisher of this book have to use), Uncle Sam would be under an SEC investigation and his nephews and nieces would have taken away the keys and cut up his credit card."" Slim as it is, however, Steyn argues there is still hope. ""Americans face a choice: you can rediscover the animating principles of the American idea—of limited government, a self-reliant citizenry, and the opportunities to exploit your talents to the fullest—or you can join most of the rest of the western world in terminal decline. This is a battle for the American idea, and it's an epic one, but you can do anything you want to do. So do it."" Bitingly funny and wickedly clever, After America is Mark Steyn at his best.",0.062915
1,230,THE VANISHING AMERICAN ADULT,Ben Sasse,St. Martin's,nyt://book/ae055aca-f320-5920-aaf4-4cfdde47b516,https://goto.applebooks.apple/9781250114402?at=10lIEQ,Politics & Current Events,"THE INSTANT NEW YORK TIMES BESTSELLERIn an era of safe spaces, trigger warnings, and an unprecedented election, the country's youth are in crisis. Senator Ben Sasse warns the nation about the existential threat to America's future.Raised by well-meaning but overprotective parents and coddled by well-meaning but misbegotten government programs, America's youth are ill-equipped to survive in our highly-competitive global economy. Many of the coming-of-age rituals that have defined the American experience since the Founding: learning the value of working with your hands, leaving home to start a family, becoming economically self-reliant—are being delayed or skipped altogether. The statistics are daunting: 30% of college students drop out after the first year, and only 4 in 10 graduate. One in three 18-to-34 year-olds live with their parents. From these disparate phenomena: Nebraska Senator Ben Sasse who as president of a Midwestern college observed the trials of this generation up close, sees an existential threat to the American way of life.In The Vanishing American Adult, Sasse diagnoses the causes of a generation that can't grow up and offers a path for raising children to become active and engaged citizens. He identifies core formative experiences that all young people should pursue: hard work to appreciate the benefits of labor, travel to understand deprivation and want, the power of reading, the importance of nurturing your body—and explains how parents can encourage them.Our democracy depends on responsible, contributing adults to function properly—without them America falls prey to populist demagogues. A call to arms, The Vanishing American Adult will ignite a much-needed debate about the link between the way we're raising our children and the future of our country.",0.075591
2,335,IT WAS ALL A LIE,Stuart Stevens,Knopf,nyt://book/7fd446ce-b254-5f3d-ad64-efc54d1d0a05,https://goto.applebooks.apple/9780525658450?at=10lIEQ,Politics & Current Events,"AN INSTANT NEW YORK TIMES BESTSELLER""In his bare-knuckles account, Stevens confesses to the reader that the entire apparatus of his Republican Party is built on a pack of lies... This reckoning inspired Stevens to publish this blistering, tell-all history... Although this book will be a hard read for any committed conservatives, they would do well to ponder it.""--Julian E. Zelizer, The New York TimesFrom the most successful Republican political operative of his generation, a searing, unflinching, and deeply personal exposé of how his party became what it is todayStuart Stevens spent decades electing Republicans at every level, from presidents to senators to local officials. He knows the GOP as intimately as anyone in America, and in this new book he offers a devastating portrait of a party that has lost its moral and political compass. This is not a book about how Donald J. Trump hijacked the Republican Party and changed it into something else. Stevens shows how Trump is in fact the natural outcome of five decades of hypocrisy and self-delusion, dating all the way back to the civil rights legislation of the early 1960s. Stevens shows how racism has always lurked in the modern GOP's DNA, from Goldwater's opposition to desegregation to Ronald Reagan's welfare queens and states' rights rhetoric. He gives an insider's account of the rank hypocrisy of the party's claims to embody ""family values,"" and shows how the party's vaunted commitment to fiscal responsibility has been a charade since the 1980s. When a party stands for nothing, he argues, it is only natural that it will be taken over by the loudest and angriest voices in the room.It Was All a Lie is not just an indictment of the Republican Party, but a candid and often lacerating mea culpa. Stevens is not asking for pity or forgiveness; he is simply telling us what he has seen firsthand. He helped to create the modern party that kneels before a morally bankrupt con man and now he wants nothing more than to see what it has become burned to the ground.",0.058875
3,403,THE BILL OF OBLIGATIONS,Richard Haass,Penguin Press,nyt://book/907da459-176e-5d35-b85e-aba80b3924ee,https://goto.applebooks.apple/9780525560654?at=10lIEQ,Politics & Current Events,"Watch the PBS companion documentary “A Citizen’s Guide to Preserving Democracy”“An indispensable guide to good citizenship in an era of division and rancor.” —Anne ApplebaumThere is no question that the United States faces dangerous threats from without; the greatest peril to the country, however, comes from within. In The Bill of Obligations, bestselling author Richard Haass argues that, to solve our climate of division and safeguard our democracy, the very idea of citizenship must be revised and expanded. The Bill of Rights is at the center of our Constitution, yet the most intractable conflicts often emerge from cases that, as former Supreme Court Justice Stephen Breyer pointed out, “are not about right versus wrong. They are about right versus right.”There is a way forward: to place obligations on the same footing as rights. The ten obligations that Haass introduces here reenvision what it means to be an American citizen, to commit to our fellow citizens and counter the growing apathy, anger, and violence that threaten us all.Through an expert blend of civics, history, and political analysis, this book illuminates how Americans across the political spectrum can rediscover how to contribute to and reshape this country’s future.",0.060432
