# Introduction

This notebook implements some applications of the NLP course at DSSC AA 2021-2022. 

In particular, I will perform topic modeling using LDA and sentiment analysis for a dataset of book reviews from amazon. The data set was found at:

https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews

The intended use cases are the following:
- Topic modeling: enable publishing companies to understand the most popular topics of books (perhaps by year) in order to increase revenue. It could also be used by Authors to know which are the hottest topics as of late.
- Sentiment Analysis: By predicting the sentiment of a review for a book, a company such as amazon can incorporate this knowledge in a recommender system for collaborative filtering techniques. 

Therefore, we will develop these techniques with these goals in mind. 

### The dataset

The dataset contains of two main files:

- books_data.csv: information for 200,000 books. Namely, the title, the author name, the summary, the category of the book, etc.
- Books_rating.csv: contains information for 3 million book reviews. Namely, the book id, the book title, the user id who rated the book, the review/score, the summary of the review, and the full text of the former. 

### Import Statements

In [2]:
# import statements for the whole notebook
import pandas as pd
import numpy as np


import spacy

#for language detection, uncomment
#!pip install spacy_fastlang
import spacy_fastlang

import nltk
#nltk.download('all')

# collocations
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.corpus import stopwords
from collections import Counter
from tqdm import tqdm

from sklearn.metrics import classification_report, f1_score, balanced_accuracy_score, precision_score, recall_score
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score

from transformers import AutoTokenizer,  AutoModelForSequenceClassification, pipeline, DataCollatorWithPadding
from transformers import TrainingArguments, Trainer
import evaluate

import torch 

import re
import pickle

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

import stylecloud

#load spacys english dictionary. 
nlp = spacy.load("en_core_web_sm")
#nlp.add_pipe("language_detector")



  from .autonotebook import tqdm as notebook_tqdm


In [23]:
#Global variable to stop execution in notebook re runs
FIRST_RUN = False

# Topic Modeling

We focus on the first data set which contains information about the 200,000 books.

We briefly recall the idea behind topic modeling: we use LDA to extract the main topics that the books are talking about in their descriptions. 

LDA is a document generating model which assumes that all documents are generated by sampling a specific topic distribution and sampling the words most relevant to each topic in a proportional manner.
This assumption is faulty in some ways since it doesn't account for dependence between consecutive words.

We can reverse engineer the process to find the $k$ topics being discussed and, for each topic, the top words (descriptors) associated to it.

To choose the optimal value of $k$, we will maximize the UMass and CVscore, while also keeping in mind the principle of Occam's razor.
- UMass: based on log probability of word cooccurrences
- CVscore: uses normalized PMI and cosine similarity

In [27]:
df_books = pd.read_csv("./DataSet/books_data.csv")
df_books.head()

Unnamed: 0,Title,description,authors,image,previewLink,publisher,publishedDate,infoLink,categories,ratingsCount
0,Its Only Art If Its Well Hung!,,['Julie Strain'],http://books.google.com/books/content?id=DykPA...,http://books.google.nl/books?id=DykPAAAACAAJ&d...,,1996,http://books.google.nl/books?id=DykPAAAACAAJ&d...,['Comics & Graphic Novels'],
1,Dr. Seuss: American Icon,Philip Nel takes a fascinating look into the k...,['Philip Nel'],http://books.google.com/books/content?id=IjvHQ...,http://books.google.nl/books?id=IjvHQsCn_pgC&p...,A&C Black,2005-01-01,http://books.google.nl/books?id=IjvHQsCn_pgC&d...,['Biography & Autobiography'],
2,Wonderful Worship in Smaller Churches,This resource includes twelve principles in un...,['David R. Ray'],http://books.google.com/books/content?id=2tsDA...,http://books.google.nl/books?id=2tsDAAAACAAJ&d...,,2000,http://books.google.nl/books?id=2tsDAAAACAAJ&d...,['Religion'],
3,Whispers of the Wicked Saints,Julia Thomas finds her life spinning out of co...,['Veronica Haddon'],http://books.google.com/books/content?id=aRSIg...,http://books.google.nl/books?id=aRSIgJlq6JwC&d...,iUniverse,2005-02,http://books.google.nl/books?id=aRSIgJlq6JwC&d...,['Fiction'],
4,"Nation Dance: Religion, Identity and Cultural ...",,['Edward Long'],,http://books.google.nl/books?id=399SPgAACAAJ&d...,,2003-03-01,http://books.google.nl/books?id=399SPgAACAAJ&d...,,


#### Step 0: Exploration

We extract the description of each book. However, since also the titles of the books are useful in terms of information, we will join the two columns together and separate them with a period. 

Furthermore, we will filter non english documents since LDA operates under the assumption that documents are in the same language. To do this, we use **spacy_fastlang** library to detect the language.

In [7]:
df_books = df_books.dropna(subset = ["description"])
df_books["new_desc"] = df_books.loc[:,"description"] + ". " + df_books.loc[:,"Title"]
newdate = [re.sub(r"[^0-9].*","",date ) for date in df_books["publishedDate"].apply(str)]
df_books["publishedDate"] = newdate
df_books = df_books.drop(["image", "publisher", "infoLink", "ratingsCount", "previewLink"], axis = 1)

In [151]:
if FIRST_RUN:
    #check if the summaries are in english. Runtime is approximately 1 hr
    lang = [nlp(doc)._.language for doc in df_books["new_desc"]]
    df_books["lang"] = lang
    df_books_en = df_books[df_books["lang"] == "en"]
    with open("./DataSet/books_en.csv", "wb") as fp:
            pickle.dump(df_books_en, fp)
else:
    with open("./DataSet/books_en.csv", "rb") as fp:
            df_books_en = pickle.load(fp)
book_description_corpus = df_books_en["new_desc"].tolist()

len(book_description_corpus)

142149

As we can see, the final corpus contains approximatley 150,000 documents in english. Below we show an example:

In [9]:
book_description_corpus[0]

'Philip Nel takes a fascinating look into the key aspects of Seuss\'s career - his poetry, politics, art, marketing, and place in the popular imagination." "Nel argues convincingly that Dr. Seuss is one of the most influential poets in America. His nonsense verse, like that of Lewis Carroll and Edward Lear, has changed language itself, giving us new words like "nerd." And Seuss\'s famously loopy artistic style - what Nel terms an "energetic cartoon surrealism" - has been equally important, inspiring artists like filmmaker Tim Burton and illustrator Lane Smith. --from back cover. Dr. Seuss: American Icon'

#### Most frequent categories
Having filtered out non english documents, we now analyze the most frequent categories. These give an idea of the concepts which the corpus talks about. 

In [46]:
cat_num = 10
fig = px.bar(df_books_en["categories"].value_counts()[:cat_num][::-1], orientation = "h", title = f"Top {cat_num} Categories")
fig.show()

As we can see, the most frequent topic is related to fiction, while other topics such as religion and history are also very popular. 

#### Step 1: Preprocessing

Before using LDA, we must do some aggressive preprocessing in order to reduce variance to a minimum. The proprocessing for LDA has to be much more aggressive than other tasks since we want to capture important words in order to define our topics clearly. 

For example, if we don't lemmatize or stem, we could find that the top descriptors of a topic are simply all the conjugations of a word. This is not very informative and thus must be avoided at all costs. 

The preprocessing steps that we will emply are as follows:

- tokenization: using spacys tokenizer.
- normalization: removing punctuation except for hyphens and apostrophes (since spacys tokenizer is much better at handling this)
- lemmatization: reduce each token to it's lemma.
- stopword removal: we use spacy's predefined list of stop words.
- joint collocations: we introduce the bigrams with PMI greater than or equal to 1.0 (for computational reasons we don't include all of them)

In [2]:
def normalize(text):
    '''This function removes punctuation, symbols, and numbers. Returns the text into lower case.
      Input: a string of text
      Output: cleaned a string of text
    '''
    no_punctuation = re.sub(r'[^\w^\s^-^\']','',text)
    no_numbers = re.sub(r'[0-9+]', '0', no_punctuation)
    downcase = no_numbers.lower()
    return downcase

def clean_corpus(corpus, stopword_removal = True):
    '''This function preprocesses each document of the corpus by first normalizing it and then keeping only the lemma of the word
     if the word is not a stop word.
     Input: a list of strings (documents)
     Outout: a list of strings (documents) which have been preprocessed.
    '''
    clean_corpus = [] 
    if stopword_removal:
        for document in corpus:
            lowered_text = normalize(document)
            clean_corpus.append(' '.join([token.lemma_ for token in nlp(lowered_text) if token.pos_ in ['NOUN', 'VERB', 'ADJ', 'ADV', 'PROPN'] and not token.is_stop]))
    else:
        for document in corpus:
            lowered_text = normalize(document)
            clean_corpus.append(' '.join([token.lemma_ for token in nlp(lowered_text) if token.pos_ in ['NOUN', 'VERB', 'ADJ', 'ADV', 'PROPN']]))
     
    return clean_corpus 

In [13]:
def collocation_finder(cleaned_corpus, min_freq = 10, thresh = 1):
    words = [word for document in cleaned_corpus for word in document.split()]   
    finder = BigramCollocationFinder.from_words(words)
    bgm = BigramAssocMeasures()
    score = bgm.mi_like
    finder.apply_freq_filter(min_freq)
    collocations = pd.DataFrame(finder.score_ngrams(score), columns = ["bigram", "pmi"])
    return collocations[collocations["pmi"] >= thresh]
def replace_words_with_collocations(cleaned_corpus, collocations):
    corpus_with_collocations = []
    for document in tqdm(cleaned_corpus):
        for word1, word2 in collocations:
            document = re.sub(f'{word1} {word2}', f'{word1}_{word2}', document)
        corpus_with_collocations.append(document)
    return corpus_with_collocations


We can now finally preprocess the corpus by first applying the pipeline and then finding all collocations and adding them to the documents as bigrams. 

In [14]:
#preprocess corpus and save once

if FIRST_RUN:
    book_description_corpus_cleaned = clean_corpus(book_description_corpus)

    with open("./DataSet/book_description_cleaned.csv", "wb") as fp:
        pickle.dump(book_description_corpus_cleaned, fp)
else:
    with open("./DataSet/book_description_cleaned.csv", "rb") as fp:
        book_description_corpus_cleaned = pickle.load(fp)


In [16]:
collocations = collocation_finder(book_description_corpus_cleaned)
collocations

Unnamed: 0,bigram,pmi
0,"(united, states)",3860.036858
1,"(york, times)",1570.857642
2,"(new, york)",1257.851052
3,"(los, angeles)",742.395737
4,"(san, francisco)",388.615226
...,...,...
1300,"(professionally, typeset)",1.007619
1301,"(p, lovecraft)",1.001788
1302,"(british, columbia)",1.001537
1303,"(fort, sumter)",1.000000


As we can see, we have a total of 1305 collocations to add to our corpus. This is computationally intensive since for each document we must search for all the collocations and substitute with the bigram.
However, these bigrams help to add information to our descriptors since they free up space for other words rather than occupying two slots. 

In [17]:
if FIRST_RUN:
    book_description_corpus_cleaned_collocations = replace_words_with_collocations(book_description_corpus_cleaned, collocations["bigram"])
    
    with open("./DataSet/books_description_corpus_cleaned_with_collocations.csv", "wb") as fp:
        pickle.dump(book_description_corpus_cleaned_collocations, fp)
else:
    with open("./DataSet/books_description_corpus_cleaned_with_collocations.csv", "rb") as fp:
        book_description_corpus_cleaned_collocations = pickle.load(fp)

After having run the whole preprocessing, we can finally see the results. We show the raw text, the cleaned text without collocations, and the final text with collocations. This can be seen by the appearance of the term "united_states" in the final example.

In [18]:
i = 29
print(book_description_corpus[i], "\n")
print(book_description_corpus_cleaned[i], "\n")
print(book_description_corpus_cleaned_collocations[i], "\n")

The United States "Explorer" map is a classic example of the cartographic excellence National Geographic is known for. This colorful political map clearly shows state boundaries, capitals, major highways, rivers, lakes, and many major cities. Insets show detail of Alaska and Hawaii. The color palette is vibrant and stunning shaded relief provides additional texture and detail. The map is encapsulated in heavy-duty 1.6 mil laminate which makes the paper much more durable and resistant to the swelling and shrinking caused by changes in humidity. Laminated maps can be framed without the need for glass, so the fames can be much lighter and less expensive. "Map Scale = 1:6,396,000Sheet Size = 32" x 20.25"". Usa Laminated Map 

united states explorer map classic example cartographic excellence national geographic know colorful political map clearly show state boundary capital major highway river lake major city inset detail alaska hawaii color palette vibrant stunning shaded relief provide a

#### Word Cloud
We now look at a wordcloud to see the most frequent terms in the corpus. This will also serve to give an idea of the topics being discussed.


In [None]:
words = " ".join(book_description_corpus_cleaned_collocations)
wordcloud = stylecloud.gen_stylecloud(
                        text = words,
                        size = 1028,
                        background_color = "black",
                        output_name = "./Images/wordcloud.png",
                        icon_name = "fas fa-book"
                        )

![image](./Images/wordcloud.png)

As we can see, frequent words such as work, life, history, time, love, relationship, are very suggestive of the topics present in the corpus. 

#### Step 2: Using Gensim's LDA 

To apply LDA, we use the gensim library which is the most popular library for this task. In particular, we use the LDAmulticore library to pararellize the work and speed the computation. 

In [21]:
from gensim.models import LdaMulticore, TfidfModel, CoherenceModel
from gensim.corpora import Dictionary
import time 
import multiprocessing 

In [24]:
instances = [doc.split() for doc in book_description_corpus_cleaned_collocations]
dictionary = Dictionary(instances)
dictionary.filter_extremes(no_below = 50, no_above = 0.5)

ldacorpus = [dictionary.doc2bow(text) for text in instances]
tfidfmodel = TfidfModel(ldacorpus)
model_corpus = tfidfmodel[ldacorpus]

In [25]:
i = 1 
print(book_description_corpus_cleaned_collocations[i])
print(instances[i])
print(ldacorpus[i])
print(model_corpus[i])

resource include principle understand small church worship practice planning worship few people suggestion congregational study wonderful worship small church
['resource', 'include', 'principle', 'understand', 'small', 'church', 'worship', 'practice', 'planning', 'worship', 'few', 'people', 'suggestion', 'congregational', 'study', 'wonderful', 'worship', 'small', 'church']
[(51, 2), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 2), (61, 1), (62, 1), (63, 1), (64, 1), (65, 3)]
[(51, 0.31376452057035326), (52, 0.32180463855405556), (53, 0.2693150375467911), (54, 0.0664025240567632), (55, 0.09767008421079369), (56, 0.19392337346329966), (57, 0.12066432336191125), (58, 0.1504597029346263), (59, 0.1314502090225251), (60, 0.28331521221527595), (61, 0.09744878989309225), (62, 0.1936860437731818), (63, 0.11980735412718313), (64, 0.18348738900751174), (65, 0.666431806837864)]


#### Step 3: Choosing number of topics
To choose the number of topics we use the UMass score and CV score as mentioned earlier. In particular, we have to find a number of topics for which these scores have a local maxima. We also keep in mind the principle of Occams razor which states that simpler models are preferred. Therefore, sometimes if we the maximum occurs when there is a lot of topics, it may be preferable to pick another local maxima which is lower but has less topics. This is because we are more interested in explainability rather than maximizing a metric. In fact, at the end of the day, we are interesting in **knowing** which topics are being talked about, not having the best scores. 

In [30]:
coherence_values = []

dev_size = 100000
eval_size = 30000

for num_topics in range(2, 21):
    model = LdaMulticore(corpus=model_corpus[:dev_size], 
                         id2word=dictionary, 
                         num_topics=num_topics,
                         random_state = 2)

    coherencemodel_umass = CoherenceModel(model=model, 
                                          texts=instances[dev_size:dev_size+eval_size], 
                                          dictionary=dictionary, 
                                          coherence='u_mass')

    coherencemodel_cv = CoherenceModel(model=model, 
                                       texts=instances[dev_size:dev_size+eval_size], 
                                       dictionary=dictionary, 
                                       coherence='c_v')

    umass_score = coherencemodel_umass.get_coherence()
    cv_score = coherencemodel_cv.get_coherence()
    print(f"Number of topics: {num_topics}, UMass Score: {umass_score}, CV Score: {cv_score} \n")
    coherence_values.append((num_topics, umass_score, cv_score))

Number of topics: 2, UMass Score: -1.9605988643372265, CV Score: 0.3161597160060067 

Number of topics: 3, UMass Score: -1.9717528690166812, CV Score: 0.29431340607395245 

Number of topics: 4, UMass Score: -2.005779097776581, CV Score: 0.3064429016914327 

Number of topics: 5, UMass Score: -2.0269277266492507, CV Score: 0.31563442611381015 

Number of topics: 6, UMass Score: -2.053769982167863, CV Score: 0.315766214867727 

Number of topics: 7, UMass Score: -2.010900744638036, CV Score: 0.3169885581834011 

Number of topics: 8, UMass Score: -2.0597289045429577, CV Score: 0.3154455716383609 

Number of topics: 9, UMass Score: -2.036263730354967, CV Score: 0.30268530259809956 

Number of topics: 10, UMass Score: -2.0345074490719113, CV Score: 0.3180204074656512 

Number of topics: 11, UMass Score: -2.089334649556665, CV Score: 0.29665755564040147 

Number of topics: 12, UMass Score: -2.026210672008201, CV Score: 0.3269589351828821 

Number of topics: 13, UMass Score: -2.0360165325245614

In [40]:
scores = pd.DataFrame(coherence_values, columns=['num_topics', 'UMass', 'CV'])

fig = make_subplots(rows=1, cols=2,
                    subplot_titles=("UMass Score", "CV Score"))

fig.add_trace(
    go.Scatter(x=scores["num_topics"], y=scores["UMass"]),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=scores["num_topics"], y=scores["CV"]),
    row=1, col=2
)

fig.update_xaxes(title_text="Num. Topics", row=1, col=1)
fig.update_xaxes(title_text="Num. Topics", row=1, col=2)

fig.update_yaxes(title_text="Coherence Score", row=1, col=1)


fig.update_layout(height=600, width=800, title_text="Choosing Number of Topics Via UMass and CV scores")
fig.show()

The best maxima occurs at 17 topics for both measures. However, since we prefer explainability, we prefer the local maxima with 7 topics, which represents the second highest peak for UMass and a local maxima for CV score. 

#### Step 4: Running the model

Having chosen the number of topics, we run the full model and analyze the results.

In [42]:
if FIRST_RUN: 
    num_topics = 7
    
    # find chunksize to make about 200 updates
    num_passes = 10
    chunk_size = len(model_corpus) * num_passes/200
    print(f"chunk size: {chunk_size}")
    
    start = time.time()
    print("fitting model", flush=True)
    model = LdaMulticore(num_topics=num_topics, # number of topics
                         corpus=model_corpus, # what to train on 
                         id2word=dictionary, # mapping from IDs to words
                         workers=min(10, multiprocessing.cpu_count()-1), # choose 10 cores, or whatever computer has
                         passes=num_passes, # make this many passes over data
                         chunksize=chunk_size, # update after this many instances
                         alpha=0.5,
                         random_state = 2
                        )
        
    print("done in {}".format(time.time()-start), flush=True)
    with open("./Models/model_lda", "wb") as fp:
        pickle.dump(model, fp)
else:
    with open("./Models/model_lda", "rb") as fp:
        model = pickle.load(fp)


chunk size: 7107.45
fitting model
done in 352.34129548072815


#### Step 5: Analyzing results

We now extract the descriptors for all 7 topics and try to understand the topics which they describe.

In [43]:
def get_descriptors(model, num_words = 5):
    topic_sep = re.compile(r"0\.[0-9]{3}\*|\+|\"") # getting rid of useless formatting
    model_topics = [(topic_no, re.sub(topic_sep, '', model_topic)) for topic_no, model_topic in
                    model.print_topics(num_words=num_words)]
    descriptors = {}
    for i, m in model_topics:
        descriptors[f"topic {i+1}"] = m.split()
    return pd.DataFrame.from_dict(descriptors, orient = "index", columns = [f"Word {i}" for i in range(1,num_words+1)])

In [50]:
topics = get_descriptors(model, 10)
topics

Unnamed: 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10
topic 1,theory,study,social,history,science,culture,work,book,introduction,modern
topic 2,art,guide,design,include,edition,book,feature,dictionary,color,work
topic 3,guide,student,business,help,provide,child,practice,information,learn,system
topic 4,god,bible,life,book,spiritual,christian,jesus,work,church,christ
topic 5,love,man,story,woman,life,novel,family,find,young,murder
topic 6,war,history,world,military,american,political,new,life,book,civil
topic 7,story,adventure,animal,christmas,book,song,music,little,great,tale


As we can see, the topics are quite easy to interpret:
- topic 1: words such as "theory", "study", "history", "science" indicate that this topic belongs to **academia**
- topic 2: words such as "guide" along with "design" and "art" indicate that this topic belongs to **art**.
- topic 3: words such as "guide" along with "student", "business", "help" and "provide" tell us that this topic deals with **student help**.
- topic 4: words such as "god", "bible" indicate a religious theme: **religion**.
- topic 5: words such as "love", "man", "woman", "murder" indicate the theme of **fiction & romance**.
- topic 6: words such as "war", "history", "military", "american" suggest a theme of **american war history**.
- topic 7: words such as "story", "adventure", "animal", "book" indicate a theme related to **children books**.


As we can see, these themes more or less relate to the top 10 categories with a few important observations:
- The theme **art** is not present in the top 10 categories. 
- Looking at the top 10 categories we could think that the theme of **fiction & romance** could  actually related to the category of **biography & autobiography** since the descriptors could very well be dealing with someone's personal life. 
- The theme of **student help** is probably associated to the theme of **juvenile non-fiction**

#### Topic frequency over the years
Finally, lets analyze how the popularity of the themes changed throughout the years.

In particular, we will analyze the last 12 years (for graphical and practical reasons)

In [91]:
topic_corpus = model[model_corpus]
scores = [[t[1] for t in topic_corpus[entry]] for entry in range(len(book_description_corpus))]


In [152]:
df_books_en[["topic1", "topic2", "topic3", "topic4", "topic5", "topic6", "topic7"]]=scores

In [153]:
df_books_en

Unnamed: 0,Title,description,authors,publishedDate,categories,new_desc,lang,topic1,topic2,topic3,topic4,topic5,topic6,topic7
1,Dr. Seuss: American Icon,Philip Nel takes a fascinating look into the k...,['Philip Nel'],2005,['Biography & Autobiography'],Philip Nel takes a fascinating look into the k...,en,0.095130,0.340764,0.068506,0.075415,0.064876,0.266570,0.088739
2,Wonderful Worship in Smaller Churches,This resource includes twelve principles in un...,['David R. Ray'],2000,['Religion'],This resource includes twelve principles in un...,en,0.105597,0.084881,0.199642,0.347270,0.083063,0.088499,0.091049
3,Whispers of the Wicked Saints,Julia Thomas finds her life spinning out of co...,['Veronica Haddon'],2005,['Fiction'],Julia Thomas finds her life spinning out of co...,en,0.056008,0.054771,0.063038,0.090354,0.606541,0.068405,0.060883
5,The Church of Christ: A Biblical Ecclesiology ...,In The Church of Christ: A Biblical Ecclesiolo...,['Everett Ferguson'],1996,['Religion'],In The Church of Christ: A Biblical Ecclesiolo...,en,0.085947,0.068766,0.075421,0.549577,0.081188,0.074279,0.064822
8,Saint Hyacinth of Poland,The story for children 10 and up of St. Hyacin...,['Mary Fabyan Windeatt'],2009,['Biography & Autobiography'],The story for children 10 and up of St. Hyacin...,en,0.082401,0.073128,0.071681,0.275148,0.091007,0.322728,0.083908
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
212399,The Orphan Of Ellis Island (Time Travel Advent...,"During a school trip to Ellis Island, Dominick...",['Elvira Woodruff'],2000,['Juvenile Fiction'],"During a school trip to Ellis Island, Dominick...",en,0.082278,0.091112,0.083885,0.081321,0.338309,0.086601,0.236494
212400,Red Boots for Christmas,Everyone in the village of Friedensdorf is hap...,,1995,['Juvenile Fiction'],Everyone in the village of Friedensdorf is hap...,en,0.082209,0.089091,0.087297,0.098683,0.233809,0.092970,0.315941
212401,Mamaw,"Give your Mamaw a useful, beautiful and though...",['Wild Wild Cabbage'],2018,,"Give your Mamaw a useful, beautiful and though...",en,0.071210,0.113667,0.081025,0.233818,0.134653,0.081190,0.284437
212402,The Autograph Man,Alex-Li Tandem sells autographs. His business ...,['Zadie Smith'],2003,['Fiction'],Alex-Li Tandem sells autographs. His business ...,en,0.113216,0.085311,0.081603,0.099031,0.417966,0.082309,0.120565


In [154]:
df_books_en.loc[98662,"publishedDate"] = "2007" #Wrong Data point, used to be 2030 which is impossible
years = df_books_en["publishedDate"].value_counts().sort_index(ascending=False)[:14]
years.pop("2023") # strangely enough, we had some points from 2023 (perhaps books that are announced but yet to be published?). These were eliminated
years

2022    1006
2021    2645
2020    2724
2019    2697
2018    2766
2017    2992
2016    3187
2015    3540
2014    4288
2013    5098
2012    5218
2011    4532
2010    3906
Name: publishedDate, dtype: int64

In [155]:
print(sum(years.values))
years = years.index.tolist()

44599


In [156]:
df_books_en["publishedDate"].replace("",np.nan, inplace=True)
df_books_en = df_books_en.dropna(subset = ["publishedDate"])
df_years = df_books_en[(df_books_en["publishedDate"].apply(int) > 2009) & (df_books_en["publishedDate"].apply(int) < 2023)]
df_years

Unnamed: 0,Title,description,authors,publishedDate,categories,new_desc,lang,topic1,topic2,topic3,topic4,topic5,topic6,topic7
12,Mensa Number Puzzles (Mensa Word Games for Kids),Acclaimed teacher and puzzler Evelyn B. Christ...,['Evelyn B. Christensen'],2018,['Juvenile Nonfiction'],Acclaimed teacher and puzzler Evelyn B. Christ...,en,0.081379,0.206227,0.368441,0.074345,0.093401,0.071278,0.104929
13,Vector Quantization and Signal Compression (Th...,"Herb Caen, a popular columnist for the San Fra...","['Allen Gersho', 'Robert M. Gray']",2012,['Technology & Engineering'],"Herb Caen, a popular columnist for the San Fra...",en,0.281639,0.166511,0.293628,0.065531,0.059910,0.065226,0.067555
14,A husband for Kutani,"First published in 1938, this is a collection ...",['Frank Owen'],2018,['History'],"First published in 1938, this is a collection ...",en,0.079303,0.154894,0.067302,0.214772,0.250086,0.110496,0.123148
16,The Ultimate Guide to Law School Admission: In...,This collection brings together a distinguishe...,['Fiona Cownie'],2010,['Law'],This collection brings together a distinguishe...,en,0.140504,0.069417,0.428557,0.091853,0.068621,0.115173,0.085873
17,The Repeal of Reticence: A History of America'...,At a time when America's faculties of taste an...,['Rochelle Gurstein'],2016,['Political Science'],At a time when America's faculties of taste an...,en,0.169678,0.070360,0.144317,0.102388,0.102878,0.352481,0.057897
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
212348,Watercolor: For the Artistically Undiscovered ...,Put the brush in your hand. Stick it in the wa...,"['Thacher Hurd', 'John Cassidy']",2017,['Art'],Put the brush in your hand. Stick it in the wa...,en,0.060808,0.403781,0.083427,0.081929,0.241133,0.060765,0.068156
212375,Sell and Re-Sell Your Photos,Sell your photos again and again! Live anywher...,"['Rohn Engh', 'Mikael Karlsson']",2016,['Photography'],Sell your photos again and again! Live anywher...,en,0.056735,0.290012,0.332104,0.077475,0.060911,0.120815,0.061948
212394,Final things,Grace's father believes in science and builds ...,['Jenny Offill'],2015,['Fiction'],Grace's father believes in science and builds ...,en,0.058780,0.056405,0.055993,0.077312,0.604928,0.058690,0.087891
212398,Autodesk Inventor 10 Essentials Plus,Autodesk Inventor 2017 Essentials Plus provide...,"['Daniel Banach', 'Travis Jones']",2016,['Computers'],Autodesk Inventor 2017 Essentials Plus provide...,en,0.057221,0.286917,0.451048,0.053184,0.047131,0.053911,0.050588


In [158]:
df_years = df_years.groupby(df_years.publishedDate).mean()

fig = go.Figure(data=[
    go.Bar(name="academia", x=df_years.index, y=df_years["topic1"]),
    go.Bar(name="art", x=df_years.index, y=df_years["topic2"]),
    go.Bar(name="student help", x=df_years.index, y=df_years["topic3"]),
    go.Bar(name="religion", x=df_years.index, y=df_years["topic4"]),
    go.Bar(name="fiction & romance", x=df_years.index, y=df_years["topic5"]),
    go.Bar(name="american war history", x=df_years.index, y=df_years["topic6"]),
    go.Bar(name="children books", x=df_years.index, y=df_years["topic7"]),

])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

Some notable observations from the bar chart above:
- In the last three years the popularity of books related to **fiction & romance** have gone up. This could be due to the difficult times due to the pandemic which undoubtedly left lots of people with desires of affection and human contact. 
link: https://www.frontiersin.org/articles/10.3389/fpsyg.2021.798260/full
- There was a noticeable increase in popularity of books related to **academia** in during 2012-2013. This could be linked to the boom in popularity of massive open online courses - MOOCs in those years. Since academic books are one of the most essential supplements to an education, it could be that the boom in MOOCs cause an increase in the production of academic texts.  
link: https://onlinelearninginsights.wordpress.com/2012/12/21/what-the-heck-happened-in-2012-review-of-the-top-three-events-in-education/


### Conclusions

In conclusion, we have trained a model to extract the topics being discussed in the book corpus. The optimal number of topics (7) were chosen to maximize some metrics while also keeping an eye to explainability. 
These topics are highly related to the categories of the books, however, there have been some surprising insights such as the presence of **art** as an existing theme. Furthermore, we have seen that the last three years have seen an increasing trend in the popularity of **fiction & romance** which means that publishers and authors should consider investing in books related to these themes. 
We have also seen that important technological moments have influenced the popularity of certain books, and this could be true of various other events. For this reason, any publisher/author should keep an eye out for course-altering events and use a similar approach to understand if there are effects in the book selling industry.

# Sentiment Analysis
In this section we will analyze the reviews from various users and perform sentiment analysis. This is a useful application in NLP since it can be used to know the sentiment towards a specific book based on the reviews. This information can be used by publishing companies and authors alike to know which products are liked most and what the public likes/dislikes.

In particular we will use the reviews dataset which we saw earlier.

In [3]:
df_reviews = pd.read_csv("./DataSet/Books_rating.csv" )

In [3]:
print(f"The Data Set contains {len(df_reviews)/1e6} million reviews")
df_reviews = df_reviews.rename(columns = {"review/helpfulness":"helpfulness", "review/score":"score", "review/time":"time", "review/summary":"summary", "review/text":"text"})

df_reviews.head()


The Data Set contains 3.0 million reviews


Unnamed: 0,Id,Title,Price,User_id,profileName,helpfulness,score,time,summary,text
0,1882931173,Its Only Art If Its Well Hung!,,AVCGYZL8FQQTD,"Jim of Oz ""jim-of-oz""",7/7,4.0,940636800,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,826414346,Dr. Seuss: American Icon,,A30TK6U7DNS82R,Kevin Killian,10/10,5.0,1095724800,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,826414346,Dr. Seuss: American Icon,,A3UH4UZ4RSVO82,John Granger,10/11,5.0,1078790400,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,826414346,Dr. Seuss: American Icon,,A2MVUWT453QH61,"Roy E. Perry ""amateur philosopher""",7/7,4.0,1090713600,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,826414346,Dr. Seuss: American Icon,,A22X4XUPKF66MR,"D. H. Richards ""ninthwavestore""",3/3,4.0,1107993600,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...


### Analysis of Data Set
 
We analyze the distribution of the scores and find the most frequent words for the positive and negative reviews.

In [5]:
fig = px.bar(df_reviews["score"].value_counts(), orientation = "h", title = f"Distribution of scores")
fig.show()

As we can see, there is a great imbalance towards 5-star reviews. 

By looking at the first 50 3-star reviews, we realize that it is difficult to categorize them as positive or negative. Therefore, they will simply be classified as "neutral".

In [18]:
df_reviews[df_reviews["score"] == 3.0]["summary"].sample(20, random_state= 2)

360556                         study guide for microbiology
2951265                                             OK book
1933705    The Scarlet Letter -- Wordy, dense, far too long
1792488             Another bittersweet slice of reality...
1771923           A classic, but hard to place an age group
2719754                                        Average book
2100828                   Yes, something is definitely awry
462543                         stats, stats, and mo' stats!
1860971                                 I have not read it.
624287            Better than your average cash cow sequel.
2545175                              Interesting but slight
1475002                     The turning point in the series
441864            Not what I was expecting, but still okay.
2755116                                         Cigar Guide
1792390          Perfection is always a tough act to follow
2004466                                   Moving yet Opaque
1825311                                 

In [4]:
df_reviews.loc[df_reviews["score"]>3.0, "class"] = "POSITIVE"
df_reviews.loc[df_reviews["score"]==3.0, "class"] = "NEUTRAL"
df_reviews.loc[df_reviews["score"]<3.0, "class"] = "NEGATIVE"
 

### Word Cloud

Lets look at the word cloud for positive reviews and negative ones. As usual we have to do some preprocessing in order to reduce variation and catch more conceptual words. 

In [7]:
positive_reviews = df_reviews[df_reviews["class"]=="POSITIVE"].dropna(subset = ["summary"]).sample(100000, random_state=2)
negative_reviews = df_reviews[df_reviews["class"]=="NEGATIVE"].dropna(subset = ["summary"]).sample(100000, random_state=2)

In [7]:
positive_reviews["clean_summary"] = clean_corpus(positive_reviews["summary"])
negative_reviews["clean_summary"] = clean_corpus(negative_reviews["summary"])

In [8]:
positive_words = " ".join(positive_reviews["clean_summary"])
wordcloud_positive = stylecloud.gen_stylecloud(
                        text = positive_words,
                        size = 1028,
                        background_color = "black",
                        output_name = "./Images/positive_wordcloud.png",
                        icon_name = "fas fa-comment"
                        )

![image](./Images/positive_wordcloud.png)

In [9]:
negative_words = " ".join(negative_reviews["clean_summary"])
wordcloud_negative = stylecloud.gen_stylecloud(
                        text = negative_words,
                        size = 1028,
                        background_color = "black",
                        output_name = "./Images/negative_wordcloud.png",
                        icon_name = "fas fa-comment"
                        )

![image](./Images/negative_wordcloud.png)

In the negative wordcloud, we can immediately observe the presence of some terms that we associate with positive emotions. For example, the word "good" and "great" are seen in the wordcloud. One possible explanation is that these terms are always associated with negative connotations. For example, they could be preceded by the word "not" or "hardly any" or any word that negates the positive context of the word. 
This would be easier to understand if we included the collocations as we did for topic modeling. However, since this is an exploratory analysis on a random subsample of 100 thousand documents, it won't be performed.

We can try to get a sense of the correctness of our claim by analyzing the reviews which contain the word "good" in it. 

In [15]:
negative_reviews[negative_reviews["summary"].str.contains("good")]["summary"].head()

1477650    Notice most of the good reviews on this are by...
2073598         Airplane/Beach Read, Not good for much else.
2002262                                     not good quality
606137              Some good information, but high on guilt
546251                             Not really that good.....
Name: summary, dtype: object

As we can see, our hypothesis seems to be correct. 

### Bert

As we mentioned before, we will be using BERT for our analysis.

In particular, we will test:

- Multilingual Bert: This model is trained on product reviews in various languages and is the most natural model to handle our data set since it outputs a score from 1-5. In this way, we do not have to do any preprocessing at all. 


### Multilingual Bert


In [4]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

In [3]:
if FIRST_RUN:
    df_reviews = df_reviews.dropna(subset=["summary"]).sample(300000, random_state=2)
    tqdm.pandas()
    df_reviews["predicted_rating_BERT"] = df_reviews["summary"].progress_apply(lambda x: int(re.sub(r"\D","",classifier(x)[0]['label'])))
    lang = [nlp(doc)._.language for doc in tqdm(df_reviews["summary"])]
    df_reviews["lang"] = lang
    with open("./DataSet/reviews_small_predicted.csv", "wb") as fp:
        pickle.dump(df_reviews,fp)
else:
    with open("./DataSet/reviews_small_predicted.csv", "rb") as fp:
        df_reviews = pickle.load(fp)
df_reviews

Unnamed: 0,Id,Title,Price,User_id,profileName,helpfulness,score,time,summary,text,predicted_rating_BERT,lang
2524966,B000HZ8ZR8,"The Garden of Eden, A Novel",,,,4/11,2.0,912297600,The Garden of Eden? Hmmmm......,"This book starts out great, it romantisizes Eu...",2,en
530830,B000HOSLEG,Fairytale,,A2PGVEDRPGGIJ0,coteb,2/2,5.0,1338854400,loved it,I highly recommend if you like Maggie shayne. ...,5,en
2597389,0754053822,Disgrace: Complete & Unabridged,,A1MRV6VAB1HVJW,Gisele W. Wright,0/4,3.0,1022803200,A hard read,What drew to this novel was the title and the ...,2,en
218539,0786222859,Bittersweet Rain,,A21MVX83HPR5VK,Dina_179,17/22,4.0,952128000,Ms. Brown keeps surprising me with her amazing...,Only Sandra Brown will put together the most u...,5,en
2759866,0821755641,Unforgettable,,A1N967FZREGFG,Tess Capra,3/3,4.0,890524800,"Not unforgettable, but pretty darn good.",Perhaps if Zebra spent as much on publicity as...,3,en
...,...,...,...,...,...,...,...,...,...,...,...,...
1644252,0395615801,A Nightmare in History: The Holocaust 1933-1945,,A1OJHNAMUW4F4Y,David N. 1st period.,2/5,5.0,1008028800,1st peiod reading -A nightmare in history,Imagne for a second that you are sleeping in y...,5,en
2816988,1593555563,Stone of Tears (Sword of Truth Series),167.25,A2JNZ09Y5JYPAK,I hate peas,2/5,1.0,1325462400,Never should have been published,This book is absolutely terrible.If you take o...,1,en
1088854,B000KIITWS,CONFUCIUS: THE SECULAR AS SACRED.,,A3PBFMN1QU4VWV,John J. Gibbs,0/0,5.0,1318636800,a gem,"concise, well-written and constructed (as you ...",5,pt
962254,0875968708,"You Bet Your Tomatoes: Fun Facts, Tall Tales, ...",,ASFNOYMHYMQNO,GLORIA O'CONNELL,1/1,4.0,1245974400,Entertaining and informative.,"I've been gardening for many years (actually, ...",4,en


In [9]:
fig = px.bar(df_reviews["score"].value_counts(), orientation = "h", title = f"Distribution of scores")
fig.show()

In [6]:
print(classification_report(df_reviews["score"], df_reviews["predicted_rating_BERT"]))

              precision    recall  f1-score   support

         1.0       0.32      0.55      0.41     20127
         2.0       0.28      0.27      0.27     15091
         3.0       0.30      0.32      0.31     25563
         4.0       0.35      0.30      0.32     58262
         5.0       0.77      0.74      0.76    180957

    accuracy                           0.58    300000
   macro avg       0.41      0.44      0.41    300000
weighted avg       0.60      0.58      0.59    300000



The results are not so good. The accuracy is around 58% which is not surprising since the multilingual model has an average that is close to that value. 

However, in the description of the model, they propose an "off by 1" metric. In other terms, we classify a prediction to be accurate if it is off by 1 from the ground label.
This metric can make sense in an application where we are not interested in the exact score, but a qualitative measure of positivity or negativity towards a certain book. This could be the case for example, in some applications of collaborative filtering where we are simply interested to know if the user consumed an item positively or negatively. 

Hence, we will use this metric here as well to evaluate the model.

In [80]:
for score in range(1,6):
    df_reviews.loc[df_reviews["score"] == score, "off_score_"+str(score)]= df_reviews["predicted_rating_BERT"].apply(lambda x: score if x<=score+1 and x>=score-1 else 0.0)
    df_reviews.loc[df_reviews["predicted_rating_BERT"] == score, "off_score_"+str(score)] = score
df_reviews.fillna(0, inplace = True)
df_reviews

Unnamed: 0,Id,Title,Price,User_id,profileName,helpfulness,score,time,summary,text,predicted_rating_BERT,off_score_1,off_score_2,off_score_3,off_score_4,off_score_5
2524966,B000HZ8ZR8,"The Garden of Eden, A Novel",0.00,0,0,4/11,2.0,912297600,The Garden of Eden? Hmmmm......,"This book starts out great, it romantisizes Eu...",2,0.0,2.0,0.0,0.0,0.0
530830,B000HOSLEG,Fairytale,0.00,A2PGVEDRPGGIJ0,coteb,2/2,5.0,1338854400,loved it,I highly recommend if you like Maggie shayne. ...,5,0.0,0.0,0.0,0.0,5.0
2597389,0754053822,Disgrace: Complete & Unabridged,0.00,A1MRV6VAB1HVJW,Gisele W. Wright,0/4,3.0,1022803200,A hard read,What drew to this novel was the title and the ...,2,0.0,2.0,3.0,0.0,0.0
218539,0786222859,Bittersweet Rain,0.00,A21MVX83HPR5VK,Dina_179,17/22,4.0,952128000,Ms. Brown keeps surprising me with her amazing...,Only Sandra Brown will put together the most u...,5,0.0,0.0,0.0,4.0,5.0
2759866,0821755641,Unforgettable,0.00,A1N967FZREGFG,Tess Capra,3/3,4.0,890524800,"Not unforgettable, but pretty darn good.",Perhaps if Zebra spent as much on publicity as...,3,0.0,0.0,3.0,4.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1644252,0395615801,A Nightmare in History: The Holocaust 1933-1945,0.00,A1OJHNAMUW4F4Y,David N. 1st period.,2/5,5.0,1008028800,1st peiod reading -A nightmare in history,Imagne for a second that you are sleeping in y...,5,0.0,0.0,0.0,0.0,5.0
2816988,1593555563,Stone of Tears (Sword of Truth Series),167.25,A2JNZ09Y5JYPAK,I hate peas,2/5,1.0,1325462400,Never should have been published,This book is absolutely terrible.If you take o...,1,1.0,0.0,0.0,0.0,0.0
1088854,B000KIITWS,CONFUCIUS: THE SECULAR AS SACRED.,0.00,A3PBFMN1QU4VWV,John J. Gibbs,0/0,5.0,1318636800,a gem,"concise, well-written and constructed (as you ...",5,0.0,0.0,0.0,0.0,5.0
962254,0875968708,"You Bet Your Tomatoes: Fun Facts, Tall Tales, ...",0.00,ASFNOYMHYMQNO,GLORIA O'CONNELL,1/1,4.0,1245974400,Entertaining and informative.,"I've been gardening for many years (actually, ...",4,0.0,0.0,0.0,4.0,0.0


In [85]:
print("Off by one metrics \n\n")
for score in range(1,6):
    tp = (df_reviews["score"] == df_reviews["off_score_"+str(score)]).sum()
    fp = ((df_reviews["score"] != score) & (df_reviews["off_score_"+str(score)]==score)).sum()
    fn = ((df_reviews["score"] == score) & (df_reviews["off_score_"+str(score)]!=score)).sum()
    print(f"Class {score}")
    print(f"Precision:\t {tp/(tp+fp)}")
    print(f"Recall:\t\t {tp/(tp+fn)}\n")


Off by one metrics 


Class 1
Precision:	 0.3755330847839498
Recall:		 0.6956327321508422

Class 2
Precision:	 0.5294090700870362
Recall:		 0.7658206878271817

Class 3
Precision:	 0.4673034737455919
Recall:		 0.6583343113093142

Class 4
Precision:	 0.6173135169401986
Recall:		 0.90254368198826

Class 5
Precision:	 0.8017643352236925
Recall:		 0.878938090264538



As we can see, with this metric, the per class precision and recall is much higher. However, one could object that the off by one metric is unrealiable as it is simply increasing the number of true positives (and keeping the false positives and false negatives the same). For this reason, we shall take this with a grain of salt and prefer to stick to the normal classification report. 

## RoBERTa

One interesting alternative to the multilingual model, is RoBERTa. This model is trained on a wider variety of data sets and also gives the ability of fine tuning which is a desirable quality. However, the first limitation of this model is that it only handles English. 
Therefore, we must first keep only the books that are in english using spacy fast lang as before. This was done previously.

A quick analysis shows that approximately 15000 of the documents are not in english while a quick inspection showed that this is not so accurate. Therefore, just to be sure, we proceed to reclassify them using the text of the review instead of the summary. In this way, we are not throwing away data which could be useful.

In [32]:
#nonenglish = df_reviews[df_reviews["lang"]!="en"]["text"]
#lang = [nlp(doc)._.language for doc in tqdm(nonenglish)]
#df_reviews.loc[df_reviews["lang"]!="en", "lang"] = lang

#remove non english docs
df_reviews_english = df_reviews[df_reviews["lang"]=="en"]

len_3 = len( df_reviews_english[df_reviews_english["score"]==3.0])
print(f"Documents with score of 3.0: { len_3}")

#remove docs with score of 3.0
df_reviews_english = df_reviews_english[df_reviews_english["score"]!=3.0]

print(f"We are left with {len(df_reviews_english)} documents")

df_reviews_english.loc[df_reviews_english["score"]>3.0, "class"] = "POSITIVE"
df_reviews_english.loc[df_reviews_english["score"]<3.0, "class"] = "NEGATIVE"


Documents with score of 3.0: 25546
We are left with 273958 documents


Now the non english documents have decreased to only 500 documents, which we drop. 


The next step involves handling another limitation of the model, namely that it only classifies text as positive or negative. As we analyzed previously, we can easily group 1 & 2 stars as negative and 4 & 5 stars as positive, while for 3 stars it is not so clear whether they are one category or the other. This could be harmful for evaluating the models performance as we have no prior idea of the true labels. For this reason, we will drop 3 star reviews when we evaluate the performance. 

This means dropping 25500 reviews. 

In the end, we are left with more than 250000 documents, which is acceptable to measure a comparable performance. Unfortunately, the run time required for this is 12 hours, therefore, we choose to consider only a sample of 100 thousand documents, which requires only 5 hours. 

In [27]:
model_name = "siebert/sentiment-roberta-large-english"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 2)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

In [8]:
if FIRST_RUN:
    df_reviews_english = df_reviews_english.sample(100000, random_state=2)
    tqdm.pandas()
    df_reviews_english["predicted_rating_RoBERTa"] = df_reviews_english["summary"].progress_apply(lambda x: str(classifier(x)[0]['label']))
    with open("./DataSet/reviews_small_roberta.csv", "wb") as fp:
        pickle.dump(df_reviews_english,fp)
else:
    with open("./DataSet/reviews_small_roberta.csv", "rb") as fp:
        df_reviews_english = pickle.load(fp)

In [14]:
fig = px.bar(df_reviews_english["class"].value_counts(), orientation = "h", title = f"Distribution of positive vs negative")
fig.show()

In [8]:
print(classification_report(df_reviews_english["class"], df_reviews_english["predicted_rating_RoBERTa"]))

              precision    recall  f1-score   support

    NEGATIVE       0.51      0.82      0.63     12720
    POSITIVE       0.97      0.89      0.93     87280

    accuracy                           0.88    100000
   macro avg       0.74      0.85      0.78    100000
weighted avg       0.91      0.88      0.89    100000



As we can see, this model performs much better than the previous one. It has high values of recall for both classes and a high precision for the positive class. However, it has a precision of 50% for the negative class, which indicates that half of the time, the model is classifying positive reviews as negative ones. This is largely due to the imbalance of the classes since the missing 10% of the recall in the positive class corresponds to around 8500 documents which are positive but classified as negative. However, the size of this misclassification is pratically as large as the number of negative documents, which will cause its precision to inevitably go down. 

#### Fine Tuning RoBERTa

As mentioned before, we can fine tune this model for our own dataset to try to obtain better results.

In particular, we follow the approach outlined in:
https://huggingface.co/docs/transformers/training

#### Step 1: Prepare Data
In the tutorial, we see that we must have data in HuggingFace format. This can be achieved by uploading the train,validation,test sets into the HUB of huggingface (account needed). 

The strategy was as follows: 
- Drop all columns except class and summary (to store less data)
- Convert Positive - 1, Negative - 0
- Shuffle Data
- Split train, validation, test into 60%, 20%, 20%.
- Upoad the results into the HUB.
- Load using the appropriate library.

In [9]:
df_reviews_english["class"] = df_reviews_english["class"].apply(lambda x: 1 if x == "POSITIVE" else 0)
df_reviews_english = df_reviews_english.drop(["Id", "Title", "score", "Price", "User_id", "profileName", "helpfulness", "time", "text", "predicted_rating_BERT", "lang", "predicted_rating_RoBERTa"], axis = 1)
df_reviews_english.rename(columns = {"class":"labels","summary":"text"},inplace = True)
lendf = len(df_reviews_english)
train, validation, test = np.split(df_reviews_english.sample(frac = 1, random_state = 2), [int(0.6*lendf), int(0.8*lendf)])
train.to_csv("./DataSet/train.csv", index = False)
validation.to_csv("./DataSet/validation.csv", index = False)
test.to_csv("./DataSet/test.csv", index = False)

In [10]:
from datasets import load_dataset
dataset = load_dataset("theabm/NLPFinalProject")
dataset


Using custom data configuration theabm--NLPFinalProject-c2fd898ba956fdb1
Found cached dataset csv (C:/Users/andre/.cache/huggingface/datasets/theabm___csv/theabm--NLPFinalProject-c2fd898ba956fdb1/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)
100%|██████████| 3/3 [00:00<00:00, 460.98it/s]


DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 60000
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['text', 'labels'],
        num_rows: 20000
    })
})

As we can see, the dataset is composed of a DataDict which is itself composed of the splits. We can access the data in the following way:

In [20]:
dataset["train"][0]

{'text': 'A Perfect Slice of History', 'labels': 1}

#### Step 2: Tokenizer

We use the same tokenizer as above to handle the inputs.

In [4]:
model_name = "siebert/sentiment-roberta-large-english"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 2)

In [5]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer)


 98%|█████████▊| 59/60 [00:01<00:00, 39.63ba/s]
 95%|█████████▌| 19/20 [00:00<00:00, 36.48ba/s]
 95%|█████████▌| 19/20 [00:00<00:00, 49.05ba/s]


In [6]:
tokenized_datasets["train"]

Dataset({
    features: ['text', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 60000
})

In [7]:
training_args = TrainingArguments(
    output_dir="./test_trainer",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    num_train_epochs=5,
    warmup_steps=500,
    weight_decay=0.01,
    eval_accumulation_steps=100,
    per_device_eval_batch_size=32,
    per_device_train_batch_size=32,
    group_by_length=True,
    save_strategy = "epoch"
)
    

clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return clf_metrics.compute(predictions=predictions, references=labels)

In [9]:
trainer = Trainer(
    model = model,
    args = training_args,
    tokenizer= tokenizer,
    train_dataset = tokenized_datasets["train"],
    eval_dataset = tokenized_datasets["validation"],
    compute_metrics = compute_metrics,
    data_collator=data_collator
)

In [None]:
#requires a decent amount of memory in gpu. Settings above were used in Colab and Finished in an hour of training
#trainer.train()

![image](./Images/finetuning.png)

We fine tuned the model for 5 epochs and saved the model at each of these. As we can see, the training loss is quickly decreasing while the validation loss is increasing a lot towards the end. This indicates we are in a situation of overfitting. For the first epoch the validation loss is at a minimum, so we will use the weights of the model at this epoch. 

A fine grained analysis could save the model weights every 1000 steps and have a better understanding of the models behaviour. We did not do this for computational reasons. 

#### Results

We evaluate the fine tuned model on the unseen test set, in order to compare the performances.

In [8]:
path_to_model = "C:\\Users\\andre\Documents\\NLP\\NLPFinalProject\\FineTuningCheckpoint\\checkpoint-1875"
model = AutoModelForSequenceClassification.from_pretrained(path_to_model)
tokenizer = AutoTokenizer.from_pretrained(path_to_model)
ft_clf = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, device = 0)

In [None]:
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm

dataset = load_dataset("theabm/NLPFinalProject", split = "test")


In [28]:
if FIRST_RUN:
    predicted = []
    for out in tqdm(ft_clf(KeyDataset(dataset, "text"))):
        predicted.append(str(out['label']))
    data = {"trueLabels": dataset["labels"],
            "predictedLabels": predicted
        }
    df = pd.DataFrame(data)
    df["predictedLabels"] = df["predictedLabels"].apply(lambda x: 1 if x == "POSITIVE" else 0)
    df.to_csv("./DataSet/ft_predicted_labels.csv", index = False)
else:
    predicted = pd.read_csv("./DataSet/ft_predicted_labels.csv")


In [31]:
print(classification_report(predicted["trueLabels"], predicted["predictedLabels"]))

              precision    recall  f1-score   support

           0       0.77      0.70      0.73      2561
           1       0.96      0.97      0.96     17439

    accuracy                           0.93     20000
   macro avg       0.86      0.83      0.85     20000
weighted avg       0.93      0.93      0.93     20000



As we can see, the model has noticeably improved compared to before. We have higher values for precision (51% -> 71%) and f1-score (63% -> 73%) for the negative class, and higher recall (89%->97%), and f1(93%->96%) score for the positive class. Furthermore, the accuracy also increased from 88% to 93%. 

The precision for the positive class decreased slightly (97% -> 96%) and the recall droped considerably (82% -> 70%).

However, the performance metrics indicate and **overall** improvement and are more desirable for an all purpose application. 

# Baselines

We now have to compare these models against a baseline in order to understand if our model makes sense. In particular, we will use the most frequent baseline. 

We will only compare the fine tuned RoBERTa model and the baseline, since multilingual bert behaved poorly and the proposed application only requires to obtain positive or negative feedback in the most basic form to be incorporated into a recommender system. 

In particular, we will use the f1 score weighted, micro, and macro which each try to account differently for the imbalance in the classes.

In [33]:
# most frequent

# we simply predict a 5 for all documents

predicted_most_frequent = np.ones(len(predicted))

balanced_acc_baseline = balanced_accuracy_score(predicted["trueLabels"], predicted_most_frequent)
balanced_acc_roberta = balanced_accuracy_score(predicted["trueLabels"], predicted["predictedLabels"])

f1_weighted_baseline = f1_score(predicted["trueLabels"], predicted_most_frequent, average = "weighted")
f1_weighted_roberta = f1_score(predicted["trueLabels"], predicted["predictedLabels"], average = "weighted")

f1_micro_baseline = f1_score(predicted["trueLabels"], predicted_most_frequent, average = "micro")
f1_micro_roberta = f1_score(predicted["trueLabels"], predicted["predictedLabels"], average = "micro")

f1_macro_baseline = f1_score(predicted["trueLabels"], predicted_most_frequent, average = "macro")
f1_macro_roberta = f1_score(predicted["trueLabels"], predicted["predictedLabels"], average = "macro")


print(f"Balanced accuracy baseline: {balanced_acc_baseline}\nbalanced accuracy model: {balanced_acc_roberta}\n")
print(f"f1-weighted baseline: {f1_weighted_baseline}\nf1-weighted model: {f1_weighted_roberta}\n")
print(f"f1-micro baseline: {f1_micro_baseline}\nf1-micro model: {f1_micro_roberta}\n")
print(f"f1-macro baseline: {f1_macro_baseline}\nf1-macro model: {f1_macro_roberta}\n")



Balanced accuracy baseline: 0.5
balanced accuracy model: 0.8333527349272734

f1-weighted baseline: 0.8123046048238468
f1-weighted model: 0.9331138209427435

f1-micro baseline: 0.87195
f1-micro model: 0.9345

f1-macro baseline: 0.4657976975880766
f1-macro model: 0.8471822708849865



We notice tha the model performs better than the baseline in all the metrics.

In particular, we see that the balanced accuracy for the baseline is reduced to $1/n_{classes}$ which is the accuracy of a random classifier. 

The model also outperforms the weighted and micro f1 scores, which is important due to the high imbalance of the classes. To briefly explain further, the baseline will correctly predict all the occurrences of the positive class, which account for approximately 85% of the dataset. Thus, if we are weighting the metrics by their class weight, this will "hide" the null performance on the negative class. However, our fine tuned model is able to outperform this and correctly predict much more than just the majority of the positive class but also 70% of the negative class.

Lastly, we are not surprised that the model outperforms the baseline in f1-macro score since this gives equal weights to the classes and the baseline misclassifies all the negative instances. 