# Judging a book by it's cover

The good read dataset provides data on books rated on goodread, with title, authors, etc. This is an EDA notebook to examine the language of book titles and whether or not it can be correlated with the rating obtained by a book.

## preprocessing

loading all the essential libraries, visualizing the architecture of the data, then selecting the most relevant aspects for analysis

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
import spacy
import numpy
import wordcloud

from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

In [None]:
nlp = spacy.load('en_core_web_lg')

In [None]:
path = "../input/books.csv"
dfr = pd.read_csv('../input/books.csv', skiprows=[4011, 5687, 7055, 10600, 10667])

In [None]:
dfr.columns

we observe informations about the book itself, like its title, number of pages, and authors, and informations about it's rating and reviews. We are missing the reviews so it's a bit minimalistic in terms of language analysis. I need to first clean the data, and to do that I will first see how many languages are available

In [None]:
languages = dfr['language_code'].tolist()
l_dict = {}
for l in set(languages):
    l_dict[l] = languages.count(l)
lists = sorted(l_dict.items())
x, y = zip(*lists)
f, ax = plt.subplots(figsize=(12, 5))
plt.barh(x, y)
plt.show()

So most books are in english, I'm going to create a dataset based on the merge of 'eng', 'en-GB', 'eng-CA' and 'en-US', and I will keep only books that have at least 10 ratings in order to get rid of the books that are not rated

In [None]:
df = dfr[dfr['ratings_count'] >= 10]
english_codes = ["eng", "en-US", "en-CA", "en-GB"]
df = df[df.language_code.isin(english_codes)]


## Book metadata analysis

so now I'm going to analyze the books rating based on their title and length

In [None]:
df["title_length"] = df.apply(lambda row: len(row.title), axis=1)

I use title length in chars, which is a bit minimalistic as an approach, it might be nice to add tokenization and maybe remove stopwords. I think there could be an interesting thing to do with word clouds after pre-processing stopwords on titles, see if some words are highly correlated with good (or bad) rankings

In [None]:
f, ax = plt.subplots(figsize=(9, 6))
sns.despine(f, left=True, bottom=True)
sns.scatterplot(x=df.title_length, y=df.average_rating,
                hue=df['# num_pages'], 
                palette="ch:r=-.2,d=.3_r",
                sizes=(1, 8), linewidth=0,
                data=df, ax=ax)

Meh. No real indication that title length impacts significantly the rating of a book here

In [None]:
f, ax = plt.subplots(figsize=(6.5, 6.5))
sns.despine(f, left=True, bottom=True)
sns.scatterplot(y=df.average_rating, x=df['# num_pages'],
                hue=df.title_length, 
                palette="ch:r=-.2,d=.3_r",
                sizes=(1, 8), linewidth=0,
                data=df, ax=ax)

Observing some correlation between number of pages and ranking from here. I'm missing some important info like genre of the books, that would be great to see what kind of book gets the best ranking. Now, I'm going to add a categorical data about whether each book name contains a verb, an adjective and add it in one hot encoding to see if these values impact scoring.

In [None]:
def pos_in_doc(text, postag):
    doc = nlp(text)
    return 1 if postag in [token.pos_ for token in doc] else 0

In [None]:
df["verb_in_title"] = df.apply(lambda row: pos_in_doc(row.title, "VERB"), axis=1)
df["noun_in_title"] = dfr.apply(lambda row: pos_in_doc(row.title, "NOUN"), axis=1)
df["adv_in_title"] = dfr.apply(lambda row: pos_in_doc(row.title, "ADV"), axis=1)
df["adj_in_title"] = dfr.apply(lambda row: pos_in_doc(row.title, "ADJ"), axis=1)
df["propn_in_title"] = dfr.apply(lambda row: pos_in_doc(row.title, "PROPN"), axis=1)

## Correlation between title and rating

Now about the ranking, I have infos about how many comments for each book, how many ratings, and the rating mean. I wonder if the number of rating tends to smooth the rating itself (I don't have any information about the variance between ratings, it would be great to see which books are more polarizing and which creates the most consensus).

In [None]:
plt.figure(1, figsize=(10, 6))
plt.subplot(2, 3, 1)
sns.violinplot(x=df.verb_in_title, y=df.average_rating,
               split=True, inner="quart",
               data=df)
plt.subplot(2, 3, 2)
sns.violinplot(x=df.noun_in_title, y=df.average_rating,
               split=True, inner="quart",
               data=df)
plt.subplot(2, 3, 3)
sns.violinplot(x=df.adv_in_title, y=df.average_rating,
               split=True, inner="quart",
               data=df)
plt.subplot(2, 3, 4)
sns.violinplot(x=df.adj_in_title, y=df.average_rating,
               split=True, inner="quart",
               data=df)
plt.subplot(2, 3, 5)
sns.violinplot(x=df.propn_in_title, y=df.average_rating,
               split=True, inner="quart",
               data=df)

It's visible that an adverb in your title gives you a slight advantage in terms of rating. Nice ! Now let's get a little more specific. I wonder if we can associate the words in the title with the rating ? Let's remove the stopwords, the punctuation, numbers, and make everything lowercase, then tokenize + lemmatize the title words and build another dataset from that, with each word associated with it's mean ratings, with variance associated.

In [None]:
print(df['title'][0])

As can be seen here, the way to write a book title here is  `title (series number)`. that's great because later on we can do analysis on sequels, see how bad they perform across long series, is the ratings stay consistents. for now we will exploit only the first part of the title.

In [None]:
relevant_pos = ['ADJ', 'VERB', 'NOUN', 'PROPN', 'ADV']
col = ["word", "ratings_list", "pos_tag"]
title_words_df = pd.DataFrame(columns=col)
title_words_df = title_words_df.set_index('word')
for index, row in df.iterrows():
    rating = row['average_rating']
    short_title = re.sub(r" \(.*", "", row['title'])
    doc = nlp(short_title)
    tokens = [t for t in doc if t.pos_ in relevant_pos]
    for token in tokens:
        lemma = token.lemma_.lower()
        pos_tag = token.pos_
        if lemma not in title_words_df.index.values:
            mini_df = pd.DataFrame({col[0]: lemma, col[1]: [numpy.array([rating])], col[2]: pos_tag})
            mini_df = mini_df.set_index('word')
            title_words_df = title_words_df.append(mini_df)
        else:
            ex_value = title_words_df.get_value(lemma, col[1])
            new_value = numpy.append(ex_value, [rating])
            title_words_df.at[lemma, col[1]] = new_value

Alright, now we end up with a dataset that contains a massive number of words, we are going to add some columns describing the word's frequency (number of different ratings it collaborated to, so length of rating vector), it's variance in rankings, it's mean ranking, and so on. Then I will extract another dataset composed of words that have a frequency count above ten, to get rid of hapaxes. And I will probably get rid of one-letter words too.

In [None]:
title_words_df["frequency"] = title_words_df.apply(lambda row: len(row['ratings_list']), axis=1)
title_words_df["variance"] = title_words_df.apply(lambda row: numpy.var(row['ratings_list']), axis=1)
title_words_df["mean_rating"] = title_words_df.apply(lambda row: numpy.mean(row['ratings_list']), axis=1)

In [None]:
freq_words_df = title_words_df[title_words_df.frequency > 20]
adj_freq_words_df = freq_words_df[freq_words_df.pos_tag == 'ADJ']
noun_freq_words_df = freq_words_df[freq_words_df.pos_tag == 'NOUN']
adv_freq_words_df = freq_words_df[freq_words_df.pos_tag == 'ADV']
verb_freq_words_df = freq_words_df[freq_words_df.pos_tag == 'VERB']
noun_freq_words_df.sort_values("mean_rating", ascending=False).head(5)

In [None]:
noun_freq_words_df.sort_values("mean_rating", ascending=True).head(5)

So it put you in a better position to write about alchemist (I'm assuming this is from the manga series fullmetal alchemist, and not the indication of an ongoing booming interest for the old science of turning random materials laying around your house into gold), than about girls, mystery, or confession. 

## Wordclouds

Now let's generate two wordclouds, one for the best ranked books, the other for the worst. To simplify this we will conside rthat a "good" book is a book above 4.5, and a bad book is below 3.5. These theshlds might require some adjustments later on, for example if we realize there isn't enough book that fall into one or the other category.

In [None]:
stop = set(STOPWORDS)
good_text = " ".join(re.sub("\(.*", "", t) for t in df[df.average_rating >= 4.5].title)
bad_text = " ".join(re.sub("\(.*", "", t) for t in df[df.average_rating <= 3.5].title)
text = " ".join(re.sub("\(.*", "", t) for t in df.title)
good_wordcloud = WordCloud(stopwords=stop, background_color="white").generate(good_text)
bad_wordcloud = WordCloud(stopwords=stop, background_color="white").generate(bad_text)
plt.figure(1, figsize=(14, 10))
plt.subplot(1, 2, 1)
plt.imshow(good_wordcloud, interpolation='bilinear')
plt.axis("off")
plt.subplot(1, 2, 2)
plt.imshow(bad_wordcloud, interpolation='bilinear')
plt.axis("off")

In general, if you want to meet success with your book on that type of rating website, don't go for anything that involves men or girl or time or the color black cause people aren't that much into it. Go for something more classy, like "Lord Alchemist of the Calvin Ring's Complete Guide" and that's a guaranteed hit !

In [None]:
text_wordcloud = WordCloud(stopwords=stop, background_color="white").generate(text)
plt.figure(1, figsize=(14, 10))
plt.imshow(text_wordcloud, interpolation='bilinear')
plt.axis("off")

And these are the most frequent words in titles, all ratings merged

## Series examination

Now, comes the time to study series, and whether or not their rating decrease across books ! We have seen earlier that the book title contains some informationabout the series it's part of, as well as the number in the series. Based on that we will cut ourselves a slice of that dataset with only the books that are part of a series, and add their number to a metadata.

In [None]:
series_df = df[df.title.str.endswith(')')]

In [None]:
def find_number(text):
    match = re.search(r"#(\d+)", text) 
    if match:
        return int(match.group(1))
    else:
        return 0
    
series_df["n_in_series"] = series_df.apply(lambda row: find_number(row.title), axis=1)

In [None]:
f, ax = plt.subplots(figsize=(10, 6.5))
sns.despine(f, left=True, bottom=True)
sns.scatterplot(y=series_df.average_rating, x=series_df.n_in_series,
                hue=series_df['# num_pages'],
                palette="ch:r=-.2,d=.3_r",
                sizes=(1, 8), linewidth=0,
                data=series_df, ax=ax)

okay, maybe we need to get rid of any book that has more than a 50 iterations, cause these sparse examples are making it harder to read

In [None]:
small_series_df = series_df[series_df.n_in_series < 50]
f, ax = plt.subplots(figsize=(10, 6.5))
sns.despine(f, left=True, bottom=True)
sns.scatterplot(y=small_series_df.average_rating, x=small_series_df.n_in_series,
                hue=small_series_df['# num_pages'],
                palette="ch:r=-.2,d=.3_r",
                sizes=(1, 8), linewidth=0,
                data=series_df, ax=ax)