# Important

`make cb-tf-idf` has to be run in the main directory before as the notebook uses the result data from that process.

# Imports

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Analysis

In [10]:
original_data = pd.read_csv('../data/interim/book-unified_ids.csv', index_col='book_id')

**Missing description example**

In [11]:
original_data.loc[9973]

goodreads_book_id                                                       849380
best_book_id                                                            849380
work_id                                                                   4370
books_count                                                                 52
isbn                                                                 609805797
authors                                            John M. Gottman, Nan Silver
original_publication_year                                                 1999
original_title               The Seven Principles for Making Marriage Work:...
title                        The Seven Principles for Making Marriage Work:...
language_code                                                              NaN
average_rating                                                            4.19
ratings_count                                                             8868
work_ratings_count                                  

**Description not in english example**

In [12]:
data.loc[9966]

goodreads_book_id                                                         9864
best_book_id                                                              9864
work_id                                                                3279710
books_count                                                                 72
isbn                                                                 312254997
authors                                                         Salman Rushdie
original_publication_year                                                 1999
original_title                                     The Ground Beneath Her Feet
title                                              The Ground Beneath Her Feet
language_code                                                              eng
average_rating                                                            3.77
ratings_count                                                             8673
work_ratings_count                                  

### Description length analysis

In [14]:
original_data['description'].apply(len).describe()

count    10000.00000
mean       872.34430
std        512.57934
min          4.00000
25%        541.00000
50%        809.00000
75%       1108.00000
max       8271.00000
Name: description, dtype: float64

## Noticed issues
* There are missing descriptions in the data
* Some descriptions are not in english

## Description content analysis

In [22]:
data['description'].dropna()

0       Winning will make you famous. Losing means cer...
1       Harry Potter's life is miserable. His parents ...
2       About three things I was absolutely positive.F...
3       The unforgettable novel of a childhood in a sl...
4       THE GREAT GATSBY, F. Scott Fitzgerald’s third ...
5       There is an alternate cover edition here."I fe...
6       In a hole in the ground there lived a hobbit. ...
7       The hero-narrator of The Catcher in the Rye is...
8       When world-renowned Harvard symbologist Robert...
9       “It is a truth universally acknowledged, that ...
10       “It may be unfair, but what happens in a few ...
11      Paperback features over fifty pages of bonus m...
12      The year 1984 has come and gone, but George Or...
13      As ferociously fresh as it was more than a hal...
14      Discovered in the attic in which she spent the...
15      Mikael Blomkvist, a once-respected financial j...
16      Sparks are igniting.Flames are spreading.And t...
17      Harry 

The descriptions need cleaning regarding removing punctuation and stopwords. Additionally stemming and lemmatization will be performed.

# Cleaning results

Descriptions have been cleaned using the following operations:
* transforming to lower case
* lemmatization
* stemming

Two approaches regarding nouns have been implemented:
* nouns are kept in the description
* nouns are deleted from the description

The reason why there are two approaches is the fact that on the one hand expressions like `Harry Potter` is a very important feature. But if there is another book in which the main character is named `Harry` then even though this book might be completely different it might get classified as similar. 

In [15]:
cleaned_data = pd.read_csv('../data/interim/cb-tf-idf/book_with_nouns.csv', index_col='book_id')
cleaned_data['description']

book_id
1       win make lose mean certain the nation form nor...
2       harri life his parent dead stuck heartless for...
3       about three thing i absolut edward part i know...
4       the unforgett novel childhood sleepi southern ...
5       the great gatsbi scott third stand suprem achi...
6       there altern cover edit fell love way fall des...
7       in hole ground live not wet fill end worm oozi...
8       the the catcher rye ancient child nativ new yo...
9       when harvard symbologist robert langdon summon...
10      truth univers singl man posse good fortun must...
11      may happen sometim even singl chang cours whol...
12      paperback featur fifti page bonus includ sneak...
13      the year come georg nightmarish vision world b...
14      a feroci fresh half centuri remark allegori do...
15      discov attic spent last year ann remark diari ...
16      mikael financi watch profession life rapid cru...
17      spark flame and capitol want against katniss h...
18    

# Example results

In [60]:
cleaned_data.loc[1, 'description']

'win make lose mean certain nation form north countri consist wealthi capitol region surround poorer earli rebellion led district capitol result destruct creation annual televis event known hunger in remind power grace district must yield one boy one girl age lotteri system particip the chosen annual reap forc fight leav one survivor claim young select district femal katniss volunt take she male counterpart pit stronger train whole see death but katniss close death for surviv second'

In [61]:
data.loc[1, 'description']

"Winning will make you famous. Losing means certain death.The nation of Panem, formed from a post-apocalyptic North America, is a country that consists of a wealthy Capitol region surrounded by 12 poorer districts. Early in its history, a rebellion led by a 13th district against the Capitol resulted in its destruction and the creation of an annual televised event known as the Hunger Games. In punishment, and as a reminder of the power and grace of the Capitol, each district must yield one boy and one girl between the ages of 12 and 18 through a lottery system to participate in the games. The 'tributes' are chosen during the annual Reaping and are forced to fight to the death, leaving only one survivor to claim victory.When 16-year-old Katniss's young sister, Prim, is selected as District 12's female representative, Katniss volunteers to take her place. She and her male counterpart Peeta, are pitted against bigger, stronger representatives, some of whom have trained for this their whole

## Comparison of nouns removal

In [33]:
clean_data_with_nouns = pd.read_csv('../data/interim/cb-tf-idf/book_with_nouns.csv', index_col='book_id')
clean_data_without_nouns = pd.read_csv('../data/interim/cb-tf-idf/book_without_nouns.csv', index_col='book_id')

In [34]:
harry_potter_description_with_nouns = clean_data_with_nouns.loc[2, 'description']
harry_potter_description_without_nouns = clean_data_without_nouns.loc[2, 'description']

In [35]:
harry_potter_description_with_nouns

'harri life his parent dead stuck heartless forc live tini closet but fortun chang receiv letter tell truth a mysteri visitor rescu relat take new hogwart school witchcraft after lifetim bottl magic harri final feel like normal but even within wizard he boy person ever surviv kill cur inflict evil lord launch brutal takeov wizard vanish fail kill though first year hogwart best everyth there danger secret object hidden within castl harri believ respons prevent fall evil but bring contact forc terrifi ever could full sympathet wild imagin countless excit first instal seri assembl unforgett magic world set stage mani adventur'

In [36]:
harry_potter_description_without_nouns

'life his parent dead stuck heartless forc live tini closet but fortun chang receiv letter tell truth mysteri visitor rescu relat take new after lifetim bottl magic final feel like normal but even within he boy person ever surviv kill cur inflict evil launch brutal takeov vanish fail kill first year best everyth there danger secret object hidden within castl believ respons prevent fall evil but bring contact forc terrifi ever could sympathet wild imagin countless excit first instal seri assembl unforgett magic world set stage mani adventur'

### Example of book with short description

In [54]:
original_data.loc[4210, 'description']

'Kiss of the Highlander (The Highlander Series, Book 4)'

In [52]:
clean_data_with_nouns.loc[4210, 'description']

'kiss highland highland book'

In [55]:
clean_data_without_nouns.loc[4210, 'description']

nan

# Notes

- N-grams should be considered in other methods, for example a very specific feature word pairing like `Hunger Games` is omitted in the result
- weird ending like for example `countri` instead of `country`. however this is not an issue because all words will be processed in the same way

# Bibliography