# Important

`make models/content-based-models/tf-idf-nouns-model.pkl` has to be run in the main directory before as the notebook uses the result data from that process.

# Imports

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from langdetect import detect

# Visualization settings

In [None]:
sns.set(
    context='paper', font_scale=1.2, style='ticks', palette='muted', font='Arial'
)

# Analysis

In [None]:
original_data = pd.read_csv('../data/processed/book.csv', index_col='book_id')

There are 315 books with missing descriptions:

In [None]:
original_data['description'].isna().sum()

The are 199 descriptions that are written in different languages than english.

In [None]:
non_english_desc = original_data['description'].dropna().apply(
    lambda desc: detect(desc) != 'en')
non_english_desc.sum()

**Missing description example**

In [None]:
original_data.loc[9973]

**Description not in english example**

In [None]:
original_data.loc[9966]

In [None]:
reduced_data_descriptions = original_data['description'].dropna()[~non_english_desc]

### Description length analysis

In [None]:
ax = sns.boxplot(reduced_data_descriptions.str.len())
ax.set(xlabel='Description length')
ax.set_xscale('log')

In [None]:
reduced_data_descriptions.str.len().describe()

## Noticed issues
* There are missing descriptions in the data
* Some descriptions are not in english

## Description content analysis

In [None]:
original_data['description'].dropna()

The descriptions need cleaning regarding removing punctuation and stopwords. Additionally stemming and lemmatization will be performed.

# Cleaning results

Descriptions have been cleaned using the following operations:
* transforming to lower case
* lemmatization
* stemming

Two approaches regarding nouns have been implemented:
* nouns are kept in the description
* nouns are deleted from the description

The reason why there are two approaches is the fact that on the one hand expressions like `Harry Potter` is a very important feature. But if there is another book in which the main character is named `Harry` then even though this book might be completely different it might get classified as similar. 

In [None]:
cleaned_data_with_nouns = pd.read_csv('../data/interim/cb-tf-idf/book_with_nouns.csv', index_col='book_id')
cleaned_data_with_nouns['description']

# Example results

In [None]:
cleaned_data_with_nouns.loc[1, 'description']

In [None]:
original_data.loc[1, 'description']

## Comparison of nouns removal

In [None]:
clean_data_with_nouns = pd.read_csv('../data/interim/cb-tf-idf/book_with_nouns.csv', index_col='book_id')
clean_data_without_nouns = pd.read_csv('../data/interim/cb-tf-idf/book_without_nouns.csv', index_col='book_id')

In [None]:
harry_potter_description_with_nouns = clean_data_with_nouns.loc[2, 'description']
harry_potter_description_without_nouns = clean_data_without_nouns.loc[2, 'description']

In [None]:
harry_potter_description_with_nouns

In [None]:
harry_potter_description_without_nouns

### Example of books with short descriptions

Unfortunately when some descriptions are very short the cleaning results in an empty description.

In [None]:
original_data.loc[4210, 'description']

In [None]:
clean_data_with_nouns.loc[4210, 'description']

In [None]:
clean_data_without_nouns.loc[4210, 'description']

However this occurs only 2 times in case of the proper noun removal approach.

In [None]:
clean_data_without_nouns['description'].isna().sum()

In [None]:
clean_data_with_nouns['description'].isna().sum()

## Descriptions length after cleaning

In [None]:
desc_len_with_nouns = clean_data_with_nouns['description'].str.len()
desc_len_without_nouns = clean_data_without_nouns['description'].str.len()

In [None]:
desc_len_with_nouns.describe()

In [None]:
desc_len_without_nouns.describe()

In [None]:
ax = sns.distplot(desc_len_with_nouns)
ax.set(xlabel='Description length')

In [None]:
ax = sns.boxplot(desc_len_without_nouns.dropna())
ax.set(xlabel='Description length')
ax.set_xscale('log')

# Notes

- N-grams should be considered in other methods, for example a very specific feature word pairing like `Hunger Games` is omitted in the result
- weird ending like for example `countri` instead of `country`. however this is not an issue because all words will be processed in the same way