This notebook is devoted to the task of adding tag based features to the feature vectors of content based recommendation models.

# Important

`make features` has to be run before running any notebook cell

# Imports

In [None]:
import numpy as np
import pandas as pd

In [None]:
book_tags = pd.read_csv('../data/raw/book_tags.csv')
tags_data = pd.read_csv('../data/raw/tags.csv')
with open("../data/external/genres.txt") as file:
    goodreads_genres = [line.rstrip('\n') for line in file]

## Data description

In [None]:
book_tags.head()

The data contains information about what tags were assigned to a specific book and how many times was it assigned - the `count` column in the above presented data frame.

In [None]:
tags_data.tag_name

Unfortunately, some tags are defined in other languages than english and some tags contain no specific information as for example `--5-`. That is why only tags representing genres will be kept as book features. The considered set of features is presented in the cell below

In [None]:
goodreads_genres

## How to represent tags as features?

The question is how those tags should be converted to features. The following ideas are considered:

* append tags counts to existing feature vectors
* normalize the tags count in order to measure 'how much fictional' is the considered book

The problem of the first approach is that one book might have been assigned a 100 times and another one a 1000 times. For example the first one got the `comic-book` tag assigned a 100 times and the second one got tagged as `comic-book` 300 times. Now the first book seems like a pure `comic-book` but in terms of quantities the second book is 'more' `comic-book` than the first even though it is just partly a comic book.

The first step is to check the average amount of unique tags assigned to a single book.

In [None]:
book_tags_names = book_tags.merge(tags_data)
book_tags_names = book_tags_names[book_tags_names.tag_name.isin(goodreads_genres)]
tags_assigned_count = book_tags_names.groupby(
    'goodreads_book_id')['tag_id'].apply(np.unique).apply(len).reset_index()['tag_id']

In [None]:
tags_assigned_count.describe()

On average a single book has 20 different tags assigned. This makes it an relevant feature as having 20 tags overall is not overspecific, but provides useful insights at the same time. Additionally, the small dimensionality allows omitting heavy computations. 

## Feature extraction result analysis

In [None]:
tag_features = pd.read_csv('../features/tag_based_features.csv', index_col='book_id')

In [None]:
tag_features.apply(sum, axis=1).head()

In [None]:
all(tag_features.apply(sum, axis=1).apply(round, 1) == 1)

All values sum up to 1 in each row which means that the tags count were normalized correctly. The reason why the sum was rounded up is because while extracting features computations were made on floating numbers which do not provide perfect accuracy.

# Bibliography