This notebook is devoted to the task of adding tag based features to the feature vectors of content based recommendation models.

# Imports

In [39]:
import numpy as np
import pandas as pd

In [15]:
book_tags = pd.read_csv('../data/raw/book_tags.csv')
tags_data = pd.read_csv('../data/raw/tags.csv')
with open("../data/external/genres.txt") as file:
    goodreads_genres = [line.rstrip('\n') for line in file]

## Data description

In [4]:
book_tags.head()

Unnamed: 0,goodreads_book_id,tag_id,count
0,1,30574,167697
1,1,11305,37174
2,1,11557,34173
3,1,8717,12986
4,1,33114,12716


The data contains information about what tags were assigned to a specific book and how many times was it assigned - the `count` column in the above presented data frame.

In [19]:
tags_data.tag_name

0                                  -
1                               --1-
2                              --10-
3                              --12-
4                             --122-
5                             --166-
6                              --17-
7                              --19-
8                               --2-
9                             --258-
10                              --3-
11                             --33-
12                              --4-
13                              --5-
14                             --51-
15                              --6-
16                             --62-
17                              --8-
18                             --99-
19       --available-at-raspberrys--
20                           -2001--
21                          -calif--
22                            -d-c--
23                             -dean
24                         -england-
25                          -fiction
26                        -fictional
2

Unfortunately, some tags are defined in other languages than english and some tags contain no specific information as for example `--5-`. That is why only tags representing genres will be kept as book features. The considered set of features is presented in the cell below

In [22]:
goodreads_genres

['10th-century',
 '11th-century',
 '12th-century',
 '13th-century',
 '14th-century',
 '15th-century',
 '16th-century',
 '17th-century',
 '1864-shenandoah-campaign',
 '18th-century',
 '1917',
 '19th-century',
 '1st-grade',
 '20th-century',
 '21st-century',
 '2nd-grade',
 '40k',
 'abandoned',
 'abuse',
 'academia',
 'academic',
 'academics',
 'accounting',
 'accra',
 'action',
 'activism',
 'adaptations',
 'addis-ababa',
 'addition',
 'adolescence',
 'adoption',
 'adult',
 'adult-colouring-books',
 'adult-fiction',
 'adventure',
 'adventurers',
 'aeroplanes',
 'africa',
 'african-american',
 'african-american-literature',
 'african-american-romance',
 'african-literature',
 'agender',
 'agriculture',
 'aircraft',
 'airliners',
 'airships',
 'albanian-literature',
 'alchemy',
 'alcohol',
 'alexandria',
 'algeria',
 'algiers',
 'algorithms',
 'aliens',
 'alternate-history',
 'alternate-universe',
 'alternative-medicine',
 'amateur-sleuth',
 'amazon',
 'ambulance-service',
 'ambulances',
 '

## How to represent tags as features?

The question is how those tags should be converted to features. The following ideas are considered:

* append tags counts to existing feature vectors
* normalize the tags count in order to measure 'how much fictional' is the considered book

The problem of the first approach is that one book might have been assigned a 100 times and another one a 1000 times. For example the first one got the `comic-book` tag assigned a 100 times and the second one got tagged as `comic-book` 300 times. Now the first book seems like a pure `comic-book` but in terms of quantities the second book is 'more' `comic-book` than the first even though it is just partly a comic book.

The first step is to check the average amount of unique tags assigned to a single book.

In [47]:
book_tags_names = book_tags.merge(tags_data)
book_tags_names = book_tags_names[book_tags_names.tag_name.isin(goodreads_genres)]
tags_assigned_count = book_tags_names.groupby(
    'goodreads_book_id')['tag_id'].apply(np.unique).apply(len).reset_index()['tag_id']

In [49]:
tags_assigned_count.describe()

count    10000.000000
mean        20.712700
std          5.206311
min          3.000000
25%         17.000000
50%         20.000000
75%         24.000000
max         43.000000
Name: tag_id, dtype: float64

On average a single book has 20 different tags assigned.

## Feature extraction result analysis

In [57]:
tag_features = pd.read_csv('../features/tag_based_features.csv', index_col='book_id')

In [73]:
tag_features.apply(sum, axis=1).head()

book_id
1    1.0
2    1.0
3    1.0
4    1.0
5    1.0
dtype: float64

In [69]:
all(tag_features.apply(sum, axis=1).apply(round) == 1)

True

All values sum up to 1 in each row which means that the tags count were normalized correctly.

# Bibliography