# Important

`make data data/raw/books_xml` has to be run before any cell in this notebook

# Imports

In [None]:
import os
import pandas

In [None]:
books_xml_dir = "../data/raw/books_xml"

## IDs

- `work_id` - globally unique id of a book(abstract, disregarding edition or language)
- `goodreads_book_id`, `isbn`, `isbn13` - id of a specific edition of the book
- `best_book_id` - id of most popular edition of the book

`book_id` is used through data files as a new abstract identifier for a book:
- in range 1-10000 
- semantically identical to `work_id`

It is used in ratings.csv and to_read.csv, which were aggregated by `work_id`, so they contain data for all editions of a book.

## book.csv

In [None]:
book_df = pandas.read_csv("../data/raw/book.csv")

In [None]:
book_df.head(1)

To check whether the dataset contains multiple editions of the same book, we should look for duplicates in columns `work_id` or `best_book_id`.

In [None]:
len(book_df[book_df.duplicated(['work_id'])])

In [None]:
len(book_df[book_df.duplicated(['best_book_id'])])

In the dataset there is no duplicate `work_id`, so `book_id` has the same meaning as `work_id`

In [None]:
import booksuggest.data.clean_book

In [None]:
book_extra_info_rows = booksuggest.data.clean_book.extract_book_extra_info(books_xml_dir)
book_extra_info_df = booksuggest.data.clean_book.process_book_extra_info(book_extra_info_rows)

In [None]:
book_extra_info_df.head(1)

In [None]:
set(book_df.work_id.unique()) ^ set(book_extra_info_df.work_id.unique())

## similar_books.csv

In [None]:
import booksuggest.data.prepare_similar_books

In [None]:
similar_books_rows = booksuggest.data.prepare_similar_books.extract_similar_books(books_xml_dir)
similar_books_raw_df = booksuggest.data.prepare_similar_books.process_similar_books(similar_books_rows)

In [None]:
similar_books_raw_df.head(1)

Here, data rows are identified by `work_id`. To maintain consistency we should change ids to `book_id`.

In [None]:
len(set(book_df.work_id.unique()) & set(similar_books_raw_df.similar_book_work_id.unique()))

In [None]:
len(set(similar_books_raw_df.similar_book_work_id.unique()) - set(book_df.work_id.unique()))

Section `similar_books` contains 6025 books from the dataset. Aditionally, more than 40k books are out of the dataset and provide no value to the analysis, so they should be omitted.

## book_tags.csv

In [None]:
book_tags_df = pandas.read_csv("../data/raw/book_tags.csv")

In [None]:
book_tags_df.head(1)

Here, data rows are identified by `goodreads_book_id`. To maintain consistency we should change ids to `book_id`.

In [None]:
set(book_tags_df.goodreads_book_id.unique()) ^ set(book_df.goodreads_book_id.unique())

There 14 books that were not marked as `to_read` by any user.

## ratings.csv

In [None]:
ratings_df = pandas.read_csv("../data/raw/ratings.csv")

In [None]:
ratings_df.head(1)

In [None]:
set(ratings_df.book_id.unique()) ^ set(book_df.book_id.unique())

## to_read.csv

In [None]:
to_read_df = pandas.read_csv("../data/raw/to_read.csv")

In [None]:
to_read_df.head(1)

In [None]:
set(to_read_df.book_id.unique()) ^ set(book_df.book_id.unique())

In [None]:
(set(book_df.book_id.unique()) - set(to_read_df.book_id.unique())) == set(to_read_df.book_id.unique()) ^ set(book_df.book_id.unique())