In [1]:
import os
import pandas

  return f(*args, **kwds)


In [19]:
book_df = pandas.read_csv("../data/raw/book.csv")
book_additional_info_df = pandas.read_csv("../data/interim/book-additional_info.csv", dtype={'isbn13': object})
book_tags_df = pandas.read_csv("../data/raw/book_tags.csv")
ratings_df = pandas.read_csv("../data/raw/ratings.csv")
to_read_df = pandas.read_csv("../data/raw/to_read.csv")
similar_books_df = pandas.read_csv("../data/interim/similar_books.csv")

## IDs

- `work_id` - globally unique id of a book(abstract, disregarding edition or language)
- `goodreads_book_id`, `isbn`, `isbn13` - id of a specific edition of the book
- `best_book_id` - id of most popular edition of the book

`book_id` is used through data files as a new abstract identifier for a book:
- in range 1-10000 
- semantically identical to `work_id`

It is used in ratings.csv and to_read.csv, which were aggregated by `work_id`, so they contain data for all editions of a book.

## book.csv

In [20]:
book_df.head(1)

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...


To check whether the dataset contains multiple editions of the same book, we should look for duplicates in columns `work_id` or `best_book_id`.

In [21]:
len(book_df[book_df.duplicated(['work_id'])])

0

In [22]:
len(book_df[book_df.duplicated(['best_book_id'])])

0

In the dataset there is no duplicate `work_id`, so `book_id` has the same meaning as `work_id`

## book-additonal_info.csv

In [23]:
book_additional_info_df.head(1)

Unnamed: 0,work_id,isbn13,description
0,15888570,9780373605699,Notorious Nora Sutherlin is famous for her del...


In [24]:
set(book_df.work_id.unique()) ^ set(book_additional_info_df.work_id.unique())

set()

## ratings.csv

In [25]:
ratings_df.head(1)

Unnamed: 0,user_id,book_id,rating
0,1,258,5


In [26]:
set(ratings_df.book_id.unique()) ^ set(book_df.book_id.unique())

set()

## to_read.csv

In [27]:
to_read_df.head(1)

Unnamed: 0,user_id,book_id
0,9,8


In [28]:
set(to_read_df.book_id.unique()) ^ set(book_df.book_id.unique())

{3151,
 3539,
 3996,
 4206,
 4439,
 5130,
 5898,
 6262,
 7330,
 7803,
 8055,
 9120,
 9161,
 9426}

In [29]:
(set(book_df.book_id.unique()) - set(to_read_df.book_id.unique())) == set(to_read_df.book_id.unique()) ^ set(book_df.book_id.unique())

True

There 14 books that were not marked as `to_read` by any user.

## similar_books.csv

In [30]:
similar_books_df.head(1)

Unnamed: 0,work_id,similar_book_work_id
0,15888570,18868842


Here, data rows are identified by `work_id`. To maintain consistency we should change ids to `book_id`.

In [31]:
len(set(book_df.work_id.unique()) & set(similar_books_df.similar_book_work_id.unique()))

6025

In [32]:
len(set(similar_books_df.similar_book_work_id.unique()) - set(book_df.work_id.unique()))

50644

Section `similar_books` contains 6025 books from the dataset. Aditionally, more than 40k books are out of the dataset and provide no value to the analysis, so they should be omitted.

## book_tags.csv

In [33]:
book_tags_df.head(1)

Unnamed: 0,goodreads_book_id,tag_id,count
0,1,30574,167697


Here, data rows are identified by `goodreads_book_id`. To maintain consistency we should change ids to `book_id`.

In [34]:
set(book_tags_df.goodreads_book_id.unique()) ^ set(book_df.goodreads_book_id.unique())

set()

In [35]:
book_df.head(1)

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...


In [36]:
book_additional_info_df.head(1)

Unnamed: 0,work_id,isbn13,description
0,15888570,9780373605699,Notorious Nora Sutherlin is famous for her del...


In [37]:
book_df = book_df.drop(columns=['isbn13'])
merged_df = book_df.merge(book_additional_info_df, on='work_id')

merged_df.isbn13 = merged_df.isbn13.astype('str')
merged_df.original_publication_year = merged_df.original_publication_year.astype(
    'str')


In [39]:
merged_df.head(1)

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,authors,original_publication_year,original_title,title,...,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url,isbn13,description
0,1,2767052,2767052,2792775,272,439023483,Suzanne Collins,2008.0,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",...,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...,9780439023481,<b>Winning will make you famous. <br />Losing ...


In [41]:
book_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 22 columns):
book_id                      10000 non-null int64
goodreads_book_id            10000 non-null int64
best_book_id                 10000 non-null int64
work_id                      10000 non-null int64
books_count                  10000 non-null int64
isbn                         9300 non-null object
authors                      10000 non-null object
original_publication_year    9979 non-null float64
original_title               9415 non-null object
title                        10000 non-null object
language_code                8916 non-null object
average_rating               10000 non-null float64
ratings_count                10000 non-null int64
work_ratings_count           10000 non-null int64
work_text_reviews_count      10000 non-null int64
ratings_1                    10000 non-null int64
ratings_2                    10000 non-null int64
ratings_3                    10000 no

In [42]:
book_additional_info_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
work_id        10000 non-null int64
isbn13         9415 non-null object
description    9685 non-null object
dtypes: int64(1), object(2)
memory usage: 234.5+ KB
