# Data exploration and preparation
In this notebook, you will look at the data and create a subset of that data. The dataset was "relatively" clean upon download, but the lecturers got rid of some pesky delimiter issues. If you want to encounter these issues yourself, you can use the original dataset found at the [Book-Crossing Dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/).

### 1. Loading the data
Load the three datasets and explore the data.


In [38]:
import pandas as pd

df_books_ratings = pd.read_csv('data/BX-Book-Ratings.csv', sep=';', encoding='latin-1')
df_books = pd.read_csv('data/BX-Books.csv', low_memory=False, sep=';', encoding='latin-1')
df_users = pd.read_csv('data/BX-Users.csv', low_memory=False, sep=';', encoding='latin-1')

### 2. Cleaning the data
Check if all reviews are connected to a book. Is there a review but no book or user connected to this review? Check if all the authors are spelled correctly, etc etc.

### 3. Subsetting the data
The publication accompanied with this dataset [Improving Recommendation Lists Through Topic Diversification](http://www2.informatik.uni-freiburg.de/~cziegler/BX/WWW-2005-Preprint.pdf) by Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; describes the process of subsetting (condensation steps) the dataset as follows (p5): 

> Hence, we discarded all books missing taxonomic descriptions, along with all ratings referring to them. Next, we also removed book titles with fewer than 20 overall mentions. Only community members with at least five ratings each were kept. 

Explore what these parameters mean for the overall dataset. Also, consider if you want the implicit ratings (Book-Rating == 0) in the final dataset. What would the implications be? Would you exclude it before the other parameters, or would you exclude them afterwards? 

While the publication describes the resulting dataset's dimensions, your results might differ. But that is ok for now.


In [39]:
df = df_books_ratings

# this combination worked pretty well 
df = df[df['Book-Rating'] != 0]

x = df['ISBN'].value_counts() >= 20
idx = x[x].index
ratings = df[df['ISBN'].isin(idx)]

x = ratings['User-ID'].value_counts() >= 10
idx = x[x].index
df_book_ratings_final = ratings[ratings['User-ID'].isin(idx)]
df_book_ratings_final.shape

### 4. Extra step
Take a closer look at `BX-Books.csv` and search for a book named _Robots and Empire_ by Isaac Asimov. What do you encounter? Is this something you would solve? 

Let us argue that this is problematic for our dataset. How would you solve this? You might want to redo step 2 if you choose to take this extra step.

In [None]:
df_books[df_books['Book-Title'] == 'Robots and Empire']

### 5. Save the new dataset(s)
Save the dataset(s) in distinct named CSV-files for later usage. Move the file(s) to the data directory.


In [None]:
df_book_ratings_final.to_csv('BX-Book-Ratings-Subset.csv', index=False, sep=';')