# Transformation
<hr>

### Cleaning, filtering, and joining two datasets: (1). "source_kaggle.csv", (2). "source_ucsd.csv"

The purpose of this notebook is to combine the two datasets into one using the International Standard Book Number - 10 (ISBN-10) as the primary key. In addition, we will modify the combined final dataset to a degree that's best tailored for our future analysis (if there is any).

The cleaned <strong>"books.csv"</strong> will have unique columns from both datasets which are the information we want to build a database for book references. It will also serve as our foundation for the next step in the ETL process - (L)oad.

In [16]:
import pandas as pd

In [17]:
# File path
books_kaggle = "../cleaned_datasets/source_kaggle.csv"
books_ucsd = "../cleaned_datasets/source_ucsd.csv"

In [18]:
# Load df
kaggle_df = pd.read_csv(books_kaggle)
ucsd_df = pd.read_csv(books_ucsd)

In [19]:
# Check Kaggle dataset
kaggle_df.head(2)
# kaggle_df.shape

Unnamed: 0,ISBN,Name,Authors,Rating
0,,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling,4.57
1,439358078.0,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,4.5


In [20]:
# Check UCSD dataset
ucsd_df.head(2)
# ucsd_df.shape

Unnamed: 0.1,Unnamed: 0,isbn,isbn13,title
0,0,312853122,9780312853129,W.C. Fields: A Life on Film
1,1,743509986,9780743509985,Good Harbor


In [21]:
kaggle_df.dtypes

ISBN        object
Name        object
Authors     object
Rating     float64
dtype: object

In [22]:
ucsd_df.dtypes

Unnamed: 0     int64
isbn          object
isbn13        object
title         object
dtype: object

## Data Clean - Kaggle Dataset

In [6]:
# Remove all NaN
kaggle_df_cleaned = kaggle_df.dropna(axis='index', how='any')
print(kaggle_df_cleaned.isna().any())

ISBN       False
Name       False
Authors    False
Rating     False
dtype: bool


In [7]:
# Rename columns to match style in the UCSD dataset
kaggle_df_cleaned = kaggle_df_cleaned.rename(columns={
    'ISBN': 'isbn',
    'Name': 'book_name',
    'Authors': 'authors',
    'Rating': 'rating'
})

kaggle_df_cleaned.head(2)

Unnamed: 0,isbn,book_name,authors,rating
1,439358078,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling,4.5
3,439554896,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42


In [8]:
# Drop some rows with duplicated book names
kaggle_df_unique = kaggle_df_cleaned.drop_duplicates(subset='book_name', keep=False)

In [9]:
kaggle_df_unique.shape

(757407, 4)

## Data Clean - UCSD Dataset

In [10]:
# Delete first row (an undesired index column resulted from a json to csv conversion)
ucsd_df_cleaned = ucsd_df.drop(columns=['Unnamed: 0'])
ucsd_df_cleaned.head(1)

Unnamed: 0,isbn,isbn13,title
0,312853122,9780312853129,W.C. Fields: A Life on Film


In [11]:
# Remove all NaN
ucsd_df_cleaned.dropna(axis='index', how='any', inplace=True)
print(ucsd_df_cleaned.isna().any())

isbn      False
isbn13    False
title     False
dtype: bool


In [12]:
# Drop some duplicated titles
ucsd_df_unique = ucsd_df_cleaned.drop_duplicates(subset='title', keep=False)

In [13]:
ucsd_df_unique.shape

(235058, 3)

## Finalize datasets for Loading

In [14]:
# Create final dataset copies
kaggle_final = kaggle_df_unique.copy()
ucsd_final = ucsd_df_unique.copy()

In [15]:
# Export cleaned csv files to the "final_datasets" folder
kaggle_final.to_csv('../final_datasets/books_info.csv', index=False)
ucsd_final.to_csv('../final_datasets/isbn13.csv', index=False)