# 👽👾 `print(fiction)` 📚🛸

> #### A data science project by _Tobias Reaper_

#### 📓 Notebook 1: Data Preparation 🔬

---

### Notebook Outline

* Intro
* Imports and configuration
* Convert data to tabular format

---

## Intro

[quick intro to project]

[explanation of this notebook in context of project]

---

### 📥 Initial Imports and Configuration ⚙️

In [1]:
# The Utiliteers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Extras
import janitor
import missingno
import pandas_profiling
import os

In [2]:
# Set pandas display options to allow for more columns and rows
pd.options.display.max_columns = 200
pd.options.display.max_rows = 200

---
---

## Convert data to tabular format

The output of the scraper was a set of JSON files. In order to use it in the project, I'll need to convert to Pandas DataFrame.

#### Setting up filepaths

In [3]:
# === Path to json data === #
datapath = "/Users/Tobias/workshop/vela/ds/interview_prep/practice/print-fiction/assets/json_data"

In [4]:
# === Create the book filepaths === #
bookfiles = [  # List of book json files to be included in the books dataframe
    "book_must_read_01_20.jl",
    "book_must_read_21_200.jl",
    "book_must_read_201_216.jl",
]

# Create list of filepaths from book file names
bookpaths = [os.path.join(datapath, filename) for filename in bookfiles]

bookpaths

['/Users/Tobias/workshop/vela/ds/interview_prep/practice/print-fiction/assets/json_data/book_must_read_01_20.jl',
 '/Users/Tobias/workshop/vela/ds/interview_prep/practice/print-fiction/assets/json_data/book_must_read_21_200.jl',
 '/Users/Tobias/workshop/vela/ds/interview_prep/practice/print-fiction/assets/json_data/book_must_read_201_216.jl']

#### Functions to combine files into single DataFrame and do some preprocessing

In [5]:
def json_cat(json_files):
    """
    Reads and concatenates a list of .jl (json lines) 
    files into a single dataframe.
    """

    # Read the books json files into a list of dataframes
    dfs = [pd.read_json(filepath, lines=True) for filepath in json_files]

    # Concatenate the list of dataframes
    df = pd.concat(dfs, sort=False)
    
    return df

In [6]:
def encode_book_genres(df):
    """Deconcatenates top 30 book genres into separate features, OneHotEncoding style."""
    
    # Set comprehension - creates a set of all genres listed in dataset
    all_genres = {genre for row_genres in df["genres"] for genre in row_genres}

    # Create a new feature for every genre
    for genre in all_genres:
        has_genre = lambda g: genre in g
        df[genre] = df.genres.apply(has_genre)

    # Create list of top 30 most common genres
    most_common_genres = df[list(all_genres)].sum().sort_values(ascending=False).head(30)
    most_common_genres = most_common_genres.index.tolist()
    
    # Drop all but the top 30 genres from the dataframe
    unwanted_genres = list(all_genres - set(most_common_genres))
    df = df.drop(columns=unwanted_genres)
    
    # Drop the original "genres" feature
    df = df.drop(columns=["genres"])
    
    return df

In [8]:
def book_pub_date(df):
    """Deconcatenates book publish_date to three separate features
    for year, month, and day. Drops the original publish_date feature.
    """
    # === The Pandas method === #
    # Convert the "publish_date" column to datetime
    df["publish_date"] = pd.to_datetime(df["publish_date"], errors="coerce", infer_datetime_format=True)

    # Break out "publish_date" into dt components
    df["publish_year"] = df["publish_date"].dt.year
    df["publish_month"] = df["publish_date"].dt.month
    df["publish_day"] = df["publish_date"].dt.day
    
    df = df.drop(columns=["publish_date"])  # Drop the OG publish_date
    
    return df

In [11]:
def book_cat(paths_list, output_filename):
    """Reads and concatenates a list of book_*.jl (json lines) files."""

    # === Concatenate the list of dataframes === #
    df = json_cat(paths_list)

    # === Initial wrangling === #
    # I will address these three steps later on
    # df = df.dropna(subset=["genres"])  # Drop rows with null "genres"
    # df = encode_book_genres(df)  # Break out genres into top 30
    # df = book_pub_date(df)  # Break out publish_date into components

    df = df.drop_duplicates(subset=["url"])  # Drop duplicate records
    # Format column names with pyjanitor
    df = (df.clean_names())

    # Break ratings_histogram (array) into component features
    df_hist = df["rating_histogram"].apply(pd.Series)
    rating_cols = {}  # Dict for mapping column names
    for col in df_hist.columns.tolist():
        rating_cols[col] = f"{col}_rating_count"
    # Rename according to mapper created above
    df_hist = df_hist.rename(columns=rating_cols)
    # Concat new columns onto main dataframe
    df = pd.concat([df, df_hist], axis=1, join="outer")
    # Drop extra column
    df = df.drop(columns=["rating_histogram"])
    
    df.to_csv(output_filename, index=False)
    print(f"Successfully created dataframe and saved to current directory as '{output_filename}'")
    
    return df

---

#### Create books DataFrame

In [12]:
# === Create the books dataframe === #
books = book_cat(bookpaths, "must_read_books-01.csv")

Successfully created dataframe and saved to current directory as 'must_read_books-01.csv'


In [15]:
# === First looks at books dataframe === #
print(books.shape)
books.head()

(21514, 21)


Unnamed: 0,url,title,author,num_ratings,num_reviews,avg_rating,num_pages,language,publish_date,original_publish_year,genres,characters,series,places,asin,0_rating_count,1_rating_count,2_rating_count,3_rating_count,4_rating_count,5_rating_count
0,https://www.goodreads.com/book/show/323355.The...,The Book of Mormon: Another Testament of Jesus...,Anonymous,71355.0,5704.0,4.37,531.0,English,2013-10-22 00:00:00,1830.0,"[Lds, Church, Christianity, Religion, Nonfiction]",,,,,,7520.0,2697.0,2521.0,1963.0,56654.0
1,https://www.goodreads.com/book/show/28862.The_...,The Prince,Niccolò Machiavelli,229715.0,7261.0,3.81,140.0,English,2003-06-01 00:00:00,1513.0,"[European Literature, Italian Literature, Hist...","[Theseus (mythology), Alexander the Great, Ces...",,,,,5254.0,16827.0,61182.0,80221.0,66231.0
2,https://www.goodreads.com/book/show/46654.The_...,The Foundation Trilogy,Isaac Asimov,83933.0,1331.0,4.4,679.0,English,1974-01-01 00:00:00,1953.0,"[Science Fiction, Classics, Fiction]","[Hari Seldon, Salvor Hardin, Hober Mallow, Mul...",Foundation (Publication Order) #1-3,,,,477.0,1521.0,9016.0,25447.0,47472.0
3,https://www.goodreads.com/book/show/3980.From_...,From the Mixed-Up Files of Mrs. Basil E. Frank...,E.L. Konigsburg,173617.0,6438.0,4.15,178.0,English,2003-06-02 00:00:00,1967.0,"[Childrens, Mystery, Middle Grade, Fiction, Yo...","[Mrs. Basil E. Frankweiler, Claudia Kincaid, J...",,"[New York City, New York, Connecticut]",,,2742.0,6381.0,29358.0,58559.0,76577.0
4,https://www.goodreads.com/book/show/18521.A_Ro...,A Room of One's Own,Virginia Woolf,98164.0,5848.0,4.14,112.0,English,2000-01-01 00:00:00,1929.0,"[Essays, Feminism, Classics, Nonfiction, Writing]",,,,,,1357.0,3778.0,15993.0,35876.0,41160.0


In [16]:
books.isnull().sum()

url                          0
title                        1
author                       1
num_ratings                  1
num_reviews                  1
avg_rating                   1
num_pages                 1175
language                  2148
publish_date               436
original_publish_year     8675
genres                    2941
characters               15691
series                   14466
places                   16276
asin                     17561
0_rating_count           21514
1_rating_count            1295
2_rating_count            1295
3_rating_count            1295
4_rating_count            1295
5_rating_count            1295
dtype: int64

---

## To Be Continued

The next notebook in this series, prefixed with '02', picks up where this one ends. I.e. the next one is where the bulk of the data wrangling and exploration takes place.

See you there!

## 📓👀