# EDA of Books 

## Steps 

1. [Importing the libraries](#1-importing-the-libraries)
2. [Importing the dataset](#2-importing-the-dataset)
3. [Data Cleaning](#3-data-cleaning)
4. [Exploratory Data Analysis](#4-exploratory-data-analysis)
5. [Feature Engineering](#5-feature-engineering)
6. [Saving the dataset](#6-saving-the-dataset)




In [4]:
# considerations

# is date of publishing important right now? 
# is price too many missing values to take the average? 
# is the number of pages important? -> possible future feature to include in the scraper 


### 1. Importing the libraries

In [5]:
import os 
import pymongo
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import datetime as dt
from data_importer import DataImporter
warnings.filterwarnings('ignore')



### 2. Importing the dataset

In [6]:
data_manager = DataImporter()
data = data_manager.import_data()
df_raw = pd.DataFrame(data)

### 3. Data Cleaning

In [7]:
# helper functions

# get the key values from a list of dictionaries # account for empy lists
get_key_values_from_list = lambda x: [i['k'] for i in x] if len(x) > 0  else []

get_key_values_from_list_nans = lambda x: [i['k'] for i in x] if x != None else []




In [8]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6265 entries, 0 to 6264
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   book_id              6265 non-null   object
 1   title                6265 non-null   object
 2   author               6265 non-null   object
 3   price                3761 non-null   object
 4   genres               6265 non-null   object
 5   isbn                 4495 non-null   object
 6   language             5178 non-null   object
 7   series               2384 non-null   object
 8   publisher            5267 non-null   object
 9   year_published       5643 non-null   object
 10  description          6265 non-null   object
 11  current_readers      5478 non-null   object
 12  wanted_to_read       6137 non-null   object
 13  num_reviews          6265 non-null   object
 14  num_ratings          6265 non-null   object
 15  rating               6265 non-null   object
 16  awards

In [9]:
df_proc = df_raw.copy()
df_proc.head()

Unnamed: 0,book_id,title,author,price,genres,isbn,language,series,publisher,year_published,description,current_readers,wanted_to_read,num_reviews,num_ratings,rating,awards,primary_lists,all_lists_link,date_time_of_scrape
0,77203.The_Kite_Runner,The Kite Runner,Khaled Hosseini,,"[{'k': 'Fiction', 'v': 'https://www.goodreads....",,English,,Riverhead Books,"May 1, 2004",1970s Afghanistan: Twelve-year-old Amir is des...,42.9k,1m,90234,2935385,4.33,[{'k': 'Borders Original Voices Award for Fict...,[{'k': 'Books That Everyone Should Read At Lea...,/list/book/77203,2023-03-23T20:31:37.567776
1,929.Memoirs_of_a_Geisha,Memoirs of a Geisha,Arthur Golden,12.99,"[{'k': 'Fiction', 'v': 'https://www.goodreads....",9781400096893.0,English,,Vintage Books USA,"November 22, 2005","A literary sensation and runaway bestseller, t...",12.3k,793k,34102,1922540,4.14,[],"[{'k': 'Best Books Ever', 'v': '/list/show/1'}...",/list/book/929,2023-03-23T20:31:42.411881
2,128029.A_Thousand_Splendid_Suns,A Thousand Splendid Suns,Khaled Hosseini,12.99,"[{'k': 'Fiction', 'v': 'https://www.goodreads....",9781594489501.0,English,,Riverhead Books,"June 1, 2007",Mariam is only fifteen when she is sent to Kab...,32.7k,760k,69431,1417260,4.42,[{'k': 'British Book Award for Best Read of th...,"[{'k': 'Best Books Ever', 'v': '/list/show/1'}...",/list/book/128029,2023-03-23T20:31:46.875495
3,19063.The_Book_Thief,The Book Thief,Markus Zusak,10.99,"[{'k': 'Historical Fiction', 'v': 'https://www...",,English,,Alfred A. Knopf,"March 14, 2006",Librarian's note: An alternate cover edition c...,86k,2m,134883,2345385,4.39,[{'k': 'National Jewish Book Award for Childre...,"[{'k': 'Best Books Ever', 'v': '/list/show/1'}...",/list/book/19063,2023-03-23T20:32:45.197223
4,4214.Life_of_Pi,Life of Pi,Yann Martel,,"[{'k': 'Fiction', 'v': 'https://www.goodreads....",9780770430078.0,English,,Seal Books,"August 29, 2006",Life of Pi is a fantasy adventure novel by Yan...,24.9k,726k,51257,1544622,3.93,"[{'k': 'Booker Prize (2002)', 'v': 'https://ww...","[{'k': 'Best Books Ever', 'v': '/list/show/1'}...",/list/book/4214,2023-03-23T20:32:49.804773


In [10]:
# drop columns that are not needed for now
df_proc.drop(columns=['all_lists_link', 'date_time_of_scrape', 'isbn'], inplace=True)

# is a series or not change to boolean based on None vs not None (since feature descriptions will be enough to distinguish intrinsic attributes of a book)
df_proc['series'] = df_proc['series'].apply(lambda x: 1 if x != None else 0)
df_proc

Unnamed: 0,book_id,title,author,price,genres,language,series,publisher,year_published,description,current_readers,wanted_to_read,num_reviews,num_ratings,rating,awards,primary_lists
0,77203.The_Kite_Runner,The Kite Runner,Khaled Hosseini,,"[{'k': 'Fiction', 'v': 'https://www.goodreads....",English,0,Riverhead Books,"May 1, 2004",1970s Afghanistan: Twelve-year-old Amir is des...,42.9k,1m,90234,2935385,4.33,[{'k': 'Borders Original Voices Award for Fict...,[{'k': 'Books That Everyone Should Read At Lea...
1,929.Memoirs_of_a_Geisha,Memoirs of a Geisha,Arthur Golden,12.99,"[{'k': 'Fiction', 'v': 'https://www.goodreads....",English,0,Vintage Books USA,"November 22, 2005","A literary sensation and runaway bestseller, t...",12.3k,793k,34102,1922540,4.14,[],"[{'k': 'Best Books Ever', 'v': '/list/show/1'}..."
2,128029.A_Thousand_Splendid_Suns,A Thousand Splendid Suns,Khaled Hosseini,12.99,"[{'k': 'Fiction', 'v': 'https://www.goodreads....",English,0,Riverhead Books,"June 1, 2007",Mariam is only fifteen when she is sent to Kab...,32.7k,760k,69431,1417260,4.42,[{'k': 'British Book Award for Best Read of th...,"[{'k': 'Best Books Ever', 'v': '/list/show/1'}..."
3,19063.The_Book_Thief,The Book Thief,Markus Zusak,10.99,"[{'k': 'Historical Fiction', 'v': 'https://www...",English,0,Alfred A. Knopf,"March 14, 2006",Librarian's note: An alternate cover edition c...,86k,2m,134883,2345385,4.39,[{'k': 'National Jewish Book Award for Childre...,"[{'k': 'Best Books Ever', 'v': '/list/show/1'}..."
4,4214.Life_of_Pi,Life of Pi,Yann Martel,,"[{'k': 'Fiction', 'v': 'https://www.goodreads....",English,0,Seal Books,"August 29, 2006",Life of Pi is a fantasy adventure novel by Yan...,24.9k,726k,51257,1544622,3.93,"[{'k': 'Booker Prize (2002)', 'v': 'https://ww...","[{'k': 'Best Books Ever', 'v': '/list/show/1'}..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6260,29901665-death-at-st-vedast,Death at St. Vedast,Mary Lawrence,3.99,"[{'k': 'Mystery', 'v': 'https://www.goodreads....",,1,Kensington Publishing Corporation,"January 1, 2017","During the tempestuous reign of Henry VIII, Lo...",42,1595,59,205,3.90,[],"[{'k': 'Best Medical Thrillers', 'v': '/list/s..."
6261,41959631-a-murderous-malady,A Murderous Malady,Christine Trent,12.99,"[{'k': 'Historical Fiction', 'v': 'https://www...",,1,Crooked Lane Books,"May 7, 2019",For fans of Charles Todd and Deanna Raybourn c...,16,324,86,232,3.79,[],"[{'k': 'Historical Mystery 2019', 'v': '/list/..."
6262,36445482-no-cure-for-the-dead,No Cure for the Dead,Christine Trent,12.99,"[{'k': 'Mystery', 'v': 'https://www.goodreads....",English,1,Crooked Lane Books,"May 8, 2018","When a young nurse dies on her watch, Florence...",53,621,86,380,3.65,[],"[{'k': 'Historical Fiction 2018', 'v': '/list/..."
6263,15793166-the-midwife-s-tale,The Midwife's Tale,Sam Thomas,5.99,"[{'k': 'Historical Fiction', 'v': 'https://www...",English,1,Minotaur Books,"January 8, 2013",In the tradition of Arianna Franklin and C. J....,63,7444,421,2855,3.66,[],"[{'k': 'Historical Fiction 2013', 'v': '/list/..."


In [11]:
# lets see the unique value for language and publisher
languages = df_proc['language'].unique()
publishers = df_proc['publisher'].unique()
# count of books by language

# QA Do we drop the books that are not in English?
#Yes

print('Languages:', languages,'where most books are in english:', df_proc['language'].value_counts(normalize=True)[0] )
print('Publishers:', publishers)

Languages: ['English' None 'German' 'Hindi' 'Spanish; Castilian' 'French'
 'Dutch; Flemish' 'English, Middle (1100-1500)' 'Norwegian' 'Danish'
 'Bokmål, Norwegian; Norwegian Bokmål' 'Swedish' 'Scots' 'Italian'
 'Persian' 'Chinese' 'Multiple languages' 'Undetermined' 'Indonesian'
 'Croatian'] where most books are in english: 0.9733487833140209
Publishers: ['Riverhead Books' 'Vintage Books USA' 'Alfred A. Knopf' ...
 'Hodder and Stoughton ' 'G.P. Putnam’s Sons' 'Red Puddle Print']


In [12]:
# drop everything except english

df_proc = df_proc[df_proc['language'] == 'English']


In [13]:
# an example of running lambda function on one
# df_raw['genres'] = df_raw['genres'].apply(get_key_values_from_list)
# df_raw['genres']


In [14]:
# sample 1 item  where the list of genres is empty


#find instance of empy list 
keys = ['genres', 'awards', 'primary_lists']
for key in keys:
     if key == 'primary_lists':
         print(key, len(df_proc[df_proc[key].apply(lambda x: x == None)]))
     else:
         print(key, len(df_proc[df_proc[key].apply(lambda x: len(x) == 0)]))



genres 223
awards 3502
primary_lists 0


In [15]:

def generator_format_list_of_dicts(df, keys):
    """
    Generator function to format the list of dictionaries in the column 'key'
    """
    for key in keys: 
        # apply
        # if value is none then skip
        # if it throws an error then print out the error and skip
        try:
            if key == 'primary_lists':
                df[key] = df[key].apply(get_key_values_from_list_nans)
            else: 
                df[key] = df[key].apply(get_key_values_from_list)
        except Exception as e:
            # print more info about the error
            print('Error in generator_format_list_of_dicts')
            
            print(e)
            continue
    return df


In [16]:
generator_format_list_of_dicts(df_proc, ['genres', 'awards', 'primary_lists'])

# insert_manay(data, uniqueness_index=(book_id))
# replace_many(data)

# col.drop()
# -------
# col.insert()

Unnamed: 0,book_id,title,author,price,genres,language,series,publisher,year_published,description,current_readers,wanted_to_read,num_reviews,num_ratings,rating,awards,primary_lists
0,77203.The_Kite_Runner,The Kite Runner,Khaled Hosseini,,"[Fiction, Historical Fiction, Classics, Contem...",English,0,Riverhead Books,"May 1, 2004",1970s Afghanistan: Twelve-year-old Amir is des...,42.9k,1m,90234,2935385,4.33,[Borders Original Voices Award for Fiction (20...,[Books That Everyone Should Read At Least Once...
1,929.Memoirs_of_a_Geisha,Memoirs of a Geisha,Arthur Golden,12.99,"[Fiction, Historical Fiction, Romance, Histori...",English,0,Vintage Books USA,"November 22, 2005","A literary sensation and runaway bestseller, t...",12.3k,793k,34102,1922540,4.14,[],"[Best Books Ever, Best Historical Fiction, Boo..."
2,128029.A_Thousand_Splendid_Suns,A Thousand Splendid Suns,Khaled Hosseini,12.99,"[Fiction, Historical Fiction, Contemporary, Hi...",English,0,Riverhead Books,"June 1, 2007",Mariam is only fifteen when she is sent to Kab...,32.7k,760k,69431,1417260,4.42,[British Book Award for Best Read of the Year ...,"[Best Books Ever, Books That Everyone Should R..."
3,19063.The_Book_Thief,The Book Thief,Markus Zusak,10.99,"[Historical Fiction, Fiction, Young Adult, His...",English,0,Alfred A. Knopf,"March 14, 2006",Librarian's note: An alternate cover edition c...,86k,2m,134883,2345385,4.39,[National Jewish Book Award for Children’s and...,"[Best Books Ever, Books That Everyone Should R..."
4,4214.Life_of_Pi,Life of Pi,Yann Martel,,"[Fiction, Fantasy, Classics, Adventure, Contem...",English,0,Seal Books,"August 29, 2006",Life of Pi is a fantasy adventure novel by Yan...,24.9k,726k,51257,1544622,3.93,"[Booker Prize (2002), Bollinger Everyman Wodeh...","[Best Books Ever, Books That Everyone Should R..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6257,25489259-death-of-an-alchemist,Death of an Alchemist,Mary Lawrence,5.99,"[Mystery, Historical Fiction, Fiction, Histori...",English,1,Kensington Books,"January 26, 2016","In the mid sixteenth century, Henry VIII sits ...",37,2085,68,285,3.89,[],[Most Anticipated Historical Mysteries for 201...
6259,52185047-the-lost-boys-of-london,The Lost Boys of London,Mary Lawrence,,"[Mystery, Historical Fiction, Historical, Fict...",English,1,Red Puddle Print,"April 28, 2020",Set in the final years of King Henry VIII's re...,10,2371,51,99,4.39,[],"[Anticipated 2020 Literary Fiction, Crime, mys..."
6262,36445482-no-cure-for-the-dead,No Cure for the Dead,Christine Trent,12.99,"[Mystery, Historical Fiction, Historical Myste...",English,1,Crooked Lane Books,"May 8, 2018","When a young nurse dies on her watch, Florence...",53,621,86,380,3.65,[],"[Historical Fiction 2018, Historical Mystery 2..."
6263,15793166-the-midwife-s-tale,The Midwife's Tale,Sam Thomas,5.99,"[Historical Fiction, Mystery, Fiction, Histori...",English,1,Minotaur Books,"January 8, 2013",In the tradition of Arianna Franklin and C. J....,63,7444,421,2855,3.66,[],"[Historical Fiction 2013, most anticipated mys..."


In [17]:
def string_to_float_generator(x): 
    if x == 'None':
        return -1 
    elif type(x) == float:
        return x
    elif 'm' in x:
        return float(x.replace('m', '').replace(',', '')) * 1000000
    elif 'k' in x:
        return float(x.replace('k', '').replace(',', '')) * 1000
    else:
        return -1

In [18]:
df_proc

Unnamed: 0,book_id,title,author,price,genres,language,series,publisher,year_published,description,current_readers,wanted_to_read,num_reviews,num_ratings,rating,awards,primary_lists
0,77203.The_Kite_Runner,The Kite Runner,Khaled Hosseini,,"[Fiction, Historical Fiction, Classics, Contem...",English,0,Riverhead Books,"May 1, 2004",1970s Afghanistan: Twelve-year-old Amir is des...,42.9k,1m,90234,2935385,4.33,[Borders Original Voices Award for Fiction (20...,[Books That Everyone Should Read At Least Once...
1,929.Memoirs_of_a_Geisha,Memoirs of a Geisha,Arthur Golden,12.99,"[Fiction, Historical Fiction, Romance, Histori...",English,0,Vintage Books USA,"November 22, 2005","A literary sensation and runaway bestseller, t...",12.3k,793k,34102,1922540,4.14,[],"[Best Books Ever, Best Historical Fiction, Boo..."
2,128029.A_Thousand_Splendid_Suns,A Thousand Splendid Suns,Khaled Hosseini,12.99,"[Fiction, Historical Fiction, Contemporary, Hi...",English,0,Riverhead Books,"June 1, 2007",Mariam is only fifteen when she is sent to Kab...,32.7k,760k,69431,1417260,4.42,[British Book Award for Best Read of the Year ...,"[Best Books Ever, Books That Everyone Should R..."
3,19063.The_Book_Thief,The Book Thief,Markus Zusak,10.99,"[Historical Fiction, Fiction, Young Adult, His...",English,0,Alfred A. Knopf,"March 14, 2006",Librarian's note: An alternate cover edition c...,86k,2m,134883,2345385,4.39,[National Jewish Book Award for Children’s and...,"[Best Books Ever, Books That Everyone Should R..."
4,4214.Life_of_Pi,Life of Pi,Yann Martel,,"[Fiction, Fantasy, Classics, Adventure, Contem...",English,0,Seal Books,"August 29, 2006",Life of Pi is a fantasy adventure novel by Yan...,24.9k,726k,51257,1544622,3.93,"[Booker Prize (2002), Bollinger Everyman Wodeh...","[Best Books Ever, Books That Everyone Should R..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6257,25489259-death-of-an-alchemist,Death of an Alchemist,Mary Lawrence,5.99,"[Mystery, Historical Fiction, Fiction, Histori...",English,1,Kensington Books,"January 26, 2016","In the mid sixteenth century, Henry VIII sits ...",37,2085,68,285,3.89,[],[Most Anticipated Historical Mysteries for 201...
6259,52185047-the-lost-boys-of-london,The Lost Boys of London,Mary Lawrence,,"[Mystery, Historical Fiction, Historical, Fict...",English,1,Red Puddle Print,"April 28, 2020",Set in the final years of King Henry VIII's re...,10,2371,51,99,4.39,[],"[Anticipated 2020 Literary Fiction, Crime, mys..."
6262,36445482-no-cure-for-the-dead,No Cure for the Dead,Christine Trent,12.99,"[Mystery, Historical Fiction, Historical Myste...",English,1,Crooked Lane Books,"May 8, 2018","When a young nurse dies on her watch, Florence...",53,621,86,380,3.65,[],"[Historical Fiction 2018, Historical Mystery 2..."
6263,15793166-the-midwife-s-tale,The Midwife's Tale,Sam Thomas,5.99,"[Historical Fiction, Mystery, Fiction, Histori...",English,1,Minotaur Books,"January 8, 2013",In the tradition of Arianna Franklin and C. J....,63,7444,421,2855,3.66,[],"[Historical Fiction 2013, most anticipated mys..."


In [19]:
keys = ['current_reaaders', 'wanted_to_read']
# set index 
df_proc['current_readers'] = df_proc['current_readers'].apply(lambda x: string_to_float_generator(x) if x != None else -1)
df_proc['wanted_to_read'] = df_proc['wanted_to_read'].apply(lambda x: string_to_float_generator(x) if x != None else -1)
    

In [20]:
# first convert strings to numbers and then replace nan with average 
df_proc['price'] = df_proc['price'].apply(lambda x: float(x) if x != None else np.nan)
df_proc['price'] = df_proc['price'].fillna(df_proc['price'].mean())

In [21]:
# x = df_proc[df_proc['year_published'] == 'May 1, 199']
# # how do I identify special cases like this? 
# x['year_published'].to_string()[-4:-3] == ' ' 
# this led to the realization that there is around 53 books that have a year_published that is not in the correct format
# replace those with nan


def check_year_published_format(x):
    counter = 0
    if x == None: 
        return np.nan
    try:
        return dt.datetime.strptime(x, '%B %d, %Y')
    except ValueError as e:
        counter += 1
        print(e)
        x = np.nan
        return x





In [22]:
# convert year_published to datetime object of month day and year
import datetime as dt 

df_proc['year_published'] = [x for x in df_proc['year_published'].apply(lambda x: check_year_published_format(x))]

time data 'Gollancz' does not match format '%B %d, %Y'
time data 'Penguin Publishing Group' does not match format '%B %d, %Y'
time data 'DAW Books' does not match format '%B %d, %Y'
time data 'May 1, 199' does not match format '%B %d, %Y'
time data 'Emily Bestler Books' does not match format '%B %d, %Y'
time data 'Stephen Douglass' does not match format '%B %d, %Y'
time data 'Blackie' does not match format '%B %d, %Y'
time data "G. B. Putnam's Sons" does not match format '%B %d, %Y'
time data 'Scholastic' does not match format '%B %d, %Y'
time data 'Putnams' does not match format '%B %d, %Y'
time data 'Parasite Publications' does not match format '%B %d, %Y'
time data 'Puffin Books' does not match format '%B %d, %Y'
time data 'Funk and Wagnalls' does not match format '%B %d, %Y'
time data 'Harper-collins Publishers' does not match format '%B %d, %Y'
time data 'Marvel' does not match format '%B %d, %Y'
time data 'Not Avail' does not match format '%B %d, %Y'
time data "Viking Children's 

In [23]:
df_proc.info() # dropped around 1,000 entries due to languages


<class 'pandas.core.frame.DataFrame'>
Int64Index: 5040 entries, 0 to 6264
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   book_id          5040 non-null   object        
 1   title            5040 non-null   object        
 2   author           5040 non-null   object        
 3   price            5040 non-null   float64       
 4   genres           5040 non-null   object        
 5   language         5040 non-null   object        
 6   series           5040 non-null   int64         
 7   publisher        4733 non-null   object        
 8   year_published   4987 non-null   datetime64[ns]
 9   description      5040 non-null   object        
 10  current_readers  5040 non-null   float64       
 11  wanted_to_read   5040 non-null   float64       
 12  num_reviews      5040 non-null   object        
 13  num_ratings      5040 non-null   object        
 14  rating           5040 non-null   object 

In [28]:
df_proc
# Write to csv
line_terminator = os.linesep
df_proc.to_csv('processed_books.csv', index=False, lineterminator=line_terminator)

### 4. Exploratory Data Analysis

<> 


### 5. Feature Engineering

genres 

### 6. Saving the dataset