# GoodReads Dataset EDA

#### Author: Quinci Birker

### Introduction

### Data Dictionary

| Attributes  | Definition | Completeness |
| ------------- | ------------- | ------------- | 
| bookId  | Book Identifier as in goodreads.com  | 100 |
| title  | Book title | 100 |
| series | Series Name | 45 |
| author | Book's Author | 100 |
| rating | Global goodreads rating | 100 |
| description | Book's description | 97 |
| language | Book's language | 93 |
| isbn | Book's ISBN | 92 |
| genres | Book's genres | 91 |
| characters | Main characters | 26 |
| bookFormat | Type of binding | 97 |
| edition | Type of edition (ex. Anniversary Edition) | 9 |
| pages | Number of pages | 96 |
| publisher | Editorial | 93 |
| publishDate | publication date | 98 |
| firstPublishDate | Publication date of first edition | 59 |
| awards | List of awards | 20 |
| numRatings | Number of total ratings | 100 |
| ratingsByStars | Number of ratings by stars | 97 |
| likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
| setting | Story setting | 22 |
| coverImg | URL to cover image | 99 |
| bbeScore | Score in Best Books Ever list | 100 |
| bbeVotes | Number of votes in Best Books Ever list | 100 |
| price | Book's price (extracted from Iberlibro) | 73 |


### Table of Contents

- [Data Cleaning](#data_cleaning)

- [Part 1.2 - Basic Analysis](#q1.2)
---------------------------------------


### Import Dataset and Libraries

In [30]:
# import libraries:
import pandas as pd
import numpy as np
import plotly.express as px

In [31]:
# read data from the CSV file:
raw_df = pd.read_csv('data/books_1.Best_Books_Ever.csv')

In [32]:
# check number of rows and columns:
raw_df.shape
print(f'There are {raw_df.shape[0]} rows and {raw_df.shape[1]} columns in the data')

There are 52478 rows and 25 columns in the data


In [33]:
# check the data types for every column:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52478 entries, 0 to 52477
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   bookId            52478 non-null  object 
 1   title             52478 non-null  object 
 2   series            23470 non-null  object 
 3   author            52478 non-null  object 
 4   rating            52478 non-null  float64
 5   description       51140 non-null  object 
 6   language          48672 non-null  object 
 7   isbn              52478 non-null  object 
 8   genres            52478 non-null  object 
 9   characters        52478 non-null  object 
 10  bookFormat        51005 non-null  object 
 11  edition           4955 non-null   object 
 12  pages             50131 non-null  object 
 13  publisher         48782 non-null  object 
 14  publishDate       51598 non-null  object 
 15  firstPublishDate  31152 non-null  object 
 16  awards            52478 non-null  object

The majority of these columns are object. Columns that might need to change from object to integer:
- publish date
- first publish date
- price

In [34]:
# sanity check the first five rows:
raw_df.head()

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,...,firstPublishDate,awards,numRatings,ratingsByStars,likedPercent,setting,coverImg,bbeScore,bbeVotes,price
0,2767052-the-hunger-games,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,9780439023481,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...","['Katniss Everdeen', 'Peeta Mellark', 'Cato (H...",...,,['Locus Award Nominee for Best Young Adult Boo...,6376780,"['3444695', '1921313', '745221', '171994', '93...",96.0,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...,2993816,30516,5.09
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",4.5,There is a door at the end of a silent corrido...,English,9780439358071,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...","['Sirius Black', 'Draco Malfoy', 'Ron Weasley'...",...,06/21/03,['Bram Stoker Award for Works for Young Reader...,2507623,"['1593642', '637516', '222366', '39573', '14526']",98.0,['Hogwarts School of Witchcraft and Wizardry (...,https://i.gr-assets.com/images/S/compressed.ph...,2632233,26923,7.38
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,The unforgettable novel of a childhood in a sl...,English,9999999999999,"['Classics', 'Fiction', 'Historical Fiction', ...","['Scout Finch', 'Atticus Finch', 'Jem Finch', ...",...,07/11/60,"['Pulitzer Prize for Fiction (1961)', 'Audie A...",4501075,"['2363896', '1333153', '573280', '149952', '80...",95.0,"['Maycomb, Alabama (United States)']",https://i.gr-assets.com/images/S/compressed.ph...,2269402,23328,
3,1885.Pride_and_Prejudice,Pride and Prejudice,,"Jane Austen, Anna Quindlen (Introduction)",4.26,Alternate cover edition of ISBN 9780679783268S...,English,9999999999999,"['Classics', 'Fiction', 'Romance', 'Historical...","['Mr. Bennet', 'Mrs. Bennet', 'Jane Bennet', '...",...,01/28/13,[],2998241,"['1617567', '816659', '373311', '113934', '767...",94.0,"['United Kingdom', 'Derbyshire, England (Unite...",https://i.gr-assets.com/images/S/compressed.ph...,1983116,20452,
4,41865.Twilight,Twilight,The Twilight Saga #1,Stephenie Meyer,3.6,About three things I was absolutely positive.\...,English,9780316015844,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...","['Edward Cullen', 'Jacob Black', 'Laurent', 'R...",...,10/05/05,"['Georgia Peach Book Award (2007)', 'Buxtehude...",4964519,"['1751460', '1113682', '1008686', '542017', '5...",78.0,"['Forks, Washington (United States)', 'Phoenix...",https://i.gr-assets.com/images/S/compressed.ph...,1459448,14874,2.1


In [35]:
# check last 5 rows:
raw_df.tail()

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,...,firstPublishDate,awards,numRatings,ratingsByStars,likedPercent,setting,coverImg,bbeScore,bbeVotes,price
52473,11492014-fractured,Fractured,Fateful #2,Cheri Schmidt (Goodreads Author),4.0,The Fateful Trilogy continues with Fractured. ...,English,2940012616562,"['Vampires', 'Paranormal', 'Young Adult', 'Rom...",[],...,,[],871,"['311', '310', '197', '42', '11']",94.0,[],https://i.gr-assets.com/images/S/compressed.ph...,0,1,
52474,11836711-anasazi,Anasazi,Sense of Truth #2,Emma Michaels,4.19,"'Anasazi', sequel to 'The Thirteenth Chime' by...",English,9999999999999,"['Mystery', 'Young Adult']",[],...,August 3rd 2011,[],37,"['16', '14', '5', '2', '0']",95.0,[],https://i.gr-assets.com/images/S/compressed.ph...,0,1,
52475,10815662-marked,Marked,Soul Guardians #1,Kim Richardson (Goodreads Author),3.7,--READERS FAVORITE AWARDS WINNER 2011--Sixteen...,English,9781461017097,"['Fantasy', 'Young Adult', 'Paranormal', 'Ange...",[],...,March 15th 2011,"[""Readers' Favorite Book Award (2011)""]",6674,"['2109', '1868', '1660', '647', '390']",84.0,[],https://i.gr-assets.com/images/S/compressed.ph...,0,1,7.37
52476,11330278-wayward-son,Wayward Son,,"Tom Pollack (Goodreads Author), John Loftus (G...",3.85,A POWERFUL TREMOR UNEARTHS AN ANCIENT SECRETBu...,English,9781450755634,"['Fiction', 'Mystery', 'Historical Fiction', '...",[],...,April 5th 2011,[],238,"['77', '78', '59', '19', '5']",90.0,[],https://i.gr-assets.com/images/S/compressed.ph...,0,1,2.86
52477,10991547-daughter-of-helaman,Daughter of Helaman,Stripling Warrior #1,Misty Moncur (Goodreads Author),4.02,Fighting in Helaman's army is Keturah's deepes...,English,9781599554976,"['Lds Fiction', 'Historical Fiction', 'Young A...",[],...,,[],246,"['106', '73', '42', '17', '8']",90.0,[],https://i.gr-assets.com/images/S/compressed.ph...,0,1,5.2


In [36]:
raw_df.sample(5)

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,...,firstPublishDate,awards,numRatings,ratingsByStars,likedPercent,setting,coverImg,bbeScore,bbeVotes,price
44657,297219.Cloud_Nine,Cloud Nine,,Luanne Rice (Goodreads Author),3.97,Sarah Talbot surely thought she'd never live t...,English,9780553589467,"['Romance', 'Fiction', 'Chick Lit', 'Contempor...",[],...,01/01/99,[],4274,"['1492', '1498', '1030', '188', '66']",94.0,[],https://i.gr-assets.com/images/S/compressed.ph...,69,2,5.27
21692,531197.Lyra_s_Oxford,Lyra's Oxford,His Dark Materials #3.5,"Philip Pullman, John Lawrence (Illustrator)",3.58,Lyra's Oxford begins with Lyra and Pantalaimon...,English,9780375828195,"['Fantasy', 'Young Adult', 'Fiction', 'Short S...","['Lyra Belacqua', 'Pantalaimon']",...,10/30/03,[],21758,"['3680', '7577', '8414', '1818', '269']",90.0,"['Oxford, England']",https://i.gr-assets.com/images/S/compressed.ph...,99,1,5.48
43783,6377037-hack,Hack,,Peter Wrenshall,3.46,"Hack, by Peter Wrenshall, is a page-turning yo...",English,9781419669972,"['Young Adult', 'Thriller', 'Science Fiction',...",[],...,,[],120,"['23', '41', '28', '24', '4']",77.0,[],https://i.gr-assets.com/images/S/compressed.ph...,72,1,
10013,4371507-the-age-of-wonder,The Age of Wonder: How the Romantic Generation...,,Richard Holmes,3.96,'The Age of Wonder' is Richard Holmes' first m...,English,9780007149520,"['History', 'Science', 'Nonfiction', 'Biograph...",[],...,10/01/08,"['Royal Society Science Book Prize (2009)', 'A...",8775,"['3047', '3226', '1837', '452', '213']",92.0,[],https://i.gr-assets.com/images/S/compressed.ph...,253,3,2.89
10205,3313418-extreme-measures,Extreme Measures,Mitch Rapp #11,Vince Flynn,4.3,"The latest pulse-pounding thriller by #1 ""New ...",English,9781416599395,"['Thriller', 'Fiction', 'Mystery', 'Espionage'...","['Mitch Rapp', 'Mike Nash, Irene Kennedy']",...,,[],48987,"['24683', '16808', '5731', '1117', '648']",96.0,[],https://i.gr-assets.com/images/S/compressed.ph...,247,3,5.04


Notes from looking at the begining, end, and a random sample of the dataset:

1. Columns to be deleted that are not useful for this project:
    - `bookId` does not seem to be useful for my project. The index will be used instead of this.  
    - `isbn` is a numeric book identifier. I will also delete this since I am using the index to refer to each unique book in the dataset.
    - `coverImg` is a url to the books cover image. For this project, I will not be using this in my modeling.  
2. There are quite a few columns that are missing data. Further analysis will be performed.

3. Columns that stand out for further investigation:
    - `firstPublishDate` ~ missing values and different formats (i.e. 07/29/96, 1989, April 5th 2011)
    - `price` ~ the values don't seem to be accurate. The entire column might need to be deleted

In [37]:
# Check that total index matches total number of rows:
raw_df.index.nunique() == raw_df.shape[0]

True

The total index count is equal to the total number of rows in the dataset. 

In [38]:
# Count the number of missing values for each column:
raw_df.isna().sum()

bookId                  0
title                   0
series              29008
author                  0
rating                  0
description          1338
language             3806
isbn                    0
genres                  0
characters              0
bookFormat           1473
edition             47523
pages                2347
publisher            3696
publishDate           880
firstPublishDate    21326
awards                  0
numRatings              0
ratingsByStars          0
likedPercent          622
setting                 0
coverImg              605
bbeScore                0
bbeVotes                0
price               14365
dtype: int64

In [39]:
# Percentage of missing values for each column:
print(raw_df.isna().sum(axis=0)/raw_df.shape[0])

bookId              0.000000
title               0.000000
series              0.552765
author              0.000000
rating              0.000000
description         0.025496
language            0.072526
isbn                0.000000
genres              0.000000
characters          0.000000
bookFormat          0.028069
edition             0.905579
pages               0.044724
publisher           0.070430
publishDate         0.016769
firstPublishDate    0.406380
awards              0.000000
numRatings          0.000000
ratingsByStars      0.000000
likedPercent        0.011853
setting             0.000000
coverImg            0.011529
bbeScore            0.000000
bbeVotes            0.000000
price               0.273734
dtype: float64


There are 12 columns that have missing values. The column name and rounded percent of missing values in descending order:
- edition: 91%
- series: 55%
- first publish date: 41%
- price 27%
- language: 7%
- publisher: 7%
- pages: 4%
- description: 3%
- book format: 3%
- publish date: 2%
- liked percent: 1%
- cover image: 1%

The first step in the data cleaning process is to delete all the columns that I will not be using for modeling:
- bookid: using dataset index instead
- isbn: using dataset index instead
- cover image: won't be used for this modeling
- edition: over 91% of the values are missing
- first publish date: over 41% missing and my assumption is that publish date will be more relevant than first publish date
- price: the prices do not seem accurate enough to use. Secondly, depending on the date of purchase and retailer, the price of books can very greatly. 
- awards: there is limited values in this dataset
- characters: there is limited values in this dataset
- setting: there is limited values in this dataset

In [49]:
# Check for duplicated columns:
raw_df.T.duplicated().sum()

0

In [58]:
# Checking for duplicate rows: 
raw_df[raw_df.duplicated()]

Unnamed: 0,bookId,title,series,author,rating,description,language,genres,bookFormat,pages,publisher,publishDate,numRatings,ratingsByStars,likedPercent,bbeScore,bbeVotes
37431,8794263-promises-to-keep,Promises to Keep,,Ann Tatlock,3.94,Eleven-year-old Roz (Rosalind) Anthony and her...,English,"['Fiction', 'Christian Fiction', 'Christian', ...",Paperback,348.0,Bethany House Publishers,February 1st 2011,1997,"['582', '833', '476', '84', '22']",95.0,87,1
37432,1909590.Click,Click,,"Eoin Colfer, Linda Sue Park, Ruth Ozeki (Goodr...",3.54,A video message from a dead person. A larcenou...,English,"['Young Adult', 'Fiction', 'Short Stories', 'M...",Unknown Binding,217.0,Not Avail,October 1st 2007,1910,"['340', '647', '664', '214', '45']",86.0,87,1
37433,23394408-die-unendlichkeit-schl-ft,Die Unendlichkeit schläft,Loki von Schallern Staffel 1 #3,Melanie Meier,4.5,"Überall, wo er hingeht, reißen Höllenfeuer all...",German,[],Kindle Edition,,neobooks Self-Publishing,November 2014,2,[],,87,1
37434,7544945-death-note,"Death Note: Black Edition, Vol. 2",Death Note: Black Edition #2,"Tsugumi Ohba, Takeshi Obata, Yuki Kowalsky (tr...",4.48,Intégrale regroupant les tomes 3 et 4Tome 3 :L...,German,"['Graphic Novels', 'Comics', 'Fantasy', 'Manga...",Paperback,385.0,Tokyopop,2009,5849,"['3339', '2045', '408', '44', '13']",99.0,87,1
37435,25886017-m-scaras,Máscaras,,Ariel Dorfman,3.49,"¿Qué se oculta detrás de esos rostros difusos,...",Spanish,"['Fiction', 'Literature']",Paperback,159.0,,May 1988,77,"['17', '22', '23', '12', '3']",81.0,87,1
37436,2669775-el-siglo-de-las-luces,El siglo de las luces,,Alejo Carpentier,4.13,El siglo de las luces novela el impacto de la ...,Spanish,"['Fiction', 'Spanish Literature', 'Historical ...",Hardcover,345.0,"Editorial Bruguera, S.A.",March 1980,2282,"['979', '802', '368', '95', '38']",94.0,87,1
37437,24633605-always-and-forever,Always and Forever,Serenity Point #2,Harper Bentley (Goodreads Author),3.93,Does wanting to slap the hell out of Brody Kel...,English,"['Contemporary Romance', 'Romance', 'Firefight...",Kindle Edition,,HB,June 30th 2015,482,"['153', '188', '104', '26', '11']",92.0,87,1
37438,237086.Dafne_desvanecida,Dafne desvanecida,,José Carlos Somoza,3.48,Dafne desvanecida presenta a un famoso escrito...,Spanish,"['Fiction', 'Mystery', 'Contemporary', 'Spanis...",Paperback,237.0,Agata,February 1st 2000,204,"['37', '56', '82', '25', '4']",86.0,87,1
37439,17786377,فضولية العلم,,"Cyril Aydon, أحمد مغربي (ترجمة)",3.87,"ما يلفت في كتاب سيرل أيدون ""فضولية العلم"" طريق...",Arabic,"['Science', 'Nonfiction']",Paperback,280.0,دار الساقي بالاشتراك مع دار البابطين للترجمة,January 1st 2007,91,"['25', '37', '21', '8', '0']",91.0,87,1
37440,7452583-jag-vill-inte-d-jag-vill-bara-inte-leva,"Jag vill inte dö, jag vill bara inte leva",,Ann Heberlein,3.5,"Ann Heberleins omdiskuterade självbiografi""Jag...",Swedish,"['Nonfiction', 'Psychology', 'Biography', 'Men...",Paperback,206.0,Månpocket,December 18th 2009,1032,"['153', '389', '338', '128', '24']",85.0,87,1


These duplicated rows have all the same bbeScore and bbeVotes, yet all the book titles are unique, meaning that these rows will not be dropped.

<a id = 'data_cleaning'></a>
### Data Cleaning

First step in data cleaning is dropping the columns that will not be used in the analysis/modeling.

In [63]:
# Drop specified column and check that changes have been made to raw_df:

raw_df = raw_df.drop(['isbn', 'coverImg', 'edition', 'firstPublishDate', 'price', 'awards', 'setting', 'characters'], axis=1)
raw_df.head()

KeyError: "['isbn', 'coverImg', 'edition', 'firstPublishDate', 'price', 'awards', 'setting', 'characters'] not found in axis"

For the publish date, the first 30,000 books in the dataset are formatted in mm/dd/yyyy while the last 22,478 books are formated in Month Day Year. I will reformate all the dates to be in mm/dd/yyyy.

In [71]:
# Change publish date to datetime format:
raw_df['publishDate'] = pd.to_datetime(raw_df['publishDate'], errors='coerce') # errors='coerce' used if value cannot be converted to datetime format
# Reformat publish date to mm/dd/yyyy format:
raw_df['publishDate'] = raw_df['publishDate'].dt.strftime('%m/%d/%Y')

Sanity check that this change was made to the dataset:

In [69]:
raw_df.sample(5)

Unnamed: 0,bookId,title,series,author,rating,description,language,genres,bookFormat,pages,publisher,publishDate,numRatings,ratingsByStars,likedPercent,bbeScore,bbeVotes
45664,919341.Above_San_Diego,Above San Diego,,"Robert W. Cameron (Photographs), Neil Morgan (...",4.0,Above San Diego. This is the anchor of the wes...,English,[],Hardcover,160,Cameron Books,08/01/1990,9,"['2', '5', '2', '0', '0']",100.0,65,1
7641,10002296-wildflower-hill,Wildflower Hill,,Kimberley Freeman,4.09,Forced to take her life in a new direction whe...,English,"['Historical Fiction', 'Fiction', 'Romance', '...",Paperback,524,Touchstone,08/23/2011,12163,"['4268', '5271', '2149', '362', '113']",96.0,348,4
6052,17204984-between-the-lives,Between the Lives,,Jessica Shirvington (Goodreads Author),4.21,Sabine isn't like anyone else. For as long as ...,English,"['Young Adult', 'Paranormal', 'Romance', 'Fant...",Paperback,336,HarperCollins Australia,05/01/2013,4906,"['2314', '1629', '696', '207', '60']",95.0,465,5
13263,15790912-new-moon,"New Moon: The Graphic Novel, Vol. 2",Twilight: The Graphic Novel #4,"Young Kim (Adaptation), Stephenie Meyer (Creator)",4.33,Bella and Edward find themselves facing new ob...,English,"['Graphic Novels', 'Vampires', 'Young Adult', ...",Hardcover,160,Yen Press,07/21/2020,1164,"['733', '221', '121', '43', '46']",92.0,186,2
22632,18460726-strain,Strain,Strain #1,Amelia C. Gormley (Goodreads Author),3.87,"In a world with little hope and no rules, the ...",English,"['M M Romance', 'BDSM', 'Science Fiction', 'Dy...",ebook,375,Riptide Publishing,02/17/2014,1777,"['571', '664', '353', '124', '65']",89.0,99,1


Next step is to go through the columns with missing values. First, looking at series which is missing 55% of it's values.

In [42]:
raw_df.loc[raw_df['series'].isna()]

Unnamed: 0,bookId,title,series,author,rating,description,language,genres,bookFormat,pages,publisher,publishDate,numRatings,ratingsByStars,likedPercent,bbeScore,bbeVotes
3,1885.Pride_and_Prejudice,Pride and Prejudice,,"Jane Austen, Anna Quindlen (Introduction)",4.26,Alternate cover edition of ISBN 9780679783268S...,English,"['Classics', 'Fiction', 'Romance', 'Historical...",Paperback,279,Modern Library,10/10/00,2998241,"['1617567', '816659', '373311', '113934', '767...",94.0,1983116,20452
5,19063.The_Book_Thief,The Book Thief,,Markus Zusak (Goodreads Author),4.37,Librarian's note: An alternate cover edition c...,English,"['Historical Fiction', 'Fiction', 'Young Adult...",Hardcover,552,Alfred A. Knopf,03/14/06,1834276,"['1048230', '524674', '186297', '48864', '26211']",96.0,1372809,14168
6,170448.Animal_Farm,Animal Farm,,"George Orwell, Russell Baker (Preface), C.M. W...",3.95,Librarian's note: There is an Alternate Cover ...,English,"['Classics', 'Fiction', 'Dystopia', 'Fantasy',...",Mass Market Paperback,141,Signet Classics,04/28/96,2740713,"['986764', '958699', '545475', '165093', '84682']",91.0,1276599,13264
9,18405.Gone_with_the_Wind,Gone with the Wind,,Margaret Mitchell,4.30,"Scarlett O'Hara, the beautiful, spoiled daught...",English,"['Classics', 'Historical Fiction', 'Fiction', ...",Mass Market Paperback,1037,Warner Books,04/01/99,1074620,"['602138', '275517', '133535', '39008', '24422']",94.0,1087732,11211
10,11870085-the-fault-in-our-stars,The Fault in Our Stars,,John Green (Goodreads Author),4.21,Despite the tumor-shrinking medical miracle th...,English,"['Young Adult', 'Romance', 'Fiction', 'Contemp...",Hardcover,313,Dutton Books,01/10/12,3550714,"['1784471', '1022406', '512574', '150365', '80...",93.0,1087056,11287
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52463,25876358-the-natural-way-of-things,The Natural Way of Things,,Charlotte Wood,3.53,Two women awaken from a drugged sleep to find ...,English,"['Fiction', 'Dystopia', 'Australia', 'Feminism...",Paperback,320,Allen & Unwin,October 1st 2015,10894,"['2044', '3961', '3098', '1269', '522']",84.0,1,1
52464,36374396-algedonic,Algedonic,,R.H. Sin (Goodreads Author),3.71,"Bestselling poet r.h. Sin, author of the Whisk...",,"['Poetry', 'Nonfiction', 'Romance', 'Feminism']",Paperback,128,Andrews McMeel Publishing,December 12th 2017,1489,"['501', '402', '339', '144', '103']",83.0,1,1
52469,270435.Heal_Your_Body,Heal Your Body: The Mental Causes for Physical...,,Louise L. Hay,4.36,Heal Your Body is a fresh and easy step-by-ste...,English,"['Self Help', 'Health', 'Nonfiction', 'Spiritu...",Paperback,96,Hay House,January 1st 1984,14868,"['8640', '3745', '1864', '418', '201']",96.0,1,1
52470,11115191-attracted-to-fire,Attracted to Fire,,DiAnn Mills (Goodreads Author),4.14,Special Agent Meghan Connors' dream of one day...,English,"['Christian Fiction', 'Christian', 'Suspense',...",Paperback,416,Tyndale House Publishers,October 1st 2011,2143,"['945', '716', '365', '78', '39']",95.0,0,1


# COME BACK TO THIS ^^

Language column has 3,806 (7 percent) missing values. Further investigation will happen here:

In [72]:
# Look at the rows with language column missing:
raw_df.loc[raw_df['language'].isna()]

Unnamed: 0,bookId,title,series,author,rating,description,language,genres,bookFormat,pages,publisher,publishDate,numRatings,ratingsByStars,likedPercent,bbeScore,bbeVotes
503,34521870-mistress-suffragette,Mistress Suffragette,,Diana Forbes (Goodreads Author),4.34,\n A young woman without prospects at a ball ...,,"['Fiction', 'Novels', 'Historical Fiction', 'D...",Kindle Edition,333,Penmore Press LLC,03/05/2017,7647,"['3553', '3422', '466', '140', '66']",97.0,18215,203
570,36236125-invisible-monsters,Invisible Monsters,,Chuck Palahniuk (Goodreads Author),3.98,She's a catwalk model who has everything: a bo...,,"['Fiction', 'Contemporary', 'Thriller', 'Myste...",Paperback,304,W. W. Norton Company,05/01/2018,128254,"['47150', '45163', '25392', '7675', '2874']",92.0,15186,181
645,38311414-house-of-sand-and-fog,House of Sand and Fog,,Andre Dubus III,3.85,In this “page-turner with a beating heart” (Bo...,,"['Fiction', 'Contemporary', 'Literary Fiction'...",Paperback,368,W. W. Norton Company,10/02/2018,125230,"['38141', '46256', '28560', '8447', '3826']",90.0,12262,176
703,41423092-the-awakening,The Awakening: Fate in Motion,,Suzanne Boisvert (Goodreads Author),4.31,"Exiled from Earth thousands of years ago, Sar ...",,"['Contemporary', 'Drama', 'Book Club', 'Fictio...",Kindle Edition,331,,09/15/2018,6682,"['3045', '3031', '329', '206', '71']",96.0,10611,118
751,7770.One_Fish_Two_Fish_Red_Fish_Blue_Fish,"One Fish, Two Fish, Red Fish, Blue Fish",,Dr. Seuss (Reader),4.13,One Fish Two Fish Red Fish Blue Fish is a 1960...,,"['Childrens', 'Picture Books', 'Fiction', 'Cla...",Hardcover,64,Harper Collins Children's Books,10/06/2003,165623,"['81438', '41473', '30307', '8407', '3998']",93.0,9731,163
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52411,21691391-pixie-dust,Pixie Dust,Pixie Dust Chronicles #1,Laura Lee (Goodreads Author),3.79,*A lonesome fairy with no clue how to wield he...,,"['Fantasy', 'Paranormal', 'Paranormal Romance'...",ebook,,Laura Lee,12/31/2011,1373,"['460', '424', '299', '115', '75']",86.0,2,1
52416,23014216-when-i-fall,When I Fall,Alabama Summer #3,J. Daniels (Goodreads Author),4.30,"From New York Times bestselling author, J. Dan...",,"['Romance', 'New Adult', 'Contemporary Romance...",,343,,03/17/2015,12022,"['5939', '4225', '1488', '256', '114']",97.0,2,1
52429,20665259-did-god-really-command-genocide,Did God Really Command Genocide?: Coming to Te...,,"Paul Copan, Matt Flannagan",4.18,A common objection to belief in the God of the...,,"['Theology', 'Christian', 'Religion', 'Christi...",Paperback,352,Baker Books,11/18/2014,114,"['47', '47', '13', '7', '0']",94.0,2,1
52454,13254398-au-petit-poil,Au Petit Poil,Cool and Lam #26,A.A. Fair,3.89,,,"['Mystery', 'Fiction', 'Detective']",,154,Librairie des Champs-Elysées,01/01/1997,96,"['24', '42', '25', '5', '0']",95.0,1,1


The pages column is missing around 4% of it's data. Further investigation...

In [73]:
# Look at the rows with pages column missing:
raw_df.loc[raw_df['pages'].isna()]

Unnamed: 0,bookId,title,series,author,rating,description,language,genres,bookFormat,pages,publisher,publishDate,numRatings,ratingsByStars,likedPercent,bbeScore,bbeVotes
669,30346601-brainwalker,Brainwalker,,"Robyn Mundell (Goodreads Author), Stephan Laca...",4.35,One teen’s incredible journey may just blow hi...,English,"['Young Adult', 'Adventure', 'Fantasy', 'Middl...",ebook,,Dualmind Publishing,10/01/2016,15250,"['7226', '6795', '702', '371', '156']",97.0,11433,128
1062,44575575-pleasant-day,Pleasant Day,,Vera Jane Cook (Goodreads Author),4.33,"In the town of Hollow Creek, South Carolina tw...",English,"['Contemporary', 'Drama', 'Book Club', 'Fictio...",Kindle Edition,,Chattercreek,03/11/2019,5684,"['2620', '2583', '265', '143', '73']",96.0,5744,66
1428,318525.Red_Storm_Rising,Red Storm Rising,,Tom Clancy,4.17,"""Allah!""With that shrill cry, three Muslim ter...",English,"['Fiction', 'Thriller', 'Military Fiction', 'W...",Audio Cassette,,Random House Audio,09/13/1988,69639,"['30349', '24341', '12083', '2291', '575']",96.0,3500,47
1483,26030383-beg-for-mercy,Beg For Mercy,Mercy #3,Lucian Bane (Goodreads Author),4.29,"The fight is on in this installment, Mercy is ...",English,[],Kindle Edition,,,08/18/2015,714,"['439', '130', '87', '27', '31']",92.0,3246,33
1543,19181419-a-bird-without-wings,A Bird Without Wings,,Roberta Pearce (Goodreads Author),4.17,"After an impoverished and indigent childhood, ...",English,"['Romance', 'Contemporary Romance', 'Contempor...",ebook,,Smashwords,12/02/2013,168,"['86', '45', '24', '6', '7']",92.0,3046,32
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52295,53946399-launching-your-real-estate-career,Launching Your Real Estate Career: The Definit...,,Eric Kistner,5.00,,,[],Kindle Edition,,,06/07/2020,2,[],,5,1
52352,971322.The_Wanderers,The Wanderers,,Richard Price,3.90,"The Wanderers, a teen-age gang in the Bronx of...",English,"['Fiction', 'Crime', 'New York', 'Novels', 'Li...",Mass Market Paperback,,Avon Books,11/01/1979,1640,"['452', '688', '406', '74', '20']",94.0,3,1
52359,18130046-ender-in-flight,Ender in Flight,Ender's Saga short stories,Orson Scott Card,4.35,"""Ender in Flight"" is a story by Orson Scott Ca...",English,"['Science Fiction', 'Fiction', 'Young Adult', ...",,,,,133,"['66', '51', '14', '1', '1']",98.0,3,1
52407,2823038-forever-young-forever-free-forever-you...,"Forever Young, Forever Free Forever Young, For...",,Hettie Jones,3.67,,,[],Paperback,,,,3,[],,2,1
