<a id = 'title'></a>
# GoodReads Dataset EDA

#### Author: Quinci Birker

### Introduction

### Data Dictionary

| Attributes  | Definition | Completeness |
| ------------- | ------------- | ------------- | 
| bookId  | Book Identifier as in goodreads.com  | 100 |
| title  | Book title | 100 |
| series | Series Name | 45 |
| author | Book's Author | 100 |
| rating | Global goodreads rating | 100 |
| description | Book's description | 97 |
| language | Book's language | 93 |
| isbn | Book's ISBN | 92 |
| genres | Book's genres | 91 |
| characters | Main characters | 26 |
| bookFormat | Type of binding | 97 |
| edition | Type of edition (ex. Anniversary Edition) | 9 |
| pages | Number of pages | 96 |
| publisher | Editorial | 93 |
| publishDate | publication date | 98 |
| firstPublishDate | Publication date of first edition | 59 |
| awards | List of awards | 20 |
| numRatings | Number of total ratings | 100 |
| ratingsByStars | Number of ratings by stars | 97 |
| likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
| setting | Story setting | 22 |
| coverImg | URL to cover image | 99 |
| bbeScore | Score in Best Books Ever list | 100 |
| bbeVotes | Number of votes in Best Books Ever list | 100 |
| price | Book's price (extracted from Iberlibro) | 73 |


### Table of Contents
---------------------------------------
- [Import Dataset and Libraries](#import)
---------------------------------------
- [Preliminary Data Exploration](#pre_explore)
---------------------------------------
- [Data Cleaning](#data_clean)
---------------------------------------
- [Descriptive Statistics](#desc_statistics)
---------------------------------------
- [Data Visualization](#data_visualization)
---------------------------------------
- [Correlation Analysis](#correlation_analysis)
---------------------------------------
- [Summary & Insights](#summary)
---------------------------------------


<a id = 'import'></a>
### Import Dataset and Libraries

In [39]:
# import libraries:
import pandas as pd
import numpy as np
import plotly.express as px

In [40]:
# read data from the CSV file:
raw_df = pd.read_csv('data/books_1.Best_Books_Ever.csv')

<a id = 'pre_explore'></a>
### Preliminary Data Exploration

In [41]:
# check number of rows and columns:
raw_df.shape
print(f'There are {raw_df.shape[0]} rows and {raw_df.shape[1]} columns in the data')

There are 52478 rows and 25 columns in the data


In [42]:
# check the data types for every column:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52478 entries, 0 to 52477
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   bookId            52478 non-null  object 
 1   title             52478 non-null  object 
 2   series            23470 non-null  object 
 3   author            52478 non-null  object 
 4   rating            52478 non-null  float64
 5   description       51140 non-null  object 
 6   language          48672 non-null  object 
 7   isbn              52478 non-null  object 
 8   genres            52478 non-null  object 
 9   characters        52478 non-null  object 
 10  bookFormat        51005 non-null  object 
 11  edition           4955 non-null   object 
 12  pages             50131 non-null  object 
 13  publisher         48782 non-null  object 
 14  publishDate       51598 non-null  object 
 15  firstPublishDate  31152 non-null  object 
 16  awards            52478 non-null  object

The majority of these columns are object. Columns that might need to change from object to integer:
- publish date
- first publish date
- price

In [43]:
# sanity check the first five rows:
raw_df.head()

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,...,firstPublishDate,awards,numRatings,ratingsByStars,likedPercent,setting,coverImg,bbeScore,bbeVotes,price
0,2767052-the-hunger-games,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,9780439023481,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...","['Katniss Everdeen', 'Peeta Mellark', 'Cato (H...",...,,['Locus Award Nominee for Best Young Adult Boo...,6376780,"['3444695', '1921313', '745221', '171994', '93...",96.0,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...,2993816,30516,5.09
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",4.5,There is a door at the end of a silent corrido...,English,9780439358071,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...","['Sirius Black', 'Draco Malfoy', 'Ron Weasley'...",...,06/21/03,['Bram Stoker Award for Works for Young Reader...,2507623,"['1593642', '637516', '222366', '39573', '14526']",98.0,['Hogwarts School of Witchcraft and Wizardry (...,https://i.gr-assets.com/images/S/compressed.ph...,2632233,26923,7.38
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,The unforgettable novel of a childhood in a sl...,English,9999999999999,"['Classics', 'Fiction', 'Historical Fiction', ...","['Scout Finch', 'Atticus Finch', 'Jem Finch', ...",...,07/11/60,"['Pulitzer Prize for Fiction (1961)', 'Audie A...",4501075,"['2363896', '1333153', '573280', '149952', '80...",95.0,"['Maycomb, Alabama (United States)']",https://i.gr-assets.com/images/S/compressed.ph...,2269402,23328,
3,1885.Pride_and_Prejudice,Pride and Prejudice,,"Jane Austen, Anna Quindlen (Introduction)",4.26,Alternate cover edition of ISBN 9780679783268S...,English,9999999999999,"['Classics', 'Fiction', 'Romance', 'Historical...","['Mr. Bennet', 'Mrs. Bennet', 'Jane Bennet', '...",...,01/28/13,[],2998241,"['1617567', '816659', '373311', '113934', '767...",94.0,"['United Kingdom', 'Derbyshire, England (Unite...",https://i.gr-assets.com/images/S/compressed.ph...,1983116,20452,
4,41865.Twilight,Twilight,The Twilight Saga #1,Stephenie Meyer,3.6,About three things I was absolutely positive.\...,English,9780316015844,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...","['Edward Cullen', 'Jacob Black', 'Laurent', 'R...",...,10/05/05,"['Georgia Peach Book Award (2007)', 'Buxtehude...",4964519,"['1751460', '1113682', '1008686', '542017', '5...",78.0,"['Forks, Washington (United States)', 'Phoenix...",https://i.gr-assets.com/images/S/compressed.ph...,1459448,14874,2.1


In [44]:
# check last 5 rows:
raw_df.tail()

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,...,firstPublishDate,awards,numRatings,ratingsByStars,likedPercent,setting,coverImg,bbeScore,bbeVotes,price
52473,11492014-fractured,Fractured,Fateful #2,Cheri Schmidt (Goodreads Author),4.0,The Fateful Trilogy continues with Fractured. ...,English,2940012616562,"['Vampires', 'Paranormal', 'Young Adult', 'Rom...",[],...,,[],871,"['311', '310', '197', '42', '11']",94.0,[],https://i.gr-assets.com/images/S/compressed.ph...,0,1,
52474,11836711-anasazi,Anasazi,Sense of Truth #2,Emma Michaels,4.19,"'Anasazi', sequel to 'The Thirteenth Chime' by...",English,9999999999999,"['Mystery', 'Young Adult']",[],...,August 3rd 2011,[],37,"['16', '14', '5', '2', '0']",95.0,[],https://i.gr-assets.com/images/S/compressed.ph...,0,1,
52475,10815662-marked,Marked,Soul Guardians #1,Kim Richardson (Goodreads Author),3.7,--READERS FAVORITE AWARDS WINNER 2011--Sixteen...,English,9781461017097,"['Fantasy', 'Young Adult', 'Paranormal', 'Ange...",[],...,March 15th 2011,"[""Readers' Favorite Book Award (2011)""]",6674,"['2109', '1868', '1660', '647', '390']",84.0,[],https://i.gr-assets.com/images/S/compressed.ph...,0,1,7.37
52476,11330278-wayward-son,Wayward Son,,"Tom Pollack (Goodreads Author), John Loftus (G...",3.85,A POWERFUL TREMOR UNEARTHS AN ANCIENT SECRETBu...,English,9781450755634,"['Fiction', 'Mystery', 'Historical Fiction', '...",[],...,April 5th 2011,[],238,"['77', '78', '59', '19', '5']",90.0,[],https://i.gr-assets.com/images/S/compressed.ph...,0,1,2.86
52477,10991547-daughter-of-helaman,Daughter of Helaman,Stripling Warrior #1,Misty Moncur (Goodreads Author),4.02,Fighting in Helaman's army is Keturah's deepes...,English,9781599554976,"['Lds Fiction', 'Historical Fiction', 'Young A...",[],...,,[],246,"['106', '73', '42', '17', '8']",90.0,[],https://i.gr-assets.com/images/S/compressed.ph...,0,1,5.2


In [45]:
raw_df.sample(5)

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,...,firstPublishDate,awards,numRatings,ratingsByStars,likedPercent,setting,coverImg,bbeScore,bbeVotes,price
27564,149704.The_Valley_of_the_Moon,The Valley of the Moon,,Jack London,3.99,"A road novel 50 years before Kerouac, The Vall...",English,9781406946253,"['Classics', 'Fiction', 'Literature', 'Adventu...","['Billy Roberts', 'Saxon Roberts']",...,10/30/13,[],1395,"['503', '504', '285', '83', '20']",93.0,['California (United States)'],https://i.gr-assets.com/images/S/compressed.ph...,96,1,36.65
52444,96337.The_Conservationist,The Conservationist,,Nadine Gordimer,3.36,Mehring is rich. He has all the privileges and...,English,9780140047165,"['Fiction', 'Africa', 'South Africa', 'Nobel P...",[],...,1974,"['Booker Prize (1974)', 'The Best of the Booke...",2509,"['389', '772', '825', '394', '129']",79.0,['South Africa'],https://i.gr-assets.com/images/S/compressed.ph...,1,1,1.12
22092,842309.Idle_Thoughts_of_an_Idle_Fellow,Idle Thoughts of an Idle Fellow,Idle Thoughts #1,Jerome K. Jerome,3.87,"Jerome Klapka Jerome (1859 - 1927), English au...",English,9781595690241,"['Humor', 'Nonfiction', 'Classics', 'Essays', ...",[],...,10/30/86,[],2393,"['729', '852', '627', '140', '45']",92.0,[],https://i.gr-assets.com/images/S/compressed.ph...,99,1,13.0
33121,644478.Letters_from_a_Father_to_his_Daughter,Letters from a Father to his Daughter,,Jawaharlal Nehru,3.93,A priceless collection of letters from one leg...,English,9780670058167,"['Nonfiction', 'History', 'India', 'Indian Lit...",[],...,11/06/29,[],2052,"['673', '758', '476', '94', '51']",93.0,[],https://i.gr-assets.com/images/S/compressed.ph...,92,1,5.64
23961,13030276-so-shall-we-pass,So Shall We Pass,,Michael Barrera (Goodreads Author),4.92,So Shall We Pass is a first-person present ten...,English,9781462042661,[],[],...,08/01/11,[],26,"['24', '2', '0', '0', '0']",100.0,[],https://i.gr-assets.com/images/S/compressed.ph...,98,1,10.0


Notes from looking at the begining, end, and a random sample of the dataset:

1. Columns to be deleted that are not useful for this project:
    - `bookId` does not seem to be useful for my project. The index will be used instead of this.  
    - `isbn` is a numeric book identifier. I will also delete this since I am using the index to refer to each unique book in the dataset.
    - `coverImg` is a url to the books cover image. For this project, I will not be using this in my modeling.  
2. There are quite a few columns that are missing data. Further analysis will be performed.

3. Columns that stand out for further investigation:
    - `firstPublishDate` ~ missing values and different formats (i.e. 07/29/96, 1989, April 5th 2011)
    - `price` ~ the values don't seem to be accurate. The entire column might need to be deleted

In [46]:
# Check that total index matches total number of rows:
raw_df.index.nunique() == raw_df.shape[0]

True

The total index count is equal to the total number of rows in the dataset. 

In [47]:
# Count the number of missing values for each column:
raw_df.isna().sum()

bookId                  0
title                   0
series              29008
author                  0
rating                  0
description          1338
language             3806
isbn                    0
genres                  0
characters              0
bookFormat           1473
edition             47523
pages                2347
publisher            3696
publishDate           880
firstPublishDate    21326
awards                  0
numRatings              0
ratingsByStars          0
likedPercent          622
setting                 0
coverImg              605
bbeScore                0
bbeVotes                0
price               14365
dtype: int64

In [48]:
# Percentage of missing values for each column:
print(raw_df.isna().sum(axis=0)/raw_df.shape[0])

bookId              0.000000
title               0.000000
series              0.552765
author              0.000000
rating              0.000000
description         0.025496
language            0.072526
isbn                0.000000
genres              0.000000
characters          0.000000
bookFormat          0.028069
edition             0.905579
pages               0.044724
publisher           0.070430
publishDate         0.016769
firstPublishDate    0.406380
awards              0.000000
numRatings          0.000000
ratingsByStars      0.000000
likedPercent        0.011853
setting             0.000000
coverImg            0.011529
bbeScore            0.000000
bbeVotes            0.000000
price               0.273734
dtype: float64


There are 12 columns that have missing values. The column name and rounded percent of missing values in descending order:
- edition: 91%
- series: 55%
- first publish date: 41%
- price 27%
- language: 7%
- publisher: 7%
- pages: 4%
- description: 3%
- book format: 3%
- publish date: 2%
- liked percent: 1%
- cover image: 1%

The first step in the data cleaning process is to delete all the columns that I will not be using for modeling:
- bookid: using dataset index instead
- isbn: using dataset index instead
- cover image: won't be used for this modeling
- edition: over 91% of the values are missing
- first publish date: over 41% missing and my assumption is that publish date will be more relevant than first publish date
- price: the prices do not seem accurate enough to use. Secondly, depending on the date of purchase and retailer, the price of books can very greatly. 
- awards: there is limited values in this dataset
- characters: there is limited values in this dataset
- setting: there is limited values in this dataset

# Come back to this!!

In [49]:
# Check for empty lists in the 'genres' column
empty_lists_mask = raw_df['genres'].apply(lambda x: len(x) == 0)

# Identify rows with empty lists
rows_with_empty_lists = raw_df[empty_lists_mask]

# Display the result
print(rows_with_empty_lists)

Empty DataFrame
Columns: [bookId, title, series, author, rating, description, language, isbn, genres, characters, bookFormat, edition, pages, publisher, publishDate, firstPublishDate, awards, numRatings, ratingsByStars, likedPercent, setting, coverImg, bbeScore, bbeVotes, price]
Index: []

[0 rows x 25 columns]


In [50]:
# Check for duplicated columns:
raw_df.T.duplicated().sum()

0

In [51]:
# Checking for duplicate rows: 
raw_df[raw_df.duplicated()]

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,...,firstPublishDate,awards,numRatings,ratingsByStars,likedPercent,setting,coverImg,bbeScore,bbeVotes,price
37431,8794263-promises-to-keep,Promises to Keep,,Ann Tatlock,3.94,Eleven-year-old Roz (Rosalind) Anthony and her...,English,9780764208096,"['Fiction', 'Christian Fiction', 'Christian', ...",[],...,01/01/11,[],1997,"['582', '833', '476', '84', '22']",95.0,['Illinois (United States)'],https://i.gr-assets.com/images/S/compressed.ph...,87,1,4.23
37432,1909590.Click,Click,,"Eoin Colfer, Linda Sue Park, Ruth Ozeki (Goodr...",3.54,A video message from a dead person. A larcenou...,English,9781407105918,"['Young Adult', 'Fiction', 'Short Stories', 'M...",[],...,11/06/07,['Deutscher Jugendliteraturpreis Nominee for J...,1910,"['340', '647', '664', '214', '45']",86.0,[],https://i.gr-assets.com/images/S/compressed.ph...,87,1,2.6
37433,23394408-die-unendlichkeit-schl-ft,Die Unendlichkeit schläft,Loki von Schallern Staffel 1 #3,Melanie Meier,4.5,"Überall, wo er hingeht, reißen Höllenfeuer all...",German,B00O84Q7UG,[],[],...,,[],2,[],,[],https://i.gr-assets.com/images/S/compressed.ph...,87,1,
37434,7544945-death-note,"Death Note: Black Edition, Vol. 2",Death Note: Black Edition #2,"Tsugumi Ohba, Takeshi Obata, Yuki Kowalsky (tr...",4.48,Intégrale regroupant les tomes 3 et 4Tome 3 :L...,German,9783867196727,"['Graphic Novels', 'Comics', 'Fantasy', 'Manga...","['Light Yagami', 'Ryuk']",...,11/06/03,[],5849,"['3339', '2045', '408', '44', '13']",99.0,[],https://i.gr-assets.com/images/S/compressed.ph...,87,1,12.0
37435,25886017-m-scaras,Máscaras,,Ariel Dorfman,3.49,"¿Qué se oculta detrás de esos rostros difusos,...",Spanish,9500704919,"['Fiction', 'Literature']",[],...,01/01/88,[],77,"['17', '22', '23', '12', '3']",81.0,[],https://i.gr-assets.com/images/S/compressed.ph...,87,1,6.16
37436,2669775-el-siglo-de-las-luces,El siglo de las luces,,Alejo Carpentier,4.13,El siglo de las luces novela el impacto de la ...,Spanish,9788402067074,"['Fiction', 'Spanish Literature', 'Historical ...",[],...,11/06/62,[],2282,"['979', '802', '368', '95', '38']",94.0,[],https://i.gr-assets.com/images/S/compressed.ph...,87,1,0.9
37437,24633605-always-and-forever,Always and Forever,Serenity Point #2,Harper Bentley (Goodreads Author),3.93,Does wanting to slap the hell out of Brody Kel...,English,9999999999999,"['Contemporary Romance', 'Romance', 'Firefight...",[],...,,[],482,"['153', '188', '104', '26', '11']",92.0,[],https://i.gr-assets.com/images/S/compressed.ph...,87,1,
37438,237086.Dafne_desvanecida,Dafne desvanecida,,José Carlos Somoza,3.48,Dafne desvanecida presenta a un famoso escrito...,Spanish,9788423331970,"['Fiction', 'Mystery', 'Contemporary', 'Spanis...",[],...,,['Premio Nadal Nominee (2000)'],204,"['37', '56', '82', '25', '4']",86.0,[],https://i.gr-assets.com/images/S/compressed.ph...,87,1,4.9
37439,17786377,فضولية العلم,,"Cyril Aydon, أحمد مغربي (ترجمة)",3.87,"ما يلفت في كتاب سيرل أيدون ""فضولية العلم"" طريق...",Arabic,9781855166752,"['Science', 'Nonfiction']",[],...,10/01/05,[],91,"['25', '37', '21', '8', '0']",91.0,[],https://i.gr-assets.com/images/S/compressed.ph...,87,1,
37440,7452583-jag-vill-inte-d-jag-vill-bara-inte-leva,"Jag vill inte dö, jag vill bara inte leva",,Ann Heberlein,3.5,"Ann Heberleins omdiskuterade självbiografi""Jag...",Swedish,9789172321717,"['Nonfiction', 'Psychology', 'Biography', 'Men...",[],...,12/03/08,[],1032,"['153', '389', '338', '128', '24']",85.0,[],https://i.gr-assets.com/images/S/compressed.ph...,87,1,


These duplicated rows have all the same bbeScore and bbeVotes, yet all the book titles are unique, meaning that these rows will not be dropped.

<a id = 'data_clean'></a>
### Data Cleaning

Step 1: Dropping the columns that will not be used in the analysis/modeling.

In [52]:
# Drop specified column and check that changes have been made to raw_df:

raw_df = raw_df.drop(['isbn', 'coverImg', 'edition', 'firstPublishDate', 'price', 'awards', 'setting', 'characters'], axis=1)
raw_df.head()

Unnamed: 0,bookId,title,series,author,rating,description,language,genres,bookFormat,pages,publisher,publishDate,numRatings,ratingsByStars,likedPercent,bbeScore,bbeVotes
0,2767052-the-hunger-games,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",Hardcover,374,Scholastic Press,09/14/08,6376780,"['3444695', '1921313', '745221', '171994', '93...",96.0,2993816,30516
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",4.5,There is a door at the end of a silent corrido...,English,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...",Paperback,870,Scholastic Inc.,09/28/04,2507623,"['1593642', '637516', '222366', '39573', '14526']",98.0,2632233,26923
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,The unforgettable novel of a childhood in a sl...,English,"['Classics', 'Fiction', 'Historical Fiction', ...",Paperback,324,Harper Perennial Modern Classics,05/23/06,4501075,"['2363896', '1333153', '573280', '149952', '80...",95.0,2269402,23328
3,1885.Pride_and_Prejudice,Pride and Prejudice,,"Jane Austen, Anna Quindlen (Introduction)",4.26,Alternate cover edition of ISBN 9780679783268S...,English,"['Classics', 'Fiction', 'Romance', 'Historical...",Paperback,279,Modern Library,10/10/00,2998241,"['1617567', '816659', '373311', '113934', '767...",94.0,1983116,20452
4,41865.Twilight,Twilight,The Twilight Saga #1,Stephenie Meyer,3.6,About three things I was absolutely positive.\...,English,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...",Paperback,501,"Little, Brown and Company",09/06/06,4964519,"['1751460', '1113682', '1008686', '542017', '5...",78.0,1459448,14874


Step 2: Reformatting the publish date column.

For the publish date, the first 30,000 books in the dataset are formatted in mm/dd/yyyy while the last 22,478 books are formated in Month Day Year. I will reformate all the dates to be in mm/dd/yyyy.

In [53]:
# Change publish date to datetime format:
raw_df['publishDate'] = pd.to_datetime(raw_df['publishDate'], errors='coerce') # errors='coerce' used if value cannot be converted to datetime format
# Reformat publish date to mm/dd/yyyy format:
raw_df['publishDate'] = raw_df['publishDate'].dt.strftime('%m/%d/%Y')

  raw_df['publishDate'] = pd.to_datetime(raw_df['publishDate'], errors='coerce') # errors='coerce' used if value cannot be converted to datetime format


Sanity check that this change was made to the dataset:

In [54]:
raw_df.sample(5)

Unnamed: 0,bookId,title,series,author,rating,description,language,genres,bookFormat,pages,publisher,publishDate,numRatings,ratingsByStars,likedPercent,bbeScore,bbeVotes
7681,31818.The_Philosophy_of_Andy_Warhol,The Philosophy of Andy Warhol (From A to B and...,,Andy Warhol,3.77,"A loosely formed autobiography by Andy Warhol,...",English,"['Art', 'Nonfiction', 'Philosophy', 'Biography...",Paperback,272,Mariner Books,04/06/1977,37274,"['13336', '10134', '8141', '3104', '2559']",85.0,345,4
160,7190.The_Three_Musketeers,The Three Musketeers,The d'Artagnan Romances #1,Alexandre Dumas,4.07,Alexandre Dumas’s most famous tale— and possib...,English,"['Classics', 'Fiction', 'Historical Fiction', ...",Paperback,625,Modern Library,02/13/2001,273714,"['103957', '101696', '54638', '10239', '3184']",95.0,95549,1198
38096,18148148-writing-about-magic,Writing about Magic,Writer's Craft #3,Rayne Hall,3.87,Do you write fantasy fiction? This book is a r...,,"['Writing', 'Nonfiction', 'Reference']",Kindle Edition,133,,,150,"['51', '51', '33', '8', '7']",90.0,86,1
36633,21253714-the-alchemy-press-book-of-ancient-won...,The Alchemy Press Book of Ancient Wonders,Tales of the Apt #Story - Bones,"Jan Edwards (Goodreads Author) (Editor), Jenny...",3.86,"When we think of a wonder, our minds go most o...",English,[],Kindle Edition,252,The Alchemy Press,02/03/2013,21,"['8', '6', '4', '2', '1']",86.0,88,1
8519,26198109-louder-than-a-whisper,Louder Than a Whisper: Clearer Than a Bell,,Renée Paule (Goodreads Author) (Illustrator),4.71,This book challenges the status quo of humanit...,English,"['Nonfiction', 'Self Help', 'Inspirational', '...",Paperback,174,RPG Publishing,09/21/2015,35,"['28', '4', '3', '0', '0']",100.0,294,3


# COME BACK TO THIS:

Step 3: Looking at series which is missing 55% of it's values.

In [55]:
raw_df.loc[raw_df['series'].isna()]

Unnamed: 0,bookId,title,series,author,rating,description,language,genres,bookFormat,pages,publisher,publishDate,numRatings,ratingsByStars,likedPercent,bbeScore,bbeVotes
3,1885.Pride_and_Prejudice,Pride and Prejudice,,"Jane Austen, Anna Quindlen (Introduction)",4.26,Alternate cover edition of ISBN 9780679783268S...,English,"['Classics', 'Fiction', 'Romance', 'Historical...",Paperback,279,Modern Library,10/10/2000,2998241,"['1617567', '816659', '373311', '113934', '767...",94.0,1983116,20452
5,19063.The_Book_Thief,The Book Thief,,Markus Zusak (Goodreads Author),4.37,Librarian's note: An alternate cover edition c...,English,"['Historical Fiction', 'Fiction', 'Young Adult...",Hardcover,552,Alfred A. Knopf,03/14/2006,1834276,"['1048230', '524674', '186297', '48864', '26211']",96.0,1372809,14168
6,170448.Animal_Farm,Animal Farm,,"George Orwell, Russell Baker (Preface), C.M. W...",3.95,Librarian's note: There is an Alternate Cover ...,English,"['Classics', 'Fiction', 'Dystopia', 'Fantasy',...",Mass Market Paperback,141,Signet Classics,04/28/1996,2740713,"['986764', '958699', '545475', '165093', '84682']",91.0,1276599,13264
9,18405.Gone_with_the_Wind,Gone with the Wind,,Margaret Mitchell,4.30,"Scarlett O'Hara, the beautiful, spoiled daught...",English,"['Classics', 'Historical Fiction', 'Fiction', ...",Mass Market Paperback,1037,Warner Books,04/01/1999,1074620,"['602138', '275517', '133535', '39008', '24422']",94.0,1087732,11211
10,11870085-the-fault-in-our-stars,The Fault in Our Stars,,John Green (Goodreads Author),4.21,Despite the tumor-shrinking medical miracle th...,English,"['Young Adult', 'Romance', 'Fiction', 'Contemp...",Hardcover,313,Dutton Books,01/10/2012,3550714,"['1784471', '1022406', '512574', '150365', '80...",93.0,1087056,11287
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52463,25876358-the-natural-way-of-things,The Natural Way of Things,,Charlotte Wood,3.53,Two women awaken from a drugged sleep to find ...,English,"['Fiction', 'Dystopia', 'Australia', 'Feminism...",Paperback,320,Allen & Unwin,10/01/2015,10894,"['2044', '3961', '3098', '1269', '522']",84.0,1,1
52464,36374396-algedonic,Algedonic,,R.H. Sin (Goodreads Author),3.71,"Bestselling poet r.h. Sin, author of the Whisk...",,"['Poetry', 'Nonfiction', 'Romance', 'Feminism']",Paperback,128,Andrews McMeel Publishing,12/12/2017,1489,"['501', '402', '339', '144', '103']",83.0,1,1
52469,270435.Heal_Your_Body,Heal Your Body: The Mental Causes for Physical...,,Louise L. Hay,4.36,Heal Your Body is a fresh and easy step-by-ste...,English,"['Self Help', 'Health', 'Nonfiction', 'Spiritu...",Paperback,96,Hay House,01/01/1984,14868,"['8640', '3745', '1864', '418', '201']",96.0,1,1
52470,11115191-attracted-to-fire,Attracted to Fire,,DiAnn Mills (Goodreads Author),4.14,Special Agent Meghan Connors' dream of one day...,English,"['Christian Fiction', 'Christian', 'Suspense',...",Paperback,416,Tyndale House Publishers,10/01/2011,2143,"['945', '716', '365', '78', '39']",95.0,0,1


In order to use the series in the modeling, I will transform the data into either has a series (value =1) or does not have a series (value=0)

In [58]:
# Fills all NaN values with 0 and then any values that are not 0 are now 1
raw_df['series'] = raw_df['series'].fillna(0).apply(lambda x: 1 if x != 0 else 0)

# Confirm tha these changes were made:
raw_df.head()

Unnamed: 0,bookId,title,series,author,rating,description,language,genres,bookFormat,pages,publisher,publishDate,numRatings,ratingsByStars,likedPercent,bbeScore,bbeVotes
0,2767052-the-hunger-games,The Hunger Games,1,Suzanne Collins,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",Hardcover,374,Scholastic Press,09/14/2008,6376780,"['3444695', '1921313', '745221', '171994', '93...",96.0,2993816,30516
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,1,"J.K. Rowling, Mary GrandPré (Illustrator)",4.5,There is a door at the end of a silent corrido...,English,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...",Paperback,870,Scholastic Inc.,09/28/2004,2507623,"['1593642', '637516', '222366', '39573', '14526']",98.0,2632233,26923
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,1,Harper Lee,4.28,The unforgettable novel of a childhood in a sl...,English,"['Classics', 'Fiction', 'Historical Fiction', ...",Paperback,324,Harper Perennial Modern Classics,05/23/2006,4501075,"['2363896', '1333153', '573280', '149952', '80...",95.0,2269402,23328
3,1885.Pride_and_Prejudice,Pride and Prejudice,0,"Jane Austen, Anna Quindlen (Introduction)",4.26,Alternate cover edition of ISBN 9780679783268S...,English,"['Classics', 'Fiction', 'Romance', 'Historical...",Paperback,279,Modern Library,10/10/2000,2998241,"['1617567', '816659', '373311', '113934', '767...",94.0,1983116,20452
4,41865.Twilight,Twilight,1,Stephenie Meyer,3.6,About three things I was absolutely positive.\...,English,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...",Paperback,501,"Little, Brown and Company",09/06/2006,4964519,"['1751460', '1113682', '1008686', '542017', '5...",78.0,1459448,14874


Step 4: Language column has 3,806 (7 percent) missing values. Further investigation will happen here:

In [None]:
# Look at the rows where the values for the language column are missing:
raw_df.loc[raw_df['language'].isna()]

Step 5: The pages column is missing around 4% of it's data. Further investigation...

In [62]:
# Look at the rows with pages column missing:
raw_df.loc[raw_df['pages'].isna()]

Unnamed: 0,bookId,title,series,author,rating,description,language,genres,bookFormat,pages,publisher,publishDate,numRatings,ratingsByStars,likedPercent,bbeScore,bbeVotes
669,30346601-brainwalker,Brainwalker,0,"Robyn Mundell (Goodreads Author), Stephan Laca...",4.35,One teen’s incredible journey may just blow hi...,English,"['Young Adult', 'Adventure', 'Fantasy', 'Middl...",ebook,,Dualmind Publishing,10/01/2016,15250,"['7226', '6795', '702', '371', '156']",97.0,11433,128
1062,44575575-pleasant-day,Pleasant Day,0,Vera Jane Cook (Goodreads Author),4.33,"In the town of Hollow Creek, South Carolina tw...",English,"['Contemporary', 'Drama', 'Book Club', 'Fictio...",Kindle Edition,,Chattercreek,03/11/2019,5684,"['2620', '2583', '265', '143', '73']",96.0,5744,66
1428,318525.Red_Storm_Rising,Red Storm Rising,0,Tom Clancy,4.17,"""Allah!""With that shrill cry, three Muslim ter...",English,"['Fiction', 'Thriller', 'Military Fiction', 'W...",Audio Cassette,,Random House Audio,09/13/1988,69639,"['30349', '24341', '12083', '2291', '575']",96.0,3500,47
1483,26030383-beg-for-mercy,Beg For Mercy,1,Lucian Bane (Goodreads Author),4.29,"The fight is on in this installment, Mercy is ...",English,[],Kindle Edition,,,08/18/2015,714,"['439', '130', '87', '27', '31']",92.0,3246,33
1543,19181419-a-bird-without-wings,A Bird Without Wings,0,Roberta Pearce (Goodreads Author),4.17,"After an impoverished and indigent childhood, ...",English,"['Romance', 'Contemporary Romance', 'Contempor...",ebook,,Smashwords,12/02/2013,168,"['86', '45', '24', '6', '7']",92.0,3046,32
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52295,53946399-launching-your-real-estate-career,Launching Your Real Estate Career: The Definit...,0,Eric Kistner,5.00,,,[],Kindle Edition,,,06/07/2020,2,[],,5,1
52352,971322.The_Wanderers,The Wanderers,0,Richard Price,3.90,"The Wanderers, a teen-age gang in the Bronx of...",English,"['Fiction', 'Crime', 'New York', 'Novels', 'Li...",Mass Market Paperback,,Avon Books,11/01/1979,1640,"['452', '688', '406', '74', '20']",94.0,3,1
52359,18130046-ender-in-flight,Ender in Flight,1,Orson Scott Card,4.35,"""Ender in Flight"" is a story by Orson Scott Ca...",English,"['Science Fiction', 'Fiction', 'Young Adult', ...",,,,,133,"['66', '51', '14', '1', '1']",98.0,3,1
52407,2823038-forever-young-forever-free-forever-you...,"Forever Young, Forever Free Forever Young, For...",0,Hettie Jones,3.67,,,[],Paperback,,,,3,[],,2,1


For book format and description, I will replace the empty values with 'unknown'. These columns have a low percent of missing values, both being under 3 percent, therefore, this should not have a huge impact on the data. 

In [64]:
raw_df['description'].fillna('unknown', inplace=True)
raw_df['bookFormat'].fillna('unknown', inplace=True)

Last step before moving onto descriptive statistics, check that the empty values have all been dealt with:

In [65]:
# Percentage of missing values for each column:
print(raw_df.isna().sum(axis=0)/raw_df.shape[0])

bookId            0.000000
title             0.000000
series            0.000000
author            0.000000
rating            0.000000
description       0.000000
language          0.072526
genres            0.000000
bookFormat        0.000000
pages             0.044724
publisher         0.070430
publishDate       0.034014
numRatings        0.000000
ratingsByStars    0.000000
likedPercent      0.011853
bbeScore          0.000000
bbeVotes          0.000000
dtype: float64


<a id = 'desc_statistics'></a>
### Descriptive Statistics

Compute the basic statistics of all the numeric columns:

In [28]:
df = raw_df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52478 entries, 0 to 52477
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   bookId          52478 non-null  object 
 1   title           52478 non-null  object 
 2   series          23470 non-null  object 
 3   author          52478 non-null  object 
 4   rating          52478 non-null  float64
 5   description     51140 non-null  object 
 6   language        48672 non-null  object 
 7   genres          52478 non-null  object 
 8   bookFormat      51005 non-null  object 
 9   pages           50131 non-null  object 
 10  publisher       48782 non-null  object 
 11  publishDate     50693 non-null  object 
 12  numRatings      52478 non-null  int64  
 13  ratingsByStars  52478 non-null  object 
 14  likedPercent    51856 non-null  float64
 15  bbeScore        52478 non-null  int64  
 16  bbeVotes        52478 non-null  int64  
dtypes: float64(2), int64(3), object

In [27]:
df.describe()

Unnamed: 0,rating,numRatings,likedPercent,bbeScore,bbeVotes
count,52478.0,52478.0,51856.0,52478.0,52478.0
mean,4.021878,17878.65,92.231545,1984.023,22.529003
std,0.367146,103944.8,5.990689,35153.14,369.158541
min,0.0,0.0,0.0,0.0,-4.0
25%,3.82,341.0,90.0,84.0,1.0
50%,4.03,2307.0,94.0,97.0,1.0
75%,4.23,9380.5,96.0,187.0,2.0
max,5.0,7048471.0,100.0,2993816.0,30516.0


<a id = 'data_visualization'></a>
### Data Visualization

<a id = 'correlation_analysis'></a>
### Correlation Analysis

<a id = 'summary'></a>
### Summary & Insights

[Back to the top](#title)