# Data Wrangling

* [1. Loading Data](#loading)
* [2. Data Cleaning](#data-cleaning)
    * [2.1 Missing Values](#missing)
    * [2.2 Duplicates](#duplicates)
    * [2.3 Illogical Data](#illogical)
        * [2.3.1 Sets of Books](#sets)
    * [2.4 Book Formats](#formats)
* [3. Adding Categories](#categories)
    * [3.1 Book Length](#length)
    * [3.2 Good or Bad Book](#goodbad)

# 1. Loading Data <a name='loading'></a>

In [1]:
#Load the relevant libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
#Load the Kaggle goodreads dataset obtained from
#https://www.kaggle.com/meetnaren/goodreads-best-books
#that's based on the Goodreads list of best books ever: https://www.goodreads.com/list/show/1.Best_Books_Ever
best_books = pd.read_csv('best_books.csv')

In [3]:
best_books.head()

Unnamed: 0,book_authors,book_desc,book_edition,book_format,book_isbn,book_pages,book_rating,book_rating_count,book_review_count,book_title,genres,image_url
0,Suzanne Collins,Winning will make you famous. Losing means cer...,,Hardcover,9780440000000.0,374 pages,4.33,5519135,160706,The Hunger Games,Young Adult|Fiction|Science Fiction|Dystopia|F...,https://images.gr-assets.com/books/1447303603l...
1,J.K. Rowling|Mary GrandPré,There is a door at the end of a silent corrido...,US Edition,Paperback,9780440000000.0,870 pages,4.48,2041594,33264,Harry Potter and the Order of the Phoenix,Fantasy|Young Adult|Fiction,https://images.gr-assets.com/books/1255614970l...
2,Harper Lee,The unforgettable novel of a childhood in a sl...,50th Anniversary,Paperback,9780060000000.0,324 pages,4.27,3745197,79450,To Kill a Mockingbird,Classics|Fiction|Historical|Historical Fiction...,https://images.gr-assets.com/books/1361975680l...
3,Jane Austen|Anna Quindlen|Mrs. Oliphant|George...,«È cosa ormai risaputa che a uno scapolo in po...,"Modern Library Classics, USA / CAN",Paperback,9780680000000.0,279 pages,4.25,2453620,54322,Pride and Prejudice,Classics|Fiction|Romance,https://images.gr-assets.com/books/1320399351l...
4,Stephenie Meyer,About three things I was absolutely positive.F...,,Paperback,9780320000000.0,498 pages,3.58,4281268,97991,Twilight,Young Adult|Fantasy|Romance|Paranormal|Vampire...,https://images.gr-assets.com/books/1361039443l...


In [4]:
best_books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54301 entries, 0 to 54300
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   book_authors       54301 non-null  object 
 1   book_desc          52970 non-null  object 
 2   book_edition       5453 non-null   object 
 3   book_format        52645 non-null  object 
 4   book_isbn          41435 non-null  object 
 5   book_pages         51779 non-null  object 
 6   book_rating        54301 non-null  float64
 7   book_rating_count  54301 non-null  int64  
 8   book_review_count  54301 non-null  int64  
 9   book_title         54301 non-null  object 
 10  genres             51059 non-null  object 
 11  image_url          53618 non-null  object 
dtypes: float64(1), int64(2), object(9)
memory usage: 5.0+ MB


In [5]:
#Let's first create a copy to keep the original intact
books = best_books.copy(deep=True)

Let's quickly rename the column names to be more concise.

In [6]:
#Strip the 'book_' from the column names
books.columns = [col_name.lstrip('book_') if ('book_' in col_name) else col_name for col_name in best_books.columns]

# 2. Data Cleaning <a name='data-cleaning'></a>

This section will first cover how missing values, outliers, duplicates, or illogical data will be dealt with. Then, I'll take a look at cleaning up some of the data so that it's easier to work with later.

Before I look at any missing values, I'm going to also remove the unnecessary "pages" unit from the `pages` column so that the values can be numeric.

In [7]:
#Remove the unit from the pages values
books.pages = books.pages.str.rstrip(' pages')

#Convert the pages column to numbers
books.pages = pd.to_numeric(books.pages)

In [8]:
books.describe()

Unnamed: 0,pages,rating,rating_count,review_count
count,51779.0,54301.0,54301.0,54301.0
mean,337.662836,4.020027,43504.49,2011.60218
std,259.533005,0.3621,212657.2,7627.07287
min,0.0,0.0,0.0,0.0
25%,216.0,3.83,407.0,35.0
50%,310.0,4.03,2811.0,188.0
75%,400.0,4.22,12745.0,822.0
max,14777.0,5.0,5588580.0,160776.0


The ISBN feature of this dataset is not very useful since it's in scientific notation and half of the ISBN is lost to the notation. Therefore, I'll be removing it with the `image_url` column since I won't be using that feature.

In [9]:
#Delete the isbn and image_url columns
books.drop(columns=['isbn', 'image_url'], inplace=True)

## 2.1 Missing Values <a name='missing'></a>

I'll be looking at how to handle the missing values in the dataset.

In [10]:
books.isna().sum()

authors             0
desc             1331
edition         48848
format           1656
pages            2522
rating              0
rating_count        0
review_count        0
title               0
genres           3242
dtype: int64

Let's look at some of the missing values for genres and page numbers, before we tackle the books with 0 ratings/pages/etc.

In [11]:
books[books.genres.isnull()==True]

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres
260,سید مرتضی مصطفوی,Mystical storyThe internal revolution of a wom...,فارسی,ebook,204.0,3.97,956,129,زندگی مه آلود پریا,
716,سید مرتضی مصطفوی,Philosophical story about human loneliness in ...,,,82.0,4.12,923,256,گم شده ای در مه,
1421,سید مرتضی مصطفوی,"رمان بلند ""سیمای شکسته پدر سالار"" که از تعلیق...",فارسی,ebook,367.0,3.55,722,7,سیمای شکسته پدر سالار,
1928,أحمد مراد,"""للمرة الثانية بعد ""فيرتيجو"" يتّخذ أحمد مراد م...",,Paperback,389.0,4.07,48474,4239,تراب الماس,
1989,Christine M. Knight,"“I’m not Mavis anymore.” For years, song bird ...",,Paperback,304.0,4.59,39,34,Song Bird: Matters of the Heart,
...,...,...,...,...,...,...,...,...,...,...
54225,Chuck Rogers|Don Pendleton,When American military personnel are found beh...,,Paperback,188.0,4.14,14,1,Crisis Nation,
54229,William Logan,"Talented William Logan, though he hails from D...",,ebook,12.0,2.67,6,1,Mex,
54236,Chris Pepple,,,Paperback,120.0,5.00,1,0,Look to See Me: A Collection of Reflections,
54259,Jackie Budd,Lively and wide-ranging text together with col...,Large Print,Paperback,64.0,3.67,3,2,The World of Horses,


In [12]:
books[books.pages.isnull()==True]

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres
1225,Gabrielle Estres,"“A captivating tale of love, power and betraya...",,ebook,,4.79,331,16,Captive,Romance|Fantasy|Romance|Paranormal Romance|Rom...
1296,Tom Clancy,2 cassettes / 2 hoursRead by F. Murray Abraham...,Abridged,Audio Cassette,,4.15,63105,956,Red Storm Rising,Fiction|Thriller|War|Military Fiction|War
1392,Roberta Pearce,"After an impoverished and indigent childhood, ...",,ebook,,4.23,153,31,A Bird Without Wings,Romance|Romance|Contemporary Romance|Contemporary
1424,Robyn Mundell|Stephan Lacast,One teen’s incredible journey may just blow hi...,,ebook,,4.27,1054,153,Brainwalker,Young Adult|Fantasy|Fiction|Adventure|New Adul...
1787,Chris A. Jones,Mankind had spent decades trying to overcome a...,,ebook,,4.14,521,1,Reversione: Reset the Future,Young Adult|Adventure|Mystery|Fantasy|New Adul...
...,...,...,...,...,...,...,...,...,...,...
54284,Tamiki Wakaki,,,Paperback,,4.24,141,1,The World God Only Knows 17,Sequential Art|Manga|Comics Manga
54285,Tamiki Wakaki,,,Paperback,,4.20,143,2,The World God Only Knows 18,Sequential Art|Manga|Comics Manga
54286,Tamiki Wakaki,,,Paperback,,4.34,182,2,The World God Only Knows 11,Sequential Art|Manga|Comics Manga
54288,Tamiki Wakaki,,,Paperback,,4.32,179,2,The World God Only Knows 12,Sequential Art|Manga|Comics Manga|Sequential Art


I'll be removing the ones with missing genres/pages, since those will be the primary features I'll be looking at and exploring. Additionally, it looks like some books have not only genres/pages missing but some also have multiple missing values; let's remove the ones with more than 3 missing features.

In [13]:
#Remove rows with more than 3 missing features
books.dropna(axis=0, thresh=9, inplace=True)

In [14]:
#Remove rows with missing pages or missing genres
books.dropna(axis=0, subset=['pages', 'genres'], how='any', inplace=True)

Now I'll be filling in the missing values as 'missing.'

In [15]:
#Fill the missing values as 'missing'
books.fillna('Missing', inplace=True)

In [16]:
books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47760 entries, 0 to 54300
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   authors       47760 non-null  object 
 1   desc          47760 non-null  object 
 2   edition       47760 non-null  object 
 3   format        47760 non-null  object 
 4   pages         47760 non-null  float64
 5   rating        47760 non-null  float64
 6   rating_count  47760 non-null  int64  
 7   review_count  47760 non-null  int64  
 8   title         47760 non-null  object 
 9   genres        47760 non-null  object 
dtypes: float64(2), int64(2), object(6)
memory usage: 4.0+ MB


About 5k books was deleted, but there still is a good amount of books remaining. Now to really look at some the actual values. 

## 2.2 Duplicates <a name='duplicates'></a>

First I'll explore to see if there are any duplicates, and keep only one of the duplicates.

In [17]:
books[books.duplicated(keep='first')]

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres
6467,Kirsten Fullmer,Lizzie gave up her stressful job in Boston to ...,Missing,Kindle Edition,297.0,4.25,630,16,Hometown Girl Forever,Young Adult|Contemporary|New Adult|Romance|Lov...
10353,Rahiem Brooks,"""Brooks’s clunky first Naim Butler contemporar...",Missing,Paperback,278.0,4.7,30,18,A Butler Christmas,Romance|Holiday|Holiday|Christmas
24460,Frances Hardinge,This is the story of a bear-hearted girl . . ....,Missing,Hardcover,416.0,4.09,2773,532,A Skinful of Shadows,Fantasy|Historical|Historical Fiction|Young Ad...
29507,Annemarie O'Brien,Young Lara is being groomed in the family trad...,Missing,Hardcover,208.0,3.99,630,126,Lara's Gift,Historical|Historical Fiction|Childrens|Middle...
37591,Rhonda Patton,Ted and Raymond go to Africa. They find great ...,Missing,Paperback,40.0,4.61,66,13,African Safari with Ted and Raymond,Childrens
47849,Chloe Neill,"Since Merit was turned into a vampire, and the...",Missing,Paperback,350.0,4.2,11220,676,Wild Things,Fantasy|Urban Fantasy|Paranormal|Vampires|Fant...


In [18]:
books.title.value_counts()

1984                                                                  16
Selected Poems                                                        14
The Hobbit                                                            13
American Gods                                                         13
A Christmas Carol                                                     12
                                                                      ..
Welcome to Hard Times                                                  1
For Women Only: What You Need to Know about the Inner Lives of Men     1
Histoires inédites du Petit Nicolas Volume 2                           1
Lost in Learning: The Art of Discovery                                 1
Death in Venice and Seven Other Stories                                1
Name: title, Length: 42424, dtype: int64

In [19]:
books[books.title=='1984']

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres
100,George Orwell|Erich Fromm,"Among the seminal texts of the 20th century, N...",Signet Classics,Mass Market Paperback,328.0,4.16,2420816,53773,1984,Classics|Fiction|Science Fiction|Science Ficti...
5161,George Orwell|Peter Hobley Davison,"'It was a bright cold day in April, and the cl...",Missing,Paperback,326.0,4.16,2421887,53794,1984,Classics|Fiction|Science Fiction|Science Ficti...
7348,George Orwell,"Among the seminal texts of the 20th century, N...",Missing,Kindle Edition,237.0,4.16,2421777,53793,1984,Classics|Fiction|Science Fiction|Science Ficti...
9987,George Orwell,"Among the seminal texts of the 20th century, N...",Missing,Kindle Edition,237.0,4.16,2422197,53801,1984,Classics|Fiction|Science Fiction|Science Ficti...
13992,George Orwell|Alexandre Hubner|Heloísa Jahn,1984 é uma das obras mais influentes do século...,Missing,Paperback,416.0,4.16,2422432,53809,1984,Classics|Fiction|Science Fiction|Science Ficti...
19074,George Orwell|أنور الشامي,على مدى سنوات طويلة ظلت رواية 1984 لجورج اوروي...,الثالثة,Paperback,352.0,4.16,2422645,53814,1984,Classics|Fiction|Science Fiction|Science Ficti...
21486,George Orwell,Nineteen Eighty-Four revealed George Orwell as...,Missing,Hardcover,326.0,4.16,2422732,53817,1984,Classics|Fiction|Science Fiction|Science Ficti...
21918,George Orwell,1984 oferece hoje uma descrição quase realista...,Missing,Paperback,327.0,4.16,2422748,53817,1984,Classics|Fiction|Science Fiction|Science Ficti...
22072,George Orwell|Amélie Audiberti,"De tous les carrefours important, le visage à ...",Folio #822,Mass Market Paperback,448.0,4.16,2422763,53817,1984,Classics|Fiction|Science Fiction|Science Ficti...
22093,George Orwell,"Sepanjang hidupnya, Winston berusaha menjadi w...",Missing,Paperback,408.0,4.16,2422763,53817,1984,Classics|Fiction|Science Fiction|Science Ficti...


Not only are there complete duplicates of entries, but it also looks like there are multiple of the same book with different formats and editions. From the latter, the pages seem more or less aligned with each other, and the ratings/counts are the same. So I'll be keeping the books with the most amount of ratings. This way, the complete duplicates are also filtered out.

In [20]:
#Keep the duplicates with the highest rating count
books = books.sort_values(['title', 'rating_count', 'review_count'], ascending=True).drop_duplicates(subset='title', keep='last')

In [21]:
books.title.value_counts()

Grief Girl: My True Story                                                                         1
Demon Song                                                                                        1
The Painted Veil                                                                                  1
Life Studies                                                                                      1
Fareler ve İnsanlar                                                                               1
                                                                                                 ..
The Pink Dress                                                                                    1
Karamazov Kardeşler                                                                               1
Creative Visualization: Use the Power of Your Imagination to Create What You Want in Your Life    1
Al di qua del Paradiso                                                                            1


In [22]:
books[books.duplicated()==True]

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres


## 2.3 Illogical Data <a name='illogical'></a>

To look at the illogical data, or any data that doesn't really make sense with what it's supposed to be, I'm going to explore the actual values of the feature. Before that however, I'm going to remove all books with less than 100 ratings, since this would either encompass some "illogical" values or validate them, like the ones with a 0 rating or 0 pages. I would deem 100 ratings to be a reasonable minimum since the median of `rating_count` is 3763. I'm not considering the mean here since there are some books with an disproportionately large `rating_count`, which skews the mean.

In [23]:
#Remove the books with less than 100 ratings
books.drop(index=books[books.rating_count<100].index, inplace=True)

In [24]:
books.describe()

Unnamed: 0,pages,rating,rating_count,review_count
count,38629.0,38629.0,38629.0,38629.0
mean,346.600818,3.993088,34558.62,1659.270936
std,258.374482,0.277012,175620.6,6289.348471
min,0.0,2.09,100.0,0.0
25%,226.0,3.82,1138.0,81.0
50%,320.0,4.01,4146.0,269.0
75%,406.0,4.18,14296.0,905.0
max,14777.0,4.89,5588580.0,160776.0


It seems odd that there are books with 0 pages and also those with more than 2000.

In [25]:
books[books.pages > 2000]

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres
14080,Romain Rolland|م. ا. به‌آذین,"Complete in one volume:1. L'Aube (""Dawn"", 1904...",Missing,Hardcover,2179.0,4.17,650,45,(ژان کریستف (دورۀ چهار جلدی,Novels|Fiction|Cultural|France|Literature|Clas...
21608,Cao Xueqin|Gladys Yang|Xianyi Yang,"Also known as Hong Lou Meng, this is arguably ...",4 volume box set,Paperback,2549.0,4.11,3173,263,A Dream of Red Mansions,Classics|Fiction|Cultural|China|Literature|His...
23569,George R.R. Martin,George R.R. Martin's A Song of Ice and Fire is...,Missing,Paperback,5216.0,4.64,42043,1495,A Game of Thrones: The First 5 Books,Fantasy|Fiction
1730,George R.R. Martin,"For the first time, all five novels in the epi...",Re-Packaged Edition,Mass Market Paperback,5216.0,4.64,42029,1495,A Song of Ice and Fire,Fantasy|Fiction
4349,K.A. Applegate|Katherine Applegate,"Animorphs ""RM"" is an exciting series for young...",Missing,Paperback,8245.0,4.34,3647,208,Animorphs,Science Fiction|Childrens|Young Adult|Fiction|...
...,...,...,...,...,...,...,...,...,...,...
11334,شمس الدين الذهبي|حسان عبد المنان,كتاب سير أعلام النبلاء يعتبر من أمتع كتب الترا...,Missing,Hardcover,4683.0,4.35,453,32,سير أعلام النبلاء,Biography|Religion|Islam
20335,Sayed Qutb|سيد قطب,راخواطر وانطباعات من فترة عاشها سيد قطب في ظلا...,First-time author notes,Hardcover,4012.0,4.52,3210,133,في ظلال القرآن,Religion|Islam|Religion
49216,Abdul Rahman Munif|عبدالرحمن منيف,مدن الملح هي رواية عربية للروائي السعودي عبد ا...,Missing,Paperback,2345.0,4.33,959,125,مدن الملح,Novels|Fiction
13275,Ibn Khaldun,"""The Muqaddimah,"" often translated as ""Introdu...",Missing,Hardcover,3864.0,4.30,3048,246,مقدمة ابن خلدون,Nonfiction|Sociology|Classics|Politics|History...


In [26]:
books[books.pages == 0]

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres
25009,James L. Gillaspy,The broadcast interview of a young computer pr...,Missing,Nook,0.0,3.81,306,22,A Larger Universe,Science Fiction|Fiction
22603,Arthur R.G. Solmssen,"Berlín, 1922. Reina la confusión en la capital...",Missing,Mass Market Paperback,0.0,3.91,292,18,A Princess in Berlin,Fiction|Romance|Historical|Historical Fiction|...
47010,Jack Kilborn|Phil Gigante,"WELCOME TO SAFE HAVEN, POPULATION 907...Nestle...",Missing,Audio CD,0.0,3.90,7982,957,Afraid,Horror|Thriller|Fiction|Mystery|Suspense
38704,NOT A BOOK,"In this presentation, Glenn Beck tells an audi...",Missing,Audio CD,0.0,4.16,207,42,An Unlikely Mormon,Nonfiction|Biography|Christianity|Lds|Religion...
33835,Agatha Christie,Missing,Box set,Paperback,0.0,4.20,207,10,And Then There Were None/The Secret Adversary/...,Mystery|Childrens
...,...,...,...,...,...,...,...,...,...,...
45421,Pet Torres,SYNOPSISValkyrie is a young girl who has been ...,Missing,Nook,0.0,3.29,288,19,Valkyrie: The Vampire Princess,Paranormal|Vampires|Young Adult|Fantasy|Supern...
28817,Aaron Polson,While cruising a dark country road late one Sa...,Missing,Nook,0.0,3.01,207,25,We are the Monsters,Horror|Fiction|Suspense
15224,Craig Thomas,An East German defector dies in a tragic accid...,Missing,Paperback,0.0,3.35,129,5,Wildcat,Fiction|Thriller|Adventure
21469,P.G. Wodehouse,Table of ContentsList of Works by Genre and Ti...,Missing,ebook,0.0,4.49,412,10,Works of P. G. Wodehouse,Fiction|Humor|Classics|Humor|Comedy


Some books with 0 pages are sets of books, or audiobooks. I'm going to take a some time to clean up the formats before dealing with the pages.

In [27]:
books.loc[books.pages==0].format.value_counts(dropna=False)

Paperback                31
ebook                    21
Audio CD                 18
Nook                     14
Hardcover                13
Audio                     8
Audiobook                 6
Audio Cassette            6
Mass Market Paperback     4
MP3 CD                    2
Boxed Set                 1
Board Book                1
Name: format, dtype: int64

In [28]:
#Check the books dataframe to make sure it's still okay
books.head()

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres
28854,Julio Ortega|Jorge Luis Borges,De los muy pocos manuscritos que se conservan ...,Missing,Paperback,13.0,4.31,124,6,"""El Aleph"" de Jorge Luis Borges",Science Fiction Fantasy
36555,Sven Lindqvist|Joan Tate,"""Exterminate All the Brutes"" is a searching ex...",Missing,Paperback,179.0,4.15,1078,98,"""Exterminate All the Brutes"": One Man's Odysse...",History|Nonfiction|Cultural|Africa|Travel
3610,Michael Kramer,The formation of the English people starting w...,Missing,Kindle Edition,239.0,4.29,299,2,"""Now What?!!""",Young Adult|Adventure|New Adult|Historical|His...
18919,Stephen King|Gönül Suveren,Yıllar önce çocukluk kâbuslarına giren ‘O’ tüm...,Missing,Paperback,441.0,4.22,612802,18309,"""O""",Horror|Fiction|Fantasy|Thriller
8901,Harlan Ellison|Rick Berry,A rebel inhabits a world where conformity and ...,Missing,Hardcover,48.0,4.21,2347,129,"""Repent, Harlequin!"" Said the Ticktockman",Science Fiction|Short Stories|Fiction|Science ...


In [29]:
#Resort the index in the books dataframe
books = books.sort_index().reset_index(drop=True)

#Recheck the books dataframe
books.head()

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres
0,J.K. Rowling|Mary GrandPré,There is a door at the end of a silent corrido...,US Edition,Paperback,870.0,4.48,2041594,33264,Harry Potter and the Order of the Phoenix,Fantasy|Young Adult|Fiction
1,J.R.R. Tolkien,لجزء الثالث من ملحمة جيه أر أر تولكين الرائعة ...,Hobbit Movie Tie-in Boxed set,Mass Market Paperback,1728.0,4.59,99793,1652,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...,Fantasy|Fiction|Classics
2,Douglas Adams,Seconds before the Earth is demolished to make...,Missing,Paperback,193.0,4.21,1155911,23919,The Hitchhiker's Guide to the Galaxy,Science Fiction|Fiction|Humor|Fantasy|Classics
3,Shel Silverstein,"""Once there was a tree...and she loved a littl...",Missing,Hardcover,64.0,4.37,789681,15694,The Giving Tree,Childrens|Childrens|Picture Books|Classics|Fic...
4,Dan Brown,An ingenious code hidden in the works of Leona...,Missing,Paperback,481.0,3.81,1668594,43699,The Da Vinci Code,Fiction|Mystery|Thriller


### 2.2.1. Sets of Books <a name='sets'></a>

Since I want to look at only individual books, I'll be dropping all sets or collections of books that I can find.

In [30]:
#Create regex pattern to search for chronicles, set(s), collection(s) in the title
pat = r"\bchronicles\b|\bsets*\b|\bcollections*\b"

#Create new dataframe based on the regex search
book_sets = books.loc[books['title'].astype('str').str.contains(pat, regex=True, case=False)]

In [31]:
#Manually add the titles collected by regex that aren't sets
non_sets = ['...And the Truth Shall Set You Free', 'A Collection of Essays', 'A Gown of Spanish Lace (The Janette Oke Collection)', 'Blood Borne (Cathedral Chronicles, #1)', 'Burma Chronicles', 'Chronicles of Avonlea',
 'Chronicles, Volume One', 'Deep Down Dark: The Untold Stories of 33 Men Buried in a Chilean Mine, and the Miracle That Set Them Free', "Disney's Mulan Classic Storybook (The Mouse Works Classics Collection)", 
 "Disney's Storybook Collection", 'Go Set a Watchman', 'Imitatore (The Donna Chronicles #1)', 'I Hate Myselfie: A Collection of Essays', 'Indian Creek Chronicles: A Winter Alone in the Wilderness', 'Invisible Collection',
 'It Gets Worse: A Collection of Essays', 'Jerusalem: Chronicles from the Holy City', "LZ-'75: The Lost Chronicles of Led Zeppelin's 1975 American Tour", 'Lies Young Women Believe Companion Guide: And the Truth That Sets Them Free',
 'New England Witch Chronicles', 'Miz Lil: And the Chronicles of Grace', 'No Wonder They Call Him the Savior: Chronicles of the Cross', 'Piano/Vocal/Guitar Sheet Music: The Chronicles of Narnia: The Lion, the Witch and The Wardrobe',
 'Rain & Fire: A Guide to the Last Dragon Chronicles', 'Set Me Free', 'Set This House in Order', 'Set in Darkness','Set in Stone', 'Sheet Music: The Chronicles of Narnia - Prince Caspian', "The Atheist's Bible: An Illustrious Collection of Irreverent Thoughts",
 'The Andalite Chronicles', 'The Bane Chronicles', 'The Batman Chronicles, Vol. 1', 'The Black River Chronicles: Level One', 'The Christmas Chronicles: The Legend of Santa Claus', 'The Chronicles of Audy: 4R', 'The Chronicles of Faerie',
 'The Chronicles of Harris Burdick: 14 Amazing Authors Tell the Tales', 'The Chronicles of Pern: First Fall', 'The Chronicles of Spiderwick: A Grand Tour of the Enchanted World, Navigated by Thimbletack', 'The Complete Chronicles of Conan',
 'The Collection', 'The Curiosities: A Collection of Stories', 'The Cupid Chronicles', 'The Dodgeball Chronicles', 'The Edge Chronicles 10: The Immortals: The Book of Nate', 'The Edge Chronicles 1: The  Curse of the Gloamglozer: First Book of Quint',
 'The Edge Chronicles 2: The Winter Knights: Second Book of Quint', 'The Edge Chronicles 3: The Clash of the Sky Galleons: Third Book of Quint', 'The Edge Chronicles 5: Stormchaser: Second Book of Twig', 'The Edge Chronicles 6: Midnight Over Sanctaphrax: Third Book of Twig',
 'The Edge Chronicles 7: The Last of the Sky Pirates: First Book of Rook', 'The Edge Chronicles 8: Vox: Second Book of Rook', 'The Edge Chronicles Maps',  'The First Chronicles of Druss the Legend', 'The Fry Chronicles', 'The Immortal Collection', 'The Ivy Chronicles',
 'The Kane Chronicles Survival Guide', 'The Land That Time Forgot Collection', 'The Last War: A World Set Free', 'The Martian Chronicles', 'The Ring Sets Out','The Rising Dark: A Darkest Minds Collection', 'The Set Up', 'The Soul of Rumi: A New Collection of Ecstatic Poems',
 'The Spiderwick Chronicles Movie : The Movie Storybook', 'The Stalker Chronicles', 'The Superman Chronicles, Vol. 1', 'The Travelling Cat Chronicles', 'The Valley of Vision: A Collection of Puritan Prayers and Devotions', 'The Works of William Wordsworth (Wordsworth Collection)',
 'The Zombie Chronicles', 'To Draw Closer To God: A Collection Of Discourses', 'Tortall and Other Lands: A Collection of Tales', 'When I Was a Slave: Memoirs from the Slave Narrative Collection', 'Van Laven Chronicles: Throne of Novoxos', 'Wormwood: A Collection of Short Stories',
 'Zen Flesh, Zen Bones: A Collection of Zen and Pre-Zen Writings']

#Update book_sets to exclude the non_sets
book_sets = book_sets.loc[~book_sets.title.isin(non_sets)]

In [32]:
#Ignore all jupyter warnings
import warnings
warnings.filterwarnings('ignore')

#Create list where edition indicates sets
ed_sets = books.loc[books.edition.astype('str').str.contains(r"\bBox(ed)*\b|\bSets*\b", regex=True, case=False)]

In [33]:
#Join the two sets together
book_sets = book_sets.append(ed_sets)

In [34]:
#Keep only the rows not in the book_sets dataframe
books = pd.merge(books, book_sets, indicator=True, how='outer').query('_merge=="left_only"').drop('_merge', axis=1)

In [35]:
books.shape

(38420, 10)

In [36]:
#Create list of indices with boxed set formats that will be dropped
to_drop = books[(books.format=='Boxed Set') | (books.format=='Box Set')].index

#Drop the boxed sets formats
books.drop(index=to_drop, inplace=True)

In [37]:
#Resort the index in the books dataframe
books = books.sort_index().reset_index(drop=True)

#Recheck the books dataframe
books.head()

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres
0,J.K. Rowling|Mary GrandPré,There is a door at the end of a silent corrido...,US Edition,Paperback,870.0,4.48,2041594,33264,Harry Potter and the Order of the Phoenix,Fantasy|Young Adult|Fiction
1,Douglas Adams,Seconds before the Earth is demolished to make...,Missing,Paperback,193.0,4.21,1155911,23919,The Hitchhiker's Guide to the Galaxy,Science Fiction|Fiction|Humor|Fantasy|Classics
2,Shel Silverstein,"""Once there was a tree...and she loved a littl...",Missing,Hardcover,64.0,4.37,789681,15694,The Giving Tree,Childrens|Childrens|Picture Books|Classics|Fic...
3,Dan Brown,An ingenious code hidden in the works of Leona...,Missing,Paperback,481.0,3.81,1668594,43699,The Da Vinci Code,Fiction|Mystery|Thriller
4,Lewis Carroll|John Tenniel|Martin Gardner,""" I can't explain myself, I'm afraid, sir,"" sa...",Missing,Mass Market Paperback,239.0,4.07,411153,9166,Alice's Adventures in Wonderland & Through the...,Classics|Fantasy|Fiction|Childrens


## 2.4 Book Formats <a name='formats'></a>

To deal with all the different formats, I'm going to be organizing them into 5 main categories (not counting the missing values): 
1. Paperback
2. Hardcover
3. Audio
4. Digital
5. Other

In [38]:
#Add indvidual formats that mean paperback
paper = ['Capa Mole', 'capa mole', 'Softcover', 'Spiral-bound','Capa mole - 15,5 x 23 x 2cm', 'Poche', 'Broché', 'broché', 'Capa comum', 'Bìa mềm', 'Broschiert', 'Tapa blanda', 'Brossura', 'pocket', '文庫', 'Taschenbuch', 'Bolsillo','Pasta blanda','']

#Add indvidual formats that mean hardcoverb
hard = ['Capa Dura','Leather Bound','Bonded Leather','Imitation Leather', 'Capa dura','Board Book','Board book', 'Board', 'Tapa dura con sobrecubierta', 'Kovakantinen', 'Innbundet']

#Add indvidual formats that suggest audio
audio = ['Podcast','CD-ROM', 'MP3 CD','Podiobook']
             
#Add indvidual formats that suggest digital
digital = ['ebook','Nook','PDF ', 'web','Wattpad']

#Create an empty list for other formats
other = []

#Iterate over all unique formats
for fmt in books.format.unique():
    #First check whether the format is already in one of the manually made lists
    if ((fmt not in paper) & (fmt not in hard) & (fmt not in audio) & (fmt not in digital)):
        #Add the formats in the appropriate category
        if 'paper' in fmt.lower():
            paper.append(fmt)
        elif 'hard' in fmt.lower():
            hard.append(fmt)
        elif 'audio' in fmt.lower():
            audio.append(fmt)
        elif 'kindle' in fmt.lower():
            digital.append(fmt)
        elif 'online' in fmt.lower():
            digital.append(fmt)
        elif 'digital' in fmt.lower():
            digital.append(fmt)
        #Add all others into  the other category
        elif 'Missing' not in fmt:
            other.append(fmt)

In [39]:
#Replace all those in list with the corresponding format category
books.format.replace(paper, 'Paperback', inplace=True)
books.format.replace(hard, 'Hardcover', inplace=True)
books.format.replace(audio, 'Audio', inplace=True)
books.format.replace(digital, 'Digital', inplace=True)
books.format.replace(other, 'Other', inplace=True)

In [40]:
books.format.unique()

array(['Paperback', 'Hardcover', 'Digital', 'Audio', 'Other', 'Missing'],
      dtype=object)

### Book Edition

There are a lot of variations with first editions, which looks like the biggest portion, outside of the missing values. As such, I'll replace all edition values as 'First' or 'Other Edition', and create a separate `missing_ed` feature.

In [41]:
#Create missing_ed feature indicating if edition value is missing
books.loc[books.edition=='Missing','missing_ed']=1

#Fill the rest with 0
books.missing_ed.fillna(0, inplace=True)

In [42]:
books.edition.value_counts().head(20)

Missing                    34738
First Edition                298
1st Edition                  249
1st edition                  102
1st                           73
Second Edition                65
الطبعة الأولى                 62
Large Print                   59
Oxford World's Classics       54
1                             50
First                         47
Penguin Classics              43
2nd Edition                   39
Omnibus                       32
Unabridged                    32
UK                            30
Revised Edition               24
Penguin Modern Classics       24
Trade                         23
Abridged                      23
Name: edition, dtype: int64

In [43]:
#Create manual list for first edition variations
first = ['First Edition', '1st Edition', '1st edition', '1st', 'الطبعة الأولى']

#Iterate over all unique formats
for ed in books.edition.unique():
    #First check whether the format is already in one of the manually made lists
    if (ed not in first):
        #Add the formats in the appropriate category
        if '1st' in ed.lower():
            first.append(ed)
        elif 'first' in ed.lower():
            first.append(ed)

In [44]:
first

['First Edition',
 '1st Edition',
 '1st edition',
 '1st',
 'الطبعة الأولى',
 'First Scholastic Trade Paperback Edition',
 'First',
 'First US Edition',
 'First Touchstone Edition 2003',
 'First Trade Paperback Edition',
 'First edition',
 "First St. Martin's Griffin Edition",
 'First edition of this translation',
 'US First Edition',
 'First Vintage Edition',
 '1st US Edition',
 '1st Perennial Classics Edition',
 '2000 Reprint (1st edition in Penguin 1945)',
 'First Signet Classic Printing (Lingeman Introduction)',
 'first print',
 'Hyperion, First Edition',
 'First Anchor Books edition (w/new Afterword)',
 '1st Kindle Edition',
 'First Vintage International edition',
 'first',
 'First American Edition',
 'First Signet Edition',
 'First Vintage Contemporaries Edition',
 'First Mass Market Edition',
 'First Touchstone Edition',
 '1st Scribner trade paperback edition',
 'First Knopf Trade Paperback Edition',
 'First Broadway Paperback Edition',
 '1st  Edition',
 'First Edition (US / CAN)

In [45]:
#Manually identifying which  editions were misclassified as first
remove_first = ['2000 Reprint (1st edition in Penguin 1945)', '31st Impression', '21st edition', 'The 21st-Century Edition', 'Seventh Reprint (Booket 1st edition 2009)', '61st Edition', 'First Movie Tie-In Paperback Edition']

#Remove above misclassified first editions from the first edition list
for ele in remove_first:
    first.remove(ele)

In [46]:
#Input 1 if book is a first edition from the list or 0 if it isn't
books.edition.replace(first, 1, inplace=True)
books.loc[books.edition!=1, 'edition'] = 0

In [47]:
#Rename 'edition' column to 'first_ed' for better readability
books.rename(mapper={'edition':'first_ed'}, inplace=True)

### Book Description <a name='desc'></a>
Because I'm currently not exploring the book description texts, I'll remove it but not before I encode whether it exists for the book or not. This way, I don't completely lose the data.

In [48]:
#Input 0 for 'Missing' in 'desc' feature or 1 if there's text
books.loc[books.desc=='Missing', 'missing_desc'] = 0
books.loc[books.desc!=0, 'missing_desc'] = 1

# 3. Adding Categories <a name='categories'></a>

In order to deal with some of the feature data a little easier, I'll be binning some of them into categories.

## 3.1 Book Length <a name='length'></a>
It wouldn't make too much sense to input a mean into all the books with 0 pages, and there are too many of them to individually check their book pages. In order to not skew the data with so many incorrect 0-page books, I'll be categorizing the books into `1` (short), `2` (medium), `3` (long) based on the number of pages, while maintaining the 0 page books as `0`. Since some of the books have a disproportionate large number of pages, I'll be breaking it down by the quartiles, with a large portion devoted to the medium range.

In [49]:
books.pages.describe()

count    38416.000000
mean       342.581346
std        244.489347
min          0.000000
25%        226.000000
50%        320.000000
75%        404.000000
max      14777.000000
Name: pages, dtype: float64

In [50]:
#Calculate the 25th and 75th percentiles as the upper and lower bound for 1 (short) and 2 (long), respectively
short = np.percentile(books.pages, 25)
long = np.percentile(books.pages, 75)

#Categorize each book based on its number of pages, and save it into the 'length feature'
books.loc[(books.pages == 0), 'length'] = 0
books.loc[((books.pages <= short) & (books.pages != 0)), 'length'] = 1
books.loc[(books.pages >= long), 'length'] = 3
#Fill the rest with 2 (medium)
books.length.fillna(2, inplace=True)

In [51]:
#Convert length feature as categorical
books['length'] = books['length'].astype('category')

In [52]:
books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38416 entries, 0 to 38415
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   authors       38416 non-null  object  
 1   desc          38416 non-null  object  
 2   edition       38416 non-null  object  
 3   format        38416 non-null  object  
 4   pages         38416 non-null  float64 
 5   rating        38416 non-null  float64 
 6   rating_count  38416 non-null  int64   
 7   review_count  38416 non-null  int64   
 8   title         38416 non-null  object  
 9   genres        38416 non-null  object  
 10  missing_ed    38416 non-null  float64 
 11  missing_desc  38416 non-null  float64 
 12  length        38416 non-null  category
dtypes: category(1), float64(4), int64(2), object(6)
memory usage: 3.8+ MB


In [53]:
books.length.value_counts()

2.0    19142
3.0     9645
1.0     9522
0.0      107
Name: length, dtype: int64

## 3.2 Good or Bad Book <a name='goodbad'></a>

Since the data seems relatively ready, I'll be adding my main feature that will be my dependent variable, whether a book is good or bad. This dataset is already curated from the Goodreads' "Best Books List" so there is no reason to believe *any* of the books on this list are relatively bad. Additionally, I admit that opinions of books are subjective; a book can be good for one person, and also bad for another. Moreover, books couldn't really be described as completely black and white, either 'good' or 'bad'.

With all that said, I will be forming two best books list: one in the order of the dataset, determined by the voters of Goodreads; and another based purely on the books' ratings. These will differ only in their initial order, and will be split evenly in half; the first half will  be categorized as 'good' while the second will be the 'bad' ones.

In [54]:
books.rating.describe()

count    38416.000000
mean         3.991046
std          0.275888
min          2.090000
25%          3.820000
50%          4.010000
75%          4.180000
max          4.890000
Name: rating, dtype: float64

In [55]:
books.head()

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres,missing_ed,missing_desc,length
0,J.K. Rowling|Mary GrandPré,There is a door at the end of a silent corrido...,0,Paperback,870.0,4.48,2041594,33264,Harry Potter and the Order of the Phoenix,Fantasy|Young Adult|Fiction,0.0,1.0,3.0
1,Douglas Adams,Seconds before the Earth is demolished to make...,0,Paperback,193.0,4.21,1155911,23919,The Hitchhiker's Guide to the Galaxy,Science Fiction|Fiction|Humor|Fantasy|Classics,1.0,1.0,1.0
2,Shel Silverstein,"""Once there was a tree...and she loved a littl...",0,Hardcover,64.0,4.37,789681,15694,The Giving Tree,Childrens|Childrens|Picture Books|Classics|Fic...,1.0,1.0,1.0
3,Dan Brown,An ingenious code hidden in the works of Leona...,0,Paperback,481.0,3.81,1668594,43699,The Da Vinci Code,Fiction|Mystery|Thriller,1.0,1.0,3.0
4,Lewis Carroll|John Tenniel|Martin Gardner,""" I can't explain myself, I'm afraid, sir,"" sa...",0,Paperback,239.0,4.07,411153,9166,Alice's Adventures in Wonderland & Through the...,Classics|Fantasy|Fiction|Childrens,1.0,1.0,2.0


In [56]:
#Split in half and categorize the top half as 1 (good books) and the bottom half as 0 (bad books)
half = len(books)/2

#Categorize good and bad books
books.loc[:half, 'quality'] = 1
books.loc[half:, 'quality'] = 0

#Make sure the quality feature is a category
books['quality'] = books['quality'].astype('category')

In [57]:
books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38416 entries, 0 to 38415
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   authors       38416 non-null  object  
 1   desc          38416 non-null  object  
 2   edition       38416 non-null  object  
 3   format        38416 non-null  object  
 4   pages         38416 non-null  float64 
 5   rating        38416 non-null  float64 
 6   rating_count  38416 non-null  int64   
 7   review_count  38416 non-null  int64   
 8   title         38416 non-null  object  
 9   genres        38416 non-null  object  
 10  missing_ed    38416 non-null  float64 
 11  missing_desc  38416 non-null  float64 
 12  length        38416 non-null  category
 13  quality       38416 non-null  category
dtypes: category(2), float64(4), int64(2), object(6)
memory usage: 5.1+ MB


In [58]:
books.quality.value_counts()

1.0    19208
0.0    19208
Name: quality, dtype: int64

Saving the dataset for later use:

In [59]:
#Save the dataset
books.to_csv('./data/books.csv')