# Data Wrangling

* [1. Loading Data](#loading)
* [2. Data Cleaning](#data-cleaning)
    * [2.1 Missing Values](#missing)
    * [2.2 Duplicates](#duplicates)
    * [2.3 Illogical Data](#illogical)
* [3. Adding Categories](#categories)
    * [3.1 Book Length](#length)
    * [3.2 Good or Bad Book](#goodbad)

# 1. Loading Data <a name='loading'></a>

In [11]:
#Load the relevant libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [12]:
#Load the Kaggle goodreads dataset obtained from
#https://www.kaggle.com/meetnaren/goodreads-best-books
#that's based on the Goodreads list of best books ever: https://www.goodreads.com/list/show/1.Best_Books_Ever
best_books = pd.read_csv('best_books.csv')

In [13]:
best_books.head()

Unnamed: 0,book_authors,book_desc,book_edition,book_format,book_isbn,book_pages,book_rating,book_rating_count,book_review_count,book_title,genres,image_url
0,Suzanne Collins,Winning will make you famous. Losing means cer...,,Hardcover,9780440000000.0,374 pages,4.33,5519135,160706,The Hunger Games,Young Adult|Fiction|Science Fiction|Dystopia|F...,https://images.gr-assets.com/books/1447303603l...
1,J.K. Rowling|Mary GrandPré,There is a door at the end of a silent corrido...,US Edition,Paperback,9780440000000.0,870 pages,4.48,2041594,33264,Harry Potter and the Order of the Phoenix,Fantasy|Young Adult|Fiction,https://images.gr-assets.com/books/1255614970l...
2,Harper Lee,The unforgettable novel of a childhood in a sl...,50th Anniversary,Paperback,9780060000000.0,324 pages,4.27,3745197,79450,To Kill a Mockingbird,Classics|Fiction|Historical|Historical Fiction...,https://images.gr-assets.com/books/1361975680l...
3,Jane Austen|Anna Quindlen|Mrs. Oliphant|George...,«È cosa ormai risaputa che a uno scapolo in po...,"Modern Library Classics, USA / CAN",Paperback,9780680000000.0,279 pages,4.25,2453620,54322,Pride and Prejudice,Classics|Fiction|Romance,https://images.gr-assets.com/books/1320399351l...
4,Stephenie Meyer,About three things I was absolutely positive.F...,,Paperback,9780320000000.0,498 pages,3.58,4281268,97991,Twilight,Young Adult|Fantasy|Romance|Paranormal|Vampire...,https://images.gr-assets.com/books/1361039443l...


In [14]:
best_books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54301 entries, 0 to 54300
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   book_authors       54301 non-null  object 
 1   book_desc          52970 non-null  object 
 2   book_edition       5453 non-null   object 
 3   book_format        52645 non-null  object 
 4   book_isbn          41435 non-null  object 
 5   book_pages         51779 non-null  object 
 6   book_rating        54301 non-null  float64
 7   book_rating_count  54301 non-null  int64  
 8   book_review_count  54301 non-null  int64  
 9   book_title         54301 non-null  object 
 10  genres             51059 non-null  object 
 11  image_url          53618 non-null  object 
dtypes: float64(1), int64(2), object(9)
memory usage: 5.0+ MB


In [15]:
#Let's first create a copy to keep the original intact
books = best_books.copy()

Let's quickly rename the column names to be more concise.

In [16]:
#Strip the 'book_' from the column names
books.columns = [col_name.lstrip('book_') if ('book_' in col_name) else col_name for col_name in best_books.columns]

# Data Cleaning <a name='data-cleaning'></a>

This section will first cover how missing values, outliers, duplicates, or illogical data will be dealt with. Then, I'll take a look at cleaning up some of the data so that it's easier to work with later.

Before I look at any missing values, I'm going to also remove the unnecessary "pages" unit from the `pages` column so that the values can be numeric.

In [17]:
#Remove the unit from the pages values
books.pages = books.pages.str.rstrip(' pages')

#Convert the pages column to numbers
books.pages = pd.to_numeric(books.pages)

In [18]:
books.describe()

Unnamed: 0,pages,rating,rating_count,review_count
count,51779.0,54301.0,54301.0,54301.0
mean,337.662836,4.020027,43504.49,2011.60218
std,259.533005,0.3621,212657.2,7627.07287
min,0.0,0.0,0.0,0.0
25%,216.0,3.83,407.0,35.0
50%,310.0,4.03,2811.0,188.0
75%,400.0,4.22,12745.0,822.0
max,14777.0,5.0,5588580.0,160776.0


The ISBN feature of this dataset is not very useful since it's in scientific notation and half of the ISBN is lost to the notation. Therefore, I'll be removing it with the `image_url` column since I won't be using that feature.

In [19]:
#Delete the isbn and image_url columns
books.drop(columns=['isbn', 'image_url'], inplace=True)

## Missing Values <a name='missing'></a>

In [20]:
books.isna().sum()

authors             0
desc             1331
edition         48848
format           1656
pages            2522
rating              0
rating_count        0
review_count        0
title               0
genres           3242
dtype: int64

Let's look at some of the missing values for genres and page numbers, before we tackle the books with 0 ratings/pages/etc.

In [22]:
books[books.genres.isnull()==True]

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres
260,سید مرتضی مصطفوی,Mystical storyThe internal revolution of a wom...,فارسی,ebook,204.0,3.97,956,129,زندگی مه آلود پریا,
716,سید مرتضی مصطفوی,Philosophical story about human loneliness in ...,,,82.0,4.12,923,256,گم شده ای در مه,
1421,سید مرتضی مصطفوی,"رمان بلند ""سیمای شکسته پدر سالار"" که از تعلیق...",فارسی,ebook,367.0,3.55,722,7,سیمای شکسته پدر سالار,
1928,أحمد مراد,"""للمرة الثانية بعد ""فيرتيجو"" يتّخذ أحمد مراد م...",,Paperback,389.0,4.07,48474,4239,تراب الماس,
1989,Christine M. Knight,"“I’m not Mavis anymore.” For years, song bird ...",,Paperback,304.0,4.59,39,34,Song Bird: Matters of the Heart,
...,...,...,...,...,...,...,...,...,...,...
54225,Chuck Rogers|Don Pendleton,When American military personnel are found beh...,,Paperback,188.0,4.14,14,1,Crisis Nation,
54229,William Logan,"Talented William Logan, though he hails from D...",,ebook,12.0,2.67,6,1,Mex,
54236,Chris Pepple,,,Paperback,120.0,5.00,1,0,Look to See Me: A Collection of Reflections,
54259,Jackie Budd,Lively and wide-ranging text together with col...,Large Print,Paperback,64.0,3.67,3,2,The World of Horses,


In [23]:
books[books.pages.isnull()==True]

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres
1225,Gabrielle Estres,"“A captivating tale of love, power and betraya...",,ebook,,4.79,331,16,Captive,Romance|Fantasy|Romance|Paranormal Romance|Rom...
1296,Tom Clancy,2 cassettes / 2 hoursRead by F. Murray Abraham...,Abridged,Audio Cassette,,4.15,63105,956,Red Storm Rising,Fiction|Thriller|War|Military Fiction|War
1392,Roberta Pearce,"After an impoverished and indigent childhood, ...",,ebook,,4.23,153,31,A Bird Without Wings,Romance|Romance|Contemporary Romance|Contemporary
1424,Robyn Mundell|Stephan Lacast,One teen’s incredible journey may just blow hi...,,ebook,,4.27,1054,153,Brainwalker,Young Adult|Fantasy|Fiction|Adventure|New Adul...
1787,Chris A. Jones,Mankind had spent decades trying to overcome a...,,ebook,,4.14,521,1,Reversione: Reset the Future,Young Adult|Adventure|Mystery|Fantasy|New Adul...
...,...,...,...,...,...,...,...,...,...,...
54284,Tamiki Wakaki,,,Paperback,,4.24,141,1,The World God Only Knows 17,Sequential Art|Manga|Comics Manga
54285,Tamiki Wakaki,,,Paperback,,4.20,143,2,The World God Only Knows 18,Sequential Art|Manga|Comics Manga
54286,Tamiki Wakaki,,,Paperback,,4.34,182,2,The World God Only Knows 11,Sequential Art|Manga|Comics Manga
54288,Tamiki Wakaki,,,Paperback,,4.32,179,2,The World God Only Knows 12,Sequential Art|Manga|Comics Manga|Sequential Art


I'll be removing the ones with missing genres/pages, since those will be the primary features I'll be looking at and exploring. Additionally, it looks like some books have not only genres/pages missing but some also have multiple missing values; let's remove the ones with more than 3 missing features.

In [24]:
#Remove rows with more than 3 missing features
books.dropna(axis=0, thresh=9, inplace=True)

In [25]:
#Remove rows with missing pages or missing genres
books.dropna(axis=0, subset=['pages', 'genres'], how='any', inplace=True)

Now I'll be filling in the missing values as 'missing.'

In [26]:
#Fill the missing values as 'missing'
books.fillna('missing', inplace=True)

In [27]:
books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47760 entries, 0 to 54300
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   authors       47760 non-null  object 
 1   desc          47760 non-null  object 
 2   edition       47760 non-null  object 
 3   format        47760 non-null  object 
 4   pages         47760 non-null  float64
 5   rating        47760 non-null  float64
 6   rating_count  47760 non-null  int64  
 7   review_count  47760 non-null  int64  
 8   title         47760 non-null  object 
 9   genres        47760 non-null  object 
dtypes: float64(2), int64(2), object(6)
memory usage: 4.0+ MB


In [28]:
books.rating_count.median()

3763.0

About 5k books was deleted, but there still is a good amount of books remaining. Now to really look at some the actual values. 

## Duplicates <a name='duplicates'></a>

First I'll explore to see if there are any duplicates, and keep only one of the duplicates.

In [16]:
books[books.duplicated(keep='first')]

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres
6467,Kirsten Fullmer,Lizzie gave up her stressful job in Boston to ...,,Kindle Edition,297.0,4.25,630,16,Hometown Girl Forever,Young Adult|Contemporary|New Adult|Romance|Lov...
24460,Frances Hardinge,This is the story of a bear-hearted girl . . ....,,Hardcover,416.0,4.09,2773,532,A Skinful of Shadows,Fantasy|Historical|Historical Fiction|Young Ad...
29507,Annemarie O'Brien,Young Lara is being groomed in the family trad...,,Hardcover,208.0,3.99,630,126,Lara's Gift,Historical|Historical Fiction|Childrens|Middle...
47849,Chloe Neill,"Since Merit was turned into a vampire, and the...",,Paperback,350.0,4.2,11220,676,Wild Things,Fantasy|Urban Fantasy|Paranormal|Vampires|Fant...


In [17]:
books.title.value_counts()

1984                               17
American Gods                      13
Selected Poems                     13
A Christmas Carol                  13
The Hobbit                         13
                                   ..
Dawn of Night                       1
ورددت الجبال الصدى                  1
Komarr                              1
Treachery in Death                  1
How to Be an American Housewife     1
Name: title, Length: 39493, dtype: int64

In [18]:
books[books.title=='1984']

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres
100,George Orwell|Erich Fromm,"Among the seminal texts of the 20th century, N...",Signet Classics,Mass Market Paperback,328.0,4.16,2420816,53773,1984,Classics|Fiction|Science Fiction|Science Ficti...
5161,George Orwell|Peter Hobley Davison,"'It was a bright cold day in April, and the cl...",,Paperback,326.0,4.16,2421887,53794,1984,Classics|Fiction|Science Fiction|Science Ficti...
7348,George Orwell,"Among the seminal texts of the 20th century, N...",,Kindle Edition,237.0,4.16,2421777,53793,1984,Classics|Fiction|Science Fiction|Science Ficti...
9987,George Orwell,"Among the seminal texts of the 20th century, N...",,Kindle Edition,237.0,4.16,2422197,53801,1984,Classics|Fiction|Science Fiction|Science Ficti...
13992,George Orwell|Alexandre Hubner|Heloísa Jahn,1984 é uma das obras mais influentes do século...,,Paperback,416.0,4.16,2422432,53809,1984,Classics|Fiction|Science Fiction|Science Ficti...
19074,George Orwell|أنور الشامي,على مدى سنوات طويلة ظلت رواية 1984 لجورج اوروي...,الثالثة,Paperback,352.0,4.16,2422645,53814,1984,Classics|Fiction|Science Fiction|Science Ficti...
21486,George Orwell,Nineteen Eighty-Four revealed George Orwell as...,,Hardcover,326.0,4.16,2422732,53817,1984,Classics|Fiction|Science Fiction|Science Ficti...
21918,George Orwell,1984 oferece hoje uma descrição quase realista...,,Paperback,327.0,4.16,2422748,53817,1984,Classics|Fiction|Science Fiction|Science Ficti...
22072,George Orwell|Amélie Audiberti,"De tous les carrefours important, le visage à ...",Folio #822,Mass Market Paperback,448.0,4.16,2422763,53817,1984,Classics|Fiction|Science Fiction|Science Ficti...
22093,George Orwell,"Sepanjang hidupnya, Winston berusaha menjadi w...",,Paperback,408.0,4.16,2422763,53817,1984,Classics|Fiction|Science Fiction|Science Ficti...


Not only are there complete duplicates of entries, but it also looks like there are multiple of the same book with different formats and editions. From the latter, the pages seem more or less aligned with each other, and the ratings/counts are the same. So I'll be keeping the books with the most amount of ratings. This way, the complete duplicates are also filtered out.

In [30]:
#Keep the duplicates with the highest rating count
books = books.sort_values(['title', 'rating_count', 'review_count'], ascending=True).drop_duplicates(subset='title', keep='last')

In [31]:
books.title.value_counts()

Cast in Silence                                                                     1
God's Debris: A Thought Experiment                                                  1
The Girl with Borrowed Wings                                                        1
The Interrogators: Inside the Secret War Against Al Qaeda                           1
If He's Tempted                                                                     1
                                                                                   ..
Winter Garden                                                                       1
Killing the Rising Sun: How America Vanquished World War II Japan                   1
The Hidden Window Mystery                                                           1
Standing for Something: 10 Neglected Virtues That Will Heal Our Hearts and Homes    1
Notes of a Dirty Old Man                                                            1
Name: title, Length: 38629, dtype: int64

In [33]:
books[books.duplicated()==True]

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres


## Illogical Data <a name='illogical'></a>

To look at the illogical data, or any data that doesn't really make sense with what it's supposed to be, I'm going to explore the actual values of the feature. Before that however, I'm going to remove all books with less than 100 ratings, since this would either encompass some "illogical" values or validate them, like the ones with a 0 rating or 0 pages. I would deem 100 ratings to be a reasonable minimum since the median of `rating_count` is 3763. I'm not considering the mean here since there are some books with an disproportionately large `rating_count`, which skews the mean.

In [None]:
#Remove the books with less than 100 ratings
books.drop(index=books[books.rating_count<100].index, inplace=True)

In [34]:
books.describe()

Unnamed: 0,pages,rating,rating_count,review_count
count,39484.0,39484.0,39484.0,39484.0
mean,345.925236,3.993602,34255.53,1639.313469
std,264.046829,0.277831,174991.8,6246.800121
min,0.0,2.09,100.0,0.0
25%,224.0,3.82,1094.0,79.0
50%,319.0,4.01,4039.0,263.0
75%,405.0,4.19,13978.25,885.25
max,14777.0,4.89,5588580.0,160776.0


It seems odd that there are books with 0 pages and also those with more than 2000.

In [53]:
books[books.pages > 2000]

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres
14080,Romain Rolland|م. ا. به‌آذین,"Complete in one volume:1. L'Aube (""Dawn"", 1904...",missing,Hardcover,2179.0,4.17,650,45,(ژان کریستف (دورۀ چهار جلدی,Novels|Fiction|Cultural|France|Literature|Clas...
21608,Cao Xueqin|Gladys Yang|Xianyi Yang,"Also known as Hong Lou Meng, this is arguably ...",4 volume box set,Paperback,2549.0,4.11,3173,263,A Dream of Red Mansions,Classics|Fiction|Cultural|China|Literature|His...
23569,George R.R. Martin,George R.R. Martin's A Song of Ice and Fire is...,missing,Paperback,5216.0,4.64,42043,1495,A Game of Thrones: The First 5 Books,Fantasy|Fiction
1730,George R.R. Martin,"For the first time, all five novels in the epi...",Re-Packaged Edition,Mass Market Paperback,5216.0,4.64,42029,1495,A Song of Ice and Fire,Fantasy|Fiction
4349,K.A. Applegate|Katherine Applegate,"Animorphs ""RM"" is an exciting series for young...",missing,Paperback,8245.0,4.34,3647,208,Animorphs,Science Fiction|Childrens|Young Adult|Fiction|...
...,...,...,...,...,...,...,...,...,...,...
11334,شمس الدين الذهبي|حسان عبد المنان,كتاب سير أعلام النبلاء يعتبر من أمتع كتب الترا...,missing,Hardcover,4683.0,4.35,453,32,سير أعلام النبلاء,Biography|Religion|Islam
20335,Sayed Qutb|سيد قطب,راخواطر وانطباعات من فترة عاشها سيد قطب في ظلا...,First-time author notes,Hardcover,4012.0,4.52,3210,133,في ظلال القرآن,Religion|Islam|Religion
49216,Abdul Rahman Munif|عبدالرحمن منيف,مدن الملح هي رواية عربية للروائي السعودي عبد ا...,missing,Paperback,2345.0,4.33,959,125,مدن الملح,Novels|Fiction
13275,Ibn Khaldun,"""The Muqaddimah,"" often translated as ""Introdu...",missing,Hardcover,3864.0,4.30,3048,246,مقدمة ابن خلدون,Nonfiction|Sociology|Classics|Politics|History...


In [54]:
books[books.pages == 0]

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres
25009,James L. Gillaspy,The broadcast interview of a young computer pr...,missing,Nook,0.0,3.81,306,22,A Larger Universe,Science Fiction|Fiction
22603,Arthur R.G. Solmssen,"Berlín, 1922. Reina la confusión en la capital...",missing,Mass Market Paperback,0.0,3.91,292,18,A Princess in Berlin,Fiction|Romance|Historical|Historical Fiction|...
47010,Jack Kilborn|Phil Gigante,"WELCOME TO SAFE HAVEN, POPULATION 907...Nestle...",missing,Audio CD,0.0,3.90,7982,957,Afraid,Horror|Thriller|Fiction|Mystery|Suspense
38704,NOT A BOOK,"In this presentation, Glenn Beck tells an audi...",missing,Audio CD,0.0,4.16,207,42,An Unlikely Mormon,Nonfiction|Biography|Christianity|Lds|Religion...
33835,Agatha Christie,missing,Box set,Paperback,0.0,4.20,207,10,And Then There Were None/The Secret Adversary/...,Mystery|Childrens
...,...,...,...,...,...,...,...,...,...,...
45421,Pet Torres,SYNOPSISValkyrie is a young girl who has been ...,missing,Nook,0.0,3.29,288,19,Valkyrie: The Vampire Princess,Paranormal|Vampires|Young Adult|Fantasy|Supern...
28817,Aaron Polson,While cruising a dark country road late one Sa...,missing,Nook,0.0,3.01,207,25,We are the Monsters,Horror|Fiction|Suspense
15224,Craig Thomas,An East German defector dies in a tragic accid...,missing,Paperback,0.0,3.35,129,5,Wildcat,Fiction|Thriller|Adventure
21469,P.G. Wodehouse,Table of ContentsList of Works by Genre and Ti...,missing,ebook,0.0,4.49,412,10,Works of P. G. Wodehouse,Fiction|Humor|Classics|Humor|Comedy


Some books with 0 pages are sets of books, or audiobooks. I'm going to take a some time to clean up the formats before dealing with the pages.

In [55]:
books.loc[books.pages==0].format.value_counts(dropna=False)

Paperback                31
ebook                    21
Audio CD                 18
Nook                     14
Hardcover                13
Audio                     8
Audio Cassette            6
Audiobook                 6
Mass Market Paperback     4
MP3 CD                    2
Boxed Set                 1
Board Book                1
Name: format, dtype: int64

Drop all boxed set formats, since I want to look at only individual books.

In [56]:
#Create list of indices with boxed set formats that will be dropped
to_drop = books[(books.format=='Boxed Set') | (books.format=='Box Set')].index

#Drop the boxed sets
books.drop(index=to_drop, inplace=True)

It looks like there are a lot of different formats. I'm going to be organizing them into 5 main categories (not counting the missing values): 
1. Paperback
2. Hardcover
3. Audio
4. Digital
5. Other

In [35]:
books.format.unique()

array(['Paperback', 'Kindle Edition', 'Hardcover',
       'Mass Market Paperback', 'ebook', 'Audible Audio', 'missing',
       'Audio CD', 'broché', 'Nook', 'Trade paperback', 'Capa Mole',
       'Mass Market', 'PDF ', 'MP3 CD', 'Unknown Binding', 'Audiobook',
       'Trade Paperback', 'Leather Bound', 'Library Binding',
       'eBook Kindle', 'Audio', 'Bolsillo', 'Graphic Novels',
       'Board Book', 'Comic', 'Hardback', 'Innbundet', 'Gebunden',
       'Poche', 'web', 'Audio Cassette', 'Digital', 'Broché', 'paper',
       'Board book', 'Broschiert', 'Taschenbuch', 'paperback',
       'hardcover', 'Hardcover-spiral', 'Softcover', 'Klappenbroschur',
       'Newsprint', 'Podiobook', 'Online Fiction', 'Audio CD ',
       'Tapa dura con sobrecubierta', 'Capa Dura', 'Unbound', 'Brochura',
       'Capa comum', 'Hardcover im Schuber', 'Podcast', 'Hardbound',
       'Bonded Leather', 'Mass Market Paperback ', 'Cofanetto',
       'Pasta blanda', 'Perfect Paperback', 'Pamphlet',
       'Textboo

In [94]:
#Add indvidual formats that mean paperback
paper = ['Capa Mole', 'capa mole', 'Softcover', 'Spiral-bound','Capa mole - 15,5 x 23 x 2cm', 'Poche', 'Broché', 'broché', 'Capa comum', 'Bìa mềm', 'Broschiert', 'Tapa blanda', 'Brossura', 'pocket', '文庫', 'Taschenbuch', 'Bolsillo','Pasta blanda','']

#Add indvidual formats that mean hardcoverb
hard = ['Capa Dura','Leather Bound','Bonded Leather','Imitation Leather', 'Capa dura','Board Book','Board book', 'Board', 'Tapa dura con sobrecubierta', 'Kovakantinen', 'Innbundet']

#Add indvidual formats that suggest audio
audio = ['Podcast','CD-ROM', 'MP3 CD','Podiobook']
             
#Add indvidual formats that suggest digital
digital = ['ebook','Nook','PDF ', 'web','Wattpad']

#Create an empty list for other formats
other = []

#Iterate over all unique formats
for fmt in books.format.unique():
    #First check whether the format is already in one of the manually made lists
    if ((fmt not in paper) & (fmt not in hard) & (fmt not in audio) & (fmt not in digital)):
        #Add the formats in the appropriate category
        if 'paper' in fmt.lower():
            paper.append(fmt)
        elif 'hard' in fmt.lower():
            hard.append(fmt)
        elif 'audio' in fmt.lower():
            audio.append(fmt)
        elif 'kindle' in fmt.lower():
            digital.append(fmt)
        elif 'online' in fmt.lower():
            digital.append(fmt)
        elif 'digital' in fmt.lower():
            digital.append(fmt)
        #Add all others into  the other category
        elif 'missing' not in fmt:
            other.append(fmt)

In [97]:
#Replace all those in list with the corresponding format category
books.format.replace(paper, 'Paperback', inplace=True)
books.format.replace(hard, 'Hardcover', inplace=True)
books.format.replace(audio, 'Audio', inplace=True)
books.format.replace(digital, 'Digital', inplace=True)
books.format.replace(other, 'Other', inplace=True)

In [98]:
books.format.unique()

array(['Paperback', 'Digital', 'Hardcover', 'Audio', 'Other', 'missing'],
      dtype=object)

# Adding Categories <a name='categories'></a>

In order to deal with some of the feature data a little easier, I'll be binning some of them into categories.

## Book Length <a name='length'></a>
It wouldn't make too much sense to input a mean into all the books with 0 pages, and there are too many of them to individually check their book pages. In order to not skew the data with so many incorrect 0-page books, I'll be categorizing the books into short, medium, long based on the number of pages. Since some of the books have a disproportionate large number of pages, I'll be breaking it down by the quartiles, with a large portion devoted to the medium range.

In [110]:
books.pages.describe()

count    38620.000000
mean       346.350129
std        257.073255
min          0.000000
25%        226.000000
50%        320.000000
75%        406.000000
max      14777.000000
Name: pages, dtype: float64

In [1]:
#Create function to categorize the book by book length
def length(pages):
    """ Returns string of 'short', 'medium', or 'large' based on where it falls in the pages percentile. 
        Function will return 'short' if it's less than or equal to the 25th percentile. 'Long' if it's
        higher than or equal to the 75th percentile. Or 'medium' for in between (25th to 75th percentile, non-inclusive.)
    
    ...
    
    Parameters
    ----------
    pages: int, required
        The number of pages to be turned into short, medium, long categories.
    
    """
    
    #Calculate the 25th and 75th percentiles
    lower = np.percentile(books.pages, 25)
    higher = np.percentile(books.pages, 75)
    
    #Determine which category they belong to
    if pages <= lower:
        return 'short'
    elif pages >= higher:
        return 'long'
    else:
        return 'medium'

In [139]:
#Apply the function to categorize each book based on its number of pages, and save it into the 'length' feature
books['length'] = books.pages.apply(lambda x: length(x))

In [140]:
#Convert length feature as categorical
books['length'] = books['length'].astype('category')

In [141]:
books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38620 entries, 28854 to 26354
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   authors       38620 non-null  object  
 1   desc          38620 non-null  object  
 2   edition       38620 non-null  object  
 3   format        38620 non-null  object  
 4   pages         38620 non-null  float64 
 5   rating        38620 non-null  float64 
 6   rating_count  38620 non-null  int64   
 7   review_count  38620 non-null  int64   
 8   title         38620 non-null  object  
 9   genres        38620 non-null  object  
 10  length        38620 non-null  category
dtypes: category(1), float64(2), int64(2), object(6)
memory usage: 4.5+ MB


## Good or Bad Book <a name='goodbad'></a>

Since the data seems relatively ready, I'll be adding my main feature that will be my dependent variable, whether a book is good or bad. This dataset is already curated from the Goodreads' "Best Books List" so there is no reason to believe *any* of the books on this list are relatively bad. Additionally, I admit that opinions of books are subjective; a book can be good for one person, and also bad for another. Moreover, books couldn't really be described as completely black and white, either 'good' or 'bad'.

With all that said, I will be forming two best books list: one in the order of the dataset, determined by the voters of Goodreads; and another based purely on the books' ratings. These will differ only in their initial order, and will be split evenly in half; the first half will  be categorized as 'good' while the second will be the 'bad' ones.

In [144]:
books.rating.describe()

count    38620.000000
mean         3.992995
std          0.276966
min          2.090000
25%          3.820000
50%          4.010000
75%          4.180000
max          4.890000
Name: rating, dtype: float64

In [150]:
#Create good/bad books list based on the ordering created by Goodreads' best books list
best_list = books.sort_index()

#Create good/bad books list based on books' ratings
best_rate = books.sort_values(by='rating', axis=0, ascending=False, ignore_index=True)