# Feature Engineering

* [1. Genres](#genres)
* [2. Authors](#authors)
   * [2.1 Author Information](#auth_info)
   * [2.2 The More, The Merrier?](#merry)
* [3. Encodings](#encode)
   * [3.1 Format](#format)
   * [3.2 Book Length](#len)
* [4. NLP](#nlp)
   * [4.1 Multilingual?](#multi)
   * [4.2 Sentiment Analysis on Descriptions](#sentiment)

In [1]:
#Import relevant libraries
import numpy as np
import pandas as pd

In [2]:
#Change the jupyter displays
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [3]:
#Load dataset
books = pd.read_csv('./data/books.csv', index_col=0)

In [4]:
books.head()

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,genres,missing_ed,missing_desc,length,quality
0,J.K. Rowling|Mary GrandPré,There is a door at the end of a silent corrido...,0,Paperback,870.0,4.48,2041594,33264,Harry Potter and the Order of the Phoenix,Fantasy|Young Adult|Fiction,0.0,1.0,3.0,1.0
1,Douglas Adams,Seconds before the Earth is demolished to make...,0,Paperback,193.0,4.21,1155911,23919,The Hitchhiker's Guide to the Galaxy,Science Fiction|Fiction|Humor|Fantasy|Classics,1.0,1.0,1.0,1.0
2,Shel Silverstein,"""Once there was a tree...and she loved a littl...",0,Hardcover,64.0,4.37,789681,15694,The Giving Tree,Childrens|Childrens|Picture Books|Classics|Fic...,1.0,1.0,1.0,1.0
3,Dan Brown,An ingenious code hidden in the works of Leona...,0,Paperback,481.0,3.81,1668594,43699,The Da Vinci Code,Fiction|Mystery|Thriller,1.0,1.0,3.0,1.0
4,Lewis Carroll|John Tenniel|Martin Gardner,""" I can't explain myself, I'm afraid, sir,"" sa...",0,Paperback,239.0,4.07,411153,9166,Alice's Adventures in Wonderland & Through the...,Classics|Fantasy|Fiction|Childrens,1.0,1.0,2.0,1.0


## Genres <a name='genres'></a>
Encoding genres.

In [5]:
#Copy the best_list dataframe to keep the clean version the same
genres = books.copy(deep=True)

#Turn the genres into a list to enable one hot encoding
genres['genres'] = genres['genres'].str.split('|')
genres['genres']

0                          [Fantasy, Young Adult, Fiction]
1        [Science Fiction, Fiction, Humor, Fantasy, Cla...
2        [Childrens, Childrens, Picture Books, Classics...
3                             [Fiction, Mystery, Thriller]
4                  [Classics, Fantasy, Fiction, Childrens]
                               ...                        
38411    [Fiction, Cultural, Lebanon, War, Academic, Sc...
38412    [Fiction, Young Adult, Young Adult, Coming Of ...
38413    [Fiction, Art, Feminism, Contemporary, Literar...
38414    [Nonfiction, Autobiography, Memoir, Biography,...
38415    [Nonfiction, Autobiography, Memoir, Biography,...
Name: genres, Length: 38416, dtype: object

In [6]:
from sklearn.preprocessing import MultiLabelBinarizer

#Create multi label binarizer
mlb = MultiLabelBinarizer()

#One hot encoding all the genres
genre_list = pd.DataFrame(mlb.fit_transform(genres['genres']), columns=mlb.classes_, index=genres.index)

In [7]:
genre_list

Unnamed: 0,11th Century,13th Century,14th Century,15th Century,16th Century,17th Century,18th Century,19th Century,1st Grade,20th Century,21st Century,2nd Grade,40k,Abandoned,Abuse,Academia,Academic,Academics,Action,Activism,Adoption,Adult,Adult Fiction,Adventure,Aeroplanes,Africa,African American,African American Literature,African American Romance,African Literature,Agriculture,Albanian Literature,Alcohol,Algeria,Algorithms,Aliens,Alternate History,Alternative Medicine,Amazon,American,American Civil War,American Fiction,American History,American Revolution,American Revolutionary War,Americana,Amish,Amish Fiction,Anarchism,Ancient,Ancient History,Angels,Angola,Animal Fiction,Animals,Anime,Anthologies,Anthropology,Anthropomorphic,Apocalyptic,Apple,Archaeology,Architecture,Art,Art Design,Art History,Art and Photography,Arthurian,Artificial Intelligence,Asia,Asian Literature,Aspergers,Astrology,Astronomy,Atheism,Audiobook,Australia,Autobiography,Aviation,Baha I,Bande Dessinée,Bangladesh,Banking,Banned Books,Baseball,Basketball,Batman,Battle Of Britain,Bdsm,Beauty and The Beast,Beer,Belgian,Belgium,Biblical,Biblical Fiction,Biography,Biography Memoir,Biology,Birds,Bisexual,Bizarro Fiction,Boarding School,Book Club,Books About Books,Botswana,Boys Love,Brain,Brazil,Brewing,British Literature,Buddhism,Buffy The Vampire Slayer,Buisness,Bulgaria,Bulgarian Literature,Business,Canada,Canadian Literature,Cartography,Cartoon,Category Romance,Catholic,Cats,Central Africa,Chapter Books,Chemistry,Chess,Chick Lit,Childrens,Childrens Classics,China,Chinese Literature,Choose Your Own Adventure,Christian,Christian Fantasy,Christian Fiction,Christian Historical Fiction,Christian Living,Christian Non Fiction,Christian Romance,Christianity,Christmas,Church,Church History,Cinderella,Cities,Civil War,Civil War History,Class,Classic Literature,Classics,Clean Romance,Climate Change,Climbing,Coding,Collections,College,Colouring Books,Combat,Comedy,Comic Book,Comic Strips,Comics,Comics Bd,Comics Manga,Coming Of Age,Comix,Communication,Computer Science,Computers,Conservation,Contemporary,Contemporary Romance,Cookbooks,Cooking,Counselling,Counter Culture,Counting,Couture,Cozy Mystery,Crafts,Crime,Criticism,Cthulhu Mythos,Cuisine,Culinary,Cult Classics,Cults,Cultural,Cultural Studies,Culture,Currency,Cyberpunk,Cycling,Czech Literature,Danish,Dark,Dark Fantasy,Dc Comics,Death,Demons,Denmark,Design,Detective,Diary,Did Not Finish,Diets,Dinosaurs,Disability,Disability Studies,Divination,Doctor Who,Dogs,Dragonlance,Dragons,Drama,Drawing,Dungeons and Dragons,Dutch Literature,Dying Earth,Dystopia,Earth,Eastern Africa,Eastern Philosophy,Ecclesiology,Ecology,Economics,Education,Egypt,Egyptian Literature,Egyptology,Electrical Engineering,Emergency Services,Engineering,English Civil War,English History,English Literature,Entrepreneurship,Environment,Epic,Epic Fantasy,Epic Poetry,Eritrea,Erotic Romance,Erotica,Esoterica,Esp,Espionage,Essays,Ethiopia,...,Polish Literature,Political Science,Politics,Polyamorous,Polyamory,Polygamy,Pop Culture,Popular Science,Portugal,Portuguese Literature,Post Apocalyptic,Post Colonial,Poverty,Prayer,Prehistoric,Prehistory,Presidents,Princesses,Productivity,Programming,Prostitution,Psychiatry,Psychoanalysis,Psychological Thriller,Psychology,Pulp,Punk,Quantum Mechanics,Queer,Queer Lit,Queer Studies,Quilting,Rabbits,Race,Railways,Read For College,Read For School,Realistic Fiction,Reference,Regency,Regency Romance,Relationships,Religion,Research,Retellings,Reverse Harem,Road Trip,Robots,Rock N Roll,Role Playing Games,Roman,Roman Britain,Romance,Romania,Romanian Literature,Romanovs,Romantic,Romantic Suspense,Romanticism,Russia,Russian History,Russian Literature,Russian Revolution,Rwanda,Satanism,Scandinavian Lite...,Scandinavian Literature,School,School Stories,Sci Fi Fantasy,Science,Science Fiction,Science Fiction Fantasy,Science Fiction R...,Science Fiction Romance,Science Nature,Scotland,Scripture,Seinen,Self Help,Semiotics,Senegal,Sequential Art,Serbian Literature,Sewing,Sex Work,Sexuality,Shapeshifters,Shojo,Shonen,Short Stories,Short Story Collection,Shounen Ai,Sierra Leone,Siglo De Oro,Silhouette,Skepticism,Slice Of Life,Soccer,Social,Social Issues,Social Justice,Social Media,Social Movements,Social Science,Social Work,Society,Sociology,Software,Somalia,South Africa,Southern,Southern Africa,Southern Gothic,Space,Space Opera,Spain,Spanish Civil War,Spanish History,Spanish Literature,Speculative Fiction,Spider Man,Spiritualism,Spirituality,Splatterpunk,Sports,Sports Romance,Sports and Games,Spy Thriller,Star Trek,Star Trek Deep Space Nine,Star Trek The Next Generation,Star Trek Voyager,Star Wars,Steampunk,Storytime,Strippers,Subways,Sudan,Superheroes,Superman,Supernatural,Surreal,Survival,Suspense,Sustainability,Sweden,Swedish Literature,Sword and Planet,Sword and Sorcery,Tanzania,Taoism,Tarot,Tasmania,Tea,Teaching,Technical,Technology,Teen,Terrorism,Textbooks,The United States Of America,The World,Theatre,Thelema,Theology,Theory,Theosophy,Thriller,Time Travel,Time Travel Romance,Traditional Regency,Tragedy,Trains,Transgender,Transport,Travel,Travelogue,Trivia,True Crime,True Story,Tudor Period,Turkish,Turkish Literature,Tv,Uganda,Ukraine,Ukrainian Literature,Unfinished,Unicorns,United States,Urban,Urban Fantasy,Urban Planning,Urban Studies,Urbanism,Us Presidents,Usability,Utopia,Vampires,Vegan,Vegetarian,Vegetarianism,Victorian,Video Games,Walking,War,Warcraft,Warfare,Web,Webcomic,Website Design,Weird Fiction,Weird West,Werewolves,Western Africa,Western Historical Romance,Western Romance,Westerns,Wicca,Wildlife,Wine,Witchcraft,Witches,Wizards,Wolves,Womens,Womens Fiction,Womens Studies,Wonder Woman,Woodwork,World History,World Of Darkness,World Of Warcraft,World War I,World War II,Writing,X Men,Yaoi,Young Adult,Young Adult Contemporary,Young Adult Fantasy,Young Adult Historical Fiction,Young Adult Paranormal,Young Adult Romance,Young Readers,Yuri,Zen,Zimbabwe,Zombies
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38411,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
38412,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
38413,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
38414,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [8]:
#Check that each genre shows up at least once
sum(genre_list.sum(axis=1)<1)

0

In [9]:
#Check each book has at least one genre
sum(genre_list.sum(axis=0)<1)

0

In [10]:
#Calculate how many genres apply to only one book
sum(genre_list.sum(axis=0)==1)

106

814 genres is a lot of genres, especially when 106 appear only one time. Since these features rarely appear, the model will regard them as meaningless. As such, I will be further feature engineering these genres by:
1. Creating a feature denoting if one of the books' genres is one of the top 15 genres
2. Keeping the one hot encoding only for the top 15 genres, and clumping the rest as "Other"

In [11]:
#Look at the top 15 genres proportion
genre_list.sum().sort_values(ascending=False).head(15) / len(genres)

Fiction               0.542222
Fantasy               0.282981
Romance               0.255519
Young Adult           0.207882
Nonfiction            0.165478
Historical            0.148844
Historical Fiction    0.134970
Contemporary          0.124662
Classics              0.122970
Mystery               0.119768
Cultural              0.104358
Paranormal            0.103108
Science Fiction       0.097876
Childrens             0.080748
Literature            0.074240
dtype: float64

In [12]:
#Get only the top 15 genres
top15df = genre_list[genre_list.sum().sort_values(ascending=False).head(15).index]
top15 = list(top15df.columns)

In [13]:
#Get dataframe with the rest of the genres
other_gen = genre_list.drop(top15, axis=1)

#Create a new feature ("Other" genre) that shows if book has any of the non-top15 genres
other_gen['Other'] = other_gen.sum(axis=1)
other_gen.loc[(other_gen.Other != 0), 'Other'] = 1

#Add the "Other" feature to the top15 genre dataframe
top15df = top15df.join(other_gen[['Other']], how='inner')

In [14]:
top15df.sum(axis=0)

Fiction               20830
Fantasy               10871
Romance                9816
Young Adult            7986
Nonfiction             6357
Historical             5718
Historical Fiction     5185
Contemporary           4789
Classics               4724
Mystery                4601
Cultural               4009
Paranormal             3961
Science Fiction        3760
Childrens              3102
Literature             2852
Other                 35201
dtype: int64

In [15]:
#Merge the dataframes together and drop the genres column
genres = genres.join(top15df).drop(columns=['genres'])

In [16]:
genres.head()

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,missing_ed,missing_desc,length,quality,Fiction,Fantasy,Romance,Young Adult,Nonfiction,Historical,Historical Fiction,Contemporary,Classics,Mystery,Cultural,Paranormal,Science Fiction,Childrens,Literature,Other
0,J.K. Rowling|Mary GrandPré,There is a door at the end of a silent corrido...,0,Paperback,870.0,4.48,2041594,33264,Harry Potter and the Order of the Phoenix,0.0,1.0,3.0,1.0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
1,Douglas Adams,Seconds before the Earth is demolished to make...,0,Paperback,193.0,4.21,1155911,23919,The Hitchhiker's Guide to the Galaxy,1.0,1.0,1.0,1.0,1,1,0,0,0,0,0,0,1,0,0,0,1,0,0,1
2,Shel Silverstein,"""Once there was a tree...and she loved a littl...",0,Hardcover,64.0,4.37,789681,15694,The Giving Tree,1.0,1.0,1.0,1.0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1
3,Dan Brown,An ingenious code hidden in the works of Leona...,0,Paperback,481.0,3.81,1668594,43699,The Da Vinci Code,1.0,1.0,3.0,1.0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
4,Lewis Carroll|John Tenniel|Martin Gardner,""" I can't explain myself, I'm afraid, sir,"" sa...",0,Paperback,239.0,4.07,411153,9166,Alice's Adventures in Wonderland & Through the...,1.0,1.0,2.0,1.0,1,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0


In [17]:
#Save dataframe with all the genres one-hot encoded
genre_list.to_csv('./data/all_genres_encoded.csv')
#Save a more meaningful version of the genres dataframe with only top 15 and non-top15
genres.to_csv('./data/genres.csv')

## Authors <a name='authors'></a>
Because there are so many unique authors, it would be hard for one person to have a large impact on the model. As such, I'm extracting what I can from the author feature and deleting it entirely.

In [18]:
#Copy to keep original clean
authors = genres.copy(deep=True)

#Separate out the first author and add it in 'first_author' column
authors['first_author'] = authors['authors'].str.partition('|').iloc[:,0]

#Split book authors so they're each a separate element in a list
authors['authors'] = authors['authors'].str.split('|')

In [19]:
#Quick function - to delete if not used later
def longest(list1):
    longest_list = max(len(elem) for elem in list1)
    return longest_list

#Check the highest number of authors
longest(authors.authors)

51

### Author Information <a name='auth_info'></a>
I'll be gathering information about the books' authors as much as I can in this section.

In [20]:
import itertools

#Flatten the list of multiple authors and convert to DataFrame
auth_list = pd.DataFrame((author for author in list(itertools.chain(*authors.authors))), columns=['author'])

In [21]:
auth_list

Unnamed: 0,author
0,J.K. Rowling
1,Mary GrandPré
2,Douglas Adams
3,Shel Silverstein
4,Dan Brown
...,...
52320,Alicia Erian
52321,Siri Hustvedt
52322,Avi Steinberg
52323,Mimi Baird


In [22]:
len(auth_list.author.unique())

23259

In [23]:
#Look at top 10 authors with most books on the list
auth_list.value_counts().head(10) 

author         
Stephen King       146
Agatha Christie     95
Nora Roberts        94
Cassandra Clare     90
James Patterson     90
Terry Pratchett     81
Neil Gaiman         81
Anonymous           76
Erin Hunter         61
Meg Cabot           60
dtype: int64

There are 23,259 unique authors on this dataset, and the one leading with the most books is Stephen King. Mr. King is a popular author of horror fiction books who's written over 60 novels and 200 short stories, so it's understandable that he's reigning in with 146 books on this list.

In [24]:
#Create author information dataframe
auth_info = pd.DataFrame(auth_list.value_counts(), columns=['nbooks'])

#Remove multiindex
auth_info.index = auth_info.index.get_level_values(0)

#### Ratings and Number of Books
In this portion, I'm making a big assumption that the first author listed is the primary and most important author (if there's more than one).

I'm going to add the average rating of each author as two features. Since most books in this dataset have more than one author, I'm going to add one feature indicating the average book rating for the first author, assumed to be the main author. I'll then add another feature for the average of all the authors' book rating averages.

While I add the average author rating features, I'll also add two other features indicating the amount of books the author has written: one for the first author who's assumed to be the main one, and another as an average of all participating authors' number of books.

In [25]:
#Store average of all book ratings that author has written (jointly and individually)
#Iterate thru all unique authors
for author in auth_list.author.unique():
    #Locate all books under this author and calculate the average rating of those books
    auth_info.loc[author, 'rating'] = np.average(genres.loc[genres.authors.str.contains(author, regex=False), 'rating'])
    auth_info.loc[author, 'avg_pages'] = np.average(genres.loc[genres.authors.str.contains(author, regex=False), 'pages'])

#Save author information
auth_info.to_csv('./data/author_info.csv')

In [26]:
#Look at authors with lowest and highest average ratings
auth_info[(auth_info.rating==auth_info.rating.max()) | (auth_info.rating==auth_info.rating.min())]

Unnamed: 0_level_0,nbooks,rating
author,Unnamed: 1_level_1,Unnamed: 2_level_1
Victoria Foyt,1,2.09
Jo-Anne McArthur,1,4.84


In [27]:
#Add first author's average book rating feature into dataframe
authors['fauthor_rating'] = authors['first_author'].replace(to_replace=auth_info.rating.to_dict())

#Add first author's number of books feature
authors['fauthor_nbooks'] = authors['first_author'].replace(to_replace=auth_info.nbooks.to_dict())

In [28]:
#Calculate authors' average ratings and average number of books
authors['authors_rating'] = [auth_info.loc[authors.authors[i], 'rating'].mean() for i in range(len(authors))]
authors['authors_nbooks'] = [auth_info.loc[authors.authors[i], 'nbooks'].mean() for i in range(len(authors))]

### The More, the Merrier? <a name='merry'></a>
Some books have multiple authors. In fact one book even has 51 authors listed. To extract some of information, I'll create the following features regarding number of authors:

1. Number of authors for each book
2. Whether the book has more than one author (0 for no, 1 for yes)
3. Whether the book has more than three authors (0 for no, 1 for yes)
4. Whether the book has more than five authors (0 for no, 1 for yes)
5. Whether the book has more than ten authors (0 for no, 1 for yes)

In [29]:
#New column to store the number of authors for each book
authors['nauthors'] = [len(authors.authors[i]) for i in range(len(authors))]

In [30]:
#Create features for multiple authors
authors['a1plus'] = authors['nauthors'] > 1
authors['a3plus'] = authors['nauthors'] > 3
authors['a5plus'] = authors['nauthors'] > 5
authors['a10plus'] = authors['nauthors'] > 10

#Convert all True and False to 1s and 0s
authors.iloc[:, -4:] = authors.iloc[:, -4:].astype(int)

## Encoding <a name='encode'></a>
### Format <a name='format'></a>
One hot encode the 6 formats.

In [31]:
authors.format.unique()

array(['Paperback', 'Hardcover', 'Digital', 'Audio', 'Other', 'Missing'],
      dtype=object)

In [32]:
#Create dummy dataframe for the 5 book formats
dummies = pd.get_dummies(authors.format, prefix='fmt', drop_first=True)

#Merge dummy dataframe
authors = authors.join(dummies)

In [33]:
authors

Unnamed: 0,authors,desc,edition,format,pages,rating,rating_count,review_count,title,missing_ed,missing_desc,length,quality,Fiction,Fantasy,Romance,Young Adult,Nonfiction,Historical,Historical Fiction,Contemporary,Classics,Mystery,Cultural,Paranormal,Science Fiction,Childrens,Literature,Other,first_author,fauthor_rating,fauthor_nbooks,authors_rating,authors_nbooks,nauthors,a1plus,a3plus,a5plus,a10plus,fmt_Audio,fmt_Digital,fmt_Hardcover,fmt_Missing,fmt_Other,fmt_Paperback
0,"[J.K. Rowling, Mary GrandPré]",There is a door at the end of a silent corrido...,0,Paperback,870.0,4.48,2041594,33264,Harry Potter and the Order of the Phoenix,0.0,1.0,3.0,1.0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,J.K. Rowling,4.244412,34,4.388873,20.000000,2,1,0,0,0,0,0,0,0,0,1
1,[Douglas Adams],Seconds before the Earth is demolished to make...,0,Paperback,193.0,4.21,1155911,23919,The Hitchhiker's Guide to the Galaxy,1.0,1.0,1.0,1.0,1,1,0,0,0,0,0,0,1,0,0,0,1,0,0,1,Douglas Adams,4.156957,23,4.156957,23.000000,1,0,0,0,0,0,0,0,0,0,1
2,[Shel Silverstein],"""Once there was a tree...and she loved a littl...",0,Hardcover,64.0,4.37,789681,15694,The Giving Tree,1.0,1.0,1.0,1.0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,Shel Silverstein,4.241818,11,4.241818,11.000000,1,0,0,0,0,0,0,1,0,0,0
3,[Dan Brown],An ingenious code hidden in the works of Leona...,0,Paperback,481.0,3.81,1668594,43699,The Da Vinci Code,1.0,1.0,3.0,1.0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,Dan Brown,3.818571,14,3.818571,14.000000,1,0,0,0,0,0,0,0,0,0,1
4,"[Lewis Carroll, John Tenniel, Martin Gardner]",""" I can't explain myself, I'm afraid, sir,"" sa...",0,Paperback,239.0,4.07,411153,9166,Alice's Adventures in Wonderland & Through the...,1.0,1.0,2.0,1.0,1,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,Lewis Carroll,4.085000,18,4.077143,9.666667,3,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38411,"[Etel Adnan, Georgina Kleege]","Translated from the French by Georgina Kleege,...",0,Paperback,106.0,3.92,387,39,Sitt Marie Rose,1.0,1.0,1.0,0.0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,1,1,Etel Adnan,3.920000,1,3.920000,1.000000,2,1,0,0,0,0,0,0,0,0,1
38412,[Alicia Erian],Thirteen-year-old Jasira wants what every girl...,0,Paperback,336.0,3.60,3529,531,Towelhead,1.0,1.0,2.0,0.0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,Alicia Erian,3.600000,1,3.600000,1.000000,1,0,0,0,0,0,0,0,0,0,1
38413,[Siri Hustvedt],"A brilliant, provocative novel about an artist...",1,Hardcover,368.0,3.67,5827,816,The Blazing World,0.0,1.0,2.0,0.0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,Siri Hustvedt,3.767500,4,3.767500,4.000000,1,0,0,0,0,0,0,1,0,0,0
38414,[Avi Steinberg],Avi Steinberg is stumped. After defecting from...,0,Hardcover,399.0,3.51,3717,661,Running the Books: The Adventures of an Accide...,1.0,1.0,2.0,0.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,Avi Steinberg,3.510000,1,3.510000,1.000000,1,0,0,0,0,0,0,1,0,0,0


### Book Length <a name='len'></a>
It wouldn't make too much sense to input a mean into all the books with 0 pages, and there are too many of them to individually check their book pages. In order to not skew the data with so many incorrect 0-page books, I'll be categorizing the books into 'short', 'medium', 'long' based on the number of pages, while maintaining the 0 page books as 'zero'. Since some of the books have a disproportionate large number of pages, I'll be breaking it down by the quartiles, with a large portion devoted to the medium range.

In [None]:
authors.pages.describe()

In [None]:
#Calculate the 25th and 75th percentiles as the upper and lower bound for 'short' and 'long', respectively
short = np.percentile(authors.pages, 25)
long = np.percentile(authors.pages, 75)

#Categorize each book based on its number of pages, and save it into the 'length feature'
authors.loc[(authors.pages == 0), 'length'] = 'zero'
authors.loc[((authors.pages <= short) & (authors.pages != 0)), 'length'] = 'short'
authors.loc[(authors.pages >= long), 'length'] = 'long'
#Fill the rest with 2 (medium)
authors.length.fillna('medium', inplace=True)

In [None]:
#Convert length feature as categorical
authors['length'] = authors['length'].astype('category')

In [None]:
#Check dataframe to make sure the values were input correctly
authors.length.value_counts()

In [None]:
#One hot encoding the length variables
dummies = pd.get_dummies(authors.length, prefix='len', drop_first=True)

#Merge dummy dataframe
authors = authors.join(dummies)

# Saving the Data <a name='save'></a>

In [34]:
authors.to_csv('./data/authors_postsave.csv')

In [36]:
#Delete the unneeded columns
authors.drop(columns=['first_author', 'authors', 'title', 'desc', 'format'], inplace=True)

In [38]:
#Save dataset
authors.to_csv('./data/final.csv')