<div style="background-color:#ffcc00;"><span style="color:navy;">Importing Libraries</span></div>

In [1]:
import pandas as pd
import numpy as np

<div style="background-color:#ffcc00;"><span style="color:navy;">Loading the dataset and initial analysis</span></div>

In [2]:
books = pd.read_csv('../data/book_data.csv',error_bad_lines = False) 
#error_bad_lines : boolean, default True Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)

In [3]:
print("There are {} rows and {} columns in the dataset.".format(books.shape[0], books.shape[1]))

There are 54301 rows and 12 columns in the dataset.


In [4]:
#columns
np.array(books.columns)

array(['book_authors', 'book_desc', 'book_edition', 'book_format',
       'book_isbn', 'book_pages', 'book_rating', 'book_rating_count',
       'book_review_count', 'book_title', 'genres', 'image_url'],
      dtype=object)

<span style="color:#747678; font-size:14px;">Columns are <strong>'book_authors', 'book_desc', 'book_edition', 'book_format',
       'book_isbn', 'book_pages', 'book_rating', 'book_rating_count',
    'book_review_count', 'book_title', 'genres', 'image_url'</strong></span>

In [5]:
books.head()

Unnamed: 0,book_authors,book_desc,book_edition,book_format,book_isbn,book_pages,book_rating,book_rating_count,book_review_count,book_title,genres,image_url
0,Suzanne Collins,Winning will make you famous. Losing means cer...,,Hardcover,9780440000000.0,374 pages,4.33,5519135,160706,The Hunger Games,Young Adult|Fiction|Science Fiction|Dystopia|F...,https://images.gr-assets.com/books/1447303603l...
1,J.K. Rowling|Mary GrandPré,There is a door at the end of a silent corrido...,US Edition,Paperback,9780440000000.0,870 pages,4.48,2041594,33264,Harry Potter and the Order of the Phoenix,Fantasy|Young Adult|Fiction,https://images.gr-assets.com/books/1255614970l...
2,Harper Lee,The unforgettable novel of a childhood in a sl...,50th Anniversary,Paperback,9780060000000.0,324 pages,4.27,3745197,79450,To Kill a Mockingbird,Classics|Fiction|Historical|Historical Fiction...,https://images.gr-assets.com/books/1361975680l...
3,Jane Austen|Anna Quindlen|Mrs. Oliphant|George...,«È cosa ormai risaputa che a uno scapolo in po...,"Modern Library Classics, USA / CAN",Paperback,9780680000000.0,279 pages,4.25,2453620,54322,Pride and Prejudice,Classics|Fiction|Romance,https://images.gr-assets.com/books/1320399351l...
4,Stephenie Meyer,About three things I was absolutely positive.F...,,Paperback,9780320000000.0,498 pages,3.58,4281268,97991,Twilight,Young Adult|Fantasy|Romance|Paranormal|Vampire...,https://images.gr-assets.com/books/1361039443l...


<div style="background-color:#ffcc00;"><span style="color:navy;">removing useless columns</span></div>

In [6]:
del books['book_format']

In [7]:
del books['book_isbn']

<div style="background-color:#ffcc00;"><span style="color:navy;">checking for null cells</span></div>

In [8]:
#columns which contain null values and the number of null elements
null_counts = books.isnull().sum()
null_counts[null_counts > 0].sort_values(ascending=False) #null_counts in each column (sorted)

book_edition    48848
genres           3242
book_pages       2522
book_desc        1331
image_url         683
dtype: int64

In [9]:
#books = books[books['book_title'].notna()] #removing nulls in book_title column
print("rows before filtering null genres: {}".format(books.shape[0]))
books = books[books['genres'].notna()] #removing nulls in genres column
print("rows after filtering null genres: {}".format(books.shape[0]))

rows before filtering null genres: 54301
rows after filtering null genres: 51059


<div style="background-color:#ffcc00;"><span style="color:navy;">removing duplicate rows (based on title)</span></div>

In [10]:
print("rows before filtering duplicate row based on title: {}".format(books.shape[0]))

rows before filtering duplicate row based on title: 51059


In [11]:
books.drop_duplicates(subset ="book_title",
                     keep = False, inplace = True)

In [12]:
print("rows after filtering duplicate row based on title: {}".format(books.shape[0]))

rows after filtering duplicate row based on title: 41837


<div style="background-color:#ffcc00;"><span style="color:navy;">removing "pages" from column "book_pages"</span></div>

In [13]:
del books['book_pages']

In [14]:
books.head()

Unnamed: 0,book_authors,book_desc,book_edition,book_rating,book_rating_count,book_review_count,book_title,genres,image_url
1,J.K. Rowling|Mary GrandPré,There is a door at the end of a silent corrido...,US Edition,4.48,2041594,33264,Harry Potter and the Order of the Phoenix,Fantasy|Young Adult|Fiction,https://images.gr-assets.com/books/1255614970l...
9,J.R.R. Tolkien,لجزء الثالث من ملحمة جيه أر أر تولكين الرائعة ...,Hobbit Movie Tie-in Boxed set,4.59,99793,1652,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...,Fantasy|Fiction|Classics,https://images.gr-assets.com/books/1346072396l...
11,Douglas Adams,Seconds before the Earth is demolished to make...,,4.21,1155911,23919,The Hitchhiker's Guide to the Galaxy,Science Fiction|Fiction|Humor|Fantasy|Classics,https://images.gr-assets.com/books/1388282444l...
12,Shel Silverstein,"""Once there was a tree...and she loved a littl...",,4.37,789681,15694,The Giving Tree,Childrens|Childrens|Picture Books|Classics|Fic...,https://images.gr-assets.com/books/1174210942l...
14,Dan Brown,An ingenious code hidden in the works of Leona...,,3.81,1668594,43699,The Da Vinci Code,Fiction|Mystery|Thriller,https://images.gr-assets.com/books/1303252999l...


<div style="background-color:#ffcc00;"><span style="color:navy;">lowercasing "book_authors" and "book_title" columns</span></div>

<div style="background-color:#ffcc00;"><span style="color:navy;">lowercasing all words in "book_desc" column</span></div>

In [15]:
books['book_authors'].str.lower()

1        j.k. rowling|mary grandpré
9                    j.r.r. tolkien
11                    douglas adams
12                 shel silverstein
14                        dan brown
                    ...            
54296                 howard megdal
54297                 howard megdal
54298                 howard megdal
54299        mimi baird|eve claxton
54300                    leah price
Name: book_authors, Length: 41837, dtype: object

In [16]:
books['book_title'].str.lower()

1                harry potter and the order of the phoenix
9        j.r.r. tolkien 4-book boxed set: the hobbit an...
11                    the hitchhiker's guide to the galaxy
12                                         the giving tree
14                                       the da vinci code
                               ...                        
54296    taking the field: a fan's quest to run the tea...
54297    the baseball talmud: koufax, greenberg, and th...
54298    wilpon's folly - the story of a man, his fortu...
54299    he wanted the moon: the madness and medical ge...
54300    the anthology and the rise of the novel: from ...
Name: book_title, Length: 41837, dtype: object

In [17]:
books['book_desc'].str.lower()

1        there is a door at the end of a silent corrido...
9        لجزء الثالث من ملحمة جيه أر أر تولكين الرائعة ...
11       seconds before the earth is demolished to make...
12       "once there was a tree...and she loved a littl...
14       an ingenious code hidden in the works of leona...
                               ...                        
54296    in this fearless and half-crazy story, howard ...
54297    from the icons of the game to the players who ...
54298                                                  NaN
54299    soon to be a major motion picture, from brad p...
54300    the anthology and the rise of the novel brings...
Name: book_desc, Length: 41837, dtype: object

<div style="background-color:#ffcc00;"><span style="color:navy;">checking for url exist for all rows of "image_url"</span></div>

In [18]:
books = books[books['image_url'].notna()] #removing nulls in genres column

In [19]:
print("rows after filtering null based on image url: {}".format(books.shape[0]))

rows after filtering null based on image url: 41498


<div style="background-color:#ffcc00;"><span style="color:navy;">analysis on "book_edition" column</span></div>

<div style="background-color:cyan;"><span style="color:navy;">filtering tasks</span></div>

<div style="background-color:#ffcc00;"><span style="color:navy;">Functions for filtering based on different features</span></div>