# Data Preprocessing

Import libraries

In [1]:
import pandas as pd
import numpy as np


Loading the dataset

In [2]:
df = pd.read_csv(r'..\datasets\books_1.Best_Books_Ever.csv')
df.sample().T

Unnamed: 0,40778
bookId,490667.Sport
title,Sport: A Novel
series,
author,Mick Cochrane
rating,4.02
description,A nostalgic story about a Minnesota boy's sear...
language,English
isbn,9780816640850
genres,[]
characters,[]


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52478 entries, 0 to 52477
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   bookId            52478 non-null  object 
 1   title             52478 non-null  object 
 2   series            23470 non-null  object 
 3   author            52478 non-null  object 
 4   rating            52478 non-null  float64
 5   description       51140 non-null  object 
 6   language          48672 non-null  object 
 7   isbn              52478 non-null  object 
 8   genres            52478 non-null  object 
 9   characters        52478 non-null  object 
 10  bookFormat        51005 non-null  object 
 11  edition           4955 non-null   object 
 12  pages             50131 non-null  object 
 13  publisher         48782 non-null  object 
 14  publishDate       51598 non-null  object 
 15  firstPublishDate  31152 non-null  object 
 16  awards            52478 non-null  object

Checking missing values

In [4]:
df.isna().sum()

bookId                  0
title                   0
series              29008
author                  0
rating                  0
description          1338
language             3806
isbn                    0
genres                  0
characters              0
bookFormat           1473
edition             47523
pages                2347
publisher            3696
publishDate           880
firstPublishDate    21326
awards                  0
numRatings              0
ratingsByStars          0
likedPercent          622
setting                 0
coverImg              605
bbeScore                0
bbeVotes                0
price               14365
dtype: int64

Dropping missing values from essential columns

In [5]:
df.dropna(subset=['coverImg', 'title', 'genres', 'author', 'description'], inplace=True)

Dropping duplicates (only when both `title` and `author` columns are the same)

In [6]:
df.drop_duplicates(subset=['title', 'author'], inplace=True)

Choosing only *English* books

In [7]:
df = df[df['language'] == 'English']

Removing insignificant descriptions

In [8]:
df = df[df['description'].str.len() > 75]

Saving preprocessed dataset

In [11]:
df.to_csv(r'..\datasets\preprocessed.csv', index=False)