### 2.2.1 &nbsp; Data pre-processing

- In web scraping step, to avoid the loss of all previous information caused by error, we separately stored each book's information into CSVs
- So we first combine books' infomation & reviews CSVs under each genre, then based on this, combine books in all genres to 2 comprehensive CSVs
    - **Book_info.csv**
    - **Book_reviews.csv**
   
### !!! NOTE !!! Since original source of web scraping data is too large, we only provide a small sample in submission for demonstartion purposes, its running result might not look as below

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
# read book genres
path = './book_stats/'

genre_folders = os.listdir(path)
print("The total genre number: ", len(genre_folders))
print()
print("The detailed genre names are: \n", genre_folders)

The total genre number:  40

The detailed genre names are: 
 ['art', 'biography', 'business', 'chick-lit', 'children-s', 'christian', 'classics', 'comics', 'contemporary', 'cookbooks', 'crime', 'ebooks', 'fantasy', 'fiction', 'gay-and-lesbian', 'graphic-novels', 'historical-fiction', 'history', 'horror', 'humor-and-comedy', 'manga', 'memoir', 'music', 'mystery', 'nonfiction', 'paranormal', 'philosophy', 'poetry', 'psychology', 'religion', 'romance', 'science', 'science-fiction', 'self-help', 'spirituality', 'sports', 'suspense', 'thriller', 'travel', 'young-adult']


#### Combination of Book information & review files

In [3]:
# for each genre, access the book info & review files
for genre in genre_folders:
    info_path = path + genre + '/basic_info/'
    review_path = path + genre + '/book_reviews/'
    
    info_files = os.listdir(info_path)
    review_files = os.listdir(review_path)
    
    # combine books' info together into one file 
    for i in range(len(info_files)):
        if i == 0:
            book_info = pd.read_csv(info_path + info_files[i])
        else:
            df_temp = pd.read_csv(info_path + info_files[i])
            book_info = pd.concat([book_info, df_temp], ignore_index=True)
            
    for i in range(len(review_files)):
        if i == 0:
            book_reviews = pd.read_csv(review_path + review_files[i])
        else:
            df_temp = pd.read_csv(review_path + review_files[i])
            book_reviews = pd.concat([book_reviews, df_temp], ignore_index=True)
            
    # store each genre's books' data
    book_info.to_csv(path + genre + '/book_info.csv', index=False)
    book_reviews.to_csv(path + genre + '/book_reviews.csv', index=False)

In [4]:
# starting from each genre, we combine all books' info & reviews together

df_info = pd.DataFrame()
df_reviews = pd.DataFrame()

for genre in genre_folders:
    info_path = path + genre + '/book_info.csv'
    review_path = path + genre + '/book_reviews.csv'
    
    info_temp = pd.read_csv(info_path)
    review_temp = pd.read_csv(review_path)
    
    df_info = pd.concat([df_info, info_temp], ignore_index=True)
    df_reviews = pd.concat([df_reviews, review_temp], ignore_index=True)

In [5]:
# one typo in creating feature names
df_info.rename(columns={'Rating_Dist:': 'Rating_Dist'}, inplace=True)

# rename 'rate' as 'rating' for more clear expressions
df_info.rename(columns={'Rate': 'Rating'}, inplace=True)
df_reviews.rename(columns={'Review_Rate': 'Review_Rating'}, inplace=True)

#### Take a look at two datasets

In [None]:
df_info.info()
df_info.head()

In [None]:
df_reviews.info()
df_reviews.head()

#### Store raw (uncleaned) datasets

In [14]:
df_info.to_csv('Book_info.csv', index=False)
df_reviews.to_csv('Book_reviews.csv', index=False)