# Webscraping book reviews
This notebook is a documentation on the process of extracting the reviews and reviews' metadata (reviewer, review date and ratings) from a book reviews website. 
<br/>

In the process of learning to scrape the data on [goodreads](https://www.goodreads.com) website, I have broaden my understanding on the following:
1. Reading html tags using the inspect function on google chrome
2. Using BeautifulSoup library
<br/>

### Table of Content
1. Import Libraries
2. Parsing Goodreads Website
3. Extracting the reviews' metadata
4. Extracting the reviews
5. Creating a dataframe with the reviews

#### 1. Import Libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import numpy as np

#### 2. Parsing Goodreads Website

In [2]:
url = f"https://www.goodreads.com/book/show/37976541-bad-blood"
page = requests.get(url).text
doc = BeautifulSoup(page, "html.parser")

#### 3. Extract the Reviewer's Name, Rating and Review date

In [3]:
book_rating = doc.find_all(class_="reviewHeader uitext stacked") # Obtain the reviewer's name, rating and review date

username = []
ratings = []
review_date = []

for rating in book_rating:
    try:
        user = rating.find(class_="user").get_text()
        rating_stars = rating.find(class_="staticStar p10").get_text()
        date = rating.find(class_="reviewDate createdAt right").get_text()

        username.append(user)
        ratings.append(rating_stars)
        review_date.append(date)
    except:
        pass

#### 4. Extract each individual review on the page

In [5]:
book_review = doc.find_all(class_="reviewText stacked") # Obtain the review text

reviews = []

x= 0

for review in book_review:
    try:
        full_review = review.find(style="display:none").get_text()
        reviews.append(full_review)
    except: # The except clause will be applied to short reviews that are not hidden 
        full_review = review.find(class_="readable").get_text() 
        reviews.append(full_review)

#### 5. Combine the above data into a dataframe

In [10]:
goodreads_reviews = pd.DataFrame(np.column_stack([username,ratings,review_date,reviews]),columns=['username','ratings','review_date','reviews']) #appending all the information into a dataframe
print(goodreads_reviews)

                          username          ratings   review_date  \
0                  Michael Perkins   it was amazing  May 29, 2018   
1                       Bill Gates  really liked it  Dec 03, 2018   
2                           Roxane  really liked it  Jun 21, 2018   
3                            Ilona   it was amazing  Jun 18, 2018   
4                   Always Pouting   it was amazing  Jan 08, 2020   
5                           Julie   really liked it  Mar 04, 2019   
6   Chelsea (chelseadolling reads)  really liked it  Apr 02, 2019   
7                           carol.         liked it  Sep 01, 2019   
8                      BlackOxford  really liked it  Dec 14, 2018   
9                    Andrew Garvin   it was amazing  May 23, 2018   
10                          Rincey  really liked it  Jul 16, 2018   
11             Dr. Appu Sasidharan   it was amazing  Dec 05, 2019   
12                           JanB    it was amazing  Jan 18, 2019   
13                        Michelle