<h1><center>HW 2: Scrape Book Reviews</center></h1>

Choose one of your favorite book at goodreads.com (e.g. for the book "The Midninght Library", the URL is: https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true)

- Q1. Write a function to scrape all **reviews on the first page**, including, 
    - **reviewer's name** (see (1) in Figure)
    - **rating** (see (2) in Figure)
    - **date** (see (3) in Figure)
    - **review content** (see (4) in Figure. For each review text, need to get the **complete text**, i.e., need to expand the `more` button. Only text is needed, pictures are not needed. (Hint: take a close look at the content of the html file. Do you need Selenium?)
    - **likes** (see (5) in Figure). 
    - If a field, e.g. rating, is missing, use `None` to indicate it. 
- `Function Input`: book page URL
- `Function Output`: save all reviews as a DataFrame of columns (`reviewer, rating, date, review, like`). E.g., for the given URL, you can get 30 reviews.
- 10 points: 1 point for each element, 5 points for overall logic.
- **Note**: GoodReads occasionaly blocks request. You may get an error that is not due to your codes. You may try to run a couple of times. 
    
    
![alt text](GoodReads.png "GoodReads")

In [1]:
import requests
from bs4 import BeautifulSoup  
import pandas as pd
 

def getReviews(page_url):
    
    # enter your codes here
    page = requests.get(page_url)
    
    if page.status_code == 200:
        soup = BeautifulSoup(page.content, 'html.parser')
        try:
            bookReviews = soup.find("div", {"id": "bookReviews"})
            contents = bookReviews.find_all("div", {"class": "friendReviews elementListBrown"})

            reviews = {'name': [], 'rating': [], 'date': [], 'review': [], 'like': []}

            for content in contents:
                reviewerName = content.find("a", {"class":"user"}).text
                reviews['name'].append(reviewerName)
                reviewDate = content.find("a", {"class": "reviewDate createdAt right"}).text
                reviews['date'].append(reviewDate)
                reviewStars = len(content.find_all("span", {"class": "staticStar p10"}))
                reviewText = content.find_all("span", {"class": "staticStar"})[0].text if content.find_all("span", {"class": "staticStar"}) else None
                reviews['rating'].append(reviewText)
                reviewElement = content.find("span", {"class": "readable"})
                reviewActText = reviewElement.find("span", {"style":"display:none"})
                if reviewActText == None:
                    reviewActText = reviewElement.text
                else:
                    reviewActText = reviewActText.text
                reviews['review'].append(reviewActText)
                # print(reviewerName, reviewDate, reviewText, reviewStars)
                reviewLikes = content.find("span", {"class":"likesCount"}).text.split()[0]
                reviews['like'].append(reviewLikes)
        except Exception as e:
            reviews = getReviews(page_url)
    else:
        print("Page content cannot be retreived")
    
    reviews = pd.DataFrame(reviews)
    return reviews    
    

In [2]:
page_url = 'https://www.goodreads.com/book/show/128029.A_Thousand_Splendid_Suns'
reviews = getReviews(page_url)
reviews

Unnamed: 0,name,rating,date,review,like
0,Stephen,it was amazing,"Jun 04, 2011",\n\nLike diamonds and roses hidden under bomb ...,614
1,Lucy,really liked it,"Dec 04, 2007",For the last two months I have been putting of...,1243
2,Anu,it was amazing,"Aug 29, 2007",August 2007I was riding in a cab in Bombay rec...,1570
3,Emily (Books with Emily Fox),it was amazing,"Jun 10, 2021",\nApparently this will break my heart even mor...,577
4,Tharindu Dissanayake,it was amazing,"Aug 01, 2021","\n""A face of grievances unspoken, burdens gone...",550
5,Matthew,it was amazing,"Jul 17, 2019",Amazing!Heart-Wrenching!Important!In a world w...,425
6,Emily May,really liked it,"Aug 29, 2012","It was a warm, sunny day in Montenegro and I w...",419
7,Daniel,it was ok,"Feb 22, 2009",It's apparently becoming something of a tradit...,382
8,Ahmad Sharabiani,really liked it,"Aug 09, 2008","A Thousand Splendid Suns, Khaled HosseiniA Tho...",313
9,Hend,it was amazing,"Feb 02, 2012","I have never cried while reading a book,like I...",265


- Q2 (Bonus). Modify the function you defined in Q1 to scrape **reviews on all the pages** for your url. Since a book may have multiple pages, use the **next** button at the end of each page (shown in the picture) to navigate to the next page. Continue scraping all the pages until the last page. `Please don't hardcode the pages in the URL because the number of pages varies by book`. (3 points. If URL is hardcoded, -2).

![alt text](GoodReads_bonus.png "GoodReads_bonus")

In [17]:
def getReviews_2(page_url): 
    
    pageCounter = 1
    nextPage = page_url + '&page=%d'%pageCounter
    
    page_urls = []
    
    while True:
        
        page = requests.get(nextPage)
        soup = BeautifulSoup(page.content, 'html.parser')
        if page.status_code == 200:
            nextEle = soup.find("a",{"class":"next_page"})
            if nextEle == None:
                break
            else:
                pageCounter += 1
                page_urls.append(nextPage)
                print(nextPage)
                nextPage = page_url+'&page=%d'%pageCounter
        else:
            nextEle = None
            
    reviews = pd.DataFrame()      
    for url in page_urls:
        reviews = reviews.append(getReviews(url), ignore_index=True)  
    return reviews             

In [19]:
# enter your url
page_url = 'https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true'

reviews=getReviews_2(page_url)
reviews

https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true&page=1
https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true&page=2
https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true&page=3
https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true&page=4
https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true&page=5
https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true&page=6
https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true&page=7
https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true&page=8
https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true&page=9
https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true&page=10


Unnamed: 0,name,rating,date,review,like
0,Nataliya,it was ok,"Jan 04, 2021",I liked this book until it suddenly decided to...,2644
1,Nicole,it was ok,"Nov 20, 2020",Everybody probably knows the premise of this b...,841
2,Nilufer Ozmekik,it was amazing,"Aug 21, 2020",Okay! No more words! This is one of the best s...,2919
3,Paromjit,it was amazing,"May 28, 2020",It is no secret that Matt Haig has mental heal...,1632
4,emma,liked it,"Oct 12, 2020",Okay. Picture this: you are about to bite into...,1689
...,...,...,...,...,...
295,Laura,really liked it,"Dec 16, 2020","Six stars for the premise. For me, this one st...",22
296,Israt Zaman Disha,really liked it,"Jan 12, 2021","""The only way to learn is to live.""A Beautiful...",22
297,Tanja Berg,it was amazing,"Oct 19, 2020","Nora Seeds decides to die. However, before she...",22
298,Sofia,it was amazing,"Dec 21, 2020","If you do not like a story, what's to do, well...",22
