# <center>HW2 : Web Scraping </center>

**Instructions**: 
- Please read the problem description carefully
- Make sure to complete all requirements (shown as bullets) . In general, it would be much easier if you complete the requirements in the order as shown in the problem description
- Follow the Submission Instruction to submit your assignment.
- Code of academic integrity:
    - **Each assignment needs to be completed independently. This is NOT group assignment**. 
    - Never ever copy others' work (even with minor modification, e.g. changing variable names)
    - If you generate code using large lanaguage models (although it is not encouraged), make sure to adapt the generated code to meet all requirements and it is executable.
    - Anti-Plagiarism software will be used to check similarities between all submissions.
    - Check Syllabus for more details.

# Q1. Collecting Movie Reviews

Write a function `getReviews(url, webdriver = None)` to scrape all **reviews on the first page**, including, 
- **title** (see (1) in Figure)
- **reviewer's name** (see (2) in Figure)
- **date** (see (3) in Figure)
- **rating** (see (4) in Figure)
- **review content** (see (5) in Figure)
    - For each review, if the full text is not shown, first click the expander icon (shown in (7)) to expand the review. 
    - Hint. You can first select all expander icons on the page and click each of them. The expander icon can be selected by CSS Selector `div.ipl-expander div.expander-icon-wrapper`
    - Then collect the **complete review text**.
- **helpful** (see (6) in Figure). 


Requirements:
- `Function Input`:
    - `page URL`: the URL string
    - `web driver`: if you use Selenium or Playwright, pass the initialized web driver. In other words, your function should work with an initialized web driver of any web browser.
- `Function Output`: save all reviews as a DataFrame of columns (`title, reviewer, rating, date, review, helpful`). For the given URL, you can get 25 reviews.
- If a field, e.g. rating, is missing, use `None` to indicate it. 

    


![alt text](IMDB.png "IMDB")

# Q2 (Bonus) Scrape Dynamic Content


- Expand your function defined in Q1 to include an argument `N` for the minimum number of reveiws to be collected, i.e., `get_N_review(url, webdriver = None, N = 100)`. 
- When called, this function can scrape **at least N reviews** by clicking the `Load More` button at the end of the page continously.

In [22]:
import requests
from bs4 import BeautifulSoup  
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time


In [23]:
def getReviews(page_url, webdriver):
    
    reviews = []
    
    # add your code here
    
    webdriver.get(page_url)
    time.sleep(2)  # Wait for the page to load
    
    # Expand all reviews by clicking the expander icons
    expanders = webdriver.find_elements(By.CSS_SELECTOR, "div.ipl-expander div.expander-icon-wrapper")
    for expander in expanders:
        try:
            webdriver.execute_script("arguments[0].click();", expander)
            time.sleep(1)  
        except:
            pass  
    
    # Collect reviews
    review_elements = webdriver.find_elements(By.CSS_SELECTOR, "div.review-container")
    for review_element in review_elements:
        # Extract the required elements
        title = review_element.find_element(By.CSS_SELECTOR, "a.title").text.strip()
        reviewer = review_element.find_element(By.CSS_SELECTOR, "span.display-name-link a").text.strip()
        date = review_element.find_element(By.CSS_SELECTOR, "span.review-date").text.strip()
        
        # Handle the rating with a try-except block if it's missed
        try:
            rating = review_element.find_element(By.CSS_SELECTOR, "span.rating-other-user-rating span").text.strip()
        except:
            rating = None
        
        review_content = review_element.find_element(By.CSS_SELECTOR, "div.content .text.show-more__control").text.strip()
        helpful = review_element.find_element(By.CSS_SELECTOR, "div.actions.text-muted").text.replace(" people found this helpful", "").strip()
        
        reviews.append({
            "title": title,
            "reviewer": reviewer,
            "date": date,
            "rating": rating,
            "review": review_content,
            "helpful": helpful
        })
    
    # Convert the list of dictionaries to a DataFrame
    reviews_df = pd.DataFrame(reviews)
    
    return reviews_df


In [24]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions 

# initialize the web drive
# executable_path = '../notes/Web_Scraping/driver/geckodriver'
driver = webdriver.Chrome()

page_url = 'https://www.imdb.com/title/tt1745960/reviews?sort=totalVotes&dir=desc&ratingFilter=0'

# Test the function
reviews = getReviews(page_url, driver)
driver.quit()


In [25]:
reviews

Unnamed: 0,title,reviewer,date,rating,review,helpful
0,This is slightly different to the other review...,scottedwards-87359,26 May 2022,10,If you were a late teen or in your early twent...,"5,329 out of 5,610 found this helpful. Was thi..."
1,The truly epic blockbuster we needed.,Top_Dawg_Critic,23 May 2022,10,"Wow. The first Top Gun is a classic, and as we...","2,736 out of 3,092 found this helpful. Was thi..."
2,Let me just say...,lovefalloutkindagamer,26 May 2022,10,"I was reluctantly dragged into the theater, th...","1,885 out of 2,086 found this helpful. Was thi..."
3,Best Sequel yet,GusherPop,25 May 2022,10,In one of the more memorable lines in the orig...,"1,223 out of 1,442 found this helpful. Was thi..."
4,The real cinema experience!,alexglimbergwindh,30 May 2022,10,If there's any movie that deserves to be seen ...,"1,071 out of 1,240 found this helpful. Was thi..."
5,This is why we go to the movies,dtucker86,27 May 2022,10,This is one sequel that looked like it would n...,"983 out of 1,157 found this helpful. Was this ..."
6,Flying High,DarkVulcan29,27 May 2022,10,"Top Gun (1986) made Tom Cruise a star, and now...",641 out of 828 found this helpful. Was this re...
7,What an excellent sequel,r96sk,25 May 2022,9,"What an excellent sequel - I, in fact, like it...",658 out of 823 found this helpful. Was this re...
8,Fake Imdb reviews artificially upping the rati...,imseeg,29 May 2022,5,"Almost 90 percent of all the reviews have a 8,...",143 out of 803 found this helpful. Was this re...
9,"Great Flight Sequences, Cliche-Ridden Plot",Stoshie,18 June 2022,7,I don't share everyone's unbridled enthusiasm ...,577 out of 783 found this helpful. Was this re...


In [21]:
# for Bonus

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time


def getReviews(page_url, webdriver, N=100):
    
    reviews = []
    
    # add your code here
    
    webdriver.get(page_url)
    time.sleep(2)  
    res=[]
    # Loop to click the "Load More" button until at least N reviews are loaded
    while True:
        try:
            #load_more_button 
            load_more_button = webdriver.find_element(By.CSS_SELECTOR, "button.ipl-load-more__button")
            webdriver.execute_script("arguments[0].scrollIntoView(true);", load_more_button)
            time.sleep(5)  # Allow time for the button to become clickable
            load_more_button.click()
            time.sleep(5)  # Allow time for new reviews to load
            review_elements = webdriver.find_elements(By.CSS_SELECTOR, "div.review-container")
            if len(review_elements) >= N:
                res=review_elements
                break
        except Exception as e:
            print("Error or end of reviews reached:", e)
            break

    for review_element in res:  # Process the first N reviews
        if len(reviews) >= N:  # Make sure if the for loop will not exceed the N 
            break  
        title = review_element.find_element(By.CSS_SELECTOR, "a.title").text.strip()
        reviewer = review_element.find_element(By.CSS_SELECTOR, "span.display-name-link a").text.strip()
        date = review_element.find_element(By.CSS_SELECTOR, "span.review-date").text.strip()

        try:
            rating = review_element.find_element(By.CSS_SELECTOR, "span.rating-other-user-rating span").text.strip()
        except:
            rating = None

        # Expand the review text if necessary
        try:
            expander = review_element.find_element(By.CSS_SELECTOR, "div.ipl-expander div.expander-icon-wrapper")
            webdriver.execute_script("arguments[0].click();", expander)
            time.sleep(1)  # Wait for the text to expand
        except:
            pass  # No expander found or not clickable

        review_content = review_element.find_element(By.CSS_SELECTOR, "div.content .text.show-more__control").text.strip()

        try:
            helpful = review_element.find_element(By.CSS_SELECTOR, "div.actions.text-muted").text.split('Was this review helpful?')[0]
        except:
            pass

        reviews.append({
            "title": title,
            "reviewer": reviewer,
            "date": date,
            "rating": rating,
            "review": review_content,
            "helpful": helpful
        })

    # Convert to DataFrame
    reviews_df = pd.DataFrame(reviews)
    return reviews_df

driver = web_driver()
getReviews('https://www.imdb.com/title/tt1745960/reviews?sort=totalVotes&dir=desc&ratingFilter=0', driver)


Unnamed: 0,title,reviewer,date,rating,review,helpful
0,This is slightly different to the other review...,scottedwards-87359,26 May 2022,10,If you were a late teen or in your early twent...,"5,329 out of 5,610 found this helpful."
1,The truly epic blockbuster we needed.,Top_Dawg_Critic,23 May 2022,10,"Wow. The first Top Gun is a classic, and as we...","2,736 out of 3,092 found this helpful."
2,Let me just say...,lovefalloutkindagamer,26 May 2022,10,"I was reluctantly dragged into the theater, th...","1,885 out of 2,086 found this helpful."
3,Best Sequel yet,GusherPop,25 May 2022,10,In one of the more memorable lines in the orig...,"1,885 out of 2,086 found this helpful."
4,The real cinema experience!,alexglimbergwindh,30 May 2022,10,If there's any movie that deserves to be seen ...,"1,071 out of 1,240 found this helpful."
...,...,...,...,...,...,...
95,Same as original with deleted scenes,gmvyd,15 June 2022,1,Well that was the Tom cruise show for sure. Ze...,39 out of 112 found this helpful.
96,Amazing at times.,jeriahswillgdp,5 June 2022,7,I went into this film with very high expectati...,80 out of 112 found this helpful.
97,"Just as boring as it's predecessor, but with a...",CriticsVoiceVideo,21 June 2022,1,Not really much is different here in this sequ...,38 out of 109 found this helpful.
98,An Epic Blockbuster Even Better Than The Original,cadillac20,24 May 2022,9,Top Gun has been an 80's staple since it first...,52 out of 106 found this helpful.
