## Web scraping approach using Selenium

To extract reviews for the movie "The Shawshank Redemption" from IMDb's website, we employed Selenium, a powerful browser automation tool commonly used for web scraping tasks involving dynamic content (Selenium Project 2022). The process begins by setting up the Selenium WebDriver, which initializes a Chrome browser instance to navigate IMDb's web pages and interact with dynamic elements (Selenium Project 2022). The `get_all_reviews()` function then utilized Selenium to dynamically load and extract reviews from the IMDb review section for the specified movie. This involved iteratively clicking the "Load More" button to fetch additional reviews until at least 100 reviews were retrieved, ensuring comprehensive data collection (Selenium Documentation 2022). Selenium's ability to handle dynamic content and simulate human browsing behavior was pivotal in navigating through IMDb's review pages and extracting the desired information.

The decision to use Selenium was driven by its robust automation capabilities, particularly its effectiveness in handling dynamic web content and interacting with JavaScript-based elements, which are prevalent on modern websites like IMDb (Selenium Project 2022). Selenium's flexibility and extensive functionality made it well-suited for navigating through IMDb's review section and extracting reviews reliably.

In [1]:
import pandas as pd
import selenium
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import re
import time
import numpy as np
from tqdm import tqdm
import os

In [2]:
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service as ChromeService
from bs4 import BeautifulSoup

def setup_driver():
    service = ChromeService(ChromeDriverManager().install())
    options = webdriver.ChromeOptions()
    driver = webdriver.Chrome(service=service, options=options)
    return driver

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def get_all_reviews(driver, url):
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    reviews = []
    
    # Looping until desired number of reviews is retrieved
    while len(reviews) < 100:
        # Find all review containers
        review_containers = soup.find_all('div', class_='lister-item-content')
        for review in review_containers:
            title = review.find('a', class_='title').text.strip()
            content = review.find('div', class_='text').text.strip()
            reviews.append({'Review Title': title, 'Review Content': content})
        
        # Checking if there's a "Load More" button
        load_more_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, '//button[text()="Load More"]'))
        )
        if load_more_button:
            # Clicking the "Load More" button
            driver.execute_script("arguments[0].click();", load_more_button)
            # Wait for the new reviews to load
            time.sleep(2)
            # Update the page source
            soup = BeautifulSoup(driver.page_source, 'html.parser')
        else:
            break  # Breaking the loop if no more reviews can be loaded
    
    return reviews[:100]  # Returning only the first 100 reviews

driver = setup_driver()
url = 'https://www.imdb.com/title/tt0111161/reviews'
all_reviews = get_all_reviews(driver, url)
driver.quit()

reviews_df = pd.DataFrame(all_reviews)
display(reviews_df)

# Selecting specific columns from the existing DataFrame
new_df = reviews_df[['Review Title', 'Review Content']]

# Renaming columns for clarity
new_df.rename(columns={'Review Title': 'Title', 'Review Content': 'Text'}, inplace=True)

# Adding constant URL column
new_df['URL'] = "https://www.imdb.com/title/tt0111161/reviews"

# Adding a new column labeled "Human-written"
new_df['Label'] = "Human-written"

# Reordering the columns
new_df = new_df[['URL', 'Title', 'Text', 'Label']]

# Displaying the new DataFrame
display(new_df)


Unnamed: 0,Review Title,Review Content
0,Some birds aren't meant to be caged.,The Shawshank Redemption is written and direct...
1,An incredible movie. One that lives with you.,It is no wonder that the film has such a high ...
2,Don't Rent Shawshank.,I'm trying to save you money; this is the last...
3,This is How Movies Should Be Made,This movie is not your ordinary Hollywood flic...
4,A classic piece of unforgettable film-making.,"In its Oscar year, Shawshank Redemption (writt..."
...,...,...
95,never give up hope,"""The Shawshank Redemption"" should have won Bes..."
96,Masterpiece,"Shawshank Redemption, The (1994)**** (out of 4..."
97,Hope can set you free and so can this remarkab...,"One of my favorite movies ever,The Shawshank R..."
98,Two movies in one,The reason I became a member of this database ...


Unnamed: 0,URL,Title,Text,Label
0,https://www.imdb.com/title/tt0111161/reviews,Some birds aren't meant to be caged.,The Shawshank Redemption is written and direct...,Human-written
1,https://www.imdb.com/title/tt0111161/reviews,An incredible movie. One that lives with you.,It is no wonder that the film has such a high ...,Human-written
2,https://www.imdb.com/title/tt0111161/reviews,Don't Rent Shawshank.,I'm trying to save you money; this is the last...,Human-written
3,https://www.imdb.com/title/tt0111161/reviews,This is How Movies Should Be Made,This movie is not your ordinary Hollywood flic...,Human-written
4,https://www.imdb.com/title/tt0111161/reviews,A classic piece of unforgettable film-making.,"In its Oscar year, Shawshank Redemption (writt...",Human-written
...,...,...,...,...
95,https://www.imdb.com/title/tt0111161/reviews,never give up hope,"""The Shawshank Redemption"" should have won Bes...",Human-written
96,https://www.imdb.com/title/tt0111161/reviews,Masterpiece,"Shawshank Redemption, The (1994)**** (out of 4...",Human-written
97,https://www.imdb.com/title/tt0111161/reviews,Hope can set you free and so can this remarkab...,"One of my favorite movies ever,The Shawshank R...",Human-written
98,https://www.imdb.com/title/tt0111161/reviews,Two movies in one,The reason I became a member of this database ...,Human-written



## Data storage for further analysis

After successfully scraping and organizing the data, it is stored in a pickle file named `aljazeera_articles.pkl`. This step allowed us to keep a stable and easily accessible dataset for further analysis, obviating the need to redo the scraping process. Opting for a pickle file as the storage medium was particularly advantageous due to its capacity to store Python objects, thereby maintaining the integrity of the data's structure and content. 


In [3]:
new_df.to_pickle("movie_data_100.pkl")

### References

- Selenium Project 2022, 'Selenium WebDriver Documentation', SeleniumHQ, viewed 1 April 2024, <https://www.selenium.dev/documentation/en/getting_started_with_webdriver/>.

- Selenium Project 2022, 'Selenium WebDriver Requirements', SeleniumHQ, viewed 1 April 2024, <https://www.selenium.dev/documentation/en/webdriver/driver_requirements/>.

- Selenium Documentation 2022, 'Selenium Python Locating Elements', SeleniumHQ, viewed 1 April 2024, <https://selenium-python.readthedocs.io/locating-elements.html>.

- pandas Development Team 2022, 'Pandas Documentation', Pandas, viewed 1 April 2024, <https://pandas.pydata.org/docs/>.