<a href="https://colab.research.google.com/github/shaikadish/imdbProject/blob/main/web_scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What are we doing here?

Although [brilliant datasets](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) exist for sentiment analysis using IMDB user reviews, you might need something a little different. 

For example, the project I am working on is to try and estimate the rating out of ten a user would give a film, based on their review of the film. I struggled to find a dataset that had exactly what I needed (user reviews and their ratings), so I decided to make my own. This notebook has the code I used to do that.

I hope that this notebook can act as a tutorial for anyone interested in building their own dataset through the use of pretty standard web-scraping techniques.


# Imports and setup

Firstly, I am working off colabs, so there is some code here just to mount to your google drive, and to import the appropriate python libraries. The main libraries we will be using here are: 
*   [Selenium](https://selenium-python.readthedocs.io/) to automate web browsing activities.
*   [requests](https://docs.python-requests.org/en/latest/) for reading the *HTML* file on a given page.
*   [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) for parsing the *HTML* files.
*   [Pandas](https://pandas.pydata.org/docs/) for tabular data manipulation and saving.

In [None]:
# Mount to google drive
from google.colab import drive
drive.mount('/content/drive',force_remount=True)
%cd drive/MyDrive/GitHub/IMDB_project/imdbProject

In [None]:
# Install libraries to colab environment
!pip install selenium
!apt-get update 
!apt install chromium-chromedriver
!pip install beautifulsoup4

In [12]:
# Import libraries
import selenium
import requests 
from bs4 import BeautifulSoup
import pandas as pd
import time 

In [None]:
# Configure Selenium webdriver for colab notebook
from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)

# Web Scraping

Now comes the good stuff. We will get access to the review data in three steps:


1.   This step automates the traversal of an [*IMDB list page*](https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2010-01-01,2022-01-01&languages=en&adult=include&sort=release_date,asc&start=0&ref_=adv_nxt), where the set of films can be found. The URL to each of these films [*IMDB page*](https://www.imdb.com/title/tt0368226/?ref_=nv_sr_srsg_0) are extracted through this traversal.
2.   Next, each of these film's *IMDB Page*s are visited, and the URL to that film's [review page](https://www.imdb.com/title/tt0368226/reviews?ref_=tt_urv) is scraped.
3.   Finally, the user review and ratings are scraped from each of review page.

This process is broken into these steps because it is less convoluted to automate each process individually than to try do them all at once. This approach was also used for those of you are here to copy some code, as each of these steps can easily be modified for similar projects, abd the modularity helps with code editing.



In [None]:
# 1. List page traversal

# Navigate to the IMDB list page being used with Selenium
url = "https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2010-01-01,2022-01-01&languages=en&adult=include&sort=release_date,asc&start=0&ref_=adv_nxt"
driver.get(url) 

# Dictionary to save scraped data
initial_link_dictionary={'movie_title':[],'movie_link':[],'year':[]}

# Go through 1700 pages (feel free to change this value) of the list
# Each loop collects the URL's for the films on that page
for page_idx in range(0,1700):
   
  # Scrape the information of all films on the current page
  film_content = driver.find_elements_by_class_name('lister-item')

  # Process content of all 50 films on the page
  for film_index in range(0,50):

      try:

        # Extract the index in front of the title. This is required for processing the title text
        film_order = film_content[film_index].find_element_by_class_name('lister-item-index').text
        # Get the year in which the film was released
        film_year = film_content[film_index].find_element_by_class_name('lister-item-year').text
        
        # Get film title. The title requires some text processing
        film_title = film_content[film_index].find_element_by_class_name('lister-item-header').text
        # Remove the index and year from the film's title
        film_title = film_title.replace(" "+film_year,"")
        film_title = film_title.replace(film_order+' ', '')

        # Scrape the link to the films IMDB page
        film_link = film_content[film_index].find_element_by_link_text(film_title).get_attribute('href')
        
        # Update dictionary of scraped data
        initial_link_dictionary['movie_title'].append(ftitle)
        initial_link_dictionary['year'].append(fyear[fyear.find('2'):-1])
        initial_link_dictionary['movie_link'].append(flink)

      except:

        # Catch inconsistencies in film naming convention (extremely rare)
        continue 
    
  # Get location of next page button on page
  load_more = driver.find_element_by_class_name('next-page')
  # Click on that button to traverse the pages of the list
  load_more.click()

# Create a data frame from the scraped data and save it as a CSV
movie_url_df=pd.DataFrame(data=initial_link_dictionary)
movie_url_df.to_csv('movie_urls.csv',index=False)


In [None]:
# If you already have a correctly formatted movie url list, load it in here
movie_url_df=pd.read_csv('movie_urls.csv')
movie_url_df.head()

In [None]:
# 2. Get review page links from each films IMDB page

# Dictionary to save scraped data
final_link_dictionary={'movie_title':[],'movie_link':[],'review_link':[],'year':[]}

# Loop through the links of each film, scraped in the last step
for i,row in movie_url_df.iterrows():

  # Parse information from data frame
  url = row['movie_link']
  movie_title = row['movie_title']
  year = row['year']

  try:

    # Load the data from the current films IMDB page
    page_data = requests.get(url, headers = {'User-Agent': 'Requests'})
    # Parse the data
    soup = BeautifulSoup(page_data.text, 'html.parser')

    # Modify the link to the films IMDB page to get the link to the user review page
    review_link=url.split('/')
    review_link[-1]=soup.find('a', text = 'User reviews').get('href')
    review_link='/'.join(review_link)

    # Update dictionary of scraped data
    final_link_dictionary['movie_title'].append(movie_title)
    final_link_dictionary['movie_link'].append(url)
    final_link_dictionary['year'].append(year)
    final_link_dictionary['review_link'].append(review_link)

  except:

    # For movies with no user reviews, and hense no review link to extract
    continue 

# Create a data frame from the scraped data and save it as a CSV
review_url_df=pd.DataFrame(data=final_link_dictionary)
review_url_df.to_csv('review_urls.csv',index=False)

In [14]:
# If you already have a correctly formatted movie review url list, load it in here
review_url_df=pd.read_csv('review_urls.csv')

In [None]:
# 3. Scrape user reviews and ratings from film review pages

# Dictionary to save scraped data
review_dictionary={'review_title':[],'review_rating':[],'review_date':[],'review_body':[],'movie_title':[]}

# Loop through each film's review page
for i in range(len(review_url_df['review_link'])):

  # Go to review page of current film
  driver.get(review_url_df['review_link'][i]) # Go to user review page

  # Store title of current film  
  current_title=review_url_df['movie_title'][i]

  # Expand review list on current page 5 times using Selenium
  page = 1 
  while page<5:  

      try:
          # Find the button to load more reviews
          load_more = driver.find_element_by_id('load-more-trigger')
          # Click the button
          load_more.click()
          page+=1

      except:
          # If no more reviews to load, break
          break
          
  # Grab all reviews from fully expanded list
  reviews = driver.find_elements_by_class_name('review-container')

  # Process reviews for the current film
  for review in reviews:

      try:
        
          # Get review information
          review_title = review.find_element_by_class_name('title').text
          # The "textContent" attribute gets passed the spoiler warning
          review_content = review.find_element_by_class_name('content').get_attribute("textContent").strip()
          review_rating = review.find_element_by_class_name('rating-other-user-rating').text
          review_date = review.find_element_by_class_name('review-date').text

          # Update dictionary of sraped data
          review_dictionary['review_title'].append(review_title)
          review_dictionary['review_body'].append(review_content)
          review_dictionary['review_rating'].append(review_rating.split('/')[0])
          review_dictionary['review_date'].append(review_date)
          review_dictionary['movie_title'].append(current_title)

      except:

          # Some reviews do not have ratings. These reviews must be skipped
          continue

# Create a data frame from the scraped data and save it as a CSV
review_df = pd.DataFrame(data = review_dictionary)
review_df.to_csv('review_data.csv',index=False)

And there you have it! If you are interested in the analysis and use of this data, please feel free to checkout the other notebooks on this repo!