<a href="https://colab.research.google.com/github/shaikadish/imdbProject/blob/main/web_scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What are we doing here?

Although [brilliant datasets](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) exist for sentiment analysis using IMDB user reviews, you might need something a little different. 

For example, the project I am working on is to try and estimate the rating out of ten a user would give a film, based on their review of the film. I struggled to find a dataset that had exactly what I needed (user reviews and their ratings), so I decided to make my own. This notebook has the code I used to do that.

I hope that this notebook can act as a tutorial for anyone interested in building their own dataset through the use of pretty standard web-scraping techniques.


# Imports and setup

Firstly, I am working off colabs, so there is some code here just to mount to your google drive, and to import the appropriate python libraries. The main libraries we will be using here are: 
*   [Selenium](https://selenium-python.readthedocs.io/) to automate web browsing activities.
*   [requests](https://docs.python-requests.org/en/latest/) for reading the *HTML* file on a given page.
*   [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) for parsing the *HTML* files.
*   [Pandas](https://pandas.pydata.org/docs/) for tabular data manipulation and saving.

In [2]:
# Mount to google drive
from google.colab import drive
drive.mount('/content/drive')
%cd drive/MyDrive/GitHub/IMDB_project/imdbProject

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Install libraries to colab environment
!pip install selenium
!apt-get update 
!apt install chromium-chromedriver
!pip install beautifulsoup4

In [None]:
# Import libraries
import selenium
import requests 
from bs4 import BeautifulSoup
import pandas as pd
import time 

In [None]:
# Configure Selenium webdriver for colab notebook
from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)

  """Entry point for launching an IPython kernel.


# Web Scraping

Now comes the good stuff. We will get access to the review data in three steps:


1.   This step automates the traversal of an [*IMDB list page*](https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2010-01-01,2022-01-01&languages=en&adult=include&sort=release_date,asc&start=0&ref_=adv_nxt), where the set of films can be found. The URL to each of these films [*IMDB page*](https://www.imdb.com/title/tt0368226/?ref_=nv_sr_srsg_0) are extracted through this traversal.
2.   Next, each of these film's *IMDB Page*s are visited, and the URL to that film's [review page](https://www.imdb.com/title/tt0368226/reviews?ref_=tt_urv) is scraped.
3.   Finally, the user review and ratings are scraped from each of review page.

This process is broken into these steps because it is less convoluted to automate each process individually than to try do them all at once. This approach was also used for those of you are here to copy some code, as each of these steps can easily be modified for similar projects, abd the modularity helps with code editing.



In [None]:
# 1. List page traversal

# Navigate to the IMDB list page being used with Selenium
url = "https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2010-01-01,2022-01-01&languages=en&adult=include&sort=release_date,asc&start=0&ref_=adv_nxt"
driver.get(url) 

# Dictionary to save scraped data
initial_link_dictionary={'movie_title':[],'movie_link':[],'year':[]}

# Go through 1700 pages (feel free to change this value) of the list
# Each loop collects the URL's for the films on that page
for page_idx in range(0,1700):
   
  # Scrape the information of all films on the current page
  film_content = driver.find_elements_by_class_name('lister-item')

  # Process content of all 50 films on the page
  for film_index in range(0,50):

      try:

        # Extract the index in front of the title. This is required for processing the title text
        film_order = film_content[film_index].find_element_by_class_name('lister-item-index').text
        # Get the year in which the film was released
        film_year = film_content[film_index].find_element_by_class_name('lister-item-year').text
        
        # Get film title. The title requires some text processing
        film_title = film_content[film_index].find_element_by_class_name('lister-item-header').text
        # Remove the index and year from the film's title
        film_title = film_title.replace(" "+film_year,"")
        film_title = film_title.replace(film_order+' ', '')

        # Scrape the link to the films IMDB page
        film_link = film_content[film_index].find_element_by_link_text(film_title).get_attribute('href')
        
        # Update dictionary of scraped data
        initial_link_dictionary['movie_title'].append(ftitle)
        initial_link_dictionary['year'].append(fyear[fyear.find('2'):-1])
        initial_link_dictionary['movie_link'].append(flink)

      except:

        # Catch inconsistencies in film naming convention (extremely rare)
        continue 
    
  # Get location of next page button on page
  load_more = driver.find_element_by_class_name('next-page')
  # Click on that button to traverse the pages of the list
  load_more.click()

# Create a data frame from the scraped data and save it as a CSV
movie_url_df=pd.DataFrame(data=initial_link_dictionary)
movie_url_df.to_csv('movie_urls.csv',index=False)


In [None]:
# If you already have a correctly formatted movie url list, load it in here
movie_url_df=pd.read_csv('movie_urls.csv')
movie_url_df.head()

In [None]:
# 2. Get review page links from each films IMDB page

# Dictionary to save scraped data
final_link_dictionary={'movie_title':[],'movie_link':[],'review_link':[],'year':[]}

# Loop through the links of each film, scraped in the last step
for i,row in movie_url_df.iterrows():

  # Parse information from data frame
  url = row['movie_link']
  movie_title = row['movie_title']
  year = row['year']

  try:

    # Load the data from the current films IMDB page
    page_data = requests.get(url, headers = {'User-Agent': 'Requests'})
    # Parse the data
    soup = BeautifulSoup(page_data.text, 'html.parser')

    # Modify the link to the films IMDB page to get the link to the user review page
    review_link=url.split('/')
    review_link[-1]=soup.find('a', text = 'User reviews').get('href')
    review_link='/'.join(review_link)

    # Update dictionary of scraped data
    final_link_dictionary['movie_title'].append(movie_title)
    final_link_dictionary['movie_link'].append(url)
    final_link_dictionary['year'].append(year)
    final_link_dictionary['review_link'].append(review_link)

  except:

    # For movies with no user reviews, and hense no review link to extract
    continue 

# Create a data frame from the scraped data and save it as a CSV
review_url_df=pd.DataFrame(data=final_link_dictionary)
review_url_df.to_csv('review_urls.csv',index=False)

In [6]:
# If you already have a correctly formatted movie review url list, load it in here
review_url_df=pd.read_csv('review_urls.csv')

In [None]:
import time
#Set list for each element:
title = []
content = []
rating = []
date = []
movie_title=[]
# Step 2, we will grab the data from each user review page
# Use Selenium to go to each user review page. each page is for a different film
checkpoint=time.time()
for i in range(len(review_url_df['review_link'])):

  if i<=11486:
    continue
  
  if (i%20)==0:
    print(f'{(i/16376)*100}% complete')

  if(time.time()-checkpoint)>1800:
    data = {'review_title': title, 
    'review_rating': rating,
    'review_date' : date,
    'review_body' : content,
    'movie_title': movie_title
    }
  #Build dataframe for each movie to export
    review = pd.DataFrame(data = data)
    review.to_csv(f'review_data_11486_to_{i}.csv')
    print('Checkpoint taken')
    checkpoint=time.time()

  driver.get(review_url_df['review_link'][i]) # Go to user review page
  #driver.implicitly_wait(1) # tell the webdriver to wait for 1 seconds for the page to load to prevent blocked by anti spam software
  
  current_title=review_url_df['movie_title'][i]

  # LOAD MORE REVIEWS FOR A GIVEN FILM. once loaded, can save reviews
  # Set up action to click on 'load more' button
  # note that each page on imdb has 25 reviews
  page = 1 #Set initial variable for while loop
  #We want at least 1000 review, so get 50 at a safe number
  while page<5:  
      try:
          #find the load more button on the webpage
          load_more = driver.find_element_by_id('load-more-trigger')
          #click on that button
          load_more.click()
          page+=1 #move on to next loadmore button
      except:
          #If couldnt find any button to click, stop
          break
  # After fully expand the page, we will grab data from whole website
  review = driver.find_elements_by_class_name('review-container')

  # save reviews for a given film
  for n in range(0,125):
      try:
          #Some reviewers only give review text or rating without the other, 
          #so we use try/except here to make sure each block of content must has all the element before append them to the list

          #Check if each review has all the elements
          ftitle = review[n].find_element_by_class_name('title').text
          #For the review content, some of them are hidden as spoiler, 
          #so we use the attribute 'textContent' here after extracting the 'content' tag
          fcontent = review[n].find_element_by_class_name('content').get_attribute("textContent").strip()
          frating = review[n].find_element_by_class_name('rating-other-user-rating').text
          fdate = review[n].find_element_by_class_name('review-date').text

          #Then add them to the respective list
          title.append(ftitle)
          content.append(fcontent)
          rating.append(frating.split('/')[0])
          date.append(fdate)
          movie_title.append(current_title)
      except:
          continue

#Build data dictionary for dataframe
data = {'review_title': title, 
    'review_rating': rating,
    'review_date' : date,
    'review_body' : content,
    'movie_title': movie_title
    }
#Build dataframe for each movie to export
review = pd.DataFrame(data = data)
review.to_csv(f'review_data.csv')
#movie = top50['Movie_name'][i] #grab the movie name from the top50 list    
#review['Movie_name'] = movie #create new column with the same movie name column    
#review.to_csv(f'data/{folder_name}/{i+1}.csv') #store them into individual file for each movies, so we can combine or check them later



70.2247191011236% complete
70.34684904738641% complete
70.46897899364924% complete
70.59110893991206% complete
70.71323888617489% complete
70.83536883243772% complete
70.95749877870054% complete
71.07962872496336% complete
71.20175867122619% complete
71.323888617489% complete
71.44601856375184% complete
71.56814851001465% complete
71.56814851001465% complete
71.69027845627748% complete
71.69027845627748% complete
71.81240840254031% complete
71.81240840254031% complete
71.93453834880313% complete
71.93453834880313% complete
72.05666829506595% complete
72.05666829506595% complete
72.17879824132878% complete
72.17879824132878% complete
72.3009281875916% complete
72.3009281875916% complete
72.42305813385443% complete
72.42305813385443% complete
72.54518808011724% complete
72.54518808011724% complete
72.66731802638007% complete
72.66731802638007% complete
72.78944797264289% complete
72.78944797264289% complete
72.9115779189057% complete
72.9115779189057% complete
73.03370786516854% complete

In [None]:
review.to_csv(f'review_data.csv')

In [None]:
sum=0
for i in range(len(review['Review Rating'])):
  sum+=int(review['Review Rating'][i].split('/')[0])
average_score=sum/len(review['Review Rating'])
print(average_score)

6.246926497515041


In [None]:
review['Review_body'][83]

"7 October 2016. Yes, this movie is predictable, the storyline nothing special that hasn't already been done before. Yet, the music is really sharp, crisp, and engrossingly fun and hip. The performances are subtly different and more natural and easy-going without overly drama. This Disney Version of the famous pop star accidentally meets unknown, plain girl is done in an appealing and straightforward, honest way that doesn't depend on anything more than a good music track, sung well, and performances and chemistry which are entertaining and smartly done. Other fun movies include I'll Be There (2003), A Royal Night Out (2015), Roman Holiday (1953), Music and Lyrics (2007), Pride and Prejudice (2005), The Artist (2011), The Devil Wears Prada (2006), Good Morning Call (2016),\n                \n                    1 out of 1 found this helpful.\n                        \n                            Was this review helpful?  Sign in to vote.\n                        \n                     