# SCRAPING FOR CREATING A NLP MOVIES DATASET

Scraping are a group of techniques used for extracting data directly from a web by going through de HTML and CSS code. In this exercise we will extract data from https://www.moviehousememories.com/movie-summaries/ which is a website where you can find relatively extense movies summaries. The idea is creating a rich enough dataset for making future, for example, NLP projects.  

For this purpose, we are going to use Selenium library, which is a python package that was creating at the beggining for testing and interact with websites in an autonomous way.

Also, we will use Firefox web browser, so we are going to use geckodriver firefox driver in order to interact with the pages.

First thing we have to do, as always, is importing libraries we will use during the exercise:

In [1]:
from selenium.webdriver.firefox.options import Options
from selenium import webdriver
import pandas as pd
import json
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from tqdm import tqdm

In [2]:
conteos = 100

options = Options()
options.headless = True

Once the libraries have been imported, we will create a first dataset with the title and the url for extracting each summary for each movie.

As you see in the code below, first for loop will extract all movies html elements. Then we will run the second for loop, where we will extract the title and the movie url, creating after that the movies url dataset

In [3]:
driver = webdriver.Firefox(executable_path= './drivers/geckodriver', options=options)
driver.get("https://www.moviehousememories.com/movie-summaries/") 

for x in tqdm(range(0,conteos)):
    i = True
    iteracion = 0
    boton = None
    while i and iteracion < 100:
        try:
            boton = driver.find_element_by_xpath('//a[@class="block-pagination next-posts show-more-button load-more-button"]')
            if boton != None:
                driver.execute_script("arguments[0].scrollIntoView(false);",boton)
                boton.click()
                break
            else: 
                iteracion = iteracion + 1
        except:
            iteracion = iteracion + 1

#Extract movies URL
listado_peliculas = driver.find_elements_by_xpath('//h2[@class="post-title"]')

url_peliculas = []
titulo = []
if len(listado_peliculas) != 0:
    for pelicula in tqdm(listado_peliculas):
        movie = pelicula.find_element_by_tag_name("a")
        titulo.append(pelicula.text)
        url_peliculas.append(movie.get_attribute("href"))

dataset_movies = pd.DataFrame({"Title":titulo, "url":url_peliculas})
dataset_movies.to_excel('./recommenderFiles/movies.xlsx', index=False)
driver.close()
driver.stop_client()

100%|██████████| 100/100 [00:34<00:00,  2.01it/s]
100%|██████████| 892/892 [00:26<00:00, 33.15it/s]


Once the movies url dataset is created, is time to visit each url stored in it for getting the summaries and save all of them in a txt file with the structure: title, url, summary

In [4]:
#SCRAPER MOVIES: 

dataset_movies = pd.read_excel('./recommenderFiles/movies.xlsx')
driver = webdriver.Firefox(executable_path= './drivers/geckodriver', options=options)

count = 0
for x in tqdm(range(len(dataset_movies))):
    
    if count >= 400: #AVOID WEB TIMEOUT
        count = 0
        driver.close()
        driver.stop_client()
        driver = webdriver.Firefox(executable_path= './drivers/geckodriver', options=options)
        
    count = count + 1
        
    try:
        url = dataset_movies["url"][x]
        driver.get(url)
        title = dataset_movies["Title"][x]
        text_summary = driver.find_element_by_xpath("//div[@id='main-content-row']").text
        text_summary = text_summary.replace("Film and Plot Synopsis", "").replace("The summary below contains spoilers.","")
        posicion = text_summary.find("Additional Film Information")   
        text_summary = text_summary[:posicion] 
    except:
        title = ""
        url = ""
        continue
    
    dictionary_movie = {"Title":title,"url":url,"summary":text_summary}
    with open("./recommenderFiles/movies_house_memories.txt", 'a',encoding='utf-8') as json_file:
        # convert from Python dict-like structure to JSON format
        jsoned_data = json.dumps(dictionary_movie,ensure_ascii=False)
        json_file.write(jsoned_data)
        json_file.write('\n')

100%|██████████| 892/892 [45:24<00:00,  2.04s/it]  


Finally, we can import saved data and see the final result of our dataset:

In [5]:
data = []
with open("./recommenderFiles/movies_house_memories.txt", 'r', encoding='utf-8') as f:
    for line in f:
        data.append(json.loads(line))

movies_dataset = pd.DataFrame.from_records(data)   

lista_resumen = []
for i in range(0,len(movies_dataset)):
    detecta_additional = movies_dataset["summary"][i].find("Additional Film")
    if detecta_additional != -1:
        summary = movies_dataset["summary"][i][:detecta_additional]
        lista_resumen.append(summary.replace("\n",""))
        
    else:
        lista_resumen.append(movies_dataset["summary"][i])

movies_dataset = movies_dataset.drop(columns="summary", axis = 1)
movies_dataset.insert(len(movies_dataset.columns),"summary",lista_resumen)

In [6]:
movies_dataset.head(10)

Unnamed: 0,Title,url,summary
0,The Desert Trail (1935),https://www.moviehousememories.com/the-desert-...,\nWhen a shady rodeo promoter tries to swindle...
1,Austin Powers: International Man of Mystery (1...,https://www.moviehousememories.com/austin-powe...,"\nIn the 1960’s, Austin Powers is a hipster in..."
2,Last Action Hero (1993),https://www.moviehousememories.com/last-action...,\nDanny Madigan is a lonely boy who has recent...
3,The Ice Pirates (1984),https://www.moviehousememories.com/the-ice-pir...,"\nIn an alternate future, water is the most va..."
4,The Haunted Mansion (2003),https://www.moviehousememories.com/the-haunted...,"\nWhile on the way to a family vacation, real-..."
5,The Rules of the Game (1939),https://www.moviehousememories.com/the-rules-o...,\nThe Rules of the Game takes place in pre-Wor...
6,Cherry 2000 (1987),https://www.moviehousememories.com/cherry-2000...,\nCherry 2000 takes place in the far off year ...
7,The Ruling Class (1972),https://www.moviehousememories.com/the-ruling-...,"\nIn The Ruling Class, after a member of the H..."
8,Jaws 2 (1978),https://www.moviehousememories.com/jaws-2-1978...,\nIt has been four years since the events of t...
9,Abominable (2019),https://www.moviehousememories.com/abominable-...,\nTeenager Yi lives in Shanghai with her mothe...


As you can see, scraping is a really interesting way of enrich and also creating new data that can be used for, for example, descriptive or predictive analytics