<a href="https://colab.research.google.com/github/yiyangjessieyu/Machine-Learning/blob/main/web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Tutorial from https://www.comet.com/site/blog/top-5-web-scraping-methods-including-using-llms/

# Method 1: BeautifulSoup and Requests for Web Scraping

Conclusion from experimenting:
- Wasn't too bad tracing back nested structure of tags, classnames, and text.
- Was a hassle to debug why it wasn't acesssing the required data. Figured out the cause was because of None types when there wasn't a tag or classname for that div, eg. metascore.

In [9]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [8]:
url = "https://www.imdb.com/list/ls566941243/"

In [22]:
# Step 1: Send a GET request to the specified URL
response = requests.get(url)

# Step 2: Parse the HTML content of the response using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Step 3: Save the HTML content to a text file for reference
with open("imdb.txt", "w", encoding="utf-8") as file:
    file.write(str(soup))
print("Page content has been saved to imdb.txt")

# Step 4: Extract movie data from the parsed HTML and store it in a list
movies_data = []
for movie in soup.find_all('div', class_='lister-item-content'):
    title = movie.find('a').text
    genre = movie.find('span', class_='genre').text.strip()
    stars = movie.find('div', class_='ipl-rating-star').find('span', class_='ipl-rating-star__rating').text
    runtime = movie.find('span', class_='runtime').text
    rating = movie.find('span', class_='ipl-rating-star__rating').text
    metascore_div = movie.find('div', class_='ratings-metascore')
    metascore_span = metascore_div.find('span', class_='metascore') if metascore_div != None else None
    metascore = metascore_span.text if metascore_span != None else 0 #.span.text.strip()
    movies_data.append([title, genre, stars, runtime, rating, metascore])

# Step 5: Create a Pandas DataFrame from the extracted movie data
df = pd.DataFrame(movies_data, columns=['Title', 'Genre', 'Stars', 'Runtime', 'Rating', 'Metascore'])

# Display the resulting DataFrame
df

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.imdb.com:443
2024-03-13 05:08:24 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.imdb.com:443
DEBUG:urllib3.connectionpool:https://www.imdb.com:443 "GET /list/ls566941243/ HTTP/1.1" 200 None
2024-03-13 05:08:27 [urllib3.connectionpool] DEBUG: https://www.imdb.com:443 "GET /list/ls566941243/ HTTP/1.1" 200 None


Page content has been saved to imdb.txt


Unnamed: 0,Title,Genre,Stars,Runtime,Rating,Metascore
0,Bullet Train,"Action, Comedy, Thriller",7.3,127 min,7.3,49
1,Emancipation,"Action, Thriller",6.2,132 min,6.2,53
2,Violent Night,"Action, Comedy, Thriller",6.7,112 min,6.7,55
3,Top Gun: Maverick,"Action, Drama",8.2,130 min,8.2,78
4,The Batman,"Action, Crime, Drama",7.8,176 min,7.8,72
...,...,...,...,...,...,...
95,Wolf Hound,"Action, Adventure, War",3.7,130 min,3.7,0
96,Pursuit,"Action, Crime, Drama",2.8,95 min,2.8,0
97,The Commando,"Action, Thriller",3.3,93 min,3.3,0
98,Wolves of War,"Action, Thriller, War",3.9,87 min,3.9,0


# Method 2: ScraPy for Web Scraping

Conclusion from experimenting:
- Found it a hassle to have to restart runtime on Colab everytime we scraping and displaying it as a dataframe. eg. ReactorNotRestartable error, just restart the kernel and run the code again. Currently don't have a more efficient process.
- Otherwise work as expected. Appreciate how tidy it is to access needed web data by just declaring its nested tags and class name without having to think of the notation, null cases or functions to do so (unlike Beautiful Soup).

In [None]:
!pip install scrapy


In [None]:
!pip show scrapy


In [None]:
import scrapy


In [3]:
from scrapy.crawler import CrawlerProcess

In [None]:
# Define the Spider class for IMDb data extraction
class IMDbSpider(scrapy.Spider):
    # Name of the spider
    name = "imdb_spider"
    # Starting URL(s) for the spider to crawl
    start_urls = ["https://www.imdb.com/list/ls566941243/"]
    # start_urls = [url]

    # Parse method to extract data from the webpage
    def parse(self, response):
        # Iterate over each movie item on the webpage
        for movie in response.css('div.lister-item-content'):
            yield {
                'title': movie.css('h3.lister-item-header a::text').get(),
                'genre': movie.css('p.text-muted span.genre::text').get(),
                'runtime': movie.css('p.text-muted span.runtime::text').get(),
                'rating': movie.css('div.ipl-rating-star span.ipl-rating-star__rating::text').get(),
                'metascore': movie.css('div.ratings-metascore span.metascore::text').get()
            }
# Initialize a CrawlerProcess instance with settings
process = CrawlerProcess(settings={
    'FEED_FORMAT': 'json',
    'FEED_URI': 'output.json',  # This will overwrite the file every time you run the spider
})


# Add the IMDbSpider to the crawling process
process.crawl(IMDbSpider)
# Start the crawling process
process.start()

In [5]:
import pandas as pd

# Read the output.json file into a DataFrame (jsonlines format)
df = pd.read_json('output.json')

# Display the DataFrame
df.head()

INFO:numexpr.utils:NumExpr defaulting to 2 threads.
2024-03-13 04:47:12 [numexpr.utils] INFO: NumExpr defaulting to 2 threads.


Unnamed: 0,title,genre,runtime,rating,metascore
0,Bullet Train,"\nAction, Comedy, Thriller",127 min,7.3,49.0
1,Emancipation,"\nAction, Thriller",132 min,6.2,53.0
2,Violent Night,"\nAction, Comedy, Thriller",112 min,6.7,55.0
3,Top Gun: Maverick,"\nAction, Drama",130 min,8.2,78.0
4,The Batman,"\nAction, Crime, Drama",176 min,7.8,72.0


# Method 3: Selenium for Web Scraping

Conclusion from experimenting
- PRO: flexibility and control over the browser’s behavior.
- The key to Selenium is in its Chrome options, which are settings for customizing the behavior of the Chrome browser controlled by Selenium WebDriver. These options enable control over aspects like incognito mode, window size, notifications, and more.

In [None]:
# Here are some important Chrome options that you might find useful:

# Runs the browser in incognito (private browsing) mode.
chrome_options.add_argument("--incognito")

# Runs the browser in headless mode, i.e., without a graphical user interface.
# Useful for running Selenium tests in the background without opening a visible browser window.
chrome_options.add_argument("--headless")

# Sets the initial window size of the browser.
chrome_options.add_argument("--window-size=1200x600")

# Disables browser notifications.
chrome_options.add_argument("--disable-notifications")

# Disables the infobar that appears at the top of the browser.
chrome_options.add_argument("--disable-infobars")

# Disables browser extensions.
chrome_options.add_argument("--disable-extensions")

# Disables the GPU hardware acceleration.
chrome_options.add_argument("--disable-gpu")

# Disables web security features, which can be useful for testing on localhost without CORS issues.
chrome_options.add_argument("--disable-web-security")

In [None]:
!pip install chromedriver-binary


In [34]:
!which chromedriver


In [None]:
!pip install selenium

In [None]:
!pip show selenium


In [None]:
%reset


In [4]:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

In [46]:
!which google-chrome


In [5]:
# URL of the IMDb list
url = "https://www.imdb.com/list/ls566941243/"

In [None]:
# Set up Chrome options to run the browser in incognito mode
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--remote-debugging-port=9222")


# Initialize the Chrome driver with the specified options
driver = webdriver.Chrome(options=chrome_options)

In [None]:
# Navigate to the IMDb list URL
driver.get(url)

# Wait for the page to load (adjust the wait time according to your webpage)
driver.implicitly_wait(10)

# Get the HTML content of the page after it has fully loaded
html_content = driver.page_source

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Save the HTML content to a text file for reference
with open("imdb_selenium.txt", "w", encoding="utf-8") as file:
    file.write(str(soup))
print("Page content has been saved to imdb_selenium.txt")

# Extract movie data from the parsed HTML
movies_data = []
for movie in soup.find_all('div', class_='lister-item-content'):
    title = movie.find('a').text
    genre = movie.find('span', class_='genre').text.strip()
    stars = movie.select_one('div.ipl-rating-star span.ipl-rating-star__rating').text
    runtime = movie.find('span', class_='runtime').text
    rating = movie.select_one('div.ipl-rating-star span.ipl-rating-star__rating').text
    movies_data.append([title, genre, stars, runtime, rating])

# Create a Pandas DataFrame from the collected movie data
df = pd.DataFrame(movies_data, columns=['Title', 'Genre', 'Stars', 'Runtime', 'Rating'])

# Display the resulting DataFrame
print(df)

# Close the Chrome driver
driver.quit()

# Method 4: Requests and lxml for Web Scraping

Conclusion from experimenting


*   Dislike the readability syntax to access the needed data
* Apart from that, its works okay.



In [7]:
import requests
from lxml import html
import pandas as pd

In [8]:
url = "https://www.imdb.com/list/ls566941243/"

In [9]:
# Send an HTTP request to the URL and get the response
response = requests.get(url)

# Parse the HTML content using lxml
tree = html.fromstring(response.content)

# Extract movie data from the parsed HTML
titles = tree.xpath('//h3[@class="lister-item-header"]/a/text()')
genres = [', '.join(genre.strip() for genre in genre_list.xpath(".//text()")) for genre_list in tree.xpath('//p[@class="text-muted text-small"]/span[@class="genre"]')]
ratings = tree.xpath('//div[@class="ipl-rating-star small"]/span[@class="ipl-rating-star__rating"]/text()')
runtimes = tree.xpath('//p[@class="text-muted text-small"]/span[@class="runtime"]/text()')

# Create a dictionary with extracted data
data = {
    'Title': titles,
    'Genre': genres,
    'Rating': ratings,
    'Runtime': runtimes
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the resulting DataFrame
df.head()

Unnamed: 0,Title,Genre,Rating,Runtime
0,Bullet Train,"Action, Comedy, Thriller",7.3,127 min
1,Emancipation,"Action, Thriller",6.2,132 min
2,Violent Night,"Action, Comedy, Thriller",6.7,112 min
3,Top Gun: Maverick,"Action, Drama",8.2,130 min
4,The Batman,"Action, Crime, Drama",7.8,176 min


# Method 5: LangChain for Web Scraping


Conclusion from experimenting
- The easier option to use in terms of less fuss over syntax, functions, nested structure to limit accessibility so more time can be spent on the actual web scraping and deciding what data to extract.

In [None]:
!pip install python-dotenv


In [None]:
!pip install comet_llm

In [None]:
!pip install langchain_openai

In [None]:
!pip install langchain_community

In [None]:
!pip install langchain

In [None]:
!pip install openai


In [None]:
!pip install playwright


In [None]:
!playwright install

In [12]:
import os
import dotenv
import time

# Load environment variables from a .env file
dotenv.load_dotenv()

# Retrieve OpenAI and Comet key from environment variables
MY_OPENAI_KEY = os.getenv("MY_OPENAI_KEY")
MY_COMET_KEY = os.getenv("MY_COMET_KEY")

In [15]:
import comet_llm

# Initialize a Comet project
comet_llm.init(project="langchain-web-scraping",
               api_key=MY_COMET_KEY,
               )

Please paste your Comet API key from https://www.comet.com/api/my/settings/
(api key may not show as you type)
Comet API key: ··········


[1;38;5;39mCOMET INFO:[0m Valid Comet API Key saved in /root/.comet.config (set COMET_CONFIG to change where it is saved).


In [28]:
# Get the OpenAI API key from the environment variable
openai_api_key = os.environ.get('OPENAI_API_KEY')

# If the environment variable is not set, prompt the user to enter it
if openai_api_key is None:
    openai_api_key = input("Please enter your OpenAI API key: ")


Please enter your OpenAI API key: qiBxBrahDn7caU1RnTVP9pQFY


In [21]:
# Resolve async issues by applying nest_asyncio
import nest_asyncio
nest_asyncio.apply()

# Import required modules from langchain
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import AsyncChromiumLoader
from langchain_community.document_transformers import BeautifulSoupTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import create_extraction_chain



In [26]:
from langchain.llms import OpenAI


In [41]:
# Define the URL
url = "https://www.imdb.com/list/ls566941243/"

# Initialize ChatOpenAI instance with OpenAI API key
llm = ChatOpenAI(openai_api_key='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')

# Load HTML content using AsyncChromiumLoader
loader = AsyncChromiumLoader([url])
docs = loader.load()

# Save the HTML content to a text file for reference
with open("imdb_langchain_html.txt", "w", encoding="utf-8") as file:
    file.write(str(docs[0].page_content))
print("Page content has been saved to imdb_langchain_html.txt")

# Transform the loaded HTML using BeautifulSoupTransformer
bs_transformer = BeautifulSoupTransformer()
docs_transformed = bs_transformer.transform_documents(
    docs, tags_to_extract=["h3", "p"] # extracting movie title, genre, rating, and runtime, so I’ll go with the <h3> and <p> tags.
)

# Split the transformed documents using RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000, chunk_overlap=0)
splits = splitter.split_documents(docs_transformed)

Page content has been saved to imdb_langchain_html.txt


In [42]:
# Define a JSON schema for movie data validation
schema = {
    "properties": {
        "movie_title": {"type": "string"},
        "stars": {"type": "integer"},
        "genre": {"type": "array", "items": {"type": "string"}},
        "runtime": {"type": "string"},
        "rating": {"type": "string"},
    },
    "required": ["movie_title", "stars", "genre", "runtime", "rating"],
}

def extract_movie_data(content: str, schema: dict):
    """
    Extract movie data from content using a specified JSON schema.

    Parameters:
    - content (str): Text content containing movie data.
    - schema (dict): JSON schema for validating the movie data.

    Returns:
    - dict: Extracted movie data.
    """
    # Run the extraction chain with the provided schema and content
    start_time = time.time()
    extracted_content = create_extraction_chain(schema=schema, llm=llm).run(content)
    end_time = time.time()

    # Log metadata and output in the Comet project for tracking purposes
    comet_llm.log_prompt(
        prompt=str(content),
        metadata= {
            "schema": schema
        },
        output= extracted_content,
        duration= end_time - start_time,
    )

    return extracted_content


In [None]:
# Extract movie data using the defined schema and the first split page content
extracted_content = extract_movie_data(schema=schema, content=splits[0].page_content)

# Display the extracted movie data
extracted_content