# Web Scraping with 5 Different Methods: All You Need to Know
### Don't miss the last method using LLM for Web Scraping

![Author](https://img.shields.io/badge/Author-Nhi%20Yen-brightgreen)
[![Medium](https://img.shields.io/badge/Medium-Follow%20Me-blue)](https://medium.com/@yennhi95zz/subscribe)
[![GitHub](https://img.shields.io/badge/GitHub-Follow%20Me-lightgrey)](https://github.com/yennhi95zz)
[![Kaggle](https://img.shields.io/badge/Kaggle-Follow%20Me-orange)](https://www.kaggle.com/nhiyen/code)
[![LinkedIn](https://img.shields.io/badge/LinkedIn-Connect%20with%20Me-informational)](https://www.linkedin.com/in/yennhi95zz/)


This notebook is associated with the articles/ project below:
- Find the complete code on this [GitHub repository](https://github.com/yennhi95zz/langchain-web-scraping).
- Explore a detailed explanation in my [Medium article](https://medium.com/@yennhi95zz/everything-about-how-to-web-scrape-using-5-different-methods-403a59fceea0).
- Experiment Tracking in [Commet LLM Project](https://www.comet.com/yennhi95zz/langchain-web-scraping/prompts)

Get UNLIMITED access to every story on Medium with just $1/week ▶ [HERE](https://medium.com/@yennhi95zz/membership)


# Method 1: BeautifulSoup and Requests

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.imdb.com/list/ls566941243/"

# Step 1: Send a GET request to the specified URL
response = requests.get(url)

# Step 2: Parse the HTML content of the response using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Step 3: Save the HTML content to a text file for reference
with open("imdb_bs4_html.txt", "w", encoding="utf-8") as file:
    file.write(str(soup))
print("Page content has been saved to imdb_bs4_html.txt")

# Step 4: Extract movie data from the parsed HTML and store it in a list
movies_data = []
for movie in soup.find_all('div', class_='lister-item-content'):
    title = movie.find('a').text
    genre = movie.find('span', class_='genre').text.strip()
    stars = movie.find('div', class_='ipl-rating-star').find('span', class_='ipl-rating-star__rating').text
    runtime = movie.find('span', class_='runtime').text
    rating = movie.find('span', class_='ipl-rating-star__rating').text
    movies_data.append([title, genre, stars, runtime, rating])

# Step 5: Create a Pandas DataFrame from the extracted movie data
df = pd.DataFrame(movies_data, columns=['Title', 'Genre', 'Stars', 'Runtime', 'Rating'])

# Display the resulting DataFrame
df


Page content has been saved to imdb_bs4_html.txt


Unnamed: 0,Title,Genre,Stars,Runtime,Rating
0,Bullet Train,"Action, Comedy, Thriller",7.3,127 min,7.3
1,Emancipation,"Action, Thriller",6.2,132 min,6.2
2,Violent Night,"Action, Comedy, Thriller",6.7,112 min,6.7
3,Top Gun: Maverick,"Action, Drama",8.3,130 min,8.3
4,The Batman,"Action, Crime, Drama",7.8,176 min,7.8
...,...,...,...,...,...
95,Wolf Hound,"Action, Adventure, War",3.7,130 min,3.7
96,Pursuit,"Action, Crime, Drama",2.8,95 min,2.8
97,The Commando,"Action, Thriller",3.3,93 min,3.3
98,Wolves of War,"Action, Thriller, War",3.9,87 min,3.9


# Method 2: Scrapy

In [2]:
url = "https://www.imdb.com/list/ls566941243/"

In [3]:
# Import necessary libraries
import scrapy
from scrapy.crawler import CrawlerProcess

# Define the Spider class for IMDb data extraction
class IMDbSpider(scrapy.Spider):
    # Name of the spider
    name = "imdb_spider"
    # Starting URL(s) for the spider to crawl
    # start_urls = ["https://www.imdb.com/list/ls566941243/"]
    start_urls = [url]

    # Parse method to extract data from the webpage
    def parse(self, response):
        # Iterate over each movie item on the webpage
        for movie in response.css('div.lister-item-content'):
            yield {
                'title': movie.css('h3.lister-item-header a::text').get(),
                'genre': movie.css('p.text-muted span.genre::text').get(),
                'runtime': movie.css('p.text-muted span.runtime::text').get(),
                'rating': movie.css('div.ipl-rating-star span.ipl-rating-star__rating::text').get(),
            }
# Initialize a CrawlerProcess instance with settings
process = CrawlerProcess(settings={
    'FEED_FORMAT': 'json',
    'FEED_URI': 'output_scrapy.json',  # This will overwrite the file every time you run the spider
})


# Add the IMDbSpider to the crawling process
process.crawl(IMDbSpider)
# Start the crawling process
process.start()


2024-02-07 21:19:07 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: scrapybot)
2024-02-07 21:19:07 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.12.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.10.13 (main, Nov  1 2023, 16:44:37) [Clang 14.0.0 (clang-1400.0.29.202)], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.2, Platform macOS-12.7.1-x86_64-i386-64bit
2024-02-07 21:19:07 [scrapy.addons] INFO: Enabled addons:
[]


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-02-07 21:19:07 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-02-07 21:19:07 [scrapy.extensions.telnet] INFO: Telnet Password: 59c2cc80a63c9e02
  exporter = cls(crawler)

2024-02-07 21:19:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsol

In [4]:
import pandas as pd

# Read the output.json file into a DataFrame (jsonlines format)
df = pd.read_json('output_scrapy.json')

# Display the DataFrame
df.head()

Unnamed: 0,title,genre,runtime,rating
0,Bullet Train,"\nAction, Comedy, Thriller",127 min,7.3
1,Emancipation,"\nAction, Thriller",132 min,6.2
2,Violent Night,"\nAction, Comedy, Thriller",112 min,6.7
3,Top Gun: Maverick,"\nAction, Drama",130 min,8.3
4,The Batman,"\nAction, Crime, Drama",176 min,7.8


# Method 3: Selenium

In [6]:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

# URL of the IMDb list
url = "https://www.imdb.com/list/ls566941243/"

# Set up Chrome options to run the browser in incognito mode
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")

# Initialize the Chrome driver with the specified options
driver = webdriver.Chrome(options=chrome_options)

# Navigate to the IMDb list URL
driver.get(url)

# Wait for the page to load (adjust the wait time according to your webpage)
driver.implicitly_wait(10)

# Get the HTML content of the page after it has fully loaded
html_content = driver.page_source

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Save the HTML content to a text file for reference
with open("imdb_selenium_html.txt", "w", encoding="utf-8") as file:
    file.write(str(soup))
print("Page content has been saved to imdb_selenium_html.txt")

# Extract movie data from the parsed HTML
movies_data = []
for movie in soup.find_all('div', class_='lister-item-content'):
    title = movie.find('a').text
    genre = movie.find('span', class_='genre').text.strip()
    stars = movie.select_one('div.ipl-rating-star span.ipl-rating-star__rating').text
    runtime = movie.find('span', class_='runtime').text
    rating = movie.select_one('div.ipl-rating-star span.ipl-rating-star__rating').text
    movies_data.append([title, genre, stars, runtime, rating])

# Create a Pandas DataFrame from the collected movie data
df = pd.DataFrame(movies_data, columns=['Title', 'Genre', 'Stars', 'Runtime', 'Rating'])

# Display the resulting DataFrame
print(df)

# Close the Chrome driver
driver.quit()


2024-02-07 21:19:15 [selenium.webdriver.common.selenium_manager] DEBUG: Selenium Manager binary found at: /Users/admin/Desktop/medium-how-to-web-scraping/.venv/lib/python3.10/site-packages/selenium/webdriver/common/macos/selenium-manager
2024-02-07 21:19:15 [selenium.webdriver.common.selenium_manager] DEBUG: Executing process: /Users/admin/Desktop/medium-how-to-web-scraping/.venv/lib/python3.10/site-packages/selenium/webdriver/common/macos/selenium-manager --browser chrome --language-binding python --output json
2024-02-07 21:19:16 [selenium.webdriver.common.selenium_manager] DEBUG: Driver path: /Users/admin/.cache/selenium/chromedriver/mac-x64/121.0.6167.85/chromedriver
2024-02-07 21:19:16 [selenium.webdriver.common.selenium_manager] DEBUG: Browser path: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome
2024-02-07 21:19:16 [selenium.webdriver.common.selenium_manager] DEBUG: Using driver at: /Users/admin/.cache/selenium/chromedriver/mac-x64/121.0.6167.85/chromedriver
2024-02

Page content has been saved to imdb_selenium_html.txt


2024-02-07 21:19:35 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://localhost:56988/session/c7e5ad7b3dcd8567c224e1b550d4de31 {}


                      Title                     Genre Stars  Runtime Rating
0              Bullet Train  Action, Comedy, Thriller   7.3  127 min    7.3
1              Emancipation          Action, Thriller   6.2  132 min    6.2
2             Violent Night  Action, Comedy, Thriller   6.7  112 min    6.7
3         Top Gun: Maverick             Action, Drama   8.3  130 min    8.3
4                The Batman      Action, Crime, Drama   7.8  176 min    7.8
..                      ...                       ...   ...      ...    ...
95               Wolf Hound    Action, Adventure, War   3.7  130 min    3.7
96                  Pursuit      Action, Crime, Drama   2.8   95 min    2.8
97             The Commando          Action, Thriller   3.3   93 min    3.3
98            Wolves of War     Action, Thriller, War   3.9   87 min    3.9
99  Diabolik: Ginko Attacks    Action, Crime, Mystery   5.4  116 min    5.4

[100 rows x 5 columns]


2024-02-07 21:19:35 [urllib3.connectionpool] DEBUG: http://localhost:56988 "DELETE /session/c7e5ad7b3dcd8567c224e1b550d4de31 HTTP/1.1" 200 0
2024-02-07 21:19:35 [selenium.webdriver.remote.remote_connection] DEBUG: Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-02-07 21:19:35 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request


# Method 4: Requests and lxml

In [7]:
import requests
from lxml import html
import pandas as pd

# Define the URL
url = "https://www.imdb.com/list/ls566941243/"

# Send an HTTP request to the URL and get the response
response = requests.get(url)

# Parse the HTML content using lxml
tree = html.fromstring(response.content)

# Extract movie data from the parsed HTML
titles = tree.xpath('//h3[@class="lister-item-header"]/a/text()')
genres = [', '.join(genre.strip() for genre in genre_list.xpath(".//text()")) for genre_list in tree.xpath('//p[@class="text-muted text-small"]/span[@class="genre"]')]
ratings = tree.xpath('//div[@class="ipl-rating-star small"]/span[@class="ipl-rating-star__rating"]/text()')
runtimes = tree.xpath('//p[@class="text-muted text-small"]/span[@class="runtime"]/text()')

# Create a dictionary with extracted data
data = {
    'Title': titles,
    'Genre': genres,
    'Rating': ratings,
    'Runtime': runtimes
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the resulting DataFrame
df.head()


2024-02-07 21:19:35 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.imdb.com:443
2024-02-07 21:19:37 [urllib3.connectionpool] DEBUG: https://www.imdb.com:443 "GET /list/ls566941243/ HTTP/1.1" 200 None


Unnamed: 0,Title,Genre,Rating,Runtime
0,Bullet Train,"Action, Comedy, Thriller",7.3,127 min
1,Emancipation,"Action, Thriller",6.2,132 min
2,Violent Night,"Action, Comedy, Thriller",6.7,112 min
3,Top Gun: Maverick,"Action, Drama",8.3,130 min
4,The Batman,"Action, Crime, Drama",7.8,176 min


# Method 5. Langchain

- [LangChain Beautiful Soup](https://python.langchain.com/docs/integrations/document_transformers/beautiful_soup)
- [LangChain Extraction](https://python.langchain.com/docs/use_cases/extraction)

In [8]:
import os
import dotenv
import time

# Load environment variables from a .env file
dotenv.load_dotenv()

# Retrieve OpenAI and Comet key from environment variables
MY_OPENAI_KEY = os.getenv("MY_OPENAI_KEY")
MY_COMET_KEY = os.getenv("MY_COMET_KEY")

In [9]:
import comet_llm

# Initialize a Comet project
comet_llm.init(project="langchain-web-scraping",
               api_key=MY_COMET_KEY,
               )

2024-02-07 21:19:43 [everett] DEBUG: No INI file found: []
2024-02-07 21:19:43 [everett] DEBUG: No INI file found: ['./.comet.config']
2024-02-07 21:19:43 [everett] DEBUG: No INI file found: ['/content/drive/MyDrive/.comet.config']
2024-02-07 21:19:43 [everett] DEBUG: Looking up key: raise_exceptions_on_error, namespace: ['comet']
2024-02-07 21:19:43 [everett] DEBUG: Searching <ConfigOSEnv> for COMET_RAISE_EXCEPTIONS_ON_ERROR
2024-02-07 21:19:43 [everett] DEBUG: Searching <ConfigEnvFileEnv: '/Users/admin/Desktop/medium-how-to-web-scraping/.env'> for COMET_RAISE_EXCEPTIONS_ON_ERROR
2024-02-07 21:19:43 [everett] DEBUG: Searching <ConfigIniEnv: /Users/admin/.comet.config> for key: raise_exceptions_on_error, namespace: ['comet']
2024-02-07 21:19:43 [everett] DEBUG: Searching <ConfigDictEnv: {}> for COMET_RAISE_EXCEPTIONS_ON_ERROR
2024-02-07 21:19:43 [everett] DEBUG: Found nothing--returning NO_VALUE
2024-02-07 21:19:43 [everett] DEBUG: Looking up key: raise_exceptions_on_error, namespace: 

In [10]:
# Resolve async issues by applying nest_asyncio
import nest_asyncio
nest_asyncio.apply()

# Import required modules from langchain
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import AsyncChromiumLoader
from langchain_community.document_transformers import BeautifulSoupTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import create_extraction_chain

# Define the URL
url = "https://www.imdb.com/list/ls566941243/"

# Initialize ChatOpenAI instance with OpenAI API key
llm = ChatOpenAI(openai_api_key=MY_OPENAI_KEY)

# Load HTML content using AsyncChromiumLoader
loader = AsyncChromiumLoader([url])
docs = loader.load()

# Save the HTML content to a text file for reference
with open("imdb_langchain_html.txt", "w", encoding="utf-8") as file:
    file.write(str(docs[0].page_content))
print("Page content has been saved to imdb_langchain_html.txt")

# Transform the loaded HTML using BeautifulSoupTransformer
bs_transformer = BeautifulSoupTransformer()
docs_transformed = bs_transformer.transform_documents(
    docs, tags_to_extract=["h3", "p"]
)

# Split the transformed documents using RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000, chunk_overlap=0)
splits = splitter.split_documents(docs_transformed)

2024-02-07 21:19:52 [httpx] DEBUG: load_ssl_context verify=True cert=None trust_env=True http2=False
2024-02-07 21:19:52 [httpx] DEBUG: load_verify_locations cafile='/Users/admin/Desktop/medium-how-to-web-scraping/.venv/lib/python3.10/site-packages/certifi/cacert.pem'
2024-02-07 21:19:52 [httpx] DEBUG: load_ssl_context verify=True cert=None trust_env=True http2=False
2024-02-07 21:19:52 [httpx] DEBUG: load_verify_locations cafile='/Users/admin/Desktop/medium-how-to-web-scraping/.venv/lib/python3.10/site-packages/certifi/cacert.pem'
2024-02-07 21:19:52 [langchain_community.document_loaders.chromium] INFO: Starting scraping...
2024-02-07 21:20:06 [langchain_community.document_loaders.chromium] INFO: Content scraped


Page content has been saved to imdb_langchain_html.txt


In [11]:
# Define a JSON schema for movie data validation
schema = {
    "properties": {
        "movie_title": {"type": "string"},
        "stars": {"type": "integer"},
        "genre": {"type": "array", "items": {"type": "string"}},
        "runtime": {"type": "string"},
        "rating": {"type": "string"},
    },
    "required": ["movie_title", "stars", "genre", "runtime", "rating"],
}

def extract_movie_data(content: str, schema: dict):
    """
    Extract movie data from content using a specified JSON schema.

    Parameters:
    - content (str): Text content containing movie data.
    - schema (dict): JSON schema for validating the movie data.

    Returns:
    - dict: Extracted movie data.
    """
    # Run the extraction chain with the provided schema and content
    start_time = time.time()
    extracted_content = create_extraction_chain(schema=schema, llm=llm).run(content)
    end_time = time.time()

    # Log metadata and output in the Comet project for tracking purposes
    comet_llm.log_prompt(
        prompt=str(content),
        metadata= {
            "schema": schema
        },
        output= extracted_content,
        duration= end_time - start_time,
    )

    return extracted_content




In [12]:
# Extract movie data using the defined schema and the first split page content
extracted_content = extract_movie_data(schema=schema, content=splits[0].page_content)

# Display the extracted movie data
extracted_content

  warn_deprecated(

2024-02-07 21:20:19 [everett] DEBUG: Looking up key: api_key, namespace: ['comet']
2024-02-07 21:20:19 [everett] DEBUG: Searching <ConfigOSEnv> for COMET_API_KEY
2024-02-07 21:20:19 [everett] DEBUG: Searching <ConfigEnvFileEnv: '/Users/admin/Desktop/medium-how-to-web-scraping/.env'> for COMET_API_KEY
2024-02-07 21:20:19 [everett] DEBUG: Searching <ConfigIniEnv: /Users/admin/.comet.config> for key: api_key, namespace: ['comet']
2024-02-07 21:20:19 [everett] DEBUG: Returning raw: 'eP6Jhfasgx3pTkJ8B97eyKxkd', parsed: 'eP6Jhfasgx3pTkJ8B97eyKxkd'
2024-02-07 21:20:19 [everett] DEBUG: Looking up key: hide_api_key, namespace: ['comet', 'logging']
2024-02-07 21:20:19 [everett] DEBUG: Searching <ConfigOSEnv> for COMET_LOGGING_HIDE_API_KEY
2024-02-07 21:20:19 [everett] DEBUG: Searching <ConfigEnvFileEnv: '/Users/admin/Desktop/medium-how-to-web-scraping/.env'> for COMET_LOGGING_HIDE_API_KEY
2024-02-07 21:20:19 [everett] DEBUG: Searching <ConfigIniEnv: /Users/admin/.comet.config

Chain logged to https://www.comet.com/yennhi95zz/langchain-web-scraping


2024-02-07 21:20:43 [comet_llm.summary] INFO: Chain logged to https://www.comet.com/yennhi95zz/langchain-web-scraping
2024-02-07 21:20:43 [everett] DEBUG: Looking up key: api_key, namespace: ['comet']
2024-02-07 21:20:43 [everett] DEBUG: Searching <ConfigOSEnv> for COMET_API_KEY
2024-02-07 21:20:43 [everett] DEBUG: Searching <ConfigEnvFileEnv: '/Users/admin/Desktop/medium-how-to-web-scraping/.env'> for COMET_API_KEY
2024-02-07 21:20:43 [everett] DEBUG: Searching <ConfigIniEnv: /Users/admin/.comet.config> for key: api_key, namespace: ['comet']
2024-02-07 21:20:43 [everett] DEBUG: Returning raw: 'eP6Jhfasgx3pTkJ8B97eyKxkd', parsed: 'eP6Jhfasgx3pTkJ8B97eyKxkd'
2024-02-07 21:20:43 [everett] DEBUG: Looking up key: hide_api_key, namespace: ['comet', 'logging']
2024-02-07 21:20:43 [everett] DEBUG: Searching <ConfigOSEnv> for COMET_LOGGING_HIDE_API_KEY
2024-02-07 21:20:43 [everett] DEBUG: Searching <ConfigEnvFileEnv: '/Users/admin/Desktop/medium-how-to-web-scraping/.env'> for COMET_LOGGING_HID

[{'movie_title': 'Bullet Train',
  'stars': 18,
  'genre': ['Action', 'Comedy', 'Thriller'],
  'runtime': '127 min',
  'rating': None},
 {'movie_title': 'Emancipation',
  'stars': None,
  'genre': ['Action', 'Thriller'],
  'runtime': '132 min',
  'rating': 'R'},
 {'movie_title': 'Violent Night',
  'stars': None,
  'genre': ['Action', 'Comedy', 'Thriller'],
  'runtime': '112 min',
  'rating': 'R'},
 {'movie_title': 'Top Gun: Maverick',
  'stars': None,
  'genre': ['Action', 'Drama'],
  'runtime': '130 min',
  'rating': 'P13'},
 {'movie_title': 'The Batman',
  'stars': None,
  'genre': ['Action', 'Crime', 'Drama'],
  'runtime': '176 min',
  'rating': 'P13'}]