# Crawling the PhapDien Website

## I. Crawling the PhapDien Website

### Objective:
Download and organize legal documents from the PhapDien website into structured directories for further processing.

### Approach: [ISODS-PhapDien-Crawler-Semantic-Search](https://github.com/saladnga/ISODS-PhapDien-Crawler-Semantic-Search)

In [None]:
%pip install requests beautifulsoup4 lxml selenium

Note: you may need to restart the kernel to use updated packages.


In [1]:
import os
import requests
import threading
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from queue import Queue
from concurrent.futures import ThreadPoolExecutor, as_completed

In [2]:
# URLs and base directory
full_base_url = "https://vbpl.vn/TW/Pages/vbpq-toanvan.aspx?ItemID={}"
property_base_url = "https://vbpl.vn/tw/Pages/vbpq-thuoctinh.aspx?&ItemID={}"
history_base_url = "https://vbpl.vn/tw/Pages/vbpq-lichsu.aspx?&ItemID={}"
related_base_url = "https://vbpl.vn/TW/Pages/vbpq-vanbanlienquan.aspx?ItemID={}"
pdf_base_url = "https://vbpl.vn/tw/Pages/vbpq-van-ban-goc.aspx?ItemID={}"
base_dir = "BoPhapDienDienTu"

### Create Additional Directories:
- vbpl: For full text HTML documents
- property: For property pages of the documents
- history: For history pages of the documents
- related: For related pages of the documents
- pdf: For PDF files of the documents

In [3]:
# Directories
demuc_dir = os.path.join(base_dir, "demuc")
vbpl_dir = os.path.join(base_dir, "vbpl")
property_dir = os.path.join(base_dir, "property")
history_dir = os.path.join(base_dir, "history")
related_dir = os.path.join(base_dir, "related")
pdf_dir = os.path.join(base_dir, "pdf")

### Crawl and Save HTML Documents:
- Use os and BeautifulSoup to extract and iterate through unique ItemIDs from the index HTML files in the BoPhapDienDienTu/demuc directory

In [4]:
# Ensure all directories are created
os.makedirs(vbpl_dir, exist_ok=True)
os.makedirs(property_dir, exist_ok=True)
os.makedirs(history_dir, exist_ok=True)
os.makedirs(pdf_dir, exist_ok=True)
os.makedirs(related_dir, exist_ok=True)

In [5]:
# Initialize ChromeDriver to download PDF files
chromedriver_path = "chromedriver/chromedriver"
webdriver_pool = Queue()
max_workers = 5
webdriver_lock = threading.Lock()

In [6]:
# Initialize WebDriver pool
def init_webdriver():
    for _ in range(max_workers):
        options = Options()
        service = Service(chromedriver_path)
        driver = webdriver.Chrome(service=service, options=options)
        webdriver_pool.put(driver)

In [7]:
# Get a WebDriver instance
def get_webdriver():
    with webdriver_lock:
        return webdriver_pool.get()

In [8]:
# Return a WebDriver instance to the pool
def return_webdriver(driver):
    with webdriver_lock:
        webdriver_pool.put(driver)

In [9]:
# Extract IDs from HTML files
def extract_item_ids(demuc_dir):
    item_ids = set()
    for root, _, files in os.walk(demuc_dir):
        for file in files:
            if file.endswith(".html"):
                file_path = os.path.join(root, file)
                with open(file_path, "r", encoding="utf-8") as f:
                    soup = BeautifulSoup(f.read(), "lxml")
                    for link in soup.find_all("a", href=True):
                        href = link["href"]
                        if "ItemID=" in href:
                            item_id = href.split("ItemID=")[1].split("&")[0]
                            item_ids.add(item_id)
    return item_ids

- For each ItemIDs, construct their correspond URLs and save their content using request:
    - For full text:
        - URL: https://vbpl.vn/TW/Pages/vbpq-toanvan.aspx?ItemID=
        - File path: full_.html
        - Designated directory: BoPhapDienDienTu/vbpl
    - For property:
        - URL: https://vbpl.vn/tw/Pages/vbpq-thuoctinh.aspx?&ItemID=
        - File path: p_.html
        - Designated directory: BoPhapDienDienTu/property
    - For history
        - URL: https://vbpl.vn/tw/Pages/vbpq-lichsu.aspx?&ItemID=
        - File path: h_.html
        - Designated directory: BoPhapDienDienTu/history
    - For related:
        - URL: https://vbpl.vn/tw/Pages/vbpq-vanbanlienquan.aspx?&ItemID=
        - File path: r_.html
        - Designated directory: BoPhapDienDienTu/related

In [10]:
# Download HTML files from the URL
def download_html(html_url, save_path):
    if os.path.exists(save_path):
        print(f"Skipping HTML file - already exists: {save_path}")
        return

    try:
        response = requests.get(html_url)
        if response.status_code == 200:
            with open(save_path, "wb") as file:
                file.write(response.content)
            print(f"HTML file downloaded successfully - {save_path}")
    except requests.RequestException as e:
        print(f"Failed to download HTML file - {html_url} : {e}")

### Download PDF Files Dynamically:
- Use Selenium and ChromDriver to extract and download PDF files.
- Locate the PDF link in the data attribute of the tag using XPath
    - For PDF:
        - URL: https://vbpl.vn/tw/Pages/ vbpq-van-ban-goc?&ItemID=
        - File path: pdf_.pdf
        - Designated directory: BoPhapDienDienTu/pdf

In [11]:
# Find the element for Selenium to download PDF files
def find_element(driver, xpath):
    try:
        return driver.find_element(By.XPATH, xpath)
    except Exception as e:
        print(f"Error finding element by XPath: {e}")
        return None

In [12]:
# Download PDF files from the URL using Selenium
def download_pdf(url, save_path):
    if os.path.exists(save_path):
        print(f"PDF file already exists: {save_path}")
        return

    attempts = 3
    for attempt in range(1, attempts + 1):
        driver = None
        try:
            driver = get_webdriver()
            driver.get(url)
            time.sleep(10)

            pdf_window = find_element(driver, "//object")
            if pdf_window:
                relative_pdf_url = pdf_window.get_attribute("data")
                if relative_pdf_url:
                    if relative_pdf_url.startswith("https://vbpl.vn") == False:
                        pdf_url = f"https://vbpl.vn{relative_pdf_url.lstrip('/')}"
                    else:
                        pdf_url = relative_pdf_url

                    print(f"Found PDF URL at {pdf_url}")

                    response = requests.get(pdf_url, timeout=(30, 1200))
                    if response.status_code == 200:
                        with open(save_path, "wb") as file:
                            file.write(response.content)
                        print(f"PDF downloaded successfully: {save_path}")
                        return
                else:
                    print("PDF URL not found on the page")
            else:
                print("Cannot locate PDF")

        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
        finally:
            if driver:
                return_webdriver(driver)

        time.sleep(5)
    print(f"Failed to download PDF after {attempts} attempts: {url}")

In [13]:
# Download all files with given ItemIDs
def scrape_item(item_id):
    urls = {
        "full_doc": (full_base_url, "full_{}.html", vbpl_dir),
        "property": (property_base_url, "p_{}.html", property_dir),
        "history": (history_base_url, "h_{}.html", history_dir),
        "related": (related_base_url, "r_{}.html", related_dir),
        "pdf": (pdf_base_url, "p_{}.pdf", pdf_dir),
    }

    for type, (url, file_name, dir) in urls.items():
        url = url.format(item_id)
        file_name = file_name.format(item_id.split("#")[0])
        file_path = os.path.join(dir, file_name)

        if type == "pdf":
            download_pdf(url, file_path)
        else:
            download_html(url, file_path)

### Optimize Downloads with Multiprocessing:
- Use ThreadPoolExecutor from the concurrent.futures module download files concurrently, significantly speeding up the process

In [None]:
# Execution (multiprocessing)
full_item_ids = extract_item_ids(demuc_dir)
init_webdriver()

try:
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(scrape_item, item_id) for item_id in full_item_ids]
        for future in as_completed(futures):
            try:
                future.result()
            except Exception as e:
                print(f"Error processing item: {e}")
except KeyboardInterrupt:
    print("Exit")
finally:
    while not webdriver_pool.empty():
        driver = webdriver_pool.get()
        driver.quit()

Skipping HTML file - already exists: BoPhapDienDienTu/vbpl/full_.htmlSkipping HTML file - already exists: BoPhapDienDienTu/vbpl/full_133859.html
Skipping HTML file - already exists: BoPhapDienDienTu/property/p_133859.html

Skipping HTML file - already exists: BoPhapDienDienTu/vbpl/full_124052.html
Skipping HTML file - already exists: BoPhapDienDienTu/vbpl/full_136705.html
Skipping HTML file - already exists: BoPhapDienDienTu/history/h_133859.html
Skipping HTML file - already exists: BoPhapDienDienTu/related/r_133859.html
Skipping HTML file - already exists: BoPhapDienDienTu/property/p_136705.html
Skipping HTML file - already exists: BoPhapDienDienTu/property/p_124052.html
Skipping HTML file - already exists: BoPhapDienDienTu/vbpl/full_146048.html
Skipping HTML file - already exists: BoPhapDienDienTu/history/h_136705.html
Skipping HTML file - already exists: BoPhapDienDienTu/history/h_124052.html
Skipping HTML file - already exists: BoPhapDienDienTu/property/p_146048.html
Skipping HTML 

Failed to download PDF after 3 attempts: https://vbpl.vn/tw/Pages/vbpq-van-ban-goc.aspx?ItemID=146048#Chuong_V_Dieu_22
Skipping HTML file - already exists: BoPhapDienDienTu/vbpl/full_117868.html
Skipping HTML file - already exists: BoPhapDienDienTu/property/p_117868.html
Skipping HTML file - already exists: BoPhapDienDienTu/history/h_117868.html
Skipping HTML file - already exists: BoPhapDienDienTu/related/r_117868.html
Failed to download PDF after 3 attempts: https://vbpl.vn/tw/Pages/vbpq-van-ban-goc.aspx?ItemID=133859#Chuong_II_Dieu_8
Skipping HTML file - already exists: BoPhapDienDienTu/vbpl/full_136038.html
Skipping HTML file - already exists: BoPhapDienDienTu/property/p_136038.html
Skipping HTML file - already exists: BoPhapDienDienTu/history/h_136038.html
Skipping HTML file - already exists: BoPhapDienDienTu/related/r_136038.html
