# Ledger Insights Articles Scraper

This script scrapes article headlines, links, and publication dates from the Ledger Insights website's tokenization section. The scraped data is saved into a CSV file, appending new relevant articles each time the script is run.

## Prerequisites

1. **Python**: Ensure you have Python installed on your system. You can download it from [python.org](https://www.python.org/).
2. **Selenium WebDriver**: Install Selenium WebDriver for Python using the following command:
   ```bash
   pip install selenium
   ```
3. **BeautifulSoup**: Install BeautifulSoup for parsing HTML using the following command:
   ```bash
   pip install beautifulsoup4
   ```
4. **ChromeDriver**: Download ChromeDriver from [chromedriver.chromium.org](https://chromedriver.chromium.org/downloads) and ensure its path is correct in the script.

## Script Overview

#### *Setup*

1. **Path to ChromeDriver**: Ensure the `webdriver_path` variable points to the correct location of your ChromeDriver executable.
2. **Selenium WebDriver Initialization**: The script sets up the Selenium WebDriver and opens the target URL.

#### *CSV File Handling*

1. **Append Mode**: The script opens the CSV file (`ledgerinsights_articles.csv`) in append mode. If the file does not exist, it creates a new one and writes the header row.
2. **Avoiding Duplicates**: The script reads existing headlines from the CSV file to avoid duplications when appending new data.

#### *Keywords for Filtering Articles*

The script filters articles based on the following keywords: `digital assets`, `digital securities`, `tokenized`, `tokenization`, `bond`, `security`, `asset`, `token`.

#### *Closing Cookie Consent*

The script includes a function `close_cookie_consent` to close the cookie consent dialog if it appears.

#### *Loading More Articles*

A bespoke feature unique to the Ledger Insights  script clicks the "Load more" button multiple times (adjustable by `num_clicks`) to load additional articles. If you are using a different site other than the one specified here, the code will likely fail. 

#### *Parsing and Scraping*

1. **Parsing with BeautifulSoup**: The script parses the loaded page content using BeautifulSoup.
2. **Scraping Data**: For each article, the script extracts the headline, link, and publication date, ensuring no duplicates are added.

#### *Writing Data to CSV*

The script writes the scraped data (headline, link, date) to the CSV file and prints the data to the console.

## Running the Script

1. Ensure all prerequisites are met.
2. Adjust the `webdriver_path` variable to the correct path of your ChromeDriver.
3. Run the script using Python:
   ```bash
   python ledger_insights_scraper.py
   ```
4. The script will append new relevant articles to `ledgerinsights_articles.csv`.
```

#### *Note*: you will have to physically accept the privacy notice for the code to run properly

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import csv
import time
import os

# Path to your ChromeDriver
webdriver_path = 'C:/Program Files/chromedriver-win64/chromedriver.exe'  # Ensure this path is correct

# Setup Selenium WebDriver
service = Service(webdriver_path)
options = webdriver.ChromeOptions()
#options.add_argument('--headless')
driver = webdriver.Chrome(service=service, options=options)
driver.get('https://www.ledgerinsights.com/tokenization/')

# Open a CSV file to append the scraped data
csv_filename = 'ledgerinsights_articles.csv'
file_exists = os.path.isfile(csv_filename)
csv_file = open(csv_filename, 'a', newline='', encoding='utf-8')
csv_writer = csv.writer(csv_file)

# Write header only if the file does not already exist
if not file_exists:
    csv_writer.writerow(['headline', 'link', 'date'])

# Read existing headlines to avoid duplicates
existing_headlines = set()
if file_exists:
    with open(csv_filename, 'r', encoding='utf-8') as read_file:
        csv_reader = csv.reader(read_file)
        next(csv_reader)  # Skip header row
        for row in csv_reader:
            if row:
                existing_headlines.add(row[0])

# Keywords to filter articles
keywords = ['digital assets', 'digital securities', 'tokenized', 'tokenization', 'bond', 'security', 'asset', 'token']

# Function to close cookie consent button
def close_cookie_consent():
    try:
        cookie_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.ID, 'wt-cli-settings-btn'))
        )
        cookie_button.click()
        time.sleep(2)  # Wait for the consent dialog to close
    except Exception as e:
        print(f"Failed to close cookie consent: {e}")

# Close cookie consent if it appears
close_cookie_consent()

# Load articles by clicking the 'Load More' button multiple times
num_clicks = 20  # Adjust the number of clicks to load more articles as needed
for _ in range(num_clicks):
    try:
        load_more_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, '//a[contains(text(), "Load more")]'))
        )
        load_more_button.click()
        time.sleep(3)  # Wait for the content to load
    except Exception as e:
        print(f"Exception occurred: {e}")
        break

# Parse the loaded page content with BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'lxml')

# Use a set to store seen links to avoid duplication
seen_links = set()

# Loop through each link that might contain a headline
for link in soup.find_all('a', title=True):
    headline = link.get('title').strip()
    url = link.get('href')

    # Check if the link is already processed or the headline already exists in the CSV file
    if url in seen_links or headline in existing_headlines:
        continue

    # Check if any of the keywords are in the headline
    if any(keyword in headline.lower() for keyword in keywords):
        # Mark this link as seen
        seen_links.add(url)

        # Open the article in a new tab to fetch the date
        driver.execute_script("window.open(arguments[0], '_blank');", url)
        driver.switch_to.window(driver.window_handles[1])
        
        try:
            # Wait for the date element to be present
            date_element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CLASS_NAME, 'updated'))
            )
            date_text = date_element.text
        except Exception as e:
            print(f"Error fetching date for article {headline}: {e}")
            date_text = "Unknown"
        
        # Close the new tab and switch back to the main tab
        driver.close()
        driver.switch_to.window(driver.window_handles[0])
        
        # Print the scraped data to the console
        print(f"Headline: {headline}")
        print(f"Link: {url}")
        print(f"Date: {date_text}")
        
        # Write the data to the CSV file
        csv_writer.writerow([headline, url, date_text])

# Close the CSV file and the WebDriver
csv_file.close()
driver.quit()


Headline: Germany’s Bankhaus Scheich tokenizes Euro money market fund on Polygon
Link: https://www.ledgerinsights.com/bankhaus-scheich-tokenizes-euro-money-market-fund-polygon/
Date: February 22, 2024
Headline: EU tokenization event: DeFi regulation unlikely. Why the DLT Pilot regime is unpopular
Link: https://www.ledgerinsights.com/eu-tokenization-defi-dlt-pilot-regime-is-unpopular/
Date: February 21, 2024
Headline: Daiwa Securities issues digital bond paying emoney interest
Link: https://www.ledgerinsights.com/daiwa-securities-digital-bond-paying-emoney-interest/
Date: February 21, 2024
Headline: HKMA provides guidance to Hong Kong banks on tokenization
Link: https://www.ledgerinsights.com/hkma-hong-kong-tokenization/
Date: February 20, 2024
Headline: Japan progresses law for VCs to invest in web3 crypto tokens
Link: https://www.ledgerinsights.com/japan-law-vc-web3-crypto-tokens/
Date: February 19, 2024
Headline: SAB 121: ABA, SIFMA ask SEC to exclude digital securities from custody 

Headline: Sumitomo Life invests in tokenized Blackstone real estate fund
Link: https://www.ledgerinsights.com/tokenized-blackstone-real-estate-fund/
Date: December 20, 2023
Headline: Regulated asset tokenization to launch in India’s GIFT City
Link: https://www.ledgerinsights.com/regulated-asset-tokenization-gift-city-india/
Date: December 19, 2023
Headline: UK legislation for Digital Securities Sandbox published. No limits specified
Link: https://www.ledgerinsights.com/digital-securities-sandbox-legislation-no-limits/
Date: December 18, 2023
Headline: Raiffeisen Switzerland joins SIX Digital Exchange for digital securities
Link: https://www.ledgerinsights.com/raiffeisen-switzerland-digital-securities-sdx-six-digital-exchange/
Date: December 18, 2023
Headline: Deutsche bank backed Taurus tokenizes SME loan portfolio
Link: https://www.ledgerinsights.com/taurus-tokenizes-sme-loan-portfolio/
Date: December 15, 2023
Headline: Five banks involved in 2nd Hong Kong digital green bond on HSBC D

Headline: Shipping network GSBN partners Ant’s ZAN to tokenize bills of lading
Link: https://www.ledgerinsights.com/tokenize-bills-of-lading-gsbn-ant-zan/
Date: November 6, 2023
Headline: Union Investment buys tokenized Metzler fund units
Link: https://www.ledgerinsights.com/union-investment-tokenized-metzler-fund/
Date: November 3, 2023
Headline: UAE plans tokenized bonds  issued on HSBC Orion, listed on ADX
Link: https://www.ledgerinsights.com/tokenized-bonds-hsbc-orion-listed-on-adx/
Date: November 3, 2023
Headline: Hong Kong plans regulations targeting tokenization
Link: https://www.ledgerinsights.com/hong-kong-regulations-tokenization-rwa/
Date: November 2, 2023
Headline: HSBC tokenizes gold
Link: https://www.ledgerinsights.com/hsbc-tokenizes-gold/
Date: November 1, 2023
Headline: SBI Digital Markets collaborates with UBS for fund tokenization trial
Link: https://www.ledgerinsights.com/sbi-digital-markets-fund-tokenization-trial/
Date: October 31, 2023
Headline: Singapore tokeniza

Headline: Korea’s Hana Bank partners with BitGo for digital asset custody
Link: https://www.ledgerinsights.com/hana-bank-bitgo-digital-asset-custody/
Date: September 5, 2023
Headline: London Stock Exchange plans blockchain digital asset exchange
Link: https://www.ledgerinsights.com/london-stock-exchange-lse-plans-blockchain-digital-asset/
Date: September 4, 2023
Headline: DLT, digital assets not high priority for Swiss asset managers, except for banks
Link: https://www.ledgerinsights.com/dlt-digital-assets-swiss-asset-managers-banks/
Date: August 31, 2023
Headline: Moody’s sees tokenization opportunity in private capital, alternatives
Link: https://www.ledgerinsights.com/moodys-tokenization-private-capital-alternatives/
Date: August 25, 2023
Headline: After green bond issuance, Hong Kong sees tokenization benefits. Says interoperability needs attention
Link: https://www.ledgerinsights.com/hong-kong-tokenized-green-bond-interoperability/
Date: August 24, 2023
Headline: Digital Assets: J