# Blockworks Articles Scraper

This script scrapes article headlines, links, and publication dates from the Blockworks website's tokenization section. The scraped data is saved into a CSV file, appending new relevant articles each time the script is run.

## Prerequisites

1. **Selenium WebDriver**: Install Selenium WebDriver for Python using the following command:
   ```bash
   pip install selenium
   ```
2. **BeautifulSoup**: Install BeautifulSoup for parsing HTML using the following command:
   ```bash
   pip install beautifulsoup4
   ```
3. **ChromeDriver**: Download ChromeDriver from [chromedriver.chromium.org](https://chromedriver.chromium.org/downloads) and ensure its path is correct in the script.

## Script Overview

#### *Setup*

1. **Path to ChromeDriver**: Ensure the `webdriver_path` variable points to the correct location of your ChromeDriver executable.
2. **Selenium WebDriver Initialization**: The script sets up the Selenium WebDriver and opens the target URL.

#### *CSV File Handling*

1. **Append Mode**: The script opens the CSV file (`blockworks_articles.csv`) in append mode. If the file does not exist, it creates a new one and writes the header row.
2. **Avoiding Duplicates**: The script reads existing headlines from the CSV file to avoid duplications when appending new data.

#### *Keywords for Filtering Articles*

The script filters articles based on the following keywords: `tokenized`, `tokenization`.

#### *Closing Cookie Consent*

The script includes a function `close_cookie_consent` to close the cookie consent dialog if it appears.

#### *Loading More Articles*

A bespoke feature unique to the Blockworks site, the script clicks the "Load more" button multiple times (adjustable by `num_clicks`) to load additional articles. If you are using a different site other than the one specified, the code will likely fail. 

#### *Parsing and Scraping*

1. **Parsing with BeautifulSoup**: The script parses the loaded page content using BeautifulSoup.
2. **Scraping Data**: For each article, the script extracts the headline, link, and publication date, ensuring no duplicates are added.

#### *Writing Data to CSV*

The script writes the scraped data (headline, link, date) to the CSV file and prints the data to the console.

## Running the Script

1. Ensure all prerequisites are met.
2. Adjust the `webdriver_path` variable to the correct path of your ChromeDriver.
3. Run the script as a Jupyter notebook, or using Python:
   ```bash
   python blockworks_scraper.py
   ```
4. The script will append new relevant articles to `blockworks_articles.csv`.
```

#### *Note*: you will have to physically close certain pop-ups to run the code properly

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import csv
import time
import os

# Path to your ChromeDriver
webdriver_path = 'C:/Program Files/chromedriver-win64/chromedriver.exe'  # Ensure this path is correct

# Setup Selenium WebDriver
service = Service(webdriver_path)
options = Options()
# options.add_argument('--headless')
driver = webdriver.Chrome(service=service, options=options)
driver.get('https://blockworks.co/search')

# Open a CSV file to append the scraped data
csv_filename = 'blockworks_articles.csv'
file_exists = os.path.isfile(csv_filename)
csv_file = open(csv_filename, 'a', newline='', encoding='utf-8')
csv_writer = csv.writer(csv_file)

# Write header only if the file does not already exist
if not file_exists:
    csv_writer.writerow(['headline', 'link', 'date'])

# Read existing headlines to avoid duplicates
existing_headlines = set()
if file_exists:
    with open(csv_filename, 'r', encoding='utf-8') as read_file:
        csv_reader = csv.reader(read_file)
        next(csv_reader)  # Skip header row
        for row in csv_reader:
            if row:
                existing_headlines.add(row[0])

# Function to close cookie consent button
def close_cookie_consent():
    try:
        # Attempt to locate the button by ID
        cookie_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.ID, 'wt-cli-settings-btn'))
        )
    except:
        try:
            # Attempt to locate the button by class name as a fallback
            cookie_button = WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.CLASS_NAME, 'wt-cli-settings-btn'))
            )
        except:
            print("Cookie consent button not found.")
            return  # Exit the function if the button is not found

    try:
        cookie_button.click()
        time.sleep(2)  # Wait for the consent dialog to close
        print("Cookie consent closed.")
    except Exception as e:
        print(f"Failed to close cookie consent: {e}")

# Close cookie consent if it appears
close_cookie_consent()

# Find the search bar and input 'tokeniz'
try:
    search_bar = WebDriverWait(driver, 20).until(
        EC.presence_of_element_located((By.ID, 'blockworks-search'))  # Use correct ID for the search bar
    )
    search_bar.send_keys('tokeniz')
    search_bar.send_keys(Keys.RETURN)
    print("Search query submitted.")
except Exception as e:
    print(f"Error locating search bar: {e}")
    driver.quit()
    exit()

# Load More button click settings
num_clicks = 20  # Number of times to click the 'Load More' button

for _ in range(num_clicks):
    try:
        load_more_button = WebDriverWait(driver, 20).until(
            EC.element_to_be_clickable((By.XPATH, '//button[text()="Load More"]'))
        )
        driver.execute_script("arguments[0].scrollIntoView(true);", load_more_button)  # Scroll into view if needed
        load_more_button.click()
        time.sleep(3)  # Wait for the content to load
        print(f"'Load More' button clicked {_ + 1} times.")
    except Exception as e:
        print(f"Exception occurred while clicking 'Load More': {e}")
        break

# Parse the loaded page content with BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Locate all article links on the search results page
article_links = [a['href'] for a in soup.find_all('a', class_='font-headline flex-grow text-base font-semibold leading-snug hover:text-primary')]

# Use a set to store seen links to avoid duplication
seen_links = set()

# Loop through each article link
for relative_url in article_links:
    url = 'https://blockworks.co' + relative_url

    # Open the article
    driver.execute_script("window.open(arguments[0], '_blank');", url)
    driver.switch_to.window(driver.window_handles[1])
    
    try:
        # Extract the headline
        headline_tag = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CLASS_NAME, 'font-headline'))
        )
        headline = headline_tag.text.strip()

        # Extract the date
        date_tag = driver.find_element(By.TAG_NAME, 'time')
        date_text = date_tag.text.strip()
        
        # Print the scraped data to the console
        print(f"Headline: {headline}")
        print(f"Link: {url}")
        print(f"Date: {date_text}")

        # Write the data to the CSV file
        if headline not in existing_headlines:
            csv_writer.writerow([headline, url, date_text])
            existing_headlines.add(headline)
    except Exception as e:
        print(f"Error fetching details for article: {e}")
    
    # Close the new tab and switch back to the main tab
    driver.close()
    driver.switch_to.window(driver.window_handles[0])

# Close the CSV file and the WebDriver
csv_file.close()
driver.quit()
