# Scraping with Firecrawl


In [1]:
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

True

In [33]:
from pprint import pprint

In [2]:
from llama_index.readers.web import FireCrawlWebReader
from llama_index.core import SummaryIndex
import os

In [3]:
import tracemalloc

# Start tracking memory allocations
tracemalloc.start()

In [5]:
urls = [
    "https://www.columbiaspectator.com/opinion/2024/12/11/at-columbia-we-dont-strike-our-ideological-opponents/",
    "https://www.columbiaspectator.com/opinion/2024/12/08/columbias-complicity-in-cop29-the-greenwashing-of-human-rights-abuses/",
    "https://www.columbiaspectator.com/opinion/op-eds/10/"
]

In [35]:
FIRECRAWL_API_KEY = os.getenv("FIRECRAWL_API_KEY")

firecrawl_reader = FireCrawlWebReader(
    api_key=FIRECRAWL_API_KEY,  # Replace with your actual API key from https://www.firecrawl.dev/
    mode="scrape",  # Choose between "crawl" and "scrape" for single page scraping
    params={
        "waitFor": 1000,
        "timeout": 15000,
        "onlyMainContent": False,
        "formats": ["markdown", "html"]
    }
)

len(FIRECRAWL_API_KEY)

35

In [36]:
documents = firecrawl_reader.load_data(url=urls[0])

len(documents)

1

In [37]:
pprint(vars(documents[0]))

{'embedding': None,
 'end_char_idx': None,
 'excluded_embed_metadata_keys': [],
 'excluded_llm_metadata_keys': [],
 'id_': '5af2b29d-f546-4347-8569-706ad8e64c26',
 'metadata': {'author': 'Elisha Baker',
              'credits': '',
              'description': 'Why are we at Columbia, and what is the purpose '
                             'of higher education? Many of us arrived at '
                             'Columbia fueled by curiosity and a yearning for '
                             'knowledge. However, in recent months, it has '
                             'become clear that not all members of our '
                             'community share a vision of open dialogue and '
                             'mutual learning.',
              'favicon': 'https://s3.amazonaws.com/year-in-review-assets/CDS_Favicon.ico',
              'image': 'https://cloudfront-us-east-1.images.arcpublishing.com/spectator/5B47ZNYV3RCIXGQHLZUNIPGBM4.jpg',
              'keywords': '',
              

In [68]:
import re

def clean_markdown(text) -> str:
    # Remove markdown images ![alt text](url)
    text = re.sub(r'!\[.*?\]\(.*?\)', '', text)
    
    # Remove markdown links [text](url)
    text = re.sub(r'\[([^\]]+)\]\([^)]+\)', r'\1', text)
    
    # Remove bare URLs
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
    
    # Remove "[]( ### Opinion | Op-eds # " patterns
    text = re.sub(r'\[\]\(', '', text)
    text = re.sub(r'###\s*Opinion\s*\|?\s*Op-eds\s*#?', '', text)
    
    # Remove iframe patterns with Opinion/Op-eds
    text = re.sub(r'iframe\s*###\s*Opinion\s*\|?\s*Op-eds\s*', '', text)
        
    # Clean up extra whitespace
    text = re.sub(r'\s+', ' ', text)
    
    return text.strip()

## Native Firecrawl API : An Article

Scraping Guide: [guide](https://docs.firecrawl.dev/advanced-scraping-guide)


In [61]:

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key=FIRECRAWL_API_KEY)

content = app.scrape_url(urls[0])

len(content)

2

In [67]:
main_content = content['markdown']
main_content[:300]

'[iframe](about:blank)\n\n[![](https://spec-imagehosting.s3.amazonaws.com/CDSwhitemasthead.png)](https://www.columbiaspectator.com/)\n\n### [Opinion](https://www.columbiaspectator.com/opinion/) \\| [Op-eds](https://www.columbiaspectator.com/opinion/op-eds/)\n\n# At Columbia, we don’t ‘strike’ our ideologica'

In [69]:
print(clean_markdown(main_content))

iframe ### Opinion \| Op-eds # At Columbia, we don’t ‘strike’ our ideological opponents ### By Judy Goldstein / Senior Staff Photographer By Elisha Baker, Eden Yadegar, and David Lederer • December 11, 2024 at 11:07 AM Share Why are we at Columbia, and what is the purpose of higher education? Many of us arrived at Columbia fueled by curiosity and a yearning for knowledge. However, in recent months, it has become clear that not all members of our community share a vision of open dialogue and mutual learning. We can see these divisions playing out in current campus discourse, including in Amine Bit’s recent op-ed in Spectator titled “ On reckless criticism and propaganda.” In the editorial, Mr. Bit challenges our arguments and ideas by impugning our credibility and accusing us—proud members of the Jewish community—of misrepresenting Jewish identity in a number of our published articles. To be sure, there is a lot we could respond to in Mr. Bit’s personal condemnation. However, we believe

## Native Firecrawl API : Article Listing


In [70]:
listing_url = "https://www.columbiaspectator.com/opinion/op-eds/10/"

listing_content = app.scrape_url(listing_url)

len(listing_content)


2

In [71]:
print(listing_content.keys())

dict_keys(['markdown', 'metadata'])


In [73]:
print(listing_content['markdown'])

[iframe](about:blank)

[![](https://spec-imagehosting.s3.amazonaws.com/CDSwhitemasthead.png)](https://www.columbiaspectator.com/)

# opinion: op eds

![](https://www.columbiaspectator.com/resizer/v2/L36KMJORLJDA3A72XC7TQXHKGM.jpg?auth=c0260fa55be13ac1ebff56bc66e3a1b8dc632f3984bea6dbd32ed9e85f83eced)

[Columbia researchers: It’s time to divest our labor from the war machine](https://www.columbiaspectator.com/opinion/2024/11/26/columbia-researchers-its-time-to-divest-our-labor-from-the-war-machine/)

BY [Columbia University Researchers Against War](https://www.columbiaspectator.com/contributors//)November 26, 2024

More than any other actor, the U.S. military now [sets the terms](https://www.thenation.com/article/world/the-pentagons-quest-for-academic-intelligence-ai/) of the university research landscape. In 2023, nearly half of [federal research funding](https://crsreports.congress.gov/product/pdf/R/R47564) went to the Department of Defense, ranking it first among all agencies that sup