# Tavily Database Feeder (with dates)

This version of the Tavily scraper uses the `htmldate` library to obtain the dates from the news articles found by this tool.

`domains_to_check` has to be kept, since Tavily struggles to do web scraping if the websites are not clearly explicit in this parameter.

For Tavily scraping of a single website, a different pipeline is used.

In [1]:
import os
import hashlib
import uuid
from datetime import datetime, timedelta
from tavily import TavilyClient
from htmldate import find_date
import re

tavily_client = TavilyClient(api_key="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")

In [2]:
# Some well-known Portuguese startup and business news websites

domains_to_check = [
    "empreendedor.com",
    "portugalstartupnews.com",
    "portugalstartups.com",
    "startupportugal.com",
    "portugalventures.pt",
    "observador.pt",
    "eco.sapo.pt",
    "essential-business.pt",
    "portugalbusinessesnews.com"
]

In [3]:
def extract_date(url, text_content=None):
    """
    Tries to find the publication date using htmldate, using one of the following:
    1. Checks the URL patterns (fastest).
    2. Fetches metadata from the URL (most accurate).
    3. Returns today's date if all else fails.

    Args:
        url: The URL of the article.
        text_content (optional): The raw HTML content of the article. Default == None.
    
    Returns:
        str: The extracted date in 'YYYY-MM-DD' format or today's date as fallback.
    """
    try:
        found_date = find_date(url, outputformat='%Y-%m-%d')
        if found_date:
            return found_date
        
    except Exception as e:
        print(f"Date extraction failed for {url}: {e}")

    return datetime.now().strftime("%Y-%m-%d")

Tavily's extraction generates a score for each result it obtains - the higher the score, the more relevant the result is to the query.

A threshold of `0.5` was initially defined, but some relevant news were producing low scores, so we thought of scrapping it completely. However, not setting this threshold resulted in a lot of noise being selected by Tavily as news (homepages, "About Us" pages, cookies popups, etc.). This threshold does not eliminate the problem completely, but allows a much more balanced output.

In [4]:
def aggressive_clean(text):
    lines = text.split('\n')
    cleaned_lines = []
    seen_lines = set()

    # 1. AGGRESSIVE PATTERNS TO REMOVE
    # If a line contains ANY of these, it is deleted instantly.
    # We target GDPR, Cookies, Login, UI elements, and specific platform noise.
    garbage_triggers = [
        "login", "register", "sign in", "sign up", "subscribe", "subscrição",
        "cookie", "consent", "parceiros", "armazenamento", "device", "browser",
        "privacy policy", "terms of use", "termos e condições", "política de privacidade",
        "all rights reserved", "copyright", "powered by", "mybusiness.com",
        "share on", "compartilhar", "tweet", "follow us", "siga-nos",
        "no information for this section", "our apologies",
        "skip to content", "saltar os links", "outdated browser",
        "funcional", "estatísticas", "marketing", "preferências", # Cookie banner categories
        "ver mais", "saiba mais", "ler mais", "read more",
        "edit with live css", "write css",
        "image", "video", "shutterstock", "freepik", # Image captions often leak
        "whatsapp://", "facebook.com", "twitter.com", "linkedin.com"
    ]

    # 2. PATTERNS TO KEEP (Whitelist)
    # If a line matches these headers, we force keep it (unless it's garbage)
    header_indicators = ["===", "---"]

    for line in lines:
        original_line = line
        line = line.strip()

        # --- FILTER 1: Empty & Structure ---
        if not line:
            continue
        
        # Keep Markdown Headers (=== or ---)
        if any(x in line for x in header_indicators):
            cleaned_lines.append(line)
            continue

        # --- FILTER 2: Remove Markdown Images & Icons ---
        # Remove ![Alt](URL) completely
        line = re.sub(r'!\[.*?\]\(.*?\)', '', line)
        # Remove standalone image links often found in these scrapes
        if line.startswith("[![Image"):
            continue

        # --- FILTER 3: The "Garbage Word" Check ---
        line_lower = line.lower()
        if any(trigger in line_lower for trigger in garbage_triggers):
            continue

        # --- FILTER 4: The "Link" Analysis ---
        # We need to distinguish between a Navigation Link (Bad) and a Resource/Article Link (Good).
        
        # Check if line is PURELY a markdown link: [Text](URL)
        link_match = re.match(r'^\[(.*?)\]\(.*?\)$', line)
        
        if link_match:
            link_text = link_match.group(1)
            # LOGIC: 
            # If the link text is short (< 25 chars) and doesn't look like a book/resource title, kill it.
            # Navigation links are usually "Home", "News", "Economia" (Short).
            # Content links are "The Lean Startup", "Sonae investe seis milhões..." (Longer).
            if len(link_text) < 25:
                continue 
            
            # Additional check: If it looks like a date or author inside a link, kill it.
            if re.match(r'^\d', link_text) or "author" in original_line:
                continue

        # --- FILTER 5: Short Line Noise ---
        # If a line is very short, not a list item (*), and has no punctuation, it's likely UI noise.
        # e.g. "Lisboa", "Domingo", "10.5 C"
        if len(line) < 20 and not line.startswith("*") and not line.endswith(('.', ':', '?', '!')):
            continue

        # --- FILTER 6: Deduplication ---
        # Scrapes often repeat the Title 3 times.
        if line in seen_lines:
            continue
        
        seen_lines.add(line)
        cleaned_lines.append(line)

    return "\n".join(cleaned_lines)

In [5]:
def get_news_with_dates():
    """
    Fetches the latest entrepreneurship and startup funding news articles from Tavily,
    extracts their publication dates, and prepares them for database insertion.

    Returns:
        all_articles: A list of dictionaries, where each dictionary contains the
        'id', 'title', 'text', and 'date' of an article.
    """
    all_articles = []

    today = datetime.now()
    seven_days_ago = today - timedelta(days=7)

    today_dt = today.strftime("%Y-%m-%d")
    seven_days_ago_dt = seven_days_ago.strftime("%Y-%m-%d")
    
    for url in domains_to_check:
        print(f"Checking {url}...")
        
        try:
            response = tavily_client.search(
                query="latest entrepreneurship startup funding investment news",
                topic="general",
                start_date=seven_days_ago_dt,
                end_date=today_dt,
                max_results=5,
                include_raw_content=False,
                include_domains=[url],
                search_depth="advanced"
            )

            results = response.get("results", []) 
            
            for result in results:
                title = result["title"]
                result_url = result.get("url")       
            
                content = result["content"]
            
                published_date = extract_date(result_url, content)

                article_id = hashlib.md5(result_url.encode()).hexdigest()

                article_record = {
                    "id": article_id,
                    "title": title,
                    "text": content,
                    "date": published_date
                }

                all_articles.append(article_record)

        except Exception as e:
            print(f"Failed to fetch from {url}: {e}")

    return all_articles

In [6]:
result = get_news_with_dates()

Checking empreendedor.com...
Checking portugalstartupnews.com...
Checking portugalstartups.com...
Checking startupportugal.com...
Checking portugalventures.pt...
Checking observador.pt...
Checking eco.sapo.pt...
Date extraction failed for https://www.cnn.com/: ("URL couldn't be processed: %s", None)
Checking essential-business.pt...
Checking portugalbusinessesnews.com...


In [7]:
result

[{'id': '81286442e8a390e984bf124b7196c704',
  'title': 'Business, Technology and World News | MyBusiness.com',
  'text': 'Recent Articles. Sep 9, 2025. Interview ; News. Aug 17, 2025. News ; Entrepreneurship. Jun 25, 2025. Entrepreneurship ; Economy. Jul 22, 2025. Economy ; Technology.',
  'date': '2025-01-01'},
 {'id': 'fd4f871f9a68cf58e4eefcf7d473e109',
  'title': 'Revista do Empreendedor - Revista do Empreendedor',
  'text': 'Revista do Empreendedor é o site de notícias de empreendedorismo e inovação que apoia as startups e o ecossistema empreendedor em Portugal.Indústria transformadora trava mais ransomware, mas cresce roubo de dados. Empreendedor - 14 de Dezembro, 2025 .',
  'date': '2025-03-14'},
 {'id': '61f1da1867eb2387fa8d4857cc33a8db',
  'title': 'Entrepreneurship Awards 2025 celebram... - Revista do Empreendedor',
  'text': 'Inicio Atualidade Entrepreneurship Awards 2025 celebram os líderes que estão a transformar o ecossistema',
  'date': '2025-12-16'},
 {'id': 'e933efc64dc