# Tavily Database Feeder (with dates)

This version of the Tavily scraper uses the `htmldate` library to obtain the dates from the news articles found by this tool.

`domains_to_check` has to be kept, since Tavily struggles to do web scraping if the websites are not clearly explicit in this parameter.

For Tavily scraping of a single website, a different pipeline is used.

In [35]:
import os
import hashlib
import uuid
from datetime import datetime, timedelta
from tavily import TavilyClient
from htmldate import find_date

tavily_client = TavilyClient(api_key="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")

In [36]:
# Some well-known Portuguese startup and business news websites

domains_to_check = [
    "empreendedor.com",
    "portugalstartupnews.com",
    "portugalstartups.com",
    "startupportugal.com",
    "portugalventures.pt",
    "observador.pt",
    "eco.sapo.pt",
    "essential-business.pt",
    "portugalbusinessesnews.com"
]

In [37]:
def extract_date(url, text_content=None):
    """
    Tries to find the publication date using htmldate, using one of the following:
    1. Checks the URL patterns (fastest).
    2. Fetches metadata from the URL (most accurate).
    3. Returns today's date if all else fails.

    Args:
        url: The URL of the article.
        text_content (optional): The raw HTML content of the article. Default == None.
    
    Returns:
        str: The extracted date in 'YYYY-MM-DD' format or today's date as fallback.
    """
    try:
        found_date = find_date(url, outputformat='%Y-%m-%d')
        if found_date:
            return found_date
        
    except Exception as e:
        print(f"Date extraction failed for {url}: {e}")

    return datetime.now().strftime("%Y-%m-%d")

Tavily's extraction generates a score for each result it obtains - the higher the score, the more relevant the result is to the query.

A threshold of `0.5` was initially defined, but some relevant news were producing low scores, so this was scrapped.

In [46]:
def get_news_with_dates():
    """
    Fetches the latest entrepreneurship and startup funding news articles from Tavily,
    extracts their publication dates, and prepares them for database insertion.

    Returns:
        all_articles: A list of dictionaries, where each dictionary contains the
        'id', 'title', 'text', and 'date' of an article.
    """
    all_articles = []

    today = datetime.now()
    seven_days_ago = today - timedelta(days=7)

    today_dt = today.strftime("%Y-%m-%d")
    seven_days_ago_dt = seven_days_ago.strftime("%Y-%m-%d")
    
    for url in domains_to_check:
        print(f"Checking {url}...")
        
        try:
            response = tavily_client.search(
                query="latest entrepreneurship startup funding investment news",
                topic="general",
                start_date=seven_days_ago_dt,
                end_date=today_dt,
                max_results=5,
                include_raw_content=True,
                include_domains=[url],
                search_depth="advanced"
            )

            results = response.get("results", []) 
            
            for result in results:
                title = result.get("title", "No Title")
                result_url = result.get("url")       
            
                content = result.get("raw_content")
                if not content:
                    content = result.get("raw_content", "")
            
                published_date = extract_date(result_url, content)

                article_id = hashlib.md5(result_url.encode()).hexdigest()

                article_record = {
                    "id": article_id,
                    "title": title,
                    "text": content,
                    "date": published_date
                }

                all_articles.append(article_record)

        except Exception as e:
            print(f"Failed to fetch from {url}: {e}")

    return all_articles

In [47]:
result = get_news_with_dates()

Checking empreendedor.com...
Checking portugalstartupnews.com...
Checking portugalstartups.com...
Checking startupportugal.com...
Checking portugalventures.pt...
Checking observador.pt...
Checking eco.sapo.pt...
Date extraction failed for https://www.weforum.org/stories/2025/06/five-shifts-in-a-new-era-for-entrepreneurs-reflections-at-summer-davos/: ("URL couldn't be processed: %s", None)
Checking essential-business.pt...
Checking portugalbusinessesnews.com...


In [48]:
result

[{'id': '81286442e8a390e984bf124b7196c704',
  'title': 'Business, Technology and World News | MyBusiness.com',
  'date': '2025-10-17'},
 {'id': 'eae767c94dc8db5cbf4e9a3f3ade4e19',
  'title': 'Armilar raises €120 million to invest in disruptive startups ...',
  'date': '2025-11-06'},
 {'id': '0a2fd9614c7b709e8a426ae56e5ef77b',
  'title': '$715.5M raised after Web Summit 2024, with AI startups ...',
  'date': '2025-11-11'},
 {'id': '4da9571b4172bdb423250641716f8a63',
  'title': 'Edtech platform Luca raises $8M Series A, plans tech hub ...',
  'date': '2025-12-04'},
 {'id': '6e6603825905fa9fd80d21cd2af8652a',
  'title': 'Spotlite raises €3.5 million to scale AI-powered satellite ...',
  'date': '2025-12-03'},
 {'id': '9177b03b59c0b3c08be2d7d9a3dab510',
  'title': 'AI grant consultant Granter wins Web Summit 2025 PITCH ...',
  'date': '2025-11-13'},
 {'id': '3fa5dba24b53ef55073b84855ef61320',
  'title': "Design-driven Coolivin infusing lifestyle into Portugal's shared ...",
  'date': '2021