# Get HTML Page Data

> Given raw html data, extract the main body parts of the page by any means necessary

In [None]:
#| default_exp html

In [None]:
#| hide

company_story = "../test-data/Continental acquires mold specialist EMT for commercial and specialty tires.html"
non_company_story = "../test-data/Culpepper Cattle Co. Dallas Opens In Continental Gin Building - Local Profile.html"

## Extracting text and metadata from HTML files

We are using [Trafilatura](https://github.com/adbar/trafilatura) to extract text and metadata from html files. Here is an example

In [None]:
#| export

import trafilatura

In [None]:
#| hide
from rich.pretty import pprint

In [None]:

# Step 2: Read the HTML content from your file
html_file_path = non_company_story
with open(html_file_path, 'r', encoding='utf-8') as file:
    html_content = file.read()

# Step 3: Use Trafilatura to process the HTML content
# The `extract` function is used here to process the HTML content directly
date_params = {"original_date": True, "extensive_search": True, "max_date": "2024-05-03"}
extracted_text = trafilatura.extract(html_content, date_extraction_params=date_params)
metadata = trafilatura.extract_metadata(html_content, date_config=date_params)

# Print the extracted text
print(extracted_text)
pprint(metadata.as_dict())

Culpepper Cattle Co. Dallas will open in the historical Continental Gin Building on Wednesday, May 1. The East-Texas-inspired restaurant that originated in Rockwall, Texas will be open for lunch, dinner and weekend brunch. Culpepper Cattle Co. Dallas will feature fresh Tex-Mex, prime steaks, and Texas home cooking.
The original Culpepper Steakhouse was opened in Rockwall in 1982 by Michael “Dobber” Stephenson. Local country music artists would often play at the restaurant including sightings from Waylon Jennings and Randy Travis. The original was purchased in 1992 by Bob L. Clements who owned Culpepper for the next 30 years. He had a vision for finer dinning and brought the best-known chefs from around the country and introduced fine wines, seafood and live Jazz to the Rockwall scene.
After closing in 2023, UNCO Hospitality Group came in to begin reviving the restaurant for its third installment, Culpepper Cattle Co. With the third iteration of Culpepper Cattle Company, the brand focus

In [None]:
#| export
from datetime import date
import msgspec


# Extract data from web page

Now that we know how to process data with Trafilatura, we can create a simple function to extract the text and other important information such as title, site and publication date from the article

In [None]:
#| export

class Article(msgspec.Struct):
    content: str
    metadata: str

def extract_article(html_content: str, date_params: dict = None) -> Article:
    if date_params is None:
        # set max date to today
        today = date.today()
        formatted_date = today.strftime("%Y-%m-%d")
        
        date_params = {"original_date": True, "extensive_search": True, "max_date": formatted_date}

    extracted_text = trafilatura.extract(html_content, date_extraction_params=date_params)
    metadata = trafilatura.extract_metadata(html_content, date_config=date_params)
    return Article(content=extracted_text, metadata=metadata.as_dict())


## Processing a file with this module

This is an example processing a file with this module to get content and metadata.

In [None]:
with open(company_story, 'r', encoding='utf-8') as file:
    html_content = file.read()

article = extract_article(html_content)
pprint(article)

In [None]:
#| hide
import nbdev

nbdev.nbdev_export()