# **Introduction to Trellis API**

Trellis is a platform designed to extract structured data, insights, and patterns from documents using advanced AI models. It provides a seamless workflow for document processing, offering tools for extraction, classification, summarization, and more. With its REST API and Python SDK, Trellis enables easy integration into various applications.

## **Core Features**
1. **Document Processing**: Handles various formats, including PDFs, images, and text files.
2. **Customizable Workflows**: Supports tailored operations like extraction, classification, and summarization.
3. **Scalability and Security**: Processes large datasets efficiently while ensuring data privacy.

## **Workflow Overview**
1. **Set Up a Project**: Create a project to organize your assets and transformations.
2. **Upload Documents**: Add files (e.g., PDFs or images) to your project.
3. **Define Transformations**: Specify tasks like extraction or summarization using custom prompts.
4. **Retrieve Results**: Access processed results via the API or SDK for analysis or storage.

## **Key Concepts**
- **Assets**: Individual files (PDFs, images, etc.) uploaded for processing within a project.
- **Projects**: Containers to manage assets and transformations, identified by a unique `proj_id`.
- **Transformations**: Defined operations (e.g., extraction, classification, summarization) applied to assets.
- **Operations**: Specific tasks within transformations, such as extracting dates or classifying content.

## **Key API Endpoints**
1. **Create a Project**:
   - **Endpoint**: `/projects/create`
   - **Method**: `POST`
   - Purpose: Set up a project to manage assets and transformations.
2. **Upload an Asset**:
   - **Endpoint**: `/assets/upload`
   - **Method**: `POST`
   - Purpose: Upload files (e.g., PDFs) for processing within a project.
3. **Create a Transformation**:
   - **Endpoint**: `/transforms/create`
   - **Method**: `POST`
   - Purpose: Define operations such as data extraction or classification for the uploaded assets.
4. **Retrieve Results**:
   - **Endpoint**: `/transforms/results`
   - **Method**: `GET`
   - Purpose: Access the processed results of transformations.

## **Authentication**
Trellis requires an API key for authentication. Include the key in your request headers:
```bash
Authorization: Bearer YOUR_API_KEY

### **Using Trellis API for Dataset Processing**

In this project, We used the Trellis API to automate the extraction of publication dates from a dataset of French administrative documents hosted on Hugging Face. 

#### **Workflow Overview**
1. **Project Creation**: A new Trellis project was created to manage the dataset files and transformations.
2. **File Upload**: Files from the dataset were uploaded to the project for processing.
3. **Custom Transformation**: A custom operation was defined using a detailed prompt to extract publication dates from the documents.
4. **Result Retrieval**: Extracted dates were retrieved via the API and compared against the dataset's gold-standard labels to evaluate performance.

This process streamlined the extraction task, ensuring accuracy and efficiency in handling large datasets.


In [248]:
import pandas as pd
import requests

from typing import List, Dict

In [None]:
TRELLIS_API_KEY = "YOUR-API-KEY"
CREATE_PROJECT_URL = "https://api.runtrellis.com/v1/projects/create"
CREATE_TRANSFORM_URL = "https://api.runtrellis.com/v1/transforms/create"
EVENT_URL = "https://api.runtrellis.com/v1/events/subscriptions/actions/bulk"
UPLOAD_URL = "https://api.runtrellis.com/v1/assets/upload"

HEADERS = {
    "Authorization": TRELLIS_API_KEY,
    "Content-Type": "application/json",
    "accept": "application/json"
}

PROMPT = """
You are tasked with extracting the publication date from French administrative documents. 

The publication date can typically be found in the following places:
1. On the first page, specifically in the top-right corner, look for text after "Reçu en préfecture le." This is often the most reliable source of the publication date.
2. In a section titled "Approbation et modifications du règlement" or similar, often specified with phrases like "entre en vigueur le."
3. Elsewhere on the document where the date is explicitly linked to the document's publication or adoption.

Guidelines:
- Look for dates in formats such as "DD/MM/YYYY" or "D/M/YYYY or YYYY"
- Prioritize the date found after "Reçu en préfecture le" in the top-right corner of the first page. If this is unavailable, proceed to other sections.
- If multiple dates are present, prioritize the one explicitly linked to the document's publication or adoption.
- Return only the date in `DD/MM/YYYY` format.
- If no date is found, return "Date not found."

"""

In [260]:
def create_project(project_name: str) -> str:
    """Creates a Trellis project and returns the project ID."""
    try:
        payload = {"name": project_name}
        response = requests.post(CREATE_PROJECT_URL, json=payload, headers=HEADERS)
        response.raise_for_status()
        return response.json()["data"]["proj_id"]
    except requests.RequestException as e:
        raise RuntimeError(f"Error creating project: {e}") from e


def create_transform(proj_id: str, prompt: str) -> str:
    """Creates a Trellis transform and returns the transform ID."""
    try:
        payload = {
            "proj_id": proj_id,
            "transform_name": "publication_date_extraction",
            "transform_params": {
                "model": "trellis-premium",
                "mode": "document",
                "operations": [
                    {
                        "column_name": "publication_date",
                        "column_type": "text",
                        "transform_type": "extraction",
                        "task_description": prompt
                    }
                ]
            }
        }
        response = requests.post(CREATE_TRANSFORM_URL, json=payload, headers=HEADERS)
        response.raise_for_status()
        return response.json()["data"]["transform_id"]
    except requests.RequestException as e:
        raise RuntimeError(f"Error creating transform: {e}") from e


def configure_events(proj_id: str, transform_id: str):
    """Configures event subscriptions for the Trellis project."""
    try:
        payload = {
            "events_with_actions": [
                {
                    "event_type": "asset_uploaded",
                    "proj_id": proj_id,
                    "actions": [
                        {"type": "run_extraction", "proj_id": proj_id}
                    ]
                },
                {
                    "event_type": "asset_extracted",
                    "proj_id": proj_id,
                    "actions": [
                        {"type": "refresh_transform", "transform_id": transform_id}
                    ]
                }
            ]
        }
        response = requests.post(EVENT_URL, json=payload, headers=HEADERS)
        return response.raise_for_status()
    except requests.RequestException as e:
        raise RuntimeError(f"Error configuring events: {e}") from e


def upload_pdf(proj_id: str, pdf_urls: List[str]) -> List[str]:
    """Uploads PDF URLs to the Trellis project and returns the asset IDs."""
    try:
        payload = {"proj_id": proj_id, "urls": pdf_urls}
        response = requests.post(UPLOAD_URL, json=payload, headers=HEADERS)
        response.raise_for_status()
        return [data["asset_id"] for data in response.json()["data"]]
    except requests.RequestException as e:
        raise RuntimeError(f"Error uploading PDF: {e}") from e


def fetch_results(transform_id: str, asset_ids: List[str]) -> List[Dict]:
    """Fetches extraction results from Trellis."""
    try:
        url = f"https://api.runtrellis.com/v1/transforms/{transform_id}/results"
        payload = {"filters": {}, "asset_ids": asset_ids}
        response = requests.post(url, json=payload, headers=HEADERS)
        response.raise_for_status()
        return response.json()
    except requests.RequestException as e:
        raise RuntimeError(f"Error fetching results: {e}") from e

In [261]:
def process_publication_dates(pdf_url: str) -> List[str]:
    """
    Extracts publication date for a PDF URL
    """
    try:
        # Step 1: Create project
        proj_id = create_project("Date Extraction")
        
        # Step 2: Create transform
        transform_id = create_transform(proj_id, PROMPT)
        
        # Step 3: Configure events
        configure_events(proj_id, transform_id)
        
        # Step 4: Upload PDF and get asset IDs
        asset_ids = upload_pdf(proj_id, pdf_url)
        
        # Step 5: Fetch results
        results = fetch_results(transform_id, asset_ids)
        
        # Step 6: Extract the publication date from results
        column_definitions = results['metadata']['column_definitions']
        publication_date_id = next(
            (column["id"] for column in column_definitions if column["name"] == "publication_date"), 
            None
        )
        
        data = results.get("data", [])
        predicted_date = next(
            (item[publication_date_id] for item in data if publication_date_id in item), 
            "Date not found"
        )

        return predicted_date
    
    except Exception as e:
        print(f"Error during processing: {e}")


## Preprocess Data

In [5]:
df1 = pd.read_csv("/Users/zeinab/Computational-Linguistics/M2/NLP in Industry/french-date-extractor/data/raw/original_data.csv")
df2 = pd.read_csv("/Users/zeinab/Computational-Linguistics/M2/NLP in Industry/french-date-extractor/data/raw/annotations.csv")

df1_subset = df1[["url", "cache"]]
df2_subset = df2[["url", "Gold published date"]]
df1_subset = df1_subset.rename(columns={"cache": "pdf_url"})
df2_subset = df2_subset.rename(columns={"Gold published date": "gold_date"})

merged_df = pd.merge(df1_subset, df2_subset, on="url", how="inner")
merged_df = merged_df.dropna()
merged_df = merged_df.drop(columns="url")
df = merged_df[~merged_df["pdf_url"].str.contains(r"\s", na=False)].reset_index(drop=True)
df.to_csv("/Users/zeinab/Computational-Linguistics/M2/NLP in Industry/french-date-extractor/data/processed/final_dataset.csv")

In [312]:
df.head(10)

Unnamed: 0,pdf_url,gold_date
0,https://datapolitics-public.s3.gra.io.cloud.ov...,16/01/2023
1,https://datapolitics-public.s3.gra.io.cloud.ov...,25/01/2023
2,https://datapolitics-public.s3.gra.io.cloud.ov...,02/02/2023
3,https://datapolitics-public.s3.gra.io.cloud.ov...,26/01/2023
4,https://datapolitics-public.s3.gra.io.cloud.ov...,16/01/2023
5,https://datapolitics-public.s3.gra.io.cloud.ov...,16/02/2023
6,https://datapolitics-public.s3.gra.io.cloud.ov...,22/02/2023
7,https://datapolitics-public.s3.gra.io.cloud.ov...,13/02/2023
8,https://datapolitics-public.s3.gra.io.cloud.ov...,09/03/2023
9,https://datapolitics-public.s3.gra.io.cloud.ov...,15/02/2023


## Process all PDF URLs by calling API 

In [262]:
urls = df['pdf_url'].to_list()
url_date_map = {}

In [263]:
for url in urls:
    date = process_publication_dates([url])
    print(date)
    url_date_map[url] = date

16/01/2023
25/01/2023
02/02/2023
26/01/2023
16/01/2023
16/02/2023
Error during processing: Error fetching results: 504 Server Error: Gateway Time-out for url: https://api.runtrellis.com/v1/transforms/transform_2pZpM3L0crCOKixyaw0Rjr0lAVM/results
None
13/02/2023
Error during processing: Error fetching results: 504 Server Error: Gateway Time-out for url: https://api.runtrellis.com/v1/transforms/transform_2pZpZx6QMVMRJQKqZdIj3TqEgWa/results
None
15/02/2023
15/02/2023
Error during processing: Error fetching results: 504 Server Error: Gateway Time-out for url: https://api.runtrellis.com/v1/transforms/transform_2pZps7uyhBQAkojj4nWr1DDsgMD/results
None
16/02/2023
20/02/2023
08/02/2023
14/02/2023
Error during processing: Error fetching results: 504 Server Error: Gateway Time-out for url: https://api.runtrellis.com/v1/transforms/transform_2pZqLlP1usAS3hOnBMEgPYExJP3/results
None
10/02/2023
10/03/2023
30/01/2023
Date not found.
10/01/2023
25/01/2023
12/01/2023
12/01/2023
16/02/2023
Date not foun

In [None]:
df['extracted_date'] = df['pdf_url'].map(url_date_map)

In [305]:
df.head(10)

Unnamed: 0,pdf_url,gold_date,extracted_date
0,https://datapolitics-public.s3.gra.io.cloud.ov...,16/01/2023,16/01/2023
1,https://datapolitics-public.s3.gra.io.cloud.ov...,25/01/2023,25/01/2023
2,https://datapolitics-public.s3.gra.io.cloud.ov...,02/02/2023,02/02/2023
3,https://datapolitics-public.s3.gra.io.cloud.ov...,26/01/2023,26/01/2023
4,https://datapolitics-public.s3.gra.io.cloud.ov...,16/01/2023,16/01/2023
5,https://datapolitics-public.s3.gra.io.cloud.ov...,16/02/2023,16/02/2023
6,https://datapolitics-public.s3.gra.io.cloud.ov...,22/02/2023,
7,https://datapolitics-public.s3.gra.io.cloud.ov...,13/02/2023,13/02/2023
8,https://datapolitics-public.s3.gra.io.cloud.ov...,09/03/2023,
9,https://datapolitics-public.s3.gra.io.cloud.ov...,15/02/2023,15/02/2023


In [303]:
num_nan_rows = df['extracted_date'].isna().sum()
num_nan_rows

91

In [302]:
num_no_date_rows = df[df['extracted_date'] == 'Date not found.'].count()
num_no_date_rows

pdf_url           18
gold_date         18
extracted_date    18
dtype: int64

In [304]:
valid_rows = df.dropna(subset=['gold_date', 'extracted_date'])

matches = (valid_rows['gold_date'] == valid_rows['extracted_date']).sum()
total_valid = len(valid_rows)
accuracy = matches / total_valid

print(f"Total matches: {matches}")
print(f"Total pdfs that have been successfully processed: {total_valid}")
print(f"Accuracy: {accuracy:.4%}")


Total matches: 274
Total pdfs that have been successfully processed: 355
Accuracy: 77.1831%


* Out of 468 rows in the annotated dataset, 91 could not be processed by Trellis, either the link to the URL denied the access or the file was too large to be processed.
* For 18 of the rows Trellis could not find a date!
* Final accuracy is: 74.6479%

# How to use?

1. Define your url in the `pdf_url`
2. Pass the url to `process_publication_dates` method
3. The result of the method can be:
    - None (in case the API couldn't process the uploaded pdf url)
    - 'Date not found.' (when Trellis could not find any matched date)
    - the extracted date in DD/MM/YYYY format.

In [None]:
pdf_url = ""
extracted_date = process_publication_dates(pdf_url)
print(extracted_date)