- 🚀 Getting Started
- 📦 Python Package
- 🖥️ User Interface
- 🤝 Usage Guidance & Contribution Guidance
- 📚 TWIX API Reference
TWIX is a tool for automatically extracting structured data from templatized documents that are programmatically generated by populating fields in a visual template. TWIX infers the underlying template and then performs data extraction, offering a scalable solution with high accuracy and low cost. In particular, TWIX offers,
- A Python package for extracting structured data from documents step by step, designed for production pipeline deployment, with optional user feedback to monitor and refine the extraction process.
- An interactive UI playground for data extraction that allows users to edit the inferred template, enabling more confident and accurate extraction.
- Python 3.10 or later
- OpenAI API key
- Clone the repository
git clone https://github.com/ucbepic/TWIX.git
- Install packages.
pip install -e .
- Set OpenAI API key as environment variable
You can refer to this document
If you want to use TWIX as a Python package, see detailed Python API references in TWIX API Reference, and examples below.
- To use TWIX to extract structured data step by step, check out
tests/test_twix.ipynb
. - To use TWIX to extract structured data with a single API call, check out
tests/test_twix_transform.ipynb
. - To edit the inferred template with user input, check out
tests/test_twix_user_apis.ipynb
. - To see how TWIX performs in large datasets, check out
tests/test_twix_large.ipynb
. In this example, after inferring the template with a cost of approximately $0.001, TWIX extracts all data from a 1,292-page document in about 4 seconds with zero additional cost.
If you want to use TWIX in our user interfaces:
-
Start the frontend server. In the
/twix-ui/
directory, run:npm start
-
Start the backend server in a new terminal. Navigate to the
/twix-ui/backend/
, run:python3 app.py
In our user interface, you can view the cumulative cost incurred so far. You can edit the template by updating the fields for each node, as well as add or delete nodes before starting data extraction. The outputs generated in this step can be downloaded as local files.
-
Applicable Documents
TWIX is currently optimized for documents generated from the same template. To infer the underlying template, TWIX requires at least two records (or two pages) created with that template. Sample documents are available in thetests/data
directory. We are actively working on extending TWIX to support structured data extraction from a single-record (single-page) PDF. Updates will be announced in the repository once this feature is released. -
Phrase Patterns in Documents
TWIX uses a free OCR (Optical Character Recognition) tool, PdfPlumber, to extract phrases from documents. TWIX further infers fields and templates based on the extracted phrases. PdfPlumber performs well for extracting individual words; however, in cases where a phrase—such as a table column name or a table cell—spans multiple lines, PdfPlumber fails to recognize the entire phrase and instead returns fragmented words. To address this, TWIX develops a feature powered by vision-based LLMs that first extracts true phrases by using semantic knowledge and vision information from a small document sample, and then learns rules for combining words correctly. These learned rules are applied in downstream phrase extraction. You can enable this feature by settingvision_feature = True
intwix.extract_phrase
(it is disabled by default). We are actively working on improving the rule-learning function for phrase extraction. Additionally, we are exploring alternative OCR tools that can better handle multi-line phrase extraction. If you're interested in contributing to the phrase extraction component, feel free to reach out!
Several components in TWIX can potentially be replaced and improved by the community.
-
Using a custom LLM model:
To integrate your own LLM, create a wrapper function for your API that accepts aprompt
and returns aresponse
. Theprompt
should be a tuple of (instruction
,context
). For an example implementation, see how we wrap OpenAI's LLM intwix/models
.
Next, add an entry point intwix/model.py
to expose your LLM interface to TWIX. -
Using a custom OCR tool for phrase extraction:
To integrate another OCR tool, ensure that its output is cleaned and formatted as a table. Seetests/out/Investigations_Redacted_modified/Investigations_Redacted_modified_raw_phrases_bounding_box_page_number.txt
for an example. The table should have the following header:text,x0,y0,x1,y1,page
, wheretext
is the extracted phrase,x0,y0,x1,y1
represent the bounding box coordinates, andpage
indicates the page number. -
Using a custom field prediction approach:
To integrate a custom field prediction method, ensure its output is cleaned and formatted as a list of strings, where each string represents an inferred field. Seetests/out/Investigations_Redacted_modified/twix_key.txt
for an example.
We’re excited to chat with you and hear your feedback on new datasets and use cases! Feel free to reach out to yiminglin@berkeley.edu
to share your thoughts.
We welcome contributions! Please follow these guidelines:
- Fork the repository and create a new feature branch.
- Follow the coding style used in the codebase.
- Write tests and update documentation if applicable.
- Commit with clear messages and open a pull request.
- Be responsive to code review feedback.
Feel free to reach out by opening an issue or emailing us if you'd like to discuss bigger changes or new features.
This document provides an overview of the available APIs in the TWIX Python package. Each API is described with its functionality, parameters, and return values.
__all__ = [
"predict_field",
"predict_template",
"extract_data",
"extract_phrase",
"transform",
"remove_template_node",
"modify_template_node",
"add_fields",
"remove_fields"
]
Below is a detailed description of each API:
Extracts phrases from PDFs by using OCR tools.
- Parameters:
data_files
(list): Stores a list of paths to documents that are created using the same template.result_folder
(str): The path to store results.LLM_model_name
(str, optional): Specify the LLM model name.page_to_infer_fields
(int, optional): TWIX extracts all phrases from each input document by default. For field prediction, it separately creates a small document sample by specifyingpage_to_infer_fields
(by default is 5), which determines how many pages are used for inferring fields.vision_feature
(boolean, optional): If set to True, TWIX uses a vision-based LLM to extract phrases from the first two pages of the document and learn rules to merge words into phrases. Defaults to False.
- Returns:
dict
: A dict of phrases per input document. The key in the dict stores the document name. The value corresponds to the raw extracted phrases and their bounding box, which are also written in the result folder.cost
(float): The cost incurred during the function call.
Predicts a list of fields from documents. Fields refer to phrases in table headers or keywords in key-value pairs.
- Parameters:
data_files
(list): Stores a list of paths to documents that are created using the same template.result_folder
(str): The path to store results.LLM_model_name
(str, optional): Specify the LLM model name.
- Returns:
list
: A list of predicted fields. TWIX also writes the results in the local result folder, naming the file astwix_key.txt
.cost
(float): The cost incurred during the function call.
Predicts the template from documents. A template is defined as a tree (refer to the paper for details) and stored as a JSON. Each tree node (JSON object) corresponds to the abstract of a data block, storing the type and fields of the node.
- Parameters:
data_files
(list): Stores a list of paths to documents that are created using the same template.result_folder
(str): The path to store results.LLM_model_name
(str, optional): Specify the LLM model name.
- Returns:
list
: The template as a list of nodes, stored locally in the result folder, naming the file astemplate.json
.cost
(float): The cost incurred during the function call.
Extracts data based on a template.
- Parameters:
data_files
(list): Stores a list of paths to documents that are created using the same template.template
(list, optional): The template output frompredict_template
. If not specified, TWIX will look in the local result folder to read the predicted template.result_folder
(str): The path to store results.
- Returns:
dict
: A dictionary of extraction results, where the key is the file name, and the value is the extraction object of that file. Each extraction object is a list of data blocks containing either table blocks or key-value blocks. Results will be written locally in the result folder, naming the file asextracted.json
.cost
(float): The cost incurred during the function call.
Provides an end-to-end API to directly extract data from PDFs.
- Parameters:
data_files
(list): Stores a list of paths to documents that are created using the same template.result_folder
(str): The path to store results.LLM_model_name
(str): Specify the LLM model name.vision_feature
(boolean, optional): If set to True, TWIX uses a vision-based LLM to extract phrases from the first two pages of the document and learn rules to merge words into phrases. Defaults to False.
- Returns:
fields
(list): A list of strings representing the predicted fields.template
(list): The template as a list of nodes, stored locally in the result folder, naming the file astemplate.json
.extraction_object
(dict): A dictionary of extraction results, where the key is the file path, and the value is the extraction object of that file. Each extraction object is a list of data blocks containing either table blocks or key-value blocks. Results will be written locally in the result folder, naming the file asextracted.json
.cost
(float): The cost incurred during the function call.
Allows users to add fields to the predicted fields.
- Parameters:
added_fields
(list): A list of fields to add based on the predicted fields.result_folder
(str): The path to store results.
Allows users to delete fields from the predicted fields.
- Parameters:
removed_fields
(list): A list of fields to delete based on the predicted fields.result_folder
(str): The path to store results.
Allows users to remove nodes in the predicted template.
- Parameters:
node_ids
(list): A list of node IDs. Each node ID is an integer.result_folder
(str): The path to store results.
Allows users to update nodes in the predicted template.
- Parameters:
node_id
(int): The integer node ID to update.type
(str): The type of the node to update, either"kv"
or"table"
.fields
(list): A list of fields (strings) to update.result_folder
(str): The path to store results.