Form Extractor Prototype

This tool extracts the structure from a PDF or image of a form.

By default it uses the Claude 3 LLM model by Anthropic.

But it can also use the OpenAI LLM.

A single extraction of an A4 form page costs about 10p.

It replicates the form structure in JSON, following the schema used by GOV.UK Forms.

It then uses that to generate a multi-page web form in the GOV.UK style.

Here's a short demo video:

form-extractor-v2-alpha.mov

You'll notice that it doesn't try to faithfully replicate every field in a question. Instead, it uses the relevant components and patterns from the GOV.UK Design System. This is a feature not a bug ;-)

Install

You'll need either an Anthropic API key, or an Open AI one.

Add the key as a local environment variable called ANTHROPIC_API_KEY, or OPENAI_API_KEY.

Install the app locally with npm install.

You'll also need to install GraphicsMagick. It's used to convert PDF pages into images.

There's a guide for doing that here.

Run

Start the app locally with npm start dev.

It'll be available at http://localhost:3000/

Current capabilities

processing PDF forms or images of forms
breaking a form down into questions
distinguishing between question, hint and field text
distinguishing between single-choice and multiple-choice questions
recognising common question types like 'name', 'address', 'date' etc.
recognising when an image isn't a form
recognising when a question has conditional routing
processing hand drawn forms
browsing previously processed forms

Current limitations

it only knows about certain kinds of question types
you can't provide your own API key via the UI
like a lot of Gen AI, it can be unpredictable

How it works

Disclaimer: This is a prototype and I am not a developer ;-).

The main UI is in app/views/index.html.

Other Nunjucks page templates and macros are in app/views.

Additional CSS styles are in assets/style.scss.

Generate updates to the CSS with sass assets/style.scss public/assets/style.css.

The script in public/assets/scripts.js enhances file upload and adds loading spinners.

The form in index.html uploads the file to the server.

If it's a PDF it uses GraphicsMagick to convert the pages into image files.

Form files are stored in subfolders in public/results.

The images are sent to an LLM, along with a prompt and JSON schema, via the 'SendToLLM' function in server.js.

The JSON schema for each LLM is specified in data/.

The results are saved as a JSON files in the subfolders in public/results.

Those files are used to generate the pages that are loaded into iframes in app/views/index.html.

The form components are specificed in app/views/answer-types.njk

They are built using the Nunjucks components in GOV.UK Frontend.

Page rendering is defined in the URL routing rules found at the bottom of server.js.

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
app/views		app/views
assets		assets
data		data
public		public
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
server.js		server.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Form Extractor Prototype

Install

Run

Current capabilities

Current limitations

How it works

About

Releases

Contributors 5

Languages

License

timpaul/form-extractor-prototype

Folders and files

Latest commit

History

Repository files navigation

Form Extractor Prototype

Install

Run

Current capabilities

Current limitations

How it works

About

Resources

License

Stars

Watchers

Forks

Releases

Contributors 5

Languages