Skip to content

theemadnes/PDF_text_extract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Overview

This code, written to be executed as an AWS Lambda function, uses the Slate module to extract the text from a PDF file, and then indexes that text to an ElasticSearch cluster. It is designed to be invoked when a PDF document is put to an S3 bucket.

A few implementation notes:

  • Because this is just a simple PoC, the only text data index to Elasticsearch is on the first page
  • Play around with the Lambda timeout time to set something that works for document sizes you're placing in the S3 bucket
  • For smaller PDF docs, I've observed memory utilization (in CWL) of low 10s of Mbytes
  • This assumes some familiarity with AWS Lambda basics (configuring events sources, invocation policies, etc)
  • Specify a suffix of 'pdf' to make sure it's only executing for pdf files

To be implemented:

  • Signing of POSTs to Elasticsearch endpoints using SigV4, instead of using python modules

About

AWS Lambda function written in Python to perform text extraction (using Slate) from a PDF put to S3 & indexed in ElasticSearch. — Edit

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages