Overview

This code, written to be executed as an AWS Lambda function, uses the Slate module to extract the text from a PDF file, and then indexes that text to an ElasticSearch cluster. It is designed to be invoked when a PDF document is put to an S3 bucket.

A few implementation notes:

Because this is just a simple PoC, the only text data index to Elasticsearch is on the first page
Play around with the Lambda timeout time to set something that works for document sizes you're placing in the S3 bucket
For smaller PDF docs, I've observed memory utilization (in CWL) of low 10s of Mbytes
This assumes some familiarity with AWS Lambda basics (configuring events sources, invocation policies, etc)
Specify a suffix of 'pdf' to make sure it's only executing for pdf files

To be implemented:

Signing of POSTs to Elasticsearch endpoints using SigV4, instead of using python modules

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
lambda_functions/pdf_text_extract		lambda_functions/pdf_text_extract
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lambda_functions/pdf_text_extract

lambda_functions/pdf_text_extract

README.md

README.md

Repository files navigation

Overview

About

Releases

Packages

Languages

theemadnes/PDF_text_extract

Folders and files

Latest commit

History

lambda_functions/pdf_text_extract

lambda_functions/pdf_text_extract

README.md

README.md

Repository files navigation

Overview

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages