GitHub - williamguilhermesouza/SUSEPDataExtract: Extraction of data from SUSEP for Carteira Global Challenge

SUSEP Data Extraction functions

This repository was made as part of the Carteira Global technical challenge. The repository with the challenge specifications can be found here.

Objective

The objectives of the challenge consisted of four main items:

Extract and format the data from SUSEP json
Download and save all documents from the query in SUSEP site: http://www.susep.gov.br/menu/consulta-de-produtos-1
Extract interest data from the documents downloaded
Save the extracted data to an output JSON

Business Logic

The work done is splitted between three python files:

main.py

The main file used as the application entrypoint. The logic inside this file consists of the data gathering and filtering from SUSEP, and then the creation and control of the threads that do the workout of downloading the pdfs and extracting data from them. Finally, the program outputs the errors and the data extracted in two different JSONs.

PdfProcessing.py

This file holds a subclass of the Thread class, so it can be used as many threads objects. The PdfProcessing is responsible for the download of the Pdf files and for calling the last file, to extract the data from the pdfs.

PdfExtractor.py

The PdfExtractor holds the logic behind the extraction of information in the downloaded pdf files. It passes the data extracted back to the main file, so it can output it in the saved json.

PdfOCRParser.py

This class is used to parse pdf image files, as the ones scanned, into text. It is only enabled when the program is executed with the 'ocr' flag. The class is not used as standard because parsing the files to text with ocr takes too long, and then it would make the execution much slower.

How to use

The project uses pipenv project management. So, to run the project you must install pipenv with the command:

pip install pipenv

After installing pipenv simple run:

pipenv install

to install dependencies.

And then run the main file with:

pipenv run python main.py

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.gitignore		.gitignore
LICENSE		LICENSE
PdfExtractor.py		PdfExtractor.py
PdfOCRParser.py		PdfOCRParser.py
PdfProcessing.py		PdfProcessing.py
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SUSEP Data Extraction functions

Objective

Business Logic

main.py

PdfProcessing.py

PdfExtractor.py

PdfOCRParser.py

How to use

About

Releases

Packages

Languages

License

williamguilhermesouza/SUSEPDataExtract

Folders and files

Latest commit

History

Repository files navigation

SUSEP Data Extraction functions

Objective

Business Logic

main.py

PdfProcessing.py

PdfExtractor.py

PdfOCRParser.py

How to use

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages