Python (3.9.13) based tool to extract text from pdf files and custom process text for sustainability concepts.
Project dependencies:
pip install PyPDF2
pip install nltk
pip install numpy
pip install pandas
pip install openpyxl
pip install lexnlp
pip install scikit-learn
NLTK Tokenizer: https://www.nltk.org/_modules/nltk/tokenize/punkt.html
NLTK Model: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip