Source Agnostic Text Summerizer (SATS). An application designed to automate text-based data extraction and summarisation in volumes.
- Most communication over the internet happens through unstructured text.
- To get an informed view regarding general trends based on feedback for a given topic (e.g. product, event, etc.), extraction of aggregated themes based on numerous documents of text will be invaluable.
The environment used for this requires the installation of tesseract-ocr for text recognition. If you use windows, you can install it from here: https://github.com/UB-Mannheim/tesseract/wiki
You should take note of where the destination folder for the install location. It will likely look as follows: C:\Users\username\AppData\Local\Programs\Tesseract-OCR This will be required when runnin Tesseract-OCR through the python script.
You will also need the Poppler library if you are converting your PDF to images. This can be installed here: [install poppler](https://github.com/oschwartz10612/poppler-windows/releases/r. Read the instructions to ensure you extract all of the documents in the correct pkgs or library folder.
