Take a variety of document types, process into text and surface a front end with a search facility, classification report and geomapping.
Pipeline was as follows:
- Convert inputs to text:
1a. Image to text - Convert image to text (handwritten and printed text) via Azure Computer Vision OCR API
1b. XHTML to text - Extract text from XHTML using Beautiful Soup
1c. PDF to text - Use Pytesseract to convert PDF to text - Text to Database - send text to CosmosDB using pymongo Azures sdk
- Enhance Database - Entity recogniton, NLP preprocessing (e.g. lemmatization, stopwords) and geocoding.
- Modelling - Peform TFIDF and Word2Vec and produce clusters and document similarity
- Surfacing - Front end in Flask, hosted on Azure. Can search and return modelling results.