ds-hack

End goal

Take a variety of document types, process into text and surface a front end with a search facility, classification report and geomapping.

Pipeline was as follows:

Convert inputs to text:
1a. Image to text - Convert image to text (handwritten and printed text) via Azure Computer Vision OCR API
1b. XHTML to text - Extract text from XHTML using Beautiful Soup
1c. PDF to text - Use Pytesseract to convert PDF to text
Text to Database - send text to CosmosDB using pymongo Azures sdk
Enhance Database - Entity recogniton, NLP preprocessing (e.g. lemmatization, stopwords) and geocoding.
Modelling - Peform TFIDF and Word2Vec and produce clusters and document similarity
Surfacing - Front end in Flask, hosted on Azure. Can search and return modelling results.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
Metrics		Metrics
Search-App		Search-App
Text-Cleaning		Text-Cleaning
blob-store		blob-store
clustering		clustering
image_to_text		image_to_text
pipeline		pipeline
xhtml_to_text		xhtml_to_text
.gitignore		.gitignore
Lessons Learned.png		Lessons Learned.png
Process.png		Process.png
README.md		README.md