# Research on embedding and similarity search on python for eCRF 

# Index

[1. Research on embedding and similarity search on python for eCRF](#research-on-embedding-and-similarity-search-on-python-for-ecrf)  
[2. REQUIREMENTS](#requirements)  
[3. Task 1A: Text extraction from a pdf](#task-1a--text-extraction-from-a-pdf)  
[4. Task 1B: Pdf to image + OCR on the image.](#task-1b-pdf-to-image--ocr-on-the-image)  
[5. Task 2: Embedding for a similarity search using AI](#task-2-embedding-for-a-similarity-search-using-ai)  
[6. Task 3: Get the questionnaire from Milo's API](#task-3-get-the-questionnaire-from-milos-api)  
[7. References & Sources][def]  


[def]: #references--sources

## REQUIREMENTS

- ideally, install Bash. 
```bash
brew install bash #MacOS
https://www.howtogeek.com/790062/how-to-install-bash-on-windows-11/ #Windows
```
- the one from windows is a tutorial, you will need to use a virtual machine. You could also do the installation on powershell but some commands don't even exist for it. 

- Should work for most previous versions but I am using a python 3.12.3 kernel.

- create a .env file with the environment variables. 



# Task 1A : Text extraction from a pdf
We must retrieve the plain text from a pdf file and identify the questions in a questionnaire. 
For our final product this might not even be useful because a lot of the documents will simply be scanned. 

[Task 1: PDF Extraction](./task1_PDFextraction.ipynb)

That being said, if one had to be chosen, I would go with PyMuPDF for its undeniable superiority. It is quicker and the text extracted is more accurate.  




# Task 1B: Pdf to image + OCR on the image.

[Task 1B: pdf to Image](./task1b_PDFtoImage.ipynb)

The best way to turn a pdf into an image, is not as easily found. The quickest way it the library pdfium2, however, I think PyMuPDF is an option worth considering becuase it can easily denoise the image and make it black and white. These two extra features and the fact that it is the best text retriever, make it according to me the best option. 

[Task 1C: OCR on the Image](./task1c_OCRonImage.ipynb)


After carefully reviewing, all the open source models can simply be forgotten. Those who perform well like easy OCR are extremely slow and thus getting a working product would be impossible. EasyOCR used with a good GPU would be good, but still it has trouble detecting double lined cells in tables. 

[Task 1C: Paid OCR Services](./task1c_PaidOCR.ipynb)

In terms of paid OCR, the best options I have found are Google Gemini 2.5 from Google Vertex AI and Meta Llama 4 Maverick. Maverick doesn't have coordinates but is by far the best at properly detecting layout. I might develop two prototypes, one with maverick and pytesseract and the other only with google Vertex AI. 

TODO: add them to the price comparison. 


# Task 2: Embedding for a similarity search using AI

### Embedding 
stands for associating values to multidimensional vectors to perform searches on the text of said documents

[Task 2a: Denoising text with an LLM](./task2a_denoisingLLM.ipynb)

[Task 2b: Embedding](./task2b_Embedding.ipynb)


# Task 3: Get the questionnaire from Milo's API

[Task 3 : API retrieval](./task3:getQuestionnaireMilo.ipynb)

# References & Sources

[1] Deepseek Team, Deepseek: Quick doubts, definitions, and explanations. [Online]. Available: https://deepseek.com

[2] GeeksforGeeks Team, “Extract text from PDF file using Python,” GeeksforGeeks. [Online]. Available: https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/

[3] J. Singer-Vine, pdfplumber Documentation. [Online]. Available: https://github.com/jsvine/pdfplumber

[4] B. Rogojan, “How to automate PDF data extraction: 3 different methods to parse PDFs for analytics,” Seattle Data Guy. [Online]. Available: https://www.theseattledataguy.com/how-to-automate-pdf-data-extraction-3-different-methods-to-parse-pdfs-for-analytics/#page-content

[5] Python Software Foundation, “time — Time access and conversions,” Python Documentation. [Online]. Available: https://docs.python.org/3/library/time.html

[6] Google, Tesseract OCR. [Online]. Available: https://github.com/tesseract-ocr/tesseract

[7] S. Hoffstaetter, pytesseract GitHub Repository. [Online]. Available: https://github.com/madmaze/pytesseract

[8] J. Jerphanion, pdf2image GitHub Repository. [Online]. Available: https://github.com/Belval/pdf2image

[9] S. Dufour, PyPDF2 GitHub Repository. [Online]. Available: https://github.com/sdpython/PyPDF2

[10] PyMuPDF Team, PyMuPDF GitHub Repository. [Online]. Available: https://github.com/pymupdf/PyMuPDF

[11] J. J. Vens, pdfminer.six GitHub Repository. [Online]. Available: https://github.com/pdfminer/pdfminer.six

[12] E. Berger, Scalene GitHub Repository. [Online]. Available: https://github.com/plasma-umass/scalene?tab=readme-ov-file

[13] E. Berger, “Scalene: A high-performance, high-precision CPU, GPU, and memory profiler for Python,” YouTube. [Online]. Available: https://www.youtube.com/watch?v=5iEf-_7mM1k

[14] Z. Shen, R. Zhang, M. Dell, B. C. G. Lee, J. Carlson, and W. Li, “LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis,” arXiv preprint arXiv:2103.15348, 2021. [Online]. Available: https://github.com/Layout-Parser/layout-parser/tree/main

[15] LayoutParser Team, “EfficientDet Model for PubLayNet,” Hugging Face. [Online]. Available: https://huggingface.co/layoutparser/efficientdet/tree/main/PubLayNet/tf_efficientdet_d1

[16] PyPDFium2 Team, PyPDFium2 GitHub Repository. [Online]. Available: https://github.com/pypdfium2-team

[17] E. McConville, Wand GitHub Repository. [Online]. Available: https://github.com/emcconville/wand?tab=readme-ov-file

[18] Wand Documentation, “Install Wand on Debian,” Wand. [Online]. Available: https://docs.wand-py.org/en/latest/guide/install.html#install-wand-debian

[19] J. Alankrita, pdftotext GitHub Repository. [Online]. Available: https://github.com/jalan/pdftotext

[20] JaidedAI Team, EasyOCR GitHub Repository. [Online]. Available: https://github.com/JaidedAI/EasyOCR

[21] PaddlePaddle Team, PaddleOCR GitHub Repository. [Online]. Available: https://github.com/PaddlePaddle/PaddleOCR

[22] UKPLab Team, SentenceTransformers GitHub Repository. [Online]. Available: https://github.com/UKPLab/sentence-transformers

[23] TensorFlow Team, “Semantic Similarity with TensorFlow Hub Universal Encoder,” GitHub Repository. [Online]. Available: https://github.com/tensorflow/docs/blob/master/site/en/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder.ipynb

[24] Explosion Team, spaCy GitHub Repository. [Online]. Available: https://github.com/explosion/spaCy

[25] Hugging Face, Qwen2.5-VL-72B-Instruct Model Card. [Online]. Available: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct?inference_provider=nebius

[26] Meta, Llama-3.2-11B-Vision-Instruct Model Card. [Online]. Available: https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct?inference_provider=hf-inference

[27] Groq, Llama-4-Scout-17B-16E-Instruct Model Documentation. [Online]. Available: https://console.groq.com/docs/model/llama-4-scout-17b-16e-instruct

[28] Groq, Llama-4-Maverick-17B-128E-Instruct Model Documentation. [Online]. Available: https://console.groq.com/docs/model/llama-4-maverick-17b-128e-instruct