This project, developed as part of an AI Hackathon, focuses on Optical Character Recognition (OCR) and data extraction from PDF documents. It's a collaborative effort by @pabloortega, @hugo, and other contributors.
- 📄 OCR processing of PDFs.
- 📊 Data extraction and analysis.
- 📈 CSV format creation from extracted data.
This project employs advanced Retrieval Augmented Generation (RAG) techniques to enhance its OCR and data extraction capabilities:
- In-Context Learning: Utilizes historical data and contextual information to improve the accuracy of data extraction.
- Similarity Search through Cosine Similarity: Employs cosine similarity measures within the FAISS vector database for efficient and accurate document retrieval.
- Chain of Thought Reasoning: This approach is used to break down complex data extraction tasks into simpler steps, enhancing the overall understanding and accuracy.
- Regex (Regular Expressions): Regular expressions are used for pattern matching and data validation in the OCR process. @vtwoptwo from the IE Robotics & AI Club also conducted a workshop on the topic of Regex. You can check out the video here
✨ Possible Improvements (Check the Issues)
Improving the efficiency and performance of our application is a continual process. Here are some potential enhancements that could be implemented in future versions:
-
Optimized Temporary Database Loading: Implement a strategy to load only one temporary database per document instead of creating separate instances for each column. This change aims to reduce memory usage and increase processing speed.
-
Enhanced Multiprocessing: Introduce multiprocessing at two levels: for each document and for each column within those documents. By creating child processes at both levels, we can significantly speed up data processing and handling.
-
Real-Time CSV Saving: Modify the data handling mechanism to allow for the saving of CSV files while documents are still being loaded. This improvement could lead to more efficient memory usage and faster overall data processing times.
-
- Retry Decorators with Feedback to the Model: Implement retry decorators that not only handle exceptions but also provide feedback to the model for continuous improvement. This approach aims to enhance the robustness of the application by allowing it to learn from operational challenges and adapt accordingly.
To set up the project, follow these steps:
- Clone the repository:
git clone https://github.com/vtwoptwo/ai-hackathon.git
- Navigate to the project directory:
cd ai-hackathon
Build the project (all commands from the root of the repository):
python3 -m build
Don't forget to create a .env
file with the following variables outlines in the env/example
file.
Create a virtual environment and install the dependencies:
pip install -e .
There are optional arguments for the main.py file:
python3 src/analysis/main.py
You can also use the Makefile to run the code:
make build
make run
src/analysis
├── Makefile
├── column_processors
│ ├── __init__.py
│ ├── barrier.py
│ ├── cap.py
│ ├── ...
│ └── underlyings.py
└── main.py
Contributions are welcome! For major changes, please open an issue first to discuss what you would like to change.
MIT License - see the LICENSE file for details. Due to the NDA we assigned, we are not able to share the pdf files.