AI Hackathon PDF OCR and Data Extraction 📑🔍

🌟 Project Overview

This project, developed as part of an AI Hackathon, focuses on Optical Character Recognition (OCR) and data extraction from PDF documents. It's a collaborative effort by @pabloortega, @hugo, and other contributors.

Key Features:

📄 OCR processing of PDFs.
📊 Data extraction and analysis.
📈 CSV format creation from extracted data.

💡 RAG Techniques in the Project

This project employs advanced Retrieval Augmented Generation (RAG) techniques to enhance its OCR and data extraction capabilities:

In-Context Learning: Utilizes historical data and contextual information to improve the accuracy of data extraction.
Similarity Search through Cosine Similarity: Employs cosine similarity measures within the FAISS vector database for efficient and accurate document retrieval.
Chain of Thought Reasoning: This approach is used to break down complex data extraction tasks into simpler steps, enhancing the overall understanding and accuracy.
Regex (Regular Expressions): Regular expressions are used for pattern matching and data validation in the OCR process. @vtwoptwo from the IE Robotics & AI Club also conducted a workshop on the topic of Regex. You can check out the video here

✨ Possible Improvements (Check the Issues)

Improving the efficiency and performance of our application is a continual process. Here are some potential enhancements that could be implemented in future versions:

Optimized Temporary Database Loading: Implement a strategy to load only one temporary database per document instead of creating separate instances for each column. This change aims to reduce memory usage and increase processing speed.
Enhanced Multiprocessing: Introduce multiprocessing at two levels: for each document and for each column within those documents. By creating child processes at both levels, we can significantly speed up data processing and handling.
Real-Time CSV Saving: Modify the data handling mechanism to allow for the saving of CSV files while documents are still being loaded. This improvement could lead to more efficient memory usage and faster overall data processing times.
- Retry Decorators with Feedback to the Model: Implement retry decorators that not only handle exceptions but also provide feedback to the model for continuous improvement. This approach aims to enhance the robustness of the application by allowing it to learn from operational challenges and adapt accordingly.

🛠 Tech Stack

🛠 Installation

To set up the project, follow these steps:

Clone the repository:

git clone https://github.com/vtwoptwo/ai-hackathon.git

🛠 Installation

Navigate to the project directory:

cd ai-hackathon

Setup

Build the project (all commands from the root of the repository):

python3 -m build

Don't forget to create a .env file with the following variables outlines in the env/example file.

Create a virtual environment and install the dependencies:

pip install -e .

🚀 Usage

Running the Code

There are optional arguments for the main.py file:

python3 src/analysis/main.py

You can also use the Makefile to run the code:

    make build
    make run

📁 Project Structure

src/analysis
├── Makefile
├── column_processors
│   ├── __init__.py
│   ├── barrier.py
│   ├── cap.py
│   ├── ...
│   └── underlyings.py
└── main.py

👥 Contributing

Contributions are welcome! For major changes, please open an issue first to discuss what you would like to change.

📄 License

MIT License - see the LICENSE file for details. Due to the NDA we assigned, we are not able to share the pdf files.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
env		env
misc		misc
src/analysis		src/analysis
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

env

env

misc

misc

src/analysis

src/analysis

.gitignore

.gitignore

Makefile

Makefile

README.md

README.md

pyproject.toml

pyproject.toml

Repository files navigation

AI Hackathon PDF OCR and Data Extraction 📑🔍

🌟 Project Overview

Key Features:

💡 RAG Techniques in the Project

✨ Possible Improvements (Check the Issues)

🛠 Tech Stack

📚 Table of Contents

🛠 Installation

🛠 Installation

Setup

🚀 Usage

Running the Code

📁 Project Structure

👥 Contributing

📄 License

About

Releases

Packages

Contributors 3

Languages

vtwoptwo/ai-hackathon

Folders and files

Latest commit

History

Repository files navigation

AI Hackathon PDF OCR and Data Extraction 📑🔍

🌟 Project Overview

Key Features:

💡 RAG Techniques in the Project

✨ Possible Improvements (Check the Issues)

🛠 Tech Stack

📚 Table of Contents

🛠 Installation

🛠 Installation

Setup

🚀 Usage

Running the Code

📁 Project Structure

👥 Contributing

📄 License

About

Resources

Stars

Watchers

Forks

Languages