Skip to content

vtwoptwo/ai-hackathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Status Python Version

Contributors Forks Stargazers Issues MIT License LinkedIn

AI Hackathon PDF OCR and Data Extraction 📑🔍


Logo

🌟 Project Overview

This project, developed as part of an AI Hackathon, focuses on Optical Character Recognition (OCR) and data extraction from PDF documents. It's a collaborative effort by @pabloortega, @hugo, and other contributors.

Key Features:

  • 📄 OCR processing of PDFs.
  • 📊 Data extraction and analysis.
  • 📈 CSV format creation from extracted data.

💡 RAG Techniques in the Project

This project employs advanced Retrieval Augmented Generation (RAG) techniques to enhance its OCR and data extraction capabilities:

  • In-Context Learning: Utilizes historical data and contextual information to improve the accuracy of data extraction.
  • Similarity Search through Cosine Similarity: Employs cosine similarity measures within the FAISS vector database for efficient and accurate document retrieval.
  • Chain of Thought Reasoning: This approach is used to break down complex data extraction tasks into simpler steps, enhancing the overall understanding and accuracy.
  • Regex (Regular Expressions): Regular expressions are used for pattern matching and data validation in the OCR process. @vtwoptwo from the IE Robotics & AI Club also conducted a workshop on the topic of Regex. You can check out the video here

✨ Possible Improvements (Check the Issues)

Improving the efficiency and performance of our application is a continual process. Here are some potential enhancements that could be implemented in future versions:

  • Optimized Temporary Database Loading: Implement a strategy to load only one temporary database per document instead of creating separate instances for each column. This change aims to reduce memory usage and increase processing speed.

  • Enhanced Multiprocessing: Introduce multiprocessing at two levels: for each document and for each column within those documents. By creating child processes at both levels, we can significantly speed up data processing and handling.

  • Real-Time CSV Saving: Modify the data handling mechanism to allow for the saving of CSV files while documents are still being loaded. This improvement could lead to more efficient memory usage and faster overall data processing times.

    • Retry Decorators with Feedback to the Model: Implement retry decorators that not only handle exceptions but also provide feedback to the model for continuous improvement. This approach aims to enhance the robustness of the application by allowing it to learn from operational challenges and adapt accordingly.

🛠 Tech Stack

Langchain FAISS Python

📚 Table of Contents

🛠 Installation

To set up the project, follow these steps:

  1. Clone the repository:
    git clone https://github.com/vtwoptwo/ai-hackathon.git
    

🛠 Installation

  1. Navigate to the project directory:
cd ai-hackathon

Setup

Build the project (all commands from the root of the repository):

python3 -m build

Don't forget to create a .env file with the following variables outlines in the env/example file.

Create a virtual environment and install the dependencies:

pip install -e .

🚀 Usage

Running the Code

There are optional arguments for the main.py file:

python3 src/analysis/main.py 

You can also use the Makefile to run the code:

    make build
    make run

📁 Project Structure

src/analysis
├── Makefile
├── column_processors
│   ├── __init__.py
│   ├── barrier.py
│   ├── cap.py
│   ├── ...
│   └── underlyings.py
└── main.py

👥 Contributing

Contributions are welcome! For major changes, please open an issue first to discuss what you would like to change.

📄 License

MIT License - see the LICENSE file for details. Due to the NDA we assigned, we are not able to share the pdf files.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published