Flock is a versatile and configurable Machine Learning (ML) pipeline designed to build Language Model Models (LLMs) for domain-specific tasks. It offers support for popular LLM architectures such as wizardlm, bloom, falcon, and llama. The project also features a deep document mining system capable of extracting data from both text and images.
- Configurable ML pipeline for domain-specific Language Model Models (LLMs).
- Supports multiple LLM architectures: wizardlm, bloom, falcon, and llama.
- Deep document mining system for data extraction from text and images.
- Developed using Python, pdfMiner, langChain, and streamLit technologies.
- Clone the repository:
git clone https://github.com/yourusername/flock.git
cd flock
- Install the required dependencies:
pip install -r requirements.txt
- Run the Flock application:
python app.py
- Choose an LLM architecture: wizardlm, bloom, falcon, or llama.
- Configure the pipeline settings according to your domain-specific task.
- Prepare your text and image data for training and evaluation.
- Run the pipeline using the provided scripts.
- Evaluate the trained LLM and fine-tune as necessary.
- Set up the project repository with a basic directory structure.
- Create a virtual environment and install necessary dependencies.
- Implement data collection mechanisms for text and image data.
- Preprocess and clean the collected data for further processing.
- Integrate support for wizardlm architecture.
- Integrate support for bloom architecture.
- Integrate support for falcon architecture.
- Integrate support for llama architecture.
- Implement a data extraction system for text documents.
- Implement a data extraction system for image documents.
- Develop mechanisms to combine text and image data for comprehensive analysis.
- Create a configuration interface for setting pipeline parameters.
- Develop the ML pipeline to train and evaluate LLMs based on selected architectures.
- Implement mechanisms for fine-tuning LLMs using domain-specific data.
- Build a user-friendly interface using streamLit for interacting with the pipeline.
- Implement visualization tools to display training progress and evaluation metrics.
- Test the pipeline with sample domain-specific tasks and datasets.
- Optimize the pipeline for performance and efficiency.
- Identify and resolve any bugs or issues.
- Write comprehensive documentation for setting up, using, and extending the pipeline.
- Prepare the repository for deployment, including proper version control and packaging.
Contributions are welcome! If you'd like to contribute to Flock, please follow the guidelines in the CONTRIBUTING.md file.
This project is licensed under the MIT License.