🚀 Model Evaluation

This repository empowers you to seamlessly evaluate machine learning models. Here’s what you can achieve:

✨ Capabilities:

🖥️ Local Models – Evaluate models running on your machine.
🌐 Public Models – Test models available online.
🏗️ Production-Ready – Run evaluations in a containerized, pytest-friendly workflow, making it easy to integrate into CI/CD pipelines.

⚙️ Dependencies

🐳 Docker – Required for running evaluations locally in isolated containers.
💻 VS Code (Optional) – Use VS Code tasks to simplify commands and workflow.
🔗 GitHub – Skip local setup entirely; run evaluations automatically on GitHub Actions.

🎓 Required Knowledge

🧪 pytest (on 🐍 Python, obviously)
🐳 Docker, but just setting environmental variables on docker-compose.yaml based on the existing example should be enough.

Getting Started

Locally-run Example

To run the example, I am using a GPU with 6GB memory. It is pretty old, so almost anyone with a graphic adapter should be able to run this locally.

Consider the following docker-compose.yaml

services:

  llm:
    image: sinanozel/ollama.0.12.2:llava-7b
    ports:
      - "11434:11434"
    networks:
      - nutrition-information-extraction-evaluation
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            capabilities: ["gpu"]
            count: all

  evaluator:
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      - OLLAMA_URL=http://llm:11434
      - OLLAMA_MODELS=ollama/llava:7b
      - MISTRAL_API_KEY=${MISTRAL_API_KEY}
    depends_on:
      - llm
    networks:
      - nutrition-information-extraction-evaluation
    tty: true

networks:
  nutrition-information-extraction-evaluation:
    driver: bridge

This runs with the command:

docker compose -f nutrition_information_extraction/docker-compose.yaml --project-directory nutrition_information_extraction up --build --abort-on-container-exit --exit-code-from evaluator

GitHub actions Example

TODO

Actual Evaluation

Basic Example with Binary Output

TODO

Future Work

Non-binary outputs
Text similarity
LLM-as-a-judge

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.vscode		.vscode
helpers		helpers
nutrition_information_extraction		nutrition_information_extraction
test_model_evaluation		test_model_evaluation
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
conftest.py		conftest.py
docker-compose.yaml		docker-compose.yaml
test_model_evaluation.py		test_model_evaluation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Model Evaluation

⚙️ Dependencies

🎓 Required Knowledge

Getting Started

Locally-run Example

GitHub actions Example

Actual Evaluation

Basic Example with Binary Output

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Model Evaluation

⚙️ Dependencies

🎓 Required Knowledge

Getting Started

Locally-run Example

GitHub actions Example

Actual Evaluation

Basic Example with Binary Output

Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages