Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argilla annotator #2687

Merged
merged 40 commits into from
May 16, 2024
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
61c2c5c
add argilla skeleton section in docs
strickvl May 9, 2024
1d7540a
update launch method as per other PRs
strickvl May 9, 2024
742bc5b
update ls annotator based on baseannotator
strickvl May 9, 2024
42d6bc4
naming changes (initial set)
strickvl May 9, 2024
a118197
add argilla to init and constants
strickvl May 9, 2024
b9e71bf
update init file
strickvl May 9, 2024
099c96e
formatting
strickvl May 9, 2024
069914b
add settings to flavor
strickvl May 9, 2024
18f339c
flavor updates
strickvl May 9, 2024
0e299f2
client methods
strickvl May 9, 2024
53c0ec4
port is optional
strickvl May 9, 2024
4d1c3b0
auth + secrets for client
strickvl May 9, 2024
c4db7f7
use secret if there, else use api key
strickvl May 9, 2024
4b7c3c4
add logging statement
strickvl May 9, 2024
c269e78
handle both api key and auth secret provided
strickvl May 9, 2024
d85422e
get datasets and dataset names
strickvl May 9, 2024
e9af42f
add methods for deleting and getting dataset
strickvl May 9, 2024
0cfd71e
continue to update annotator methods
strickvl May 9, 2024
55556ba
update get method
strickvl May 9, 2024
60c3bff
add todo to update
strickvl May 9, 2024
66a5dea
finish the rest of the methods
strickvl May 9, 2024
75612dc
final fixes
strickvl May 9, 2024
ccbc075
handle dataset deletions
strickvl May 9, 2024
abd1cdb
improve docstrings
strickvl May 9, 2024
a66a6d1
refactoring
strickvl May 9, 2024
7672fd9
add docs
strickvl May 9, 2024
41f7922
add docs tweaks and an image
strickvl May 9, 2024
e8771b2
docstrings
strickvl May 9, 2024
2d9fff2
mypy fixes
strickvl May 9, 2024
7db9b9b
Merge branch 'develop' into feature/argilla-annotator
strickvl May 9, 2024
a00c332
Optimised images with calibre/image-actions
github-actions[bot] May 9, 2024
1d7a3b9
Merge remote-tracking branch 'origin/develop' into feature/argilla-an…
strickvl May 13, 2024
b66e197
Address review suggestions
strickvl May 13, 2024
e0fa83b
add to mocked_libs
strickvl May 13, 2024
eca620b
darglint fixes
strickvl May 13, 2024
beaef9d
Refactor launch method in BaseAnnotator class
strickvl May 13, 2024
6fcccef
Update src/zenml/integrations/argilla/flavors/argilla_annotator_flavo…
strickvl May 14, 2024
dfc4f0b
Merge branch 'develop' into feature/argilla-annotator
strickvl May 14, 2024
8e54d58
Merge remote-tracking branch 'origin/develop' into feature/argilla-an…
strickvl May 14, 2024
c0b273e
add validator
strickvl May 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Binary file added docs/book/.gitbook/assets/argilla_annotator.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ ZenML features an integration with `label_studio`.

| Annotator | Flavor | Integration | Notes |
|-----------------------------------------|----------------|----------------|----------------------------------------------------------------------|
| [ArgillaAnnotator](argilla.md) | `argilla` | `argilla` | Connect ZenML with Argilla |
| [LabelStudioAnnotator](label-studio.md) | `label_studio` | `label_studio` | Connect ZenML with Label Studio |
| [Custom Implementation](custom.md) | _custom_ | | Extend the annotator abstraction and provide your own implementation |

Expand Down
145 changes: 145 additions & 0 deletions docs/book/stacks-and-components/component-guide/annotators/argilla.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
---
description: Annotating data using Argilla.
---

# Argilla

[Argilla](https://github.com/argilla-io/argilla) is an open-source data curation
platform designed to enhance the development of both small and large language
models (LLMs) and NLP tasks in general. It enables users to build robust
language models through faster data curation using both human and machine
feedback, providing support for each step in the MLOps cycle, from data labeling
to model monitoring.

![Argilla Annotator](../../../.gitbook/assets/argilla_annotator.png)

Argilla distinguishes itself for its focus on specific use cases and
human-in-the-loop approaches. While it does offer programmatic features,
Argilla's core value lies in actively involving human experts in the
tool-building process, setting it apart from other competitors.

### When would you want to use it?

If you need to label textual data as part of your ML workflow, that is the point
at which you could consider adding the Argilla annotator stack component as part
of your ZenML stack.

We currently support the use of annotation at the various stages described in
[the main annotators docs page](annotators.md). The Argilla integration
currently is built to support annotation using a local (Docker-backed) instance
of Argilla as well as a deployed instance of Argilla. There is an easy way to
deploy Argilla as a [Hugging Face
Space](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla), for
instance, which is documented in the [Argilla
documentation](https://docs.argilla.io/en/latest/getting_started/installation/deployments/huggingface-spaces.html).

### How to deploy it?

The Argilla Annotator flavor is provided by the Argilla ZenML integration. You
need to install it to be able to register it as an Annotator and add it to your
stack:

```shell
zenml integration install argilla
```

You can either pass the `api_key` directly into the `zenml annotator register`
command or you can register it as a secret and pass the secret name into the
command. We recommend the latter approach for security reasons. If you want to
take the latter approach, be sure to register a secret for whichever artifact
store you choose, and then you should make sure to pass the name of that secret
into the annotator as the `--authentication_secret`. For example, you'd run:

```shell
zenml secret create argilla_secrets --api_key="<your_argilla_api_key>"
```

(Visit the Argilla documentation and interface to obtain your API key.)

Then register your annotator with ZenML:

```shell
zenml annotator register argilla --flavor argilla --authentication_secret=argilla_secrets
```

When using a deployed instance of Argilla, the instance URL must be specified
bcdurak marked this conversation as resolved.
Show resolved Hide resolved
without any trailing `/` at the end. If you are using a Hugging Face Spaces
instance and its visibility is set to private, you must also set the
`extra_headers` parameter which would include a Hugging Face token. For example:

```shell
zenml annotator register argilla --flavor argilla --authentication_secret=argilla_secrets --instance_url="https://[your-owner-name]-[your_space_name].hf.space" --extra_headers="{"Authorization": f"Bearer {<your_hugging_face_token>}"}"
```

Finally, add all these components to a stack and set it as your active stack.
For example:

```shell
zenml stack copy default annotation
# this must be done separately so that the other required stack components are first registered
zenml stack update annotation -an <YOUR_ARGILLA_ANNOTATOR>
zenml stack set annotation
# optionally also
zenml stack describe
```

Now if you run a simple CLI command like `zenml annotator dataset list` this
should work without any errors. You're ready to use your annotator in your ML
workflow!

### How do you use it?

ZenML supports access to your data and annotations via the `zenml annotator ...`
CLI command. We have also implemented an interface to some of the common Argilla
functionality via the ZenML SDK.

You can access information about the datasets you're using with the `zenml
annotator dataset list`. To work on annotation for a particular dataset, you can
run `zenml annotator dataset annotate <dataset_name>`. What follows is an
overview of some key components to the Argilla integration and how it can be
used.

#### Argilla Annotator Stack Component

Our Argilla annotator component inherits from the `BaseAnnotator` class. There
are some methods that are core methods that must be defined, like being able to
register or get a dataset. Most annotators handle things like the storage of
state and have their own custom features, so there are quite a few extra methods
specific to Argilla.

The core Argilla functionality that's currently enabled includes a way to
register your datasets, export any annotations for use in separate steps as well
as start the annotator daemon process. (Argilla requires a server to be running
in order to use the web interface, and ZenML handles the connection to this
server using the details you passed in when registering the component.)

#### Argilla Annotator SDK

Visit [the SDK
docs](https://sdkdocs.zenml.io/latest/integration_code_docs/integrations-argilla/)
to learn more about the methods that ZenML exposes for the Argilla annotator. To
access the SDK through Python, you would first get the client object and then
call the methods you need. For example:

```python
from zenml.client import Client

client = Client()
annotator = client.active_stack.annotator

# list dataset names
dataset_names = annotator.get_dataset_names()

# get a specific dataset
dataset = annotator.get_dataset("dataset_name")

# get the annotations for a dataset
annotations = annotator.get_labeled_data(dataset_name="dataset_name")
```

For more detailed information on how to use the Argilla annotator and the
functionality it provides, visit the [Argilla
documentation](https://docs.argilla.io/en/latest/).

<!-- For scarf -->
<figure><img alt="ZenML Scarf" referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=f0b4f458-0a54-4fcd-aa95-d5ee424815bc" /></figure>
1 change: 1 addition & 0 deletions docs/book/toc.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,7 @@
* [Feast](stacks-and-components/component-guide/feature-stores/feast.md)
* [Develop a Custom Feature Store](stacks-and-components/component-guide/feature-stores/custom.md)
* [Annotators](stacks-and-components/component-guide/annotators/annotators.md)
* [Argilla](stacks-and-components/component-guide/annotators/argilla.md)
* [Label Studio](stacks-and-components/component-guide/annotators/label-studio.md)
* [Develop a Custom Annotator](stacks-and-components/component-guide/annotators/custom.md)
* [Image Builders](stacks-and-components/component-guide/image-builders/image-builders.md)
Expand Down
8 changes: 7 additions & 1 deletion docs/mocked_libs.json
Original file line number Diff line number Diff line change
Expand Up @@ -207,5 +207,11 @@
"whylogs.api.writer.whylabs",
"whylogs.core",
"whylogs.viz",
"xgboost"
"xgboost",
"argilla",
"argilla.client",
"argilla.client.client",
"argilla.client.sdk",
"argilla.client.sdk.commons",
"argilla.client.sdk.commons.errors"
]
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -459,6 +459,7 @@ module = [
"matplotlib.*",
"IPython.*",
"huggingface_hub.*",
"label_studio_sdk.*"
"label_studio_sdk.*",
"argilla.*"
]
ignore_missing_imports = true
7 changes: 4 additions & 3 deletions src/zenml/annotators/base_annotator.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"""Base class for ZenML annotator stack components."""

from abc import ABC, abstractmethod
from typing import Any, ClassVar, List, Optional, Tuple, Type, cast
from typing import Any, ClassVar, List, Tuple, Type, cast

from zenml.enums import StackComponentType
from zenml.stack import Flavor, StackComponent
Expand Down Expand Up @@ -91,11 +91,12 @@ def get_dataset_stats(self, dataset_name: str) -> Tuple[int, int]:
"""

@abstractmethod
def launch(self, url: Optional[str]) -> None:
def launch(self, **kwargs: Any) -> None:
"""Launches the annotation interface.

Args:
url: The URL of the annotation interface.
**kwargs: Additional keyword arguments to pass to the
annotation client.
"""

@abstractmethod
Expand Down
1 change: 0 additions & 1 deletion src/zenml/cli/annotator.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,6 @@ def dataset_annotate(
f"Launching the annotation interface for dataset '{dataset_name}'."
)
try:
annotator.get_dataset(dataset_name=dataset_name)
annotator.launch(url=annotator.get_url_for_dataset(dataset_name))
except ValueError as e:
raise ValueError("Dataset does not exist.") from e
1 change: 1 addition & 0 deletions src/zenml/integrations/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import sys

from zenml.integrations.airflow import AirflowIntegration # noqa
from zenml.integrations.argilla import ArgillaIntegration # noqa
from zenml.integrations.aws import AWSIntegration # noqa
from zenml.integrations.azure import AzureIntegration # noqa
from zenml.integrations.bentoml import BentoMLIntegration # noqa
Expand Down
46 changes: 46 additions & 0 deletions src/zenml/integrations/argilla/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Copyright (c) ZenML GmbH 2024. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at:
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied. See the License for the specific language governing
# permissions and limitations under the License.
"""Initialization of the Argilla integration."""
from typing import List, Type

from zenml.integrations.constants import ARGILLA
from zenml.integrations.integration import Integration
from zenml.stack import Flavor

ARGILLA_ANNOTATOR_FLAVOR = "argilla"


class ArgillaIntegration(Integration):
"""Definition of Argilla integration for ZenML."""

NAME = ARGILLA
REQUIREMENTS = [
"argilla>=1.20.0,<2",
]

@classmethod
def flavors(cls) -> List[Type[Flavor]]:
"""Declare the stack component flavors for the Argilla integration.

Returns:
List of stack component flavors for this integration.
"""
from zenml.integrations.argilla.flavors import (
ArgillaAnnotatorFlavor,
)

return [ArgillaAnnotatorFlavor]


ArgillaIntegration.check_installation()
20 changes: 20 additions & 0 deletions src/zenml/integrations/argilla/annotators/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Copyright (c) ZenML GmbH 2024. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at:
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied. See the License for the specific language governing
# permissions and limitations under the License.
"""Initialization of the Argilla annotators submodule."""

from zenml.integrations.argilla.annotators.argilla_annotator import (
ArgillaAnnotator,
)

__all__ = ["ArgillaAnnotator"]