Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nlp example #467

Merged
merged 35 commits into from
Apr 14, 2022
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
5dc05a8
fix typo (#432) (#433)
wjayesh Feb 25, 2022
eac9d5a
nlp token classification pipeline added
Ankur3107 Mar 10, 2022
d96a907
run pipeline added
Ankur3107 Mar 12, 2022
d45c96b
labels added in config
Ankur3107 Mar 13, 2022
e89781f
readme added
Ankur3107 Mar 13, 2022
8591e4f
Update examples/nlp/token-classification/pipeline.py
Ankur3107 Mar 14, 2022
3f60bfa
refactored pipeline
Ankur3107 Mar 15, 2022
0aba382
import issue fixed
Ankur3107 Mar 15, 2022
377837b
model label config added
Ankur3107 Mar 15, 2022
97939a4
readme updated
Ankur3107 Mar 15, 2022
cd45525
more comment added
Ankur3107 Mar 15, 2022
d742baf
hf materializer added
Ankur3107 Mar 18, 2022
81198d9
added two huggingface model materializer pt and tf
Ankur3107 Mar 18, 2022
5b20fa6
model typing fixed
Ankur3107 Mar 18, 2022
d4d1b3a
refactoring token-classification task
Ankur3107 Mar 18, 2022
660f0ca
sequence classification example added
Ankur3107 Mar 20, 2022
db9c4e1
readme updated
Ankur3107 Mar 20, 2022
1da6c50
Update examples/huggingface/README.md
Ankur3107 Mar 22, 2022
6b18926
Update examples/huggingface/README.md
Ankur3107 Mar 22, 2022
8541cba
Update examples/huggingface/README.md
Ankur3107 Mar 22, 2022
2d15735
Update examples/huggingface/README.md
Ankur3107 Mar 22, 2022
ab784d3
Update examples/huggingface/run_pipeline.py
Ankur3107 Mar 22, 2022
69823e1
Update examples/huggingface/run_pipeline.py
Ankur3107 Mar 22, 2022
c54a80f
Update examples/huggingface/sequence_classification.py
Ankur3107 Mar 22, 2022
981dbae
readme updated
Ankur3107 Mar 22, 2022
776f97c
format.sh and lint.sh
Mar 22, 2022
d6b553e
Materializer updated
Ankur3107 Apr 8, 2022
84f5f98
example updated
Ankur3107 Apr 12, 2022
2cba5bf
materializer import issue fixed
Ankur3107 Apr 12, 2022
f6709db
save_to_disk fixed
Ankur3107 Apr 12, 2022
f13dc36
readme updated
Ankur3107 Apr 12, 2022
027c4df
Merge remote-tracking branch 'remotes/origin/develop' into zenml_nlp_…
Ankur3107 Apr 12, 2022
636f431
Merge branch 'develop' into zenml_nlp_example
Ankur3107 Apr 12, 2022
4a81c11
W292 no newline at end of file fixed
Ankur3107 Apr 12, 2022
ab62b4e
readme and copy_dir path updated
Ankur3107 Apr 14, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/book/features/cloud-pipelines/guide-aws-gcp-azure.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ To run our pipeline on Kubeflow Pipelines deployed to cloud, we will create a ne
2. Register the stack components

```powershell
zenml container-registry register cloud-registry --type=default --uri=$PATH_TO_YOUR_CONTAINER_REGISTRY
zenml container-registry register cloud_registry --type=default --uri=$PATH_TO_YOUR_CONTAINER_REGISTRY
zenml orchestrator register cloud_orchestrator --type=kubeflow --custom_docker_base_image_name=YOUR_IMAGE
zenml metadata-store register kubeflow_metadata_store --type=kubeflow
zenml artifact-store register cloud_artifact_store --type=<s3/gcp/azure> --path=$PATH_TO_YOUR_BUCKET
Expand Down
145 changes: 145 additions & 0 deletions examples/huggingface/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Implementation of NLP algorithms using Huggingface & Zenml

These examples demonstrate how we can use zenml and huggingface transformers to build, train, & test NLP models.

Huggingface; one of our favorite emoji to express thankfulness, love, or appreciation. In the world of AI/ML, [`Hugging Face`](https://huggingface.co/) is a startup in the Natural Language Processing (NLP) domain (now they are expanding to computer vision and RL) , offering its library of SOTA models in particular around the Transformers. More than a thousand companies using their library in production including Bing, Apple, Microsoft etc. Do checkout thier [`Transformers Library`](https://github.com/huggingface/transformers), [`Datasets Library`](https://github.com/huggingface/datasets) and [`Model Hub`](https://huggingface.co/models)

NLP is a branch of machine learning that is about helping systems to understand natural text and spoken words in the same way that humans do.

The following is a list of common NLP tasks:

- Classification of sentences: sequence-classification
- Classification of each words in a sentence: token-classification
- Extraction of answer from a context text: question-answering
- Text generation using prompt: text-generation
- Translation: text-translation

## Sequence Classification

Sequence Classification is an NLP/NLU task, where we assign labels to a given text, i.e. sentiment classification, natural langauge inference etc. In this example, we will train a sentiment classification model using the [`imdb`](https://huggingface.co/datasets/imdb) dataset.

- Load dataset: Load sequence-classification dataset in this case it is the `imdb` dataset
```python
from datasets import load_dataset
datasets = load_dataset("imdb")
print(datasets['train'][0])

{'label': 0, # Sentiment label i.e. 0->Negative 1->Positive
'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.......'}
Ankur3107 marked this conversation as resolved.
Show resolved Hide resolved
```

- Load pre-trained tokenizer: Load pre-trained tokenizer from huggingface transformers.

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
```

- Tokenize and Prepare dataset for training: Use pre-trained tokenizer to tokenize and encode dataset into ids along with labels
- Build and Train Model: You can build model or use pre-trained models from huggingface transformers. Use encoded dataset to train model.
- Evaluate: Evaluate model loss and accuracy

## Token Classification

Token Classification is an NLP/NLU task, where we assign labels to tokens in a text, i.e. Name entity recognition, Part of speech tagging etc. In this example, we will train a NER model using the [`conll2003`](https://huggingface.co/datasets/conll2003) dataset.

- Load dataset: Load token-classification dataset in this case it is `conll2003` dataset

```python
from datasets import load_dataset
datasets = load_dataset("conll2003")
print(datasets['train'][0])

{'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
'id': '0',
'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0], #list of token classification labels
'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
'tokens': ['EU',
'rejects',
'German',
'call',
'to',
'boycott',
'British',
'lamb',
'.']}
```

- Load pre-trained tokenizer: Load pre-trained tokenizer from huggingface transformers.

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
```

- Tokenize and Prepare dataset for training: Use pre-trained tokenizer to tokenize and encode dataset into ids along with labels
- Build and Train Model: You can build model or use pre-trained models from huggingface transformers. Use encoded dataset to train model.
- Evaluate: Evaluate model loss and accuracy

AlexejPenner marked this conversation as resolved.
Show resolved Hide resolved
## Run it locally

```shell
# install CLI
pip install zenml transformers datasets

# install ZenML integrations
zenml integration install mlflow tensorflow

# pull example
cd zenml/examples/huggingface

# initialize
zenml init
```

### Run the project
Now we're ready. Execute:

```shell
# sequence-classification
python run_pipeline.py --nlp_task=sequence-classification --pretrained_model=distilbert-base-uncased --epochs=3 --batch_size=16

# token-classification
python run_pipeline.py --nlp_task=token-classification --pretrained_model=distilbert-base-uncased --epochs=3 --batch_size=16
```

### Test pipeline

```python

from zenml.repository import Repository
from transformers import pipeline

# 1. Load sequence-classification and inference
repo = Repository()
p = repo.get_pipeline(pipeline_name="seq_classifier_train_eval_pipeline")
runs = p.runs
print(f"Pipeline `seq_classifier_train_eval_pipeline` has {len(runs)} run(s)")
latest_run = runs[-1]
trainer_step = latest_run.get_step('trainer')
load_tokenizer_step = latest_run.get_step("load_tokenizer")

# load model and pipeline
model = trainer_step.output.read()
tokenizer = load_tokenizer_step.output.read()
sentiment_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

print(sentiment_classifier("MLOps movie by Zenml-io was awesome."))


# 2. Load token-classification and inference
repo = Repository()
p = repo.get_pipeline(pipeline_name="token_classifier_train_eval_pipeline")
runs = p.runs
print(f"Pipeline `token_classifier_train_eval_pipeline` has {len(runs)} run(s)")
latest_run = runs[-1]
trainer_step = latest_run.get_step('trainer')
load_tokenizer_step = latest_run.get_step("load_tokenizer")

# load model and pipeline
model = trainer_step.output.read()
tokenizer = load_tokenizer_step.output.read()
token_classifier = pipeline("token-classification", model=model, tokenizer=tokenizer)

print(token_classifier("Zenml-io is based out of Munich, Germany"))
```
96 changes: 96 additions & 0 deletions examples/huggingface/run_pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Copyright (c) ZenML GmbH 2022. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at:
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied. See the License for the specific language governing
# permissions and limitations under the License.

import click
import sequence_classification
import token_classification
from sequence_classification import SequenceClassificationConfig
from token_classification import TokenClassificationConfig


@click.command()
@click.option(
"--nlp_task",
default="token-classification",
help="Name NLP task i.e. token-classificaion",
)
@click.option(
"--pretrained_model",
default="distilbert-base-uncased",
help="Pretrained model name from huggingface hub",
)
@click.option(
"--batch_size",
default=8,
help="Batch Size for training",
)
@click.option(
"--epochs",
default=3,
help="Number of epochs for training",
)
def main(nlp_task: str, pretrained_model: str, batch_size: int, epochs: int):
if nlp_task == "token-classification":
# Run Pipeline
token_classification_config = TokenClassificationConfig(
pretrained_model=pretrained_model,
epochs=epochs,
batch_size=batch_size,
)
pipeline = token_classification.token_classifier_train_eval_pipeline(
importer=token_classification.data_importer(
token_classification_config
),
load_tokenizer=token_classification.load_tokenizer(
token_classification_config
),
tokenization=token_classification.tokenization(
token_classification_config
),
trainer=token_classification.trainer(token_classification_config),
evaluator=token_classification.evaluator(
token_classification_config
),
)
pipeline.run()

elif nlp_task == "sequence-classification":
# Run Pipeline
sequence_classification_config = SequenceClassificationConfig(
pretrained_model=pretrained_model,
epochs=epochs,
batch_size=batch_size,
)
pipeline = sequence_classification.seq_classifier_train_eval_pipeline(
importer=sequence_classification.data_importer(
sequence_classification_config
),
load_tokenizer=sequence_classification.load_tokenizer(
sequence_classification_config
),
tokenization=sequence_classification.tokenization(
sequence_classification_config
),
trainer=sequence_classification.trainer(
sequence_classification_config
),
evaluator=sequence_classification.evaluator(
sequence_classification_config
),
)
pipeline.run()


if __name__ == "__main__":
main()