zenml-io · AlexejPenner · Apr 14, 2022 · Feb 25, 2022 · Mar 10, 2022 · Mar 12, 2022
diff --git a/docs/book/features/cloud-pipelines/guide-aws-gcp-azure.md b/docs/book/features/cloud-pipelines/guide-aws-gcp-azure.md
@@ -131,7 +131,7 @@ To run our pipeline on Kubeflow Pipelines deployed to cloud, we will create a ne
 2. Register the stack components
 
     ```powershell
-    zenml container-registry register cloud-registry --type=default --uri=$PATH_TO_YOUR_CONTAINER_REGISTRY
+    zenml container-registry register cloud_registry --type=default --uri=$PATH_TO_YOUR_CONTAINER_REGISTRY
     zenml orchestrator register cloud_orchestrator --type=kubeflow --custom_docker_base_image_name=YOUR_IMAGE
     zenml metadata-store register kubeflow_metadata_store --type=kubeflow
     zenml artifact-store register cloud_artifact_store --type=<s3/gcp/azure> --path=$PATH_TO_YOUR_BUCKET

diff --git a/examples/huggingface/README.md b/examples/huggingface/README.md
@@ -0,0 +1,145 @@
+# Implementation of NLP algorithms using Huggingface & Zenml
+
+These examples demonstrate how we can use zenml and huggingface transformers to build, train, & test NLP models.
+
+Huggingface; one of our favorite emoji to express thankfulness, love, or appreciation. In the world of AI/ML, [`Hugging Face`](https://huggingface.co/) is a startup in the Natural Language Processing (NLP) domain (now they are expanding to computer vision and RL) , offering its library of SOTA models in particular around the Transformers. More than a thousand companies using their library in production including Bing, Apple, Microsoft etc. Do checkout thier [`Transformers Library`](https://github.com/huggingface/transformers), [`Datasets Library`](https://github.com/huggingface/datasets) and [`Model Hub`](https://huggingface.co/models)
+
+NLP is a branch of machine learning that is about helping systems to understand natural text and spoken words in the same way that humans do.
+
+The following is a list of common NLP tasks:
+
+- Classification of sentences: sequence-classification
+- Classification of each words in a sentence: token-classification
+- Extraction of answer from a context text: question-answering
+- Text generation using prompt: text-generation
+- Translation: text-translation
+
+## Sequence Classification
+
+Sequence Classification is an NLP/NLU task, where we assign labels to a given text, i.e. sentiment classification, natural langauge inference etc. In this example, we will train a sentiment classification model using the [`imdb`](https://huggingface.co/datasets/imdb) dataset.
+
+- Load dataset: Load sequence-classification dataset in this case it is the `imdb` dataset
+```python
+    from datasets import load_dataset
+    datasets = load_dataset("imdb")
+    print(datasets['train'][0])
+
+    {'label': 0, # Sentiment label i.e. 0->Negative 1->Positive
+ 'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.......'}
+```
+
+- Load pre-trained tokenizer: Load pre-trained tokenizer from huggingface transformers.
+
+```python
+    from transformers import AutoTokenizer
+    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
+```
+
+- Tokenize and Prepare dataset for training: Use pre-trained tokenizer to tokenize and encode dataset into ids along with labels
+- Build and Train Model: You can build model or use pre-trained models from huggingface transformers. Use encoded dataset to train model.
+- Evaluate: Evaluate model loss and accuracy
+
+## Token Classification
+
+Token Classification is an NLP/NLU task, where we assign labels to tokens in a text, i.e. Name entity recognition, Part of speech tagging etc. In this example, we will train a NER model using the [`conll2003`](https://huggingface.co/datasets/conll2003) dataset.
+
+- Load dataset: Load token-classification dataset in this case it is `conll2003` dataset
+
+```python
+    from datasets import load_dataset
+    datasets = load_dataset("conll2003")
+    print(datasets['train'][0])
+
+    {'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
+        'id': '0',
+        'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0], #list of token classification labels
+        'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
+        'tokens': ['EU',
+        'rejects',
+        'German',
+        'call',
+        'to',
+        'boycott',
+        'British',
+        'lamb',
+        '.']}
+```
+
+- Load pre-trained tokenizer: Load pre-trained tokenizer from huggingface transformers.
+
+```python
+    from transformers import AutoTokenizer
+    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
+```
+
+- Tokenize and Prepare dataset for training: Use pre-trained tokenizer to tokenize and encode dataset into ids along with labels
+- Build and Train Model: You can build model or use pre-trained models from huggingface transformers. Use encoded dataset to train model.
+- Evaluate: Evaluate model loss and accuracy
+
+## Run it locally
+
+```shell
+# install CLI
+pip install zenml transformers datasets
+
+# install ZenML integrations
+zenml integration install mlflow tensorflow
+
+# pull example
+cd zenml/examples/huggingface
+
+# initialize
+zenml init
+```
+
+### Run the project
+Now we're ready. Execute:
+
+```shell
+# sequence-classification
+python run_pipeline.py --nlp_task=sequence-classification --pretrained_model=distilbert-base-uncased --epochs=3 --batch_size=16
+
+# token-classification
+python run_pipeline.py --nlp_task=token-classification --pretrained_model=distilbert-base-uncased --epochs=3 --batch_size=16
+```
+
+### Test pipeline
+
+```python
+
+from zenml.repository import Repository
+from transformers import pipeline
+
+# 1. Load sequence-classification and inference
+repo = Repository()
+p = repo.get_pipeline(pipeline_name="seq_classifier_train_eval_pipeline")
+runs = p.runs
+print(f"Pipeline `seq_classifier_train_eval_pipeline` has {len(runs)} run(s)")
+latest_run = runs[-1]
+trainer_step = latest_run.get_step('trainer')
+load_tokenizer_step = latest_run.get_step("load_tokenizer")
+
+# load model and pipeline
+model = trainer_step.output.read()
+tokenizer = load_tokenizer_step.output.read()
+sentiment_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
+
+print(sentiment_classifier("MLOps movie by Zenml-io was awesome."))
+
+
+# 2. Load token-classification and inference
+repo = Repository()
+p = repo.get_pipeline(pipeline_name="token_classifier_train_eval_pipeline")
+runs = p.runs
+print(f"Pipeline `token_classifier_train_eval_pipeline` has {len(runs)} run(s)")
+latest_run = runs[-1]
+trainer_step = latest_run.get_step('trainer')
+load_tokenizer_step = latest_run.get_step("load_tokenizer")
+
+# load model and pipeline
+model = trainer_step.output.read()
+tokenizer = load_tokenizer_step.output.read()
+token_classifier = pipeline("token-classification", model=model, tokenizer=tokenizer)
+
+print(token_classifier("Zenml-io is based out of Munich, Germany"))
+```
diff --git a/examples/huggingface/run_pipeline.py b/examples/huggingface/run_pipeline.py
@@ -0,0 +1,96 @@
+#  Copyright (c) ZenML GmbH 2022. All Rights Reserved.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at:
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+#  or implied. See the License for the specific language governing
+#  permissions and limitations under the License.
+
+import click
+import sequence_classification
+import token_classification
+from sequence_classification import SequenceClassificationConfig
+from token_classification import TokenClassificationConfig
+
+
+@click.command()
+@click.option(
+    "--nlp_task",
+    default="token-classification",
+    help="Name NLP task i.e. token-classificaion",
+)
+@click.option(
+    "--pretrained_model",
+    default="distilbert-base-uncased",
+    help="Pretrained model name from huggingface hub",
+)
+@click.option(
+    "--batch_size",
+    default=8,
+    help="Batch Size for training",
+)
+@click.option(
+    "--epochs",
+    default=3,
+    help="Number of epochs for training",
+)
+def main(nlp_task: str, pretrained_model: str, batch_size: int, epochs: int):
+    if nlp_task == "token-classification":
+        # Run Pipeline
+        token_classification_config = TokenClassificationConfig(
+            pretrained_model=pretrained_model,
+            epochs=epochs,
+            batch_size=batch_size,
+        )
+        pipeline = token_classification.token_classifier_train_eval_pipeline(
+            importer=token_classification.data_importer(
+                token_classification_config
+            ),
+            load_tokenizer=token_classification.load_tokenizer(
+                token_classification_config
+            ),
+            tokenization=token_classification.tokenization(
+                token_classification_config
+            ),
+            trainer=token_classification.trainer(token_classification_config),
+            evaluator=token_classification.evaluator(
+                token_classification_config
+            ),
+        )
+        pipeline.run()
+
+    elif nlp_task == "sequence-classification":
+        # Run Pipeline
+        sequence_classification_config = SequenceClassificationConfig(
+            pretrained_model=pretrained_model,
+            epochs=epochs,
+            batch_size=batch_size,
+        )
+        pipeline = sequence_classification.seq_classifier_train_eval_pipeline(
+            importer=sequence_classification.data_importer(
+                sequence_classification_config
+            ),
+            load_tokenizer=sequence_classification.load_tokenizer(
+                sequence_classification_config
+            ),
+            tokenization=sequence_classification.tokenization(
+                sequence_classification_config
+            ),
+            trainer=sequence_classification.trainer(
+                sequence_classification_config
+            ),
+            evaluator=sequence_classification.evaluator(
+                sequence_classification_config
+            ),
+        )
+        pipeline.run()
+
+
+if __name__ == "__main__":
+    main()