<a href="https://colab.research.google.com/github/tiennguyen2310/NLP/blob/main/Transformers_Pipeline_Function.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Prepare Library & Functions**

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
from transformers import pipeline

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.

#**Sentiment Analysis**
**Def:** *Checking if the statement given is positive or negative, and by how many percent.*

In [None]:
classifier = pipeline("sentiment-analysis")
classifier("I trust that you have done good.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.999833345413208}]

In [None]:
classifier("I trust that you can do better in the future.")

[{'label': 'NEGATIVE', 'score': 0.897894024848938}]

**Note:** You can pass in *multiple texts/statements*.

In [None]:
classifier(["I love you.", "I hate you."])

[{'label': 'POSITIVE', 'score': 0.9998705387115479},
 {'label': 'NEGATIVE', 'score': 0.9992952346801758}]

#**Zero-shot Classification**

**Def:** *This algorithm lets you decide which labels you want for classification.*

So, it is *not limited to Positive/Negative* like Sentiment Analysis, but **you can actually create labels** you want to implement.

In [None]:
classifier = pipeline("zero-shot-classification")
classifier(
    "I love going to Vietnam.",
    candidate_labels = ["education", "tourism", "politics"]
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


{'sequence': 'I love going to Vietnam.',
 'labels': ['tourism', 'education', 'politics'],
 'scores': [0.9485887289047241, 0.028267277404665947, 0.023144032806158066]}

##**Compare Sentiment Analysis with Zero-shot Classification.**

Again, for *I love you.* statement, Sentiment Analysis gives a positive
value of `0.9998705387115479`.

Also, notice that $P(positive) + P(negative) \approx 1$


In [None]:
classifier(
    "I love you.",
    candidate_labels = ["POSITIVE", "NEGATIVE"]
)

{'sequence': 'I love you.',
 'labels': ['POSITIVE', 'NEGATIVE'],
 'scores': [0.829241156578064, 0.17075888812541962]}

#**Text Generation**

**Def:** *This algorithm auto-completes the given prompt.*

**Note:** You can decide some characteristics:
1.   `max_length`: the maximum length
2.   `min_length`: the minimum length
3.   `num_return_sequences`: the number of sequences being returned




##**Default Model**

In [None]:
generator = pipeline("text-generation")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [None]:
generator("I love travelling to Vietnam because")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I love travelling to Vietnam because people can see all the different things we see and understand. It makes you more connected to the people around you and gives you an opportunity to share your own experiences with the world. But most of all, Vietnam is like'}]

In [None]:
generator(
    "I will study Natural Language Processing in order to",
    max_length = 200,
    num_return_sequences = 5
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I will study Natural Language Processing in order to study Language Development in order to learn about Language Development. The research was done before he was appointed by the Department of Education.'},
 {'generated_text': 'I will study Natural Language Processing in order to understand neural networks that are useful and capable of processing and performing novel experiments. I believe that some research papers can and will provide new insight that can help us get a better understanding of the biology of language.\n\n\nFor more information, or for more information about what you can learn about how natural language processing in neural networks can play an important role in understanding language development, see the Nature Biophysics paper below.\nI am grateful to Dr. Shai Zang-Mang, Prof. Yang and I for teaching Natural Language Processing in our field of neuroscience and natural language processing in science and a valuable resource for the field of natural

##**Particular Model from the Hub**

In [None]:
generator = pipeline("text-generation", model = "distilgpt2")

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


In [None]:
generator(
    "Google Colab is a good tool to",
    max_length = 100,
    min_length = 80,
    num_return_sequences = 5
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Google Colab is a good tool to bring a large range of games to the home screen. With a little more background, the app can be found in the Android Store and it’d be a great addition.\n\n\n\nThe app, launched on February 23 and ran until February 18, 2014, on Android devices running Android 8.1. The app was recently released in China.\nIt’t quite as impressive how well it’s been up-and-coming'},
 {'generated_text': 'Google Colab is a good tool to help you build in a very clean and simple way, right?\nThe Colab is the kind of tool that helps you write down your code using a CSS engine. It‒s the absolute best tool out there or anything…\nIf you have any concerns about how the Colab works, you can download the latest Colab for £15/t to help you quickly and easily follow along with what‒s there to do.\nAnd don'},
 {'generated_text': 'Google Colab is a good tool to create an accurate and simple product that you really need to know. This product provides an accurate and s

#**Mask filling**

**Def:** *Fill in the blank*

In [None]:
unmasker = pipeline("fill-mask")
unmasker("I love travelling to <mask>.", top_k = 5)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


[{'score': 0.04099900647997856,
  'token': 1005,
  'token_str': ' Europe',
  'sequence': 'I love travelling to Europe.'},
 {'score': 0.037131473422050476,
  'token': 3430,
  'token_str': ' Scotland',
  'sequence': 'I love travelling to Scotland.'},
 {'score': 0.03694169223308563,
  'token': 14605,
  'token_str': ' Iceland',
  'sequence': 'I love travelling to Iceland.'},
 {'score': 0.027476176619529724,
  'token': 1221,
  'token_str': ' Australia',
  'sequence': 'I love travelling to Australia.'},
 {'score': 0.02279961109161377,
  'token': 2627,
  'token_str': ' Italy',
  'sequence': 'I love travelling to Italy.'}]

#**Named Entity Recognition** (NER)

**Def:** This algorithm finds which part/word **refers to entities** (name of persons/locations/organizations/etc).

In the code below, `grouped_entities = True` means that some separate words, if referred to the same entity, will be grouped together!

In [None]:
ner = pipeline("ner", grouped_entities = True)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


model.safetensors:  47%|####7     | 629M/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cpu


**Eg:** `Nguyen Chu Viet Tien` are `4` separate words, but refer to `1` entity (person).

In [None]:
ner("My Full Name is Nguyen Chu Viet Tien. I Come from Hanoi, Vietnam. Currently, I am studying Computer Science at the University of Richmond.")

[{'entity_group': 'PER',
  'score': 0.9507469,
  'word': 'Nguyen Chu Viet Tien',
  'start': 16,
  'end': 36},
 {'entity_group': 'LOC',
  'score': 0.99945754,
  'word': 'Hanoi',
  'start': 50,
  'end': 55},
 {'entity_group': 'LOC',
  'score': 0.9998198,
  'word': 'Vietnam',
  'start': 57,
  'end': 64},
 {'entity_group': 'ORG',
  'score': 0.99715215,
  'word': 'University of Richmond',
  'start': 115,
  'end': 137}]

#**Question Answering**

**Def:** Return the answer to a question, given a context.

In [None]:
question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cpu


Note: pipeline `extracts answer from the given context`, and does **NOT** generate the answer!

In [None]:
question_answerer(
    question = "What do I usually eat?",
    context = "My name is Tien. I love eating salad."
)

{'score': 0.9864376783370972, 'start': 31, 'end': 36, 'answer': 'salad'}

#**Text Summary**

**Note:** You can specify `max_length` and `min_length`, same as `Text Generation`.

In [None]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cpu


In [None]:
summarizer(
"""
The relationship between knowledge and power has long been a prominent issue in science and science fiction. Does ownership have anything to do with creativity? If someone invents something, should he or she be held fully responsible for his or her innovation?  Indeed, the interaction between the two main characters in Mary Shelley’s Frankenstein shows the inseparable relationship between knowledge and power, opening up numerous questions about responsibility and ethics for invention.
From my personal perspective, while we do not always own the faith of our investigation or invention, we have a responsibility to the consequences that it may bring to the community. In the story, Victor Frankenstein represents the example of an ambitious scientist who is not afraid to dive deep into an unexplored field of knowledge. The desire to overcome the limits of nature and create life demonstrates that humans can achieve considerable power through expertise and creativity. However, this power also has an undeniable responsibility to control possible bad situations. When Victor built the creature, he was not prepared for the consequences of its actions later in the book; rather, he was only interested in proving his own abilities, forgetting that every invention can bring about unpredictable outcomes, a serious shortcoming of the inventor. Victor failed to control and take responsibility for his creation. Alternatively, Victor fled and left the creature to ravage human society, leading to a chain of unrecoverable tragedies. The beast, rejected and despised, became more and more brutal, acting not only out of pain but also out of a lack of love and acceptance.
""",
            max_length = 100,
            min_length = 70
)

[{'summary_text': ' The relationship between knowledge and power has long been a prominent issue in science and science fiction . If someone invents something, should he or she be held fully responsible for his or her innovation? The interaction between the two main characters in Mary Shelley’s Frankenstein opens up numerous questions about responsibility and ethics for invention . In the story, Victor Frankenstein represents the example of an ambitious scientist who is not afraid to dive deep into an unexplored field of knowledge .'}]

#**Translation**

In [None]:
translator = pipeline("translation", model = "VietAI/envit5-translation")

Device set to use cpu


In [None]:
translator("Hello, How are you doing?")

[{'translation_text': 'vi: Xin chào, bạn khoẻ không?'}]

In [None]:
translator("Tôi rất khỏe, còn bạn thì sao?")

[{'translation_text': "en: I'm very well, how are you?"}]