<a href="https://colab.research.google.com/github/reban87/HuggingFace/blob/main/01.%20Transformer%20Model%20/%20Transformer_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
#@ INITIALIZATION TO AVOID RELOADING THE SAME PROJECT AND TO VISUALIZE
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [5]:
#@ UNCOMMENT FOR INSTALLATION
#!pip install transformers

The most basic object in the 🤗 Transformers library is the pipeline() function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answe

In [8]:
import transformers
from transformers import pipeline

In [12]:
# SENTIMENT ANALYSIS : SINGLE SENTENCE
# By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English.
classifier = pipeline("sentiment-analysis")
classifier("I am very happy to start journey with HuggingFace")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9998515844345093}]

In [11]:
# SENTIMENT ANALYSIS : MULTIPLE SENTENCES
classifier(["I love studying NLP",
            "I hate copying codes without understanding",
            "Study from documentation is a best practise"])

[{'label': 'POSITIVE', 'score': 0.999051034450531},
 {'label': 'NEGATIVE', 'score': 0.9991101622581482},
 {'label': 'POSITIVE', 'score': 0.9969923496246338}]

`Pipelines`

There are three main steps involved when we pass some text to a `pipeline`:

1. The text is preprocessed into a format the model can understand.
2. The preprocessed inputs are passed to the model.
3. The predictions of the model are post-processed, so you can make sense of    them.

`Zero-shot classification`

- It allows us to specify which labels to use for the classification, so that one don't have to rely on the labels of the pretrained model. 
- Similar like above examples, model has classified into positive and negative labels, it can also be classified using any set of labels we like.
- This pipeline is called zero-shot because no need to fine-tune the model on the data to use it. It can directly return probability scores for any list of labels we want!

In [17]:
classifier = pipeline("zero-shot-classification")
classifier("HuggingFace is a informative course library for data scientist ",
           candidate_labels = ["education", "politics", "business"])

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


{'labels': ['education', 'business', 'politics'],
 'scores': [0.8056657910346985, 0.13851258158683777, 0.05582159385085106],
 'sequence': 'HuggingFace is a informative course library for data scientist '}

`Text Generation`
- gives a prompt and the pipeline will auto complete it by generating the remaining text.
- generates random text, thus the output could be inconsistant

In [20]:
generator = pipeline("text-generation")
generator("This is the very begining of text generation pipeline where")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'This is the very begining of text generation pipeline where people can express ideas without fear or coercion, and they have the ability to produce the desired things in a very real instant, and they have no control over where it goes, how it is interpreted'}]

In [22]:
# The output of the text generator can be controlled using
# num_return_sequences and max_length as follows
generator("This is the begining of hugging face where", num_return_sequences = 2, max_length = 50)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "This is the begining of hugging face where all the people you meet are going for the best you have been!\n\nI love seeing the expressions of how many people your heart desires, but especially when you're already happy you need to embrace your"},
 {'generated_text': 'This is the begining of hugging face where no one can catch it.\n\nShifting gears and looking to meet him, I begin to pull up his jacket, only to find, in front of me, a rather thin young man with curly'}]

In [26]:
# IMPLEMENTATION OF DISTILGPT2 MODEL
generator = pipeline("text-generation", model = "distilgpt2")
generator(
    "The course is suitable for a programmer so that",
    max_length = 20,
    num_return_sequences = 2
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The course is suitable for a programmer so that it can be used to build on the next generation of'},
 {'generated_text': 'The course is suitable for a programmer so that it is not more complicated. But you can use this'}]

`Mask Filling` : To fill in the blanks in a given text 

In [30]:
unmasker = pipeline("fill-mask")
unmasker("The course teaches about <mask> models", top_k = 2)     #The top_k argument controls how many possibilities you want to be displayed.

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


[{'score': 0.2331242859363556,
  'sequence': 'The course teaches about mathematical models',
  'token': 30412,
  'token_str': ' mathematical'},
 {'score': 0.0674612820148468,
  'sequence': 'The course teaches about predictive models',
  'token': 27930,
  'token_str': ' predictive'}]

`Name Entity Recognition`

**NER** is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations

In [33]:
NER = pipeline("ner", grouped_entities = True)
NER("My name is Rebanta Aryal and I am from Kathmandu, Nepal")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)
  "`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to"


[{'end': 24,
  'entity_group': 'PER',
  'score': 0.9985151,
  'start': 11,
  'word': 'Rebanta Aryal'},
 {'end': 48,
  'entity_group': 'LOC',
  'score': 0.99885875,
  'start': 39,
  'word': 'Kathmandu'},
 {'end': 55,
  'entity_group': 'LOC',
  'score': 0.9995988,
  'start': 50,
  'word': 'Nepal'}]

`Question Answering`

The question-answering pipeline answers questions using information from a given context.

This pipeline works by extracting information from the provided context; it does not generate the answer.

In [37]:
question_answer = pipeline("question-answering")
question_answer(
    question = "where do i live ?", 
    context = "My name is Rebanta and I am from Kathmandu, Nepal"
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


{'answer': 'Kathmandu, Nepal',
 'end': 49,
 'score': 0.6519007682800293,
 'start': 33}

`Summarization` 
Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text.


In [40]:
summarize = pipeline("summarization")
summarize("""
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
""")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]