## **Natural Language Processing**

- NLP is a field of linguistics and machine learning focused on understanding everything related to human lanauge. The aim of NLP task is not only to understand single words individually, but to be able to understand the context of those words.

*Common NLP Tasks*

- Classifying whole sentences
- Classifying each word in sentence
- Generating text context
- Extracting an answer from a text
- Generating a new sentence from an input text

**Why it is challenging**

- Computers don’t process information in the same way as humans. For example, when we read the sentence “I am hungry,” we can easily understand its meaning. Similarly, given two sentences such as “I am hungry” and “I am sad,” we’re able to easily determine how similar they are. For machine learning (ML) models, such tasks are more difficult. The text needs to be processed in a way that enables the model to learn from it. And because language is complex, we need to think carefully about how this processing must be done. There has been a lot of research done on how to represent text, and we will look at some methods in the next chapter.



## The Pipeline Function

- The pipeline function is the most high-level API of the Transformers library

- `Pipeline()` returns an end-to-end object that performs an NLP task on one or several texts

- It regroups together all the steps to go from raw texts to usable predictions. The model used is at the core of pipeline, but the pipeline also include all the necessary pre-processing as well as some post-preprocessing to make the output of the model human-readable.

There are three main steps involved when we pass some text in pipeline:

- Preprocessing a text in the format that model can understand
- The preprocessed inputs are passed to the model
- The predictions of the model are post-processed, so we can make sense of them

In [1]:
#@ DOWNLOADING THE TRANSFORMERS LIBRARY

# !pip --q install transformers
# !pip --q install sentencepiece

**1. Sentiment Analysis**

- It is a branch of natural language processing, which involves the emotional tone or sentiment expressed in text.

In [2]:
#@ LOADING THE REQUIRED LIBRARIES AND DEPENDENCIES
import transformers
from transformers import pipeline, AutoTokenizer

In [3]:
#@ SENTIMENT ANALYSIS PIPELINE
classifier = pipeline("sentiment-analysis")

classifier("I've been waiting for a HuggingFace course my whole life")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'POSITIVE', 'score': 0.9516069889068604}]

In [4]:
#@ Passing the multiple text to the pipeline
classifier([
    'This hugging face course is very good',
    'I hate this soo much'
])

[{'label': 'POSITIVE', 'score': 0.9998674392700195},
 {'label': 'NEGATIVE', 'score': 0.9992349147796631}]

**2. Zero - shot classification**

- It is a machine learning technique that is used to classify the input data into multiple categories or classes, even if the model has not been explicitly trained on those categories during training phase.

- It allows the model to make predictions about classes it has never seen before.

- It majorly relies on the idea that langauge and semantic understanding can help bridge the gap between known and unknown classes

In [5]:
#@ ZERO SHOT CLASSIFICATION PIPELINE
classifier = pipeline("zero-shot-classification")

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [6]:
classifier(
    "This is a course about Transformer library",
    candidate_labels = ["education", "business", "politics"]
)

{'sequence': 'This is a course about Transformer library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.9460309743881226, 0.03919023275375366, 0.014778840355575085]}

**3. Text Generation**

- The main idea of text generation is that when we provide prompt and the model will auto complete it by generating the remaining text.

- It is similar to the predictive text feature that is found on many phones.

- We can give several arguments like `num_return_sequences` and `max_length` to generate any sentences

In [7]:
#@ TEXT GENERATION PIPELINE
generator = pipeline("text-generation", model="distilgpt2")

In [8]:
generator("In this course, we will teach you how to use Hugging Face",
          max_length=40,
          num_return_sequences=2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use Hugging Face ID to learn how to use facial recognition.\n\n\n\n\n\nThe goal of this course is to learn how to use'},
 {'generated_text': 'In this course, we will teach you how to use Hugging Face Recognition.\n\n\n\nYou will probably have heard of this course from a bunch of others, especially, because they were'}]

**4. Mask Filling**

- The idea of mask filling is to fill the blanks in a given text

- The argument `top_k` is to control how many possibilities that we want to display. The model fills in the special `<mask>` word, which is often referred as *mask token*.

In [9]:
#@ FILL MASK
unmasker = pipeline("fill-mask")

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [10]:
unmasker("You will be implementing deep <mask> models.", top_k=2)

[{'score': 0.4101862907409668,
  'token': 2239,
  'token_str': ' learning',
  'sequence': 'You will be implementing deep learning models.'},
 {'score': 0.09801976382732391,
  'token': 26739,
  'token_str': ' neural',
  'sequence': 'You will be implementing deep neural models.'}]

**5. Named Entity Recognition**  

NER pipeline identifies entities such as persons, organizations or locations in a sentence.

- When we pass `grouped_entities=True` in the pipeline creation function to tell the pipeline to regroup together the parts of the sentences that correspond to the same entity.

In [11]:
#@ NAMED ENTITY RECOGNITION
ner = pipeline("ner", grouped_entities=True)
ner("My name is Saugat Regmi and I work at Mercantile in Kathmandu")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.99854356,
  'word': 'Saugat Regmi',
  'start': 11,
  'end': 23},
 {'entity_group': 'ORG',
  'score': 0.9829941,
  'word': 'Mercantile',
  'start': 38,
  'end': 48},
 {'entity_group': 'LOC',
  'score': 0.9983037,
  'word': 'Kathmandu',
  'start': 52,
  'end': 61}]

**6. Question Answering**

- This pipeline helps answering the questions using information given from the context.

In [12]:
#@ QUESTION ANSWERING
question_answerer = pipeline("question-answering")

question_answerer(
    question = "Where do i work?",
    context = 'My name is Saugat Regmi and I work at Mercantile Inc in Kathmandu'
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.8813986778259277,
 'start': 38,
 'end': 52,
 'answer': 'Mercantile Inc'}

**7. Summarization**

- Summarization helps in the task reducing a text into a shorter text while keeping all of the important aspects referenced in the text

In [13]:
#@ SUMMARIZATION

summarizer = pipeline("summarization")

summarizer("""
  Rajesh Hamal: the son of a diplomat, a scholarly man with a master’s degree in English Literature, a Goodwill Ambassador and recurrent contributor to charity. But you may know him better as the ‘Great Actor’ of Nepal, starring in award winning films such as Deuta (1992), for which he won the first of many Best Actor Awards by the National Film Award, the most prestigious cinematic awards association in Nepal.
Since his film debut in the late 1980’s and his seemingly overnight rise to stardom, he has become one of the most, if not the most, iconic figures of Nepali cinema. His roles ranged in character and variety from light hearted romantic comedy,to seat-gripping action and adventure. His cinematic success has not gone unnoticed; he’s the recipient of three decades worth of Best Actor awards and nominations, in addition to numerous other achievements. It seems as if there is nothing that Mr. Hamal cannot do.
Rajesh Hamal has become common household name in Nepal. His dedicated acting career has redefined the classifications of what it means to be a Nepali actor. His near record-breaking number of films and acting achievements have not only inspired actors (aspiring and seasoned actors alike), but has given Nepali cinema an unparalleled level of credibility, and has changed the country’s cinematic culture towards higher standards of professionalism and quality.
Early Life Before Acting Hamal was born on June 4, 1964 in Tansen, in the heart of Nepal. Though he has starred in a staggering number of films and television shows, his fruitfulcareer in film did not begin until his adult years, his mid-twenties, in the early 1990’s.
He spent the majority of his childhood in hometown in Nepal, attending private school until the eighth grade.As an early teen, he accompanied his father on a cross-continental move to Moscow. At the time, his father was a political diplomat for Nepalese government. Hamal and his father remained in Russia for a number of years. Hamal even began attending college in Moscow. However, he returned to India to finish his formal education in Lahore, at the Punjab University. It was there that he graduated with a M.A. in English Literature.
Hamal’s introduction to the camera began upon his return to India in his college years. His exposure started not in cinematic film, but rather in modelling. His infamous modelling career was relatively short-lived, lasting only for a couple years in the mid 1980’s, buthe became one of India’s most popular male models at the time.  His characteristic long black hair and striking good looks did not go unnoticed.  He appeared sporting the latest clothing fashions on runways in Kathmandu and New Delhi, and was featured in the renowned Indian fashion magazine, Fashion Net.
While he was quite successful as a model, earning the titles of Best Ramp Performer and Model of the Year in 1989, his aspirations for something more would not be satisfied with modelling. His time in the modelling spotlight soon transformed into something much bigger. Altering not only his life, but the entertainment industry of Nepal as well.
""")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' Rajesh Hamal is the son of a diplomat, a scholarly man with a master’s degree in English Literature, a Goodwill Ambassador and a recurrent contributor to charity . He has starred in a staggering number of films and television shows . His dedicated acting career has redefined the classifications of what it means to be a Nepali actor .'}]

**8. Language Translation**

- This pipeline helps in translating the words or sentences from one langauge to another lanauge

- We can specify `max_length` or `min_length` for the result

In [14]:
#@ LANGUAGE TRANSLATION
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en")
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en", tokenizer=tokenizer)
translator("Ce cours est produit par Hugging Face.")



[{'translation_text': 'This course is produced by Hugging Face.'}]