<a href="https://colab.research.google.com/github/sandhyaparna/NLP/blob/main/%F0%9F%A4%97_Course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# light version of 🤗
!pip install datasets transformers[sentencepiece]
# !pip install torchvision==0.6.1 
# !pip install torch==1.5.1 
# !pip install torchaudio==0.5.1

# Development version, which comes with all the required dependencies for pretty much any imaginable use case
# !pip install transformers[dev]

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation and more in over 100 languages. Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and can be modified to enable quick research experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration between them. It's straightforward to train your models with one before loading them for inference with the other.

In [None]:
import transformers
from transformers import pipeline

By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when you create the classifier object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.

There are three main steps involved when you pass some text to a pipeline:
1. The text is preprocessed into a format the model can understand.
2. The preprocessed inputs are passed to the model.
3. The predictions of the model are post-processed, so you can make sense of them.

#  Transformer models

Learn about:
* How to use the pipeline function to solve NLP tasks such as text generation and classification
* About the Transformer architecture
* How to distinguish between encoder, decoder, and encoder-decoder architectures and use cases

Use cases:
* Classifying whole sentences: Getting the sentiment of a review, detecting if an email is spam, determining if a sentence is grammatically correct or whether two sentences are logically related or not
* Classifying each word in a sentence: Identifying the grammatical components of a sentence (noun, verb, adjective), or the named entities (person, location, organization)
* Generating text content: Completing a prompt with auto-generated text, filling in the blanks in a text with masked words
* Extracting an answer from a text: Given a question and a context, extracting the answer to the question based on the information provided in the context
* Generating a new sentence from an input text: Translating a text into another language, summarizing a text

Transformer model categories:
* GPT-like (also called auto-regressive Transformer models)
* BERT-like (also called auto-encoding Transformer models)
* BART/T5-like (also called sequence-to-sequence Transformer models)

### Sentiment Analysis
Identifies if a record is positive or negative and the score associated with it

In [None]:
sentiment_analysis = pipeline("sentiment-analysis")

In [None]:
sentiment_analysis([
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!"
])

[{'label': 'POSITIVE', 'score': 0.9598046541213989},
 {'label': 'NEGATIVE', 'score': 0.9994558095932007}]

In [None]:
sentences = ["I've been waiting for a HuggingFace course my whole life.",
             "I've been waiting for a HuggingFace course my whole life. It is too good to be true",
             "Dont tell me it is true"]

for i in sentences:
  print(sentiment_analysis(i))

[{'label': 'POSITIVE', 'score': 0.9598046541213989}]
[{'label': 'POSITIVE', 'score': 0.9949657320976257}]
[{'label': 'POSITIVE', 'score': 0.8956882357597351}]


### Zero-shot classification
* classify texts that haven’t been labelled
* This pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!
* We’ll start by tackling a more challenging task where we need to classify texts that haven’t been labelled. This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise. For this use case, the zero-shot-classification pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model.

In [None]:
zero_shot_classification = pipeline("zero-shot-classification")

In [None]:
zero_shot_classification(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business", "NLP", "Natural language processing", "text processing"],
)

{'labels': ['education',
  'NLP',
  'business',
  'text processing',
  'Natural language processing',
  'politics'],
 'scores': [0.8420583009719849,
  0.07702988386154175,
  0.023751623928546906,
  0.022383013740181923,
  0.01998896524310112,
  0.014788143336772919],
 'sequence': 'This is a course about the Transformers library'}

In [None]:
zero_shot_classification(
    "This is about the Transformers library",
    candidate_labels=["education", "politics", "business", "NLP", "Natural language processing", "text processing"],
)

{'labels': ['business',
  'NLP',
  'education',
  'text processing',
  'politics',
  'Natural language processing'],
 'scores': [0.34183067083358765,
  0.25975117087364197,
  0.1664823293685913,
  0.08993060141801834,
  0.07495827972888947,
  0.06704689562320709],
 'sequence': 'This is about the Transformers library'}

### Text generation
use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Text generation involves randomness, so it’s normal if you don’t get the same results as shown below.

In [None]:
from transformers import pipeline

You can control how many different sequences are generated with the argument num_return_sequences and the total length of the output text with the argument max_length

In [None]:
# uses default model
text_generation = pipeline("text-generation")

In [None]:
text_generation("In this course, we will teach you how to", num_return_sequences=2, max_length=25)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': 'In this course, we will teach you how to take advantage of various benefits of cryptocurrencies for the real or near future. By'},
 {'generated_text': 'In this course, we will teach you how to use SQL server scripts in Azure.\n\nFor the purposes of this course'}]

Any models available in https://huggingface.co/models?pipeline_tag=text-generation can be used

In [None]:
# to choose a specific model from the Hub, use model argument
generator = pipeline("text-generation", model="distilgpt2")

In [None]:
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': 'In this course, we will teach you how to build your own self-esteem by talking about the importance of self-confidence and responsibility. With this'},
 {'generated_text': 'In this course, we will teach you how to control behavior in self-directed actions. This course will teach you how to control your own behavior and'}]

### Mask filling
* The idea of this task is to fill in the blanks in a given text
* top_k argument controls how many possibilities you want to be displayed. Note that here the model fills in the special <mask> word, which is often referred to as a mask token. Other mask-filling models might have different mask tokens, so it’s always good to verify the proper mask word when exploring other models. 

In [None]:
fill_mask_unmasker = pipeline("fill-mask")

In [None]:
fill_mask_unmasker("This course will teach you all about <mask> models.", top_k=2)

[{'score': 0.1961962729692459,
  'sequence': 'This course will teach you all about mathematical models.',
  'token': 30412,
  'token_str': ' mathematical'},
 {'score': 0.0405271016061306,
  'sequence': 'This course will teach you all about computational models.',
  'token': 38163,
  'token_str': ' computational'}]

In [None]:
bert_base_unmasker = pipeline('fill-mask', model='bert-base-cased')

In [None]:
bert_base_unmasker("Hello I'm a [MASK] model.")

[{'score': 0.09019170701503754,
  'sequence': "Hello I'm a fashion model.",
  'token': 4633,
  'token_str': 'fashion'},
 {'score': 0.0634998083114624,
  'sequence': "Hello I'm a new model.",
  'token': 1207,
  'token_str': 'new'},
 {'score': 0.06228206306695938,
  'sequence': "Hello I'm a male model.",
  'token': 2581,
  'token_str': 'male'},
 {'score': 0.04417263716459274,
  'sequence': "Hello I'm a professional model.",
  'token': 1848,
  'token_str': 'professional'},
 {'score': 0.03326144814491272,
  'sequence': "Hello I'm a super model.",
  'token': 7688,
  'token_str': 'super'}]

### Named entity recognition
Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations.
* grouped_entities=True in the pipeline creation function to tell the pipeline to regroup together the parts of the sentence that correspond to the same entity: here the model correctly grouped “Hugging” and “Face” as a single organization, even though the name consists of multiple words. In fact, as we will see in the next chapter, the preprocessing even splits some words into smaller parts. For instance, Sylvain is split into four pieces: S, ##yl, ##va, and ##in. In the post-processing step, the pipeline successfully regrouped those pieces.

In [None]:
ner = pipeline("ner", grouped_entities=True)

In [None]:
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

Instructions for updating:
Use tf.identity instead.


[{'end': 18,
  'entity_group': 'PER',
  'score': 0.9981694,
  'start': 11,
  'word': 'Sylvain'},
 {'end': 45,
  'entity_group': 'ORG',
  'score': 0.97960204,
  'start': 33,
  'word': 'Hugging Face'},
 {'end': 57,
  'entity_group': 'LOC',
  'score': 0.99321055,
  'start': 49,
  'word': 'Brooklyn'}]

### Question answering
Note that this pipeline works by extracting information from the provided context; it does not generate the answer

In [None]:
question_answerer = pipeline("question-answering")

In [None]:
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn"
)

{'answer': 'Hugging Face', 'end': 45, 'score': 0.6949764490127563, 'start': 33}

### Summarization
Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. 

In [None]:
summarizer = pipeline("summarization")

In [None]:
summarizer("""
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
""")

[{'summary_text': 'the number of graduates in traditional engineering disciplines has declined . in most of the premier american universities engineering curricula now concentrate on and encourage largely the study of engineering science . rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

Like with text generation, you can specify a max_length or a min_length for the result

In [None]:
summarizer("""
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
""", max_length = 20)

[{'summary_text': 'in most of the premier american universities engineering curricula now concentrate on and encourage largely'}]

In [25]:
# Universal sentence embedding of Google implemented in Transformers
# Understand how tokenization is different when sentences are seperated by space versus fullstop

# https://huggingface.co/johngiorgi/declutr-small
import torch
from scipy.spatial.distance import cosine

from transformers import AutoModel, AutoTokenizer

# Load the model
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-small")
model = AutoModel.from_pretrained("johngiorgi/declutr-small")

# Prepare some text to embed
text = [
    "A smiling costumed woman is holding an umbrella. A happy woman in a fairy costume holds an umbrella.",
    "A smiling costumed woman is holding an umbrella A happy woman in a fairy costume holds an umbrella.",
]
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
print("\ninputs:",inputs)

# detokenize - to understand tokenization
# print(tokenizer.decode(inputs["input_ids"][0]))
# print(tokenizer.decode(inputs["input_ids"][0]))
for i in range(len(inputs["input_ids"])):
  print("\ndetokenized:",tokenizer.decode(inputs["input_ids"][i]))

# Embed the text
with torch.no_grad():
    sequence_output = model(**inputs)[0]

# Mean pool the token-level embeddings to get sentence-level embeddings
embeddings = torch.sum(
    sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)

# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])
print("\nsemantic similarity:",semantic_sim)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=547.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=798293.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456356.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=239.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=54.0, style=ProgressStyle(description_w…




ImportError: ignored