<a href="https://colab.research.google.com/github/toddwalters/pgaiml-python-coding-examples/blob/main/deep-learning/C8/11_05_Introduction_to_BERT__V3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# __Introduction to BERT and Transformers Library__
- BERT stands for Bidirectional Encoder Representations from Transformers.
- BERT is pre-trained on a large corpus of unlabeled text, including the entire Wikipedia (that's 2,500 million words!) and the Book Corpus (800 million words).
- BERT is based on the Transformer architecture.

## Steps to Be Followed:
1. Importing required libraries
2. Analyzing sentiment using transformer pipeline
3. Creating text generation
4. Creating named entity recognition (NER)
5. Generating masked language model using a model and a tokenizer

### Step 1: Importing Required Libraries
- The code from the transformers import pipeline allows for easy access to pre-trained models and simplified execution of NLP tasks using the transformers library.



In [None]:
from transformers import pipeline

###Step 2: Analyzing Sentiment Using Transformer Pipeline

- Import the pipeline function from the Transformers library, which enables easy access to pre-trained NLP models
- The snippet creates a sentiment analysis pipeline using the pre-trained model and uses it to classify the sentiment of the input text **I hate you**
- The result, including the sentiment label and score, is then printed

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

result = classifier("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

- Perform sentiment analysis on the text **I love you**.
- Print the sentiment analysis result.

In [None]:
result = classifier("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

**Observation**
- The sentiment analysis model is highly confident that the sentiment of the text **I love you** is positive, with a score of 0.9999.

### Step 3: Creating Text Generation
- It creates a text generation pipeline using the pipeline function from the Transformers library.
- It generates text starting with the provided prompt **As far as I am concerned, I will** using the text generation pipeline, with a maximum length of 50 tokens and without sampling, which is deterministic output.
- The generated text is then printed.

In [None]:
text_generator = pipeline("text-generation")
print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))


### Step 4: Creating Named Entity Recognition (NER)
- It creates a NER pipeline using the pipeline function from the Transformers library.

- It applies the NER pipeline to the provided sequence, which is a text containing named entities. The pipeline identifies and extracts named entities such as organization names **Hugging Face Inc.**, locations **New York City**, and others. The extracted entities are then printed.

In [None]:
ner_pipe = pipeline("ner")
sequence = """Hugging Face Inc. is a company based in New york city. Manhattan bridge is visible from the window."""

- Print the Entities after Performing Named Entity Recognition on the Sequence

In [None]:
for entity in ner_pipe(sequence):
    print(entity)

In [None]:
import pandas as pd
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(sequence)
pd.DataFrame(outputs)


### Step 5: Generating Masked Language Model Using a Model and a Tokenizer

- Masked Language Modeling Using a Model and a Tokenizer
  - Masked language modeling is a task where a model fills in masked tokens in a sequence, improving its understanding of language. It involves predicting missing tokens by considering the context of surrounding words.

- The process includes the following steps:
  - Instantiate a tokenizer and a model from the checkpoint name.
  - Define a sequence with a masked token, placing the tokenizer.mask_token instead of a word.
  - Encode that sequence into a list of IDs and find the position of the masked token in that list.
  - Retrieve the predictions at the index of the masked token
  - Retrieve the top 5 tokens using the PyTorch topk or TensorFlow top_k methods
  - Replace the masked token with the tokens and print the results

 ### Masked Langauge Modeling
- Import the necessary modules from the transformers library and torch
- Load the pre-trained tokenizer and model
- Define the input sequence with a masked token
- Tokenize the input sequence and convert to tensors
- Find the index of the masked token and generate token predictions using the model
- Get the indices of the top 5 predicted tokens and print them in the sequence
- Load the pre-trained tokenizer and model
- Define the input sequence with a masked token
- Tokenize the input sequence and convert to tensors
- Find the index of the masked token and generate token predictions using the model
- Get the indices of the top 5 predicted tokens and print them in the sequence
- Print the top 5 predicted tokens in the masked position

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

sequence = (
    "Distilled models are smaller than the models they mimic. Using them instead of the large "
    f"versions would help {tokenizer.mask_token} our carbon footprint."
)

inputs = tokenizer(sequence, return_tensors="pt") # "tf"
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] # np.where

token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

In [None]:
#first word
first_word = token_logits[0, 0, :]
top_5_tokens = torch.topk(first_word, 5, dim=0).indices.tolist()
for token in top_5_tokens:
    print(tokenizer.decode([token]))

In [None]:
#first word
first_word = token_logits[0, 1, :]
top_5_tokens = torch.topk(first_word, 5, dim=0).indices.tolist()
for token in top_5_tokens:
    print(tokenizer.decode([token]))

In [None]:
#first word
first_word = token_logits[0, 2, :]
top_5_tokens = torch.topk(first_word, 5, dim=0).indices.tolist()
for token in top_5_tokens:
    print(tokenizer.decode([token]))

In [None]:
#first word
first_word = token_logits[0, 3, :]
top_5_tokens = torch.topk(first_word, 5, dim=0).indices.tolist()
for token in top_5_tokens:
    print(tokenizer.decode([token]))

In [None]:
#first word
first_word = token_logits[0, 4, :]
top_5_tokens = torch.topk(first_word, 5, dim=0).indices.tolist()
for token in top_5_tokens:
    print(tokenizer.decode([token]))

In [None]:
token_logits.shape

In [None]:
mask_token_logits = token_logits[0, mask_token_index, :]

In [None]:
mask_token_logits.shape

**Observation**
- The output provides alternative sentence suggestions by replacing the masked token with different predicted tokens, demonstrating how using distilled models instead of larger ones can impact the carbon footprint.