Code augmented from:
* https://www.databites.tech/p/hugging-face-use-cases-and-applications
* https://github.com/rcalix1/TransferLearning/blob/main/HuggingfaceTransformers/2024/inClass_intro_HF.ipynb
* https://github.com/rcalix1/TransferLearning/blob/main/HuggingfaceTransformers/2024/HelloWorldGTP2.ipynb
* Professor Ricardo Calix, Purdue University Northwest "Introduction to HuggingFace's Transformers module", https://www.youtube.com/watch?v=D5fuMjkJf6k 

The purpose of this notebook is to provide a few usecases of transfer learning using language models to perform specified tasks. HuggingFace has a plethora of trained language models. With transfer learning, we can apply these pre-trained models to new domains and knowledge tasks

<br>Notebook is by Solomon Sonya 0xSolomonSonya
<br>Most code and data cells in this notebook have been augmented from ChatGPT, Copilot, Gemini, other Generative AI models, and online resources.

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on enabling computers to understand, interpret and generate human language. NLP bridges the gap between human communication and computer understanding.

Large Language Models (LLMs) like BERT and GPT have significantly advanced the state of Natural Language Processing.  These models, trained on massive datasets of text and code, possess a deep understanding of language structure and semantics.

Using these advanced LLMs, we can perform linguistic tasks such as:

* Text classification: Categorizing text into predefined categories (e.g., spam detection, topic labeling).
* Question answering: Providing answers to questions posed in natural language based on a given context or knowledge base.
* Named entity recognition: Identifying and classifying named entities (e.g., people, organizations, locations) within text.
* Sentiment analysis: Determining the emotional tone or attitude expressed in a piece of text (e.g., positive, negative, neutral).
* Text prediction: Predicting the next word or sequence of words in a given text (e.g., autocomplete, predictive keyboard).
* Summarization: Condensing a longer piece of text into a shorter, concise summary while preserving the key information.
Text generation: Creating new text, such as stories, poems, articles, or code, based on a given prompt or context.

- source: Gemini

<br>BERT Models are based on Transformers and Encoders (given text it is able to classify the text and provide probabilities for next token (e.g., fill in the blank [MASK], and sentiment analysis). BERT models look in both directions (next word and previous word)
<br>GPT Models are based on Transformers and Decoders (given a word, it calculates the probability for the next word, Feed forward in a single direction, etc)

In [3]:
#!pip install SentencePiece
#!pip install torch
#!pip install datasets
#!pip install --upgrade --force-reinstall pyarrow datasets

In [4]:
import warnings
warnings.filterwarnings("ignore")

<hr style="height:75px;color:#000;background-color:#000;">

# Text Prediction - Viewing Embeddings

In [17]:
%%time
# code in this section is augmented from source: 
#        https://github.com/rcalix1/TransferLearning/blob/main/HuggingfaceTransformers/2024/inClass_intro_HF.ipynb

from transformers import AlbertTokenizer, AlbertModel
import pandas as pd
from transformers import pipeline

# we will be using the tokenizer specific to the BERT model
# each model is optimized for specific task(s) and then we can transfer the pretrained model and apply to our domain with fine tuning
# the tokenizer transforms text to numbers
tokenizer = AlbertTokenizer.from_pretrained("albert-base-v2")
model     = AlbertModel.from_pretrained("albert-base-v2")


text = "this dog is very happy."

# output is all numbers to be decoded
encoded_input = tokenizer(text, return_tensors='pt')

output = model(**encoded_input)
print(output)

BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 0.9141,  1.2348, -0.6978,  ...,  0.3912, -1.2460,  1.1479],
         [ 0.8212, -0.0263,  0.6866,  ..., -0.2959,  1.2790, -0.6806],
         [-0.1347,  0.1948, -0.3194,  ..., -0.3486,  2.2491, -1.0724],
         ...,
         [-0.2334, -0.3141,  0.0913,  ...,  1.6683,  0.2309,  0.5323],
         [ 0.3367,  0.5625, -0.2566,  ..., -0.4797,  1.4007,  0.3033],
         [ 0.0500,  0.1402, -0.0665,  ..., -0.0931,  0.1098,  0.2176]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[ 0.4812, -0.4942,  0.9253, -0.9148, -0.7215, -0.9468,  0.4822, -0.4793,
          0.5641, -0.9999,  0.8975,  0.3981, -0.6010, -0.9836, -0.9098, -0.4541,
          0.4660,  0.4578,  0.9904, -0.4105, -0.8281, -0.9952,  0.9985,  0.9632,
          0.6430, -0.4528,  0.5644, -0.9889, -0.9986, -0.4758, -1.0000,  0.5501,
          0.5359,  0.4877,  0.5160, -0.4009,  0.5471,  0.9606, -0.5019,  0.4890,
          0.4574, -0.9673, -0.8559,  0.4776,  0.49

<hr style="height:75px;color:#000;background-color:#000;">

# Fill Mask

In [20]:
fillmask = pipeline('fill-mask', model='albert-base-v2')

# specify where a word should be filled in
res = pd.DataFrame(fillmask("The cat is so [MASK] ."))

# the result is a dataframe providing probability of next word based on pretrained model's embeddings
# token of greatest probability is what we'll use as the most likely response
# cute seems to be more likely here
res

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertForMaskedLM: ['albert.pooler.bias', 'albert.pooler.weight']
- This IS expected if you are initializing AlbertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Unnamed: 0,score,token,token_str,sequence
0,0.281032,10901,cute,the cat is so cute .
1,0.094896,26354,adorable,the cat is so adorable .
2,0.042963,1700,happy,the cat is so happy .
3,0.040976,5066,funny,the cat is so funny .
4,0.024234,28803,affectionate,the cat is so affectionate .


In [21]:
res1 = pd.DataFrame(fillmask("El chapo is a  [MASK] person."))
res1

Unnamed: 0,score,token,token_str,sequence
0,0.031417,27668,figurative,el chapo is a figurative person.
1,0.028689,18496,franciscan,el chapo is a franciscan person.
2,0.025276,9650,dominican,el chapo is a dominican person.
3,0.02296,19210,moroccan,el chapo is a moroccan person.
4,0.017772,14484,basque,el chapo is a basque person.


In [27]:
res2 = pd.DataFrame(fillmask("Anna is a  [MASK] person."))
res2

Unnamed: 0,score,token,token_str,sequence
0,0.059892,5934,wonderful,anna is a wonderful person.
1,0.057768,5066,funny,anna is a funny person.
2,0.047016,254,good,anna is a good person.
3,0.046513,8601,lovely,anna is a lovely person.
4,0.038441,2210,nice,anna is a nice person.


In [23]:
res3 = pd.DataFrame(fillmask("Michael is a  [MASK] person."))
res3

Unnamed: 0,score,token,token_str,sequence
0,0.090354,5934,wonderful,michael is a wonderful person.
1,0.056277,254,good,michael is a good person.
2,0.054211,5066,funny,michael is a funny person.
3,0.051298,374,great,michael is a great person.
4,0.049223,2210,nice,michael is a nice person.


In [24]:
# This is an example of bias in the pretrained dataset
res3 = pd.DataFrame(fillmask("The nurse is examining the patient.  [MASK] is writing down notes."))
res3

Unnamed: 0,score,token,token_str,sequence
0,0.416072,39,she,the nurse is examining the patient. she is wri...
1,0.227184,28153,joyah,the nurse is examining the patient. joyah is w...
2,0.140536,29833,evalle,the nurse is examining the patient. evalle is ...
3,0.007875,24,he,the nurse is examining the patient. he is writ...
4,0.003686,23512,jaenelle,the nurse is examining the patient. jaenelle i...


In [25]:
# This is an example of bias in the pretrained dataset
res3 = pd.DataFrame(fillmask("The doctor is examining the patient.  [MASK] is writing down notes."))
res3

Unnamed: 0,score,token,token_str,sequence
0,0.210027,24,he,the doctor is examining the patient. he is wri...
1,0.208616,39,she,the doctor is examining the patient. she is wr...
2,0.108847,28153,joyah,the doctor is examining the patient. joyah is ...
3,0.071828,29833,evalle,the doctor is examining the patient. evalle is...
4,0.010178,1687,doctor,the doctor is examining the patient. doctor is...


In [28]:
res3 = pd.DataFrame(fillmask("The pilot is flying the plane.  [MASK] is doing a great job!"))
res3

Unnamed: 0,score,token,token_str,sequence
0,0.341382,24,he,the pilot is flying the plane. he is doing a g...
1,0.144531,29833,evalle,the pilot is flying the plane. evalle is doing...
2,0.067653,39,she,the pilot is flying the plane. she is doing a ...
3,0.055728,28153,joyah,the pilot is flying the plane. joyah is doing ...
4,0.012015,1266,everyone,the pilot is flying the plane. everyone is doi...


<hr style="height:75px;color:#000;background-color:#000;">

# Text Summarization

In [None]:
%%time
from transformers import AutoModel, pipeline, BartTokenizer, BartForConditionalGeneration, BartConfig

summarizer = pipeline("summarization")

model     = BartForConditionalGeneration.from_pretrained('sshleifer/distilbart-cnn-12-6')

tokenizer = BartTokenizer.from_pretrained('sshleifer/distilbart-cnn-12-6')

nlp = pipeline("summarization", model=model, tokenizer=tokenizer)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [None]:
text = '''
Aviation is the activities surrounding mechanical flight and the aircraft industry. Aircraft includes fixed-wing and rotary-wing types, 
morphable wings, wing-less lifting bodies, as well as lighter-than-air craft such as hot air balloons and airships.
Aviation began in the 18th century with the development of the hot air balloon, an apparatus capable of atmospheric displacement through buoyancy. 
Some of the most significant advancements in aviation technology came with the controlled gliding flying of Otto Lilienthal in 1896; then a large 
step in significance came with the construction of the first powered airplane by the Wright brothers in the early 1900s. Since that time, aviation 
has been technologically revolutionized by the introduction of the jet which permitted a major form of transport throughout the world.
'''

q = nlp(text)

print(q)


<hr style="height:75px;color:#000;background-color:#000;">

# Text Classification - Categorization

In [12]:
%%time
from transformers import pipeline
from transformers import AutoModelForSequenceClassification, AutoTokenizer

device = "cpu"

sequence = "I am going to france" 

label = ['travel', 'cooking', 'dancing']

nli_model = AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')

premise    = sequence
hypothesis = f'This example is {label}.'

print(hypothesis)

# run through model pre-trained on MNLI

x = tokenizer.encode(premise, hypothesis, return_tensors='pt', truncation_strategy='only_first')

This example is ['travel', 'cooking', 'dancing'].
CPU times: user 3.9 s, sys: 74.3 ms, total: 3.98 s
Wall time: 4.95 s


In [13]:
x

tensor([[    0,   100,   524,   164,     7,  6664,  2389,     2,     2,   713,
          1246,    16, 47052, 28881,  3934,   128, 35190,   154,  3934,   128,
           417,  7710,   108,  8174,     2]])

In [14]:
logits = nli_model(x.to(device))[0]
logits

tensor([[-2.0327,  1.3776,  0.6796]], grad_fn=<AddmmBackward0>)

<hr style="height:75px;color:#000;background-color:#000;">

In [15]:
from transformers import pipeline

In [16]:
print(     pipeline('sentiment-analysis')('we love you')          )

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9998704195022583}]


In [17]:
print(     pipeline('sentiment-analysis')('we hate you')          )

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.9988259673118591}]


In [18]:
print(     pipeline('sentiment-analysis')('my cat kind of likes you')          )

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9996534585952759}]


In [19]:
print(     pipeline('sentiment-analysis')('i am going to the store')          )

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.9846875071525574}]


In [20]:
print(     pipeline('sentiment-analysis')('bacon')          )

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.8937287926673889}]


In [21]:
print(     pipeline('sentiment-analysis')('the')          )

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9635980725288391}]


<hr style="height:75px;color:#000;background-color:#000;">

# Text Classification - Text Categorization/Labeling

In [22]:
%time
from transformers import pipeline
from transformers import AutoModelForSequenceClassification, AutoTokenizer

device = "cpu"

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

CPU times: user 3 μs, sys: 0 ns, total: 3 μs
Wall time: 7.87 μs


Device set to use cpu


In [23]:
sequence_to_classify = "i like that tobacco rates have fallen"

candidate_labels = ['cars', 'smoking', 'pets']

result = classifier(sequence_to_classify, candidate_labels)

print(result)

{'sequence': 'i like that tobacco rates have fallen', 'labels': ['smoking', 'pets', 'cars'], 'scores': [0.9401766657829285, 0.03462902829051018, 0.025194326415657997]}


<hr style="height:75px;color:#000;background-color:#000;">

# Text Classification - Sentiment Analysis

Text classification is a core task in NLP that involves assigning one or more categories to a given input text. It has a wide range of applications, including spam detection, sentiment analysis, topic labeling, and more. - source: https://www.databites.tech/p/hugging-face-use-cases-and-applications

In [24]:
%%time
# using transfer learning, we will classify text as positive, negative, or neutral

# We import the pipeline module from the transformers library
from transformers import pipeline

# We load the pre-trained text classification model.
classifier = pipeline("text-classification",model='lxyuan/distilbert-base-multilingual-cased-sentiments-student')

# Input to be classified
input = "I truly love the hugging face library!"

# Perform classification
output = classifier(input)

# Observe the result
print(output)

Device set to use cpu


[{'label': 'positive', 'score': 0.9797219038009644}]
CPU times: user 323 ms, sys: 94.2 ms, total: 418 ms
Wall time: 1.06 s


<hr style="height:75px;color:#000;background-color:#000;">

# Text generation - Text Prediction

Many of you are probably familiar with tools like ChatGPT, Claude, or Google Gemini—platforms that generate text based on an input prompt. This process, known as text generation, is an area of NLP where a model creates human-like responses from a given input.
Text generation has a wide range of uses, from building chatbot conversations to generating creative content. The core idea is to train a model on massive text datasets, allowing it to learn the patterns, styles, and structures of natural language. As you might expect, the most resource-intensive part is training the model itself. - source: https://www.databites.tech/p/hugging-face-use-cases-and-applications

In [25]:
%%time

# We import the pipeline module from the transformers library
from transformers import pipeline

# We load the pre-trained text generation model.
generator = pipeline('text-generation', model='gpt2')

# Input to be classified
prompt = "Although AI is just starting today,"

# Generate the new text
generated_text = generator(prompt, max_length=50)[0]['generated_text']

# Observe the result
print(generated_text)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Although AI is just starting today, I wonder if the next step will be one based on a few very specific algorithms. You've probably already said:

1. Go to the machine learning section, and create your portfolio of deep learning algorithms.
CPU times: user 2.62 s, sys: 243 ms, total: 2.86 s
Wall time: 4.99 s


<hr style="height:75px;color:#000;background-color:#000;">

# Question answering

<b>Question answering</b>, commonly referred to as QA, is a field in NLP focused on building systems that automatically answer questions posed by humans in natural language.

QA systems are widely used in various applications, such as virtual assistants, customer support, and information retrieval systems.

QA systems can be broadly categorized into two types:

* <b>Open-domain QA:</b> Answers questions based on a broad range of knowledge, often sourced from the internet or large databases.
* <b>Closed-domain QA:</b> Focuses on a specific domain, like medicine or law, and answers questions from a limited dataset.

<br> These systems typically use a combination of natural language understanding to interpret the question and information retrieval to find relevant answers. - source: https://www.databites.tech/p/hugging-face-use-cases-and-applications

In [26]:
%%time

# We import the pipeline module from the transformers library
from transformers import pipeline

# We load the question-answering text generation model.
qa_pipeline = pipeline('question-answering', model='distilbert-base-uncased-distilled-squad')

# Context to ask about
context = """Paris is the capital and most populous city of France. The city has an area of 105 square kilometers and a population of 2,140,526 residents."""

# Question to perform to the model
question = "What is the population of Paris?"

# Now we get the answer
answer = qa_pipeline(question=question, context=context)

print(answer)


Device set to use cpu


{'score': 0.954908013343811, 'start': 121, 'end': 130, 'answer': '2,140,526'}
CPU times: user 260 ms, sys: 22.5 ms, total: 283 ms
Wall time: 898 ms


<hr style="height:75px;color:#000;background-color:#000;">

# Translation - Language Translation

- source: https://www.databites.tech/p/hugging-face-use-cases-and-applications

In [27]:
%%time

# We import the pipeline module from the transformers library
from transformers import pipeline

# Load the translation pipeline for English to Spanish
translator = pipeline('translation_en_to_de')

# Text to translate from English to Spanish
text_to_translate = "This is a great day for science!"

# Perform the translation
translation = translator(text_to_translate, max_length=40)

# Print the translated text
print(translation[0]['translation_text'])

No model was supplied, defaulted to google-t5/t5-base and revision a9723ea (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Dies ist ein großer Tag für die Wissenschaft!
CPU times: user 2.52 s, sys: 206 ms, total: 2.72 s
Wall time: 5.03 s


<hr style="height:75px;color:#000;background-color:#000;">

# GPT2
NOTE: There are better pretrained models! Be sure to experiement with them. This one is useful as a starter model to experiement with GPT (Decoders). There are better trained models (these will likely be significantly larger memory resources to operate on your machine.)

NOTE: Halucinations are probably more prevalent in the earlier GPT models.

Purpose of this is to experiement with Question and Answering

In [6]:
from transformers import pipeline, set_seed
# tokenizer allows us to map numbers to words
from transformers import AutoTokenizer, Trainer, TrainingArguments

import json

# variables for text generation
seed                   = 42
max_length             = 150
num_return_sequences   = 2

In [10]:
# helper function to view output from LLM in a more standardized format
def generate_examples( generator, prompt_list ):    
    set_seed(seed)    
    examples = []
    
    for prompt in prompt_list: 
        
        result = generator(
                   prompt, 
                   max_length           = max_length, 
                   num_return_sequences = num_return_sequences )
        
        example = {'prompt': prompt}
        
        for i, res in enumerate( result ):            
            ## answer = res['generated_text'].lstrip().removeprefix( prompt ).strip()
            answer    = res['generated_text'].lstrip().strip()            
            example[f'answer{ i + 1 }'] = answer #given a prompt, take an answer
            
        examples.append(example)        
        ## print(examples)
        print( json.dumps( example, indent = 2) )
        
    return examples

In [9]:
# use a pipeline to allow us to configure our LLM easily
# pipeline will help us configure the machine with the appropriate package specified
# The GPT model will continue to generate text until max length is reached at this point, it will return the response back to us
model_name           = 'gpt2'
model_gpt_generator  = pipeline('text-generation', model=model_name ) # we can specify device for gpu or cpu
tokenizer              = AutoTokenizer.from_pretrained("gpt2")

Device set to use cpu


In [12]:
# provide a list of questions to supply to the LLM
list_to_answer = ["Are electric vehicles more efficient to petroleum vehicles?"]

# supply the question to GPT and it will generate a sequence of words most closely related to our question
say_something = generate_examples( model_gpt_generator, list_to_answer )

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "Are electric vehicles more efficient to petroleum vehicles?",
  "answer1": "Are electric vehicles more efficient to petroleum vehicles?",
  "answer2": "Are electric vehicles more efficient to petroleum vehicles?\n\nElectric vehicles are considered more environmentally responsible vehicles because they are more environmentally efficient compared to gasoline.\n\nIs there still enough data on gasoline-electric use to find an exact answer?\n\nNo, there is not. No reliable data will be available till the next year.\n\nWhat are the key criteria to determine if a car is not environmentally responsible?\n\nNo, the only vehicles on the map that need to be considered in determining if a vehicle is environmentally responsible when they have been equipped with a safety-rated battery include cars with EPA rated batteries.\n\nDo you see any specific standards or guidelines that you are interested in pursuing?\n\nWe have a variety of guidelines within"
}


In [14]:
# provide a list of questions to supply to the LLM
list_to_answer = ["What evidence is present to state global warming is a threat to civilization?"]

# supply the question to GPT and it will generate a sequence of words most closely related to our question
say_something = generate_examples( model_gpt_generator, list_to_answer )

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{
  "prompt": "What evidence is present to state global warming is a threat to civilization?",
  "answer1": "What evidence is present to state global warming is a threat to civilization? Are we witnessing some sort of imminent apocalypse? I see a very large proportion of scientists who, under those conditions, do not believe that such a thing has been possible.\"\n\nA spokesman for the U.N. Climate Change Commission suggested that the panel could not immediately answer these questions. \"We will be looking into these questions once the U.N. has finished with its report and is prepared to release its findings,\" a spokesperson for the department of environment said in a statement. \"We believe it is prudent for the scientific community to have a more comprehensive review of the evidence before making an official statement.\"",
  "answer2": "What evidence is present to state global warming is a threat to civilization?\n\nThe first step is to recognize the threat it presents, to recognize

<hr style="height:75px;color:#000;background-color:#000;">