## Using Hugging Face Transformers
This notebook demonstartes the usage of a variety of Hugging Face transformers (LLM models) by the pipeline includeing
* Sentimental Analysis by a distilbert-base model
* How to fine tune a distilbert-base model using labelled data for text classifications
* text generation using gpt-2 model
* question and answer using a distilbert-base model
* translation from English to Chinese use Helsinki-NLP/opus-mt-en-zh model 

In [1]:
%pip install --disable-pip-version-check \
    torch==1.13.1 sentencepiece\
    torchdata==0.5.1 --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
import torch
import transformers
import evaluate, datasets

### Sentimental Analysis
* load model and tokenizer for a specific model. (here we use a distilbert-base mode, which is an encoder transformerl)
* pass the model and tokenizer to transformer pipeline, specify the purpose as "sentiment-analysis"
* pass the sentences we want to analyze, and get labels and the scores

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, GenerationConfig

model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student"

# We call define a model object
model = AutoModelForSequenceClassification.from_pretrained(model_name);

In [4]:
# We call the tokenizer class
tokenizer = AutoTokenizer.from_pretrained(model_name);

In [5]:
from transformers import pipeline
from datasets import load_dataset
# Initializing a classifier with a model and a tokenizer
classifier = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)

In [6]:
labels = classifier(["I am OK with this!", "I love this product."])
print(labels)

[{'label': 'positive', 'score': 0.8878770470619202}, {'label': 'positive', 'score': 0.9786158800125122}]


### Fine Tune the Model for text classifications
* load the twitter-sentiment-analysis dataset for fine tuning
* the dataset has text and the corresponding labels of postive/negative
* First, construct the dataset for training by defining a `tokenize_function` and process the dataset using dataset's map function in batch mode
  + Notice the difference between different models to define `tokenize_function`
  + In FLAN-T5, tokenize_function can generate both input_id and label columns for each example and return the example
  + For distilled bert model here, only process input text and return the tokenized_ids as the compatible dictionary
* set up the TrainingArgument and Trainer to train the model. To reduce time, we only run 1 epoch with `max_step` = 1
* a full training will require GPU and maybe a much longer time!

##### Prepare dataset for training
* load dataset, check that the dataset contains 'feeling' and 'text' column
* rename 'feeling' column to 'labels' for training

In [7]:
dataset = load_dataset("carblacac/twitter-sentiment-analysis", trust_remote_code=True);

In [8]:
# show the structure of the dataset
train_pd = dataset["train"].to_pandas()
train_pd.head()

Unnamed: 0,text,feeling
0,@fa6ami86 so happy that salman won. btw the 1...,0
1,@phantompoptart .......oops.... I guess I'm ki...,0
2,@bradleyjp decidedly undecided. Depends on the...,1
3,@Mountgrace lol i know! its so frustrating isn...,1
4,@kathystover Didn't go much of any where - Lif...,1


In [9]:
# check the target variable column only has two unique values. This is a binary classification.
train_pd['feeling'].unique()

array([0, 1])

In [10]:
# rename the target column to 'label'
dataset = dataset.rename_column('feeling', 'label')

##### Define `tokeniz_function` to process the input text by tokenizer
* load tokenizer
* define `toknenize_function`
* apply `tokenize_function` to dataset using its map() function in batch mode
* shuffle and only select 5 examples from dataset to demonstrate the training process

In [11]:
# in FLAN-T5, tokenize_function can generate both input_id and label columns for each example
# and return the example. However, for distilled bert model, you can only process
# input text and return the tokenized_ids as the compatible dictionary. you can not
# return the entire example as the dirctionary
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/61998 [00:00<?, ? examples/s]

In [12]:
tokenized_datasets['train']

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 119988
})

In [13]:
from datasets import load_dataset

model = AutoModelForSequenceClassification.from_pretrained(model_name)

# small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(5))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(5))

##### Set Up TrainingArguments and Trainer
* Define `compute_metrics` and start training
* Notice that the `compute_metrics` defines the accuracy to be used in this training
* I only run for one step and one epoch to save training time and demostrate the training process

In [14]:
from transformers import Trainer, TrainingArguments
import numpy as np
import evaluate

training_args = TrainingArguments(
    output_dir="trainer_output", 
    logging_dir='./logs', 
    evaluation_strategy="epoch",
    num_train_epochs=1,
    logging_steps=1,
    max_steps=1    
)

# define the metric to be used in compute_metrics
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)



trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)
trainer.train()
trainer.evaluate()



Epoch,Training Loss,Validation Loss,Accuracy
1,1.5731,1.195367,0.4


{'eval_loss': 1.1953672170639038,
 'eval_accuracy': 0.4,
 'eval_runtime': 3.3283,
 'eval_samples_per_second': 1.502,
 'eval_steps_per_second': 0.3,
 'epoch': 1.0}

### Text Generation
* generate text using gpt2 model
* construct a prompt with the start of a paragraph
* extract the 'generated_text' en
* In this example, the model generated a short paragraph to continue the prompt given as the start of the paragraph
* This is a typical use case of decoder transformer, such as GPT2 modeltry

In [15]:
from transformers import pipeline

generator = pipeline('text-generation', model='gpt2')

prompt = "In a world dominated by AI,"

generation_config = GenerationConfig(max_new_tokens=50, pad_token_id=50256, num_beams=1)
generated_text = generator(prompt, generation_config=generation_config)[0]['generated_text']

print(generated_text)

In a world dominated by AI, it's hard to imagine a more important issue for the future of the human race.

"We're going to have to start thinking about how we can make AI more accessible to people," said Dr. David S. Siegel, a professor


### Question-Anwsering
* use a distilbert-base model
* construct question and context and send to the `qa_pipeline`
* retrieve the answer from the context. In this example, the model successfully extracted the population of Paris from the given context
* This demonstrate a typical encoder transformer for question-answer, such as the bert model

In [16]:
from transformers import pipeline

qa_pipeline = pipeline('question-answering', model='distilbert-base-uncased-distilled-squad')

context = """Paris is the capital and most populous city of France. The city has an area of 105 square kilometers and a population of 2,140,526 residents."""
question = "What is the population of Paris?"

answer = qa_pipeline(question=question, context=context)
print(answer)


{'score': 0.954908013343811, 'start': 121, 'end': 130, 'answer': '2,140,526'}


### Translation
* use Helsinkin-NLP/opus-mt-en-zh model to translate English to Chinese
* This is a use case for a typical encoder-decode transformer
* translate a sentence from the lyrics of my daughter's favorate song "The conconut song" by Jeff Lau
* the translation is amazing!

In [17]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Text2TextGenerationPipeline

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-zh")
text2text_generator = Text2TextGenerationPipeline(model, tokenizer)

translation = text2text_generator("translate English to Chinese: The coconut bark for the kitchen floor If you save some of it, you can build a door", max_length=512, do_sample=False)

In [18]:
print(translation)

[{'generated_text': '英文译为中文:厨房地板的椰子树皮,如果你节省一部分,你可以建造一扇门。'}]
