# TensorFlow Solutions for HuggingFace Exercises

This notebook provides TensorFlow solutions for three of the exercises from the HuggingFace exercises notebook.

## Exercise 1: Downloading and Prompting T5 with TensorFlow

In [1]:
!pip install tensorflow transformers



In [2]:

from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


In [3]:

def translate_with_t5(text, model, tokenizer, source_lang="English", target_lang="French"):
    input_text = f"Translate {source_lang} to {target_lang}: {text}"
    inputs = tokenizer.encode(input_text, return_tensors="tf")
    outputs = model.generate(inputs)
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translated_text

# Example usage
translate_with_t5("Hello, world!", model, tokenizer)




'Bonjour, monde!'

## Exercise 2: Transfer Learning with BERT in TensorFlow

In [3]:
!pip install tensorflow tensorflow-datasets transformers



In [4]:
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import TFBertForSequenceClassification, BertTokenizer, glue_convert_examples_to_features

# Loading the IMDB reviews dataset
data = tfds.load('imdb_reviews', split=['train', 'test'], as_supervised=True)
train_data, test_data = data[0], data[1]




Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.9WBFVD_1.0.0/imdb_reviews-train.tfrecor…

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.9WBFVD_1.0.0/imdb_reviews-test.tfrecord…

Generating unsupervised examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.9WBFVD_1.0.0/imdb_reviews-unsupervised.…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


In [5]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def encode_examples(ds, limit=-1, batch_size=32):
    input_ids = []
    attention_masks = []
    labels = []

    for review, label in tfds.as_numpy(ds.take(limit)):
        bert_input = tokenizer.encode_plus(
            review.decode('utf-8'),
            add_special_tokens=True,
            max_length=128,
            truncation=True,
            padding='max_length',
            return_attention_mask=True,
            return_token_type_ids=False,
            return_tensors='tf'
        )

        input_ids.append(bert_input['input_ids'][0])
        attention_masks.append(bert_input['attention_mask'][0])
        labels.append(label)

    return tf.data.Dataset.from_tensor_slices(({
        'input_ids': input_ids,
        'attention_mask': attention_masks,
    }, labels)).shuffle(len(labels)).batch(batch_size)

# Apply the function to the train and test dataset
batch_size = 32
train_data_encoded = encode_examples(train_data, batch_size=batch_size, limit=10000)
test_data_encoded = encode_examples(test_data, batch_size=batch_size, limit=1000)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [6]:
# Load a pre-trained BERT model
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_99 (Dropout)        multiple                  0 (unused)
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109483778 (417.65 MB)
Trainable params: 109483778 (417.65 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [8]:
model.layers[0].trainable = False

In [9]:
# Model compilation
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = tf.metrics.SparseCategoricalAccuracy()

model.compile(optimizer=optimizer, loss=loss, metrics=[metrics])

# Model training
epochs = 3  # Adjust as needed
model.fit(train_data_encoded, epochs=epochs, validation_data=test_data_encoded)


Epoch 1/3
Epoch 2/3
Epoch 3/3


<tf_keras.src.callbacks.History at 0x7b806c2e22d0>

## Exercise 3: Distillation of BERT using TensorFlow

In [12]:

# Understanding model distillation
# Distillation involves training a smaller model (student) to mimic a larger model (teacher).
# Here we assume the use of a smaller BERT model as the student.
# The distillation process involves training the student model to replicate the teacher model's output.
# Detailed code for this process is complex and is not provided in this example.



## Exercise 4: Using ROUGE for Evaluation

### What is `rouge_score`?

In this context, `rouge_score` refers to the evaluation of the generated summary using the **ROUGE** metric (Recall-Oriented Understudy for Gisting Evaluation). ROUGE is commonly used in natural language processing to measure the quality of generated summaries by comparing them to reference texts.

The code uses the `rouge_score` Python package from Google to calculate three ROUGE variants:

- **ROUGE-1**: Measures unigram (word-level) overlap.
- **ROUGE-2**: Measures bigram (two-word sequence) overlap.
- **ROUGE-L**: Measures the longest common subsequence, capturing fluency and sequence similarity.

The line:
```python
scores = scorer.score(example_text, summary)
```

returns a dictionary of scores comparing the generated summary to the original text. Each entry contains:

**Precision:** Proportion of generated words that appear in the reference.

**Recall:** Proportion of reference words that are captured in the summary.

**F1-score:** Harmonic mean of precision and recall — typically used as the main performance metric.

Setting `use_stemmer=True` enables stemming (e.g., "running" and "run" are treated as the same), which makes the evaluation more robust.

In [13]:

!pip install tensorflow transformers rouge-score


Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=da77870613181e1493b2fed19dc1d66bf869af0b78209e5bc082a9b07d8733e9
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [21]:

import tensorflow as tf
from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer
from rouge_score import rouge_scorer

# Loading the model and tokenizer for summarization
model_name = 't5-small'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name)

# Function to perform summarization
def summarize(text):
    inputs = tokenizer.encode("summarize: " + text, return_tensors="tf", max_length=512)
    outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example text
example_text = "The quick brown fox jumps over the lazy dog. Foxes are quick. Dogs are lazy. Foxes and hounds are often discussed. New york! This is an example sentence to demonstrate text summarization."


# human produced summary

human_summary = "The fox jumped over a dog. Then some jibberish."
# Summarize the text
summary = summarize(example_text)

# Evaluate using ROUGE
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(example_text, summary)

print(example_text, summary, scores,sep='\n')


All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


The quick brown fox jumps over the lazy dog. Foxes are quick. Dogs are lazy. Foxes and hounds are often discussed. New york! This is an example sentence to demonstrate text summarization.
the quick brown fox jumps over the lazy dog. foxes are quick. Dogs are lazy. this is an example sentence to demonstrate text summarization.
{'rouge1': Score(precision=1.0, recall=0.75, fmeasure=0.8571428571428571), 'rouge2': Score(precision=0.9565217391304348, recall=0.7096774193548387, fmeasure=0.8148148148148149), 'rougeL': Score(precision=1.0, recall=0.75, fmeasure=0.8571428571428571)}


## Exercise 5: Exploring BLEU for Machine Translation

### What is `sacrebleu`?

In this context, `sacrebleu` is used to evaluate the quality of the generated translation using the **BLEU** (Bilingual Evaluation Understudy) score — a standard metric for evaluating machine translation systems.

The code uses the `sacrebleu` Python library to compute the **corpus-level BLEU score**, which measures how closely the generated translation matches one or more reference translations.

Key concepts:
- **BLEU Score**: Ranges from 0 to 100 (though typically reported from 0 to 1 or as a percentage). Higher values indicate better translation quality.
- BLEU considers:
  - **n-gram overlap** (usually up to 4-grams) between candidate and reference translations.
  - **Brevity penalty** to discourage overly short translations.

In this line:
```python
bleu_score = sacrebleu.corpus_bleu([translation], [reference])
```
`translation` is the model-generated translation.

`reference` is the correct or expected human translation.

Both are wrapped in lists because corpus_bleu expects batches of sentences.

The result (`bleu_score.score`) is a single numerical value representing the translation quality.



In [15]:

!pip install tensorflow transformers sacrebleu


Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading portalocker-3.2.0-py3-none-any.whl (22 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-3.2.0 sacrebleu-2.5.1


In [18]:
from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer
import sacrebleu

# Loading the model and tokenizer for translation
model_name = 't5-small'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name)

# Function to perform translation
def translate(text, target_language="fr"):
    # input format for French translation
    inputs = tokenizer.encode("translate English to French: " + text, return_tensors="tf", max_length=512, truncation=True)
    # Use model.generate() instead of calling model directly
    outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example text
example_text = "The quick brown fox jumps over the lazy dog."

# Translate the text
translation = translate(example_text, "fr")

# Evaluate using BLEU
reference = ["Le rapide renard brun saute par-dessus le chien paresseux."]
bleu_score = sacrebleu.corpus_bleu([translation], [reference])

translation, bleu_score.score

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


("Le renard brun rapide saute au-dessus du chien paresseux.  l'heure actuelle, il s'agit d'un renard brun et d'un renard brun.",
 8.668528067348738)