<a href="https://colab.research.google.com/github/simon-mellergaard/GAI-with-LLMs/blob/main/Litterature/NLP_00_corrections.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notes on Transformers Notebooks

*Author: Jesper N. Wulff*

This notebook contains comments and solutions to problems I encountered when running the example code [from the notebooks](https://github.com/nlp-with-transformers/notebooks) from the O'Reilly book [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/). All of the issues have been filed on GitHub.

### 01_introduction.ipynb

### 02_classification.ipynb

#### **Issue 1.**

The issue here pertains to the section "From Datasets to DataFrames". The problem arises when running this piece of code:

```
def label_int2str(row):
    return emotions["train"].features["label"].int2str(row)

df["label_name"] = df["label"].apply(label_int2str)
df.head()
```

which throws the error:

```
AttributeError: 'Value' object has no attribute 'int2str'
```

One way around the problem is to simply load a newer version of the `datasets` library like [this](https://github.com/nlp-with-transformers/notebooks/issues/113#issuecomment-1727526784):


In [None]:
!pip install --upgrade datasets

Collecting datasets==2.8.0
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7 (from datasets==2.8.0)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets==2.8.0)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets==2.8.0)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from datasets==2.8.0)
  Downloading responses-0.18.0

In [None]:
import pandas as pd
from datasets import load_dataset

# Load the emotions dataset
emotions = load_dataset("emotion")

emotions.set_format(type="pandas")
df = emotions["train"][:]
df.head()

Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.78k [00:00<?, ?B/s]



Downloading and preparing dataset emotion/split to /root/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Dataset emotion downloaded and prepared to /root/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3


In [None]:
def label_int2str(row):
    return emotions["train"].features["label"].int2str(row)

df["label_name"] = df["label"].apply(label_int2str)
df.head()

Unnamed: 0,text,label,label_name
0,i didnt feel humiliated,0,sadness
1,i can go from feeling so hopeless to so damned...,0,sadness
2,im grabbing a minute to post i feel greedy wrong,3,anger
3,i am ever feeling nostalgic about the fireplac...,2,love
4,i am feeling grouchy,3,anger


#### **Issue 2.**

To avoid a `numpy` error when calling

```
emotions_encoded.set_format("torch",
                            columns=["input_ids", "attention_mask", "label"])
```

run this code just before calling `map`, like [this](https://github.com/nlp-with-transformers/notebooks/issues/130#issuecomment-1923821023):



```
import numpy as np
np.object = object

#hide_output
emotions_hidden = emotions_encoded.map(extract_hidden_states, batched=True)
```






#### **Issue 3.**

To avoid an error when fine-tuning the classification model, remember to create a new model on HuggingFace. For more, see [this](https://github.com/nlp-with-transformers/notebooks/issues/109#issuecomment-1712672984).

### 04_multilingual-ner.ipynb

#### **Issue 1.**

Under *When Does Zero-Shot Transfer Make Sense?* the following code will throw an error if `Pandas 2.0` or larger is used:

```
for num_samples in [500, 1000, 2000, 4000]:
    metrics_df =  metrics_df.append(train_on_subset(panx_fr_encoded, num_samples), ignore_index=True)
```

To fix this, use `concat` instead like this:

```
for num_samples in [500, 1000, 2000, 4000]:
    metrics_df =  pd.concat([metrics_df, train_on_subset(panx_fr_encoded, num_samples)], ignore_index=True)
```



### 05_text-generation.ipynb

#### **Issue 1.**

On page 127 in NLPwT, there is a mistake in footnote 3:

> If you run out of memory on your machine, you can load a smaller GPT-2 version by replacing model_name = "gpt-xl" with model_name = "gpt".

It is supposed to be `model_name = "gpt2"`



### 06_summarization.ipynb

#### **Issue 1.**

To avoid an error, also download `"punkt_tab"` from `nltk`:

```
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")
nltk.download('punkt_tab')
```



#### **Issue 2.**

`load_metric` is deprecated. Alternatively, you can rely on the `evaluate` library with some changes.



```
!pip install evaluate
!pip install rouge_score

import evaluate

rouge_metric = evaluate.load('rouge')
```

Some of the code needs to be adapted to conform to the other library. For instance, remove the suffixes from `score[rn]` in this line:

```
rouge_dict = dict((rn, score[rn]) for rn in rouge_names)
```






### 07_model-compression.pynb

#### **Issue 1.**

Support for `load_metric` has been removed in datasets@3.0.0. Instead, we should use the `evaluate` library like this:

```
!pip install evaluate
import evaluate
accuracy_score = evaluate.load("accuracy")
```
Dropping this drop-in replacement should not break the other code.


#### **Issue 2.**

When defining `time_pipeline()` you'll get a syntax warning:


```
<>:18: SyntaxWarning: invalid escape sequence '\-' <>:18: SyntaxWarning:
invalid escape sequence '\-' /tmp/ipython-input-720322860.py:18: SyntaxWarning:
invalid escape sequence '\-' print(f"Average latency (ms) - {time_avg_ms:.2f}
+\- {time_std_ms:.2f}")
```
The warning comes from this part:

```
print(f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}")
```

Python is treating `\–` inside an f-string as an escape sequence, but `\-' isn't valid, hence the warning.

Fix it like this:

```
print(f"Average latency (ms) - {time_avg_ms:.2f} ± {time_std_ms:.2f}")
```