In [9]:
! pip install huggingface_hub
! pip install -U 'transformers[torch]' datasets timm
! pip install -U pandas numpy matplotlib

Collecting transformers[torch]
  Using cached transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
Using cached transformers-4.52.4-py3-none-any.whl (10.5 MB)
Installing collected packages: transformers
Successfully installed transformers-4.52.4


In [8]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# NLP - RNNs, Transformers, Hugging Face

In this notebook, we will be understanding and delving more deeply into NLP. Currently, NLP is the most popular application of deep learning. All Large Language Models (LLMs) currently operate based on the transformer architecture to provide generative capabilities.

We will try to make a NLP classification model that can identify Charles Dickens' writings.

This notebook is based off of module 4 of Fast AI's on NLP.

### Language Models

A language model is a model that is trained to predict the next word in a text based on the previous words. A language model uses something called self-supervised training to achieve this. Without external labels, it can find labels within the text it needs to evaluate. To achieve this, our language model needs to develop a certain understanding of the language. This means that for our applications, our language model needs to understand the English language, the French language, the German language, etc.

An example of an application development process is the IMDb review classifier. We will use a language model that was trained on Wikipedia data. Unfortunately, this model might not be entirely suitable for IMDb review English. Wikipedia articles are usually written in a different style and format from an IMDb review. In order to get accurate classifications, we ought to fine-tune our model on IMDb English. From that fine-tuned model, we can then work on developing a classification model for IMDb movie reviews that will be very accurate.

The preceding process is called the Universal Language Model Fine-Tuning Process (ULMFit).

#### Recurrent Neural Networks - RNNs

Recurrent Neural Networks (RNNS) are a type of neural network architecture trained on sequential or time series data that are used to make machine learning models that can make sequential predictions using previous sequence elements as inputs for the predictions. RNNs use a hidden state that helps keep track of previous inputs. This is the recurrent part of the RNN. 

RNNS use a encoder-decoder model. This model is best explained by the following image: 

![encoder/decoder model](./resources/encoder-decoder.png)

For more information: 
- [IBM article](https://www.ibm.com/think/topics/recurrent-neural-networks)
- [Blog post by Zhaozhen Xu](https://www.baeldung.com/cs/rnns-transformers-nlp)

### Transformers

Transformers are a type of neural network architecture that is very capable of processing natural language. Unlike RNNs, transformers do not use any recurrence or have hidden states. This means they do not operate sequentially (ie, they do not need to go through each input one at a time). They use something called self-attention. Self-attention allows the model to weigh the importance of different input tokens when making predictions. Transformers consist of encoder and decoder layers, employing multi-head self-attention mechanisms and feedback neural networks. Thanks to these features, they are able to parallelize their operations and are faster.

Here is an image illustrating the transformer model.

![transformer model](./resources/transformer.png)

The transformer model was layed out in the 2017 seminal paper [*Attention Is All You Need*](./resources/attention.pdf).

### Tokenization and Numericalization

Our neural networks need to take in numbers as their inputs. We need to convert our sentences into numbers, there are two steps: 

- *Tokenization*: we split text up into tokens
- *Numericalization*: convert each token into a number

This process is model dependent. Each model will have a tokenizer associated with it. We'll see when developing our Dickens classifier. We'll use the "Deberta v3-small" model developed by Microsoft. 

In [10]:
model_nm = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

```AutoTokenizer``` is a HuggingFace Transformers class that allows us to get our tokenizer function.

In [12]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

In [13]:
tokz.tokenize("Hi! I am Sami")

['hi', '!', 'i', 'am', 'sami']

### Our Dataset

In [14]:
from datasets import load_dataset

dickens_ds = load_dataset("GuillermoTBB/charles-dickens-text-classification")
dickens_ds

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 880
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 220
    })
})

In [15]:
train, test = dickens_ds['train'], dickens_ds['test']

In [16]:
train_df = train.to_pandas()
train_df

Unnamed: 0,text,label
0,"""It was your responsibility—I assert that it w...",0
1,"Mr. Jaggers, having beheld me in the radiant p...",0
2,"That night, sleep was a fleeting, haunted noti...",0
3,"My sister fetched the stone bottle, poured his...",0
4,Consider the striking consistency in his demea...,0
...,...,...
875,It is imperative that we exercise utmost cauti...,0
876,"Shortly after he had spoken, a portly man in a...",0
877,"During a recent observation, it was noted that...",0
878,"As I couldn't nod endlessly in silence, not ig...",0


In [17]:
def tok_func(ds):
    return tokz(ds['text'])

In [18]:
tok_ds = train.map(tok_func, batched=True)
tok_ds

Map: 100%|██████████| 880/880 [00:00<00:00, 10104.74 examples/s]


Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 880
})

In [22]:
tok_ds_df = tok_ds.to_pandas()
tok_ds_df

Unnamed: 0,text,label,input_ids,attention_mask
0,"""It was your responsibility—I assert that it w...",0,"[101, 1000, 2009, 2001, 2115, 5368, 1517, 1045...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,"Mr. Jaggers, having beheld me in the radiant p...",0,"[101, 2720, 1012, 25827, 2015, 1010, 2383, 202...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,"That night, sleep was a fleeting, haunted noti...",0,"[101, 2008, 2305, 1010, 3637, 2001, 1037, 2508...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,"My sister fetched the stone bottle, poured his...",0,"[101, 2026, 2905, 18584, 2098, 1996, 2962, 583...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,Consider the striking consistency in his demea...,0,"[101, 5136, 1996, 8478, 18700, 1999, 2010, 217...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
...,...,...,...,...
875,It is imperative that we exercise utmost cauti...,0,"[101, 2009, 2003, 23934, 2008, 2057, 6912, 279...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
876,"Shortly after he had spoken, a portly man in a...",0,"[101, 3859, 2044, 2002, 2018, 5287, 1010, 1037...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
877,"During a recent observation, it was noted that...",0,"[101, 2076, 1037, 3522, 8089, 1010, 2009, 2001...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
878,"As I couldn't nod endlessly in silence, not ig...",0,"[101, 2004, 1045, 2481, 1005, 1056, 7293, 1086...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [21]:
import numpy as np
from numpy.random import normal, seed, uniform
np.random.seed(42)

dataloaders = tok_ds.train_test_split(0.25, seed=42)
dataloaders

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 660
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 220
    })
})

### Training our model

In [30]:
from transformers import TrainingArguments, Trainer

bs = 128
epochs = 2
learning_rate = 8e-5

In [31]:
args = TrainingArguments( 
    output_dir="outputs",
    learning_rate=learning_rate,
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="no",
    push_to_hub=False,
)

#### Defining our metrics

We need to define a function to evaluate our model.

In [32]:
def accuracy(prediction, label):
    return np.mean(prediction == label)

In [33]:
def metrics(dataset):
    return {'accuracy': accuracy(*dataset)}

#### Training our model

The Transformers library provides some APIs to facilitate training.

In [34]:

model = AutoModelForSequenceClassification.from_pretrained(model_nm)

trainer = Trainer(model=model,
                  args=args,
                  train_dataset=dataloaders['train'],
                  eval_dataset=dataloaders['test'],
                  tokenizer=tokz,
                  compute_metrics=metrics
                  )

  trainer = Trainer(model=model,


In [29]:
trainer.train()

RuntimeError: MPS backend out of memory (MPS allocated: 16.75 GB, other allocations: 1.07 GB, max allowed: 18.13 GB). Tried to allocate 1.06 GB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

In [None]:
preds = trainer.predict(test)
preds