In [None]:
! pip install transformers datasets accelerate

Collecting accelerate
  Downloading accelerate-1.7.0-py3-none-any.whl.metadata (19 kB)
Downloading accelerate-1.7.0-py3-none-any.whl (362 kB)
Installing collected packages: accelerate
Successfully installed accelerate-1.7.0


In [25]:
! pip install pandas



# NLP - RNNs, Transformers, Hugging Face

In this notebook, we will be understanding and delving more deeply into NLP. Currently, NLP is the most popular application of deep learning. All Large Language Models (LLMs) currently operate based on the transformer architecture to provide generative capabilities.

We will try to make a NLP classification model that can identify Charles Dickens' writings.

This notebook is based off of module 4 of Fast AI's on NLP.

### Language Models

A language model is a model that is trained to predict the next word in a text based on the previous words. A language model uses something called self-supervised training to achieve this. Without external labels, it can find labels within the text it needs to evaluate. To achieve this, our language model needs to develop a certain understanding of the language. This means that for our applications, our language model needs to understand the English language, the French language, the German language, etc.

An example of an application development process is the IMDb review classifier. We will use a language model that was trained on Wikipedia data. Unfortunately, this model might not be entirely suitable for IMDb review English. Wikipedia articles are usually written in a different style and format from an IMDb review. In order to get accurate classifications, we ought to fine-tune our model on IMDb English. From that fine-tuned model, we can then work on developing a classification model for IMDb movie reviews that will be very accurate.

The preceding process is called the Universal Language Model Fine-Tuning Process (ULMFit).

#### Recurrent Neural Networks - RNNs

Recurrent Neural Networks (RNNS) are a type of neural network architecture trained on sequential or time series data that are used to make machine learning models that can make sequential predictions using previous sequence elements as inputs for the predictions. RNNs use a hidden state that helps keep track of previous inputs. This is the recurrent part of the RNN. 

RNNS use a encoder-decoder model. This model is best explained by the following image: 

![encoder/decoder model](./encoder-decoder.png)

For more information: 
- [IBM article](https://www.ibm.com/think/topics/recurrent-neural-networks)
- [Blog post by Zhaozhen Xu](https://www.baeldung.com/cs/rnns-transformers-nlp)

### Transformers

Transformers are a type of neural network architecture that is very capable of processing natural language. Unlike RNNs, transformers do not use any recurrence or have hidden states. This means they do not operate sequentially (ie, they do not need to go through each input one at a time). They use something called self-attention. Self-attention allows the model to weigh the importance of different input tokens when making predictions. Transformers consist of encoder and decoder layers, employing multi-head self-attention mechanisms and feedback neural networks. Thanks to these features, they are able to parallelize their operations and are faster.

Here is an image illustrating the transformer model.

![transformer model](./transformer.png)

The transformer model was layed out in the 2017 seminal paper [*Attention Is All You Need*](./attention.pdf).

### Tokenization and Numericalization

Our neural networks need to take in numbers as their inputs. We need to convert our sentences into numbers, there are two steps: 

- *Tokenization*: we split text up into tokens
- *Numericalization*: convert each token into a number

This process is model dependent. Each model will have a tokenizer associated with it. We'll see when developing our Dickens classifier. We'll use the "Deberta v3-small" model developed by Microsoft. 

In [20]:
model_nm = 'meta-llama/Llama-2-7b-hf'

```AutoTokenizer``` is a HuggingFace Transformers class that allows us to get our tokenizer function.

In [None]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Llama-2-7b-hf.
403 Client Error. (Request ID: Root=1-683fc23c-5522fd776ac6ae6827e4f90c;cc80a8b9-9418-4184-9ccc-72d9a0fc7633)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/config.json.
Your request to access model meta-llama/Llama-2-7b-hf is awaiting a review from the repo authors.

In [None]:
tokz.tokenize("Hi! I am Sami")

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm)

### Our Dataset

In [30]:
from datasets import load_dataset, Dataset

ds = load_dataset("GuillermoTBB/charles-dickens-text-classification")

In [38]:
train, test = ds['train'], ds['test']

In [44]:
train_df, test_df = train.to_pandas(), test.to_pandas()
train_df

Unnamed: 0,text,label
0,"""It was your responsibility—I assert that it w...",0
1,"Mr. Jaggers, having beheld me in the radiant p...",0
2,"That night, sleep was a fleeting, haunted noti...",0
3,"My sister fetched the stone bottle, poured his...",0
4,Consider the striking consistency in his demea...,0
...,...,...
875,It is imperative that we exercise utmost cauti...,0
876,"Shortly after he had spoken, a portly man in a...",0
877,"During a recent observation, it was noted that...",0
878,"As I couldn't nod endlessly in silence, not ig...",0
