# Lesson 4 - Natural Language Programming

[LECTURE VIDEO](https://www.youtube.com/watch?v=toUgBQv1BT8&list=PLfYUBJiXbdtSvpQjSnJJ_PmDQB_VyT5iU&index=4&ab_channel=JeremyHoward)

### Preamble
- Not using Fast.ai
- Using Huggingface transformers
    - Useful for everyone to have practice using more than one library. Using the same concepts in different ways. Huggingface NLP is the best on the market. It's being incorporated into Fast.ai
    - Doesn't have the same layered Architecture as Fast.ai, so not as ready to go - which is actually a good thing.
- A pretrained model is a model where many parameters have already been fit (i.e a and b in the ReLu), but you need to 'fine-tune' to get a few of these  (i,e c) correct as well.
- In the past Recurrent Neural Nets were the cutting edge, but now it's transformers.

## ULMfit
- Wikitext103: Language Model -> tried to predict the next word in every wikipedia article, or missing words. It requires a lot of real-world understanding. To be good at this, it needs to understand language, what is true/not, a model of the world...
    - Started with random weights, and could predict >30% the correct word in a wiki article (ULMFit Paper)
- IMDb: Language Model -> Ran the pretrained Wikitext model with a few more epochs trained on IMBb movie review database to predict movie review words
- IMDb: Classifier -> Took these weights to predict whether positive or negative sentiment...

**Language Model** Tries to predict the next word in a sentance

First two models didn't require labels, just the text BECAUSE the label is simply "what's the next word in the sentence"

Transformers built because they can take advantage of TPU's really well, but they can't very easily predict the next word.
Instead, the training takes a chunk of text and removes random words and asks the model to predict the missing words.
- Replaced 'RNN' with 'Transformer' Model, and 'Language model' with 'Masked Language model' but otherwise the basic idea is the same. 

## Transformer, Masked-Language Model

![image.png](attachment:image.png)
^^ This is an image from the ConvNet paper showing the edges, gradients, -> Corners and circles -> Features such as eyeball or flower petals. 

**Activations** Sets of rectified linear units

Later layers are much more specific to the training task. You're unlikely to change the early layers.

Chop the head of the NN, add new random matrix to learn what you're trying to fit.

## Kaggle Competitions

Best place to get real data where people care 'really' about actual results!

![image.png](attachment:image.png)

[The Kaggle Comp](https://www.kaggle.com/code/greensamiam/getting-started-with-nlp-for-absolute-beginners/edit)

^^ Kind of a classification problem, but also kind of not. Why - because classification is normally 0:Not in the set, 1: in the set. However in the US. Patent Phrase competition, they also have a score of 0.0, 0.25, 0.5, 0.75 and 1.0 as to whether two phrases are similar.

You need to convert this 'similarity' problem, and turn it into a 'classification' problem.

- Make sure to look at the data, but ALSO the information about the data

#### Technical Notes
* Sometimes Kaggle doesn't have the right installs...i.e huggingface.
* You can put bash commands inside python conditionals `!ls` for example to list the `!ls {path}` will list the contents at that path.
* Pandas is one of four key libraries you need to know to do datascience
    * Numpy: basic numeric programming
    * Pands: tables of data
    * MatplotLib: plotting
    * PyTorch: Deep learning
    * All covered in ['Python for Data Analysis'](https://www.oreilly.com/library/view/python-for-data/9781449323592/)

In [1]:
import os
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

In [2]:
creds = '{"username":"greensamiam","key":"254d90fac91d2453ef3927cf1a8e21be"}'

In [4]:
# for working with paths in Python, I recommend using `pathlib.Path`
from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

In [5]:
path = Path('us-patent-phrase-to-phrase-matching')

In [None]:
!pip install kaggle

In [8]:
if not iskaggle and not path.exists():
    import zipfile,kaggle
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

To see what data you have, you can use bash commands in conjunction with Python Variables!

In [9]:
!ls {path}

sample_submission.csv test.csv              train.csv


In [11]:
import pandas as pd

In [12]:
df = pd.read_csv(path/'train.csv')

In [13]:
df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


In [15]:
df.describe(include="object")

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


^^ This `describe` method allows you to take a look at a summary of the data that you're importing. One of the most useful commands in pandas.

As you can see with this data, there's actually not a lot of unique data! i.e - Only 733 unique anchors and 106 unique contexts!

To create a categorisation tast, you will need to combine these so that the Neural Network has something that it can learn from.

The headers added `TEXT1` etc are there as markers so that it knows where one piece of data starts and ends. Could equally be `X` or in a different order. Just need the NN to be able to distinguish between the data. You roughly need to give the NN data it understands....

In [19]:
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor

`.head` gives you the first few rows

In [20]:
df['input'].head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

You can also use dot notation for accessign columns.

In [23]:
df.input[0]

'TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement'

Neural Nets work with numbers...so we need to convert this to a NN.

### Step 1 (Tokenization):
- Split each of these into tokens (i.e - basically words)
    - Problems, not all lanugages have words (i.e - chinese) and definitely not space-separated-words
- A vocabulary is the number of unique words.
    - If the vocabulary is too big, then it takes longer to train...SO, no people tend to tokenise into **subwords**

PyTorch and Huggingface both have 'datasets', this one is the Huggingface one

In [25]:
from datasets import Dataset,DatasetDict

ds = Dataset.from_pandas(df)

In [26]:
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

But we can't pass the texts directly into a model. A deep learning model expects numbers as inputs, not English sentences! So we need to do two things:

- *Tokenization*: Split each text up into words (or actually, as we'll see, into *tokens*)
- *Numericalization*: Convert each word (or token) into a number.

The details about how this is done actually depend on the particular model we use. So first we'll need to pick a model. There are thousands of models available, but a reasonable starting point for nearly any NLP problem is to use this (replace "small" with "large" for a slower but more accurate model, once you've finished exploring):

**Huggingface Model Hub has 44,748 models!**
You can search for models for your specific use case at [HuggingFace Models](https://huggingface.co/models)

i.e - Can search for 'segmentation' and find a pre-trained model!

BUT - there are a lot of models that work pretty well lots of the time! i.e - Deberta-V3.

NLP has only been useful for a year or two, whereas computer vision has been embedded for a few years now. The community is figuring it out NOW!

In [27]:
model_nm = 'microsoft/deberta-v3-small'

`AutoTokenizer` will create a tokenizer appropriate for a given model. It just says 'this model uses this tokenisation method'.

*Before you start tokenising, you need to decide how to tokenise! If you're using a pretrained model, you need to use the same tokenisation method as the pretrained model otherwise you'll not be able to map to the data properly.*

In [30]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


You can easily see how the tokeniser works as follows

In [31]:
tokz.tokenize("Heya mate, how'z it going?")

['▁He', 'ya', '▁mate', ',', '▁how', "'", 'z', '▁it', '▁going', '?']

The _ represents a start of a word! Uncommon words may be split up in order to map back to the 'vocabularly' generated from when the particular model was first trained!

i.e = the token '_A' will be given a number in the vocabularly. This is called **Numericalisation**

In [32]:
tokz.tokenize("A platypus is an ornithorhynchus anatinus.")

['▁A',
 '▁platypus',
 '▁is',
 '▁an',
 '▁or',
 'ni',
 'tho',
 'rhynch',
 'us',
 '▁an',
 'at',
 'inus',
 '.']

To take a documnet and tokenize it, you can create a simple function to do this!

In [33]:
def tok_func(x): return tokz(x["input"])

In [37]:
tok_ds = ds.map(tok_func, batched=True)

  0%|          | 0/37 [00:00<?, ?ba/s]

^^ ds.map will parallelise this function!
Uses a 'rust' library to do this!

You can see now that our text based string is now a series of number id_s that can be fed into the neural network and map back to a vocab!

In [36]:
row = tok_ds[0]
row['input'], row['input_ids']

('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])

`.vocab` Just looks stuff up in a dictionary in the vocab

In [38]:
tokz.vocab['▁of']

265

### Note on long-documents:
- ULMfit can handle long documents easily, because it can easily split a long document into parts and process it gradually
- Transformers need to read the whole document...people spend a lot of money on GPU's that can store them in memory.

Rule of thumb: try transformers, but if too expensive etc (roughly around 2,000 words probably) then try ULMfit! Try both.

Laptop GPU might not be able to handle 2,000 words...

Finally, we need to prepare our labels. Transformers always assumes that your labels has the column name `labels`, but in our dataset it's currently `score`. Therefore, we need to rename it:

In [40]:
tok_ds = tok_ds.rename_columns({'score':'labels'})

## Test and Validation Sets

You need to have separate training and validation sets!

Our *Validation/test set* doesn't have labels! A score!

In [41]:
eval_df = pd.read_csv(path/'test.csv')
eval_df.describe()

Unnamed: 0,id,anchor,target,context
count,36,36,36,36
unique,36,34,36,29
top,4112d61851461f60,el display,inorganic photoconductor drum,G02
freq,1,2,1,3


sklearn is good for quick and dirty, but also classic ML

### UnderFit

![image.png](attachment:image.png)

Models that are too simple, are underfit.

![image-2.png](attachment:image-2.png)

Models that are too detailed, are overfit.

Underfitting is easy to recognise as the model won't fit our data very well. I.e - high error rate.

Overfitting is harder to recognise as the training data will be very closely fit.

### Test Overfitting
To test - we take original data set, remove some of them (i.e - 20% of them.)

Then try fit the model using only those points we haven't removed, then test against the ones we removed.

*Validation Set* doesn't get shown to the model during training

Most libraries allow you to not have a Validation set so you're likely to shoot yourself in the foot! Test your metrics against the validation set!

### A good validation set

You can't just take a random sampling from your data.

Take for example the below prediction of a time-series. Taking random data from the middle will be easier to create a model from because there's lots of surrounding data, but it won't do a good job at predicting future timeseries. Instead, you want your validation set to come from the end -> ie so it looks a lot more like what you're trying to fit it too!

Original Data |  Poor random sampling | Better truncated Sampling
:------------:|:---------------------:|:-------------------------:
![image-3.png](attachment:image-3.png)  |  ![image-4.png](attachment:image-4.png) | ![image-5.png](attachment:image-5.png)


