# Prerequisites Study Notes

## NLP (Natural Language Processing)
*understand, interpret, and generate human language*

### Applications
- Virtual Assistants
- Sentiment Analysis
- Summarization
- Fraud Detection
...

### Why Now?
- Transformer models can capture long-range dependencies in text
- Large text datasets
- Advances in computational power & Hardware like GPUs
- Open Source libraries

## Hugging Face Pipelines

Hugging Face pipelines provide a quick and easy way to use models for inference. In as little as three lines of code, you can summarize text, translate languages, answer questions, generate text, fill in masked text, or perform a variety of other NLP tasks.

1. Tokenization (break tokens into tokens)
2. Encoding (turn into numerical representations)
3. Post-processing (into human readable results)


In [3]:
from transformers import pipeline, Conversation

converse = pipeline("conversational")

conversation_1 = Conversation("Going to the movies tonight - any suggestions?")
conversation_2 = Conversation("What's the last book you have read?")
converse([conversation_1, conversation_2])

No model was supplied, defaulted to microsoft/DialoGPT-medium and revision 8bada3b (https://huggingface.co/microsoft/DialoGPT-medium).
Using a pipeline without specifying a model name and revision in production is not recommended.
Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected beha

[Conversation id: 9f1c109e-8353-4478-9a46-6259cddcc25d
 user: Going to the movies tonight - any suggestions?
 assistant: The Big Lebowski,
 Conversation id: 9c335953-80ae-4308-bb37-8d6f8a191a4b
 user: What's the last book you have read?
 assistant: The Last Question]

### Zero-Shot Text Classification

In [5]:
from transformers import pipeline

pipe = pipeline(task="zero-shot-classification",model="facebook/bart-large-mnli")

pipe("I have a problem with my iphone that needs to be resolved asap!",
    candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],
)

Downloading (…)lve/main/config.json: 100%|██████████| 1.15k/1.15k [00:00<00:00, 3.00MB/s]
Downloading model.safetensors: 100%|██████████| 1.63G/1.63G [00:16<00:00, 97.1MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 62.7kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 2.15MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 1.48MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 2.64MB/s]


{'sequence': 'I have a problem with my iphone that needs to be resolved asap!',
 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'],
 'scores': [0.5227586030960083,
  0.45813971757888794,
  0.014264645986258984,
  0.0026850185822695494,
  0.0021520694717764854]}

### Named Entity Recognition (NER)

In [6]:
pipe = pipeline(task="token-classification")
pipe("I am John and I live in New York City.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 998/998 [00:00<00:00, 6.57MB/s]
Downloading model.safetensors: 100%|██████████| 1.33G/1.33G [00:13<00:00, 96.0MB/s]
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model

[{'entity': 'I-PER',
  'score': 0.9974554,
  'index': 3,
  'word': 'John',
  'start': 5,
  'end': 9},
 {'entity': 'I-LOC',
  'score': 0.9992238,
  'index': 8,
  'word': 'New',
  'start': 24,
  'end': 27},
 {'entity': 'I-LOC',
  'score': 0.99931407,
  'index': 9,
  'word': 'York',
  'start': 28,
  'end': 32},
 {'entity': 'I-LOC',
  'score': 0.99942446,
  'index': 10,
  'word': 'City',
  'start': 33,
  'end': 37}]

## Evaluating models

[Hugging Face evaluate](https://huggingface.co/docs/evaluate/main/en/index)

### Metrics

Measure performance of a model on a given dataset. [Accuracy, Exact Match, Mean Intersection over Union (mIOU),...]

### Comparisions

Compare performance of two or more models on a test dataset, to see whether the models' predictions diverge or not.

### Measurements

Gain more insights on datasets and model predictions. Average length of the inputs for example can help when choosing input length for a Tokenizer.

¦ NLP Task ¦ Evaluation Metric ¦
¦ ------------------------ ¦ ----------------------------------- ¦
¦ Text Classification      ¦ Accuracy, F1 Score, AUC-ROC         ¦
¦ Token Classification     ¦ Precision, Recall, F1 Score         ¦
¦ Question Answering       ¦ Exact Match , F1 Score              ¦
¦ Named Entity Recognition ¦ Mean Intersection over Union (mIOU) ¦
¦ Summarization            ¦ ROUGE                               ¦
¦ Translation              ¦ BLEU                                ¦
¦ Language Modeling        ¦ Perplexity                          ¦
¦ Dialog Systems           ¦ BLEU                                ¦

## Lexicon

### In-Context Learning
Input demonstrates the desired task, and the model learns to perform the task by predicting the output without any parameter updates.

**Zero-shot learning** completely relies on the model's pre-trained knowledge, with no demo input. Where are **One-shot learning** takes a single example in the input prompt to define the task. **Few-shot learning** takes a few examples in the input prompt to define the task, the variety of these examples helps the model infer the objective and response structure.

### Retrieval Augmented Generation (RAG)

Retrieval augmented generation is a technique that combines LLMs with external knowledge sources. It uses first retrieves the most relevant passages from a knowledge source, and then uses the LLM to generate a response based on the retrieved passages.