<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ufidon/nlp/blob/main/11.apps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ufidon/nlp/blob/main/11.apps.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>
<br>

# NLP APPLICATIONS

- 📝 SALP chapter 13-16
- 📝 [HuggingFace NLP course](https://huggingface.co/learn/nlp-course)

🔭 NLP Theory in Action: HuggingFace NLP
- [Explore Huggingface](https://huggingface.co/)
- [Explore its repositories](https://github.com/huggingface)

In [None]:
# Install Huggingface core libraries
!pip install tokenizers transformers datasets accelerate

from transformers import pipeline

## 💡 HuggingFace Transformers
### NLP Pipeline
- The `pipeline()` function in the 🤗 Transformers library simplifies NLP tasks by 
  - combining model selection with preprocessing and postprocessing steps, 
  - enabling easy input of text and retrieval of results.
- When text is passed to a pipeline, it goes through `three` main steps: 
  - preprocessing into a model-compatible format, 
  - running through the model, 
  - and post-processing to make predictions understandable.
- Available pipelines: 
  - feature-extraction (get the vector representation of a text)
  - fill-mask, question-answering
  - ner (named entity recognition), sentiment-analysis by zero-shot-classification
  - summarization, text-generation, translation

In [None]:
# 1. Sentiment analysis by zero-shot-classification
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

# multiple statements
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

- The `zero-shot-classification pipeline` enables classifying unlabeled text by specifying custom labels
  - which is useful in scenarios where labeling data is difficult or time-consuming.
- This pipeline is called "zero-shot" because it doesn’t require model fine-tuning on specific data
  - it can assign probability scores for any label set directly.

In [None]:
# 2. Text generation
generator = pipeline("text-generation")
generator("In this course, I learned how to")

- The text-generation pipeline allows you to input a prompt, and the model completes it by generating the rest of the text, similar to predictive text on phones.
- You can customize the output by setting the number of sequences generated `num_return_sequences` and controlling the output length `max_length`.

In [None]:
# 2.1 Customizing the generator with specified model instead of the default
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, I learned how to",
    max_length=30,
    num_return_sequences=2,
)

- You can choose [specific models](https://huggingface.co/models?pipeline_tag=text-generation) from the Hugging Face Model Hub for various tasks, filtering by task and language to find models suited for specific needs.
- The Hub provides an online widget to test models directly, allowing quick previews of model capabilities before downloading.

In [None]:
# 3. `Mask filling:` `fill-mask` fills in the blanks in a given text:
# mask-filling models might have different mask tokens
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

In [None]:
# 4. Named entity recognition
# one type of part-of-speech tagging (POS) 
ner = pipeline("ner", grouped_entities=True)
ner("Begin your journey at the majestic Lincoln Memorial, stroll past reflecting pools to the Smithsonian museums, explore Capitol Hill, and end with sunset views of the White House.")

- Setting `grouped_entities=True` in the pipeline groups parts of a sentence that belong to the same entity, 
  - allowing `multi-word names` like "Lincoln Memorial" to be recognized as a single entity.

In [None]:
# 5. Question answering
question_answerer = pipeline("question-answering")
question_answerer(
    question="Who will be the next president?",
    context="The U.S. presidential campaign between Kamala Harris and Donald Trump was intense, highlighting sharp divides on policy and vision. Harris focused on progressive reforms, while Trump emphasized traditional values. Both candidates rallied fervent supporters, underscoring contrasting paths for America's future.",
)

In [None]:
# 6. Summarization
article = """
The U.S. presidential race between Kamala Harris and Donald Trump was 
a highly charged contest that underscored deep ideological divides. 
Harris championed progressive policies, emphasizing social equity, 
climate action, and healthcare expansion. She aimed to build on her 
party’s recent reforms, appealing to a younger, diverse electorate. 
Trump, in contrast, doubled down on conservative principles, 
prioritizing national security, economic growth, and traditional values, 
rallying his core base with a focus on restoring a sense of American identity. 
Both candidates energized their supporters with a vision of the country’s future, 
setting up a pivotal choice for voters on Election Day.
"""
summarizer = pipeline("summarization")
summarizer(article)

In [None]:
# 7. Translation: French to English
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

In [None]:
# 7.1 Chinese to English
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-zh-en")
pipe("""
《夜宿山寺》
     唐代李白
危楼高百尺，手可摘星辰。
不敢高声语，恐惊天上人。     
""")

### Transformer mechanism
- Transformer models can be broadly grouped into three categories:
  - `Auto-regressive` models like GPT
  - `Auto-encoding` models like BERT
  - `Sequence-to-sequence` models like BART/T5

### Transformers are language models
- Transformer models are trained using `self-supervised learning` on raw text, 
  - developing a statistical understanding without human-labeled data.
- For practical tasks, these pretrained models undergo `transfer learning` 
  - fine-tuned with human-labeled data to perform specific tasks,
  - such as `causal or masked` language modeling.

### Transformers are big models
- Increasing model size and data generally improves performance but is costly in time, compute resources, and environmental impact.
- Sharing pretrained models helps reduce global costs by avoiding the need to retrain from scratch for each project.
- Tools like `ML CO2 Impact` and `Code Carbon` can help measure a model's carbon footprint, promoting awareness and efficiency in model training.

### Transfer Learning
- Pretraining initializes a model from scratch with random weights on large datasets, requiring significant time and resources.
- Fine-tuning builds on pretrained models, needing less data, time, and cost by adapting knowledge to specific tasks.
- Fine-tuning is efficient and yields better results than training from scratch, especially for models closely aligned with the task.

### General Architecture
- The Transformer model architecture has two main components: 
  - an encoder for understanding input and 
  - a decoder for generating output based on this understanding.
- Specific configurations include encoder-only for tasks like classification, 
  - decoder-only for generation, and encoder-decoder for tasks like translation.
- Attention layers in Transformers focus on relevant words for each prediction, crucial for tasks like translation.
  - Context impacts word meaning, requiring the model to focus on nearby or distant words as needed.
- Transformers’ encoder can consider the full input sentence, while the decoder sequentially builds the output.
  - During training, the decoder sees limited past outputs to improve prediction difficulty and accuracy.
  - Attention masking manages word relevance, helping with padding and structural differences between languages.

### Architectures vs. checkpoints
- **Architecture**: Defines the model structure, including layers and operations.
- **Checkpoints**: Saved weights trained on specific data for an architecture.
- **Model**: Broad term referring to either architecture, checkpoint, or both.

### Encoder models
- Known for `bi-directional` attention, focus on understanding the entire sentence context and are typically used for tasks like classification and question answering.
- They are pretrained by `reconstructing corrupted sentences`, with models like BERT, ALBERT, and RoBERTa as key examples.

### Decoder models
- Using only the Transformer decoder, are `auto-regressive` and `predict the next word` based on previous words, making them ideal for text generation tasks.
- Examples include models like GPT and Transformer XL, which excel in generating coherent text by training on sequential word prediction.

### Encoder-decoder models 
- or sequence-to-sequence models, use both Transformer parts, 
  - with the encoder capturing full input context and the decoder focusing on sequential output generation.
- These models, like T5 and BART, are well-suited for tasks such as summarization, translation, and question answering by generating responses based on specific inputs.

### Bias and limitations
- Transformer models are trained on large datasets that include both high- and low-quality content, leading to potential biases in generated responses.
- Despite fine-tuning, models may still produce biased, offensive outputs, so users should be cautious of these limitations in production.

In [None]:
# Bias and limitations
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

### Summary

| Model     | Examples   | Tasks       |
|------------------|-------|-----------------|
| **Encoder**      | ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa | Sentence classification, named entity recognition, extractive question answering |
| **Decoder**      | CTRL, GPT, GPT-2, Transformer XL        | Text generation           |
| **Encoder-decoder** | BART, T5, Marian, mBART      | Summarization, translation, generative question answering |

## 🏃 Explore and practice 
- The rest of the [🤗 NLP Course](https://huggingface.co/learn/nlp-course/chapter2)

# References
- [⚔️ Chatbot Arena (formerly LMSYS): Free AI Chat to Compare & Test Best AI Chatbots](https://lmarena.ai/)
  - [openplayground: An LLM playground you can run on your laptop](https://github.com/nat/openplayground)
- [LLM Zoo: democratizing ChatGPT](https://github.com/FreedomIntelligence/LLMZoo)
- [DSPy: Programming—not prompting—Foundation Models](https://github.com/stanfordnlp/dspy)
  - [LLM Engine: fine-tuning and serving large language models](https://github.com/scaleapi/llm-engine)
- [vLLM: a fast and easy-to-use library for LLM inference and serving](https://github.com/vllm-project/vllm)
  - [ollama: Get up and running with large language models](https://github.com/ollama/ollama)
  - [llama.cpp: LLM inference in C/C++](https://github.com/ggerganov/llama.cpp)
- [Browse State-of-the-Art](https://paperswithcode.com/sota)
  - [Practical Deep Learning](https://course.fast.ai/)
  - [deeplearning.ai](https://www.deeplearning.ai/)