# **Natural Language Processing (NLP) and Large Language Models (LLMs)**

# **NLP**

*Natural Language Processing (NLP)* is a field of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language in a meaningful way. It combines linguistics, computer science, and machine learning to connect human communication and digital systems.  

At its core, NLP allows machines to process text and speech just as humans do—analyzing syntax, semantics, and context to extract useful information or produce coherent responses. Applications of NLP range from simple spell checkers and chatbots to complex systems such as automatic translation, sentiment analysis, and large language models that can reason, summarize, and converse naturally.  

In essence, NLP is what allows computers to *read*, *listen*, *speak*, and *understand* human language—turning unstructured linguistic data into structured, actionable knowledge.

---

## **History and Evolution**

The history of NLP is deeply intertwined with the broader development of artificial intelligence.  
It began in the 1950s, when researchers first attempted to make machines understand language through symbolic rules and logic. One of the earliest milestones was the **Georgetown–IBM experiment (1954)**, which demonstrated automatic translation of Russian sentences into English. Although limited, it marked the start of computational linguistics.

During the **1960s and 1970s**, systems such as **ELIZA** and **SHRDLU** explored rule-based interactions with users. These programs relied heavily on handcrafted linguistic rules, which worked only within narrow, predefined domains. While impressive at the time, they lacked scalability and real understanding.

The **1980s and 1990s** introduced the statistical revolution: researchers began using probability and data-driven methods instead of fixed rules. Large text corpora and statistical models (like Hidden Markov Models) enabled tasks such as speech recognition and part-of-speech tagging to become far more robust and adaptable.

In the **2000s**, the field shifted again toward **machine learning**, using algorithms such as support vector machines and conditional random fields. This era also saw the rise of NLP toolkits like **NLTK** and **spaCy**, which made NLP research and application more accessible.

The **2010s** marked a turning point with the rise of **deep learning**. Neural networks—especially recurrent architectures such as **LSTMs** and later **Transformers**—dramatically improved performance in translation, summarization, and text generation. The introduction of **word embeddings** (Word2Vec, GloVe) and **contextual models** (ELMo, BERT, GPT) brought NLP closer to human-level language understanding.

Today, NLP has evolved into a foundation of modern AI. With the emergence of **Large Language Models (LLMs)** trained on massive datasets, NLP systems can not only process language but also reason, write, and interact in ways that resemble human communication. The journey from handcrafted rules to self-learning models reflects the broader evolution of AI—from explicit instruction to emergent intelligence.

---

## **NLP Tasks**

Natural Language Processing includes a wide range of tasks that allow computers to understand and produce human language. Each task focuses on a specific aspect of language, from identifying emotions in text to generating complete sentences. Below are some of the main NLP tasks used in research and real-world applications.

---

### **Text Classification**

Text classification is about assigning predefined categories or labels to text.  
It can be used for spam detection in emails, topic identification in articles, or sorting customer feedback into categories.  

For example:
- "You won a free iPhone!" → *Spam*
- "Meeting at 10 AM tomorrow" → *Work*

Machine learning models learn to recognize patterns in words and phrases to make these predictions.

---

### **Sentiment Analysis**

Sentiment analysis determines whether a piece of text expresses a positive, negative, or neutral opinion.  
It’s widely used in social media monitoring, brand analysis, and customer reviews.  

Example:
- "This phone is amazing!" → *Positive*
- "Battery life is terrible." → *Negative*

It helps companies understand public opinion and emotional tone in large volumes of text.

---

### **Named Entity Recognition (NER)**

Named Entity Recognition identifies and classifies key information in text such as names of people, organizations, locations, dates, or products.  

Example:
> "Apple released the iPhone in California in 2007."  
Here, **Apple** → *Organization*, **iPhone** → *Product*, **California** → *Location*, **2007** → *Date*.

NER is essential for information extraction, question answering, and search engines.

---

### **Part-of-Speech Tagging**

Part-of-Speech (POS) tagging assigns grammatical roles to each word in a sentence, such as noun, verb, adjective, or adverb.  

Example:
> "The cat sleeps on the mat."  
Tags might be: *The (det)*, *cat (noun)*, *sleeps (verb)*, *on (prep)*, *the (det)*, *mat (noun)*.

POS tagging helps systems understand the structure and meaning of sentences.

---

### **Machine Translation**

Machine translation automatically converts text from one language to another.  
Modern translation systems like Google Translate use deep learning models that consider not only individual words but also context and sentence structure.  

Example:
> English: “Good morning.”  
> French: “Bonjour.”

Today’s models, such as those based on Transformers, produce much more natural and accurate translations than older rule-based systems.

---

### **Text Summarization**

Text summarization condenses long passages into shorter versions while keeping the main ideas intact.  
There are two main types:
- **Extractive summarization**: selects the most important sentences from the text.
- **Abstractive summarization**: generates new sentences that capture the key meaning.

This task is useful for news aggregation, legal document review, and research papers.

---

### **Question Answering**

Question Answering (QA) systems take a question in natural language and try to find the most accurate answer from a text or database.  

Example:
> Question: "Who wrote *Pride and Prejudice*?"  
> Answer: "Jane Austen."

Modern QA models combine reading comprehension and retrieval abilities, often using large language models like BERT or GPT.

---

### **Text Generation**

Text generation focuses on producing coherent and meaningful text automatically.  
It can be used for story writing, chatbots, email drafting, or code generation.  

The model predicts the next word based on previous ones, gradually forming complete sentences and paragraphs.  
Controlling parameters like *temperature*, *top-k*, and *top-p* allows adjustment of creativity and randomness.

---

# **Core Concepts in NLP**

Natural Language Processing (NLP) enables machines to process and generate human language. Several core concepts form the backbone of modern NLP systems.

---

## **Tokenization**

Tokenization splits text into smaller units called tokens—words, subwords, or characters. It’s the first step in most NLP pipelines, turning raw text into a structured format.

Types:
- **Word-level:** splits text by spaces or punctuation.  
- **Subword-level:** breaks words into smaller units (WordPiece, BPE), useful for rare words.  
- **Character-level:** each character is a token, helpful in morphologically rich languages.

---

## **Bag of Words (BoW)**

Before embeddings became popular, one of the simplest ways to represent text was the **Bag of Words (BoW)** model. Despite its simplicity, it is still widely used for basic text processing and classical machine learning models.
The Bag of Words model represents a text as a collection of its words, ignoring grammar, word order, and context. Essentially, a sentence or document is treated as a "bag" containing all its words. Each unique word in the corpus becomes a feature in a vector, and the value of that feature is usually the count of how many times the word appears in the document.

For example, consider these two sentences:

1. `"I love natural language processing"`  
2. `"Language processing is fun"`

First, we build a vocabulary of all unique words:
\["I", "love", "natural", "language", "processing", "is", "fun"]

Then we represent each sentence as a vector of word counts:

- Sentence 1 → `[1, 1, 1, 1, 1, 0, 0]`  
- Sentence 2 → `[0, 0, 0, 1, 1, 1, 1]`

Here, each position corresponds to a word in the vocabulary, and the number indicates how many times the word appears in that sentence.

BoW is simple and easy to implement, works well for small datasets and classical ML algorithms (Naive Bayes, SVM) and the vocabulary can be extended or restricted.

Some variants are :
- **Binary BoW:** Instead of counts, each entry is 1 if the word appears, 0 otherwise.  
- **TF (Term Frequency):** Counts how often a word appears relative to the total number of words in the document.  
- **TF-IDF (Term Frequency–Inverse Document Frequency):** Adjusts word counts based on how common the word is across all documents. Rare words get higher weight, common words get lower weight. This helps highlight words that are more informative.

BoW also have some limitations :
- Ignores word order, syntax, and context. `"I love NLP"` and `"NLP love I"` are treated the same.  
- High-dimensional sparse vectors for large vocabularies.  
- Cannot capture semantic similarity between words (e.g., `"car"` vs `"automobile"` are unrelated in BoW).

---

## Word Embeddings

Word embeddings are a way to represent words as dense numerical vectors in a continuous space. Unlike one-hot encoding, which represents each word as a sparse vector, embeddings allow words to have meaningful relationships. Words that are semantically or syntactically similar will be close together, while unrelated words will be farther apart. This helps models understand language patterns more effectively.

*Static embeddings* assign a fixed vector to each word, regardless of its context.

### Word2Vec

Word2Vec is one of the most popular methods for generating static embeddings. It has two main approaches:

1. **CBOW (Continuous Bag of Words):** predicts a target word from surrounding words.
2. **Skip-gram:** predicts surrounding words given a target word.

The intuition is simple: words that appear in similar contexts tend to have similar meanings. For example, "king" and "queen" often appear near words like "royal", "palace", or "monarch", so their vectors are close in the embedding space.

Word2Vec also captures semantic relationships. Vector arithmetic can show analogies, such as the relationship between gender or hierarchy, by adding and subtracting word vectors.

### GloVe

GloVe (Global Vectors) creates embeddings using co-occurrence statistics from a large corpus. Instead of only looking at nearby words like Word2Vec, GloVe considers how often words appear together across the entire dataset. This produces embeddings that capture both syntactic and semantic relationships.

### Measuring Similarity Between Embeddings

Once words are represented as vectors, we often want to know how similar or related they are. The most common method is **cosine similarity**:

$$
d(u, v) = \frac{u \cdot v}{||u|| \, ||v||}
$$

Cosine similarity measures the angle between two vectors. Vectors pointing in the same direction have a similarity close to 1, vectors at right angles have similarity near 0, and vectors pointing in opposite directions have similarity -1. This works well for high-dimensional embeddings because it focuses on direction rather than magnitude.

Other similarity or distance measures include:

- **Euclidean distance:**  
  Measures the straight-line distance between two vectors in space. Smaller distances mean more similar words. Useful for clustering or nearest-neighbor searches:
  $$
  d(u, v) = ||u - v||_2
  $$

- **Manhattan distance (L1 norm):**  
  Sum of absolute differences between vector components. Sometimes preferred for sparse or high-dimensional data:
  $$
  d(u, v) = \sum_i |u_i - v_i|
  $$
  
- **Dot product:**  
  Measures how much two vectors point in the same general direction. Often used in attention mechanisms and neural networks:
  $$
  u \cdot v = \sum_i u_i v_i
  $$

*Cosine* and *dot product* similarity can be used to compute *numerical similarity*, embeddings that are *numerically similar* are also *semantically similar*.

*sentence embeddings* associate every sentence with a vector of numbers, similar sentences are assigned to similar vectors.

---

## **Language Modeling**

Language models predict word sequences, key for text generation, translation, and speech recognition.

- **Statistical models:** N-grams estimate probabilities from observed sequences.  
- **Neural models:** LSTMs, GRUs, Transformers capture complex patterns beyond simple co-occurrences.  

---

## **Attention Mechanisms**

The attention mechanism is a way for sequence models to dynamically focus on different parts of the input sequence when generating each output token. Instead of treating all input tokens equally, the model learns to assign different weights to different tokens based on their relevance to the current step of the output. This allows the model to capture long-range dependencies and relationships that might be far apart in the sequence, which traditional RNNs often struggle with.

In practice, attention computes a weighted combination of all encoder hidden states, where the weights reflect the importance of each input token for generating the current output token. This lets the decoder "attend" to the most relevant pieces of information from the input, producing more accurate and contextually coherent outputs.

---

## **Encoder-Decoder / Seq2Seq Models**

Encoder-decoder architectures handle input and output sequences of different lengths. Common in translation, summarization, and chatbots.

**Example:**  
- Input: `"Hello, how are you?"`  
- Output: `"Ciao, come stai?"`

An **encoder** is a neural network component that functions like a “reader” or “compressor” of information. It takes the input sequence and transforms it into a dense, numerical representation that captures the meaning, context, and dependencies of the entire sequence. It is not a separate physical object—it is part of a model, a series of mathematical operations and learned parameters that encode sequences into vectors. The encoder’s output is often called a **context vector** or a set of hidden states, which summarize all the relevant information from the input in a form that the decoder can understand.
The process of updating hidden states at each step can be expressed as:
$$
h_t = f(W_{hh} h_{t-1} + W_{xh} x_t)
$$
where:  
  - $h_t$ = hidden state at time $t$  
  - $x_t$ = input at time $t$  
  - $f$ = activation function (tanh or ReLU)  

A **decoder** is the complementary neural network component that functions like a “writer” or “translator.” It receives the encoded representation from the encoder and generates the output sequence step by step. At each step, the decoder uses the current hidden state, the previous token, and the encoder’s context to decide the next token. The decoder is effectively a conditional generator: it produces outputs conditioned on the information encoded by the encoder, allowing it to handle sequences of different lengths and to produce coherent text that corresponds to the input.
The generation process can be expressed as:
$$
y_t = g(W_{hy} h_t)
$$
where:  
  - $y_t$ = output at time $t$  
  - $h_t$ = decoder hidden state at time $t$  
  - $g$ = output activation function  

**Implementation:**  
Usually RNNs, LSTMs, or GRUs. Teacher forcing feeds correct previous tokens during training. Input/output vectors may differ in size.

**Tokenization for Seq2Seq:**  
Tokenization is the process of converting text into a format that a sequence-to-sequence model can understand. The input text is first split into smaller units, called tokens, which can be words, subwords, or characters depending on the tokenizer used. Each token is then mapped to a unique numerical ID according to a predefined vocabulary. Special tokens, such as the start-of-sequence (SOS) and end-of-sequence (EOS), are added to indicate the beginning and the end of the sequence. This numerical representation allows the encoder-decoder model to process the text efficiently and generate coherent output sequences.

Example: `"Transformers improve summarization."` → `["transformers", "improve", "summarization", "."]` → `[201, 57, 1342, 4]`.


---

# TODO
## Vector database
## Naive Bayes
## TF-IDF
## NLTK
## RNN LSTM
## BERT, GPT

# **Introduction to Large Language Models (LLMs)**
## **What Are LLMs?**
## **How LLMs Work**
## **Transformer Architecture Overview**
## **Pretraining vs Fine-tuning**
## **Zero-shot, One-shot, Few-shot Learning**

# **LLM Applications**
## **Chatbots and Virtual Assistants**
## **Content Creation**
## **Code Generation**
## **Summarization and Translation**
## **Data Analysis and Insights**
## **Prompt Engineering**
### **Basics of Prompt Design**
### **Temperature and Creativity**
## **Decoding / Generation Strategies**
### **Greedy Search**
### **Beam Search**
### **Top-k Sampling**
### **Top-p (Nucleus Sampling)**
### **Frequency and Presence Penalties**
### **Max Tokens**


# **Challenges and Limitations**
## **Bias and Fairness**
## **Hallucinations and Incorrect Outputs**
## **Tokenization Limitations**
## **Compute and Energy Costs**
## **Security and Misuse Concerns**

# **Tools and Frameworks**
## **Hugging Face Transformers**
## **OpenAI API**
## **TensorFlow and PyTorch**
## **Specialized Libraries (spaCy, NLTK, etc.)**


# Useful resources

### **Youtube**

*TensorFlow*: Natural Language Processing (NLP) Zero to Hero
https://www.youtube.com/playlist?list=PLQY2H8rRoyvzDbLUZkbudP-MFQZwNmU4S

*Stanford Online*: Stanford CS224N Natural Language Processing (NLP)
https://www.youtube.com/playlist?list=PLQY2H8rRoyvzDbLUZkbudP-MFQZwNmU4S


### **Websites**

### **Courses**

*Oracle*: OCI Generative AI Professional (2025)
https://mylearn.oracle.com/ou/learning-path/become-an-oci-generative-ai-professional-2025/147863