
# Introduction to AI Model Evaluation and Probability Concepts

When we build AI models, especially language models, we need ways to check how good they are. This process is called **evaluation**. Evaluation helps us understand if our model makes accurate predictions and performs well in real-world applications. In this document, we will go through different evaluation methods and probability concepts in simple terms.

## 1. Extrinsic Evaluation: End-to-End Evaluation
This type of evaluation tests how well a model works when embedded into a real application.

**Example:** Suppose we create a chatbot using a language model. We measure how well it answers customer queries by checking customer satisfaction ratings before and after integrating the model.

## 2. Intrinsic Evaluation
Intrinsic evaluation focuses on testing the model’s performance on specific tasks, such as predicting the next word in a sentence or identifying correct grammar.

**Example:** If we build a model to predict missing words in sentences, we test its accuracy by checking how often it chooses the right word.

## 3. Training Set, Test Set (Held-out Set of Data)
- **Training Set:** The data used to train the model.
- **Test Set:** A separate set of data not seen by the model during training. It is used to measure how well the model generalizes to new data.

**Example:** If we train a spam filter, we use thousands of emails to teach it. Later, we test it on new emails (test set) to see if it correctly identifies spam.

## 4. Test Set Sentences and Probability
A good language model should assign high probability to likely sentences and low probability to sentences that don’t make sense.

**Example:**
- Likely sentence: "The sun rises in the east." (High probability)
- Unlikely sentence: "Sun east rises in the." (Low probability)

## 5. Training on The Test Set (Overfitting Issue)
If a sentence from the test set is also in the training set, the model may perform well just because it has memorized the data, not because it has learned general patterns. This is called **overfitting** and is a bad practice.

## 6. Development Set (DevSet)
We do all our testing on this dataset during model development. We only use the final test set once at the very end to evaluate our model’s performance.

**Example:** Before launching a voice assistant, we tune it using a DevSet and only test on the final test set once, just before release.

## 7. Less Surprised = High Probability
If a model correctly predicts a sentence or word with confidence, it assigns it a high probability, meaning it is less surprised by the result.

**Example:** If the model predicts "Good morning!" when someone says "Hello," it is less surprised (high probability). If it predicts "Banana tree!" it is very surprised (low probability).

## 8. Probability of a Sentence (P(sentence))
Longer sentences generally have lower probability because each word added makes the sentence rarer.

**Example:**
- "The cat sleeps." (Higher probability)
- "The little black cat sleeps on the sunny windowsill every afternoon." (Lower probability)

## 9. Raw Probability and Perplexity
- **Raw Probability:** Favors shorter sentences because they are more common.
- **Perplexity:** A better way to compare models because it normalizes probability by the number of words.

## 10. Perplexity: A Measure of Model Performance
Perplexity is the inverse probability of the test set, normalized by the number of words. Lower perplexity means a better model.

**Example:**
- Model A: Perplexity = 50 (better)
- Model B: Perplexity = 100 (worse)

## 11. Branching Factor (Possible Next Words)
The branching factor is the number of words that can follow a given word in a sentence.

**Example:** After "The cat", the next word could be "sleeps", "runs", "jumps", etc. If there are 10 possibilities, the branching factor is 10.

## 12. Deterministic vs. Probabilistic Language Models
- **Deterministic Model:** The branching factor is simply the size of the vocabulary.
- **Probabilistic Model:** Some words are more likely than others, so the branching factor is weighted.

**Example:**
- "The sun..." is more likely followed by "shines" than "eats" (weighted probability).

## Conclusion
Understanding these evaluation techniques and probability concepts helps us build better AI models. A good model should generalize well, assign reasonable probabilities, and perform well in real-world applications.
