<a href="https://colab.research.google.com/github/jeffheaton/app_generative_ai/blob/main/t81_559_class_11_4_eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-559: Applications of Generative Artificial Intelligence
**Module 11: Finetuning**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 11 Material

Module 11: Finetuning

* Part 11.1: Understanding Finetuning [[Video]](https://www.youtube.com/watch?v=fflySydZABM) [[Notebook]](t81_559_class_11_1_finetune.ipynb)
* Part 11.2: Finetuning from the Dashboard [[Video]](https://www.youtube.com/watch?v=RIJj1QLk-V4) [[Notebook]](t81_559_class_11_2_dashboard.ipynb)
* Part 11.3: Finetuning from Code [[Video]](https://www.youtube.com/watch?v=29tUrxrneOs) [[Notebook]](t81_559_class_11_3_code.ipynb)
* **Part 11.4: Evaluating your Model** [[Video]](https://www.youtube.com/watch?v=MrwFSG4PWUY) [[Notebook]](t81_559_class_11_4_eval.ipynb)
* Part 11.5: Finetuning for Text to Image [[Video]](https://www.youtube.com/watch?v=G_FYFSzkB5Y&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_559_class_11_5_image.ipynb)


# Part 11.4: Evaluating your Model

## How are Large Language Models Evaluated

Before implementing strategies to enhance your fine-tuned language model, it's crucial to understand how OpenAI evaluates these models during the fine-tuning process. This understanding allows you to interpret the metrics provided by the API effectively and make informed decisions to improve your model's performance.

### Understanding Training Metrics
The primary metrics used to evaluate a model during fine-tuning are the training loss and validation loss. These metrics are calculated using the cross-entropy loss function, a standard method for measuring the difference between the predicted probabilities and the actual distribution of the target data in classification tasks.

### Cross-Entropy Loss
Cross-entropy loss quantifies the performance of a classification model whose output is a probability value between 0 and 1. In the context of language models, it measures how well the model predicts the next word in a sequence.

The cross-entropy loss:


$L = -\sum_{i=1}^{N} y_i \log(p_i)$

where:

* $𝑁$ is the number of possible classes (in language models, the vocabulary size).
* $y_i$ is the true distribution (1 for the correct word and 0 for others).
* $p_i$ is the predicted probability for class 

For a dataset, the average cross-entropy loss over all predictions provides the training loss or validation loss.

### Training Loss
The training loss represents the average cross-entropy loss calculated over the training dataset. It reflects how well the model is learning the training data. A decreasing training loss over epochs indicates that the model is effectively capturing patterns within the training data.

### Validation Loss
The validation loss is computed similarly but over a separate validation dataset not seen by the model during training. It serves as an indicator of the model's ability to generalize to new, unseen data. A low validation loss suggests that the model can effectively apply learned patterns to unfamiliar inputs, not just memorize the training data.

### Interpreting Loss Curves
Plotting the training and validation loss against epochs can help visualize the model's performance:

* **Convergence:** If both losses decrease and eventually stabilize, the model is likely learning effectively.
* **Overfitting:** If the training loss continues to decrease while the validation loss starts to increase, the model may be overfitting—memorizing the training data without generalizing well.
By analyzing these trends, you can decide whether to adjust training parameters, modify your dataset, or implement techniques like early stopping.

### Perplexity as an Evaluation Metric
Perplexity is another metric derived from cross-entropy loss, commonly used in language modeling to evaluate how well a probability model predicts a sample.

$P=e^L$

Where $L$ is the cross-entropy loss. Lower perplexity values indicate better predictive performance, as the model is less "perplexed" by the data.

### Utilizing the Validation Dataset
Incorporating a validation dataset is essential for unbiased evaluation:

* **Separate Data:** The validation dataset should be distinct from the training data to provide an accurate assessment of the model's generalization capabilities.
* **Loss Calculation:** The validation loss is computed using cross-entropy loss over the validation dataset after each epoch.
By comparing training and validation loss, you can detect issues like overfitting and adjust your training strategy accordingly.

Accessing Detailed Training Metrics
OpenAI's API offers detailed logs and metrics during the fine-tuning process:


## How can you Improve Results when Finetuning a Large Language Model

Fine-tuning large language models, such as those provided by OpenAI, allows developers to tailor these powerful tools to specific tasks and domains. Achieving optimal results requires more than just running the fine-tuning process; it involves a strategic approach to data preparation, parameter adjustment, and iterative improvement. This chapter explores various methods to enhance the outcomes of fine-tuning an OpenAI language model using the API.

### High-Quality Training Data
The cornerstone of effective fine-tuning is the quality of the training data.

* **Relevance:** Ensure that your dataset is closely aligned with the tasks or topics you want the model to handle. For instance, if you're developing a model for medical diagnoses, include medical case studies and terminologies.

* **Clarity and Consistency:** Use clear, precise language to prevent ambiguity. Maintain a consistent style, tone, and formatting throughout the dataset to help the model learn the desired patterns effectively.

### Sufficient Quantity of Data
While quality is paramount, the amount of data also influences the model's performance.

* **Comprehensive Examples:** A larger dataset provides the model with more patterns to learn from, improving its ability to generalize to new inputs.

* **Balanced Dataset:** Include a diverse range of examples to cover different scenarios, but avoid unnecessary repetition that could lead to overfitting.

### Optimizing Data Formatting
Properly structured data guides the model during training.

* **Use JSONL Format:** OpenAI recommends the JSON Lines (JSONL) format, where each line is a JSON object containing "prompt" and "completion" keys.

Structured Prompts and Completions: Clearly define the input (prompt) and the expected output (completion). For example:

```
{
  "messages": [
    {"role": "system", "content": "You are a language translation assistant."},
    {"role": "user", "content": "Translate to French: Hello, how are you?"},
    {"role": "assistant", "content": "Bonjour, comment ça va?"}
  ]
}
```

Include Metadata When Necessary: Additional information, such as labels or context identifiers, can be embedded in the prompts to provide the model with more context.

### Crafting Instruction-Based Prompts
Directing the model with well-designed prompts can enhance its responses.

* **Explicit Instructions:** Begin prompts with clear instructions or questions. For example, "Explain the significance of photosynthesis in plants."

* **Consistent Prompt Format:** Maintain a uniform structure in prompts to help the model recognize and replicate the desired response patterns.

### Adjusting Training Parameters
Fine-tuning parameters significantly affect the model's learning process.

* **Epochs:** Experiment with the number of epochs—the number of times the model passes through the entire training dataset. Too few epochs may lead to underfitting, while too many can cause overfitting.

* **Batch Size:** Adjust the batch size, which is the number of training examples used in one iteration. A larger batch size can speed up training but may require more computational resources.

* **Learning Rate:** The learning rate controls how much the model adjusts its weights with each update. A suitable learning rate ensures stable convergence.

### Data Augmentation Techniques
Enhancing your dataset through augmentation can improve model robustness.

* **Paraphrasing:** Rephrase sentences to provide the model with varied inputs that have the same meaning.

* **Synonyms and Antonyms:** Replace words with synonyms to diversify vocabulary exposure.

* **Noise Introduction:** Intentionally introduce minor errors or variations to help the model handle imperfect inputs.

### Regular Evaluation and Iteration
Ongoing assessment allows for continuous improvement.

* **Validation Set:** Reserve a portion of your data as a validation set to evaluate the model's performance objectively.

* **Performance Metrics:** Monitor metrics such as accuracy, loss, and perplexity to gauge improvement.

* **Iterative Refinement:** Use insights from evaluations to refine your training data and adjust parameters in subsequent training rounds.

## Leveraging Advanced API Features
OpenAI's API provides options to fine-tune model outputs during inference.

* **Temperature Setting:** Controls the randomness of the output. Lower values make responses more deterministic, while higher values increase creativity.

* **Top-p (Nucleus Sampling):** Adjusts the cumulative probability threshold for token selection, balancing the trade-off between diversity and focus.

* **Max Tokens:** Limits the length of the generated output to prevent excessively long responses.

### Implementing Early Stopping
Prevent overfitting by stopping training at the optimal point.

* **Monitor Loss Trends:** Observe the training and validation loss. If validation loss starts increasing while training loss decreases, overfitting may be occurring.

* **Set Patience Levels:** Define a number of epochs with no improvement after which training will stop.

Multiple Fine-Tuning Rounds
Sequential fine-tuning can progressively enhance model performance.

* **Initial Broad Training:** Start with a general dataset to teach the model basic patterns.

* **Focused Refinement:** In subsequent rounds, use more specific data to hone the model's performance on particular tasks.

### Incorporating Negative Examples
Teaching the model what not to do can be as important as teaching it what to do.

* **Incorrect Examples:** Include prompts that lead to incorrect completions along with corrections.

* **Penalty Mechanisms:** While not directly supported, structuring data to discourage certain outputs can guide the model away from undesired responses.

### Ensuring Dataset Diversity
A diverse dataset helps the model handle a wide range of inputs.

* **Varied Topics:** Incorporate content from different subject areas.

* **Stylistic Variation:** Use examples with different writing styles, tones, and formats.

### Monitoring Training Metrics
Keep an eye on the model's learning process.

* **Loss Curves:** Plotting training and validation loss over epochs can reveal learning patterns.

* **Accuracy Metrics:** Where applicable, track how often the model produces correct responses.

### Effective Prompt Engineering
Designing prompts that elicit the desired response is an art.

* **Use Placeholders:** Employ variables in prompts to generalize patterns. For example, "Calculate the sum of {number1} and {number2}."

* **Directive Language:** Start prompts with verbs like "Explain," "Describe," or "Summarize" to guide the model.

### Incorporating User Feedback
Real-world usage provides valuable insights.

* **Collect Feedback:** Gather responses from users to identify strengths and weaknesses.

* **Update Training Data:** Use this feedback to adjust your dataset, adding new examples or correcting existing ones.
