# Where finetuning fits in

## Pretraining

- Model at the start:
  - Zero knowledge about the world 
  - Can't form English words
- Next token prediction
- Giant corpus of text data
- Often scraped from the internet: "unlabeled"
- Self-supervised learning
  
- After Training
  - Learns language
  - Learns knowledge

## What is "data scraped from the internet"?
- Often not publicized how to pretrain
- Open-source pretraining data: "The Pile"
- Expensive & time-consuming to train

## Limitations of pretrained base models
<img src="../figure/05.png">

## Finetuning after pretraining
<img src="../figure/07.png">

Finetuning usually refers to training further
- Can also be self-supervised unlabeled data
- Can be "labeled" data you curated
- Much less data needed
- Tool in your toolbox

Finetuning for generative tasks is not well-defined:
- Updates entire model, not just part of it
- Same training objective: next token prediction
- More advanced ways reduce how much to update (more later!)

## What is finetuning doing for you?

<img src="../figure/08.png">

Behavior change
- Learning to respond more consistently 
- Learning to focus, e.g. moderation
- Teasing out capability, e.g. better at conversation

Gain knowledge
- Increasing knowledge of new specific concepts
- Correcting old incorrect information

Both

## Tasks to finetune

<img src="../figure/09.png">

- Just text-in, text-out:
  - Extraction: text in, less text out
    - "Reading"
    - Keywords, topics, routing, agents (planning, reasoning, self-critic, tool use), etc.
  - Expansion: text in, more text out
    - "Writing"
    - Chat, write emails, write code

- Task clarity is key indicator of success

- Clarity means knowing what's bad vs. good vs. better

## First time finetuning

1. Identify task(s) by prompt-engineering a large LLM
2. Find tasks that you see an LLM doing ~OK at
3. Pick one task
4. Get ~1000 inputs and outputs for the task
5. Better than the ~OK from the LLM
6. Finetune a small LLM on this data

## Practice
## Finetuning data: compare to pretraining and basic preparation

In [15]:
import jsonlines
import itertools
import pandas as pd
from pprint import pprint

import datasets
from datasets import load_dataset

### Look at pretraining data set

**Sorry**, "The Pile" dataset is currently relocating to a new home and so we can't show you the same example that is in the video. Here is another dataset, the ["Common Crawl"](https://huggingface.co/datasets/c4) dataset.

In [16]:
#pretrained_dataset = load_dataset("EleutherAI/pile", split="train", streaming=True)

pretrained_dataset = load_dataset("c4", "en", split="train", streaming=True)




In [17]:
n = 5
print("Pretrained dataset:")
top_n = itertools.islice(pretrained_dataset, n)
for i in top_n:
  print(i)

Pretrained dataset:
{'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.', 'timestamp': '2019-04-25T12:57:54Z', 'url': 'https://klyq.com/beginners-bbq-class-taking-place-in-missoula/'}
{'text': 'Discussion in \'Mac OS X Lion (10.7)\' started by axboi87, Jan 20, 2012.\nI\'ve got a 500gb inter

### Contrast with company finetuning dataset you will be using

In [18]:
filename = "kotzeje/lamini_docs.jsonl"
instruction_dataset_df = load_dataset(filename)
instruction_dataset_df

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 1400
    })
})

### Various ways of formatting your data

In [19]:
# Access the 'train' split of the DatasetDict
examples = instruction_dataset_df['train']

# Define a function to combine question and answer
def combine_qa(example):
    return {'combined_text': example['question'] + example['answer']}

# Apply the function to the dataset
combined_dataset = examples.map(combine_qa)

# Access the first combined text
text = combined_dataset['combined_text'][0]
print(text)

How can I evaluate the performance and quality of the generated text from Lamini models?There are several metrics that can be used to evaluate the performance and quality of generated text from Lamini models, including perplexity, BLEU score, and human evaluation. Perplexity measures how well the model predicts the next word in a sequence, while BLEU score measures the similarity between the generated text and a reference text. Human evaluation involves having human judges rate the quality of the generated text based on factors such as coherence, fluency, and relevance. It is recommended to use a combination of these metrics for a comprehensive evaluation of the model's performance.


In [20]:
if "question" in examples and "answer" in examples:
  text = examples["question"][0] + examples["answer"][0]

In [21]:
prompt_template_qa = """### Question:
{question}

### Answer:
{answer}"""

In [22]:
question = examples["question"][0]
answer = examples["answer"][0]

text_with_prompt_template = prompt_template_qa.format(question=question, answer=answer)
text_with_prompt_template

"### Question:\nHow can I evaluate the performance and quality of the generated text from Lamini models?\n\n### Answer:\nThere are several metrics that can be used to evaluate the performance and quality of generated text from Lamini models, including perplexity, BLEU score, and human evaluation. Perplexity measures how well the model predicts the next word in a sequence, while BLEU score measures the similarity between the generated text and a reference text. Human evaluation involves having human judges rate the quality of the generated text based on factors such as coherence, fluency, and relevance. It is recommended to use a combination of these metrics for a comprehensive evaluation of the model's performance."

In [23]:
prompt_template_q = """### Question:
{question}

### Answer:"""

In [24]:
num_examples = len(examples["question"])
finetuning_dataset_text_only = []
finetuning_dataset_question_answer = []
for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]

  text_with_prompt_template_qa = prompt_template_qa.format(question=question, answer=answer)
  finetuning_dataset_text_only.append({"text": text_with_prompt_template_qa})

  text_with_prompt_template_q = prompt_template_q.format(question=question)
  finetuning_dataset_question_answer.append({"question": text_with_prompt_template_q, "answer": answer})

In [25]:
pprint(finetuning_dataset_text_only[0])

{'text': '### Question:\n'
         'How can I evaluate the performance and quality of the generated text '
         'from Lamini models?\n'
         '\n'
         '### Answer:\n'
         'There are several metrics that can be used to evaluate the '
         'performance and quality of generated text from Lamini models, '
         'including perplexity, BLEU score, and human evaluation. Perplexity '
         'measures how well the model predicts the next word in a sequence, '
         'while BLEU score measures the similarity between the generated text '
         'and a reference text. Human evaluation involves having human judges '
         'rate the quality of the generated text based on factors such as '
         'coherence, fluency, and relevance. It is recommended to use a '
         'combination of these metrics for a comprehensive evaluation of the '
         "model's performance."}


In [26]:
pprint(finetuning_dataset_question_answer[0])

{'answer': 'There are several metrics that can be used to evaluate the '
           'performance and quality of generated text from Lamini models, '
           'including perplexity, BLEU score, and human evaluation. Perplexity '
           'measures how well the model predicts the next word in a sequence, '
           'while BLEU score measures the similarity between the generated '
           'text and a reference text. Human evaluation involves having human '
           'judges rate the quality of the generated text based on factors '
           'such as coherence, fluency, and relevance. It is recommended to '
           'use a combination of these metrics for a comprehensive evaluation '
           "of the model's performance.",
 'question': '### Question:\n'
             'How can I evaluate the performance and quality of the generated '
             'text from Lamini models?\n'
             '\n'
             '### Answer:'}


### Common ways of storing your data

In [27]:
with jsonlines.open(f'lamini_docs_processed.jsonl', 'w') as writer:
    writer.write_all(finetuning_dataset_question_answer)

In [28]:
finetuning_dataset_name = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_name)
print(finetuning_dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})
