# Small Language Model from LLM
Download an LLM, prune and quantize it, and benchmark it each step of the way.

## Start by downloading an LLM
I was going to use Llama 2 just because of how ubiquitous it currently is. However, I realized it requires HuggingFace authentication, because of how Meta AI has an approval process. To avoid cluttering the code with authentication, I just went with Mistral AI's Mistral model instead. We could choose larger versions of this model. However, to prove out and practice these model-optimization concepts, we can iterate faster with a smaller model like 7B.

According to a discussion on HuggingFace, Llama-2 7B requires 28GB of GPU RAM. Assuming it is similar for Mistral 7B, and to be on the safe side, I'll over-provision with an ml.g5.4xlarge for my SageMaker Studio Notebook.

### Set up environment
At first I got the error `KeyError: 'mistral'` when running `from_pretrained()`
The solution was on [the model's HuggingFace page](https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting)

I tried installing `evaluate` later in the script, right before using it, but that gave me a warning that a `transformers` process already started. Once I moved the `pip install evaluate` to here, that warning went away.

In [None]:
!pip install --upgrade datasets evaluate sentence_transformers transformers

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

### Download LLM

In [None]:
model_repo = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_repo)
model = AutoModelForCausalLM.from_pretrained(model_repo, torch_dtype=torch.float16).to("cuda")

### Verify LLM works

In [4]:
# Simple prompt
prompt = "Write a Haiku explaining biodynamic farming."

In [5]:
# Tokenize the prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

In [None]:
# Generate response
output = model.generate(input_ids, max_length=35)

In [7]:
# Decode generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True) 
print(generated_text)

Write a Haiku explaining biodynamic farming.

The moon is full

The sun is shining bright

The earth is fertile

The moon


## Benchmark FM for baseline
Let's benchmark for accuracy, latency, and resource utilization (Memory, GPU, and CPU).

### Set up environment for evaluations

In [8]:
from sentence_transformers import SentenceTransformer, util

# Load the pre-trained SAS model
sas_model = SentenceTransformer('sentence-transformers/paraphrase-mpnet-base-v2')

# Set the pad token
tokenizer.pad_token = tokenizer.eos_token

def compute_sas(predicted_answers, reference_answers):
    """
    Compute the Semantic Answer Similarity (SAS) between a list of predicted answers and reference answers.
    
    Args:
        predicted_answers (list of str): The list of predicted answer texts.
        reference_answers (list of str): The list of reference answer texts.
    
    Returns:
        float: The average SAS score between the predicted and reference answers.
    """
    sas_scores = []
    for predicted, reference in zip(predicted_answers, reference_answers):
        predicted_embedding = sas_model.encode(predicted, convert_to_tensor=True)
        reference_embedding = sas_model.encode(reference, convert_to_tensor=True)
        sas_score = util.cos_sim(predicted_embedding, reference_embedding).item()
        sas_scores.append(sas_score)
    
    return sum(sas_scores) / len(sas_scores)

### Benchmark for Accuracy

In [9]:
from evaluate import load
from datasets import load_dataset

# Load the evaluation metric and dataset
metric = load("accuracy")
dataset = load_dataset("allenai/reward-bench")

In [10]:
def generate_predictions(examples):
    predictions = []
    
    for prompt, chosen in zip(examples["prompt"], examples["chosen"]):
        # Tokenize the input
        input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
        attention_mask = tokenizer(prompt, return_tensors="pt").attention_mask.to("cuda")
        
        # Generate the model's prediction
        output = model.generate(
            input_ids,
            attention_mask=attention_mask,
            # max_new_tokens is more flexible than max_length,
            # because it only caps the output
            # so it won't fail if input is longer than the specified value
            max_new_tokens=100,
            pad_token_id=tokenizer.eos_token_id,
        )
        
        # Decode the generated text
        predicted_answer = tokenizer.decode(output[0], skip_special_tokens=True)
        
        # Append the predicted answer and reference answer to the predictions list
        predictions.append({"predicted": predicted_answer, "reference": chosen})
    
    # Return the predictions
    return {"predictions": predictions}

In [16]:
# See what the dataset looks like.
# Let's treat `prompt` like the Question and `chosen` like the Answer.
dataset["train"][0]

{'prompt': 'What are the names of some famous actors that started their careers on Broadway?',
 'chosen': 'Several famous actors started their careers on Broadway before making it big in film and television. Here are a few notable examples:\n\n1. Sarah Jessica Parker - Before she was Carrie Bradshaw on "Sex and the City," Sarah Jessica Parker was a Broadway star, having appeared in productions like "Annie" as a child.\n\n2. Meryl Streep - Meryl Streep\'s early career included Broadway productions such as "Trelawny of the \'Wells\'" and "A Memory of Two Mondays / 27 Wagons Full of Cotton."\n\n3. Hugh Jackman - Hugh Jackman won a Tony Award for his role in "The Boy from Oz" and has been known for his stage work as well as his film career.\n\n4. Sutton Foster - Known for her television role in "Younger," Sutton Foster is also a Broadway legend with leading roles in shows like "Thoroughly Modern Millie" and "Anything Goes."\n\n5. Kristen Bell - Before she was the voice of Anna in "Frozen" 

We sample a subset of the data, to more quickly get a rough idea if the evaluation works and awards at least some level of accuracy. From a model like Mistral, we should get more than 10% accuracy from our evaluation -- hopefully a lot more. If it is 0% accuracy, then either our evaluation code doesn't work at all, or the similarity checking is too rigid.

Also, the dataset contains over 5000 records in the `train` portion alone. This can take hours to iterate through. Since we are not yet running this as a dedicated SageMaker job, a long job like this in a notebook can easily time out. As a solution, we are frequently saving the evaluation results as `evaluation_snapshots`, so we can pick up where we left off, if we get cut off.

Initial results of the following job, on a subset of 80 examples (`num_samples = 80`), was a "Semantic Answer Similarity" of "0.72" -- meaning a 72% accuracy. This is satisfactory

In [18]:
%%time

import os

batch_size=8

results_dir = "evaluation_snapshots"
os.makedirs(results_dir, exist_ok=True)
            
subset_name = "train"
subset_dataset = dataset[subset_name]

# Limit to the first 80 Q&As
num_samples = 80
subset_dataset = subset_dataset.select(range(num_samples))

num_batches = (len(subset_dataset) + batch_size - 1) // batch_size

# Check for existing snapshots
snapshot_files = sorted(os.listdir(results_dir))
if snapshot_files:
    latest_snapshot = snapshot_files[-1]
    snapshot_path = os.path.join(results_dir, latest_snapshot)
    snapshot = torch.load(snapshot_path)
    predicted_answers = snapshot["predicted_answers"]
    reference_answers = snapshot["reference_answers"]
    start_batch_idx = int(latest_snapshot.split("_")[-1].split(".")[0]) + 1
else:
    predicted_answers = []
    reference_answers = []
    start_batch_idx = 0

for batch_idx in range(start_batch_idx, num_batches):
    start_idx = batch_idx * batch_size
    end_idx = min(start_idx + batch_size, len(subset_dataset))
    batch_dataset = subset_dataset.select(range(start_idx, end_idx))
    
    batch_results = batch_dataset.map(generate_predictions, batched=True, batch_size=batch_size)
    
    batch_predicted_answers = [pred["predicted"] for pred in batch_results["predictions"]]
    batch_reference_answers = [pred["reference"] for pred in batch_results["predictions"]]
    
    predicted_answers.extend(batch_predicted_answers)
    reference_answers.extend(batch_reference_answers)
    
    # Save evaluation snapshot
    snapshot_path = os.path.join(results_dir, f"snapshot_{batch_idx}.pt")
    torch.save({
        "predicted_answers": predicted_answers,
        "reference_answers": reference_answers
    }, snapshot_path)
    
    print(f"Processed batch {batch_idx + 1}/{num_batches}")

# Compute SAS score
sas_score = compute_sas(predicted_answers, reference_answers)
print(f"Semantic Answer Similarity: {sas_score:.2f}")



Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Processed batch 1/10


Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Processed batch 2/10


Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Processed batch 3/10


Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Processed batch 4/10


Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Processed batch 5/10


Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Processed batch 6/10


Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Processed batch 7/10


Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Processed batch 8/10


Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Processed batch 9/10


Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Processed batch 10/10
Semantic Answer Similarity: 0.72
CPU times: user 5min 33s, sys: 2min 30s, total: 8min 3s
Wall time: 7min 49s
