# Small Language Model from LLM
Download an LLM, prune and quantize it, and benchmark it each step of the way.

## Start by downloading an LLM
I was going to use Llama 2 just because of how ubiquitous it currently is. However, I realized it requires HuggingFace authentication, because of how Meta AI has an approval process. To avoid cluttering the code with authentication, I just went with Mistral AI's Mistral model instead. We could choose larger versions of this model. However, to prove out and practice these model-optimization concepts, we can iterate faster with a smaller model like 7B.

According to a discussion on HuggingFace, Llama-2 7B requires 28GB of GPU RAM. Assuming it is similar for Mistral 7B, and to be on the safe side, I'll over-provision with an ml.g5.4xlarge for my SageMaker Studio Notebook.

### Set up environment
At first I got the error `KeyError: 'mistral'` when running `from_pretrained()`
The solution was on [the model's HuggingFace page](https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting)

I tried installing `evaluate` later in the script, right before using it, but that gave me a warning that a `transformers` process already started. Once I moved the `pip install evaluate` to here, that warning went away.

In [None]:
!pip install --upgrade datasets evaluate sentence_transformers transformers

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

### Download LLM

In [None]:
model_repo = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_repo)
model = AutoModelForCausalLM.from_pretrained(model_repo, torch_dtype=torch.float16).to("cuda")

### Verify LLM works

In [None]:
# Simple prompt
prompt = "Write a Haiku explaining biodynamic farming."

In [None]:
# Tokenize the prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

In [None]:
# Generate response
output = model.generate(input_ids, max_length=35)

In [None]:
# Decode generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True) 
print(generated_text)

## Benchmark FM for baseline
Let's benchmark for accuracy, latency, and resource utilization (Memory, GPU, and CPU).

### Set up environment for evaluations

In [None]:
from sentence_transformers import SentenceTransformer, util

# Load the pre-trained SAS model
sas_model = SentenceTransformer('sentence-transformers/paraphrase-mpnet-base-v2')

# Set the pad token
tokenizer.pad_token = tokenizer.eos_token

def compute_sas(predicted_answers, reference_answers):
    """
    Compute the Semantic Answer Similarity (SAS) between a list of predicted answers and reference answers.
    
    Args:
        predicted_answers (list of str): The list of predicted answer texts.
        reference_answers (list of str): The list of reference answer texts.
    
    Returns:
        float: The average SAS score between the predicted and reference answers.
    """
    sas_scores = []
    for predicted, reference in zip(predicted_answers, reference_answers):
        predicted_embedding = sas_model.encode(predicted, convert_to_tensor=True)
        reference_embedding = sas_model.encode(reference, convert_to_tensor=True)
        sas_score = util.cos_sim(predicted_embedding, reference_embedding).item()
        sas_scores.append(sas_score)
    
    return sum(sas_scores) / len(sas_scores)

### Benchmark for Accuracy

In [None]:
from evaluate import load
from datasets import load_dataset

# Load the evaluation metric and dataset
metric = load("accuracy")
dataset = load_dataset("allenai/reward-bench")

In [None]:
def generate_predictions(examples):
    predictions = []
    
    for prompt, chosen in zip(examples["prompt"], examples["chosen"]):
        # Tokenize the input
        input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
        attention_mask = tokenizer(prompt, return_tensors="pt").attention_mask.to("cuda")
        
        # Generate the model's prediction
        output = model.generate(
            input_ids,
            attention_mask=attention_mask,
            # max_new_tokens is more flexible than max_length,
            # because it only caps the output
            # so it won't fail if input is longer than the specified value
            max_new_tokens=100,
            pad_token_id=tokenizer.eos_token_id,
        )
        
        # Decode the generated text
        predicted_answer = tokenizer.decode(output[0], skip_special_tokens=True)
        
        # Append the predicted answer and reference answer to the predictions list
        predictions.append({"predicted": predicted_answer, "reference": chosen})
    
    # Return the predictions
    return {"predictions": predictions}

In [None]:
%%time

# Evaluate the model on the dataset
results = dataset.map(generate_predictions, batched=True, batch_size=32)
predicted_answers = [pred["predicted"] for pred in results["predictions"]]
reference_answers = [pred["reference"] for pred in results["predictions"]]
sas_score = compute_sas(predicted_answers, reference_answers)
print(f"Semantic Answer Similarity: {sas_score:.2f}")

