# Small Language Model from LLM
Download an LLM, prune and quantize it, and benchmark it each step of the way.

## Start by downloading an LLM
I was going to use Llama 2 just because of how ubiquitous it currently is. However, I realized it requires HuggingFace authentication, because of how Meta AI has an approval process. To avoid cluttering the code with authentication, I just went with Mistral AI's Mistral model instead. We could choose larger versions of this model. However, to prove out and practice these model-optimization concepts, we can iterate faster with a smaller model like 7B.

According to a discussion on HuggingFace, Llama-2 7B requires 28GB of GPU RAM. Assuming it is similar for Mistral 7B, and to be on the safe side, I'll over-provision with an ml.g5.4xlarge for my SageMaker Studio Notebook.

### Set up environment
At first I got the error `KeyError: 'mistral'` when running `from_pretrained()`
The solution was on [the model's HuggingFace page](https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting)

In [None]:
!pip install --upgrade transformers

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
model_repo = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_repo)
model = AutoModelForCausalLM.from_pretrained(model_repo, torch_dtype=torch.float16).to("cuda")

In [None]:
# Simple prompt
prompt = "Write a Haiku explaining biodynamic farming."

In [None]:
# Tokenize the prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

In [None]:
# Generate response
output = model.generate(input_ids, max_length=35)

In [None]:
# Decode generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True) 
print(generated_text)