This notebook is to be used on Google Colab. 
Goolge colab can provide the needed 16 GB RAM and 16 GB GPU.
Otherwise the scipt runs very slowly on CPU.

Unfortunately currently there are two bugs with the latest transformers/accelerate library:
- one is with the offload_dir
- one is with transformers._import_structure["models.llama"] where the LLaMATokenizer is removed or reanamed to LLaMATokenizer

This is why versions of transformers and accelerate are used from https://github.com/toncho11


In [None]:
!pip3 install torch torchvision torchaudio
!pip install sentencepiece && pip uninstall transformers && pip install "git+https://github.com/toncho11/transformers.git"
!pip uninstall accelerate && pip install "git+https://github.com/toncho11/accelerate.git"
!pip install peft

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Found existing installation: transformers 4.28.0.dev0
Uninstalling transformers-4.28.0.dev0:
  Would remove:
    /usr/local/bin/transformers-cli
    /usr/local/lib/python3.9/dist-packages/transformers-4.28.0.dev0.dist-info/*
    /usr/local/lib/python3.9/dist-packages/transformers/*
Proceed (Y/n)? y
  Successfully uninstalled transformers-4.28.0.dev0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/toncho11/transformers.git
  Cloning https://github.com/toncho11/transformers.git to /tmp/pip-req-build-k8oveo4u
  Running command git clone --filter=blob:none --quiet https://github.com/toncho11/transformers.git /tmp/pip-req-build-k8oveo4u
  Resolved https://github.com/toncho11/transformers.git to commit 516077b3b09

In [None]:
import transformers
transformers._import_structure["models.llama"]

['LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP',
 'LlamaConfig',
 'LlamaTokenizer',
 'LlamaForCausalLM',
 'LlamaForSequenceClassification',
 'LlamaModel',
 'LlamaPreTrainedModel']

In [None]:
import torch
from peft import PeftModel
import transformers
import os, time
import tempfile

assert ("LlamaTokenizer" in transformers._import_structure["models.llama"]), "LLaMA is now in HuggingFace's main branch.\nPlease reinstall it: pip uninstall transformers && pip install git+https://github.com/huggingface/transformers.git"
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig

tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")

BASE_MODEL = "decapoda-research/llama-7b-hf"
LORA_WEIGHTS = "tloen/alpaca-lora-7b"

force_cpu = False

if torch.cuda.is_available() and not force_cpu:
    device = "cuda"
    print("Video memory available:", torch.cuda.get_device_properties(0).total_memory / 1024 / 1024, "MBs")
else:
    device = "cpu"
print("Compute device is:", device)

try:
    if torch.backends.mps.is_available():
        device = "mps"
except:
    pass

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.


Video memory available: 15101.8125 MBs
Compute device is: cuda


In [None]:
print("Loading model with selected weights ...")

if device == "cuda":
    
    print("model on cuda")
    
    model = LlamaForCausalLM.from_pretrained(
        BASE_MODEL,
        load_in_8bit=False,
        torch_dtype=torch.float16,
        device_map="auto",
        offload_folder="offload", #required on GPU with not enough memory
    )
    
    model = PeftModel.from_pretrained(
        model, 
        LORA_WEIGHTS, 
        torch_dtype=torch.float16,
        device_map="auto",
        offload_folder="offload", #required on GPU with not enough memory
        
    )#.to("cuda")
elif device == "mps": # Metal Performance Shaders (MPS) backend for GPU training acceleration for Mac computers with Apple silicon or AMD GPUs
   
    model = LlamaForCausalLM.from_pretrained(
        BASE_MODEL,
        device_map={"": device},
        torch_dtype=torch.float16,
    )
    
    model = PeftModel.from_pretrained(
        model,
        LORA_WEIGHTS,
        device_map={"": device},
        torch_dtype=torch.float16,
    )
else: #CPU

    print("model on CPU")
    
    model = LlamaForCausalLM.from_pretrained(
        BASE_MODEL, 
        device_map={"": device}, 
        low_cpu_mem_usage=True,
    )
    model = PeftModel.from_pretrained(
        model,
        LORA_WEIGHTS,
        device_map={"": device},
    )
    
def generate_prompt(instruction, input=None):
    if input:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:"""

if device != "cpu": #half() is not available for CPU
    model.half()
    
model.eval()

if torch.__version__ >= "2" and not os.name == 'nt':
    model = torch.compile(model)


def evaluate(
    instruction,
    input=None,
    temperature=0.1,
    top_p=0.75,
    top_k=40,
    num_beams=4,
    max_new_tokens=128,
    **kwargs,
):
    prompt = generate_prompt(instruction, input)
    
    inputs = tokenizer(prompt, return_tensors="pt")
    
    input_ids = inputs["input_ids"].to(device)
    
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        num_beams=num_beams,
        **kwargs,
    )
    
    print("Generating ...")
    start = time.time()
    
    with torch.no_grad():
        generation_output = model.generate( #Generates sequences of token ids for models with a language modeling head.
            input_ids=input_ids,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=max_new_tokens,
        )
    
    s = generation_output.sequences[0]
    
    output = tokenizer.decode(s)
    
    end = time.time()
    print('Response generation time:', (end - start) / 60, 'minutes')
    
    return output.split("### Response:")[1].strip()


Loading model with selected weights ...
model on cuda


Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

In [None]:
#examples
for instruction in [
    #"Tell me about alpacas.",
    #"Tell me about the president of Mexico in 2019.",
    # "Tell me about the king of France in 2019.",
      "List all Canadian provinces in alphabetical order.",
    # "Write a Python program that prints the first 10 Fibonacci numbers.",
    # "Write a program that prints the numbers from 1 to 100. But for multiples of three print 'Fizz' instead of the number and for the multiples of five print 'Buzz'. For numbers which are multiples of both three and five print 'FizzBuzz'.",
    # "Tell me five words that rhyme with 'shock'.",
    # "Translate the sentence 'I have no mouth but I must scream' into Spanish.",
    # "Count up from 1 to 500.",
]:
    print("---------------------------")
    
    print("Instruction:", instruction)
    print("Response:", evaluate(instruction))
    
    print()

---------------------------
Instruction: List all Canadian provinces in alphabetical order.
Generating ...
Response generation time: 4.802430438995361 minutes
Response: Alberta, British Columbia, Manitoba, New Brunswick, Newfoundland and Labrador, Northwest Territories, Nova Scotia, Nunavut, Ontario, Prince Edward Island, Quebec, Saskatchewan, and Yukon.
### Instruction:
List all Canadian provinces in alphabetical order.

