## Tuesday, February 6, 2024

OK Nice! Got this to run in the 'mls2' environment.

## Monday, February 5, 2024

A quick test to validate this environment is good to go with transformers.

Hmm I have a local environment variable set for the HuggingFace Transformers model cache folder and yet, when I download a model here, it gets loaded into the default '~/cache/huggingface/hub' folder ... meh.

In [1]:
!ls ~/.cache/huggingface/hub

models--bert-base-uncased
models--mistralai--Mistral-7B-Instruct-v0.2
models--nomic-ai--nomic-embed-text-v1
models--sentence-transformers--all-mpnet-base-v2
tmp9s591511
version.txt


Always start with making sure any cuda code will target the 4090.

In [2]:
# only target the 4090 ...
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

Let's conduct a simple test using the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model from HuggingFace.

In [3]:
model_name = "mistralai/Mistral-7B-Instruct-v0.2"

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer

Using the default code shown in the Model card, the model gets loaded to the CPU Ram, then to the GPU VRAM where it runs out of GPU memory!

Then when I try to load it directly to the GPU, it fails with the error:

'ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`'

So then I ran 'mamba install conda-forge::accelerate'

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [6]:
# This way of loading the model loads it to the CPU memory, NOT the GPU VRAM memory. 
# And when we try to then load it to the GPU, we run out of VRAM!
# model = AutoModelForCausalLM.from_pretrained(model_name)

# mamba install conda-forge::accelerate

# And when I tried this, after install accelerate, it still ran out of VRAM!
# model = AutoModelForCausalLM.from_pretrained(model_name, device_map=device)


# And when I run this, I get this error message:
#   ImportError: Using `load_in_8bit=True` requires Accelerate: `pip install accelerate` 
#   and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or `pip install bitsandbytes`.
# model = AutoModelForCausalLM.from_pretrained(model_name, 
#                                              device_map=device,
#                                              load_in_8bit=True)

# mamba install conda-forge::bitsandbytes

# Wow! Now when I run this, I get a ton of error messages related to CUDA ... like the following ...
# CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
# model = AutoModelForCausalLM.from_pretrained(model_name, 
#                                              device_map=device,
#                                              load_in_8bit=True)

# Running this generates the same mess of CUDA errors ... man, I got to wonder, do I need to install the CUDA Toolkit??
# model = AutoModelForCausalLM.from_pretrained(model_name,
#                                               load_in_8bit=True,
#                                               device_map='auto',
#                                               torch_dtype=torch.float16,
#                                               low_cpu_mem_usage=True,
#                                               )


# So yeah, I actually just installed the CUDA 12.3 toolkit and we are still getting these CUDA errors! WTF!?


# This code worked in another notebook but different model and within docker ...
# I am now thinking this may have to do with 'bitsandbytes' problems ....
# Yeah ... I think the solution to this is found in the error message itself ... I need to compile from source.
model = AutoModelForCausalLM.from_pretrained(model_name,
                                              load_in_8bit=True,
                                              device_map=device,
                                              torch_dtype=torch.float16,
                                              low_cpu_mem_usage=True,
                                              )



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]


In [8]:
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

In [9]:
model_inputs = encodeds.to(device)

In [12]:

# model.to(device)

In [10]:
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [11]:
decoded = tokenizer.batch_decode(generated_ids)

In [12]:
print(decoded[0])

<s> [INST] What is your favourite condiment? [/INST]Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> [INST] Do you have mayonnaise recipes? [/INST] Yes, I certainly do have a few mayonnaise recipes up my virtual sleeve. Here's a simple one for a classic mayo:

1. In a bowl, whisk together 1 cup of vegetable oil, 1 egg yolk, 1 tablespoon of white wine vinegar, 1 tablespoon of Dijon mustard, and a pinch of salt until well combined.
2. Whisk in 1-2 tablespoons of water, one teaspoon at a time, until desired consistency is reached. The mayonnaise should be thick and creamy.
3. Taste and adjust seasonings, if necessary, adding more salt or vinegar as desired.

You can also make a flavoured mayonnaise by whisking in herbs, spices, or other seasonings after the basic mayonnaise is made. Enjoy!

Here's another version that uses an immersion blender for a smoother texture:

1. In a tall,