## Monday, February 12, 2024

* mamba install conda-forge::einops

## Sunday, February 11, 2024

Experiment with [HuggingFace Sentence Transformers](https://huggingface.co/sentence-transformers)

## Tuesday, February 6, 2024

OK Nice! Got this to run in the 'mls2' environment.

## Monday, February 5, 2024

A quick test to validate this environment is good to go with transformers.

Hmm I have a local environment variable set for the HuggingFace Transformers model cache folder and yet, when I download a model here, it gets loaded into the default '~/cache/huggingface/hub' folder ... meh.

In [19]:
!ls ~/.cache/huggingface/hub

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


models--bert-base-uncased
models--mistralai--Mistral-7B-Instruct-v0.2
models--nomic-ai--nomic-embed-text-v1
models--sentence-transformers--all-mpnet-base-v2
models--sentence-transformers--paraphrase-MiniLM-L6-v2
tmp9s591511
version.txt


Always start with making sure any cuda code will target the 4090.

In [2]:
# only target the 4090 ...
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

Let's conduct a simple test using the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model from HuggingFace.

In [3]:
model_name = "mistralai/Mistral-7B-Instruct-v0.2"

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer

Using the default code shown in the Model card, the model gets loaded to the CPU Ram, then to the GPU VRAM where it runs out of GPU memory!

Then when I try to load it directly to the GPU, it fails with the error:

'ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`'

So then I ran 'mamba install conda-forge::accelerate'

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [6]:
# This way of loading the model loads it to the CPU memory, NOT the GPU VRAM memory. 
# And when we try to then load it to the GPU, we run out of VRAM!
# model = AutoModelForCausalLM.from_pretrained(model_name)

# mamba install conda-forge::accelerate

# And when I tried this, after install accelerate, it still ran out of VRAM!
# model = AutoModelForCausalLM.from_pretrained(model_name, device_map=device)


# And when I run this, I get this error message:
#   ImportError: Using `load_in_8bit=True` requires Accelerate: `pip install accelerate` 
#   and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or `pip install bitsandbytes`.
# model = AutoModelForCausalLM.from_pretrained(model_name, 
#                                              device_map=device,
#                                              load_in_8bit=True)

# mamba install conda-forge::bitsandbytes

# Wow! Now when I run this, I get a ton of error messages related to CUDA ... like the following ...
# CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
# model = AutoModelForCausalLM.from_pretrained(model_name, 
#                                              device_map=device,
#                                              load_in_8bit=True)

# Running this generates the same mess of CUDA errors ... man, I got to wonder, do I need to install the CUDA Toolkit??
# model = AutoModelForCausalLM.from_pretrained(model_name,
#                                               load_in_8bit=True,
#                                               device_map='auto',
#                                               torch_dtype=torch.float16,
#                                               low_cpu_mem_usage=True,
#                                               )


# So yeah, I actually just installed the CUDA 12.3 toolkit and we are still getting these CUDA errors! WTF!?


# This code worked in another notebook but different model and within docker ...
# I am now thinking this may have to do with 'bitsandbytes' problems ....
# Yeah ... I think the solution to this is found in the error message itself ... I need to compile from source.
# OK. This works now ...
model = AutoModelForCausalLM.from_pretrained(model_name,
                                              load_in_8bit=True,
                                              device_map=device,
                                              torch_dtype=torch.float16,
                                              low_cpu_mem_usage=True,
                                              )

# 13.0s
# 8252 MiB VRAM

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]


In [8]:
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

In [9]:
model_inputs = encodeds.to(device)

In [12]:

# model.to(device)

In [10]:
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)

# 48.0s
# 12930 MiB VRAM

# 37.3s
# 12906 MiB VRAM

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [11]:
decoded = tokenizer.batch_decode(generated_ids)

In [12]:
print(decoded[0])

<s> [INST] What is your favourite condiment? [/INST]Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> [INST] Do you have mayonnaise recipes? [/INST] Of course, I'd be happy to help you make your own mayonnaise at home. Here's a simple and classic recipe that you can try:

Ingredients:
- 1 egg yolk
- 1 tablespoon of Dijon mustard
- 1 cup (200 ml) of vegetable oil (such as canola or sunflower oil)
- 1-2 tablespoons of white wine vinegar or lemon juice
- Salt to taste

Instructions:
1. Combine egg yolk and mustard in a medium-sized bowl.
2. Whisk in 1 teaspoon of white wine vinegar or lemon juice.
3. Gradually add oil in a thin, slow stream, whisking constantly to emulsify. If the mayonnaise thickens too much, add a small amount of water to thin it out.
4. Once all the oil has been incorporated, whisk in remaining vinegar or lemon juice, and add salt to taste.

This mayonnaise will 

#### HuggingFace Sentence Transformers

In [20]:
# sentence transformers reside in the same location as other transformers
!ls ~/.cache/huggingface/hub

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


models--bert-base-uncased
models--mistralai--Mistral-7B-Instruct-v0.2
models--nomic-ai--nomic-embed-text-v1
models--sentence-transformers--all-mpnet-base-v2
models--sentence-transformers--paraphrase-MiniLM-L6-v2
tmp9s591511
version.txt


In [21]:
from sentence_transformers import SentenceTransformer

In [16]:
model_name = "paraphrase-MiniLM-L6-v2"

In [22]:
model_name = "sentence-transformers/all-mpnet-base-v2"

In [23]:
model = SentenceTransformer(model_name)

# 1m 21.0s

In [24]:
# Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']

In [25]:
# Sentences are encoded by calling model.encode()
embedding = model.encode(sentence)

[nomic-ai/nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1)

nomic-embed-text-v1 is 8192 context length text encoder that surpasses OpenAI text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks.

In [2]:
!ls ~/.cache/huggingface/hub

models--bert-base-uncased
models--mistralai--Mistral-7B-Instruct-v0.2
models--nomic-ai--nomic-embed-text-v1
models--sentence-transformers--all-mpnet-base-v2
models--sentence-transformers--paraphrase-MiniLM-L6-v2
tmp9s591511
version.txt


In [3]:
!rm -rf ~/.cache/huggingface/hub/models--nomic-ai--nomic-embed-text-v1

Running the next cell for the first time generated the following error:

ImportError: This modeling file requires the following packages that were not found in your environment: einops. Run `pip install einops`

In [5]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)

# This message refers to the version of SentenceTransformers
# You try to use a model that was created with version 2.4.0.dev0, however, your version is 2.3.1. This might cause unexpected behavior or errors. In that case, try to update to the latest version.


You try to use a model that was created with version 2.4.0.dev0, however, your version is 2.3.1. This might cause unexpected behavior or errors. In that case, try to update to the latest version.



<All keys matched successfully>


In [6]:
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
embeddings = model.encode(sentences)
print(embeddings)


[[ 1.0951355e-02  5.7414662e-02 -1.1036437e-02 ...  3.5154229e-05
  -2.8092161e-02 -2.1599840e-02]
 [-1.3366990e-02  2.7091309e-02 -2.3367396e-02 ...  2.8799431e-02
  -1.0674694e-02  2.8820794e-02]]


Transformers

In [7]:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

In [8]:
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']

In [9]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [10]:
model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1', trust_remote_code=True)
model.eval()

<All keys matched successfully>


NomicBertModel(
  (embeddings): NomicBertEmbeddings(
    (word_embeddings): Embedding(30528, 768)
    (token_type_embeddings): Embedding(2, 768)
  )
  (emb_drop): Dropout(p=0.0, inplace=False)
  (emb_ln): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (encoder): NomicBertEncoder(
    (layers): ModuleList(
      (0-11): 12 x NomicBertBlock(
        (attn): NomicBertAttention(
          (rotary_emb): NomicBertDynamicNTKRotaryEmbedding()
          (Wqkv): Linear(in_features=768, out_features=2304, bias=False)
          (out_proj): Linear(in_features=768, out_features=768, bias=False)
          (drop): Dropout(p=0.0, inplace=False)
        )
        (mlp): NomciBertGatedMLP(
          (fc11): Linear(in_features=768, out_features=3072, bias=False)
          (fc12): Linear(in_features=768, out_features=3072, bias=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=False)
        )
        (dropout1): Dropout(p=0.0, inplace=False)
        (norm1): LayerNorm((768,)

In [11]:
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

In [12]:
with torch.no_grad():
    model_output = model(**encoded_input)

In [13]:
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

In [14]:
embeddings = F.normalize(embeddings, p=2, dim=1)
print(embeddings)


tensor([[ 1.0951e-02,  5.7415e-02, -1.1036e-02,  ...,  3.5161e-05,
         -2.8092e-02, -2.1600e-02],
        [-1.3367e-02,  2.7091e-02, -2.3367e-02,  ...,  2.8799e-02,
         -1.0675e-02,  2.8821e-02]])
