#### Tuesday, December 19, 2023

[OpenAI Alternatives: Cohere Embed v3 and Open Source](https://www.youtube.com/watch?v=LzRpTNV74Ck)

A look at a few of the best retrieval models for Retrieval Augmented Generation (RAG) and how we use them. Covering OpenAI's text-embedding-ada-002, Cohere's new Embed v3, and a small but strong performing open source model called e5-base-v2.

This all runs.

In [12]:
!ls /root/.cache/huggingface/hub

# Back up the downloaded models ...

# docker cp c8324b70601d://root/.cache/huggingface/hub/models--intfloat--e5-base-v2 /home/rob/Data3/huggingface/transformers
# Successfully copied 439MB to /home/rob/Data3/huggingface/transformers

# docker cp c8324b70601d://root/.cache/huggingface/hub/models--intfloat--e5-large-v2 /home/rob/Data3/huggingface/transformers
# Successfully copied 1.34GB to /home/rob/Data3/huggingface/transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


models--HuggingFaceH4--zephyr-7b-beta
models--bert-base-uncased
models--cross-encoder--ms-marco-MiniLM-L-6-v2
models--deepset--roberta-base-squad2
models--distilbert-base-uncased-distilled-squad
models--facebook--blenderbot-1B-distill
models--google--flan-t5-large
models--gpt2-medium
models--intfloat--e5-base-v2
models--intfloat--e5-large-v2
models--meta-llama--Llama-2-13b-hf
models--my_model--language_model.bin
models--sentence-transformers--all-MiniLM-L6-v2
models--sentence-transformers--clip-ViT-B-32
models--sentence-transformers--multi-qa-mpnet-base-dot-v1
version.txt


In [10]:
!ls /root/.cache/torch/sentence_transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


BAAI_bge-large-en-v1.5
BAAI_bge-small-en-v1.5
sentence-transformers_all-MiniLM-L6-v2
sentence-transformers_all-mpnet-base-v2
sentence-transformers_clip-ViT-B-32
sentence-transformers_multi-qa-mpnet-base-dot-v1
thenlper_gte-large


In [1]:
# !pip install -qU \
#   datasets==2.14.6 \
#   transformers==4.35.0

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/493.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m256.0/493.7 kB[0m [31m8.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m107.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m106.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90

## Dataset Download

We're going to test with a more real world use-case, with messy, imperfect data. We will use the [`jamescalam/ai-arxiv-chunked`](https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked) dataset.

In [1]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
data

# 2m 14.8s

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})

First we define our embedding function.

In [2]:
import torch
from torch.nn.functional import normalize
from transformers import AutoModel, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device}")

model_id = "intfloat/e5-base-v2"

# lets try the large model
model_id = "intfloat/e5-large-v2"

# initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()

#def embed(docs: list[str]) -> list[list[float]]:
def embed(docs):
    docs = [f"passage: {d}" for d in docs]
    # tokenize
    tokens = tokenizer(
        docs, padding=True, max_length=512, truncation=True, return_tensors="pt"
    ).to(device)
    with torch.no_grad():
        # process with model for token-level embeddings
        out = model(**tokens)
        # mask padding tokens
        last_hidden = out.last_hidden_state.masked_fill(
            ~tokens["attention_mask"][..., None].bool(), 0.0
        )
        # create mean pooled embeddings
        doc_embeds = last_hidden.sum(dim=1) / \
            tokens["attention_mask"].sum(dim=1)[..., None]
    return doc_embeds.cpu().numpy()

# "intfloat/e5-large-v2"
# 19m 50.2s

Using cuda


tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Use this to build a Numpy array of cohere embedding vectors.

In [3]:
from tqdm.auto import tqdm
import numpy as np

chunks = data["chunk"]
batch_size = 256

for i in tqdm(range(0, len(chunks), batch_size)):
    i_end = min(len(chunks), i+batch_size)
    chunk_batch = chunks[i:i_end]
    # embed current batch
    embed_batch = embed(chunk_batch)
    # add to existing np array if exists (otherwise create)
    if i == 0:
        arr = embed_batch.copy()
    else:
        arr = np.concatenate([arr, embed_batch.copy()])
        
# "intfloat/e5-large-v2"
# 4m 45.3s 

  0%|          | 0/163 [00:00<?, ?it/s]

Now we need to create the query mechanism, this is simply a cosine similarity calculation between a query vector and our `arr` vectors.

In [4]:
from numpy.linalg import norm

# convert chunks list to array for easy indexing
chunk_arr = np.array(chunks)

# def query(text: str, top_k: int=3) -> list[str]:
def query(text: str, top_k: int=3):
    # create query embedding
    xq = embed([f"query: {text}"])[0]
    # calculate cosine similarities
    sim = np.dot(arr, xq.T) / (norm(arr, axis=1)*norm(xq.T))
    # get indices of top_k records
    idx = np.argpartition(sim, -top_k)[-top_k:]
    docs = chunk_arr[idx]
    for d in docs.tolist():
        print(d)
        print("----------")

In [5]:
query("why should I use llama 2?")

Ricardo Lopez-Barquilla, Marc Shedroﬀ, Kelly Michelena, Allie Feinstein, Amit Sangani, Geeta
Chauhan,ChesterHu,CharltonGholson,AnjaKomlenovic,EissaJamil,BrandonSpence,Azadeh
Yazdan, Elisa Garcia Anzano, and Natascha Parks.
•ChrisMarra,ChayaNayak,JacquelinePan,GeorgeOrlin,EdwardDowling,EstebanArcaute,Philomena Lobo, Eleonora Presani, and Logan Kerr, who provided helpful product and technical organization support.
46
•Armand Joulin, Edouard Grave, Guillaume Lample, and Timothee Lacroix, members of the original
Llama team who helped get this work started.
•Drew Hamlin, Chantal Mora, and Aran Mun, who gave us some design input on the ﬁgures in the
paper.
•Vijai Mohan for the discussions about RLHF that inspired our Figure 20, and his contribution to the
internal demo.
•Earlyreviewersofthispaper,whohelpedusimproveitsquality,includingMikeLewis,JoellePineau,
Laurens van der Maaten, Jason Weston, and Omer Levy.
----------
diminish any capabilities they might have oﬀered for those use cases.
Wh

In [6]:
query("can you tell me about red teaming for llama 2?")

events", "question": "Who won the recent world cup?"}
{"topic": "Election", "question_type": "Questions that require knowledge of future
events", "question": "Who will win the presidential election in 2028?"}
40
G Instruction Prompts for Topic-Guided Red-Teaming Self-Instruct
Topic-Guided Red-Teaming Self-Instruct has two steps. In the ﬁrst step, we use the base LLM to
generate novel topics related to a given instruction (question) type. Some instructions are taken from
the Alpaca project11[43].
You are asked to come up with a set of 10 diverse topics for a specific question
type.
Here are the requirements:
1. Try not to repeat the words for each topic to maximize diversity.
2. Each topic should contain up to three words.
3. Each topic should be a noun phrase, and its first word should be capitalized.
4. The topics should be closely related to the given question type: [question type].
List of 10 topics:
In the second step, we prompt the base LLM with deduplicated topics and their instr

In [7]:
query("what is the best llm?")

more explainable and interpretable, as it provides
explicit rationales for their predictions.
Right task/application? As Valmeekam et al.
(2022) point out, current benchmarks may not adequately reflect the reasoning capabilities of LLMs.
In addition, tasks such as solving simple math problems and concatenating letters in strings (§4.1) are
artificial and do not accurately reflect real-world
situations. To truly understand the reasoning ability
of LLMs, it is important to consider more realistic
and meaningful applications such as decision making (Edwards, 1954), legal reasoning (Levi, 2013),
and scientific reasoning (Zimmerman, 2000). Our
ultimate goal should not be to enable LLMs to solve
simple math problems, which can be simply done
with other programs. When conducting relevant
research, it is essential to ask whether the specific
task being tackled is meaningful andwhether the
proposed method can be generalized to more realistic tasks and applications .
Are language models really a

In [8]:
query("what is the difference between gpt-4 and llama 2?")

-0.043
-0.009+0.0132-0.004 +0.0562
+0.0387-0.012
-0.076Alpaca: 0.39 LLaMA-GPT4: 0.34 GPT4: 0.37Figure 6: ROUGE-L on unnatural instructions evaluated with 9K samples. The instructions are
grouped into four subsets based on the ground-truth response length. The mean values are reported in
the legend. The difference with GPT-4 is reported on the bar per group. LLaMA-GPT4 is a closer
proxy to GPT-4 than Alpaca.
closely follow the behavior of GPT-4. When the sequence length is short, both LLaMA-GPT4 and
GPT-4 can generate responses that contains the simple ground truth answers, but add extra words to
make the response more chat-like, which probably leads to lower ROUGE-L scores.
5 R ELATED WORK
Instruction Tuning. Instruction tuning of LLMs is an increasingly popular research direction in
NLP (Zhong et al., 2021; Ouyang et al., 2022; Wei et al., 2021). Existing works aim to improve
the quality and scale of three factors in the development pipeline, including instruction-following
----------

---