In [1]:
!pip install -qU \
  datasets==2.14.6 \
  openai==1.2.2

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/493.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/493.7 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m220.3/220.3 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [9

## Dataset Download

We're going to test with a more real world use-case, with messy, imperfect data. We will use the [`jamescalam/ai-arxiv-chunked`](https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked) dataset.

In [2]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
data

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/153M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})

First we define our embedding function.

In [3]:
import os
from getpass import getpass
import openai

openai.api_key = os.getenv("OPENAI_API_KEY") or getpass("OpenAI API key: ")

def embed(docs: list[str]) -> list[list[float]]:
    res = openai.embeddings.create(
        input=docs, model="text-embedding-ada-002"
    )
    doc_embeds = [r.embedding for r in res.data]
    return doc_embeds

Use this to build a Numpy array of cohere embedding vectors.

In [4]:
from tqdm.auto import tqdm
import numpy as np

chunks = data["chunk"]
batch_size = 128

for i in tqdm(range(0, len(chunks), batch_size)):
    i_end = min(len(chunks), i+batch_size)
    chunk_batch = chunks[i:i_end]
    # embed current batch
    embed_batch = embed(chunk_batch)
    # add to existing np array if exists (otherwise create)
    if i == 0:
        arr = np.array(embed_batch)
    else:
        arr = np.concatenate([arr, np.array(embed_batch)])

  0%|          | 0/325 [00:00<?, ?it/s]

Now we need to create the query mechanism, this is a dot product similarity calculation between a query vector and our `arr` vectors.

In [18]:
from numpy.linalg import norm

# convert chunks list to array for easy indexing
chunk_arr = np.array(chunks)

def query(text: str, top_k: int=3) -> list[str]:
    # create query embedding
    res = openai.embeddings.create(
        input=[text], model="text-embedding-ada-002"
    )
    xq = np.array(res.data[0].embedding)
    # calculate dot product sim
    sim = np.dot(arr, xq.T)
    # get indices of top_k records
    idx = np.argpartition(sim, -top_k)[-top_k:]
    print(sim[idx])
    docs = chunk_arr[idx]
    print(docs.shape)
    for d in docs.tolist():
        print(d)
        print("----------")

In [19]:
query("why should I use llama 2?")

[0.80002822 0.83308516 0.82787721]
(3,)
models will be released as we improve model safety with community feedback.
License A custom commercial license is available at: ai.meta.com/resources/
models-and-libraries/llama-downloads/
Where to send commentsInstructions on how to provide feedback or comments on the model can be
found in the model README, or by opening an issue in the GitHub repository
(https://github.com/facebookresearch/llama/ ).
Intended Use
Intended Use Cases L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle is intended for commercial and research use in English. Tuned models
are intended for assistant-like chat, whereas pretrained models can be adapted
for a variety of natural language generation tasks.
Out-of-Scope Uses Use in any manner that violates applicable laws or regulations (including trade
compliancelaws). UseinlanguagesotherthanEnglish. Useinanyotherway
that is prohibited by the Acceptable Use Policy and Licensing Agreement for
L/l.sc/a.sc/m.sc/a.sc /two.taboldstyle.
Har

In [15]:
query("can you tell me about red teaming for llama 2?")

(3,)
the inﬂuence of model size on susceptibility to red team attacks (Figure 1) and analyze the content of the
attacks (Figures 2 & 9) to understand the types of harms uncovered by red teaming. Additionally, we provide
2https://github.com/anthropics/hh-rlhf
2
12340attack success rating
oﬀensive jokesinsults based on physical characteristicsracist languagesoliciting discriminatory responsessubstance abuseviolenceanimal abuseharmful health informationsoliciting advice on violencemaking & smuggling drugstheftidentity theftpiracycheatingsoliciting advice on harmful activitydoxxingsoliciting PIIcrimeoﬀensive languageprofane jokesprofanityoﬀensive terms starting with given letterviolenceadult contentassaultmisinformationFigure 2 Visualization of the red team attacks. Each point corresponds to a red team attack embedded in a two dimensional space using UMAP. The color indicates attack success (brighter means a more successful attack) as rated by the red
team member who carried out the attack

In [16]:
query("what is the best llm?")

(3,)
future training of LLMs. Extreme caution and review is required especially in high-stakes applications such as
medicine, journalism, transportation, and attribution of behaviors or language to individuals or organizations.
As example of the latter, early uses of ChatGPT by writers within an organization covering the tech sector
led to notable errors in publications and, by report, to new review procedures with uses of LLMs for writing
assistance [Lef23]. The new procedures were reported to include clear indications about the use of an LLM to
generate content and then naming human editors responsible for fact-checking [Gug23]. Practitioners in all
elds employing LLMs will need to adhere to the highest standards and practices for verifying information
generated by LLMs.
Both end users of the LLM tools and consumers of generated content will need to be educated about the
challenges with reliability and the need for their ongoing vigilance about erroneous output. In applications
that

In [17]:
query("what is the difference between gpt-4 and llama 2?")

(3,)
31.39%LLaMA-GPT4 
 25.99%
Tie 
 42.61%
HonestyAlpaca 
  25.43%LLaMA-GPT4 
 16.48%
Tie 
 58.10%
Harmlessness(a) LLaMA-GPT4 vs Alpaca ( i.e.,LLaMA-GPT3 )
 GPT4 
  44.11%
LLaMA-GPT4 
 42.78% Tie 
 13.11%
Helpfulness GPT4 
  37.48%
LLaMA-GPT4 
 37.88% Tie 
 24.64%
Honesty GPT4 
  35.36% LLaMA-GPT4 
 31.66% Tie 
 32.98%
Harmlessness
(b) LLaMA-GPT4 vs GPT-4
Figure 3: Human evaluation.
4.2 H UMAN EVALUATION WITH ALIGNMENT CRITERIA
To evaluate the alignment quality of our instruction-tuned LLMs, we follow alignment criteria from
Anthropic Askell et al. (2021): an assistant is aligned if it is helpful, honest, and harmless (HHH).
----------
to GPT-3 corresponds to the Stanford Alpaca model. From Figure 3(a), we observe that ( i) For the
“Helpfulness” criterion, GPT-4 is the clear winner with 54.12% of the votes. GPT-3 only wins 19.74%
of the time. ( ii) For the “Honesty” and “Harmlessness” criteria, the largest portion of votes goes
to the tie category, which is substantially higher than t