# `Занятие 3.2: Text Embedding. Text Search. Retrieval-Augmented Generation`

#### `Сириус, смена "Алгоритмы и анализ данных" 2024`

#### `Алексеев Илья, ММП ВМК МГУ`

За основу взяты материалы [спецкурса на ММП ВМК МГУ](https://github.com/mmp-efml/mmp-efml-2024-fall/blob/main/notebooks/sem6_textsearch_rag.ipynb)

##### `Полезные материалы`

- sentence transformers overview https://nbviewer.org/github/skojaku/Practical-Guide-to-Sentence-Transformers/blob/main/notebook/Practical_Guide_to_Sentence_Transformers.ipynb
- sentence transformers contrastive loss fine-tuning https://sbert.net/examples/training/quora_duplicate_questions/README.html
- sentence transformers ranking fine tuning https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py


Environment for this notebook can be configured with this command:

In [1]:
! pip install vllm sentence-transformers faiss-cpu openai

Collecting vllm
  Downloading vllm-0.6.4.post1-cp38-abi3-manylinux1_x86_64.whl.metadata (10 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting uvicorn[standard] (from vllm)
  Downloading uvicorn-0.32.1-py3-none-any.whl.metadata (6.6 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.0.0-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken>=0.6.0 (from vllm)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting lm-format-enforcer<0.11,>=0.10.9 (from vllm)
  Downloading lm_format_enforcer-0.10.9-py3-none-any.whl.metadata (17 kB)
Collecting outlines<0.1,>=0.0.43 (from vllm)
  Downloading outlines-0.0.46-py3-none-any.whl.metadata (15 kB)
Collecting partial-json-parser (from vllm)
  Downloading partial_json_parser-0.2.1.1.post4-py3-none-any.whl.metadata (6.2 kB)
Col

## Text Embedding

Here's below is a basic tutorial on `sentence_tranformers` --- a library for training and using pre-trained embedding language models.

Link: https://sbert.net/

One can initialize any model in just a few lines of code:

In [1]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2",
    device="cuda",
    tokenizer_kwargs={"model_max_length": 128, "padding": "longest"},
    trust_remote_code=True,
)

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

`SentenceTransformer.encode()` is to calculate embeddings:

In [2]:
embeddings = embedding_model.encode(
    sentences=[
        "Two households, both alike in dignity",
        "In fair Verona, where we lay our scene",
        "From ancient grudge break to new mutiny",
        "Where civil blood makes civil hands unclean."
    ],
    convert_to_numpy=True,
    # convert_to_tensor=True,
    normalize_embeddings=False,
)
print("output shape:", len(embeddings), len(embeddings[0]))
print("type:", type(embeddings))

output shape: 4 384
type: <class 'numpy.ndarray'>


In [3]:
embeddings[0][:10]

array([-0.04559452,  0.04544348, -0.06216716, -0.04529314, -0.03638914,
       -0.01896073, -0.01647556,  0.022437  , -0.04695849, -0.05876662],
      dtype=float32)

Let us inspect some real data.

> Dataset composed of online banking queries annotated with their corresponding intents.
>
> BANKING77 dataset provides a very fine-grained set of intents in a banking domain. It comprises 13,083 customer service queries labeled with 77 intents. It focuses on fine-grained single-domain intent detection.

Original paper: https://arxiv.org/abs/2003.04807v1

In [4]:
from datasets import load_dataset

banking77 = load_dataset("PolyAI/banking77", trust_remote_code=True)
banking77

README.md:   0%|          | 0.00/9.78k [00:00<?, ?B/s]

banking77.py:   0%|          | 0.00/7.17k [00:00<?, ?B/s]

dataset_infos.json:   0%|          | 0.00/5.89k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/839k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/240k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10003 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3080 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 10003
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 3080
    })
})

In [5]:
from pprint import pprint

pprint(banking77["train"][0])

{'label': 11, 'text': 'I am still waiting on my card?'}


In [6]:
print("n_classes:", len(banking77["train"].unique("label")))

n_classes: 77


One can use batch encoding:

In [7]:
banking77_embeddings = embedding_model.encode(
    banking77["train"]["text"],
    batch_size=32,
    convert_to_numpy=True,
    normalize_embeddings=True,
    show_progress_bar=True,
)

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Let us visualize embeddings obtained as TSNE projections.

In [8]:
from sklearn.manifold import TSNE

embeddings_projected = TSNE().fit_transform(banking77_embeddings)

In [9]:
import numpy as np
from numpy.typing import NDArray
import plotly.express as px
import pandas as pd

def visualize(embeddings_projected: NDArray[np.float64], texts: list[str], class_labels: list[int]):
    projected = pd.DataFrame({"x": embeddings_projected[:, 0], "y": embeddings_projected[:, 1]})
    projected['class_label'] = class_labels
    projected['text'] = texts

    fig = px.scatter(
        projected, x="x", y="y", color="class_label",
        hover_data={'class_label': True, 'text': True, "x": False, "y": False},
        width=600,
        height=600,

    )

    fig.update(layout_coloraxis_showscale=False)
    fig.update_layout()
    fig.show()

In [10]:
visualize(
    embeddings_projected,
    texts=banking77["train"]["text"],
    class_labels=banking77["train"]["label"]
)

We see the embedding distance reflecting the semantic distance between texts from different classes.

Here's some nice pop-science content on embedding space analysis: https://youtu.be/Jesv24I9bXM?si=OcWydn5B6oMH5s1t

## Text Retrieval

Let us build a prototype of a search engine for a question answering system. It finds relevant text passages for some query answer.

> Starting with a paper released at NIPS 2016, MS MARCO is a collection of datasets focused on deep learning in search.
>
> The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. Since then we released a 1,000,000 question dataset, a natural langauge generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search.

Original paper: https://arxiv.org/abs/1611.09268

In [11]:
from datasets import load_dataset

dataset_name = "microsoft/ms_marco"

ms_marco = load_dataset(dataset_name, "v1.1", trust_remote_code=True)
ms_marco

README.md:   0%|          | 0.00/9.48k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/21.4M [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/175M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/10047 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/82326 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/9650 [00:00<?, ? examples/s]

DatasetDict({
    validation: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 10047
    })
    train: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 82326
    })
    test: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 9650
    })
})

We will use a light model with 12-layers and 384-neurons for retrieval. It uses prefixes to perform an assymetric search.

Original paper: https://arxiv.org/pdf/2212.03533

In [12]:
from sentence_transformers import SentenceTransformer

retrieval_embedder = SentenceTransformer(
    "intfloat/e5-small",
    prompts={
        "psg": "passage: ",
        "qry": "query: ",
    },
)

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.0k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/362 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

We will store the embeddings in faiss index

Read more on faiss: https://github.com/facebookresearch/faiss

All the indexes: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes

In [13]:
import faiss

# one can use an approximate search with `IndexHNSWFlat`
vector_index = faiss.IndexFlatIP(retrieval_embedder.get_sentence_embedding_dimension())

To build a vector index of texts, firstly, we need to extract all the passages. They comprise our knowledge base.

In [14]:
import itertools as it
subset = ms_marco["validation"]

all_passages = subset.map(
    function=lambda batch: {"passages": list(it.chain.from_iterable([psg_list["passage_text"] for psg_list in batch]))},
    batched=True,
    batch_size=32,
    input_columns="passages",
    remove_columns=ms_marco["validation"].column_names,
)
all_passages

Map:   0%|          | 0/10047 [00:00<?, ? examples/s]

Dataset({
    features: ['passages'],
    num_rows: 82360
})

In [15]:
all_passages.save_to_disk("passages")

Saving the dataset (0/1 shards):   0%|          | 0/82360 [00:00<?, ? examples/s]

Secondly, we need to embed them and add to vector index.

In [16]:
from tqdm.notebook import tqdm_notebook as tqdm


batch_size = 32
for batch in tqdm(all_passages.iter(batch_size=batch_size), total=len(all_passages) // batch_size):
    batch_embeddings = retrieval_embedder.encode(
        batch["passages"],
        batch_size=batch_size,
        normalize_embeddings=True,
        prompt_name="psg"
    )
    vector_index.add(batch_embeddings)

  0%|          | 0/2573 [00:00<?, ?it/s]

Avoid recalculating knowledge base embeddings!! save them to file system.

In [17]:
filename = "e5_small_vector_index.faiss"

# faiss.write_index(vector_index, filename)
# vector_index = faiss.read_index(filename)

Simple wrapper for the logic related to retrieval part of our search engine.

In [18]:
import numpy as np
from numpy.typing import NDArray
from datasets import Dataset


class Retriever:
    def __init__(self, embedder: SentenceTransformer, index: faiss.Index, passages: Dataset):
        self.embedder = embedder
        self.index = index
        self.passages = passages

    def __call__(self, queries: list[str], k: int):
        query_embedding = self.embedder.encode(queries, prompt_name="qry")
        return self._search_by_embedding(query_embedding, k)

    def _search_by_embedding(self, embedding: NDArray[np.float64], k: int):
        cos_sim, indices = self.index.search(embedding, k)

        results = []
        for inds, dists in zip(indices, cos_sim, strict=True):
            cur_res = []
            for ind, dist in zip(inds, dists, strict=True):
                cur_res.append({"id": ind, "cosine": dist, "passage": self.passages["passages"][int(ind)]})
            results.append(cur_res)

        return results

In [19]:
retriever = Retriever(retrieval_embedder, vector_index, all_passages)

Let's see what it retrieves:

In [20]:
from pprint import pprint

pprint(retriever(queries=["woman"], k=5))

[[{'cosine': 0.85627615,
   'id': 41961,
   'passage': 'A trans woman (sometimes trans-woman or transwoman) is a '
              'transgender person who was assigned male at birth but whose '
              'gender identity is that of a woman. The label of transgender '
              'woman is not always interchangeable with that of transsexual '
              'woman, although the two labels are often used in this way.'},
  {'cosine': 0.85056126,
   'id': 41960,
   'passage': 'A trans woman with XY written on her hand, at a protest in '
              'Paris, October 1, 2005. A transwoman (sometimes spelled as '
              'trans-woman or trans woman) is a male-to-female (MTF) '
              'transsexual or transgender person. Many people in this group '
              'like the name trans woman over the many medical terms that are '
              'out there. Other non-medical names are t-girl, tg-girl and '
              'ts-girl. Transgender, though, is the more common name.'},
  {'

In [21]:
pprint(retriever(queries=["what's the average age of a woman?"], k=5))

[[{'cosine': 0.9043233,
   'id': 2818,
   'passage': 'the average age for americans getting married has reached a '
              'historic high 27 for women and 29 for men a jump from the 1990 '
              'average marrying age of 23 for women and 26 for men '},
  {'cosine': 0.8998129,
   'id': 42369,
   'passage': 'Relevance. Rating Newest Oldest. Best Answer: Overall: 78.06 '
              'years Male: 75.15 years Female: 80.97 years These values are '
              'for the U.S., please check the source for other countries. Not '
              'many people live to 88 years, and 88 years is well over the '
              'average span of 67years for the world.'},
  {'cosine': 0.8963249,
   'id': 2819,
   'passage': 'follow comments the average age at which a woman gets married '
              'for the first time climbed from 29 9 years in 2008 to 30 years '
              'in 2009 figures published by the office for national statistics '
              'said this is the first time t

What will happen if we don't add the prefix to a query?

In [22]:
wrong_embedding = retrieval_embedder.encode(["what's the average age of a woman?"], prompt_name="psg")
pprint(retriever._search_by_embedding(wrong_embedding, k=5)[0])

[{'cosine': 0.90344846,
  'id': 2818,
  'passage': 'the average age for americans getting married has reached a '
             'historic high 27 for women and 29 for men a jump from the 1990 '
             'average marrying age of 23 for women and 26 for men '},
 {'cosine': 0.90087724,
  'id': 42369,
  'passage': 'Relevance. Rating Newest Oldest. Best Answer: Overall: 78.06 '
             'years Male: 75.15 years Female: 80.97 years These values are for '
             'the U.S., please check the source for other countries. Not many '
             'people live to 88 years, and 88 years is well over the average '
             'span of 67years for the world.'},
 {'cosine': 0.89690053,
  'id': 42368,
  'passage': "86 for a man and 100 for a woman. Well, according to the CIA's "
             'The World Factbook, the current average human lifespan is '
             'approximately 66 and a half years. On average the current life '
             'expectancy for the world is about 67.2 years. (s

## Text Ranking

After you retrieved a descent set of candidates, the next stage is to rank them and select a small top.

Ranking is usually performed in a cross-encoder style. All the needed functionality is available at sentence transformers as `CrossEncoder` class.

One can read more on this: https://sbert.net/examples/applications/retrieve_rerank/README.html

![](figures/retrieve-rank.png)

In [23]:
from sentence_transformers import CrossEncoder

# model = CrossEncoder("Alibaba-NLP/gte-multilingual-reranker-base", trust_remote_code=True)
ranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-2-v2", trust_remote_code=True)

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/62.5M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

It has a convenient method `rank`.

In [24]:
from torch.nn import Sigmoid
from pprint import pprint

query = "A man is eating pasta."

# With all sentences in the corpus
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]

# 1. We rank all sentences in the corpus for the query
ranks = ranker.rank(query, corpus, top_k=3, activation_fct=Sigmoid())
pprint(ranks)

[{'corpus_id': 0, 'score': 0.9930165},
 {'corpus_id': 1, 'score': 0.9320454},
 {'corpus_id': 3, 'score': 0.00069369795}]


Example with FlagEmbedding: https://huggingface.co/BAAI/bge-reranker-v2-m3

Now we are ready to build `SearchEngine` class.

In [25]:
class SearchEngine:
    def __init__(self, retriever: Retriever, ranker: CrossEncoder, n_candidates: int = 50):
        self.retriever = retriever
        self.ranker = ranker
        self.n_candidates = n_candidates

    def __call__(self, queries: list[str], n_results: int):
        candidates = self.retriever(queries=queries, k=self.n_candidates)
        res = []
        for query, cands in zip(queries, candidates, strict=True):
            cand_passages = [cnd["passage"] for cnd in cands]
            ranked_res = self.ranker.rank(query, cand_passages, top_k=n_results, return_documents=True)
            res.append([x["text"] for x in ranked_res])
        return res

In [26]:
searcher = SearchEngine(retriever, ranker)

Let's see what it's capable to!

In [27]:
def pretty_print(feed: list[str]):
    for i, content in enumerate(feed):
        print(f"Page #{i}")
        pprint(content)

In [28]:
pretty_print(searcher(queries=["whats the average age of a woman?"], n_results=3)[0])

model.safetensors:   0%|          | 0.00/62.5M [00:00<?, ?B/s]

Page #0
('Middle age technically refers to reaching an age where you have lived half '
 'the average life expectancy for your gender. The average midpoint of life is '
 'now about 40 for women and 38 for men (men tend to die 6 to 8 years before '
 'women). In comparison, 100 years ago women arrived at middle age by 22, '
 'mainly because so many died in childbirth.')
Page #1
('the average age for americans getting married has reached a historic high 27 '
 'for women and 29 for men a jump from the 1990 average marrying age of 23 for '
 'women and 26 for men ')
Page #2
('marriage in the colonies the average age of a women who married for the '
 'first time rose steadily although not sharply from 1800 to 1900 north '
 'american colonists tended to get married early due to several factors the '
 'first and perhaps most important was simply that they could in 1890 when the '
 'u s census bureau started collecting marriage data it was recorded that the '
 'average age of a first marriage for

In [29]:
pretty_print(searcher(queries=["Who's Mendeleev?"], n_results=3)[0])

Page #0
('Confidence votes 1.1K. .In 1869 the Russian chemistry professor Dmitri '
 'Ivanovich Mendeleev and four months later the German Julius Lothar Meyer '
 'independently developed the first periodic table, arranging the elements by '
 'mass. its invention though is generally credited to Russian chemist Dmitri '
 'Mendeleev. Although Dmitri Mendeleev is often considered the father of the '
 'periodic table, the work of many scientists contributed to its present '
 'form.   In the Beginning   A necessary prerequisite to the construction of '
 'the periodic table was the discovery of the individual elements.')
Page #1
('Chemist for uranium nuclear fuels. SCIENTISTS WHO CONTRIBUTED to the '
 'DEVELOPMENT OF PERIODIC TABLE ARE DMITRI MENDELEEV, JOHN DALTON, Johann '
 'Dobereiner, John Newlands, Julius Lothar Meyer, etc. Dmitri Mendeleev Dmitri '
 'Mendeleev Dmitri Mendeleev is the scientists that worked with decks of '
 'cards  to decelop the arrangement of elements on the periodic ta

In [30]:
pretty_print(searcher(queries=["Who's Freddy Mercury?"], n_results=3)[0])

Page #0
('Sir Donald George Bradman (August 27, 1908 - February 25, 2001) was an '
 'Australian cricket player who is universally regarded as the greatest '
 "cricket player of all time, and one of Australia's greatest popular heroes. ")
Page #1
('George Lucas. 4,466 pages on this wiki. Lucas (right) and Kathleen Kennedy '
 'on the set of Indiana Jones and the Kingdom of the Crystal Skull. George '
 'Walton Lucas, Jr. (born May 14, 1944) is an American film director, '
 'producer, and screenwriter famous for his epic Star Wars saga and the '
 'Indiana Jones tetralogy. Jr. was born in Modesto, California. His father, '
 'George Walton Lucas, Sr., ran a stationery store and owned a small walnut '
 'orchard and was mainly of British and Swiss heritage.')
Page #2
('Dove Cameron. Maleficent Bertha Mal is the main protagonist of the Disney '
 'Channel Original Movie Descendants. She is the daughter of Maleficent, who '
 "is one of Disney's biggest villains, who she wants to grow up to be lik

Do you see the difference between these cases? Does every output answers the stated question?

## LLM

Будем использовать
- модель Qwen, доступную в открытом виде
- библиотеку vLLM для запуска Qwen

In [31]:
! pip install triton

Collecting triton
  Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (209.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.5/209.5 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: triton
Successfully installed triton-3.1.0


In [32]:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Pass the default decoding hyperparameters of Qwen2.5-7B-Instruct
# max_tokens is for the maximum length for generation.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)

# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(model="Qwen/Qwen2.5-1.5B-Instruct-AWQ", quantization="awq")

# Prepare your prompts
prompt = "Tell me a joke."

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

INFO 12-12 10:19:11 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
INFO 12-12 10:19:11 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='Qwen/Qwen2.5-1.5B-Instruct-AWQ', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-1.5B-I

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

INFO 12-12 10:19:18 selector.py:261] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 12-12 10:19:18 selector.py:144] Using XFormers backend.
INFO 12-12 10:19:19 model_runner.py:1072] Starting to load model Qwen/Qwen2.5-1.5B-Instruct-AWQ...
INFO 12-12 10:19:20 weight_utils.py:243] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/1.61G [00:00<?, ?B/s]

INFO 12-12 10:19:59 weight_utils.py:288] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 12-12 10:20:00 model_runner.py:1077] Loading model weights took 1.1037 GB
INFO 12-12 10:20:05 worker.py:232] Memory profiling results: total_gpu_memory=14.75GiB initial_memory_usage=1.52GiB peak_torch_memory=3.42GiB memory_usage_post_profile=1.55GiB non_torch_memory=0.17GiB kv_cache_size=9.69GiB gpu_memory_utilization=0.90
INFO 12-12 10:20:06 gpu_executor.py:113] # GPU blocks: 22669, # CPU blocks: 9362
INFO 12-12 10:20:06 gpu_executor.py:117] Maximum concurrency for 32768 tokens per request: 11.07x
INFO 12-12 10:20:12 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-12 10:20:12 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 

In [33]:
def from_messages(messages):
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # generate outputs
    outputs = llm.generate([text], sampling_params)
    for output in outputs:
        prompt = output.prompt
        return output.outputs[0].text

In [34]:
def get_answer(input_txt):
    messages = [
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": input_txt}
    ]
    return from_messages(messages)

In [35]:
get_answer("tell me a joke")

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.21it/s, est. speed input: 40.07 toks/s, output: 29.14 toks/s]


"Sure! Here's a joke for you:\n\nWhy did the tomato turn red?\n\nBecause it saw the salad dressing!"

## RAG

We want to build a question answering system.

![](figures/rag-completion.png)

We will use the following prompt template:

In [36]:
import yaml
from pathlib import Path
from pprint import pprint
from copy import deepcopy

template = [
{"role": "system",
  "content": "You're a helpful assistant for question answering. You are provided with question and some chunks of texts, related to the question. Your task is to give a concise and correct answer to the question.",
},
{"role": "user",
  "content": """
Question:
{question}

Related chunks:
{chunks}
"""
}
]

def construct_messages(question: str, chunks: list[str]) -> str:
    chunks = "\n\n".join(chunks)
    prompt = deepcopy(template)
    for msg in prompt:
        msg["content"] = msg["content"].format(question=question, chunks=chunks)
    return prompt

pprint(construct_messages("Who's Santa Claus?", ["nice old man", "wears red"]))

[{'content': "You're a helpful assistant for question answering. You are "
             'provided with question and some chunks of texts, related to the '
             'question. Your task is to give a concise and correct answer to '
             'the question.',
  'role': 'system'},
 {'content': '\n'
             'Question:\n'
             "Who's Santa Claus?\n"
             '\n'
             'Related chunks:\n'
             'nice old man\n'
             '\n'
             'wears red\n',
  'role': 'user'}]


Finally, all together:

In [38]:
class RAGPipeline:
    def __init__(self, searcher: SearchEngine):
        self.searcher = searcher

    def __call__(self, question: str, n_chunks: int = 3):
        chunks = self.searcher([question], n_chunks)[0]
        msg = construct_messages(question, chunks)
        answer = from_messages(msg)
        return answer

In [39]:
rag = RAGPipeline(searcher)

In [41]:
pprint(rag("whats the average age of a woman?"))

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  5.05it/s, est. speed input: 1541.68 toks/s, output: 15.31 toks/s]

'40'





In [42]:
pprint(rag("Who's Mendeleev?"))

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  4.01it/s, est. speed input: 1474.34 toks/s, output: 28.20 toks/s]

'Dmitri Mendeleev'





In [43]:
pprint(rag("Who's Freddy Mercury?"))

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  2.06it/s, est. speed input: 753.57 toks/s, output: 57.96 toks/s]

('Freddy Mercury was an English singer-songwriter, guitarist, and actor, known '
 'as the lead vocalist of the rock band Queen.')





# Hugging Face

## Text Classification

In [1]:
from transformers import pipeline

classifier = pipeline(task="sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [2]:
classifier("This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.")

[{'label': 'positive', 'score': 0.9836366772651672}]

In [3]:
classifier("i think this is lame")

[{'label': 'negative', 'score': 0.7983795404434204}]

## named entity recognition

In [7]:
from transformers import pipeline

classifier = pipeline("ner", model="dslim/bert-base-NER")

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [8]:
classifier("The Golden State Warriors are an American professional basketball team based in San Francisco.")

[{'entity': 'B-ORG',
  'score': 0.99823076,
  'index': 2,
  'word': 'Golden',
  'start': 4,
  'end': 10},
 {'entity': 'I-ORG',
  'score': 0.9988481,
  'index': 3,
  'word': 'State',
  'start': 11,
  'end': 16},
 {'entity': 'I-ORG',
  'score': 0.9988463,
  'index': 4,
  'word': 'Warriors',
  'start': 17,
  'end': 25},
 {'entity': 'B-MISC',
  'score': 0.9994301,
  'index': 7,
  'word': 'American',
  'start': 33,
  'end': 41},
 {'entity': 'B-LOC',
  'score': 0.9988174,
  'index': 13,
  'word': 'San',
  'start': 80,
  'end': 83},
 {'entity': 'I-LOC',
  'score': 0.99920326,
  'index': 14,
  'word': 'Francisco',
  'start': 84,
  'end': 93}]

## Generation

In [9]:
from transformers import pipeline

generator = pipeline("text-generation", model="openai-community/gpt2")

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [10]:
generator("Somatic hypermutation allows the immune system to")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'Somatic hypermutation allows the immune system to make antibodies against various proteins, and the immune system may make antibody-encoded molecules. Moreover, it is likely that the antibodies can bind to certain genes and interact with them as well as with an'}]

## Translation

In [11]:
from transformers import pipeline

translator = pipeline("translation_en_to_ru", model="Helsinki-NLP/opus-mt-en-ru")

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/307M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/803k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.60M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [12]:
translator("we butter the bread with butter")

[{'translation_text': 'Мы смазываем хлеб маслом.'}]