Ниже представлена основная часть разработки пайплайна для RAG ситемы. Все необходимые импорты также прописаны, поэтому все ячейки можно просто запустить друг за другом. Но для корректного выполнения нужно добавить в секреты колаба `"HF_TOKEN"`, который можно создать у себя в аккаунте [🤗 Hugging Face](https://huggingface.co/).

Также в разделе парсинга данных можно заново не запускать все ячейки, а загрузить уже полученные ранее данные (так как на использование API есть ограничения). Все необходимые файлы находятся в папке `Задание 3/data/`.

Для запуска использовалась колабовская карта T4.

# RAG на основе StackOverflow QA с помощью LangChain

Постараемся сделать RAG на основе вопросов и ответов с StackOverflow (чтобы это хоть как-то совпадало с питчем второго задания). Для этого спарсим вопросы и ответы (которые помеченны как правильные) с помощью API StackExchange.

Примерная схема пайплана:

<img src="https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/rag-diagram.png" alt="RAG diagram" width="500" height="500"/>


In [1]:
!pip install -q torch \
                transformers \
                accelerate \
                bitsandbytes \
                transformers \
                sentence-transformers \
                faiss-gpu

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# for some reasons, the default encoding is not UTF-8
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [3]:
!pip install -q langchain

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m815.9/815.9 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m241.2/241.2 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.4/55.4 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25h

## Парсинг вопросов и ответов


In [None]:
import requests
from tqdm import tqdm


base_url = "https://api.stackexchange.com/2.3/questions"

ans = []
params = {
        "order": "desc",
        "sort": "activity",
        "tagged": 'python',
        "site": "stackoverflow",
        "filter": "withbody",
        "pagesize": 100,
        "page": 1
    }

for i in tqdm(range(1, 295), total=294):
    params['page'] = i

    response = requests.get(base_url, params=params)
    data = response.json()

    try:
        ans.append(data['items'])
    except:
        continue

    if not data['has_more']:
        break



100%|██████████| 294/294 [01:52<00:00,  2.62it/s]


In [None]:
all_qestions = []

for i in tqdm(range(len(ans)), total=len(ans)):
    for j in tqdm(range(len(ans[i])), total=len(ans[i]), leave=False):
        quest = ans[i][j]
        try:
            body = quest['body']
            is_ans = quest['is_answered']
            link = quest['link']
            quest_id = quest['question_id']
            tags = quest['tags']
            title = quest['title']

            accepted_answer_id = quest['accepted_answer_id']

        except:
            continue

        else:

            d = {
                'body': body,
                'is_answered': is_ans,
                'link': link,
                'question_id': quest_id,
                'tags': tags,
                'title': title,
                'accepted_answer_id': accepted_answer_id
            }

            all_qestions.append(d)


  0%|          | 0/25 [00:00<?, ?it/s]
  0%|          | 0/100 [00:00<?, ?it/s][A
                                       [A
  0%|          | 0/100 [00:00<?, ?it/s][A
                                       [A
  0%|          | 0/100 [00:00<?, ?it/s][A
                                       [A
  0%|          | 0/100 [00:00<?, ?it/s][A
                                       [A
  0%|          | 0/100 [00:00<?, ?it/s][A
                                       [A
  0%|          | 0/100 [00:00<?, ?it/s][A
 24%|██▍       | 6/25 [00:00<00:00, 55.52it/s]
  0%|          | 0/100 [00:00<?, ?it/s][A
                                       [A
  0%|          | 0/100 [00:00<?, ?it/s][A
                                       [A
  0%|          | 0/100 [00:00<?, ?it/s][A
                                       [A
  0%|          | 0/100 [00:00<?, ?it/s][A
                                       [A
  0%|          | 0/100 [00:00<?, ?it/s][A
                                       [A
  0%|      

In [None]:
import pandas as pd


questions = pd.concat([pd.DataFrame([d]) for d in all_qestions], ignore_index=True)

In [None]:
questions.to_csv('questions_150.csv', index=False)

In [None]:
questions

Unnamed: 0,body_q,is_answered,link,question_id,tags,title,accepted_answer_id
0,<p>Heres two tables -</p>\n<pre><code>Employee...,True,https://stackoverflow.com/questions/78015379/d...,78015379,"[python, python-3.x, pandas, dataframe]",DataFrame.groupby.rank producing wrong results?,78015801
1,<p>I want to create a python script that decod...,True,https://stackoverflow.com/questions/77991471/p...,77991471,"[python, ffmpeg, raspberry-pi, sdl-2, ffmpeg-p...",Play a video with ffmpeg and SDL2 on a Raspber...,78010424
2,<p><strong>Problem</strong></p>\n<p>Given an a...,True,https://stackoverflow.com/questions/78015333/w...,78015333,"[python, algorithm, logic, sliding-window]",what is going wrong in Minimum size subarray s...,78015521
3,<p>VsCode provided venv creation feature. I tr...,True,https://stackoverflow.com/questions/77993374/c...,77993374,"[python, visual-studio-code, python-venv]",cant import code modules with vscode venv feature,77994083
4,<p>I am trying to perform cell-by-cell operati...,True,https://stackoverflow.com/questions/78014906/e...,78014906,"[python, excel]",Excel Cell Operation using Python,78015271
...,...,...,...,...,...,...,...
145,<p>I'm trying to setup Visual Studio Code for ...,True,https://stackoverflow.com/questions/40185437/n...,40185437,"[python, pandas, numpy, visual-studio-code]",No module named &#39;numpy&#39;: Visual Studio...,40186317
146,"<p>I want to click an iframe's radio button, b...",True,https://stackoverflow.com/questions/39427156/h...,39427156,"[python, selenium, firefox]",how to click iframe using python selenium,39429238
147,<p>I'm trying to understand why I'm getting th...,True,https://stackoverflow.com/questions/24856643/u...,24856643,"[python, datetime, timezone, pytz]",unexpected results converting timezones in python,24856814
148,<p>Trying to create simple login functionality...,True,https://stackoverflow.com/questions/14783344/i...,14783344,"[python, django]",ImportError no module named accounts,14785167


In [None]:
ids = questions.accepted_answer_id.tolist()

In [None]:
ans_itms = []

for id in tqdm(ids, total=len(ids)):
    base_url = f"https://api.stackexchange.com/2.3/answers/{id}"

    response = requests.get(base_url, params=params)
    data = response.json()

    try:
        ans_itms.append(data['items'])
    except:
        continue

    if not data['has_more']:
        break

In [None]:
all_answers = []

for ans in ans_itms:
    question_id = ans['question_id']
    body = ans['body']
    body_m = ans['body_markdown']

    d = {
        'body': body,
        'body_m': body_m,
        'question_id': question_id
    }
    all_answers.append(d)

In [None]:
answers = pd.concat([pd.DataFrame([d]) for d in all_answers], ignore_index=True)

In [None]:
answers.to_csv('answers_150.csv', index=False)

In [None]:
answers

Unnamed: 0,question_id,body_a,body_m
0,78015379,<p>The only change you need is this:</p>\n<ul>...,The only change you need is this:\r\n* Change ...
1,77991471,<p>I figured out how to reduce the processor l...,I figured out how to reduce the processor load...
2,78015333,"<p>Your thinking is good, but the loop will en...","Your thinking is good, but the loop will end w..."
3,77993374,"<p>Well, there is 3 ways.</p>\n<ol>\n<li>Creat...","Well, there is 3 ways.\r\n1. Create `import_fi..."
4,78014906,<p>You can accomplish this using the <code>xlw...,You can accomplish this using the `xlwings` li...
...,...,...,...
145,40185437,<p>You may not have numpy installed on the ver...,You may not have numpy installed on the versio...
146,39427156,<p>(Assuming provided HTML is correct) actuall...,(Assuming provided HTML is correct) actually y...
147,24856643,"<p>From the partial documentation:\n<a href=""h...",From the partial documentation:\r\nhttp://pytz...
148,14783344,<p>Did you add the accounts to your settings.p...,Did you add the accounts to your settings.py?


In [None]:
qa_df = pd.merge(questions, answers, on='question_id', how='right')

In [None]:
qa_df.to_csv('QA_SO_150.csv', index=False)

## Итоговый датасет с вопросами и ответами:

In [3]:
import pandas as pd

# чтобы не запускать раздел выше каждый раз заново, скачаем ранее полученные данные
# для этого их нужно загрузить в файлы данного сеанса
qa_df = pd.read_csv('QA_SO_150.csv')

In [4]:
qa_df.head()

Unnamed: 0,body_q,is_answered,link,question_id,tags,title,accepted_answer_id,body_a,body_m
0,<p>Heres two tables -</p>\n<pre><code>Employee...,True,https://stackoverflow.com/questions/78015379/d...,78015379,"['python', 'python-3.x', 'pandas', 'dataframe']",DataFrame.groupby.rank producing wrong results?,78015801,<p>The only change you need is this:</p>\n<ul>...,The only change you need is this:\r\n* Change ...
1,<p>I want to create a python script that decod...,True,https://stackoverflow.com/questions/77991471/p...,77991471,"['python', 'ffmpeg', 'raspberry-pi', 'sdl-2', ...",Play a video with ffmpeg and SDL2 on a Raspber...,78010424,<p>I figured out how to reduce the processor l...,I figured out how to reduce the processor load...
2,<p><strong>Problem</strong></p>\n<p>Given an a...,True,https://stackoverflow.com/questions/78015333/w...,78015333,"['python', 'algorithm', 'logic', 'sliding-wind...",what is going wrong in Minimum size subarray s...,78015521,"<p>Your thinking is good, but the loop will en...","Your thinking is good, but the loop will end w..."
3,<p>VsCode provided venv creation feature. I tr...,True,https://stackoverflow.com/questions/77993374/c...,77993374,"['python', 'visual-studio-code', 'python-venv']",cant import code modules with vscode venv feature,77994083,"<p>Well, there is 3 ways.</p>\n<ol>\n<li>Creat...","Well, there is 3 ways.\r\n1. Create `import_fi..."
4,<p>I am trying to perform cell-by-cell operati...,True,https://stackoverflow.com/questions/78014906/e...,78014906,"['python', 'excel']",Excel Cell Operation using Python,78015271,<p>You can accomplish this using the <code>xlw...,You can accomplish this using the `xlwings` li...


In [5]:
# and here: https://python.langchain.com/docs/integrations/document_loaders/pandas_dataframe

from langchain_community.document_loaders import DataFrameLoader


loader = DataFrameLoader(qa_df, page_content_column="body_q")

docs = loader.load()

Порежем запросы на чанки

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# https://github.com/langchain-ai/langchain/discussions/3786
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=30, separators=[" ", ",", "\n"]
)

chunked_docs = text_splitter.split_documents(docs)

In [17]:
# если позже возникнут проблемы с cuda memory, пока не запускать

# del docs
# del qa_df
# del loader
# del text_splitter

## Получение эмбеддингов

Было много попыток запустить в колабе, но для большинства моделей была ошибка CUDA OoM. Если есть больше компьюта, то модельки для извлечвения эмбеддингов можно взять [здесь (MTEB leaderboard)](https://huggingface.co/spaces/mteb/leaderboard). А дляосновной LLM [тут (🤗 Open LLM Leaderboard)](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)

In [7]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

db = FAISS.from_documents(chunked_docs,
                        HuggingFaceEmbeddings(model_name='TaylorAI/bge-micro-v2'))

  return self.fget.__get__(instance, owner)()


## Подготовка LLM

In [21]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = 'Qwen/Qwen1.5-1.8B'

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors:   0%|          | 0.00/3.67G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Настройка общего пайплайна

In [22]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain.chains import LLMChain

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=400,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt_template = """
<|system|>
Answer the question based on your knowledge. Use the following context to help:

{context}

</s>
<|user|>
{question}
</s>
<|assistant|>

 """

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

llm_chain = LLMChain(llm=llm, prompt=prompt)

In [23]:
from langchain.schema.runnable import RunnablePassthrough

retriever = db.as_retriever(
                            search_type="similarity",
                            search_kwargs={'k': 4}
                            )

rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)


## Сравнение результатов

В качестве вопроса возьмем укороченный title [отсюда](https://stackoverflow.com/questions/77672767/how-to-insert-an-image-with-rounded-borders-in-kivy-while-cropping-it-to-keep-it), он есть в базе знаний, но для начала посмотрим как на него ответит модель без неё.

In [24]:
from warnings import filterwarnings

filterwarnings('ignore')

In [25]:
question = "How to insert an image with rounded borders in Kivy?"

Посмотрим на ответ LLM без базы знаний

In [26]:
llm_chain.invoke({"context":"", "question": question})['text']


Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


"1. First, you need to import the necessary modules and libraries.\n2. Then, create a new class for the image that will be displayed.\n3. Next, define the size of the image and its border radius.\n4. Finally, add the image to the canvas.\n\nHere's how it works:\n\n```python\nfrom kivy.app import App\nfrom kivy.uix.widget import Widget\nfrom kivy.uix.image import Image\n\nclass RoundedImage(Widget):\n    def __init__(self, **kwargs):\n        super().__init__(**kwargs)\n        self.image = Image(source='image.png', size=(50, 50), border_radius=10)\n\nif __name__ == '__main__':\n    app = App()\n    app.run() \n```\n\nIn this example, we're using the `Image` widget from Kivy to display our image. We also set the size of the image to (50, 50) so that the image is centered within the container. The border radius is set to 10 pixels, which means the corners of the image are rounded.\n\nTo use this code, save it as `RoundedImage.py`, then run the program by typing `python RoundedImage.py` i

Модель просто в общем случае пытается ответить на вопрос

In [27]:
import gc
torch.cuda.empty_cache()
gc.collect()

48

In [28]:
rag_chain.invoke(question)['text']

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


' </div>\n  <div class="answer">\n    <h1 class="title" id="how-to-insert-an-image-with-rounded-borders-in-kivy-while-cropping-it-to-keep-it">How to insert an image with rounded borders in Kivy while cropping it to keep its ratio?</h1>\n    <div class="content">\n      <p>Another approach is to make your <code>CustomImage</code> extend <code>Image</code> and modify the <code>kv</code> rule for the basic <code>Image</code> class:</p>\n<pre><code>&lt;-CustomImage&gt;:\\n\\n    canvas.before:\\n        RoundedRectangle:\\n            pos: (0,0)\\n            size: self.size\\n            source: &quot;image.jpg&quot;\\n            radius: [40,]\\n            #fit_mode: &quot;cover&quot;\\n\\nMyScreen:\\n    CustomImage:\\n        size_hint: (0.6, 0.2)\\n        pos_hint: {&quot;center_x&quot;: 0.5, &quot;center_y&quot;: 0.5}\\n</code></pre>\n<p>The <code>CustomImage</code> class definition then becomes:</p>\n<pre><code>class CustomImage(Image):\\n    pass\\n</code></pre>\n<p>If you need t

## Итоги

Видно, что овтвет модели изменился с учетом нового контекста. Это различие, например видно по html-тегам ответа.

Для улучшения качества генерации ответа можно (нужно) брать модели получше/побольше (например, GPT-4).


P.S. Надо было заранее почистить текст от них 😢, тогда результат скорее всего станет лучше. К сожалению, в силу ограниченности вычислительных ресурсов (а также того, что многи хорошие модели просто не влезают в память) проделать это с очищенном текстом нет возможности. Но даже так видно, что ответ модели ссылается на базу знаний, а значит RAG система работатет, хоть и не так точно как хотелось бы.

В качестве основы использовался этот [cookbook от 🤗](https://github.com/huggingface/cookbook), который так удачно вчера вышел.