# Создание и запуск RAG | LangChain

### Обработка документов

**Давайте реализуем логику RAG (Retrieval-Augmented Generation) с использованием LangChain.**

**LangChain упрощает создание и управление конвейерами RAG, предоставляя инструменты для интеграции моделей, работы с данными и управления контекстом. Он также облегчает взаимодействие с внешними источниками данных и языковыми моделями, что делает процесс разработки более гибким и эффективным.**

**Давайте на практике разберем, так ли это.**

**Первым шагом загрузим все наши PDF-файлы, из которых будем извлекать необходимую информацию. Для удобства работы перенесем все файлы в список (list) в Python.**

In [2]:
import os
from tqdm.auto import tqdm

import warnings
warnings.filterwarnings('ignore')

pdf_list = []

for pdf in os.listdir('Art'): # Место хранения данных
    if os.path.isfile(os.path.join('Art', pdf)): # Место хранения данных
        pdf_list.append(pdf)

print(pdf_list)

['Gardners Art Through the Ages The Western Perspective, Volume I,.pdf', 'History_of_Art.pdf', 'Transformative Art Movements and the Paintings That Inspired Them - 2013.pdf', 'Vasari Giorgio_The_Lives_of_the_Artists_Oxford.pdf']


In [3]:
from langchain.document_loaders import PyPDFLoader

documents = []

# Загрузка PDF
for pdf in tqdm(pdf_list):
    loader = PyPDFLoader("Art/" + pdf)
    documents += loader.load()

  0%|          | 0/4 [00:00<?, ?it/s]

In [4]:
documents[2930]

Document(metadata={'source': 'Art/Vasari Giorgio_The_Lives_of_the_Artists_Oxford.pdf', 'page': 252}, page_content="226 SANDRO BOTTICELLI\npanels, which contained many beautiful and lifelike figures.*\nLikewise, for the Pucci home, he illustrated Boccaccio's\nnovella of Nastagio degli Onesti, in four paintings with tiny\nfigures, which are most lovely and delightful,* along with a\ntondo depicting the Epiphany.*\nFor the monks of Cestello, he painted a panel of the Annun-\nciation in one of their chapels.* In the church of San Pietro\nMaggiore, at the side door, he painted a panel for Matteo\nPalmieri with a vast number of figures depicting the Assump-\ntion of the Virgin and including the heavenly spheres as they\nare represented, the Patriarchs, Prophets, Apostles, Evangelists,\nMartyrs, Confessors, Doctors of the Church, Holy Virgins,\nand the Hierarchies of Angels, all taken from a drawing given\nto him by Matteo, who was a learned and worthy man.*\nSandro painted this work with mas

**LangChain предоставляет различные методы для разделения текста на фрагменты (чанки). Воспользуемся RecursiveCharacterTextSplitter, чтобы рекурсивно разделить текст по указанным символам.**

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

separators = [
    "\n#{1,6} ",
    "```\n",
    "\n\\*\\*\\*+\n",
    "\n---+\n",
    "\n___+\n",
    "\n\n",
    "\n",
    " ",
    "",
]

# Разделение на чанки
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,  # Размер чанка
    chunk_overlap=100,  # Перекрытие между чанками
    add_start_index=True,
    strip_whitespace=True,
    separators=separators
)
chunks = text_splitter.split_documents(documents)

In [6]:
chunks[333]

Document(metadata={'source': 'Art/Gardners Art Through the Ages The Western Perspective, Volume I,.pdf', 'page': 54, 'start_index': 3440}, page_content='architectural contexts shed a welcome light on the administration and \norganization of Mesopotamian city-states. Finally, Mesopotamian seals \nare an invaluable resource for art historians, providing them with thou-\nsands of miniature examples of relief sculpture spanning three millennia.\n2-8 Banquet scene, cylinder seal (left) and its modern impression (right), from the tomb of Puabi (tomb 800), Royal  \nCemetery, Ur (modern Tell Muqayyar), Iraq, ca. 2550 bce. Lapis lazuli, 1 7\n80 high, 10 diameter. British Museum, London.\nThe Mesopotamians used seals to identify and secure goods. Artists incised designs into stone cylinders that could be rolled over clay \nto produce miniature artworks such as this banquet scene.\n1 in.')

In [7]:
len(chunks)

22194

### Создание векторов и Семантический поиск

**Для эффективного поиска ближайших соседей на основе косинусного расстояния воспользуемся библиотекой FAISS.**

In [8]:
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy

embedding_model = HuggingFaceEmbeddings(
    model_name='sentence-transformers/all-mpnet-base-v2',
    multi_process=True,
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True}   
)

vector_database = FAISS.from_documents(chunks, embedding_model, distance_strategy=DistanceStrategy.COSINE)

  embedding_model = HuggingFaceEmbeddings(


In [9]:
vector_database

<langchain_community.vectorstores.faiss.FAISS at 0x251c817a7b0>

**Теперь мы можем найти среди наших фрагментов (чанков) те, которые находятся ближе всего к запросу. Для этого используем предварительно созданный индекс FAISS.**

In [10]:
query = 'What was Claude Monet famous for?'
similar_docs = vector_database.similarity_search(query, k=5)
similar_docs

[Document(id='f3e564f8-f5fd-44a1-b658-adfa17e73354', metadata={'source': 'Art/Transformative Art Movements and the Paintings That Inspired Them - 2013.pdf', 'page': 288, 'start_index': 672}, page_content='during the Franco-Prussian War, Monet returned to \nFrance, making his home in a succession of \nsuburban towns near Paris before settling in Giverny \nin 1883. Throughout his long career, Monet remained \nfaithful to the Impressionist aim of exploring the \nchanging quality of light and color in landscape. His \nseries of paintings of grainstacks, Rouen Cathedral, \nand waterlilies depict speciﬁc sites under differing \nlight and weather conditions.\nBIOGRAPHY'),
 Document(id='2399acf7-4236-48b8-a75c-8a7a5c3de1a6', metadata={'source': 'Art/Gardners Art Through the Ages The Western Perspective, Volume I,.pdf', 'page': 1083, 'start_index': 2127}, page_content='(Monet), 853, 854\nRousseau, Henri, 872–873; The Dream, \n873, 873, 28-28A, 29-17A; Sleeping \nGypsy, 872, 873\nRousseau, Jean-

**В качестве языковой модели будем использовать Mistral-7b-v0.1.**

### Добавление LLM

In [11]:
from huggingface_hub import InferenceClient

client = InferenceClient(token=api_key)

model_name = "mistralai/Mistral-7B-v0.1"

def generate_text(prompt):
    response = client.text_generation(
        prompt=prompt,
        model=model_name,
        do_sample=True,
        temperature=0.7,
        return_full_text=False,
        max_new_tokens=500
    )
    return response

**Давайте протестируем модель Mistral-7b-v0.1 без использования заранее подготовленных данных, чтобы оценить её способность генерировать ответы исключительно на основе своих внутренних знаний.**

In [12]:
generate_text("What was Claude Monet famous for?")

'\n\nMonet was best known for his pioneering role in the development of the Impressionist style of painting. He is particularly known for his series paintings of Haystacks, the Houses of Parliament, and his Water Lilies.  Claude Oscar Monet was born on 14 November 1840 in Paris, France.\n\nWhy did Monet choose to paint the water lilies?\n\nThe Water Lilies are the artistic and poetic expression of his wife’s garden and the quiet moments he would spend in it. The paintings reflect Monet’s obsession with the elusive effects of light and water, which he used to create a highly sensual experience for the viewer.\n\nWhy is Claude Monet famous?\n\nClaude Monet. Claude Monet was the leader of the Impressionists, a group of artists who pioneered a new style of painting in the late 19th century. He began painting at the age of 15, and his talent was quickly recognized, even as he taught himself to paint.\n\nWhat are Claude Monet famous paintings?\n\nMost Famous Works. Monet\'s most famous paint

**Супер! Мы можем улучшить качество поиска релевантных документов, упорядочивания результатов поиска на основе их семантической близости к запросу, это называется Reranker.**

In [13]:
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_chunks(query, chunks):
    scores = reranker.predict([(query, chunk) for chunk in chunks])

    ranked_chunks = [chunk for _, chunk in sorted(zip(scores, chunks), reverse=True)]

    return ranked_chunks[:3]

**Взглянем на самые релевантные чанки, найденные в результате поиска.**

In [14]:
similar_chunks = [doc.page_content for doc in similar_docs]

ranked_chunks = rerank_chunks(query, similar_chunks)

context = "\n".join(ranked_chunks)

print(context)

during the Franco-Prussian War, Monet returned to 
France, making his home in a succession of 
suburban towns near Paris before settling in Giverny 
in 1883. Throughout his long career, Monet remained 
faithful to the Impressionist aim of exploring the 
changing quality of light and color in landscape. His 
series of paintings of grainstacks, Rouen Cathedral, 
and waterlilies depict speciﬁc sites under differing 
light and weather conditions.
BIOGRAPHY
Doncieux, Monet’s wife (compare fig. 28-2A), is at once the painter’s admirer and his muse. In the dis-
tance are the factories and smokestacks that represent the opposite pole of life at Argenteuil. In captur-
ing both the leisure activities of the bourgeoisie and the industrialization along the Seine in the 1870s on 
the same canvas, Manet, like Monet, was fulfilling Baudelaire’s definition of “the painter of modern life. ”
Claude Monet in His Studio Boat  is also noteworthy as a document of Monet’s preference for 
painting outdoors (e

**Зададим промпт, чтобы модель четко поняла, какую задачу мы хотим решить. Это поможет направить её на генерацию наиболее точного и релевантного ответа.**

In [15]:
from langchain.prompts import PromptTemplate

prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template=   """Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.
\nExample 1:
Query: What are the key characteristics of Impressionism?
Answer: Impressionism is an art movement that originated in the late 19th century, characterized by its focus on capturing light and its changing qualities, ordinary subject matter, and unusual visual angles. Artists like Claude Monet and Pierre-Auguste Renoir used loose brushwork and vibrant colors to depict scenes from everyday life, often painting en plein air (outdoors) to better capture the natural light and atmosphere. The movement marked a departure from the detailed, polished style of academic painting, emphasizing instead the artist's perception and the transient effects of light.
\nExample 2:
Query: What is the significance of the Renaissance in art history?
Answer: The Renaissance was a pivotal period in art history, spanning from the 14th to the 17th century, marked by a revival of interest in the classical art and culture of ancient Greece and Rome. This era saw the development of techniques such as linear perspective, chiaroscuro (the contrast of light and shadow), and anatomical precision. Artists like Leonardo da Vinci, Michelangelo, and Raphael created masterpieces that emphasized humanism, realism, and the exploration of individual expression. The Renaissance not only transformed artistic practices but also had a profound impact on the cultural and intellectual landscape of Europe.
\nExample 3:
Query: How did Cubism revolutionize modern art?
Answer: Cubism, pioneered by Pablo Picasso and Georges Braque in the early 20th century, revolutionized modern art by breaking away from traditional perspectives and representing subjects in a fragmented, abstracted form. This movement introduced the concept of depicting objects from multiple viewpoints simultaneously, challenging the conventions of realistic representation. Cubism laid the groundwork for subsequent avant-garde movements and influenced various fields, including sculpture, architecture, and literature. Its emphasis on geometric shapes and the deconstruction of form paved the way for abstract art and new ways of seeing and interpreting the world.
\nNow use the following context items to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {question}
Answer:"""
)

prompt = prompt_template.format(context=context, question=query)
print(prompt)

Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.

Example 1:
Query: What are the key characteristics of Impressionism?
Answer: Impressionism is an art movement that originated in the late 19th century, characterized by its focus on capturing light and its changing qualities, ordinary subject matter, and unusual visual angles. Artists like Claude Monet and Pierre-Auguste Renoir used loose brushwork and vibrant colors to depict scenes from everyday life, often painting en plein air (outdoors) to better capture the natural light and atmosphere. The movement marked a departure from the detailed, polished style of academic painting, emphasizing instead the artist's perception and the transient effects o

### Результат

In [35]:
query = 'What was Claude Monet famous for?'

response = generate_text(prompt)
print("Ответ:", response)

Ответ:  Claude Monet is known for his innovative, Impressionist style of painting, which revolutionized the art world in the late 19th century. He was the founder of the Impressionist movement, along with artists such as Pierre-Auguste Renoir, Edgar Degas, and Camille Pissarro. Monet is particularly renowned for his luminous landscapes and captivating depictions of light and atmosphere. His paintings are characterized by a loose, free-flowing brushwork and an emphasis on the transient effects of nature.

Monet's famous works include his series of paintings of the Rouen Cathedral, the water lily pond at Giverny, and the Haystacks series. These paintings showcase Monet's mastery of color and light, as he captures the changing qualities of nature through the course of the day or the seasons. His landscapes often include elements of abstraction, as he simplifies forms and focuses on the emotional effect of his paintings.

Monet's influence on the art world was significant, inspiring artist

**Вот мы и завершили реализацию RAG с использованием LangChain. Модель Mistral-7b-v0.1 успешно справилась с генерацией текста на основе заданного промпта, а LangChain продемонстрировал свою эффективность в решении задач, связанных с Retrieval-Augmented Generation. Этот подход позволяет сочетать мощь языковых моделей с точностью поиска релевантных данных, что делает его крайне полезным для множества приложений.**