# QnA over documents

Les grands modèles linguistiques (LLM) sont confrontés à un problème de fraîcheur des données. Même certains des modèles les plus puissants, comme GPT-4, n'ont aucune idée des événements récents.

Selon les LLMs, le monde est figé dans le temps. Ils ne connaissent que le monde tel qu'il est apparu à travers leurs données d'apprentissage.

Cela pose des problèmes pour tous les cas d'utilisation qui reposent sur des informations actualisées ou sur un ensemble de données particulier. Par exemple, vous pouvez avoir des documents internes à votre entreprise avec lesquels vous aimeriez interagir par le biais d'un LLM.

Le premier défi consiste à ajouter ces documents au LLM. Nous pourrions essayer d'entraîner le LLM sur ces documents, mais cela prend du temps et coûte cher. Et que se passe-t-il lorsqu'un nouveau document est ajouté ? La formation pour chaque nouveau document ajouté est plus qu'inefficace, elle est tout simplement impossible.

Alors, comment résoudre ce problème ? Nous pouvons utiliser **l'augmentation de la recherche (ou retrieval augmentation)**. Cette technique nous permet d'extraire des informations pertinentes d'une base de connaissances externe et de les transmettre à notre LLM.

La base de connaissances externe est notre "fenêtre" sur le monde au-delà des données de formation du LLM. Dans ce chapitre, nous apprendrons tout sur la mise en œuvre de l'augmentation de la récupération pour les LLM à l'aide de LangChain.


## Base de connaissances

Pour aider le modèle, nous devons lui donner accès à des sources de connaissances pertinentes. L'ensemble de données utilisé comme base de connaissance dépend naturellement du cas d'utilisation. Il peut s'agir de documentation de code pour un LLM qui a besoin d'aider à écrire du code, de documents d'entreprise pour un chatbot interne, ou de n'importe quoi d'autre.

La plupart des ensembles de données contiennent des enregistrements comportant beaucoup de texte. C'est pourquoi notre première tâche consiste généralement à construire un pipeline de prétraitement qui **découpe ces longs morceaux de texte en morceaux plus concis et convertis ces morçeaux en représentations numériques vectorielles appelées les embeddings**.

## Embeddings

Les vecteurs intégrés (embeddings) sont essentiels à la récupération du contexte pertinent pour notre LLM. Nous prenons les morceaux de texte que nous souhaitons stocker dans notre base de connaissances et encodons chaque morceau dans un vecteur intégré ou embeddings.

Ces embeddings peuvent agir comme des "représentations numériques" de la signification de chaque morceau de texte.

Nous avons donc besoin d'un second modèle pour traduire les ffragments de texte en embeddings qui seront ensuite enregistrés dans notre base de données vectorielle et nous pouvons trouver des morceaux de texte ayant des significations similaires en calculant la distance entre les représentations vectorielles dans l'espace vectoriel.


![Alt text](image-6.png)

In [1]:
# base packages
import os
import pandas as pd
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from IPython.display import display, Markdown

# load the environment variables
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

# import model
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings

# chat model
llm = AzureChatOpenAI(
    openai_api_version="2023-09-01-preview",
    azure_endpoint=os.getenv('AZURE_API_ENDPOINT'),
    api_key=os.getenv('AZURE_OPENAI_KEY'),
    azure_deployment=os.getenv('OPENAI_DEPLOYMENT_NAME'),
    model_name=os.getenv('OPENAI_MODEL_NAME'),
    model_version=os.getenv('OPENAI_API_VERSION')
)

# ebedding model
embeddings = AzureOpenAIEmbeddings(
    openai_api_version="2023-09-01-preview",
    azure_endpoint=os.getenv('AZURE_API_ENDPOINT'),
    api_key=os.getenv('AZURE_OPENAI_KEY'),
    azure_deployment=os.getenv('OPENAI_DEPLOYMENT_NAME_EMBEDDING'),
    #model_name=os.getenv('OPENAI_MODEL_NAME_EMBEDDING'),
    #model_version=os.getenv('OPENAI_API_VERSION_EMBEDDING')
)


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# test th eembedding
embed = embeddings.embed_query("Hi my name is Harrison")
print(len(embed))
print(embed[:5])

1536
[-0.02186359354958571, 0.006734037540552159, -0.018200781829872538, -0.03919587420756086, -0.01404707648325098]


In [3]:
# let's have a look of the data
df = pd.read_csv('../data/OutdoorClothingCatalog.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,name,description
0,0,Women's Campside Oxfords,This ultracomfortable lace-to-toe Oxford boast...
1,1,"Recycled Waterhog Dog Mat, Chevron Weave",Protect your floors from spills and splashing ...
2,2,Infant and Toddler Girls' Coastal Chill Swimsu...,"She'll love the bright colors, ruffles and exc..."
3,3,"Refresh Swimwear, V-Neck Tankini Contrasts",Whether you're going for a swim or heading out...
4,4,EcoFlex 3L Storm Pants,Our new TEK O2 technology makes our four-seaso...


In [4]:
# define a loader
loader = CSVLoader(file_path="../data/OutdoorClothingCatalog.csv", encoding="utf-8")

# load documents
docs = loader.load()
print(len(docs))
docs[0]

1000


Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \r\n\r\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \r\n\r\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \r\n\r\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \r\n\r\nQuestions? Please contact us for any inquiries.", metadata={'source': '../data/OutdoorClothingCatalog.csv', 'row': 0})

In [5]:
docs[1].metadata

{'source': '../data/OutdoorClothingCatalog.csv', 'row': 1}

In [10]:
# this help to create a vector store very easily
# pip install --upgrade --quiet  "docarray"
from langchain.chains import RetrievalQA
from langchain.vectorstores import DocArrayInMemorySearch

# creation of the in memory the vector store
db = DocArrayInMemorySearch.from_documents(
    docs, # list of documents
    embeddings, # embedding object
)

# creation of a retriever
retriever = db.as_retriever()

# create retrieval chain
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", # stuff all the document into context
    retriever=retriever, # the docs are in the retriever
    verbose=True
)

# define the query
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."

# run the query
response = qa_stuff.invoke(query)

# display the response
print(response)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
{'query': 'Please list all your shirts with sun protection in a table in markdown and summarize each one.', 'result': "| Shirt Number | Name | Summary |\r\n|--------------|------|---------|\r\n| 618 | Men's Tropical Plaid Short-Sleeve Shirt | Made of 100% polyester, this shirt is rated UPF 50+ for superior protection from the sun's UV rays. It has front and back cape venting that lets in cool breezes and two front bellows pockets. |\r\n| 374 | Men's Plaid Tropic Shirt, Short-Sleeve | Made with 52% polyester and 48% nylon, this shirt is rated to UPF 50+. It has front and back cape venting, two front bellows pockets, and is machine washable and dryable. |\r\n| 255 | Sun Shield Shirt | Made of 78% nylon and 22% Lycra Xtra Life fiber, this shirt is UPF 50+ rated and handwashable. It wicks moisture for quick-drying comfort and is abrasion-resistant. It fits comfortably over swimsuits. |\r\n| 535 | Men's TropicVibe Shir

In [11]:
# nice display
display(Markdown(response["result"]))

| Shirt Number | Name | Summary |
|--------------|------|---------|
| 618 | Men's Tropical Plaid Short-Sleeve Shirt | Made of 100% polyester, this shirt is rated UPF 50+ for superior protection from the sun's UV rays. It has front and back cape venting that lets in cool breezes and two front bellows pockets. |
| 374 | Men's Plaid Tropic Shirt, Short-Sleeve | Made with 52% polyester and 48% nylon, this shirt is rated to UPF 50+. It has front and back cape venting, two front bellows pockets, and is machine washable and dryable. |
| 255 | Sun Shield Shirt | Made of 78% nylon and 22% Lycra Xtra Life fiber, this shirt is UPF 50+ rated and handwashable. It wicks moisture for quick-drying comfort and is abrasion-resistant. It fits comfortably over swimsuits. |
| 535 | Men's TropicVibe Shirt, Short-Sleeve | Made with 71% Nylon and 29% Polyester, this Men’s sun-protection shirt has built-in UPF 50+. It is wrinkle-resistant and has front and back cape venting and two front bellows pockets. | 

All of the listed shirts have UPF 50+ sun protection, blocking 98% of the sun's harmful rays. The Men's Tropical Plaid Short-Sleeve Shirt is made of 100% polyester and has front and back cape venting and two front bellows pockets. The Men's Plaid Tropic Shirt, Short-Sleeve is made of 52% polyester and 48% nylon, and is machine washable and dryable. The Sun Shield Shirt is made of 78% nylon and 22% Lycra Xtra Life fiber, and is handwashable. It is abrasion-resistant and can fit comfortably over swimsuits. The Men's TropicVibe Shirt, Short-Sleeve is made with 71% Nylon and 29% Polyester, it is wrinkle-resistant and has front and back cape venting and two front bellows pockets.

We can check the results with a simple similarity search.

In [13]:
# to check the results
query = "Please suggest a shirt with sunblocking"
docs = db.similarity_search(query)
print(f"Results found in {len(docs)} documents")
[doc.page_content for doc in docs]

Results found in 4 documents


[': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \r\n\r\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\r\n\r\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\r\n\r\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\r\n\r\nSun Protection That Won\'t Wear Off\r\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.',
 ": 374\nname: Men's Plaid Tropic Shirt, Short-Sleeve\ndescription: Our Ultracomfortable sun protection is rated to UPF 50+, helping you stay cool and dry. Originally designed for fishing, this lightest hot-weather