### LangChain for LLM Application development

### LangChain Q&A over your documents

Combines the language models with data that they were not trained on. This also introduces some of the more key components of LangChain such as **embedding models** and **vector stores**.

In [10]:
#%pip install sentence-transformers #Required for embeddings, specifically the HuggingFaceEmbedding
from langchain.chains import RetrievalQA #Retrieval of some documents
from langchain_community.chat_models import ChatOllama
from langchain_community.embeddings import HuggingFaceEmbeddings #Imports a specific embeddingmodel
from langchain.document_loaders import CSVLoader #Load the documents we want to use with the LLM
from langchain.vectorstores import DocArrayInMemorySearch #In memory vector store so does not require connection outside
from IPython.display import display, Markdown

In [11]:

llm_model = "llama3:8B"
llm = ChatOllama(temperature=0.0, model=llm_model)

file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file,encoding='utf-8')

#Instantiate the embedding model
embeddings = HuggingFaceEmbeddings()

from langchain.indexes import VectorstoreIndexCreator

#%pip install docarray

index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding= embeddings #Now a required argument!
).from_loaders([loader]) #Takes a list of documentloaders

  embeddings = HuggingFaceEmbeddings()


**Embeddings let machines understand and compare the meaning of language by converting text into vectors. They are essential when building intelligent search, recommendation, and natural language processing tasks in LangChain and other AI frameworks.**

In [12]:
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

#create a response
response = index.query(query, llm=llm)

In [None]:
#Take a look at what it returned
display(Markdown(response))

Here is the list of shirts with sun protection:

| Shirt Name | Description |
| --- | --- |
| Women's Tropical Tee, Sleeveless | A sleeveless button-up shirt with UPF 50+ rated fabric for sun protection, blocking 98% of harmful UV rays. Softly shapes the body and has a flattering fit. |
| Sun Shield Shirt by [Brand] | A high-performance sun shirt with UPF 50+ rated fabric that blocks 98% of harmful UV rays. Wicks moisture for quick-drying comfort and is abrasion resistant. Fits comfortably over swimsuits. |
| Men's Plaid Tropic Shirt, Short-Sleeve | A lightweight hot-weather shirt with UPF 50+ coverage and SunSmart technology to block 98% of harmful UV rays. Wrinkle-free and quickly evaporates perspiration. |
| Tropical Breeze Shirt | A long-sleeve men's UPF shirt that offers superior sun protection from the sun's harmful rays. Wrinkle-resistant and moisture-wicking fabric keeps you cool and comfortable. |

Note: The Sun Shield Shirt by [Brand] does not specify the brand name, so I left it blank.

What is going on?  

We want to use language models to inspect documents, but a language models can only inspect a few thousand words at a time.  

This is where **Embeddings** and **Vectorstores** come in:  

* Embeddings create numerical representation for text. 
    * It captures the symantic meaning of the text. 
    * Text with similar content will have similar vectors
    * It compares pieces of text in the vectorspace

If we compare the following sentences:
1. My dog Rover likes to chase squirrels
2. Fluffy, my cat, refuses to ear from a can
3. The Chevy Bolt accelerates to 60mph in 6.7s

It will compare the sentences content based on the vectors. It will find that the first two are quite similar amd the third one is not similar to the rest. This is a crucial piece of information when we want to think about what pieces of text we want to pass to the language model.

**Vectorstores**:

* Vector databases is a way to store the information that we just created using the embeddings
    * It gets populated with chunks of text coming from teh documents
    * First it breaks the text up into smaller chunks. This helps create pieces of text smaller than the original document, which is useful as we may not be able to pass the entire document to the language model
    * Then we can pass the most relevant chunks
    * Embeddings is created for the chunks and then stored in the Vector database
    * Now that we have this index created. We can use it during runtime to find the most relevant text for the query
    * When a query comes in we first create an embedding for that query then we compare it to all the vectors in the database and return the most similar
    * These are then returned and we can pass these to the prompt of the language model to get back the final answer

In [None]:
#Document already loaded
docs = loader.load() #Loading in the documents
docs[0]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

In [None]:
#Because the document is small, we can create the embeddings directly
#Take a look at what happens when we embed a piece of text
embed =embeddings.embed_query("Hi my name is Zane")

In [None]:
#Embedding has 768 elements
print(len(embed))

768


In [None]:
#Each element is a piece of value and combined represents the text
print(embed[:5])

[0.06792537868022919, -0.0029981930274516344, -0.004753928165882826, -0.02511957846581936, 0.020938599482178688]


In [None]:
#Create embeddings for all the pieces of text just loaded and store them in a vectorstore
db = DocArrayInMemorySearch.from_documents( #Takes in documents and embedding objects and create a vectorstore
    docs,
    embeddings
)

In [20]:
#Now we can use this to find similar things to the query
query = "Please suggest a shirt with sunblocking"

docs= db.similarity_search(query)

In [21]:
len(docs)

4

In [22]:
docs[0]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 255}, page_content=': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.')

In [23]:
#How to use this for questioning and answering
#First we need to create a retriever from the vectorstore
retriever= db.as_retriever()
#A retriever is a generic interface that can be underpinned by any method that takes in a query and returns documents

In [24]:
#Because we want to to text generation and natural language response
llm = ChatOllama(temperature=0.0, model=llm_model)

In [26]:
#If we would be doing this by hand
qdocs = "".join([docs[i].page_content for i in range(len(docs))]) #Combine the documents into a single piece of text
#Pass the variable into the language model
response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.") 

display(Markdown(response))


Here is the list of shirts with sun protection:

| Name | Description |
| --- | --- |
| Sun Shield Shirt by [Brand] | High-performance sun shirt with UPF 50+ rated fabric, blocks 98% of harmful UV rays. Softly shapes the body, falls at hip. Handwash, line dry. |
| Women's Tropical Tee, Sleeveless | Five-star sleeveless button-up shirt with built-in SunSmart™ UPF 50+ rated fabric, blocks 98% of harmful UV rays. Machine wash and dry. |
| Girls' Beachside Breeze Shirt, Half-Sleeve | Rash guard-style swim shirt with built-in UPF 50+ rated fabric, blocks 98% of harmful UV rays. Snag- and fade-resistant, seawater-resistant. Machine wash, line dry. |
| Sunrise Tee | Lightweight, high-performance button down shirt with built-in SunSmart™ UPF 50+ rated fabric, blocks 98% of harmful UV rays. Wicks away moisture, dries quickly. Machine wash and dry. |

Each of these shirts has a description that highlights its sun protection features:

* The Sun Shield Shirt by [Brand] is designed to block the sun's harmful UV rays while still allowing for comfort and mobility.
* The Women's Tropical Tee, Sleeveless is a five-star sleeveless button-up shirt with built-in UPF 50+ rated fabric that provides excellent sun protection.
* The Girls' Beachside Breeze Shirt, Half-Sleeve is a rash guard-style swim shirt designed to provide extra coverage for watersports and beach activities while still allowing for comfort and mobility.
* The Sunrise Tee is a lightweight, high-performance button down shirt with built-in UPF 50+ rated fabric that provides excellent sun protection and wicks away moisture to keep the wearer cool and comfortable.

All of these shirts have been designed to provide excellent sun protection, making them great options for anyone looking to stay safe in the sun.

In [None]:
#All these steps can be encapsulated with a langchain chain

qa_stuff = RetrievalQA.from_chain_type(
    llm = llm, #Pass in the language model
    chain_type = "stuff", #Simple method that stuffs all the documents into context
    retriever = retriever,
    verbose = True
)

**Stuff Method**
* Stuffing is the simplest method
* You simply stuff all data into the prompt as context to pass to the language model

PROS: Makes a single call to the LLM. The LLM has access to all the data at once!  

CONS: LLMS have a context length, and for large documents or many documents this will not work as it will result in a prompt larger than the context length

*What if you wanted to do the same type of question and answering over lots of different chunks?*

1. Map_reduce
    * This takes all the chunks,
    * passes them along with the question to the llm
    * gets back a response 
    * and then uses another llm call to summarize all the individual responses into a final answer

This is very powerful because it can operate over any number of documents AND you can do the individual questions in parallel, but it does take a lot more calls! It also treats all the documents as independent which it may not always be.  

2. Refine
    * Loops over many documents
    * Does it iteratively and builds upon the answer from the previous document

Very good for combining information and building up an answer over time. It will generally lead to longer answers and is also not as fast as the calls are not independent. They depend on the result of the previous calls.  

3. Map_rerank
    * Does a single call to the llm for each document and you ask it to return a score
    * Then you select the highest score
    * This relies on the llm to know what the score should be
    * Therefore, you should refine the instructions and state that if it is relevant to the document then it will be a high score  

Similar to Map_reduce, all calls are independent and you can batch then really fast. But again you will be making a large amount of llm calls, so it will be more expensive.  

These also have some other use cases e.g., Map_reduce can be used to summarize documentation

In [31]:
# query =  "Please list all your shirts with sun protection in a table \
# in markdown and summarize each one."
query = "Please list all the shirts that I can wear for doing sport in a table \
    in markdown and summarize each one"

In [32]:
response = qa_stuff.run(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [33]:
display(Markdown(response))

Here is the list of sports shirts in a Markdown table:

| Shirt Name | Description |
| --- | --- |
| Boys' Athletic Performance Tee | Quick-drying, moisture-wicking tee with antimicrobial fabric to control odors. Slightly fitted shape. |
| Multi-Sport Performance Tee | Moisture-control and quick-drying tee with abrasion-resistant fabric, antimicrobial finish, and reflective hits for visibility. |
| Multi-Athlete Performance Tee | Abrasion-resistant, antimicrobial, and UPF 50+ protected tee with mesh side panels and back yoke for breathability. |
| Endurance Dri-Fit Long-Sleeve Tee | Moisture-wicking, quick-drying long-sleeve tee with anti-microbial treatment, UPF 50+ protection, and zoned mesh for breathability. |

Let me know if you'd like me to add anything else!

In [36]:
#We can also do it much more easily in the following manner
response = index.query(query, llm=llm)

In [37]:
display(Markdown(response))

Here is the list of shirts you can wear for sports:

| Shirt Name | Description |
| --- | --- |
| Boys' Athletic Performance Tee | Quick-drying, moisture-wicking shirt with antimicrobial fabric to control odors. Reflective logo on chest increases visibility. |
| Multi-Athlete Performance Tee | Abrasion-resistant, antimicrobial shirt with UPF 50+ sun protection and mesh side panels for breathability. |
| Performance Plus Woven Shirt | Quick-drying, breathable shirt with UPF 40+ sun protection and abrasion-resistant fabric. Dries in under 14 minutes. |

These shirts are designed to provide comfort, moisture management, and sun protection during sports activities. The Boys' Athletic Performance Tee is a great option for everyday wear, while the Multi-Athlete Performance Tee is more durable and suitable for high-intensity workouts. The Performance Plus Woven Shirt is a versatile choice that can be worn both on and off the trail.

In [None]:
#We can also modify the index when we are creating it
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings
).from_loaders([loader])
#This gives flexibility on how the embeddings themselves are created and we can swap the vectorstore for a different type