# LangChain

Open-source development freamework for LLM applications.

* API calls through LangChain:
   * Prompts, style of creating inputs to the models
   * Models, LLM models
   * Output parsers, structured format 
   
LLMs can only inspect a few thousand words at a time. That is where embeddings and vector stores come into play. 

In [None]:
#!pip install openai

In [1]:
import os
import openai
openai.api_key = os.environ['OPENAI_API_KEY']

## Chat API, OpenAI

In [2]:
def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0, 
    )
    return response.choices[0].message["content"]


In [10]:
get_completion("What is 1+1?")

'1+1 equals 2.'

In [3]:
customer_email = """
Arrr, I be fuming that me blender lid \
flew off and splattered me kitchen walls \
with smoothie! And to make matters worse,\
the warranty don't cover the cost of \
cleaning up me kitchen. I need yer help \
right now, matey!
"""

In [4]:
style = """American English \
in a calm and respectful tone
"""

In [5]:
prompt = f"""Translate the text \
that is delimited by triple backticks 
into a style that is {style}.
text: ```{customer_email}```
"""

print(prompt)

Translate the text that is delimited by triple backticks 
into a style that is American English in a calm and respectful tone
.
text: ```
Arrr, I be fuming that me blender lid flew off and splattered me kitchen walls with smoothie! And to make matters worse,the warranty don't cover the cost of cleaning up me kitchen. I need yer help right now, matey!
```



In [6]:
response = get_completion(prompt)
response

'I am quite frustrated that my blender lid flew off and made a mess of my kitchen walls with smoothie! To add to my frustration, the warranty does not cover the cost of cleaning up my kitchen. I kindly request your assistance at this moment, my friend.'

## Chat API :  LangChain

The next is an abstraction of langchain connected to Openai

In [None]:
#!pip install --upgrade langchain

In [5]:
from langchain.chat_models import ChatOpenAI

In [21]:
chat = ChatOpenAI(temperature=0.0)

In [4]:
template_string = """Translate the text \
that is delimited by triple backticks \
into a style that is {style}. \
text: ```{text}```
"""

In [5]:
from langchain.prompts import ChatPromptTemplate

prompt_template = ChatPromptTemplate.from_template(template_string)

In [6]:
prompt_template.messages[0].prompt

PromptTemplate(input_variables=['style', 'text'], output_parser=None, partial_variables={}, template='Translate the text that is delimited by triple backticks into a style that is {style}. text: ```{text}```\n', template_format='f-string', validate_template=True)

In [7]:
prompt_template.messages[0].prompt.input_variables

['style', 'text']

In [8]:
customer_style = """American English \
in a calm and respectful tone
"""

In [9]:
customer_email = """
Arrr, I be fuming that me blender lid \
flew off and splattered me kitchen walls \
with smoothie! And to make matters worse, \
the warranty don't cover the cost of \
cleaning up me kitchen. I need yer help \
right now, matey!
"""

In [10]:
customer_messages = prompt_template.format_messages(
                    style=customer_style,
                    text=customer_email)

In [11]:
print(type(customer_messages))
print(type(customer_messages[0]))

<class 'list'>
<class 'langchain.schema.messages.HumanMessage'>


In [12]:
print(customer_messages[0])

content="Translate the text that is delimited by triple backticks into a style that is American English in a calm and respectful tone\n. text: ```\nArrr, I be fuming that me blender lid flew off and splattered me kitchen walls with smoothie! And to make matters worse, the warranty don't cover the cost of cleaning up me kitchen. I need yer help right now, matey!\n```\n" additional_kwargs={} example=False


In [13]:
# Call the LLM to translate to the style of the customer message
customer_response = chat(customer_messages)

In [14]:
customer_response

AIMessage(content="I'm really frustrated that my blender lid flew off and made a mess of my kitchen walls with smoothie! And to make things even worse, the warranty doesn't cover the cost of cleaning up my kitchen. I could really use your help right now, my friend!", additional_kwargs={}, example=False)

In [15]:
service_reply = """Hey there customer, \
the warranty does not cover \
cleaning expenses for your kitchen \
because it's your fault that \
you misused your blender \
by forgetting to put the lid on before \
starting the blender. \
Tough luck! See ya!
"""

In [16]:
service_style_pirate = """\
a polite tone \
that speaks in English Pirate\
"""

In [17]:
service_messages = prompt_template.format_messages(
    style=service_style_pirate,
    text=service_reply)

print(service_messages[0].content)

Translate the text that is delimited by triple backticks into a style that is a polite tone that speaks in English Pirate. text: ```Hey there customer, the warranty does not cover cleaning expenses for your kitchen because it's your fault that you misused your blender by forgetting to put the lid on before starting the blender. Tough luck! See ya!
```



In [19]:
service_response = chat(service_messages)
print(service_response.content)

Ahoy there, matey! I regret to inform ye that the warranty be not coverin' the costs o' cleanin' yer galley, as 'tis yer own fault fer misusin' yer blender by forgettin' to secure the lid afore startin' it. Aye, tough luck, me heartie! Fare thee well!


Prompt templates are useful abstraction to help you. They can be long and detailed. LangChain algo provides prompts for common operations. 

## Output Parsers

It is a way do define how we would like the LLM output look like:

In [1]:
{
  "gift": False,
  "delivery_days": 5,
  "price_value": "pretty affordable!"
}

{'gift': False, 'delivery_days': 5, 'price_value': 'pretty affordable!'}

In [2]:
customer_review = """\
This leaf blower is pretty amazing.  It has four settings:\
candle blower, gentle breeze, windy city, and tornado. \
It arrived in two days, just in time for my wife's \
anniversary present. \
I think my wife liked it so much she was speechless. \
So far I've been the only one using it, and I've been \
using it every other morning to clear the leaves on our lawn. \
It's slightly more expensive than the other leaf blowers \
out there, but I think it's worth it for the extra features.
"""

review_template = """\
For the following text, extract the following information:

gift: Was the item purchased as a gift for someone else? \
Answer True if yes, False if not or unknown.

delivery_days: How many days did it take for the product \
to arrive? If this information is not found, output -1.

price_value: Extract any sentences about the value or price,\
and output them as a comma separated Python list.

Format the output as JSON with the following keys:
gift
delivery_days
price_value

text: {text}
"""

In [3]:
from langchain.prompts import ChatPromptTemplate

prompt_template = ChatPromptTemplate.from_template(review_template)
print(prompt_template)

input_variables=['text'] output_parser=None partial_variables={} messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['text'], output_parser=None, partial_variables={}, template='For the following text, extract the following information:\n\ngift: Was the item purchased as a gift for someone else? Answer True if yes, False if not or unknown.\n\ndelivery_days: How many days did it take for the product to arrive? If this information is not found, output -1.\n\nprice_value: Extract any sentences about the value or price,and output them as a comma separated Python list.\n\nFormat the output as JSON with the following keys:\ngift\ndelivery_days\nprice_value\n\ntext: {text}\n', template_format='f-string', validate_template=True), additional_kwargs={})]


In [6]:
messages = prompt_template.format_messages(text=customer_review)
chat = ChatOpenAI(temperature=0.0)
response = chat(messages)
print(response.content)

{
  "gift": false,
  "delivery_days": 2,
  "price_value": ["It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features."]
}


In [7]:
type(response.content)

str

*Parse the LLM output string into a Python dictionary*

In [8]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

In [10]:
gift_schema = ResponseSchema(name="gift",
                             description="Was the item purchased\
                             as a gift for someone else? \
                             Answer True if yes,\
                             False if not or unknown.")
delivery_days_schema = ResponseSchema(name="delivery_days",
                                      description="How many days\
                                      did it take for the product\
                                      to arrive? If this \
                                      information is not found,\
                                      output -1.")
price_value_schema = ResponseSchema(name="price_value",
                                    description="Extract any\
                                    sentences about the value or \
                                    price, and output them as a \
                                    comma separated Python list.")

response_schemas = [gift_schema, 
                    delivery_days_schema,
                    price_value_schema]

In [11]:
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
output_parser

StructuredOutputParser(response_schemas=[ResponseSchema(name='gift', description='Was the item purchased                             as a gift for someone else?                              Answer True if yes,                             False if not or unknown.', type='string'), ResponseSchema(name='delivery_days', description='How many days                                      did it take for the product                                      to arrive? If this                                       information is not found,                                      output -1.', type='string'), ResponseSchema(name='price_value', description='Extract any                                    sentences about the value or                                     price, and output them as a                                     comma separated Python list.', type='string')])

In [12]:
format_instructions = output_parser.get_format_instructions()
format_instructions

'The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n\n```json\n{\n\t"gift": string  // Was the item purchased                             as a gift for someone else?                              Answer True if yes,                             False if not or unknown.\n\t"delivery_days": string  // How many days                                      did it take for the product                                      to arrive? If this                                       information is not found,                                      output -1.\n\t"price_value": string  // Extract any                                    sentences about the value or                                     price, and output them as a                                     comma separated Python list.\n}\n```'

In [15]:
review_template_2 = """\
For the following text, extract the following information:

gift: Was the item purchased as a gift for someone else? \
Answer True if yes, False if not or unknown.

delivery_days: How many days did it take for the product\
to arrive? If this information is not found, output -1.

price_value: Extract any sentences about the value or price,\
and output them as a comma separated Python list.

text: {text}

{format_instructions}
"""

prompt = ChatPromptTemplate.from_template(template=review_template_2)

messages = prompt.format_messages(text=customer_review, 
                                format_instructions=format_instructions)

In [16]:
print(messages[0].content)

For the following text, extract the following information:

gift: Was the item purchased as a gift for someone else? Answer True if yes, False if not or unknown.

delivery_days: How many days did it take for the productto arrive? If this information is not found, output -1.

price_value: Extract any sentences about the value or price,and output them as a comma separated Python list.

text: This leaf blower is pretty amazing.  It has four settings:candle blower, gentle breeze, windy city, and tornado. It arrived in two days, just in time for my wife's anniversary present. I think my wife liked it so much she was speechless. So far I've been the only one using it, and I've been using it every other morning to clear the leaves on our lawn. It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features.


The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```

In [17]:
response = chat(messages)
print(response.content)

```json
{
	"gift": false,
	"delivery_days": "2",
	"price_value": "It's slightly more expensive than the other leaf blowers out there, but I think it's worth it for the extra features."
}
```


In [18]:
output_dict = output_parser.parse(response.content)

In [19]:
type(output_dict)

dict

In [20]:
output_dict.get('delivery_days')

'2'

## LLMChain

In [1]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain

In [2]:
import pandas as pd
df = pd.read_csv('Data.csv')
df.head()

Unnamed: 0,Product,Review
0,Queen Size Sheet Set,I ordered a king size set. My only criticism w...
1,Waterproof Phone Pouch,"I loved the waterproof sac, although the openi..."
2,Luxury Air Mattress,This mattress had a small hole in the top of i...
3,Pillows Insert,This is the best throw pillow fillers on Amazo...
4,Milk Frother Handheld\n,I loved this product. But they only seem to l...


In [3]:
llm = ChatOpenAI(temperature=0.9)

In [4]:
prompt = ChatPromptTemplate.from_template(
    "What is the best name to describe \
    a company that makes {product}?"
)

In [5]:
chain = LLMChain(llm=llm, prompt=prompt)

In [6]:
product = "Queen Size Sheet Set"
chain.run(product)

'Royal Comfort Linens'

## Sequential Chains

The idea of using this is to combine multiplate chains where the output of one chain is the input of the next chain. There are two type of sequential chains:
+ SimpleSequentialChain: Single input/output
+ SequentialChain: multiple inputs/outputs

### SimpleSequentialChain

In [7]:
from langchain.chains import SimpleSequentialChain

In [8]:
llm = ChatOpenAI(temperature=0.9)

# prompt template 1
first_prompt = ChatPromptTemplate.from_template(
    "What is the best name to describe \
    a company that makes {product}?"
)

# Chain 1
chain_one = LLMChain(llm=llm, prompt=first_prompt)

In [9]:
# prompt template 2
second_prompt = ChatPromptTemplate.from_template(
    "Write a 20 words description for the following \
    company:{company_name}"
)
# chain 2
chain_two = LLMChain(llm=llm, prompt=second_prompt)

In [10]:
overall_simple_chain = SimpleSequentialChain(chains=[chain_one, chain_two],
                                             verbose=True
                                            )

In [11]:
overall_simple_chain.run(product)



[1m> Entering new  chain...[0m
[36;1m[1;3mRegalRest Linens[0m
[33;1m[1;3mRegalRest Linens offers luxurious and high-quality linens for hotels, resorts, and spas, ensuring ultimate comfort for guests.[0m

[1m> Finished chain.[0m


'RegalRest Linens offers luxurious and high-quality linens for hotels, resorts, and spas, ensuring ultimate comfort for guests.'

### SequentialChain

In [12]:
from langchain.chains import SequentialChain

In [14]:
llm = ChatOpenAI(temperature=0.9)

# prompt template 1: translate to english
first_prompt = ChatPromptTemplate.from_template(
    "Translate the following review to english:"
    "\n\n{Review}"
)
# chain 1: input= Review and output= English_Review
chain_one = LLMChain(llm=llm, prompt=first_prompt, 
                     output_key="English_Review"
                    )

In [15]:
second_prompt = ChatPromptTemplate.from_template(
    "Can you summarize the following review in 1 sentence:"
    "\n\n{English_Review}"
)
# chain 2: input= English_Review and output= summary
chain_two = LLMChain(llm=llm, prompt=second_prompt, 
                     output_key="summary"
                    )


In [16]:
# prompt template 3: translate to english
third_prompt = ChatPromptTemplate.from_template(
    "What language is the following review:\n\n{Review}"
)
# chain 3: input= Review and output= language
chain_three = LLMChain(llm=llm, prompt=third_prompt,
                       output_key="language"
                      )

In [17]:
# prompt template 4: follow up message
fourth_prompt = ChatPromptTemplate.from_template(
    "Write a follow up response to the following "
    "summary in the specified language:"
    "\n\nSummary: {summary}\n\nLanguage: {language}"
)
# chain 4: input= summary, language and output= followup_message
chain_four = LLMChain(llm=llm, prompt=fourth_prompt,
                      output_key="followup_message"
                     )

In [18]:
# overall_chain: input= Review 
# and output= English_Review,summary, followup_message
overall_chain = SequentialChain(
    chains=[chain_one, chain_two, chain_three, chain_four],
    input_variables=["Review"],
    output_variables=["English_Review", "summary","followup_message"],
    verbose=True
)

In [19]:
print(df.Review[5])
review = df.Review[5]
overall_chain(review)

Je trouve le goût médiocre. La mousse ne tient pas, c'est bizarre. J'achète les mêmes dans le commerce et le goût est bien meilleur...
Vieux lot ou contrefaçon !?


[1m> Entering new  chain...[0m

[1m> Finished chain.[0m


{'Review': "Je trouve le goût médiocre. La mousse ne tient pas, c'est bizarre. J'achète les mêmes dans le commerce et le goût est bien meilleur...\nVieux lot ou contrefaçon !?",
 'English_Review': "I find the taste mediocre. The foam doesn't hold, it's strange. I buy the same ones from the store and the taste is much better... Old stock or counterfeit!?",
 'summary': 'The reviewer is disappointed with the taste of the product and suggests that it may be old stock or counterfeit compared to the same ones bought from the store.',
 'followup_message': "Réponse de suivi:\n\nCher(e) client(e),\n\nNous sommes désolés d'apprendre que vous n'êtes pas satisfait(e) du goût de notre produit. Nous tenons à vous assurer que notre entreprise s'engage à fournir des produits de haute qualité à nos clients.\n\nNous comprenons votre préoccupation concernant la possibilité que le produit soit périmé ou falsifié. Nous souhaiterions enquêter davantage sur cette question pour mieux comprendre ce qui s'est p

## Router Chain

In [20]:
physics_template = """You are a very smart physics professor. \
You are great at answering questions about physics in a concise\
and easy to understand manner. \
When you don't know the answer to a question you admit\
that you don't know.

Here is a question:
{input}"""


math_template = """You are a very good mathematician. \
You are great at answering math questions. \
You are so good because you are able to break down \
hard problems into their component parts, 
answer the component parts, and then put them together\
to answer the broader question.

Here is a question:
{input}"""

history_template = """You are a very good historian. \
You have an excellent knowledge of and understanding of people,\
events and contexts from a range of historical periods. \
You have the ability to think, reflect, debate, discuss and \
evaluate the past. You have a respect for historical evidence\
and the ability to make use of it to support your explanations \
and judgements.

Here is a question:
{input}"""


computerscience_template = """ You are a successful computer scientist.\
You have a passion for creativity, collaboration,\
forward-thinking, confidence, strong problem-solving capabilities,\
understanding of theories and algorithms, and excellent communication \
skills. You are great at answering coding questions. \
You are so good because you know how to solve a problem by \
describing the solution in imperative steps \
that a machine can easily interpret and you know how to \
choose a solution that has a good balance between \
time complexity and space complexity. 

Here is a question:
{input}"""

In [21]:
prompt_infos = [
    {
        "name": "physics", 
        "description": "Good for answering questions about physics", 
        "prompt_template": physics_template
    },
    {
        "name": "math", 
        "description": "Good for answering math questions", 
        "prompt_template": math_template
    },
    {
        "name": "History", 
        "description": "Good for answering history questions", 
        "prompt_template": history_template
    },
    {
        "name": "computer science", 
        "description": "Good for answering computer science questions", 
        "prompt_template": computerscience_template
    }
]

In [22]:
from langchain.chains.router import MultiPromptChain
from langchain.chains.router.llm_router import LLMRouterChain,RouterOutputParser
from langchain.prompts import PromptTemplate

In [23]:
llm = ChatOpenAI(temperature=0)

In [24]:
destination_chains = {}
for p_info in prompt_infos:
    name = p_info["name"]
    prompt_template = p_info["prompt_template"]
    prompt = ChatPromptTemplate.from_template(template=prompt_template)
    chain = LLMChain(llm=llm, prompt=prompt)
    destination_chains[name] = chain  
    
destinations = [f"{p['name']}: {p['description']}" for p in prompt_infos]
destinations_str = "\n".join(destinations)

In [25]:
destinations_str

'physics: Good for answering questions about physics\nmath: Good for answering math questions\nHistory: Good for answering history questions\ncomputer science: Good for answering computer science questions'

In [26]:
default_prompt = ChatPromptTemplate.from_template("{input}")
default_chain = LLMChain(llm=llm, prompt=default_prompt)

In [27]:
MULTI_PROMPT_ROUTER_TEMPLATE = """Given a raw text input to a \
language model select the model prompt best suited for the input. \
You will be given the names of the available prompts and a \
description of what the prompt is best suited for. \
You may also revise the original input if you think that revising\
it will ultimately lead to a better response from the language model.

<< FORMATTING >>
Return a markdown code snippet with a JSON object formatted to look like:
```json
{{{{
    "destination": string \ name of the prompt to use or "DEFAULT"
    "next_inputs": string \ a potentially modified version of the original input
}}}}
```

REMEMBER: "destination" MUST be one of the candidate prompt \
names specified below OR it can be "DEFAULT" if the input is not\
well suited for any of the candidate prompts.
REMEMBER: "next_inputs" can just be the original input \
if you don't think any modifications are needed.

<< CANDIDATE PROMPTS >>
{destinations}

<< INPUT >>
{{input}}

<< OUTPUT (remember to include the ```json)>>"""

In [28]:
router_template = MULTI_PROMPT_ROUTER_TEMPLATE.format(
    destinations=destinations_str
)
router_prompt = PromptTemplate(
    template=router_template,
    input_variables=["input"],
    output_parser=RouterOutputParser(),
)

router_chain = LLMRouterChain.from_llm(llm, router_prompt)

In [29]:
chain = MultiPromptChain(router_chain=router_chain, 
                         destination_chains=destination_chains, 
                         default_chain=default_chain, verbose=True
                        )

In [30]:
chain.run("What is black body radiation?")



[1m> Entering new  chain...[0m




physics: {'input': 'What is black body radiation?'}
[1m> Finished chain.[0m


'Black body radiation refers to the electromagnetic radiation emitted by an object that absorbs all incident radiation and reflects or transmits none. It is called "black body" because it absorbs all wavelengths of light, appearing black at room temperature. \n\nAccording to Planck\'s law, black body radiation is characterized by a continuous spectrum of wavelengths and intensities, which depend on the temperature of the object. As the temperature increases, the peak intensity of the radiation shifts to shorter wavelengths, resulting in a change in color from red to orange, yellow, white, and eventually blue at very high temperatures.\n\nBlack body radiation is a fundamental concept in physics and has various applications, including understanding the behavior of stars, explaining the ultraviolet catastrophe, and developing technologies like incandescent light bulbs and thermal imaging devices.'

## Q&A over Documents

Combine LLM with data 

In [31]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown

In [37]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file, encoding='utf-8')
loader

<langchain.document_loaders.csv_loader.CSVLoader at 0x1b5940ac650>

In [33]:
from langchain.indexes import VectorstoreIndexCreator

In [40]:
# pip install docarray

In [41]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.01s/it]


In [43]:
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

In [44]:
response = index.query(query)

In [45]:
display(Markdown(response))



| Name | Description |
| --- | --- |
| Men's Tropical Plaid Short-Sleeve Shirt | UPF 50+ rated, 100% polyester, wrinkle-resistant, front and back cape venting, two front bellows pockets |
| Men's Plaid Tropic Shirt, Short-Sleeve | UPF 50+ rated, 52% polyester and 48% nylon, machine washable and dryable, front and back cape venting, two front bellows pockets |
| Men's TropicVibe Shirt, Short-Sleeve | UPF 50+ rated, 71% Nylon, 29% Polyester, 100% Polyester knit mesh, wrinkle resistant, front and back cape venting, two front bellows pockets |
| Sun Shield Shirt by | UPF 50+ rated, 78% nylon, 22% Lycra Xtra Life fiber, wicks moisture, fits comfortably over swimsuit, abrasion resistant |

All four shirts provide UPF 50+ sun protection, blocking 98% of the sun's harmful rays. The Men's Tropical Plaid Short-Sleeve Shirt is made of 100% polyester and is wrinkle-resistant. The Men's Plaid Trop

In [46]:
docs = loader.load()
docs[0]

Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0})

## Document Loading

In [19]:
from langchain.document_loaders import PyPDFLoader

In [20]:
loader_one = PyPDFLoader("test.pdf")

In [21]:
pages = loader_one.load()

In [22]:
len(pages)

3

In [23]:
page = pages[0]

In [24]:
page.metadata

{'source': 'test.pdf', 'page': 0}

In [25]:
page.page_content[500:1500]

'itization, multiple ways of attacking\nthrough code have emerged. Allowing cybercriminals to access\nconfidential information, impersonate identities, steal money,\naccess databases, among others. Our project approaches the 3 of\n10 vulnerabilities identified by OWASP as the most recognized\nand usual in web applications, to identify and evaluate them at\nthe JavaScript code level. For this, a machine learning model was\nproposed, detecting the frequency of occurrence of a word with\nthe identified vulnerability; resulting in a model with an accuracy\nof 89%. Finally, an extension was implemented in Visual Studio\nCode to read in real time the code that the person is writing in\norder to identify which vulnerabilities it has.\nIndex Terms —OWASP, Front-End Vulnerabilities, Machine\nLearning, Artificial Intelligence.\nI. INTRODUCTION\nFront-End developers do not have enough tools that en-\ncompass the complex diversity of cybersecurity challenges.\nPart of that inadequacy comes from cu

In [1]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

Pay attention that we have a limitation of requests using openai. The next request will cost USD $0.76

In [8]:
url = "https://youtu.be/_kr-XOeZmW4"
loader = GenericLoader(
    YoutubeAudioLoader([url],"."),
    OpenAIWhisperParser())
docs = loader.load()


[youtube] Extracting URL: https://youtu.be/_kr-XOeZmW4
[youtube] _kr-XOeZmW4: Downloading webpage
[youtube] _kr-XOeZmW4: Downloading ios player API JSON
[youtube] _kr-XOeZmW4: Downloading android player API JSON
[youtube] _kr-XOeZmW4: Downloading m3u8 information
[youtube] _kr-XOeZmW4: Downloading MPD manifest
[info] _kr-XOeZmW4: Downloading 1 format(s): 140
[download] ytcracker - enter my world (produced by amplitude problem).m4a has already been downloaded
[download] 100% of    2.98MiB
[ExtractAudio] Not converting audio ytcracker - enter my world (produced by amplitude problem).m4a; file is already in target format m4a
Transcribing part 1!


In [17]:
docs[0].page_content

"BASS BOOSTED MUSIC Look around what do you see, couple handfuls of hackers with their black hat mentality Most of us got records cause we feel compelled to flex All compulsion for these systems that we probably born to wreck My respect to my hackahgotten ex Hacking our election process like a boss bitch This type of terrorism really look terrible Ain't no bloodshed, nobody countin' for some barrels Had a few here, so check a few there Load up all your Twitter bots, I'll teach a crew there They're the marmade, foldin' up the shop origami Diamond, betting on the horses, betting on the jockeys Unh, unh, unh, unh Stick a fork in a book, a process, stick a fork in em Database pushed right to the open Google dork in em, stick a fork in a book, a process, stick a fork in em End of my world, there's no bake as it look There's dark artists doing modesty, no page in the book Can teach you anything, we learn it, no mistake and we cook And we be chopping up the blades like a fleet of Chinooks And

In [None]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("document")
docs = loader.load()
docs[0].page_content[:200]

## Document Splitting

In [28]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap = 150,
    length_function=len
)

In [43]:
from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [47]:
text_splitter.split_text(docs[0].page_content)[:20]

['B',
 'ASS',
 ' B',
 'OO',
 'ST',
 'ED',
 ' MUS',
 'IC',
 ' Look',
 ' around',
 ' what',
 ' do',
 ' you',
 ' see',
 ',',
 ' couple',
 ' handful',
 's',
 ' of',
 ' hackers']

In [44]:
docs_splits = text_splitter.split_documents(docs)

In [49]:
text_splitter.split_documents(pages)[:4]

[Document(page_content='C', metadata={'source': 'test.pdf', 'page': 0}),
 Document(page_content='ognitive', metadata={'source': 'test.pdf', 'page': 0}),
 Document(page_content=' Solution', metadata={'source': 'test.pdf', 'page': 0}),
 Document(page_content=' for', metadata={'source': 'test.pdf', 'page': 0})]

In [37]:
print(len(docs))
print(len(docs_splits))

1
1


In [8]:
from langchain.embeddings.openai import OpenAIEmbeddings
import os
import openai
from langchain.document_loaders import PyPDFLoader

In [10]:
loaders = [
    PyPDFLoader("test.pdf"),
    PyPDFLoader("semi-automated.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [13]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [14]:
splits = text_splitter.split_documents(docs)

In [15]:
len(splits)

88

### Embeddings

These are numerical representation that captures the semantic meaning of the piece of text.

+ Embedding vector captures content/meaning
+ Text with similar content will have similar vectors

In [19]:
openai.api_key = key_g

In [24]:
embedding = OpenAIEmbeddings(openai_api_key=key_g)

In [25]:
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [26]:
import numpy as np

In [27]:
np.dot(embedding1, embedding2)

0.963208056126645

In [28]:
np.dot(embedding2, embedding3)

0.7596001131741936

In [29]:
np.dot(embedding1, embedding3)

0.7709997651294672

## Vectorstores

It is the way to store the vector representations, the idea is to process a set of chunks from a group of documents using an embedding and save each embedding vector and the orignal chunk. When the vector index is ready the objective is to get a query in a embeded representaion that using the vector index is going pick the n most similar registers. The previous output is input fot the LLM model.

In [2]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown

file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file, encoding='utf-8')

docs = loader.load()

In [3]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [4]:
embed = embeddings.embed_query("Hi my name is Harrison")

In [5]:
print(len(embed))

1536


In [6]:
print(embed[:5])

[-0.021913960576057434, 0.006774206645786762, -0.018190348520874977, -0.039148248732089996, -0.014089343138039112]


lets save the vectors in the vector index store.

In [7]:
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.42s/it]


with the last index store we can find the similarity of a query.

In [8]:
query = "Please suggest a shirt with sunblocking"

In [9]:
docs = db.similarity_search(query)

In [10]:
len(docs)

4

In [11]:
docs[0]

Document(page_content=': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.', metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 255})

To implement Q&A we need to have a retriever, that receives any query and returns documents.

In [12]:
retriever = db.as_retriever()

In [13]:
llm = ChatOpenAI(temperature = 0.0)

In [21]:
qdocs = "".join([docs[i].page_content for i in range(len(docs[:2]))])
qdocs

': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.: 374\nname: Men\'s Plaid Tropic Shirt, Short-Sleeve\ndescription: Our Ultracomfortable sun protection is rated to UPF 50+, helping you stay cool and dry. Originally designed for fishing, this lightest hot-weather shirt offers UPF 50+ c

In [None]:
# response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
# shirts with sun protection in a table in markdown and summarize each one.") 
# display(Markdown(response))

In [23]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

In [24]:
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."

In [26]:
response = qa_stuff.run(query)

In [None]:
display(Markdown(response))

In [None]:
response = index.query(query, llm=llm)

In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])

*Stuff method*

Stuffing is the simplest method. You simply stuff all data into the prompt as context to pass to the language model.
+ Pros: It makes a single call to the LLM. The LLM has access to all the data at once
+ Cons: LLM have a context length, and for a large documents or many documents this will not work as it will result in a prompt larger than the context lenght.

When we need to answer lots of different types of chunks we must explore other techniques, some of them are:
+ *Map_reduce* it takes all the chunks, passes them along with the question to a language model, gets back a response, and then uses another language model call to summarize all of the individual responses into a final answer. Expensive. 
+ *Refine* loop over many documents with the difference that the answer upon from the previous document, it is really good for combining information and building up an answer, but its not fast because depends of the previous calls 
+ *Map_rerank* is mroe experimental one where you do a single call to the model for each document and you also ask it to return as score and it selects the highest score, it relies on the language model to know what score should be. Expensive.


## Evaluation

## Create our QandA application

In [1]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

In [3]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file, encoding="utf-8")
data = loader.load()

In [4]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.72s/it]


In [5]:
llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

### Coming up with test datapoints

In [6]:
data[10]

Document(page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 10})

In [7]:
data[11]

Document(page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.', metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 11})

### Hard-coded examples

In [None]:
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

### LLM-Generated examples

In [8]:
from langchain.evaluation.qa import QAGenerateChain

In [9]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [10]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[10:13]]
)



## SQL agents

In [1]:
from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
from langchain.llms.openai import OpenAI
from langchain.agents import AgentExecutor
from langchain.agents.agent_types import AgentType
from langchain.chat_models import ChatOpenAI

In [8]:
!pip install pandas


Collecting pandas
  Using cached pandas-2.0.3-cp311-cp311-win_amd64.whl (10.6 MB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting tzdata>=2022.1 (from pandas)
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.0.3 pytz-2023.3 tzdata-2023.3


In [9]:
import sqlite3
import pandas as pd
# let's make the connection with our database
conn = sqlite3.connect("FPA_FOD_20170508.sqlite")
# # we are going to use pandas to query a table
fires = pd.read_sql_query("SELECT * FROM Fires", conn)
fires.head()

Unnamed: 0,OBJECTID,FOD_ID,FPA_ID,SOURCE_SYSTEM_TYPE,SOURCE_SYSTEM,NWCG_REPORTING_AGENCY,NWCG_REPORTING_UNIT_ID,NWCG_REPORTING_UNIT_NAME,SOURCE_REPORTING_UNIT,SOURCE_REPORTING_UNIT_NAME,...,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,OWNER_CODE,OWNER_DESCR,STATE,COUNTY,FIPS_CODE,FIPS_NAME,Shape
0,1,1,FS-1418826,FED,FS-FIRESTAT,FS,USCAPNF,Plumas National Forest,511,Plumas National Forest,...,A,40.036944,-121.005833,5.0,USFS,CA,63,63,Plumas,b'\x00\x01\xad\x10\x00\x00\xe8d\xc2\x92_@^\xc0...
1,2,2,FS-1418827,FED,FS-FIRESTAT,FS,USCAENF,Eldorado National Forest,503,Eldorado National Forest,...,A,38.933056,-120.404444,5.0,USFS,CA,61,61,Placer,b'\x00\x01\xad\x10\x00\x00T\xb6\xeej\xe2\x19^\...
2,3,3,FS-1418835,FED,FS-FIRESTAT,FS,USCAENF,Eldorado National Forest,503,Eldorado National Forest,...,A,38.984167,-120.735556,13.0,STATE OR PRIVATE,CA,17,17,El Dorado,b'\x00\x01\xad\x10\x00\x00\xd0\xa5\xa0W\x13/^\...
3,4,4,FS-1418845,FED,FS-FIRESTAT,FS,USCAENF,Eldorado National Forest,503,Eldorado National Forest,...,A,38.559167,-119.913333,5.0,USFS,CA,3,3,Alpine,b'\x00\x01\xad\x10\x00\x00\x94\xac\xa3\rt\xfa]...
4,5,5,FS-1418847,FED,FS-FIRESTAT,FS,USCAENF,Eldorado National Forest,503,Eldorado National Forest,...,A,38.559167,-119.933056,5.0,USFS,CA,3,3,Alpine,b'\x00\x01\xad\x10\x00\x00@\xe3\xaa.\xb7\xfb]\...


In [None]:
Server=localhost\SQLEXPRESS;Database=master;Trusted_Connection=True;
MSI\Usuario

In [14]:
import sqlalchemy.engine.url as url

url.make_url('sqlite:////D:/Github/Data-Science/openaisite.FPA_FOD_20170508.sqlite')

sqlite:////D:/Github/Data-Science/openaisite.FPA_FOD_20170508.sqlite

In [None]:
mysqlsampledatabase.sql

+ https://js.langchain.com/docs/modules/chains/other_chains/sql
+ https://blog.futuresmart.ai/langchain-sql-agents-openai-llms-query-database-using-natural-language
+ https://python.langchain.com/docs/modules/chains/popular/sqlite