# Personal Practice Section

# LangChain Cookbook 👨‍🍳👩‍🍳

*This cookbook is based off the [LangChain Conceptual Documentation](https://docs.langchain.com/docs/)*

**Goal:** Provide an introductory understanding of the components and use cases of LangChain via [ELI5](https://www.dictionary.com/e/slang/eli5/#:~:text=ELI5%20is%20short%20for%20%E2%80%9CExplain,a%20complicated%20question%20or%20problem.) examples and code snippets. For use cases check out [part 2](https://github.com/gkamradt/langchain-tutorials/blob/main/LangChain%20Cookbook%20Part%202%20-%20Use%20Cases.ipynb). See [video tutorial](https://www.youtube.com/watch?v=2xxziIWmaSA) of this notebook.


**Links:**
* [LC Conceptual Documentation](https://docs.langchain.com/docs/)
* [LC Python Documentation](https://python.langchain.com/en/latest/)
* [LC Javascript/Typescript Documentation](https://js.langchain.com/docs/)
* [LC Discord](https://discord.gg/6adMQxSpJS)
* [www.langchain.com](https://langchain.com/)
* [LC Twitter](https://twitter.com/LangChainAI)


### **What is LangChain?**
> LangChain is a framework for developing applications powered by language models.

**~~TL~~DR**: LangChain makes the complicated parts of working & building with AI models easier. It helps do this in two ways:

1. **Integration** - Bring external data, such as your files, other applications, and api data, to your LLMs
2. **Agency** - Allow your LLMs to interact with it's environment via decision making. Use LLMs to help decide which action to take next

### **Why LangChain?**
1. **Components** - LangChain makes it easy to swap out abstractions and components necessary to work with language models.

2. **Customized Chains** - LangChain provides out of the box support for using and customizing 'chains' - a series of actions strung together.
   >> Need to learn to customize chains

4. **Speed 🚢** - This team ships insanely fast. You'll be up to date with the latest LLM features.

5. **Community 👥** - Wonderful discord and community support, meet ups, hackathons, etc.

Though LLMs can be straightforward (text-in, text-out) you'll quickly run into friction points that LangChain helps with once you develop more complicated applications.

*Note: This cookbook will not cover all aspects of LangChain. It's contents have been curated to get you to building & impact as quick as possible. For more, please check out [LangChain Conceptual Documentation](https://docs.langchain.com/docs/)*

*Update Oct '23: This notebook has been expanded from it's original form*

You'll need an OpenAI api key to follow this tutorial. You can have it as an environement variable, in an .env file where this jupyter notebook lives, or insert it below where 'YourAPIKey' is. Have if you have questions on this, put these instructions into [ChatGPT](https://chat.openai.com/).

In [107]:
from dotenv import load_dotenv
import os

load_dotenv()

openai_api_key=os.getenv('OPENAI_API_KEY', 'YourAPIKey')

In [9]:
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")

completion = openai.ChatCompletion.create(
  model="gpt-4",
  messages=[
    {"role": "system", "content": "You are a helpful AI assistant"},
    {"role": "user", "content": "Who is Jose Mourinho"}
  ]
)

print(completion.choices[0].message)
print(completion)

{
  "role": "assistant",
  "content": "Jose Mourinho is a renowned Portuguese professional football coach and former player. He was born on January 26, 1963, in Set\u00fabal, Portugal. Widely considered one of the best football managers in the world, Mourinho has managed several highly prestigious football clubs, including Porto (2002-2004), Chelsea (2004-2007, 2013-2015), Inter Milan (2008-2010), Real Madrid (2010-2013), and Manchester United (2016-2018). Mourinho is known for his tactical prowess and his ability to organize his teams effectively. He's also known for being an emotional and somewhat controversial figure in the football world. As of October 2021, he is the manager of A.S. Roma."
}
{
  "id": "chatcmpl-8G4R1LNMudhqCvjcgEDnu053R79eO",
  "object": "chat.completion",
  "created": 1698840987,
  "model": "gpt-4-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Jose Mourinho is a renowned Portuguese professional fo

# LangChain Components

## Schema - Nuts and Bolts of working with Large Language Models (LLMs)

### **Text**
The natural language way to interact with LLMs

In [3]:
# You'll be working with simple strings (that'll soon grow in complexity!)
my_text = "What day comes after Friday?"
my_text

'What day comes after Friday?'

### **Chat Messages**
Like text, but specified with a message type (System, Human, AI)

* **System** - Helpful background context that tell the AI what to do
* **Human** - Messages that are intented to represent the user
* **AI** - Messages that show what the AI responded with

For more, see OpenAI's [documentation](https://platform.openai.com/docs/guides/chat/introduction)

In [12]:
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage, AIMessage

# This it the language model we'll use. We'll talk about what we're doing below in the next section
chat = ChatOpenAI(model='gpt-4-0613', openai_api_key=openai_api_key, model_name="gpt-4-0613")

In [10]:
# help(chat) #langchain.chat_models.openai object

Now let's create a few messages that simulate a chat experience with a bot

In [13]:
chat(
    [
        SystemMessage(content="You are a helpful AI agent like SIRI."),
        HumanMessage(content="What language model are u?")
    ]
)

AIMessage(content="I am based on OpenAI's GPT-3 language model.")

You can also pass more chat history w/ responses from the AI

In [14]:
chat(
    [
        SystemMessage(content="You are a nice AI bot that helps a user figure out where to travel in one short sentence"),
        HumanMessage(content="I like the beaches where should I go?"),
        AIMessage(content="You should go to Nice, France"),
        HumanMessage(content="What else should I do when I'm there?")
    ]
)

AIMessage(content='You should visit the iconic Promenade des Anglais, explore the Old Town (Vieille Ville), and enjoy local cuisine in charming cafes.')

You can also exclude the system message if you want

In [6]:
chat(
    [
        HumanMessage(content="What day comes after Thursday?")
    ]
)

AIMessage(content='Friday')

### **Documents**
An object that holds a piece of text and metadata (more information about that text)

In [9]:
from langchain.schema import Document

In [10]:
Document(page_content="This is my document. It is full of text that I've gathered from other places",
         metadata={
             'my_document_id' : 234234,
             'my_document_source' : "The LangChain Papers",
             'my_document_create_time' : 1680013019
         })
# metadata is very helpful when you are making large repositories of information
# It also allows you to use them in your searches, you can filter things based on metadata - instead of looking at ALL your metadata

Document(page_content="This is my document. It is full of text that I've gathered from other places", metadata={'my_document_id': 234234, 'my_document_source': 'The LangChain Papers', 'my_document_create_time': 1680013019})

But you don't have to include metadata if you don't want to

## Models - The interface to the AI brains

###  **Language Model**
A model that does text in ➡️ text out!

*Check out how I changed the model I was using from the default one to ada-001 (a very cheap, low performing model). See more models [here](https://platform.openai.com/docs/models)*

In [15]:
from langchain.llms import OpenAI

llm = OpenAI(model_name="text-ada-001", openai_api_key=openai_api_key)

In [19]:
llm("When is the next leap year after 2019?")

'\n\nThe next leap year is 2027.'

### **Chat Model**
A model that takes a series of messages and returns a message output

In [20]:
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage, AIMessage

chat = ChatOpenAI(temperature=1, openai_api_key=openai_api_key)

In [26]:
chat(
    [
        SystemMessage(content="You are an unhelpful AI bot that makes a joke at whatever the user says"),
        HumanMessage(content="I would like to go to New York, how should I do this?"),
        AIMessage(content='Why did the scarecrow win an award? Because he was outstanding in his field! But seriously, to get to New York, you could try riding a kangaroo or maybe hitchhike with a flock of migrating birds. Just make sure to bring your passport and a bag of laughs for the journey!'),
        HumanMessage(content="Come on now, be serious, whats the best way to get to New York?")
    ]
)

# """
# Responses of Chat Models are AI Messages it seems
# AIMessage(content='Why did the scarecrow win an award? Because he was outstanding in his field! 
# But seriously, to get to New York, you could try riding a kangaroo or maybe hitchhike with a flock of migrating birds. 
# Just make sure to bring your passport and a bag of laughs for the journey!')
# """

AIMessage(content="Oh, you want to be serious? My apologies for my non-stop jokes! The best way to get to New York would be to book a flight or take a train. Both options are quite popular among humans. However, if you're feeling adventurous, you could always try teleporting or convincing Santa Claus to lend you his sleigh. Good luck!")

### Function Calling Models

[Function calling models](https://openai.com/blog/function-calling-and-other-api-updates) are similar to Chat Models but with a little extra flavor. They are fine tuned to give structured data outputs.

This comes in handy when you're making an API call to an external service or doing extraction.

In [27]:
chat = ChatOpenAI(model='gpt-3.5-turbo-0613', temperature=1, openai_api_key=openai_api_key)

output = chat(messages=
     [
         SystemMessage(content="You are an helpful AI bot"),
         HumanMessage(content="What’s the weather like in Boston right now?")
     ],
     functions=[{
         "name": "get_current_weather",
         "description": "Get the current weather in a given location",
         "parameters": {
             "type": "object",
             "properties": {
                 "location": {
                     "type": "string",
                     "description": "The city and state, e.g. San Francisco, CA"
                 },
                 "unit": {
                     "type": "string",
                     "enum": ["celsius", "fahrenheit"]
                 }
             },
             "required": ["location"]
         }
     }
     ]
)
output

AIMessage(content='', additional_kwargs={'function_call': {'name': 'get_current_weather', 'arguments': '{\n  "location": "Boston, MA"\n}'}})

See the extra `additional_kwargs` that is passed back to us? We can take that and pass it to an external API to get data. It saves the hassle of doing output parsing.

### **Text Embedding Model**
Change your text into a vector (a series of numbers that hold the semantic 'meaning' of your text). Mainly used when comparing two pieces of text together.

*BTW: Semantic means 'relating to meaning in language or logic.'*

In [28]:
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

In [29]:
text = "Hi! It's time for the beach"

In [30]:
text_embedding = embeddings.embed_query(text) 
print (f"Here's a sample: {text_embedding[:5]}...")
print (f"Your embedding is length {len(text_embedding)}")

Here's a sample: [-0.00019600906371495047, -0.0031846734422911363, -0.0007734206914647714, -0.019472001962491232, -0.015092319017854244]...
Your embedding is length 1536


## Prompts - Text generally used as instructions to your model

### **Prompt**
What you'll pass to the underlying model

In [31]:
from langchain.llms import OpenAI

llm = OpenAI(model_name="text-davinci-003", openai_api_key=openai_api_key)

# I like to use three double quotation marks for my prompts because it's easier to read
prompt = """
Today is Monday, tomorrow is Wednesday.

What is wrong with that statement?
"""

print(llm(prompt))


Tuesday is missing.


### **Prompt Template**
An object that helps create prompts based on a combination of user input, other non-static information and a fixed template string.

Think of it as an [f-string](https://realpython.com/python-f-strings/) in python but for prompts

*Advanced: Check out LangSmithHub(https://smith.langchain.com/hub) for many more communit prompt templates*

In [35]:
from langchain.llms import OpenAI
from langchain import PromptTemplate

llm = OpenAI(model_name="text-davinci-003", openai_api_key=openai_api_key)

# Notice "location" below, that is a placeholder for another value later
template = """
I really want to travel to {location} and I love {food}. What should I do there and where should I eat?
 
Respond in one short sentence
"""

prompt = PromptTemplate(
    input_variables=["location", "food"],
    template=template,
)

final_prompt = prompt.format(location='Rome', food='pizza')

print (f"Final Prompt: {final_prompt}")
print ("-----------")
print (f"LLM Output: {llm(final_prompt)}")

Final Prompt: 
I really want to travel to Rome and I love pizza. What should I do there and where should I eat?
 
Respond in one short sentence

-----------
LLM Output: Visit the Colosseum and try traditional Neapolitan pizza at Da Baffetto.


### **Example Selectors**
An easy way to select from a series of examples that allow you to dynamic place in-context information into your prompt. Often used when your task is nuanced or you have a large list of examples.

Check out different types of example selectors [here](https://python.langchain.com/docs/modules/model_io/prompts/example_selectors/)

If you want an overview on why examples are important (prompt engineering), check out [this video](https://www.youtube.com/watch?v=dOxUroR57xs)

In [36]:
# in context learning = show the language model what to do
# this could be things like how to answer a cs ticket

from langchain.prompts.example_selector import SemanticSimilarityExampleSelector
# This selects similar examples

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain.llms import OpenAI

llm = OpenAI(model_name="text-davinci-003", openai_api_key=openai_api_key)

example_prompt = PromptTemplate(
    input_variables=["input", "output"],
    template="Example Input: {input}\nExample Output: {output}",
)

# Examples of locations that nouns are found
examples = [
    {"input": "pirate", "output": "ship"},
    {"input": "pilot", "output": "plane"},
    {"input": "driver", "output": "car"},
    {"input": "tree", "output": "ground"},
    {"input": "bird", "output": "nest"},
]

In [37]:
# SemanticSimilarityExampleSelector will select examples that are similar to your input by semantic meaning

example_selector = SemanticSimilarityExampleSelector.from_examples(
    # This is the list of examples available to select from.
    examples, 
    
    # This is the embedding class used to produce embeddings which are used to measure semantic similarity.
    OpenAIEmbeddings(openai_api_key=openai_api_key), 
    
    # This is the VectorStore class that is used to store the embeddings and do a similarity search over.
    Chroma, 
    
    # This is the number of examples to produce.
    k=2
)

In [39]:
similar_prompt = FewShotPromptTemplate(
    # this is a new type of prompt template, few shot = few examples;
    
    # The object that will help select examples
    example_selector=example_selector,
    
    # Your prompt
    example_prompt=example_prompt,
    
    # Customizations that will be added to the top and bottom of your prompt
    prefix="Give the location an item is usually found in",
    suffix="Input: {noun}\nOutput:",
    
    # What inputs your prompt will receive
    input_variables=["noun"],
)

In [40]:
# Select a noun!
my_noun = "student"
# my_noun = "student"

print(similar_prompt.format(noun=my_noun))

Give the location an item is usually found in

Example Input: driver
Example Output: car

Example Input: pilot
Example Output: plane

Input: student
Output:


In [41]:
llm(similar_prompt.format(noun=my_noun))

' classroom'

### **Output Parsers Method 1: Prompt Instructions & String Parsing**
A helpful way to format the output of a model. Usually used for structured output. LangChain has a bunch more output parsers listed on their [documentation](https://python.langchain.com/docs/modules/model_io/output_parsers).

Two big concepts:

**1. Format Instructions** - A autogenerated prompt that tells the LLM how to format it's response based off your desired result

**2. Parser** - A method which will extract your model's text output into a desired structure (usually json)

In [42]:
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain.llms import OpenAI

In [43]:
llm = OpenAI(model_name="text-davinci-003", openai_api_key=openai_api_key)

In [44]:
# How you would like your response structured. This is basically a fancy prompt template
response_schemas = [
    ResponseSchema(name="bad_string", description="This a poorly formatted user input string"),
    ResponseSchema(name="good_string", description="This is your response, a reformatted response")
]

# How you would like to parse your output
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)

In [45]:
# See the prompt template you created for formatting
format_instructions = output_parser.get_format_instructions()
print (format_instructions)

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"bad_string": string  // This a poorly formatted user input string
	"good_string": string  // This is your response, a reformatted response
}
```


In [47]:
template = """
You will be given a poorly formatted string from a user.
Reformat it and make sure all the words are spelled correctly

{format_instructions}

% USER INPUT:
{user_input}

YOUR RESPONSE:
"""

prompt = PromptTemplate(
    input_variables=["user_input"],
    partial_variables={"format_instructions": format_instructions},
    template=template
)

promptValue = prompt.format(user_input="welcom to califonya!")

print(promptValue)


You will be given a poorly formatted string from a user.
Reformat it and make sure all the words are spelled correctly

The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":

```json
{
	"bad_string": string  // This a poorly formatted user input string
	"good_string": string  // This is your response, a reformatted response
}
```

% USER INPUT:
welcom to califonya!

YOUR RESPONSE:



In [48]:
llm_output = llm(promptValue)
llm_output

'```json\n{\n\t"bad_string": "welcom to califonya!",\n\t"good_string": "Welcome to California!"\n}\n```'

In [49]:
output_parser.parse(llm_output)

{'bad_string': 'welcom to califonya!', 'good_string': 'Welcome to California!'}

### **Output Parsers Method 2: OpenAI Fuctions**
When OpenAI released function calling, the game changed. This is recommended method when starting out.

They trained models specifically for outputing structured data. It became super easy to specify a Pydantic schema and get a structured output.

There are many ways to define your schema, I prefer using Pydantic Models because of how organized they are. Feel free to reference OpenAI's [documention](https://platform.openai.com/docs/guides/gpt/function-calling) for other methods.

In order to use this method you'll need to use a model that supports [function calling](https://openai.com/blog/function-calling-and-other-api-updates#:~:text=Developers%20can%20now%20describe%20functions%20to%20gpt%2D4%2D0613%20and%20gpt%2D3.5%2Dturbo%2D0613%2C). I'll use `gpt4-0613`

**Example 1: Simple**

Let's get started by defining a simple model for us to extract from.

In [27]:
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional

class Person(BaseModel):
    """Identifying information about a person."""

    name: str = Field(..., description="The person's name")
    age: int = Field(..., description="The person's age")
    fav_food: Optional[str] = Field(None, description="The person's favorite food")
    fav_song: Optional[str] = Field(None, description="The person's favourite song")

Then let's create a chain (more on this later) that will do the extracting for us

In [14]:
from langchain.chains.openai_functions import create_structured_output_chain

llm = ChatOpenAI(model='gpt-4-0613', openai_api_key=openai_api_key)

chain = create_structured_output_chain(Person, llm, prompt)
chain.run(
    "Sally is 13, Joey just turned 12 and loves spinach. Caroline is 10 years older than Sally."
)

NameError: name 'prompt' is not defined

Notice how we only have data on one person from that list? That is because we didn't specify we wanted multiple. Let's change our schema to specify that we want a list of people if possible.

In [16]:
from typing import Sequence

class People(BaseModel):
    """Identifying information about all people in a text."""

    people: Sequence[Person] = Field(..., description="The people in the text")

Now we'll call for People rather than Person

In [61]:
chain = create_structured_output_chain(People, llm, prompt)
chain.run(
    "Sally is 13, Joey just turned 12 and loves spinach. Caroline is 10 years older than Sally and love Churros. Caroline's favourite song is REKT by Tekashi69"
)

People(people=[Person(name='Sally', age=13, fav_food=None, fav_song=None), Person(name='Joey', age=12, fav_food='spinach', fav_song=None), Person(name='Caroline', age=23, fav_food='Churros', fav_song='REKT by Tekashi69')])

Let's do some more parsing with it

**Example 2: Enum**

Now let's parse when a product from a list is mentioned

In [14]:
import enum

llm = ChatOpenAI(model='gpt-4-0613', openai_api_key=openai_api_key)

class Product(str, enum.Enum):
    CRM = "CRM"
    VIDEO_EDITING = "VIDEO_EDITING"
    HARDWARE = "HARDWARE"

In [17]:
class Products(BaseModel):
    """Identifying products that were mentioned in a text"""

    products: Sequence[Product] = Field(..., description="The products mentioned in a text")

In [18]:
chain = create_structured_output_chain(Products, llm, prompt)
chain.run(
    "Jira, the CRM in this demo is great. Love the hardware. The microphone is also cool. Love the video editing."
)

NameError: name 'prompt' is not defined

## Indexes - Structuring documents to LLMs can work with them

### **Document Loaders**
Easy ways to import data from other sources. Shared functionality with [OpenAI Plugins](https://openai.com/blog/chatgpt-plugins) [specifically retrieval plugins](https://github.com/openai/chatgpt-retrieval-plugin)

See a [big list](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html) of document loaders here. A bunch more on [Llama Index](https://llamahub.ai/) as well.

Document loaders are components or modules in Langchain that facilitate the loading of data from various sources and formats into a standardized document format. They provide a convenient way to ingest data and convert it into a consistent representation that can be processed by other components in the pipeline.
Document loaders in Langchain serve the following purposes:

• Data Ingestion: Document loaders enable the loading of data from diverse sources such as files, databases, APIs, webpages, or cloud storage. They handle the complexities of reading and extracting data from different formats and sources.

• Standardization: Document loaders convert the loaded data into a standardized document format that can be easily processed by downstream components. This ensures consistency and compatibility across different data sources.

• Metadata Extraction: Document loaders often extract metadata associated with the loaded data, such as file information, timestamps, source URLs, or any other relevant information. This metadata can be used for further processing or analysis.

• Lazy Loading: Some document loaders support lazy loading, where data is loaded into memory only when needed. This can be useful for handling large datasets or optimizing resource usage.


Langchain provides a wide range of document loaders for various data sources and formats, including text files, CSV files, webpages, databases, cloud storage, APIs, and more. These loaders simplify the process of data ingestion and standardization, allowing for seamless integration of different data sources into the Langchain pipeline.

**HackerNews Example** 

In [50]:
from langchain.document_loaders import HNLoader

In [51]:
loader = HNLoader("https://news.ycombinator.com/item?id=34422627")

In [52]:
data = loader.load()

In [57]:
print (f"Found {len(data)} comments")
print (f"Here's a sample:\n\n{''.join([x.page_content[:150] for x in data[:2]])}")
# print(type(data)) # class = list
# data contains a bunch of metadata

Found 76 comments
Here's a sample:

Ozzie_osman 9 months ago  
             | next [–] 

LangChain is awesome. For people not sure what it's doing, large language models (LLMs) are very Ozzie_osman 9 months ago  
             | parent | next [–] 

Also, another library to check out is GPT Index (https://github.com/jerryjliu/gpt_index)


**Books from Gutenberg Project**

In [69]:
from langchain.document_loaders import GutenbergLoader

loader = GutenbergLoader("https://www.gutenberg.org/cache/epub/2148/pg2148.txt")

data = loader.load()

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1002)>

In [68]:
print(data[0].page_content[1855:1984])




**URLs and webpages**

Let's try it out with [Paul Graham's website](http://www.paulgraham.com/)

In [58]:
from langchain.document_loaders import UnstructuredURLLoader

urls = [
    "http://www.paulgraham.com/",
]

loader = UnstructuredURLLoader(urls=urls)

data = loader.load()

data[0].page_content

'New:  \r\n\r\n Superlinear Returns  |\r\n How to Do Great Work \r\n \r\n \r\n \r\n \r\n \r\n Want to start a startup?  Get funded by  Y Combinator .\r\n \r\n \r\n \r\n\r\n \n\r\n \r\n \r\n \r\n© mmxxiii pg'

### **Text Splitters**
Often times your document is too long (like a book) for your LLM. You need to split it up into chunks. Text splitters help with this.

There are many ways you could split your text into chunks, experiment with [different ones](https://python.langchain.com/en/latest/modules/indexes/text_splitters.html) to see which is best for you.

In [60]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


Langchain provides different types of text splitters that can be used to split text into smaller chunks. Here are some of the commonly used text splitters in Langchain:

• CharacterTextSplitter: This text splitter splits the text based on characters. It allows you to specify a separator character or sequence to split the text into chunks.

• SentenceTextSplitter: The SentenceTextSplitter splits the text into chunks based on sentence boundaries. It uses natural language processing techniques to identify sentence boundaries and create chunks accordingly.

• TokenTextSplitter: The TokenTextSplitter splits the text into chunks based on tokens. It uses tokenization techniques to break the text into tokens and create chunks of a specified size.

• RecursiveCharacterTextSplitter: This text splitter recursively splits the text based on characters. It tries different characters as separators to split the text into chunks, ensuring that the chunks are not larger than the specified size.


These text splitters offer flexibility in splitting text based on different criteria such as characters, sentences, or tokens. You can choose the appropriate text splitter based on your specific requirements and the nature of the text you are working with.

In [61]:
# This is a long document we can split up.
with open('data/PaulGrahamEssays/worked.txt') as f:
    pg_work = f.read()
    
print (f"You have {len([pg_work])} document")

You have 1 document


In [65]:
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 250,
    chunk_overlap  = 20,
) # text splitter object

texts = text_splitter.create_documents([pg_work])

In [66]:
print (f"You have {len(texts)} documents")

You have 362 documents


In [67]:
print ("Preview:")
print (texts[0].page_content, "\n")
print (texts[1].page_content)

Preview:
February 2021Before college the two main things I worked on, outside of school,
were writing and programming. I didn't write essays. I wrote what
beginning writers were supposed to write then, and probably still 

are: short stories. My stories were awful. They had hardly any plot,
just characters with strong feelings, which I imagined made them
deep.The first programs I tried writing were on the IBM 1401 that our


There are a ton of different ways to do text splitting and it really depends on your retrieval strategy and application design. Check out more splitters [here](https://python.langchain.com/docs/modules/data_connection/document_transformers/)

### **Retrievers**
Retrievers are components or algorithms that retrieve relevant documents or data based on a given query or search criteria. They are basically an easy way to combine documents with language models.

There are many different types of retrievers, the most widely supported is the VectoreStoreRetriever -> its the most widely supported because we're often doing so much similarity searches.

**[VECTORSTORE RETIRVERS](https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore)**
> A vector store retriever is a retriever that uses a vector store to retrieve documents. It is a lightweight wrapper around the vector store class to make it conform to the retriever interface. It uses the search methods implemented by a vector store, like similarity search and MMR, to query the texts in the vector store.


In [74]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

loader = TextLoader('data/PaulGrahamEssays/worked.txt')
documents = loader.load()

In [75]:
# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)

# Split your docs into texts
texts = text_splitter.split_documents(documents)

# Get embedding engine ready
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

# Embedd your texts
db = FAISS.from_documents(texts, embeddings)

In [76]:
# Init your retriever. Asking for just 1 document back
retriever = db.as_retriever()

In [77]:
retriever
# VectorStoreRetriever(tags=['FAISS'], vectorstore=<langchain.vectorstores.faiss.FAISS object at 0x7f8389169070>)

VectorStoreRetriever(tags=['FAISS'], vectorstore=<langchain.vectorstores.faiss.FAISS object at 0x14f893910>)

In [78]:
docs = retriever.get_relevant_documents("what types of things did the author want to build?")

In [79]:
print("\n\n".join([x.page_content[:200] for x in docs[:2]]))

standards; what was the point? No one else wanted one either, so
off they went. That was what happened to systems work.I wanted not just to build things, but to build things that would
last.In this di

much of it in grad school.Computer Science is an uneasy alliance between two halves, theory
and systems. The theory people prove things, and the systems people
build things. I wanted to build things. 


### Other types of retrievers

Langchain provides various types of retrievers that serve different purposes and are commonly used for specific tasks. Here are some of the different types of retrievers available in Langchain:

• SelfQueryRetriever: The SelfQueryRetriever allows you to generate vector store queries using an LLM (Language Model) and a vector store. It combines the power of language models and vector stores to perform retrieval based on semantic similarity and metadata filters.

• ParentDocumentRetriever: The ParentDocumentRetriever retrieves smaller chunks of documents and their parent documents. It allows you to look up smaller chunks while returning larger context, which is useful for tasks like document summarization or generating context-aware responses.

• EnsembleRetriever: The EnsembleRetriever combines multiple retrievers or retrieval algorithms to retrieve documents from different sources or using different methods. It enables you to leverage the strengths of multiple retrievers to improve retrieval performance and coverage.

• MergerRetriever: The MergerRetriever merges the results of multiple retrievers into a single set of documents. It is useful when you want to combine the outputs of different retrievers or retrieval methods to obtain a comprehensive set of documents.

• KNNRetriever: The KNNRetriever performs k-nearest neighbors (KNN) retrieval based on vector similarity. It retrieves the k most similar documents to a given query vector, allowing you to find documents that are closest in vector space.

• GoogleDocumentAIWarehouseRetriever: The GoogleDocumentAIWarehouseRetriever is a retriever based on Google's Document AI Warehouse. It allows you to retrieve documents from the Document AI Warehouse, which is a multimodal database for building AI applications.


These are just a few examples of the retrievers available in Langchain. Each retriever serves a specific purpose and is commonly used for tasks such as semantic search, document summarization, context-aware responses, or combining multiple retrieval methods. The choice of retriever depends on the specific requirements of your application and the type of retrieval task you want to perform.

### **VectorStores**
Databases to store vectors. Most popular ones are [Pinecone](https://www.pinecone.io/) & [Weaviate](https://weaviate.io/). More examples on OpenAIs [retriever documentation](https://github.com/openai/chatgpt-retrieval-plugin#choosing-a-vector-database). [Chroma](https://www.trychroma.com/) & [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) are easy to work with locally.

Conceptually, think of them as tables w/ a column for embeddings (vectors) and a column for metadata.

Example

| Embedding      | Metadata |
| ----------- | ----------- |
| [-0.00015641732898075134, -0.003165106289088726, ...]      | {'date' : '1/2/23}       |
| [-0.00035465431654651654, 1.4654131651654516546, ...]   | {'date' : '1/3/23}        |

In [80]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

loader = TextLoader('data/PaulGrahamEssays/worked.txt')
documents = loader.load()

# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)

# Split your docs into texts
texts = text_splitter.split_documents(documents)

# Get embedding engine ready
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

In [81]:
print (f"You have {len(texts)} documents")

You have 78 documents


In [82]:
embedding_list = embeddings.embed_documents([text.page_content for text in texts])

In [83]:
print (f"You have {len(embedding_list)} embeddings")
print (f"Here's a sample of one: {embedding_list[0][:3]}...")

You have 78 embeddings
Here's a sample of one: [-0.0010586286150530253, -0.011182342115534236, -0.012874804746266878]...


Your vectorstore store your embeddings (☝️) and make them easily searchable

## Memory
Helping LLMs remember information.

Memory is a bit of a loose term. It could be as simple as remembering information you've chatted about in the past or more complicated information retrieval.

We'll keep it towards the Chat Message use case. This would be used for chat bots.

There are many types of memory, explore [the documentation](https://python.langchain.com/en/latest/modules/memory/how_to_guides.html) to see which one fits your use case.

### Chat Message History

In [84]:
from langchain.memory import ChatMessageHistory
from langchain.chat_models import ChatOpenAI

chat = ChatOpenAI(temperature=0, openai_api_key=openai_api_key)

history = ChatMessageHistory()

history.add_ai_message("hi!")

history.add_user_message("what is the capital of france?")

In [85]:
history.messages

[AIMessage(content='hi!'),
 HumanMessage(content='what is the capital of france?')]

In [86]:
ai_response = chat(history.messages)
ai_response

AIMessage(content='The capital of France is Paris.')

In [87]:
history.add_ai_message(ai_response.content)
history.messages

[AIMessage(content='hi!'),
 HumanMessage(content='what is the capital of france?'),
 AIMessage(content='The capital of France is Paris.')]

## Chains ⛓️⛓️⛓️
Combining different LLM calls and action automatically

Ex: Summary #1, Summary #2, Summary #3 > Final Summary

Check out [this video](https://www.youtube.com/watch?v=f9_BWhCI4Zo&t=2s) explaining different summarization chain types

There are [many applications of chains](https://python.langchain.com/en/latest/modules/chains/how_to_guides.html) search to see which are best for your use case.

---
Chains are important because they can prevent the language model from hallucinating

---

We'll cover two of them:

### 1. Simple Sequential Chains

Easy chains where you can use the output of an LLM as an input into another. Good for breaking up tasks (and keeping your LLM focused)

In [88]:
from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains import SimpleSequentialChain

llm = OpenAI(temperature=1, openai_api_key=openai_api_key)

In [89]:
template = """Your job is to come up with a classic dish from the area that the users suggests.
% USER LOCATION
{user_location}

YOUR RESPONSE:
"""
prompt_template = PromptTemplate(input_variables=["user_location"], template=template)

# Holds my 'location' chain
location_chain = LLMChain(llm=llm, prompt=prompt_template, verbose=True)

In [90]:
template = """Given a meal, give a short and simple recipe on how to make that dish at home.
% MEAL
{user_meal}

YOUR RESPONSE:
"""
prompt_template = PromptTemplate(input_variables=["user_meal"], template=template)

# Holds my 'meal' chain
meal_chain = LLMChain(llm=llm, prompt=prompt_template, verbose=True)

In [93]:
overall_chain = SimpleSequentialChain(chains=[location_chain, meal_chain], verbose=True) # order matters here.

In [95]:
review = overall_chain.run("Seoul")



[1m> Entering new SimpleSequentialChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYour job is to come up with a classic dish from the area that the users suggests.
% USER LOCATION
Seoul

YOUR RESPONSE:
[0m

[1m> Finished chain.[0m
[36;1m[1;3mBibimbap - a traditional Korean dish featuring a variety of vegetables and meat (usually beef) served over warm rice. It is often served with gochujang (a spicy fermented red chili paste) and a raw egg.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven a meal, give a short and simple recipe on how to make that dish at home.
% MEAL
Bibimbap - a traditional Korean dish featuring a variety of vegetables and meat (usually beef) served over warm rice. It is often served with gochujang (a spicy fermented red chili paste) and a raw egg.

YOUR RESPONSE:
[0m

[1m> Finished chain.[0m
[33;1m[1;3m
Bibimbap Recipe

Ingredients:
- 2 cups cooked white rice
- 

### 2. Summarization Chain

Easily run through long numerous documents and get a summary. Check out [this video](https://www.youtube.com/watch?v=f9_BWhCI4Zo) for other chain types besides map-reduce

In [96]:
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader('data/PaulGrahamEssays/disc.txt')
documents = loader.load()

# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=50)

# Split your docs into texts
texts = text_splitter.split_documents(documents)

# There is a lot of complexity hidden in this one line. I encourage you to check out the video above for more detail
chain = load_summarize_chain(llm, chain_type="map_reduce", verbose=True)
chain.run(texts)



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"January 2017Because biographies of famous scientists tend to 
edit out their mistakes, we underestimate the 
degree of risk they were willing to take.
And because anything a famous scientist did that
wasn't a mistake has probably now become the
conventional wisdom, those choices don't
seem risky either.Biographies of Newton, for example, understandably focus
more on physics than alchemy or theology.
The impression we get is that his unerring judgment
led him straight to truths no one else had noticed.
How to explain all the time he spent on alchemy
and theology?  Well, smart people are often kind of
crazy.But maybe there is a simpler explanation. Maybe"


CONCISE SUMMARY:[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"the smartness and the craziness were not as sepa

" This passage explores the risks that scientists, such as Newton, were willing to take in pursuit of knowledge, despite the unknown outcomes. It further suggests that, in Newton's case, took three such risks in various fields, and his bet on physics ultimately paid off."

## Agents 🤖🤖

Official LangChain Documentation describes agents perfectly (emphasis mine):
> Some applications will require not just a predetermined chain of calls to LLMs/other tools, but potentially an **unknown chain** that depends on the user's input. In these types of chains, there is a “agent” which has access to a suite of tools. Depending on the user input, the agent can then **decide which, if any, of these tools to call**.


Basically you use the LLM not just for text output, but also for decision making. The coolness and power of this functionality can't be overstated enough.

Sam Altman emphasizes that the LLMs are good '[reasoning engine](https://www.youtube.com/watch?v=L_Guz73e6fw&t=867s)'. Agent take advantage of this.

### Agents

The language model that drives decision making.

More specifically, an agent takes in an input and returns a response corresponding to an action to take along with an action input. You can see different types of agents (which are better for different use cases) [here](https://python.langchain.com/en/latest/modules/agents/agents/agent_types.html).

### Tools

A 'capability' of an agent. This is an abstraction on top of a function that makes it easy for LLMs (and agents) to interact with it. Ex: Google search, send an email.

This area shares commonalities with [OpenAI plugins](https://platform.openai.com/docs/plugins/introduction).

### Toolkit

Groups of tools that your agent can select from

Let's bring them all together:

In [108]:
from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.llms import OpenAI
import json

llm = OpenAI(temperature=0, openai_api_key=openai_api_key)

In [109]:
serpapi_api_key=os.getenv("SERP_API_KEY", "YourAPIKey")

In [110]:
toolkit = load_tools(["serpapi"], llm=llm, serpapi_api_key=serpapi_api_key)

In [111]:
agent = initialize_agent(toolkit, llm, agent="zero-shot-react-description", verbose=True, return_intermediate_steps=True)

In [120]:
response = agent({"input""Which companies did Elon Musk found, and how much did he make on each one?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I should research the companies Elon Musk founded
Action: Search
Action Input: "Elon Musk companies"[0m
Observation: [36;1m[1;3m['Founder, CEO, and chief engineer of SpaceX · CEO and product architect of Tesla, Inc. · Owner and CTO of X, formerly Twitter · President of the Musk Foundation ...'][0m
Thought:[32;1m[1;3m I should research the financials of each company
Action: Search
Action Input: "Elon Musk financials"[0m
Observation: [36;1m[1;3m[{'title': 'Elon Musk lost $25 billion on Twitter in a year, and $41 billion on Tesla in just two weeks', 'link': 'https://finance.yahoo.com/news/elon-musk-lost-25-billion-180512324.html', 'source': 'Yahoo Finance', 'date': '2 days ago', 'thumbnail': 'https://serpapi.com/searches/6544efad33b2362c12794e0c/images/e6144f3dd9e33342ffe9ded5336270dca42b769052f3264b.jpeg'}, {'title': 'Elon Musk’s $13 billion whip hand against Wall Street: How interest rates and the financial disaster 


### Further [Tools](https://python.langchain.com/docs/modules/agents/tools/)
Langchain provides a wide range of tools that agents can use to interact with the world. Here are some examples of tools available for agents:

• Search: A tool that allows agents to perform search queries and retrieve relevant information from search engines.

• File Management: A toolkit that provides tools for interacting with local files, such as reading, writing, and manipulating files.

• GitHub: A toolkit that enables agents to interact with GitHub repositories, perform actions like creating issues, making commits, and managing pull requests.

• Gmail: A toolkit that allows agents to interact with Gmail, perform actions like sending emails, searching for emails, and managing labels.

• Jira: A toolkit that enables agents to interact with Jira, perform actions like creating issues, updating issues, and retrieving issue details.

• JSON: A toolkit for interacting with JSON data, providing tools for parsing, manipulating, and generating JSON.

• Natural Language API: A toolkit that utilizes natural language processing capabilities to perform tasks like sentiment analysis, entity recognition, and language detection.

• OpenAPI: A toolkit for interacting with APIs that adhere to the OpenAPI specification, providing tools for making HTTP requests and parsing responses.

• PlayWright Browser: A toolkit that allows agents to interact with web browsers using the PlayWright library, enabling actions like navigating web pages, filling forms, and extracting data.

• SQL Database: A toolkit for interacting with SQL databases, providing tools for executing SQL queries, retrieving data, and modifying database records.


These are just a few examples of the tools available in Langchain. There are many more tools and toolkits that agents can utilize to accomplish various tasks.

# [Existing integrations / Agents](https://python.langchain.com/docs/integrations/toolkits/)

There are lots of existing integrations, agents and toolkits available in langchain.
