In [None]:
from google.colab import drive
drive.mount('/content/drive')
# if you need to get openai key from your google drive

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## LangChain

LangChain is a framework for developing applications powered by language models. It helps work and build with LLM models easier. Via LangChain, the following two tasks are easier:

1.   Integration - Bring external data, such as your files, other applications, and api data, to your LLMs
2.   Agent - AAllow your LLMs to interact with it's environment via decision making. Use LLMs to help decide which action to take next

You'll need an OpenAI api key to follow this tutorial. You can have it as an environement variable, in an .env file where this jupyter notebook lives, or insert it below where 'YourAPIKey' is. Have if you have questions on this, put these instructions into ChatGPT.



### Overview:
- Models
- Prompt Templates
- Chains
- Agents and Tools
- Memory
- Indexes

https://github.com/gkamradt/langchain-tutorials/blob/main/LangChain%20Cookbook%20Part%201%20-%20Fundamentals.ipynb

In [None]:
## Install the libraries
!pip install -q openai==1.5.0 llmx typing-extensions==4.5.0 python-dotenv
!pip install -q langchain==0.1.4
!pip install -q langchainhub
!pip install -q transformers==4.35.2
!pip install -q langchain-openai
!pip install -q sentence_transformers
!pip install -q faiss-cpu==1.7.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m223.7/223.7 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.1/52.1 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.6/803.6 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

## 1. Models

A generic interface for all [LLMs](https://python.langchain.com/docs/modules/model_io/). It can work with OpenAI and huggingface models.

#### 1.1 Huggingface models

In [None]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline

In [None]:
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small") #77m
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer, max_new_tokens=500) #text2text-generation is the huggingface predefined task and model specific
# The full list of supported tasks is at https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.pipeline.task
hf_small = HuggingFacePipeline(pipeline=pipe)

#### 1.2 OpenAI model

You'll need an OpenAI api key to follow this part. You can have it as an environement variable, in an .env file where this jupyter notebook lives, or insert it below where 'YourAPIKey' is.

In [None]:
path = "drive/MyDrive/docs/openai_key"
file = open(path, "r")
openai_key = file.read()
file.close()
import os
os.environ["OPENAI_API_KEY"] = openai_key

In [None]:
from langchain_openai import OpenAI
simple_qn = "What day comes after Friday?"
llm = OpenAI(temperature=0.9)  # model_name="text-davinci-003"
print(llm.invoke(simple_qn))



Saturday comes after Friday.


## 2. Prompt Templates

LangChain faciliates prompt management and optimization. And Prompt is also important for LLM's performances. When we use an LLM in an applicaiton, we always take the user input and construct a prompt, and then send that to the LLM.

In LangChain, prompt template is the object that helps creating prompts based on a combination of user input, other non-static information and a fixed template string. (kind of an f-string in python)


And check out [LangSmithHub](https://docs.smith.langchain.com/hub/quickstart) for many more communit prompt templates

In [None]:
from langchain import PromptTemplate
study_prompt = PromptTemplate.from_template("How can we learn {subject} better?")
sample_prompt = study_prompt.format(subject="machine learning")
hf_small(sample_prompt)

'Use a syllable to read the machine learning instructions.'

In [None]:
template = """Question: {question}

Let's think step by step.

Answer: """

prompt = PromptTemplate(template=template, input_variables=["question"])
final_prompt = prompt.format(question="Can Arsenal win the Premier League?")
print(final_prompt)
print(llm.invoke(final_prompt))

Question: Can Arsenal win the Premier League?

Let's think step by step.

Answer: 
 It is certainly a possibility for Arsenal to win the Premier League, but it is not guaranteed. They have a strong squad and talented players, but there are also other top teams in the league, such as Manchester City, Liverpool, and Chelsea, who will also be vying for the title.

Additionally, Arsenal has not won the Premier League since the 2003-2004 season, and they have struggled to consistently compete for the title in recent years. However, with the right strategy, tactics, and team cohesion, they could potentially make a strong push for the title.

Ultimately, only time will tell if Arsenal can win the Premier League, but as with any team, it will require hard work, determination, and a bit of luck.


## 3. Chains

Combine different LLM calls and action automatically. For example, we can give one prompt to the language model and the output of that prompt you want to use it as an input to another call/LLM and so on.



In [None]:
from langchain.chains import LLMChain

In [None]:
# load a larger hf model and compare the performance
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base") #248m
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer, max_new_tokens=500)
hf_base = HuggingFacePipeline(pipeline=pipe)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
# have a
chain_small = LLMChain(llm=hf_small, prompt=study_prompt)
chain_base = LLMChain(llm=hf_base, prompt=study_prompt)
chain_openai  = LLMChain(llm=llm, prompt=study_prompt)

In [None]:
for topic in ["machine learning", "dancing"]:
  print(f'The question is: {study_prompt.format(subject=topic)}')
  print(f'The small model genereates: {chain_small.invoke(topic)}')
  print(f'The base model genereates: {chain_base.invoke(topic)}')
  print(f'The oepnai model genereates: {chain_openai.invoke(topic)}')
  print('=================================================================================')

The question is: How can we learn machine learning better?
The small model genereates: {'subject': 'machine learning', 'text': 'Use a syllable to read the machine learning instructions.'}
The base model genereates: {'subject': 'machine learning', 'text': 'We can learn to learn from the data we collect.'}
The oepnai model genereates: {'subject': 'machine learning', 'text': '\n\n1. Start with the basics: Before diving into complex algorithms and models, it is important to have a strong foundation in the fundamentals of machine learning such as statistics, linear algebra, and probability theory.\n\n2. Take an online course: There are many online courses available that cover the basics as well as advanced concepts in machine learning. Some popular ones include Coursera, Udacity, and edX.\n\n3. Read books: There are many books on machine learning that cover a wide range of topics in detail. Some recommended titles are "The Hundred-Page Machine Learning Book" by Andriy Burkov and "Pattern Re


#### Simple Sequential Chains
Next, let us try the simple sequential chain which will help to build the chain of LLM easily.

In [None]:
from langchain.chains import SimpleSequentialChain

In [None]:
template = """Your job is to come up with a classic dish from the area that the users suggests.
% USER LOCATION
{user_location}

YOUR RESPONSE:
"""
prompt_template = PromptTemplate(input_variables=["user_location"], template=template)

# Holds my 'location' chain
location_chain = LLMChain(llm=llm, prompt=prompt_template)

In [None]:
template = """Given a meal, give a short and simple recipe on how to make that dish at home.
% MEAL
{user_meal}

YOUR RESPONSE:
"""
prompt_template = PromptTemplate(input_variables=["user_meal"], template=template)

# Holds my 'meal' chain
meal_chain = LLMChain(llm=llm, prompt=prompt_template)

In [None]:
overall_chain = SimpleSequentialChain(chains=[location_chain, meal_chain], verbose=True)
review = overall_chain.invoke("Beijing")



[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3mOne classic dish from Beijing is Peking Duck, also known as Beijing Roast Duck. It is a traditional dish that originated from the imperial kitchens of the Qing Dynasty. The dish consists of a specially bred white-feathered duck that is roasted until the skin is crispy and the meat is tender and juicy. The duck is typically served with thin pancakes, scallions, and a sweet bean sauce for wrapping and adding flavor. Peking Duck is a must-try dish when visiting Beijing and is often enjoyed as a celebratory meal or at special occasions.[0m
[33;1m[1;3mTo make Peking Duck at home, begin by preheating your oven to 375°F (190°C). Clean and dry a whole duck and prick the skin all over with a fork. In a separate bowl, mix together 1/4 cup of honey, 1/4 cup of soy sauce, 1 tablespoon of rice vinegar, and 1 teaspoon of five-spice powder. Rub this mixture all over the duck, including inside the cavity. Place the duck on a roast

In [None]:
print(review['output'])

Here is a simple recipe for Peking Duck that you can make at home:

Ingredients:
- 1 whole duck, about 5 to 6 pounds
- ½ cup honey
- ¼ cup soy sauce
- 2 tablespoons hoisin sauce
- 2 tablespoons rice vinegar
- 1 tablespoon Chinese five-spice powder
- 1 teaspoon salt
- 1 teaspoon black pepper
- 10-12 thin pancakes
- 1 bunch scallions, cut into strips
- 1 cucumber, cut into strips
- Hoisin sauce for serving

Instructions:
1. Preheat your oven to 350 degrees Fahrenheit.
2. In a small bowl, mix together the honey, soy sauce, hoisin sauce, rice vinegar, Chinese five-spice powder, salt, and black pepper. Set aside.
3. Remove any excess fat from the duck and pat it dry with paper towels.
4. Place the duck on a roasting rack and brush the honey mixture all over the duck.
5. Cover the duck with foil and place it in the oven. Roast for 1 hour.
6. Remove the foil and continue roasting for another 30 minutes, basting the duck with the honey mixture every 10 minutes.



you can try those "small" huggingface models and compare their performance.

## 4. Agents and Tools

Agents can be thought of “bots” which take action. They are going to chain together different actions in LangChain.

LLM is working as the brain in the Agents controlling which action to take and in what order.

An action can be either:
- using a tool and observing its output
- returning it to the user directly

Following parameters are required when creating an Agent:

1. Tool: A tool is a function that performs a particular duty. This can be Google search, Database Lookup, other chains. The interface for a tool is currently a function that is expected to have a string as an input with a string as an output.

2. LLM: The language model powering the agent.

3. Agent: The agent to use. This should be a string that references a support agent class. For example, we are going to use [ReAcT](https://react-lm.github.io/) agent.


In [None]:
from langchain.agents import load_tools
from langchain.agents.agent_types import AgentType
from langchain.agents import create_react_agent, AgentExecutor
from langchain import hub

tools = load_tools(
    ["llm-math"],
    llm=llm
)
# Get the prompt to use - you can modify this!
prompt = hub.pull("hwchase17/react")

In [None]:
print(prompt)

input_variables=['agent_scratchpad', 'input', 'tool_names', 'tools'] template='Answer the following questions as best you can. You have access to the following tools:\n\n{tools}\n\nUse the following format:\n\nQuestion: the input question you must answer\nThought: you should always think about what to do\nAction: the action to take, should be one of [{tool_names}]\nAction Input: the input to the action\nObservation: the result of the action\n... (this Thought/Action/Action Input/Observation can repeat N times)\nThought: I now know the final answer\nFinal Answer: the final answer to the original input question\n\nBegin!\n\nQuestion: {input}\nThought:{agent_scratchpad}'


In [None]:
# Construct the ReAct agent
agent = create_react_agent(llm, tools, prompt)

In [None]:
# Create an agent executor by passing in the agent and tools
query = "What's the result of an investment of $10,000 growing at 4% annually for 5 years with compound interest?"
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True, max_iterations=10)
result = agent_executor.invoke({"input": query})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m To calculate compound interest, I need to know the formula and have access to a calculator.
Action: Calculator
Action Input: $10,000 * (1 + 0.04)^5[0m[36;1m[1;3mAnswer: 12166.529024000001[0m[32;1m[1;3mI now know the final answer.
Final Answer: $12,166.53[0m

[1m> Finished chain.[0m


{'input': "What's the result of an investment of $10,000 growing at 4% annually for 5 years with compound interest?",
 'output': '$12,166.53'}

## 5. Memory

Memory is the concept of persisting state between calls of a chain/agent. LangChain provides a standard interface for memory, a collection of memory implementations, and examples of chains/agents that use memory.

It is as simple as remembering information we have chatted about in the past or more complicated information retrieval.


In [None]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history")

query = "If we start with $15,000 instead and follow the same 8% annual growth for 5 years with compound interest, how much more would we have compared to the previous scenario?"
agent_executor = AgentExecutor(agent=agent, tools=tools, memory=memory, verbose=True)
result = agent_executor.invoke({"input": query})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m We can use the compound interest formula to solve this problem.
Action: Calculator
Action Input: 15000 * (1 + 0.08)^5[0m[36;1m[1;3mAnswer: 22039.92115200001[0m[32;1m[1;3m We now know the amount we would have after 5 years.
Action: Calculator
Action Input: 22039.92115200001 - 20000[0m[36;1m[1;3mAnswer: 2039.92115200001[0m[32;1m[1;3m I now know the final answer.
Final Answer: $2,039.92 more than the previous scenario.[0m

[1m> Finished chain.[0m


{'input': 'If we start with $15,000 instead and follow the same 8% annual growth for 5 years with compound interest, how much more would we have compared to the previous scenario?',
 'chat_history': '',
 'output': '$2,039.92 more than the previous scenario.'}

## 6. Indexes

Indexes refer to ways to structure documents so that LLMs can best interact with them. This module contains the following steps:

- Document Loaders: load the data into "document"
- Text Splitters: When you want to deal with long pieces of text, it is necessary to split up that text into chunks.
- Embeddings: An embedding is a numerical representation of a piece of information, for example, text, documents, images, audio, etc.
- Vectorstores: Vector databases store and index vector embeddings from NLP models to understand the meaning and context of strings of text, sentences, and whole documents for more accurate and relevant search results.

Load sample text as the external knowledge

In [None]:
import requests

url = "https://raw.githubusercontent.com/bt5153msba/bt5153msba.github.io/master/material/msba.txt"
res = requests.get(url)
with open("sample.txt", "w") as f:
  print(res.text)
  f.write(res.text)

The NUS Business Analytics Centre (BAC) was established in 2013, in collaboration with IBM, to develop the skills and knowledge of professionals in business analytics.

In a 5-year partnership that ensued, with IBM contributing industrial knowledge and NUS offering academic expertise, BAC offered and hosted the NUS Master of Science in Business Analytics (MSBA) programme. 

Based in Singapore, students both local and international, enrolled in this master's degree programme are trained to meet the growing demand of companies who are looking to improve their operations through business analytics.

NUS Master of Science in Business Analytics (MSBA) students are well-equipped with the expertise to excel in the data analytics field and serve a variety of industries such as retail, finance, information technology, healthcare and supply chain. 

To date, 300 industrial analytics projects have been accomplished by NUS MSBA students, with the institution developing valuable partnerships with o

#### 6.1 Documents

An object in Langchain that holds a piece of text and metadata (more information about that text)

In [None]:
# Document Loader
from langchain.document_loaders import TextLoader
loader = TextLoader('./sample.txt')
documents = loader.load()

In [None]:
#check text content
print(documents[0].page_content)

The NUS Business Analytics Centre (BAC) was established in 2013, in collaboration with IBM, to develop the skills and knowledge of professionals in business analytics.

In a 5-year partnership that ensued, with IBM contributing industrial knowledge and NUS offering academic expertise, BAC offered and hosted the NUS Master of Science in Business Analytics (MSBA) programme. 

Based in Singapore, students both local and international, enrolled in this master's degree programme are trained to meet the growing demand of companies who are looking to improve their operations through business analytics.

NUS Master of Science in Business Analytics (MSBA) students are well-equipped with the expertise to excel in the data analytics field and serve a variety of industries such as retail, finance, information technology, healthcare and supply chain. 

To date, 300 industrial analytics projects have been accomplished by NUS MSBA students, with the institution developing valuable partnerships with o

In [None]:
#check meta data
print(documents[0].metadata)

{'source': './sample.txt'}


#### 6.2 Text Splitter

Often times your document is too long (like a book) for your LLM. You need to split it up into chunks.

There are many ways you could split your text into chunks, experiment with [different ones](https://python.langchain.com/docs/modules/data_connection/document_transformers/) to see which is best for you.

In [None]:
# Text Splitter
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0, separator='\n') # chunk size is usually set to be around 1000
docs = text_splitter.split_documents(documents)




CharacterTextSplitter is one of splitting logic. It will only split on separator (which is '\n\n' by default). chunk_size is the maximum chunk size that will be split if splitting is possible. If a string starts with n characters, has a separator, and has m more characters before the next separator then the first chunk size will be n if chunk_size < n + m + len(separator).

In [None]:
len(docs)

5

In [None]:
for doc_chunk in docs:
  print(doc_chunk)

page_content='The NUS Business Analytics Centre (BAC) was established in 2013, in collaboration with IBM, to develop the skills and knowledge of professionals in business analytics.' metadata={'source': './sample.txt'}
page_content='In a 5-year partnership that ensued, with IBM contributing industrial knowledge and NUS offering academic expertise, BAC offered and hosted the NUS Master of Science in Business Analytics (MSBA) programme.' metadata={'source': './sample.txt'}
page_content="Based in Singapore, students both local and international, enrolled in this master's degree programme are trained to meet the growing demand of companies who are looking to improve their operations through business analytics." metadata={'source': './sample.txt'}
page_content='NUS Master of Science in Business Analytics (MSBA) students are well-equipped with the expertise to excel in the data analytics field and serve a variety of industries such as retail, finance, information technology, healthcare and s

#### 6.3 Embeddings

In [None]:
# Embeddings
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()

sample_text = "This is a test"
sample_result = embeddings.embed_documents([sample_text])
print(sample_result)

print (f"You have {len(docs)} documents")
embedding_list = embeddings.embed_documents([text.page_content for text in docs])
print (f"You have {len(embedding_list)} embeddings")
print (f"Here's a sample of one: {embedding_list[0][:3]}...")

[[-0.01107711624354124, -0.0987185537815094, -0.02173105627298355, 0.009868171997368336, -0.023400530219078064, 0.04282599687576294, 0.05967649072408676, 0.04518251493573189, 0.05964569374918938, 0.029384968802332878, 0.06904636323451996, -0.02587090991437435, 0.033072661608457565, -0.030291235074400902, 0.02752404287457466, -0.03600703179836273, 0.023956896737217903, -0.009273028001189232, -0.02163439430296421, 0.024006450548768044, -0.0657191053032875, 0.002653931500390172, -0.028488997370004654, -0.03272867947816849, -0.004460283555090427, 0.046927519142627716, -0.014092281460762024, -0.02701234072446823, 0.0018923203460872173, -0.03740781173110008, 0.026178548112511635, -0.03266902267932892, 0.016595840454101562, -0.07427819818258286, 1.805160195544886e-06, -0.0024881712161004543, 0.007248442154377699, -0.0223124697804451, -0.04737232252955437, -0.01517193391919136, -0.030517760664224625, 0.03185688704252243, -0.02177085541188717, 0.03812091052532196, -0.012869136407971382, -0.0558


#### 6.4 VectorStores

Databases to store vectors. Most popular ones are [Pinecone](https://www.pinecone.io/) & [Weaviate](https://weaviate.io/). More examples on OpenAIs [retriever documentation](https://github.com/openai/chatgpt-retrieval-plugin#choosing-a-vector-database). [Chroma](https://www.trychroma.com/) & [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) are easy to work with locally.

Those vector database can be thought as a big tables with a column for embeddings (vectors) and a column for metadata. For example,

| Embedding      | Metadata |
| ----------- | ----------- |
| [-0.00015641732898075134, -0.003165106289088726, ...]      | {'date' : '1/2/23}       |
| [-0.00035465431654651654, 1.4654131651654516546, ...]   | {'date' : '1/3/23}        |

The vectorstore store the embeddings and make them easily searchable


In [None]:
# Vectorstore: https://python.langchain.com/en/latest/modules/indexes/vectorstores.html
from langchain.vectorstores import FAISS
db = FAISS.from_documents(docs, embeddings)


In [None]:
query = "How many industrial projects have been done by MSBA students"
docs = db.similarity_search(query)

In [None]:
print(docs)

[Document(page_content='To date, 300 industrial analytics projects have been accomplished by NUS MSBA students, with the institution developing valuable partnerships with over 100 organisations.', metadata={'source': './sample.txt'}), Document(page_content='NUS Master of Science in Business Analytics (MSBA) students are well-equipped with the expertise to excel in the data analytics field and serve a variety of industries such as retail, finance, information technology, healthcare and supply chain.', metadata={'source': './sample.txt'}), Document(page_content='In a 5-year partnership that ensued, with IBM contributing industrial knowledge and NUS offering academic expertise, BAC offered and hosted the NUS Master of Science in Business Analytics (MSBA) programme.', metadata={'source': './sample.txt'}), Document(page_content="Based in Singapore, students both local and international, enrolled in this master's degree programme are trained to meet the growing demand of companies who are lo

The database can also be saved to the local disk

In [None]:
db.save_local("faiss_index")
new_db = FAISS.load_local("faiss_index", embeddings)
docs = new_db.similarity_search(query)
print(docs[0].page_content)

To date, 300 industrial analytics projects have been accomplished by NUS MSBA students, with the institution developing valuable partnerships with over 100 organisations.


This notebook can not cover all aspects of LangChain. It is suggested to  check out [LangChain Official Documentation](https://python.langchain.com/docs/get_started/introduction)
