<div style="width: 100%; overflow: hidden;">
    <div style="width: 150px; float: left;"> <img src="data/D4Sci_logo_ball.png" alt="Data For Science, Inc" align="left" border="0"> </div>
    <div style="float: left; margin-left: 10px;"> <h1>LangChain for Generative AI</h1>
<h1>Information Processing</h1>
        <p>Bruno Gonçalves<br/>
        <a href="http://www.data4sci.com/">www.data4sci.com</a><br/>
            @bgoncalves, @data4sci</p></div>
</div>

In [38]:
from collections import Counter
from pprint import pprint
from typing import List, Optional


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

import langchain
from langchain.chains.summarize import load_summarize_chain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chains.llm import LLMChain
from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain, create_extraction_chain

import langchain_core
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field

# to connect to internet directly 
import langchain_community
from langchain_community.document_loaders import WebBaseLoader

import langchain_openai
from langchain_openai import ChatOpenAI, OpenAI

import langchain_text_splitters
from langchain_text_splitters import CharacterTextSplitter


import watermark

%load_ext watermark
%matplotlib inline

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark


We start by print out the versions of the libraries we're using for future reference

In [39]:
%watermark -n -v -m -g -iv

Python implementation: CPython
Python version       : 3.12.1
IPython version      : 8.21.0

Compiler    : Clang 13.0.0 (clang-1300.0.29.30)
OS          : Darwin
Release     : 23.5.0
Machine     : arm64
Processor   : arm
CPU cores   : 8
Architecture: 64bit

Git hash: 4d783750c8d1558a44c8d77e189a1834b37ca63d

langchain               : 0.1.16
matplotlib              : 3.8.3
pandas                  : 2.2.1
langchain_community     : 0.0.34
watermark               : 2.4.3
langchain_core          : 0.1.45
numpy                   : 1.26.4
langchain_openai        : 0.1.3
langchain_text_splitters: 0.0.1



Load default figure style

In [40]:
plt.style.use('./d4sci.mplstyle')

# Text Summarization

## Summarizing a Paragraph

In [41]:
prompt_template = """Write a concise summary of the following:
"{text}"
CONCISE SUMMARY:"""

In [5]:
text = """The history of Portugal can be traced from circa 400,001 years ago, when the region of present-day Portugal was inhabited by Homo heidelbergensis.

The Roman conquest of the Iberian Peninsula, which lasted almost two centuries, led to the establishment of the provinces of Lusitania in the south and Gallaecia in the north of what is now Portugal. Following the fall of Rome, Germanic tribes controlled the territory between the 5th and 8th centuries, including the Kingdom of the Suebi centred in Braga and the Visigothic Kingdom in the south.

The 711–716 invasion by the Islamic Umayyad Caliphate conquered the Visigoth Kingdom and founded the Islamic State of Al-Andalus, gradually advancing through Iberia. In 1095, Portugal broke away from the Kingdom of Galicia. Afonso Henriques, son of the count Henry of Burgundy, proclaimed himself king of Portugal in 1139. The Algarve (the southernmost province of Portugal) was conquered from the Moors in 1249, and in 1255 Lisbon became the capital. Portugal's land boundaries have remained almost unchanged since then. During the reign of King John I, the Portuguese defeated the Castilians in a war over the throne (1385) and established a political alliance with England (by the Treaty of Windsor in 1386).

From the late Middle Ages, in the 15th and 16th centuries, Portugal ascended to the status of a world power during Europe's "Age of Discovery" as it built up a vast empire. Signs of military decline began with the Battle of Alcácer Quibir in Morocco in 1578; this defeat led to the death of King Sebastian and the imprisonment of much of the high nobility, which had to be ransomed at great cost. This eventually led to a small interruption in Portugal's 800-year-old independence by way of a 60-year dynastic union with Spain between 1580 and the beginning of the Portuguese Restoration War led by John IV in 1640. Spain's disastrous defeat in its attempt to conquer England in 1588 by means of the Invincible Armada was also a factor, as Portugal had to contribute ships for the invasion. Further setbacks included the destruction of much of its capital city in an earthquake in 1755, occupation during the Napoleonic Wars, and the loss of its largest colony, Brazil, in 1822. From the middle of the 19th century to the late 1950s, nearly two million Portuguese left Portugal to live in Brazil and the United States.[1]

In 1910, a revolution deposed the monarchy. A military coup in 1926 installed a dictatorship that remained until another coup in 1974. The new government instituted sweeping democratic reforms and granted independence to all of Portugal's African colonies in 1975. Portugal is a founding member of NATO, the Organisation for Economic Co-operation and Development (OECD), the European Free Trade Association (EFTA), and the Community of Portuguese Language Countries. It entered the European Economic Community (now the European Union) in 1986. 
"""

In [6]:
summary_prompt = PromptTemplate.from_template(prompt_template).invoke(text)

In [7]:
summary_prompt

StringPromptValue(text='Write a concise summary of the following:\n"The history of Portugal can be traced from circa 400,001 years ago, when the region of present-day Portugal was inhabited by Homo heidelbergensis.\n\nThe Roman conquest of the Iberian Peninsula, which lasted almost two centuries, led to the establishment of the provinces of Lusitania in the south and Gallaecia in the north of what is now Portugal. Following the fall of Rome, Germanic tribes controlled the territory between the 5th and 8th centuries, including the Kingdom of the Suebi centred in Braga and the Visigothic Kingdom in the south.\n\nThe 711–716 invasion by the Islamic Umayyad Caliphate conquered the Visigoth Kingdom and founded the Islamic State of Al-Andalus, gradually advancing through Iberia. In 1095, Portugal broke away from the Kingdom of Galicia. Afonso Henriques, son of the count Henry of Burgundy, proclaimed himself king of Portugal in 1139. The Algarve (the southernmost province of Portugal) was con

In [None]:
print(summary_prompt.text)

In [8]:
llm = OpenAI(temperature=0)

In [10]:
#output = llm(summary_prompt.text)
output = llm.invoke(summary_prompt.text)
print (output)


Portugal's history dates back to 400,001 years ago when it was inhabited by Homo heidelbergensis. It was later conquered by the Romans and then controlled by Germanic tribes. In 711, the Islamic Umayyad Caliphate invaded and established the Islamic State of Al-Andalus. In 1139, Portugal broke away from the Kingdom of Galicia and became an independent kingdom under King Afonso Henriques. During the 15th and 16th centuries, Portugal became a world power through its vast empire. However, it suffered setbacks such as defeat in the Battle of Alcácer Quibir and the loss of Brazil in 1822. In 1910, a revolution deposed the monarchy and a dictatorship was installed until 1974. Portugal then underwent democratic reforms and granted independence to its African colonies. It is a member of various international organizations and joined the European Union in 1986.


## Summarizing a Document

In [11]:
# Instead of pasting text, we can point it to an URL using WebBaseLoader
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-1106")
chain = load_summarize_chain(llm, chain_type="stuff")

result = chain.invoke(docs)

print(result["output_text"])

The article discusses the concept of LLM-powered autonomous agents, which use large language models as their core controllers. It covers the components of these agents, including planning, memory, and tool use, as well as case studies and proof-of-concept examples. The challenges and limitations of using natural language interfaces for these agents are also discussed. The article provides citations and references for further reading.


In [13]:
# Define prompt
prompt = PromptTemplate.from_template(prompt_template)

# Define LLM chain
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k")
llm_chain = LLMChain(llm=llm, prompt=prompt)

# Define StuffDocumentsChain ***** 
stuff_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="text")

docs = loader.load()
print(stuff_chain.invoke(docs)["output_text"])

The article discusses the concept of building autonomous agents powered by large language models (LLMs). It explores the components of such agents, including planning, memory, and tool use. The article provides case studies and examples of proof-of-concept demos, highlighting the challenges and limitations of LLM-powered agents. It also includes citations and references for further reading.


## Using MapReduce

In [15]:
llm = ChatOpenAI(temperature=0.7)

Map

In [16]:
map_template = """The following is a set of documents
{docs}
Based on this list of docs, please identify the main themes 
Helpful Answer:"""

In [17]:
map_prompt = PromptTemplate.from_template(map_template)
map_chain = LLMChain(llm=llm, prompt=map_prompt)

Reduce

In [18]:
reduce_template = """The following is set of summaries:
{docs}
Take these and distill it into a final, consolidated summary of the main themes. 
Helpful Answer:"""
reduce_prompt = PromptTemplate.from_template(reduce_template)

Run chain

In [19]:
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

In [20]:
# Takes a list of documents, combines them into a single string, and passes this to an LLMChain
combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain, document_variable_name="docs"
)

Iteratively reduces the mapped documents

In [27]:
reduce_documents_chain = ReduceDocumentsChain(
    # This is final chain that is called.
    combine_documents_chain=combine_documents_chain,
    # If documents exceed context for `StuffDocumentsChain`
    collapse_documents_chain=combine_documents_chain,
    # The maximum number of tokens to group documents into.
    token_max=4000,
)

Full MapReduce chain

In [28]:
map_reduce_chain = MapReduceDocumentsChain(
    # Map chain
    llm_chain=map_chain,
    # Reduce chain
    reduce_documents_chain=reduce_documents_chain,
    # The variable name in the llm_chain to put the documents in
    document_variable_name="docs",
    # Return the results of the map steps in the output
    return_intermediate_steps=False,
)

In [29]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=0
)

split_docs = text_splitter.split_documents(docs)

Created a chunk of size 1003, which is longer than the specified 1000


In [30]:
print(len(split_docs))

14


In [33]:
result = map_reduce_chain.invoke(split_docs)

print(result["output_text"])

The main themes across the set of documents include the utilization of large language models (LLMs) in autonomous agents for improved reasoning, problem-solving, and planning proficiency. The documents discuss feedback alignment, synergy between reasoning and acting, tool augmentation, reinforcement learning, neuro-symbolic architectures, question-answering, and AI tasks. Specific references to projects, tools, benchmarks, chemistry tools, scientific research capabilities, generative agents, and interactive simulacra of human behavior are also highlighted. Additionally, challenges related to communication bandwidth, long-term planning, task decomposition for LLMs, reliability issues with natural language interfaces, and model outputs are addressed.


# Information Extraction

Based on https://python.langchain.com/v0.1/docs/use_cases/extraction/quickstart/

We start by loading a number of tweets from a csv file

In [32]:
data = pd.read_csv('data/trump.csv')
data.head()

Unnamed: 0,text,created_at,id_str
0,.@FoxNews is no longer the same. We miss the g...,05-19-2020 01:59:49,1262563582086184970
1,So the so-called HHS Whistleblower was against...,05-18-2020 14:44:21,1262393595560067073
2,.....mixed about even wanting us to get out. T...,05-18-2020 14:39:40,1262392415513690112
3,Wow! The Front Page @washingtonpost Headline r...,05-18-2020 12:47:40,1262364231288197123
4,MAGA crowds are bigger than ever! https://t.co...,05-18-2020 12:26:37,1262358931982123008


And defining the data structures that will hold the data we want our chain to extract. The comments help the LLM understand what each field means

In [34]:
class Person(BaseModel):
    """Information about a person."""
    name: Optional[str] = Field(default=None, description="The name of the person")
    twitter_handle: Optional[str] = Field(
        default=None, description="The twitter handle if known"
    )

class Data(BaseModel):
    """Extracted data about people."""

    # Creates a model so that we can extract multiple entities.
    people: List[Person]

In [35]:
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert extraction algorithm. "
            "Only extract relevant information from the text. "
            "If you do not know the value of an attribute asked to extract, "
            "return null for the attribute's value.",
        ),
        ("human", "{text}"),
    ]
)

and our chain

In [36]:
runnable = prompt | llm.with_structured_output(schema=Data)

  warn_beta(


In [37]:
for tweet in data['text'].head(10):
    response = runnable.invoke({"text": tweet})
    print(response)

people=[Person(name='Roger Ailes', twitter_handle=None)]
people=[Person(name="Norah O'Donnell", twitter_handle='@NorahODonnell')]
people=[Person(name='unknown', twitter_handle=None)]
people=[Person(name='Obama', twitter_handle=None)]
people=[Person(name='MAGA crowds', twitter_handle=None)]
people=[Person(name=None, twitter_handle='SecAzar')]
people=[Person(name="Norah O'Donnell", twitter_handle='NorahODonnell')]
people=[Person(name='boaters', twitter_handle=None)]
people=[Person(name='The United States', twitter_handle=None)]
people=[Person(name='Sleepy Joe Biden', twitter_handle=None)]


<center>
     <img src="data/D4Sci_logo_full.png" alt="Data For Science, Inc" align="center" border="0" width=300px> 
</center>