<a href="https://colab.research.google.com/github/zackives/upenn-cis-2450/blob/main/cis2450lab4nbpt1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Extracting Structured Information from Documents with langchain

We've seen how using ChatGPT in combination with langchain can allow us to easily create tables of data from unstructured text provided as strings. Now, we investigate LLM's ability to extract information directly from the internet.

In [None]:
!pip install langchain
!pip install langchain_community
!pip install python-magic
!pip install langchain-openai



In [None]:
import magic
import pandas as pd
import re
import requests
from typing import List, Optional
from langchain import PromptTemplate, LLMChain
from langchain.document_loaders.parsers import BS4HTMLParser, PDFMinerParser
from langchain.document_loaders.parsers.generic import MimeTypeBasedParser
from langchain.document_loaders.parsers.txt import TextParser
from langchain_community.document_loaders import Blob
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI



For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
%set_env OPENAI_API_KEY=#TODO: PUT KEY HERE

We will work with a Wikipedia article on the 1967 World Series

In [None]:
response = requests.get("https://en.wikipedia.org/wiki/1967_World_Series")
data = response.content
data[:20]

b'<!DOCTYPE html>\n<htm'

In [None]:
# Configure the parsers that you want to use per mime-type!
HANDLERS = {
    "application/pdf": PDFMinerParser(),
    "text/plain": TextParser(),
    "text/html": BS4HTMLParser(),
}

# Instantiate a mimetype based parser with the given parsers
MIMETYPE_BASED_PARSER = MimeTypeBasedParser(
    handlers=HANDLERS,
    fallback_parser=None,
)

mime = magic.Magic(mime=True)
mime_type = mime.from_buffer(data)

# A blob represents binary data by either reference (path on file system)
# or value (bytes in memory).
blob = Blob.from_data(
    data=data,
    mime_type=mime_type,
)

parser = HANDLERS[mime_type]
documents = parser.parse(blob=blob)

Check out the parsed document, which is returned as a JSON that contains the actual page content along with various metadata as its fields

In [None]:
documents

[Document(metadata={'source': None, 'title': '1967 World Series - Wikipedia'}, page_content='\n\n\n1967 World Series - Wikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJump to content\n\n\n\n\n\n\n\nMain menu\n\n\n\n\n\nMain menu\nmove to sidebar\nhide\n\n\n\n\t\tNavigation\n\t\n\n\nMain pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate\n\n\n\n\n\n\t\tContribute\n\t\n\n\nHelpLearn to editCommunity portalRecent changesUpload file\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAppearance\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCreate account\n\nLog in\n\n\n\n\n\n\n\n\nPersonal tools\n\n\n\n\n\n Create account Log in\n\n\n\n\n\n\t\tPages for logged out editors learn more\n\n\n\nContributionsTalk\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nContents\nmove to sidebar\nhide\n\n\n\n\n(Top)\n\n\n\n\n\n1\nBackground\n\n\n\n\nToggle Background subsection\n\n\n\n\n\n1.1\nBoston Red

In [None]:
print(documents[0].page_content[:30].strip())

1967 World Series - Wikiped


We start by asking an informational question. Adding the context variable limits the LLM information gathering to the document we returned

In [None]:

template = """Please limit your information gathering to this text: {context}


Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])


llm = ChatOpenAI(temperature=0, model="gpt-4o")
llm_chain = LLMChain(prompt=prompt, llm=llm)


question = "What was Boston's Impossible Dream"

answer = llm_chain.run({"context": documents[0].page_content, "question": question})

answer

  llm_chain = LLMChain(prompt=prompt, llm=llm)
  answer = llm_chain.run({"context": documents[0].page_content, "question": question})


'The term "Impossible Dream" refers to the 1967 Boston Red Sox season, which was notable for its dramatic and unexpected success. Here are the key points that explain why it was called the "Impossible Dream":\n\n1. **Background of the Team**: The Boston Red Sox had experienced eight straight losing seasons prior to 1967. Interest in the team had waned due to their prolonged period of poor performance.\n\n2. **Key Players**: The team was led by Carl Yastrzemski, who won the Triple Crown and the Most Valuable Player (MVP) award for his performance in 1967, and Jim Lonborg, who won the American League Cy Young Award as the best pitcher.\n\n3. **Dramatic Pennant Race**: The Red Sox were part of a dramatic four-team pennant race that included the Detroit Tigers, Minnesota Twins, and Chicago White Sox. Going into the last week of the season, all four teams were within one game of each other in the standings.\n\n4. **Final Games**: The Red Sox played the Minnesota Twins in Boston for the fina

In [None]:
# format the response in a readable manner
for sentence in answer.split('\n'):
  print (re.sub(r'[^a-zA-Z0-9:\']+', ' ', sentence))

The term Impossible Dream refers to the 1967 Boston Red Sox season which was notable for its dramatic and unexpected success Here are the key points that explain why it was called the Impossible Dream :

1 Background of the Team : The Boston Red Sox had experienced eight straight losing seasons prior to 1967 Interest in the team had waned due to their prolonged period of poor performance 

2 Key Players : The team was led by Carl Yastrzemski who won the Triple Crown and the Most Valuable Player MVP award for his performance in 1967 and Jim Lonborg who won the American League Cy Young Award as the best pitcher 

3 Dramatic Pennant Race : The Red Sox were part of a dramatic four team pennant race that included the Detroit Tigers Minnesota Twins and Chicago White Sox Going into the last week of the season all four teams were within one game of each other in the standings 

4 Final Games : The Red Sox played the Minnesota Twins in Boston for the final two games of the season The Twins held

Now we are ready to extract information. For our first attempt, we will ask the model to output a table containing the starting pitchers for each game of the World Series

In [None]:
class Game(BaseModel):
    boston: str = Field(description="Starting pitcher for Boston")
    st_louis: str = Field(description="Starting pitcher for St Louis")

class Document(BaseModel):
    pitchers: List[Game] = Field(..., description="Starting pitchers for each game")


llm = ChatOpenAI(temperature=0, model="gpt-4o")
structured_llm = llm.with_structured_output(Document)
results = structured_llm.invoke('''
        You are an extraction algorithm. Please look in the article and extract the Starting pitcher for both teams for all 7 World Series games.\n\n
''' + documents[0].page_content)

In [None]:
results_df = pd.DataFrame([pitcher.dict() for pitcher in results.pitchers])
results_df

Unnamed: 0,boston,st_louis
0,José Santiago,Bob Gibson
1,Jim Lonborg,Dick Hughes
2,Gary Bell,Nelson Briles
3,José Santiago,Bob Gibson
4,Jim Lonborg,Steve Carlton
5,Gary Waslewski,Dick Hughes
6,Jim Lonborg,Bob Gibson


And second, we ask ChatGPT to return a table of people mentioned in the article, along with the relative frequencies of their names' appearances

In [None]:
class Person(BaseModel):
    first: str = Field(description="The first name of this person")
    last: str = Field(description="The last name of this person")
    freq: int = Field(description="The number of times this person's name was mentioned in the article")


class Document(BaseModel):
    people: List[Person] = Field(..., description="List of persons who appeared in the Wikipedia article, along with the frequency of appearances")


names_structured_llm = llm.with_structured_output(Document) # reuse GPT model
names = names_structured_llm.invoke('''
        You are an extraction algorithm. Please look in the article and extract the first and last names of every person
        mentioned in the article, along with the number of times that person is mentioned.\n\n
''' + documents[0].page_content)

In [None]:
names_df = pd.DataFrame([person.dict() for person in names.people])
names_df

Unnamed: 0,first,last,freq
0,Red,Schoendienst,3
1,Dick,Williams,4
2,Bob,Gibson,10
3,Johnny,Stevens,1
4,Al,Barlick,3
...,...,...,...
78,Don,Sutton,1
79,Bobby,Valentine,1
80,Mike,Shannon,1
81,Joe,Torre,1
