In [None]:
!pip install langchain
!pip install huggingface_hub
!pip install transformers
!pip install datasets
!pip install unstructured
!pip install pdf2image
!pip install pdfminer
!pip install pdfminer-six
!pip install sentence_transformers
!pip install chromadb
!pip install arxiv

In [2]:
from getpass import getpass

HUGGINGFACEHUB_API_TOKEN = getpass()

··········


In [3]:
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

In [4]:
from langchain import HuggingFaceHub
from langchain import PromptTemplate, LLMChain

In [5]:
question = "Who won the FIFA World Cup in the year 1994? "
template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

In [6]:
# repo_id = "OpenAssistant/stablelm-7b-sft-v7-epoch-3"

In [7]:
repo_id = "tiiuae/falcon-7b-instruct"

In [8]:
llm = HuggingFaceHub(
    repo_id=repo_id, task="text-generation", model_kwargs={"temperature": 0.5, "max_length": 64}
)

In [9]:
from langchain import PromptTemplate

template = """<|prompter|>{question}<|endoftext|><|assistant|>"""

prompt = PromptTemplate(template=template, input_variables=["question"])

In [10]:
from langchain import LLMChain

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What is the meaning of life?"

In [11]:
llm_chain.run(question)

'The meaning of life is a philosophical question that has been debated by philosophers and scientists alike for centuries.'

The model is working, now we will ask it to learn about a topic... the information it shoul learn we will be taken from a PDF file that is on line.

In [37]:
from langchain.document_loaders import OnlinePDFLoader

loader = OnlinePDFLoader("https://arxiv.org/pdf/1911.01547.pdf")
document = loader.load()

La forma en la que podemos usar un modelo de lenguaje para extraer información de un documento es pasándole el contenido del documento como contexto, como parte del prompt. En el caso de un documento corto, como por ejemplo un post o un email, podemos pasar el documento entero como contexto. En el caso de un documento largo, como un libro o un pdf grande como el del ejemplo, es posible que no podamos pasar el documento entero ya que la cantidad de tokens que podemos pasar al modelo es limitada.

Una de las grandes ventajas de GPT-4 es que es capaz de admitir contextos de hasta 64k tokens 🤯. Sin embargo, muchos de los modelos disponibles no superan los pocos miles de tokens.

In [38]:
len(document[0].page_content)

178100

Para poder llevar a cabo nuestro objetivo, tendremos que generar diferentes trozos de nuestro documento, diferentes chunks. De nuevo, LangChain nos ofrece funcionalidad para ello con sus text splitters.

In [39]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1024, chunk_overlap=64)

# texts = text_splitter.split_text(raw_text)
documents = text_splitter.split_documents(document)

len(documents)



211

En este ejemplo se han generado 211 documentos de una longitud aproximada de 1024 tokens con un solapamiento de 64 tokens entre ellos para evitar que se pierda información.

In [40]:
documents[10].page_content

'1Turing’s imitation game was largely meant as an argumentative device in a philosophical discussion, not as a literal test of intelligence. Mistaking it for a test representative of the goal of the ﬁeld of AI has been an ongoing problem.\n\n3\n\nplicit deﬁnitions has been substituted with implicit deﬁnitions and biases that stretch back decades. Though invisible, these biases are still structuring many research efforts today, as illustrated by our ﬁeld’s ongoing fascination with outperforming humans at board games or video games (a trend we discuss in I.3.5 and II.1). The goal of this document is to point out the implicit assumptions our ﬁeld has been working from, correct some of its most salient biases, and provide an actionable formal deﬁnition and measurement benchmark for human-like general intelligence, leveraging modern insight from developmental cognitive psychology.\n\nI.2 Deﬁning intelligence: two divergent visions'

In [41]:
documents[11].page_content

'I.2 Deﬁning intelligence: two divergent visions\n\nLooked at in one way, everyone knows what intelligence is; looked at in another way, no one does.\n\nRobert J. Sternberg, 2000\n\nMany formal and informal deﬁnitions of intelligence have been proposed over the past few decades, although there is no existing scientiﬁc consensus around any single deﬁnition. Sternberg & Detterman noted in 1986 [87] that when two dozen prominent psychologists were asked to deﬁne intelligence, they all gave somewhat divergent answers. In the context of AI research, Legg and Hutter [53] summarized in 2007 no fewer than 70 deﬁnitions from the literature into a single statement: “Intelligence measures an agent’s ability to achieve goals in a wide range of environments.”'

¿Cómo podemos ahora saber qué trozo de texto le tendremos que pasar al modelo en el prompt? Para ello primero generaremos embeddings de cada documento, una representación numérica en forma de vector.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()

query_result = embeddings.embed_query(documents[0].page_content)

query_result

Estos embedding serán guardados e indexados en una base de datos vectorial, lo cual nos permitirá una búsqueda y extracción eficiente de documentos pasando otro embedding como consulta. Como puedes imaginar, el objetivo será el de recuperar aquellos documentos más similares al prompt, los cuales (supuestamente), contendrán la información que buscamos. En este ejemplo usaremos Chroma como base de datos vectorial.

In [43]:
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents, embeddings)

Ya tenemos todas las piezas que necesitamos para poder chatear con nuestro PDF. Simplemente nos queda generar la cadena adecuada para ello.

In [44]:
from langchain.chains import ConversationalRetrievalChain

qa = ConversationalRetrievalChain.from_llm(llm, vectorstore.as_retriever(), return_source_documents=True)

In [45]:
chat_history = []
query = "What is the definition of intelligence?"
result = qa({"question": query, "chat_history": chat_history})
result["answer"]

' Intelligence is the ability to achieve goals in a wide range of environments.\n\nIntroduction\n\n] I'

En el siguiente recuadro vemos el trozo de texto que consideró que es el más cercano a lo que necesitamos y de allí sacó la respuesta a nuestra pregunta.

In [46]:
result['source_documents'][0].page_content

'I.2 Deﬁning intelligence: two divergent visions\n\nLooked at in one way, everyone knows what intelligence is; looked at in another way, no one does.\n\nRobert J. Sternberg, 2000\n\nMany formal and informal deﬁnitions of intelligence have been proposed over the past few decades, although there is no existing scientiﬁc consensus around any single deﬁnition. Sternberg & Detterman noted in 1986 [87] that when two dozen prominent psychologists were asked to deﬁne intelligence, they all gave somewhat divergent answers. In the context of AI research, Legg and Hutter [53] summarized in 2007 no fewer than 70 deﬁnitions from the literature into a single statement: “Intelligence measures an agent’s ability to achieve goals in a wide range of environments.”'

A continuación usaremos un AGENTE que nos facilita el trabajo. En este caso volvemos a utilizar el modelo LLM cargado anteriormente.

In [49]:
from langchain.agents import load_tools, initialize_agent, AgentType

tools = load_tools(
    ["arxiv"],
)

agent_chain = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
)

In [None]:
agent_chain.run(
    "What's the paper On the measure of intelligence about?",
)

In [None]:
#  Solving output parser ERROR. The library has changed.