In [1]:
# !pip install langchain
# !pip install crewai
# !pip install python-dotenv
# !pip install crewai-tools
# !pip install --upgrade --quiet  langchain langchain-community langchainhub langchain-openai




In [12]:
from uuid import uuid4
import os
from dotenv import load_dotenv

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.vectorstores import Qdrant
from langchain import hub
from langchain.chains import RetrievalQA
from langchain.prompts import ChatPromptTemplate
from langchain.callbacks import StdOutCallbackHandler
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from crewai_tools import BaseTool
from crewai import Agent, Task, Crew

In [15]:
# Load environmental variables from a .env file
load_dotenv()

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

embeddings = OpenAIEmbeddings()

#  LLM gpt-3.5-turbo

llm35 = ChatOpenAI(temperature=0.0,
                 model="gpt-3.5-turbo",
                 max_tokens=512)


### Load Documents

In [4]:
# Load documents
loader = PyPDFLoader("example_data/attention.pdf")
# pages = loader.load_and_split()
# Load, chunk and index the contents of the blog.
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
# vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

In [5]:
#USing in-memory document store
# %pip install --upgrade --quiet  "docarray"

In [6]:
from langchain_community.vectorstores import DocArrayInMemorySearch
# Upsert documents in vector store

vectorstore = DocArrayInMemorySearch.from_documents(splits, embeddings)



### Similarity Search

In [7]:
query = "Positional encoding"
docs = vectorstore.similarity_search(query)
print(docs[0].page_content)

tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the
bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel
as the embeddings, so that the two can be summed. There are many choices of positional encodings,
learned and fixed [9].
In this work, we use sine and cosine functions of different frequencies:
PE(pos,2i)=sin(pos/100002i/d model)
PE(pos,2i+1)=cos(pos/100002i/d model)
where posis the position and iis the dimension. That is, each dimension of the positional encoding
corresponds to a sinusoid. The wavelengths form a geometric progression from 2πto10000 ·2π. We
chose this function because we hypothesized it would allow the model to easily learn to attend by
relative positions, since for any fixed offset k,PEpos+kcan be represented as a linear function of
PEpos.
We also experimented with using learned positional embeddings [ 9] instead, and found that the two


In [20]:
# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")
# print(prompt)
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm35
    | StrOutputParser()
)

# OR you can have retrieval QA chain

# Retrieval qa chain
handler =  StdOutCallbackHandler()

qa_d35 = RetrievalQA.from_chain_type(
                                    llm=llm35,
                                    chain_type="stuff",
                                    callbacks=[handler],
                                    retriever=retriever)

In [21]:
# Invoke qa_chain gpt-3.5-turbo

qa_d35.invoke({"query": "What is the attention mechanism?"})["result"]



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'The attention mechanism is a function that maps a query and a set of key-value pairs to an output, where all components (query, keys, values, and output) are vectors. It computes an output as a weighted sum based on the relevance of the key-value pairs to the query. In the context provided, the attention mechanism is a crucial part of sequence modeling and transduction models, allowing for the modeling of dependencies without considering their distance in input or output sequences. The Transformer model architecture proposed in the context relies entirely on the attention mechanism to draw global dependencies between input and output, eliminating the need for recurrence in traditional neural network architectures.'

In [22]:
rag_chain.invoke("What is the attention mechanism?")

'The attention mechanism is a key component in sequence transduction models, connecting encoder and decoder through weighted sums of key-value pairs based on a query. It allows for modeling dependencies without regard to their distance in input or output sequences. The Transformer model architecture relies solely on attention mechanisms, eliminating the need for recurrence or convolutions.'

In [29]:
#  LLM gpt-4o

llm4o = ChatOpenAI(temperature=0.0,
                 model="gpt-4o",
                 max_tokens=512)

# Retrieval qa chain
qa_db4o = RetrievalQA.from_chain_type(
                                    llm=llm4o,
                                    chain_type="stuff",
                                    callbacks=[handler],
                                    retriever=retriever)


In [30]:
print(qa_db4o.invoke({"query": "What is the attention mechanism?"})["result"])




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
The attention mechanism is a function that maps a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, with the weights assigned to each value determined by a compatibility function of the query with the corresponding key. This mechanism allows the model to focus on different parts of the input sequence when producing each part of the output sequence, enabling it to model dependencies without regard to their distance in the input or output sequences.


In [33]:
print(qa_db4o.invoke({"query": "Can you analyze the research paper?"})["result"])




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Certainly! To provide a comprehensive analysis of the research paper, I'll break it down into several key components:

### Title and Authors
The paper is titled "Building a large annotated corpus of English: The Penn Treebank" and is authored by Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini.

### Abstract
The abstract is not provided in the context, but typically, it would summarize the main objectives, methods, results, and conclusions of the paper.

### Introduction
The introduction likely sets the stage by discussing the importance of annotated corpora in computational linguistics and natural language processing (NLP). It probably highlights the gap that the Penn Treebank aims to fill and outlines the main contributions of the work.

### Objectives
The primary objective of the paper is to build a large, annotated corpus of English text, which can be used for various NLP tasks such as parsin

## CrewAI Agents and Tools

The following steps are performed:

- Search the research paper in the vector store as tool
- Provide a detailed report with the researcher agent
- Provide a short summary with the writer agent

In [31]:
from crewai_tools import BaseTool
# Create custom tool to use the vector store

class MyCustomTool(BaseTool):
    name: str = "Research Paper"
    description: str = "Research Paper"

    def _run(self, argument: str) -> str:
        return qa_db4o.run("Can you analyze the research paper?")

tool_instance = MyCustomTool()

In [32]:
# Agents and Tools
from crewai import Agent, Task, Crew

### RESEARCHER ###

# Agent
reasarcher = Agent(
    role="Computer Science Researcher",
    goal='Understanding the published research paper and extract the main insights',
    backstory="""You work at a Computer Science school.
                  A new paper has been published and you need to extract
                  the main information and provide the main insights to
                  the students.You have a knack for dissecting complex
                  research papers""",
    verbose=True,
    tools=[tool_instance],
    allow_delegation=False
    )


# Tasks
task_research = Task(
    description="""Conduct an in-depth analysis of the published research paper.
                    Identify the key components and try to understand how each works.""",
    expected_output="""Your final answer must be a full analysis report of the research paper""",
    agent=reasarcher,
    output_file='researcher_analysis.md' 

)


### WRITER ###

# Agent
writer = Agent(
    role="Computer Science Writer",
    goal= 'Craft easy to understand content for Computer Science students',
    backstory="""You are working on writing a summary of a research paper.
                  Use the content of the Computer Science Researcher to develop
                  a short comprehensive summary of the research paper""",
    verbose=True,
    allow_delegation=False,
)

# Tasks
task_writer = Task(
    description="""Extract the most important topics and key components of the report
                    developed by the Computer Science Researcher.""",
    expected_output="""Your final answer must be a short comprehensive summary 
                        to understand report for Computer Science students""",
    agent=writer,
    output_file='writer_summary.md' 

)

# CREWAI KICKOFF

Crew = Crew(
    agents=[reasarcher, writer],
    tasks=[task_research, task_writer],
    verbose=2)

result = Crew.kickoff()



[1m[95m [DEBUG]: == Working Agent: Computer Science Researcher[00m
[1m[95m [INFO]: == Starting Task: Conduct an in-depth analysis of the published research paper.
                    Identify the key components and try to understand how each works.[00m


[1m> Entering new CrewAgentExecutor chain...[0m


[32;1m[1;3mI need to conduct an in-depth analysis of the research paper, understand the key components, and interpret how each works. To do this, I will use the Research Paper tool to access the entire content of the research paper.

Action: Research Paper
Action Input: {"argument": "The Published Research Paper"}
[0m

[1m> Entering new RetrievalQA chain...[0m


  warn_deprecated(



[1m> Finished chain.[0m
[95m 

Certainly! To provide a comprehensive analysis of the research paper, I'll break it down into several key components: objectives, methodology, results, and conclusions. However, since the context provided is fragmented and does not specify a single research paper, I'll focus on the general themes and findings from the snippets available.

### Objectives
The primary objectives of the research papers mentioned seem to revolve around advancements in natural language processing (NLP) and machine learning, particularly in areas such as parsing, attention mechanisms, and language modeling. For instance:
- **Penn Treebank (Marcus et al., 1993)**: Building a large annotated corpus of English to aid in computational linguistics.
- **Self-training for parsing (McClosky et al., 2006)**: Improving parsing accuracy through self-training methods.
- **Decomposable attention model (Parikh et al., 2016)**: Enhancing attention mechanisms in NLP tasks.
- **Abstractive s

In [34]:
print(result)


The research paper discusses advancements in the field of Natural Language Processing (NLP) and machine learning, with focus on parsing, attention mechanisms, and language modeling. The methodologies used are diverse, incorporating techniques such as corpus building and annotation, self-training, attention mechanisms, reinforcement learning, and tree annotation. The results demonstrate that these methods have had a substantial impact on the NLP field, validating their effectiveness. The paper concludes by emphasizing the importance and potential of NLP technologies and the need for continuous innovation in methodologies. It also underscores the significance of corpus building and annotation in developing effective NLP systems.
