In [13]:
from dotenv import find_dotenv, load_dotenv

load_dotenv(find_dotenv(), override=True)

True

In [2]:
import os
import sys

### A) Basic Prompt

In [5]:
from langchain_openai import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage, AIMessage

In [6]:
text = '''
Mojo is a programming language in the Python family that is currently under development.[2][3][4] It is available both in browsers via Jupyter notebooks,[4][5] and locally on Linux and macOS.[6][7] Mojo aims to combine the usability of higher level programming languages, specifically Python, with the performance of lower level programming languages like C++, Rust, and Zig.[8] The Mojo compiler is currently closed source with an open source standard library, although Modular, the company behind Mojo, has stated their intent to eventually open source the Mojo programming language itself as it matures.[9]

Mojo builds upon the MLIR compiler framework instead of directly on the lower level LLVM compiler framework that many languages like Julia, Swift, clang and Rust do.[10][11] MLIR is a newer compiler framework that allows Mojo to take advantage of higher level compiler passes not available in LLVM alone and allows Mojo to compile down and target more than just CPUs, including producing code that can run on GPUs, TPUs, ASICs and other accelerators. It can also often more effectively use certain types of CPU optimizations directly, like SIMD without direct intervention by the developer like in many other languages.[12][13] According to Jeremy Howard of fast.ai, Mojo can be seen as "syntax sugar for MLIR" and for that reason Mojo is well optimized for applications like AI.[14]

'''

In [7]:
messages = [
    SystemMessage(content='You are an expert copywriter with expertize in summarizing documents'),
    HumanMessage(content=f'Please provide a short and concise summary of the follwing text:\n TEXT:{text}')
]

llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=0)

In [8]:
llm.get_num_tokens(text=text)

281

In [9]:
summary_output = llm(messages)

In [10]:
print(summary_output.content)

Mojo is a new programming language in the Python family that aims to combine the usability of higher level languages like Python with the performance of lower level languages such as C++, Rust, and Zig. It is available on browsers through Jupyter notebooks and locally on Linux and macOS. The Mojo compiler is currently closed source but has an open source standard library. Mojo utilizes the MLIR compiler framework, allowing it to take advantage of higher level compiler passes and target various processors including GPUs, TPUs, and ASICs. Mojo is optimized for applications like AI and is considered "syntax sugar for MLIR" by experts.


### Summarizing using Prompt Template

In [13]:
from langchain import PromptTemplate
from langchain.chains import LLMChain

In [15]:
template = '''
Write a concise and short summary of the following text:
TEXT: `{text}`
Translate the summary to {language}.
'''

prompt = PromptTemplate(
    input_variables=['text', 'language'],
    template=template
)

In [16]:
llm.get_num_tokens(prompt.format(text=text, language='Odia'))

303

In [17]:
chain = LLMChain(llm=llm, prompt=prompt)
summary = chain.invoke({'text':text, 'language':'Odia'})

In [18]:
print(summary)

{'text': 'Mojo ଏକ Python ପରିବାରରେ ଉତ୍ପାଦନ କରାଯାଇଛି ପ୍ରୋଗ୍ରାମିଂ ଭାଷା ଯେହେତୁ ଉନ୍ନତ ସ୍ତରର ପ୍ରୋଗ୍ରାମିଂ ଭାଷାଗୁଡ଼ିକ ଯେମିତି Python, ଏହିକୁ C++, Rust, ଓ Zig ପ୍ରକାରର ନିମ୍ନ ସ୍ତରର ପ୍ରୋଗ୍ରାମିଂ ଭାଷାଗୁଡ଼ିକ ଯେମିତି ଉପଯୋଗକରିବା ଲକ୍ଷ୍ୟ ରଖୁଛି।` ଏହି ଉପଲବ୍ଧ ଅଛି ବ୍ରାଉଜରରେ Jupyter ନୋଟବୁକ୍ ମାଧ୍ୟମରେ, ଏବଂ ଲୋକାଲି ଲିନୁକ୍ସ ଓ macOS ଉପର ମୋଜୋ ଉପଲବ୍ଧ ଅଛି। ମୋଜୋ ଏହିକୁ ଉଚ୍ଚ ସ୍ତରର ପ୍ରୋଗ୍ରାମିଂ ଭାଷାଗୁଡ଼ିକର ଉପଯୋଗକରିବା ଯେତେ Python ଏବଂ କିଛି ଅନ୍ଯାନ୍ୟ ପ୍ରୋଗ୍ରାମିଂ ଭାଷାଗୁଡ଼ିକର ପ୍ରଦର୍ଶନକୁ ଯେତେ C++, Rust, ଓ Zig ଯେତେ।` ମୋଜୋ କମ୍ପାଇଲର ଏହି ବର୍ତ୍ତମାନ ବନ୍ଦର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର ସ୍ତରର 

### Summarizing using StuffDocumentChain

In [20]:
from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document

In [21]:
with open('documents/sj.txt', encoding='utf-8') as f:
    text = f.read()

In [22]:
docs = [Document(page_content=text)]

In [23]:
template = '''
Write a concise and short summary of the following text.
TEXT: `{text}`
'''

prompt = PromptTemplate(
    input_variables=['text'],
    template=template
)

In [24]:
chain = load_summarize_chain(
    llm,
    chain_type='stuff',
    prompt=prompt,
    verbose=True
)

In [25]:
output_summary = chain.invoke(docs)
print(output_summary)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
Write a concise and short summary of the following text.
TEXT: `I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories.

The first story is about connecting the dots.

I dropped out of Reed College after the first 6 months, but then stayed around as a drop-in for another 18 months or so before I really quit. So why did I drop out?

It started before I was born. My biological mother was a young, unwed college graduate student, and she decided to put me up for adoption. She felt very strongly that I should be adopted by college graduates, so everything was all set for me to be adopted at birth by a lawye


[1m> Finished chain.[0m

[1m> Finished chain.[0m
{'input_documents': [Document(page_content='I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories.\n\nThe first story is about connecting the dots.\n\nI dropped out of Reed College after the first 6 months, but then stayed around as a drop-in for another 18 months or so before I really quit. So why did I drop out?\n\nIt started before I was born. My biological mother was a young, unwed college graduate student, and she decided to put me up for adoption. She felt very strongly that I should be adopted by college graduates, so everything was all set for me to be adopted at birth by a lawyer and his wife. Except that when I popped out they decided at the last minute that they really w

In [28]:
print(output_summary['output_text'])

The speaker shares three stories from his life during a commencement speech. The first story is about dropping out of college, the second is about being fired from the company he co-founded, and the third is about facing a cancer diagnosis. He emphasizes the importance of following one's heart, staying true to oneself, and making the most of life as it is limited. The speech concludes with the message "Stay Hungry. Stay Foolish."


### Summarizing Large Documents Using map_reduce

In [29]:
from langchain import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [31]:
with open('documents/sj.txt', encoding='utf-8') as f:
    text = f.read()

In [30]:
llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=0)

In [32]:
llm.get_num_tokens(text)

2653

In [35]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=50)
chunks = text_splitter.create_documents([text])

In [36]:
len(chunks)

2

In [37]:
chain = load_summarize_chain(
    llm,
    chain_type='map_reduce',
    verbose=False
)

In [38]:
output_summary = chain.invoke(chunks)

In [41]:
print(output_summary['output_text'])

Steve Jobs shares three stories from his life in his commencement speech, highlighting the importance of following one's passion, not settling for less, and living each day as if it were your last. He emphasizes the significance of staying hungry and foolish, embracing new opportunities, and following one's intuition. Jobs also references The Whole Earth Catalog's message of staying curious and open-minded.


In [42]:
# Prompt for summarizing each part
chain.llm_chain.prompt.template

'Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'

In [43]:
# Prompt for combining the summaries
chain.combine_document_chain.llm_chain.prompt.template

'Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'

### Map Reduce with Custom Prompts

In [44]:
map_prompt = '''
Write a short and concise summary of the following:
text: `{text}`
CONCISE SUMMARY:
'''


map_prompt_template = PromptTemplate(
    input_variables=['text'],
    template=map_prompt
)

In [46]:
combine_prompt = '''
Write a concise summary of the following text that covers the key points.
Add a title to the summary.
Start your summary with an INTRODUCTION PARAGRAPH that gives an overview of the topic FOLLOWED by BULLET POINTS 
if possible AND end the summary with CONCLUSION PHRASE.
Text: `{text}`
'''

combine_prompt_template = PromptTemplate(
    input_variables=['text'],
    template=combine_prompt
)

In [47]:
summary_chain = load_summarize_chain(
    llm=llm,
    chain_type='map_reduce',
    map_prompt=map_prompt_template,
    combine_prompt=combine_prompt_template,
    verbose=True
)

In [48]:
output = summary_chain.invoke(chunks)



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
Write a short and concise summary of the following:
text: `I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories.

The first story is about connecting the dots.

I dropped out of Reed College after the first 6 months, but then stayed around as a drop-in for another 18 months or so before I really quit. So why did I drop out?

It started before I was born. My biological mother was a young, unwed college graduate student, and she decided to put me up for adoption. She felt very strongly that I should be adopted by college graduates, so everything was all set for me to be adopted at birth by a lawyer


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
Write a concise summary of the following text that covers the key points.
Add a title to the summary.
Start your summary with an INTRODUCTION PARAGRAPH that gives an overview of the topic FOLLOWED by BULLET POINTS 
if possible AND end the summary with CONCLUSION PHRASE.
Text: `The speaker shares three stories from his life during a commencement speech. The first story is about dropping out of college, following his curiosity, and how it led to the design of the Macintosh computer. The second story is about being fired from Apple, starting over, and finding success with NeXT and Pixar. The third story is about facing death after being diagnosed with cancer, and how it changed his perspective on life. The overall message is to trust in following your passion, not settling, and living each day as if it were your last.

The text emphasizes the inevitability of death and the importan

In [51]:
print(output['output_text'])

Title: Trusting Your Passion: Lessons from a Commencement Speech

INTRODUCTION:
The speaker in the text shares three impactful stories from his life during a commencement speech, highlighting the importance of following one's passion and living authentically.

KEY POINTS:
- Story 1: Dropping out of college, following curiosity, leading to the design of the Macintosh computer
- Story 2: Being fired from Apple, finding success with NeXT and Pixar after starting over
- Story 3: Facing death after cancer diagnosis, changing perspective on life

CONCLUSION:
The text emphasizes the importance of trusting in following one's passion, not settling, and living each day as if it were your last.


### Summarizing using refine chain

![Screenshot%202024-06-29%20at%203.36.13%E2%80%AFPM.png](attachment:Screenshot%202024-06-29%20at%203.36.13%E2%80%AFPM.png)

In [1]:
from langchain import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredPDFLoader

In [54]:
pip install unstructured -q

[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
loader = UnstructuredPDFLoader('documents/flash_attention.pdf')
data = loader.load()

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/rakesh.panigrahy/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [9]:
data[0].page_content

'2205.14135v2 [cs.LG] 23 Jun 2022\n\narXiv\n\nFLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness\n\nTri Dao’, Daniel Y. Fu’, Stefano Ermon\', Atri Rudra?, and Christopher Ré*\n\n‘Department of Computer Science, Stanford University *Department of Computer Science and Engineering, University at Buffalo, SUNY\n\n{trid,danfu}@cs.stanford.edu, ermon@stanford.edu, atri@buffalo.edu, chrismre@cs.stanford.edu\n\nJune 24, 2022\n\nAbstract\n\nTransformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms JO- aware—accounting for reads and writes between levels of GPU memory. We propose FLASHATTENTION, an IO-aware exact attention algorithm that uses ti

In [10]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=100)
chunks = text_splitter.split_documents(data)

In [11]:
len(chunks)

12

In [14]:
llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=0)

In [16]:
import tiktoken
def print_embedding_cost(texts):
    enc = tiktoken.encoding_for_model('gpt-3.5-turbo')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding Cost in USD: {total_tokens / 1000*0.002:.6f}')
print_embedding_cost(chunks)

Total Tokens: 31529
Embedding Cost in USD: 0.063058


In [17]:
chain = load_summarize_chain(
    llm=llm,
    chain_type='refine',
    verbose=False
)

output_summary = chain.invoke(chunks)

In [20]:
print(output_summary['output_text'])

The paper introduces FLASHATTENTION, an IO-aware exact attention algorithm designed for GPUs that reduces memory reads/writes between GPU memory levels, improving the speed and memory efficiency of Transformers on long sequences. By leveraging techniques such as tiling and recomputation, FLASHATTENTION achieves faster training speeds and higher model quality compared to existing methods, enabling Transformers to handle longer context and achieve better performance on various tasks. The algorithm is open-sourced and outperforms existing attention implementations in terms of speed and memory efficiency. Additionally, the analysis shows a significant reduction in HBM accesses compared to standard attention, highlighting the efficiency and effectiveness of FLASHATTENTION in handling attention computations on GPUs. The paper also introduces an extension to FLASHATTENTION called block-sparse FLASHATTENTION, which further improves IO complexity by incorporating sparsity. Experimental results 

### Refine with custom prompts

In [21]:
prompt_template = '''
Write a concise summary of the following extracting the key information:
Text: `{text}`
CONCISE SUMMARY:'''


initial_prompt = PromptTemplate(template=prompt_template, input_variables=['text'])

refine_template = '''
Your job is to produce a final summary.
I have provided an existing summary up to a certain point: {existing_answer}.
Please refine the existing summary with some more context below.
---------------------
{text}
---------------------
Start the final summary with INTRODUCTION PARAGRAPH that gives an overview of the topic FOLLOWING by BULLET POINTS
if possible AND end the summary with CONCLUSION PHRASE.
'''

refine_prompt = PromptTemplate(
    template=refine_template,
    input_variables=['existing_answer', 'text']
)

In [22]:
chain = load_summarize_chain(
    llm=llm,
    chain_type='refine',
    question_prompt=initial_prompt,
    refine_prompt=refine_prompt,
    return_intermediate_steps=True
)

output_summary = chain.invoke(chunks)

In [25]:
print(output_summary['input_documents'])

[Document(page_content="2205.14135v2 [cs.LG] 23 Jun 2022\n\narXiv\n\nFLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness\n\nTri Dao’, Daniel Y. Fu’, Stefano Ermon', Atri Rudra?, and Christopher Ré*\n\n‘Department of Computer Science, Stanford University *Department of Computer Science and Engineering, University at Buffalo, SUNY\n\n{trid,danfu}@cs.stanford.edu, ermon@stanford.edu, atri@buffalo.edu, chrismre@cs.stanford.edu\n\nJune 24, 2022\n\nAbstract\n\nTransformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms JO- aware—accounting for reads and writes between levels of GPU memory. We propose FLASHATTENTION, an IO-aware exact attention 

In [26]:
print(output_summary['intermediate_steps'])

['The paper introduces FLASHATTENTION, an IO-aware exact attention algorithm that reduces memory reads/writes between GPU memory levels, improving the efficiency of Transformers on long sequences. FLASHATTENTION achieves faster training speeds and higher model quality compared to existing methods, enabling Transformers to handle longer context and achieve better performance on various tasks. The algorithm is shown to be up to 7.6x faster than standard attention implementations and requires less memory access. Additionally, block-sparse FLASHATTENTION, a sparse attention algorithm, is 2-4x faster and can scale up to sequence lengths of 64k. The paper also discusses the potential of FLASHATTENTION as a primitive for approximate attention algorithms and provides empirical validation of its performance improvements.', 'Introduction:\nThe paper introduces FLASHATTENTION, an IO-aware exact attention algorithm designed to improve the efficiency of Transformers on long sequences by reducing me

In [28]:
print(output_summary['output_text'])

Introduction:
Recent advancements in attention algorithms for Transformers have led to the development of FLASHATTENTION, an IO-aware exact attention algorithm designed to enhance the efficiency of Transformers on long sequences. This algorithm reduces memory reads/writes between GPU memory levels, resulting in faster training speeds and higher model quality compared to existing methods. FLASHATTENTION enables Transformers to handle longer context and achieve better performance on various tasks. Additionally, the Block-Sparse FLASHATTENTION variant further improves efficiency and scalability for longer sequences.

Key Points:
- FLASHATTENTION is up to 7.6x faster than standard attention implementations and requires fewer memory accesses.
- Block-sparse FLASHATTENTION, a sparse attention algorithm, is 2-4x faster and can scale up to sequence lengths of 64k.
- FLASHATTENTION improves performance on long document classification tasks such as MIMIC-III and ECtHR datasets, showcasing its ef

### Summarizing using LangChain Agents

In [29]:
from langchain_openai import ChatOpenAI
from langchain.agents import initialize_agent, Tool
from langchain.utilities import WikipediaAPIWrapper

In [30]:
llm = ChatOpenAI(temperature=0, model='gpt-3.5-turbo')
wikipedia = WikipediaAPIWrapper()

In [31]:
tools = [
    Tool(
        name='Wikipedia',
        func=wikipedia.run,
        description='Useful for when you need to get information from wikipedia on a topic'
    )
]

In [32]:
agent_executor = initialize_agent(tools, llm, agent='zero-shot-react-description', verbose=True)

  warn_deprecated(


In [33]:
output = agent_executor.run('Can you please provide a short summary of Gopabandhu Das')

  warn_deprecated(




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI should use Wikipedia to find information on Gopabandhu Das
Action: Wikipedia
Action Input: Gopabandhu Das[0m
Observation: [36;1m[1;3mPage: Gopabandhu Das
Summary: Gopabandhu Das (1877–1928), popularly known as Utkalamani Gopabandhu Das (Jewel of Utkal or Odisha), was a social worker, reformer, political activist, journalist, poet and essayist.

Page: Nilakantha Das
Summary: Pandit Nilakantha Das (1884-1967) was one of the most illustrious sons of Odisha, who appeared both in its political and literary arena at the most crucial period of its history, when Odisha had no political identity in the map of India, and Odia as a language was about to be extinct. He worked relentlessly for Odisha's recognition both politically and linguistically, and helped bring to fruition the dreams of Utkala Gaurab Madhusudan Das, Utkalamani Gopabandhu Das and all other Odia loving people.
As a colleague of Mahatma Gandhi, Motilal Nehru and D

In [34]:
print(output)

Gopabandhu Das (1877–1928), popularly known as Utkalamani Gopabandhu Das, was a social worker, reformer, political activist, journalist, poet, and essayist.
