### Integrating Streaming into LlamaIndex via Callback Manager

This demo shows how you can integrate streaming capabilities into a LlamaIndex app through LangChain's callback manager.

LangChain's callback manager provides an `on_llm_new_token` method that's called for each token returned in a streaming setting.

The challenge with integrating with LlamaIndex is that we only want `on_llm_new_token` to be called for the *final* response,
rather than for every intermediate response. This notebook below shows you how to set that up.

Key components:
- `ChatOpenAI` LLM class with `streaming=True`
- `ChatGPTLLMPredictor` with LLM and `CallbackManager` specified
- call `index.query` with `streaming=True`

A few notes:
- We've provided a demo `StreamStdoutCallbackHandler` that outputs tokens to stdout, but would highly encourage you to develop your own CallbackHandler

In [10]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.CRITICAL)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# Uncomment if you want to temporarily disable logger
# logger = logging.getLogger()
# logger.disabled = True

In [19]:
from langchain.chat_models import ChatOpenAI
from langchain.callbacks import CallbackManager
from gpt_index.llm_predictor.chatgpt import ChatGPTLLMPredictor
from gpt_index.langchain_helpers.callbacks import StreamStdoutCallbackHandler

from gpt_index import GPTSimpleVectorIndex, SimpleDirectoryReader, ServiceContext, LLMPredictor

In [14]:
# load documents
documents = SimpleDirectoryReader('../paul_graham_essay/data').load_data()

In [15]:
# NOTE: please define your own callback handler/manager
# the StreamCallbackHandler is primarily used for demo purposes
callback_manager = CallbackManager(handlers=[StreamStdoutCallbackHandler()])

In [16]:
# define LLM + LLMPredictor
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo", streaming=True)
llm_predictor = ChatGPTLLMPredictor(llm=llm, callback_manager=callback_manager)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=1024)

In [17]:
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

In [18]:
response = index.query(
    "What did the author do growing up?", 
    service_context=service_context,
    streaming=True,
)

The author worked on writing and programming outside of school growing up. They wrote short stories and tried programming on an IBM 1401 in 9th grade, using an early version of Fortran. With microcomputers, they started programming more and wrote simple games, a rocket prediction program, and a word processor. Despite enjoying programming, they initially planned to study philosophy in college.

### Note: What if you just use the callback manager as part of the LLM class directly?

This means that intermediate LLM calls will also be logged, which may or may not be what you'd want.

We show an example where we just attach the callback manager to the LLM class, and make 0 modifications on the LlamaIndex side.

In [20]:
# define 
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo", streaming=True, callback_manager=callback_manager)
llm_predictor = LLMPredictor(llm=llm)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=1024)

In [21]:
response = index.query(
    "What did the author do growing up?", 
    service_context=service_context,
    similarity_top_k=3
)

The author worked on writing and programming outside of school when growing up. They wrote short stories and started programming on an IBM 1401 using an early version of Fortran. They later got a TRS-80 microcomputer and started programming simple games, a rocket prediction program, and a word processor.The author worked on writing and programming outside of school when growing up. They wrote short stories and started programming on an IBM 1401 using an early version of Fortran. They later got a TRS-80 microcomputer and started programming simple games, a rocket prediction program, and a word processor. While in a PhD program in computer science at Harvard, the author took art classes and decided to become an artist. They applied to RISD and the Accademia di Belli Arti in Florence and ended up studying at RISD. After finishing the BFA program at RISD, the author received an invitation to take the entrance exam at the Accademia, and despite being a foreigner and having to write an essay