# LangChain QA Application with Xinference and LangChain


This demo walks through how to build an LLM-driven question-answering (QA) application with Xinference, Milvus, and LangChain.

## Deploy Xinference Locally or in a Distributed Cluster.

For local deployment, run `xinference`. It will log an endpoint for you to use.

To deploy Xinference in a cluster, first start an Xinference supervisor using the `xinference-supervisor`. You can also use the option -p to specify the port and -H to specify the host. The default port is 9997. If the default port is used, Xinference will choose an unused port for you. It will also log the endpoint for you to use.

Then, start the Xinference workers using `xinference-worker` on each server you want to run them on. 

You can consult the README file from [Xinference](https://github.com/xorbitsai/inference) for more information.
## Start a Model

To use Xinference with LangChain, you need to first launch a model. You can use command line interface (CLI) to do so:

In [2]:
!xinference launch --model-name "falcon-instruct" --model-format pytorch --size-in-billions 40 -e "http://127.0.0.1:9997"

Model uid: ec736e9c-328b-11ee-93f8-fa163e74fa2d


The command will return a model UID for you to use.

## Prepare the Documents

In [3]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader("/home/nijiayi/inference/examples/state_of_the_union.txt") # Replace with the path of the document you want to query from

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 512,
    chunk_overlap  = 100,
    length_function = len,
)
docs = text_splitter.split_documents(documents)


## Set Up an Embedding Model

In [4]:
from langchain.embeddings import XinferenceEmbeddings

xinference_embeddings = XinferenceEmbeddings(
    server_url="http://127.0.0.1:9997", 
    model_uid = "ec736e9c-328b-11ee-93f8-fa163e74fa2d" # model_uid is the uid returned from launching the model
)

## Connect to the Vector Database

For vector store, we use the Milvus vector database. [Milvus](https://milvus.io/docs/overview.md) is a database that stores, indexes, and manages massive embedding vectors generated by deep neural networks and other machine learning models. To run, you should first [Install Milvus Standalone with Docker Compose](https://milvus.io/docs/install_standalone-docker.md).

In [None]:
$ wget https://github.com/milvus-io/milvus/releases/download/v2.2.12/milvus-standalone-docker-compose.yml -O docker-compose.yml

In the same directory as the docker-compose.yml file, start up Milvus and connect to Milvus by running:

In [None]:
$ sudo docker-compose up -d
$ docker port milvus-standalone 19530/tcp

In [None]:
from langchain.vectorstores import Milvus

vector_db = Milvus.from_documents(
    docs,
    xinference_embeddings,
    connection_args={"host": "0.0.0.0", "port": "19530"},
)

## Query about the Document

In [6]:
query = "what does the president say about Ketanji Brown Jackson"
docs = vector_db.similarity_search(query, k=10)
print(docs[0].page_content) 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


In [7]:
from langchain.llms import Xinference

xinference_llm = Xinference(
    server_url="http://127.0.0.1:9997",
    model_uid = "ec736e9c-328b-11ee-93f8-fa163e74fa2d" # model_uid is the uid returned from launching the model
)

We can now create a memory object to track the chat history.

In [8]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

Now we create ConversationalRetrievalChain with chat model and the vectorstore.

In [9]:
from langchain.chains import ConversationalRetrievalChain

chain = ConversationalRetrievalChain.from_llm(
    llm=xinference_llm,
    retriever=vector_db.as_retriever(),
    memory=memory)

Now, we can query information from the document. Instead of simply returning identical sentences from the document, the model generates responses by summarizing relevant content. Furthermore, it can relate a new query to the chat history, creating a chain of responses that build upon each other. 

In [16]:
query = "What did the president say about Ketanji Brown Jackson"
result = chain({"question": query})
result["answer"]

"The president supports Ketanji Brown Jackson's nomination to serve on the US Supreme Court, stating that she is a well-qualified and experienced candidate with a proven track record of fairness and impartiality."

In [17]:
query = "Did he mention who she succeeded"
result = chain({"question": query})
result["answer"]

'Ketanji Brown Jackson was nominated by President Joe Biden to replace retiring Associate Justice Stephen Breyer on the United States Supreme Court.'

In [19]:
query = "Summarize the President's opinion on COVID-19"
result = chain({"question": query})
result['answer']

"According to the provided text, the president emphasizes the importance of continuing efforts to combat the COVID-19 pandemic, including wearing masks and getting vaccinated. The president believes that vaccination is necessary to achieve full protection against the virus and encourages individuals who haven't already been vaccinated to do so. Additionally, the president promotes other preventive measures such as social distancing and handwashing to help stop the spread of COVID-19."


From the second query, we can see that LLM accurately recognizes that "he" refers to "the president", and "she" refers to "Ketanji Brown Jackson" mentioned in the previous query. Moreover, even though the name of the President is not mentioned anywhere in the entire article, LLM is able to identify that the speaker of this article is President Joe Biden. Moreover, the LLM summarizes President's opinion on COVID-19 in a concise way. We can see the impressive capabilities of LLM, and LangChain's "chaining" feature also allows for more coherent and context-aware interactions with the model.

To stop Milvus and delete data after stopping Milvus, run:

In [None]:
$ sudo docker-compose down

$ sudo rm -rf  volumes