# Run an LLM App in 15 Minutes
---
To prime ourselves for the type of work ahead, we will start by creating a [question answering (QA)](https://docs.langchain.com/docs/use-cases/qa-docs) service designed to run locally.

Large language models (LLMs), while very impressive at next token prediction, have no relationship to the truth. This is especially relevant when the topic falls outside of the model's training data. To help mitigate their hallucinatory tendencies, we can implement a pattern referred to as [retrieval QA](https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html). In this use case, we generate embeddings for domain-specific documents that the LLM can then use to construct a response to a user query.

After this short notebook, you will have set up a [document corpus](https://en.wikipedia.org/wiki/Text_corpus) of [Taylor Swift's Eras Tour](https://en.wikipedia.org/wiki/The_Eras_Tour) and the [2023 XFL Season](https://en.wikipedia.org/wiki/2023_XFL_season) for StableLM to use as context to supplement its generated answer.

## Create a document corpus

First, you need to establish the pool of information from which the language model will draw its context. In this example, we'll be using a few modules from [LangChain](https://python.langchain.com/en/latest/index.html) to facilitate this process.
We'll be using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/), a library for similarity search across vector embeddings.

### Load documents

In [None]:
from langchain.document_loaders import WikipediaLoader

topics = ["The Eras Tour", "2023 XFL season"]
loaders = [WikipediaLoader(query=topic, load_max_docs=20) for topic in topics]

### Split documents

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=20,
    length_function=len,
)

In [None]:
from operator import add

# Load documents
docs = add(*[loader.load() for loader in loaders])
print(", ".join([d.metadata["title"] for d in docs]))

# Split documents into chunks
chunks = text_splitter.create_documents(
    [doc.page_content for doc in docs], metadatas=[doc.metadata for doc in docs]
)

### Create embeddings for documents

In [None]:
from langchain.embeddings.base import Embeddings
from langchain.vectorstores import FAISS
from sentence_transformers import SentenceTransformer

In [None]:
class LocalHuggingFaceEmbeddings(Embeddings):
    def __init__(self, model_id):
        self.model = SentenceTransformer(model_id)

    def embed_documents(self, texts: list[str]) -> list[list[float]]:
        embeddings = self.model.encode(texts)
        return embeddings

    def embed_query(self, text: str) -> list[float]:
        embedding = self.model.encode(text)
        return list(map(float, embedding))

In [None]:
embeddings = LocalHuggingFaceEmbeddings("multi-qa-mpnet-base-dot-v1")
db = FAISS.from_documents(chunks, embeddings)

### Store documents and embeddings in a vector store

In [None]:
FAISS_INDEX_PATH = "faiss_index_local"

db.save_local(FAISS_INDEX_PATH)

## Set up a QA chain

### Create a custom `Pipeline`

In [None]:
from langchain import HuggingFacePipeline
from transformers import pipeline as hf_pipeline
from typing import Optional, Any

In [None]:
class StableLMPipeline(HuggingFacePipeline):
    # Class is temporary, we are working with the authors of LangChain to make these unnecessary.

    def _call(self, prompt: str, stop: Optional[list[str]] = None) -> str:
        response = self.pipeline(
            prompt, temperature=0.1, max_new_tokens=256, do_sample=True
        )
        print(f"Response is: {response}")
        text = response[0]["generated_text"][len(prompt) :]
        return text

    @classmethod
    def from_model_id(
        cls,
        model_id: str,
        task: str,
        device: Optional[str] = None,
        model_kwargs: Optional[dict] = None,
        **kwargs: Any,
    ):
        pipeline = hf_pipeline(
            model=model_id,
            task=task,
            device=device,
            model_kwargs=model_kwargs,
        )
        return cls(
            pipeline=pipeline,
            model_id=model_id,
            model_kwargs=model_kwargs,
            **kwargs,
        )

### Write a prompt template

In [None]:
from langchain.prompts import PromptTemplate

template = """
<|SYSTEM|># StableLM Tuned (Alpha version)
- You are a helpful, polite, fact-based agent for answering questions. 
- Your answers include enough detail for someone to follow through on your suggestions. 
<|USER|>
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
Please answer the following question using the context provided. 

CONTEXT: 
{context}
=========
QUESTION: {question} 
ANSWER: <|ASSISTANT|>"""

PROMPT = PromptTemplate(template=template, input_variables=["context", "question"])

### Create the QA chain

In [None]:
import torch
from langchain.chains.question_answering import load_qa_chain

In [None]:
class QALocal:
    def __init__(self):
        self.embeddings = LocalHuggingFaceEmbeddings("multi-qa-mpnet-base-dot-v1")
        self.db = FAISS.load_local(FAISS_INDEX_PATH, self.embeddings)
        self.llm = StableLMPipeline.from_model_id(
            model_id="stabilityai/stablelm-tuned-alpha-7b",
            task="text-generation",
            model_kwargs={
                "torch_dtype": torch.float16,
                "device_map": "auto",
                "cache_dir": "/mnt/local_storage",
            },
        )
        self.chain = load_qa_chain(llm=self.llm, chain_type="stuff", prompt=PROMPT)

    def qa(self, query):
        search_results = self.db.similarity_search(query)
        print(f"Results from db are: {search_results}")
        result = self.chain({"input_documents": search_results, "question": query})
        print(f"Result is: {result}")
        return result["output_text"]

In [None]:
local_qa = QALocal()

## Query the chain

In [None]:
local_qa.qa("How many people live in San Francisco?")

In [None]:
local_qa.qa("When did Taylor Swift's Eras tour start?")

In [None]:
local_qa.qa("Can you tell me about the XFL 2023 season?")

## Tear down application

You can either shutdown the kernel or use these cells to free up memory occupied by this application.

In [None]:
del local_qa

In [None]:
from accelerate import Accelerator

accelerator = Accelerator()
accelerator.free_memory()