<a href="https://colab.research.google.com/github/wahidur028/QA-with-LLMs/blob/main/CKD_QA_with_LangChain_and_Together_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**CKD (Custom Knowledge Domain) QA with LangChain and Together API**

####**Credits: Sam Witteveen**
####Twitter: https://twitter.com/Sam_Witteveen

###**The Pipeline for converting raw unstructured data into a QA chain using LangChain**

- **Step 1. Loading:** First, data needs to be loaded. Structured data and Unstructured data can be loaded from many sources. As of today (August 18, 2023), 154 data loaders are available on the LangChain platform. **[link](https://integrations.langchain.com/)**.

- **Step 2. Splitting:** Text splitters break documents into splits of specified size. Chunk size and chunk overlaping can be defined here.

- **Step 3. Storage:** To look up the document splits, it needs to be stored where we can later look them up. The most common way to do this is to embed the contents of each document and then store the embedding and document in a vector store, with the embedding being used to index the document.  As of today (August 18, 2023), 40 vectorstores and 30 text embedding are available on the LangChain platform. **[link](https://integrations.langchain.com/)**.

- **Step 4. Retrieval:** Retrieve relevant splits for any question using similarity search. Vectorstores are commonly used for retrieval, but they are not the only option. For example, SVMs can also be used. LangChain has many retrievers including, but not limited to, vectorstores. Some common ways to improve on vector similarity search include:
  - *MultiQueryRetriever* generates variants of the input question to improve retrieval. **[link](https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever)**.
  - *Max marginal relevance* selects for relevance and diversity among the retrieved documents. **[link](https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf)**.

- **Step 5. Generation:** An LLM produces an answer using a prompt that includes the question and the retrieved data. You can pass in an LLM or a ChatModel to the RetrievalQA chain. **[link](https://integrations.langchain.com/)**.

####**Package installation**

In [1]:
! pip install -q langchain
! pip install -q pypdf
! pip install -q InstructorEmbedding sentence_transformers
! pip install -q chromadb
! pip install -q --upgrade together

####**Import libraries**

In [2]:
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA

####**Create an empty directory and upload the required file(s)**

In [3]:
import os
from google.colab import files

# Create an empty directory
folder_name = "raw_data"
if not os.path.exists(folder_name):
    os.makedirs(folder_name)
print(f"An empty directory named '{folder_name}' has been created.")

# Upload files to the folder
print(f"Now please upload your required file(s).")
uploaded_files = files.upload()

# Move uploaded files to the created folder
for file_name in uploaded_files.keys():
    source_path = file_name
    destination_path = os.path.join(folder_name, file_name)
    os.rename(source_path, destination_path)
    print(f"'{file_name}' has been uploaded and moved to '{folder_name}' directory.")

An empty directory named 'raw_data' has been created.
Now please upload your required file(s).


####**Load multiple files and process the documents**

In [4]:
loader = DirectoryLoader('./raw_data/', glob="./*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()

In [5]:
len(documents)

219

In [6]:
#splitting the documents into text
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

####**Create the embeddings and store in a vector database**

In [7]:
model_name = "hkunlp/instructor-xl"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': True}

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs)

  from tqdm.autonotebook import trange


load INSTRUCTOR_Transformer
max_seq_length  512


In [8]:
persist_directory = 'vectordb'
embedding = instructor_embeddings

vectordb = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory=persist_directory)

####**Make the retriever**

In [9]:
retriever = vectordb.as_retriever(search_kwargs={"k": 5})

####**RetrievalQA with LLaMA 2-70B on Together API**

In [10]:
os.environ["TOGETHER_API_KEY"] = "" # create your own API key https://api.together.xyz/signin?callbackUrl=https%3A%2F%2Fapi.together.xyz%2Fplayground

In [11]:
import together

# set your API key
together.api_key = os.environ["TOGETHER_API_KEY"]

# # list available models and descriptons
# models = together.Models.list()

# together.Models.start("togethercomputer/llama-2-70b-chat")

In [12]:
import logging
from typing import Any, Dict, List, Mapping, Optional

from pydantic import Extra, Field, root_validator

from langchain.callbacks.manager import CallbackManagerForLLMRun
from langchain.llms.base import LLM
from langchain.llms.utils import enforce_stop_tokens
from langchain.utils import get_from_dict_or_env

class TogetherLLM(LLM):
    """Together large language models."""

    model: str = "togethercomputer/llama-2-70b-chat"
    """model endpoint to use"""

    together_api_key: str = os.environ["TOGETHER_API_KEY"]
    """Together API key"""

    temperature: float = 0.1
    """What sampling temperature to use."""

    max_tokens: int = 1024
    """The maximum number of tokens to generate in the completion."""

    class Config:
        extra = Extra.forbid

    @root_validator()
    def validate_environment(cls, values: Dict) -> Dict:
        """Validate that the API key is set."""
        api_key = get_from_dict_or_env(
            values, "together_api_key", "TOGETHER_API_KEY"
        )
        values["together_api_key"] = api_key
        return values

    @property
    def _llm_type(self) -> str:
        """Return type of LLM."""
        return "together"

    def _call(
        self,
        prompt: str,
        **kwargs: Any,
    ) -> str:
        """Call to Together endpoint."""
        together.api_key = self.together_api_key
        output = together.Complete.create(prompt,
                                          model=self.model,
                                          max_tokens=self.max_tokens,
                                          temperature=self.temperature,
                                          )
        text = output['output']['choices'][0]['text']
        return text

####**Make the chain**

In [13]:
llm = TogetherLLM()

# create the chain to answer questions
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [14]:
## Cite sources

import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def process_llm_response(llm_response):
    print(wrap_text_preserve_newlines(llm_response['result']))
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [23]:
# full example
query = "can you explain what is Flash attention?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

Yes, FlashAttention is a new attention algorithm proposed in the paper that computes exact attention with far
fewer memory accesses. It restructures the attention computation to split the input into blocks and make
several passes over input blocks, thus incrementally performing the softmax reduction. It also stores the
softmax normalization factor from the forward pass to quickly recompute attention on-chip in the backward
pass, which is faster than the standard approach of reading the intermediate attention matrix from HBM. The
algorithm is designed to be IO-aware, accounting for reads and writes between levels of GPU memory, and it is
shown to be faster and more memory-efficient than existing attention methods.


Sources:
raw_data/Flash-attention.pdf
raw_data/Flash-attention.pdf
raw_data/Flash-attention.pdf
raw_data/Flash-attention.pdf
raw_data/Flash-attention.pdf


In [19]:
# full example
query = "can you list down the each concept of this paper?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

Sure! Here are the concepts discussed in this paper:
1. Chain-of-thought (CoT) prompting technique for LMs.
2. Few-shot prompting method.
3. Multi-turn prompts.
4. ReAct - a method that integrates decision-making and reasoning capabilities into a large language model.
5. Intuitive and easy-to-design prompts.
6. General and flexible prompts.
7. Performant and robust prompts.
8. Scratchpads - a method that allows LMs to use a working memory when more than one step is required to solve
a task correctly.
9. Fine-tuning on example tasks with associated computation steps.
10. Few-shot prompting on ly requires a handful of manually labeled examples and enables very fast
experimentation as no model ﬁne-tuning is required.
11. Ability to perform reasoning with chain-of-thoughts from a few in-context examples only emerges as models
reach a certain size.
12. Performance depends heavily on the format in which examples are presented, the choice of few-shot
examples, and the order in which they are 

In [22]:
# full example
query = "can you explain the concept of Chain-of-thought (CoT) prompting technique for LMs?"
llm_response = qa_chain(query)
process_llm_response(llm_response)


Sure! Chain-of-thought (CoT) is a prompting technique for large language models (LMs) that involves providing
the model with a series of intermediate reasoning steps leading to the final output. The idea is to elicit the
model's own "thinking procedure" for problem-solving, rather than simply providing the final answer.

The CoT prompt typically consists of a task or question, followed by a series of prompts that ask the model to
explain its reasoning step by step. For example, in the case of a math problem, the prompt might ask the model
to explain how it arrived at a particular answer, or what steps it took to solve the problem.

The CoT technique has been shown to be effective in a variety of domains, including arithmetic, commonsense,
and symbolic reasoning tasks. It has also been extended to zero-shot prompting, where the model is given a
single prompt that is not an example, and is able to generate a chain of thought to solve the task.

One advantage of the CoT technique is that