# Overview

- Use [Langchain](https://python.langchain.com/en/latest/index.html) to **<font color='orange'>build a chatbot that can answer questions about</font>** [Harry Potter books](https://www.kaggle.com/datasets/hinepo/harry-potter-books-in-pdf-1-7)
- **<font color='orange'>Flexible and customizable RAG pipeline (Retrieval Augmented Generation)</font>**
- Experiment with various LLMs (Large Language Models)
- Use [FAISS vector store](https://python.langchain.com/docs/integrations/vectorstores/faiss) to store text embeddings created with [Sentence Transformers](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) from 🤗. FAISS runs on GPU and it is much faster than Chroma
- Use [Retrieval chain](https://python.langchain.com/docs/modules/data_connection/retrievers/) to retrieve relevant passages from embedded text
- Summarize retrieved passages
- Leverage Kaggle dual GPU (2 * T4) with [Hugging Face Accelerate](https://huggingface.co/docs/accelerate/index)
- Chat UI with [Gradio](https://www.gradio.app/guides/quickstart)

**<font color='green'>No need to create any API key to use this notebook! Everything is open source.</font>**

 - Colab  ->T4 instance


### Models

- [TheBloke/wizardLM-7B-HF](https://huggingface.co/TheBloke/wizardLM-7B-HF)
- [daryl149/llama-2-7b-chat-hf](https://huggingface.co/daryl149/llama-2-7b-chat-hf)
- [daryl149/llama-2-13b-chat-hf](https://huggingface.co/daryl149/llama-2-13b-chat-hf)
- [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)

![image.png](attachment:ceef601b-8cca-48a5-a433-54c0070f1f44.png)

img source: HinePo

In [3]:
! nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-11cbdef1-f494-f6ca-e119-a471e7756d2f)


# Installs

In [4]:
%%time

from IPython.display import clear_output

! pip install sentence_transformers==2.2.2

! pip install -qq -U langchain
! pip install -qq -U tiktoken
! pip install -qq -U pypdf
! pip install -qq -U faiss-gpu
! pip install -qq -U InstructorEmbedding

! pip install -qq -U transformers
! pip install -qq -U accelerate
! pip install -qq -U bitsandbytes

clear_output()

CPU times: user 1.11 s, sys: 184 ms, total: 1.29 s
Wall time: 2min 40s


In [5]:
!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes

Looking in indexes: https://pypi.org/simple/


# Imports

In [6]:
%%time

import warnings
warnings.filterwarnings("ignore")

import os
import glob
import textwrap
import time

import langchain

### loaders
from langchain.document_loaders import PyPDFLoader, DirectoryLoader

### splits
from langchain.text_splitter import RecursiveCharacterTextSplitter

### prompts
from langchain import PromptTemplate, LLMChain

### vector stores
from langchain.vectorstores import FAISS

### models
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceInstructEmbeddings

### retrievers
from langchain.chains import RetrievalQA

import torch
import transformers
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    BitsAndBytesConfig,
    pipeline
)

clear_output()

CPU times: user 9.72 s, sys: 1.41 s, total: 11.1 s
Wall time: 20 s


In [7]:
print('langchain:', langchain.__version__)
print('torch:', torch.__version__)
print('transformers:', transformers.__version__)

langchain: 0.1.16
torch: 2.2.1+cu121
transformers: 4.40.0


In [51]:
# sorted(glob.glob('/kaggle/input/harry-potter-books-in-pdf-1-7/HP books/*'))
sorted(glob.glob('llm/python_tutorial.pdf'))

['llm/python_tutorial.pdf']

# CFG

- CFG class enables easy and organized experimentation

In [45]:
class CFG:
    # LLMs
    model_name = 'llama2-13b-chat' # wizardlm, llama2-7b-chat, llama2-13b-chat, mistral-7B
    temperature = 0
    top_p = 0.95
    repetition_penalty = 1.15

    # splitting
    split_chunk_size = 800
    split_overlap = 0

    # embeddings
    embeddings_model_repo = 'sentence-transformers/all-MiniLM-L6-v2'

    # similar passages
    k = 6

    # paths
    PDFs_path = 'llm/'
    Embeddings_path =  'llm/faiss-hp-sentence-transformers'
    Output_folder = 'llm/vectordb'

# Define model

In [10]:
def get_model(model = CFG.model_name):

    print('\nDownloading model: ', model, '\n\n')

    if model == 'wizardlm':
        model_repo = 'TheBloke/wizardLM-7B-HF'

        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        bnb_config = BitsAndBytesConfig(
            load_in_4bit = True,
            bnb_4bit_quant_type = "nf4",
            bnb_4bit_compute_dtype = torch.float16,
            bnb_4bit_use_double_quant = True,
        )

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            quantization_config = bnb_config,
            device_map = 'auto',
            low_cpu_mem_usage = True
        )

        max_len = 1024

    elif model == 'llama2-7b-chat':
        model_repo = 'daryl149/llama-2-7b-chat-hf'

        tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)

        bnb_config = BitsAndBytesConfig(
            load_in_4bit = True,
            bnb_4bit_quant_type = "nf4",
            bnb_4bit_compute_dtype = torch.float16,
            bnb_4bit_use_double_quant = True,
        )

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            quantization_config = bnb_config,
            device_map = 'auto',
            low_cpu_mem_usage = True,
            trust_remote_code = True
        )

        max_len = 2048

    elif model == 'llama2-13b-chat':
        model_repo = 'daryl149/llama-2-13b-chat-hf'

        tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)

        bnb_config = BitsAndBytesConfig(
            load_in_4bit = True,
            bnb_4bit_quant_type = "nf4",
            bnb_4bit_compute_dtype = torch.float16,
            bnb_4bit_use_double_quant = True,
        )

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            quantization_config = bnb_config,
            device_map = 'auto',
            low_cpu_mem_usage = True,
            trust_remote_code = True
        )

        max_len = 2048 # 8192

    elif model == 'mistral-7B':
        model_repo = 'mistralai/Mistral-7B-v0.1'

        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        bnb_config = BitsAndBytesConfig(
            load_in_4bit = True,
            bnb_4bit_quant_type = "nf4",
            bnb_4bit_compute_dtype = torch.float16,
            bnb_4bit_use_double_quant = True,
        )

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            quantization_config = bnb_config,
            device_map = 'auto',
            low_cpu_mem_usage = True,
        )

        max_len = 1024

    else:
        print("Not implemented model (tokenizer and backbone)")

    return tokenizer, model, max_len

In [11]:
%%time

tokenizer, model, max_len = get_model(model = CFG.model_name)

clear_output()

CPU times: user 46.2 s, sys: 1min, total: 1min 46s
Wall time: 5min 48s


In [12]:
model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120, padding_idx=0)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear4bit(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )


In [46]:
### check how Accelerate split the model across the available devices (GPUs)
model.hf_device_map

{'': 0}

# 🤗 pipeline

- Hugging Face pipeline

In [47]:
### hugging face pipeline
pipe = pipeline(
    task = "text-generation",
    model = model,
    tokenizer = tokenizer,
    pad_token_id = tokenizer.eos_token_id,
#     do_sample = True,
    max_length = max_len,
    temperature = CFG.temperature,
    top_p = CFG.top_p,
    repetition_penalty = CFG.repetition_penalty
)

### langchain pipeline
llm = HuggingFacePipeline(pipeline = pipe)

In [48]:
llm

HuggingFacePipeline(pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x7baeeffa6b00>)

In [49]:
%%time
### testing model, not using the harry potter books yet
### answer is not necessarily related to harry potter
query = "what is bert fp"
llm.invoke(query)

CPU times: user 1min 58s, sys: 292 ms, total: 1min 58s
Wall time: 1min 59s


"what is bert fp16, and how does it differ from the original BERT model?\n\nBERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model developed by Google that has achieved state-of-the-art results on a wide range of natural language processing (NLP) tasks. In recent years, there have been several variants of BERT that have been proposed to improve its performance or adapt it to specific task domains. One such variant is BERT-FP16, which uses 16-bit floating point numbers (fp16) instead of the standard 32-bit floating point numbers (fp32) used in the original BERT model.\n\nIn this article, we will explore what is BERT FP16, and how it differs from the original BERT model. We will also discuss some of the key benefits and trade-offs of using BERT-FP16 for NLP tasks.\n\nWhat is BERT FP16?\n\nBERT FP16 is a variant of the popular BERT language model that uses 16-bit floating point numbers (fp16) instead of the standard 32-bit floating point numbers (fp

# 🦜🔗 Langchain

- Multiple document retriever with LangChain

In [17]:
CFG.model_name

'llama2-13b-chat'

# Loader

- [Directory loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory) for multiple files
- This step is not necessary if you are just loading the vector database
- This step is necessary if you are creating embeddings. In this case you need to:
    - load de PDF files
    - split into chunks
    - create embeddings
    - save the embeddings in a vector store
    - After that you can just load the saved embeddings to do similarity search with the user query, and then use the LLM to answer the question
    
You can comment out this section if you use the embeddings I already created.

In [52]:
%%time

loader = DirectoryLoader(
    CFG.PDFs_path,
    glob="./*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True,
    use_multithreading=True
)

documents = loader.load()

100%|██████████| 1/1 [00:04<00:00,  4.23s/it]

CPU times: user 4.22 s, sys: 19.8 ms, total: 4.24 s
Wall time: 4.24 s





In [53]:
print(f'We have {len(documents)} pages in total')

We have 155 pages in total


In [54]:
documents[8].page_content

'CHAPTER\nONE\nWHETTING YOUR APPETITE\nIf you do much work on computers, eventually you ﬁnd that there’s some task you’d like to automate. For\nexample, you may wish to perform a search-and-replace over a large number of text ﬁles, or rename and\nrearrange a bunch of photo ﬁles in a complicated way. Perhaps you’d like to write a small custom database,\nor a specialized GUI application, or a simple game.\nIf you’re a professional software developer, you may have to work with several C/C++/Java libraries but\nﬁnd the usual write/compile/test/re-compile cycle is too slow. Perhaps you’re writing a test suite for such\na library and ﬁnd writing the testing code a tedious task. Or maybe you’ve written a program that could\nuse an extension language, and you don’t want to design and implement a whole new language for your\napplication.\nPython is just the language for you.\nYou could write a Unix shell script or Windows batch ﬁles for some of these tasks, but shell scripts are best\nat moving

# Splitter

- Splitting the text into chunks so its passages are easily searchable for similarity
- This step is also only necessary if you are creating the embeddings
- [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/reference/modules/document_loaders.html?highlight=RecursiveCharacterTextSplitter#langchain.document_loaders.MWDumpLoader)

In [55]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = CFG.split_chunk_size,
    chunk_overlap = CFG.split_overlap
)

texts = text_splitter.split_documents(documents)

print(f'We have created {len(texts)} chunks from {len(documents)} pages')

We have created 482 chunks from 155 pages


# Create Embeddings


- Embedd and store the texts in a Vector database (FAISS)
- [LangChain Vector Stores docs](https://python.langchain.com/docs/modules/data_connection/vectorstores/)
- [FAISS - langchain](https://python.langchain.com/docs/integrations/vectorstores/faiss)
- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks - paper Aug/2019](https://arxiv.org/pdf/1908.10084.pdf)
- [This is a nice 4 minutes video about vector stores](https://www.youtube.com/watch?v=dN0lsF2cvm4)

___

- If you use Chroma vector store it will take ~35 min to create embeddings
- If you use FAISS vector store on GPU it will take just ~3 min

___

We need to create the embeddings only once, and then we can just load the vector store and query the database using similarity search.

Loading the embeddings takes only a few seconds.

I uploaded the embeddings to a Kaggle Dataset so we just load it from [here](https://www.kaggle.com/datasets/hinepo/faiss-hp-sentence-transformers).

In [56]:
%%time

### we create the embeddings only if they do not exist yet
if not os.path.exists(CFG.Embeddings_path + '/index.faiss'):

    ### download embeddings model
    embeddings = HuggingFaceInstructEmbeddings(
        model_name = CFG.embeddings_model_repo,
        model_kwargs = {"device": "cuda"}
    )

    ### create embeddings and DB
    vectordb = FAISS.from_documents(
        documents = texts,
        embedding = embeddings
    )

    ### persist vector database
    vectordb.save_local(f"{CFG.Output_folder}/faiss_index_hp") # save in output folder
#     vectordb.save_local(f"{CFG.Embeddings_path}/faiss_index_hp") # save in input folder

load INSTRUCTOR_Transformer
max_seq_length  512
CPU times: user 1.96 s, sys: 72.8 ms, total: 2.03 s
Wall time: 2.02 s


If creating embeddings, remember that on Kaggle we can not write data to the input folder.

So just write (save) the embeddings to the output folder and then load them from there.

# Load vector database

- After saving the vector database, we just load it from the Kaggle Dataset I mentioned
- Obviously, the embeddings function to load the embeddings must be the same as the one used to create the embeddings

In [57]:
%%time

### download embeddings model
embeddings = HuggingFaceInstructEmbeddings(
    model_name = CFG.embeddings_model_repo,
    model_kwargs = {"device": "cuda"}
)

### load vector DB embeddings
vectordb = FAISS.load_local(
    # CFG.Embeddings_path, # from input folder
    CFG.Output_folder + '/faiss_index_hp', # from output folder
    embeddings,
    allow_dangerous_deserialization=True
)

clear_output()

CPU times: user 82.9 ms, sys: 54.2 ms, total: 137 ms
Wall time: 137 ms


In [59]:
### test if vector DB was loaded correctly
vectordb.similarity_search('contional loop')

[Document(page_content='(Yes, this is the correct code. Look closely: the elseclause belongs to the forloop, nottheifstatement.)\nWhen used with a loop, the elseclause has more in common with the elseclause of a trystatement than it\ndoes that of ifstatements: a trystatement’s elseclause runs when no exception occurs, and a loop’s else\nclause runs when no breakoccurs. For more on the trystatement and exceptions, see Handling Exceptions .\nThecontinue statement, also borrowed from C, continues with the next iteration of the loop:\n>>> fornum inrange(2,10):\n... ifnum%2==0:\n... print("Found an even number ", num)\n... continue\n... print("Found a number ", num)\nFound an even number 2\n(continues on next page)\n4.4. breakand continue Statements, and elseClauses on Loops 21', metadata={'source': 'llm/python_tutorial.pdf', 'page': 26}),
 Document(page_content='it is usually not necessary to call iter()or deal with iterator objects yourself. The forstatement\ndoes that automatically for y

# Prompt Template

- Custom prompt

In [60]:
prompt_template = """
Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

{context}

Question: {question}
Answer:"""


PROMPT = PromptTemplate(
    template = prompt_template,
    input_variables = ["context", "question"]
)

In [31]:
# llm_chain = LLMChain(prompt=PROMPT, llm=llm)
# llm_chain

# Retriever chain

- Retriever to retrieve relevant passages
- Chain to answer questions
- [RetrievalQA: Chain for question-answering](https://python.langchain.com/docs/modules/data_connection/retrievers/)

In [61]:
retriever = vectordb.as_retriever(search_kwargs = {"k": CFG.k, "search_type" : "similarity"})

qa_chain = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = "stuff", # map_reduce, map_rerank, stuff, refine
    retriever = retriever,
    chain_type_kwargs = {"prompt": PROMPT},
    return_source_documents = True,
    verbose = False
)

In [62]:
### testing MMR search
question = "Which is inheritance?"
vectordb.max_marginal_relevance_search(question, k = CFG.k)

[Document(page_content='Python Tutorial, Release 3.7.0\nFor most purposes, in the simplest cases, you can think of the search for attributes inherited from a parent\nclass as depth-ﬁrst, left-to-right, not searching twice in the same class where there is an overlap in the hier-\narchy. Thus, if an attribute is not found in DerivedClassName , it is searched for in Base1, then (recursively)\nin the base classes of Base1, and if it was not found there, it was searched for in Base2, and so on.\nIn fact, it is slightly more complex than that; the method resolution order changes dynamically to support\ncooperative calls to super(). This approach is known in some other multiple-inheritance languages as\ncall-next-method and is more powerful than the super call found in single-inheritance languages.', metadata={'source': 'llm/python_tutorial.pdf', 'page': 83}),
 Document(page_content='72 Chapter 9. Classes', metadata={'source': 'llm/python_tutorial.pdf', 'page': 77}),
 Document(page_content='a

In [64]:
### testing similarity search
question = "Which is function?"
vectordb.similarity_search(question, k = CFG.k)

[Document(page_content='assigned a value within a function (unless named in a globalstatement), although they may be referenced.\nThe actual parameters (arguments) to a function call are introduced in the local symbol table of the called\nfunction when it is called; thus, arguments are passed using cal l by value (where the valueis always an object\nreference , not the value of the object).1When a function calls another function, a new local symbol table is\ncreated for that call.\nA function deﬁnition introduces the function name in the current symbol table. The value of the function\nname has a type that is recognized by the interpreter as a user-deﬁned function. This value can be assigned\nto another name which can then also be used as a function. This serves as a general renaming mechanism:\n>>>fib', metadata={'source': 'llm/python_tutorial.pdf', 'page': 28}),
 Document(page_content='0 1 1 2 3 5 8 13 21 34 55 89 144 233 377\n1In fact function deﬁnitions are also ‘statements’ that a

# Post-process outputs

- Format llm response
- Cite sources (PDFs)
- Change `width` parameter to format the output

In [65]:
def wrap_text_preserve_newlines(text, width=700):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text


def process_llm_response(llm_response):
    ans = wrap_text_preserve_newlines(llm_response['result'])

    sources_used = ' \n'.join(
        [
            source.metadata['source'].split('/')[-1][:-4]
            + ' - page: '
            + str(source.metadata['page'])
            for source in llm_response['source_documents']
        ]
    )

    ans = ans + '\n\nSources: \n' + sources_used
    return ans

In [66]:
def llm_ans(query):
    start = time.time()

    llm_response = qa_chain.invoke(query)
    ans = process_llm_response(llm_response)

    end = time.time()

    time_elapsed = int(round(end - start, 0))
    time_elapsed_str = f'\n\nTime elapsed: {time_elapsed} s'
    return ans + time_elapsed_str

# Ask questions

- Question Answering from multiple documents
- Invoke QA Chain
- Talk to your data

In [67]:
CFG.model_name

'llama2-13b-chat'

In [70]:
query = "while loop"
print(llm_ans(query))


Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

(Yes, this is the correct code. Look closely: the elseclause belongs to the forloop, nottheifstatement.)
When used with a loop, the elseclause has more in common with the elseclause of a trystatement than it
does that of ifstatements: a trystatement’s elseclause runs when no exception occurs, and a loop’s else
clause runs when no breakoccurs. For more on the trystatement and exceptions, see Handling Exceptions .
Thecontinue statement, also borrowed from C, continues with the next iteration of the loop:
>>> fornum inrange(2,10):
... ifnum%2==0:
... print("Found an even number ", num)
... continue
... print("Found a number ", num)
Found an even number 2
(continues on next page)
4.4. breakand continue Statements, and elseClauses on Loops 21

forline inopen("myfile.txt "):
print(line

In [72]:
query = "Author of the book?"
print(llm_ans(query))


Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

* included with this distribution is covered by the same copyright terms
* except that the holder is Tim Hudson (tjh@cryptsoft.com).
*
* Copyright remains Eric Young's, and as such any Copyright notices in
* the code are not to be removed.
* If this package is used in a product, Eric Young should be given attribution
* as the author of the parts of the library used.
(continues on next page)
140 Appendix C. History and License

(continues on next page)
136 Appendix C. History and License

72 Chapter 9. Classes

Many people have contributed to the Python language, the Python standard library, and the Python docu-
mentation. See Misc/ACKS in the Python source distribution for a partial list of contributors.
It is only with the input and contributions of the Python community that Pyt

In [73]:
query = "Exception handling"
print(llm_ans(query))


Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

occur inside functions that are called (even indirectly) in the try clause. For example:
>>> defthis_fails ():
... x=1/0
...
>>> try:
... this_fails()
... exceptZeroDivisionError aserr:
... print('Handling run-time error: ', err)
...
Handling run-time error: division by zero
8.4Raising Exceptions
Theraisestatement allows the programmer to force a speciﬁed exception to occur. For example:
>>> raiseNameError ('HiThere')
Traceback (most recent call last):
File"<stdin>" , line1, in <module>
NameError : HiThere
Thesoleargumentto raiseindicatestheexceptiontoberaised. Thismustbeeitheranexceptioninstanceor
an exception class (a class that derives from Exception ). If an exception class is passed, it will be implicitly
instantiated by calling its constructor with no arguments:

self.previ

In [74]:
query = "regular expression"
print(llm_ans(query))


Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

would create an independent local variable with the same name masking the built-in variable with its magic
behavior.
1Since** has higher precedence than -,-3**2 will be interpreted as -(3**2) and thus result in -9 . T o avoid this and get
9, you can use (-3)**2 .
10 Chapter 3. An Informal Introduction to Python

The%operator can also be used for string formatting. It interprets the left argument much like a sprintf() -
style format string to be applied to the right argument, and returns the string resulting from this formatting
operation. For example:
56 Chapter 7. Input and Output

Dcab ==> 7678
Other modiﬁers can be used to convert the value before it is formatted. '!a'appliesascii(),'!s'applies
str(), and'!r'appliesrepr():
>>>animals ='eels'
>>>print(f'My hovercraft is full of

# Gradio Chat UI

- **<font color='orange'>At the moment this part only works on Google Colab. Gradio and Kaggle started having compatibility issues recently.</font>**
- If you plan to use the interface, it is preferable to do so in Google Colab
- I'll leave this section commented out for now
- Chat UI prints below

___

- Create a chat UI with [Gradio](https://www.gradio.app/guides/quickstart)
- [ChatInterface docs](https://www.gradio.app/docs/chatinterface)
- The notebook should be running if you want to use the chat interface

In [75]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [76]:
! pip install --upgrade gradio -qq
clear_output()

In [77]:
import gradio as gr
print(gr.__version__)

4.27.0


In [78]:
def predict(message, history):
    # output = message # debug mode

    output = str(llm_ans(message)).replace("\n", "<br/>")
    return output

demo = gr.ChatInterface(
    predict,
    title = f' Open-Source LLM ({CFG.model_name}) for Question Answering'
)

demo.queue()
demo.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://6f99a1f455235cbaa7.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




![image.png](attachment:413fe7a3-6534-45b5-b6e3-7fc86e982cf1.png)

![image.png](attachment:976f4bf4-7626-4d4a-b773-3eebd7e9f000.png)

# Conclusions

- Feel free to fork and optimize the code. Lots of things can be improved.

- Things I found had the most impact on models output quality in my experiments:
    - Prompt engineering
    - Bigger models
    - Other models families
    - Splitting: chunk size, overlap
    - Search: Similarity, MMR, k
    - Pipeline parameters (temperature, top_p, penalty)
    - Embeddings function
    - LLM parameters (max len)


- LangChain, Hugging Face and Gradio are awesome libs!

- **<font color='orange'>If you liked this notebook, don't forget to show your support with an Upvote!</font>**

- In case you are interested in LLMs, I also have some other notebooks you might want to check:

    - [Instruction Finetuning](https://www.kaggle.com/code/hinepo/llm-instruction-finetuning-wandb)
    - [Preference Finetuning - LLM Alignment](https://www.kaggle.com/code/hinepo/llm-alignment-preference-finetuning)
    - [Synthetic Data for Finetuning](https://www.kaggle.com/code/hinepo/synthetic-data-creation-for-llms)
    - [Safeguards and Guardrails](https://www.kaggle.com/code/hinepo/llm-safeguards-and-guardrails)
    
___

🦜🔗🤗

![image.png](attachment:68773819-4358-4ded-be3e-f1d275103171.png)