# Deeplake on GPT and GGUF
https://python.langchain.com/docs/use_cases/question_answering/how_to/code/code-analysis-deeplake

This notebook demostrate the difference between ChatGPT and Llama2

In [1]:
# openai
# !pip3 install openai tiktoken

## OpenAI (GPT)

In [2]:
from dotenv import load_dotenv,dotenv_values
import os

# load_dotenv()
# openai_api_key=os.getenv('OPENAI_API_KEY', 'YourAPIKey')
# OR
dotenv_cfg = dotenv_values(".env")
openai_api_key=dotenv_cfg.get('OPENAI_API_KEY')

In [3]:
# ======= OpenAI Transformer==========
from langchain.embeddings.openai import OpenAIEmbeddings

gpt_embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

#**Note**: This will print your openai apikey
# gpt_embeddings   

In [4]:
# ======= OpenAI Model==========
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

chat_model = ChatOpenAI(openai_api_key=openai_api_key,model_name="gpt-3.5-turbo-0613")

## Data Preparation

### Load text into Document

In [5]:
!git clone https://github.com/langchain-ai/langchain.git mydata/langchain-sourcecode

fatal: destination path 'mydata/langchain-sourcecode' already exists and is not an empty directory.


In [6]:
root_dir = "mydata/langchain-sourcecode/libs/"

In [7]:
from langchain.document_loaders import TextLoader
import os

docs = []
files = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        if file.endswith(".py") and "*venv/" not in dirpath:
            try:
                filepath=os.path.join(dirpath, file)
                loader = TextLoader(filepath, encoding="utf-8")
                files.append(filepath)
                docs.extend(loader.load_and_split())
            except Exception as e:
                pass
print(f"load {len(docs)} docs from {len(files)} *.py")


load 5018 docs from 3385 *.py


### Split Documants
#### CharacterTextSplitter VS RecursiveCharacterTextSplitter
##### RecursiveCharacterTextSplitter
> https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter
> tries to split on them in order until the chunks are small enough. The default list is \["\n\n", "\n", " ", ""\]

##### CharacterTextSplitter
> https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter
> This is the simplest method. This splits based on characters (by default "\n\n") and measure chunk length by number of characters.
>
> 

In [8]:
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=0)
split_docs = text_splitter.split_documents(docs)
print(f"There are {len(split_docs)} documents after split")

There are 8285 documents after split


### Make Vectorstore with deeplake

In [9]:
!pip install -qU deeplake

In [10]:
# Store data locally
# https://python.langchain.com/docs/integrations/vectorstores/activeloop_deeplake#deep-lake-locally
def make_local_vectorstore(split_docs,embeddings, dataset_path="./.my_deeplake/", overwrite=False):
    from langchain.vectorstores import DeepLake
    from os import path
    if overwrite or not path.exists(dataset_path):
        local_vectorstore=DeepLake.from_documents(split_docs, dataset_path=dataset_path, embedding=embeddings, overwrite=True)
    else:
        local_vectorstore=DeepLake(dataset_path=dataset_path, embedding=embeddings, read_only=True)
    return local_vectorstore 


In [11]:
# Store data on activeloop hub
def make_hub_vectorstore(split_docs,embedding, hub_name, overwrite=False):
    from langchain.vectorstores import DeepLake    
    os.environ['ACTIVELOOP_TOKEN'] = dotenv_cfg.get('ACTIVELOOP_TOKEN')
    dataset_path=f"hub://{dotenv_cfg.get('ACTIVELOOP_TOKEN')}/{ACTIVELOOP_USER}"
    if overwrite:
        hub_vectorstore = DeepLake.from_documents(
            split_docs, embeddings, dataset_path=dataset_path, runtime={"tensor_db": True} #, overwrite=True
        )
    else:
        hub_vectorstore = DeepLake(dataset_path=dataset_path, embedding=embeddings, readonly=True)
    return hub_vectorstore

In [12]:
# gpt_vectorstore = make_hub_vectorstore(split_docs, gpt_embeddings,hub_name="my_gpt_deeplake")
gpt_vectorstore = make_local_vectorstore(split_docs, gpt_embeddings, dataset_path="./.my_gpt_deeplake/", overwrite=False)

Deep Lake Dataset in ./.my_gpt_deeplake/ already exists, loading from the storage


# Inference test 

In [13]:
# Our inference test function
from timeit import default_timer as timer
def inferenceQA(chat_model, vectorstore, questions):
    qa = ConversationalRetrievalChain.from_llm(chat_model, retriever=vectorstore.as_retriever())
    chat_history = []
    qa_dict = {}
    
    for question in questions:
        print(f"-> **Question**: {question} \n")
        start=timer()
        result = qa({"question": question, "chat_history": chat_history})
        end=timer()
        print(f"**{int((end-start)*100)/100.0} secs**\n")
        print(f"**Answer**: {result['answer']} \n")
        chat_history.append((question, result["answer"]))
        qa_dict[question] = result["answer"]


In [14]:
inferenceQA(chat_model, gpt_vectorstore, questions = [
        "What is the class hierarchy?",
        "What classes are derived from the Chain class?",
        "What kind of retrievers does LangChain have?",
    ])

-> **Question**: What is the class hierarchy? 

**4.32 secs**

**Answer**: The class hierarchy in the code includes the following classes:

- BaseMemory
  - BaseChatMemory
    - <name>Memory (e.g., ZepMemory, MotorheadMemory)

- BaseChatMessageHistory
  - <name>ChatMessageHistory (e.g., ZepChatMessageHistory)

These classes are part of the langchain_core module. 

-> **Question**: What classes are derived from the Chain class? 

**24.87 secs**

**Answer**: The following classes are derived from the Chain class:

1. APIChain (from langchain.chains.api.base)
2. OpenAPIEndpointChain (from langchain.chains.api.openapi.chain)
3. AnalyzeDocumentChain (from langchain.chains.combine_documents.base)
4. MapReduceDocumentsChain (from langchain.chains.combine_documents.map_reduce)
5. MapRerankDocumentsChain (from langchain.chains.combine_documents.map_rerank)
6. ReduceDocumentsChain (from langchain.chains.combine_documents.reduce)
7. RefineDocumentsChain (from langchain.chains.combine_documents.re

## Llama2 (GGUF on CTransformer)

In [15]:
# ======== GGUF =========
from langchain.llms import CTransformers
import os

model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF')

# context_length must be > chunk_size=1000 of text_splitter
# If context length is too short, the output would be poor.
config = {'max_new_tokens': 2048, 'repetition_penalty': 1.05,'context_length':4096}
# https://api.python.langchain.com/en/latest/llms/langchain.llms.ctransformers.CTransformers.html
cTransformers_llm = CTransformers(model=model_id, model_file="llama-2-7b-chat.Q4_K_M.gguf", config=config)

In [16]:
# embedding data
from mylib.MyModelUtils import MyModelUtils
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import os
llama2_embeddings=HuggingFaceEmbeddings(
    model_name=os.path.abspath("./models/sentence-transformers/all-mpnet-base-v2"), 
    model_kwargs={"device": MyModelUtils.device()}
)

In [17]:
# llama2_vectorstore = make_hub_vectorstore(split_docs, llama2_embeddings,hub_name="my_llama2_deeplake")
llama2_vectorstore = make_local_vectorstore(split_docs, llama2_embeddings, dataset_path="./.my_llama2_deeplake/")



Creating 8285 embeddings in 17 batches of size 500:: 100% 17/17 [03:43<00:00, 13.16s/it]

Dataset(path='./.my_llama2_deeplake/', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
   text       text      (8285, 1)     str     None   
 metadata     json      (8285, 1)     str     None   
 embedding  embedding  (8285, 768)  float32   None   
    id        text      (8285, 1)     str     None   





In [18]:
inferenceQA(cTransformers_llm, llama2_vectorstore, questions = [
        "What is the class hierarchy?",
        "What classes are derived from the Chain class?",
        "What kind of retrievers does LangChain have?",
    ])

-> **Question**: What is the class hierarchy? 

**208.2 secs**

**Answer**:  The class hierarchy for the given code is as follows:

* `BaseTool`: The base class for all tools.
* `AmadeusBaseTool`: An abstract class that implements the `BaseTool` class.
* `BaseBrowserTool`: An abstract class that implements the `BaseTool` class.
* `GmailBaseTool`: An abstract class that implements the `BaseTool` class.
* `O365BaseTool`: An abstract class that implements the `BaseTool` class.
* `SlackBaseTool`: An abstract class that implements the `BaseTool` class.
* `Embeddings`: A class that inherits from `BaseTool`.
* `BaseMemory`: A class that inherits from `BaseTool`.
* `BaseChatMemory`: A class that inherits from `BaseMemory`.
* `ZepMemory`: A class that inherits from `BaseChatMemory`.
* `MotorheadMemory`: A class that inherits from `BaseChatMemory`.
* `BaseChatMessageHistory`: A class that inherits from `BaseTool`.
* `<name>ChatMessageHistory`: A class that inherits from `BaseChatMessageHistory`.

# What will happen if we use GPT-embedding vector store on Lllama2, or vice versa?

**Conclusion**: 
vectorstore do affect to chat model result. And **vertorstore is more important than Model**

In [19]:
# Llama2-embedding vertorstore on gpt
inferenceQA(chat_model, llama2_vectorstore, questions = [
        "What is the class hierarchy?",
        "What classes are derived from the Chain class?",
        "What kind of retrievers does LangChain have?",
    ])

-> **Question**: What is the class hierarchy? 

**3.05 secs**

**Answer**: The class hierarchy includes the following structures:

- BaseTool
- AmadeusBaseTool
- BaseBrowserTool
- GmailBaseTool
- O365BaseTool
- SlackBaseTool

Additionally, there are other class hierarchies mentioned:

- Embeddings --> <name>Embeddings
- BaseMemory --> BaseChatMemory --> <name>Memory
- BaseChatMessageHistory --> <name>ChatMessageHistory 

-> **Question**: What classes are derived from the Chain class? 

**2.65 secs**

**Answer**: The specific classes derived from the Chain class are not mentioned in the provided context. 

-> **Question**: What kind of retrievers does LangChain have? 

**3.22 secs**

**Answer**: LangChain has the following retrievers:

1. MilvusRetriever
2. WeaviateHybridSearchRetriever
3. KayAiRetriever 



In [20]:
# gpt-embedding vertorstore on Llama2
inferenceQA(cTransformers_llm, gpt_vectorstore, questions = [
        "What is the class hierarchy?",
        "What classes are derived from the Chain class?",
        "What kind of retrievers does LangChain have?",
    ])

-> **Question**: What is the class hierarchy? 

**88.3 secs**

**Answer**:  The class hierarchy for Memory is:
BaseMemory --> BaseChatMemory --> <name>Memory (e.g. ZepMemory, MotorheadMemory)
The class hierarchy for ChatMessageHistory is:
BaseChatMessageHistory --> <name>ChatMessageHistory (e.g. ZepChatMessageHistory)
The main helpers are:
BaseChatMessageHistory
BaseCacheMessageHistory
BaseDocumentTransformer
BaseOutputParser
BasePromptTemplate
BaseRetriever
BaseStore

Note: The answer is based on the provided context, if there is any change or updated answer, please let me know. 

-> **Question**: What classes are derived from the Chain class? 

**236.96 secs**

**Answer**:   The Chain class is derived by many other classes, including:
* APIChain
* OpenAPIEndpointChain
* AnalyzeDocumentChain
* MapReduceDocumentsChain
* MapRerankDocumentsChain
* ReduceDocumentsChain
* RefineDocumentsChain
* StuffDocumentsChain
* ConstitutionalChain
* ConversationChain
* ChatVectorDBChain
* Conversation

## Can we use FAISS to replace Deeplake?

**Conclusion**:
FAISS and Deeplake are interchangable.

In [21]:
from langchain.vectorstores import Chroma, FAISS
faiss_vectorstore = FAISS.from_documents(split_docs, gpt_embeddings)
inferenceQA(chat_model, faiss_vectorstore, questions = [
        "What is the class hierarchy?",
        "What classes are derived from the Chain class?",
        "What kind of retrievers does LangChain have?",
    ])

-> **Question**: What is the class hierarchy? 

**4.02 secs**

**Answer**: The class hierarchy in the context provided is as follows:

- BaseMemory
  - BaseChatMemory
    - \<name\>Memory (e.g., ZepMemory, MotorheadMemory)

- BaseChatMessageHistory
  - \<name\>ChatMessageHistory (e.g., ZepChatMessageHistory)

Please note that \<name\> represents placeholder names and can be replaced with specific names like "Zep" or "Motorhead". 

-> **Question**: What classes are derived from the Chain class? 

**23.36 secs**

**Answer**: The following classes are derived from the Chain class:

1. APIChain (from langchain.chains.api.base)
2. OpenAPIEndpointChain (from langchain.chains.api.openapi.chain)
3. AnalyzeDocumentChain (from langchain.chains.combine_documents.base)
4. MapReduceDocumentsChain (from langchain.chains.combine_documents.map_reduce)
5. MapRerankDocumentsChain (from langchain.chains.combine_documents.map_rerank)
6. ReduceDocumentsChain (from langchain.chains.combine_documents.reduce)