# Deeplake on GPT and GGUF
https://python.langchain.com/docs/use_cases/question_answering/how_to/code/code-analysis-deeplake

This notebook demostrate the difference between ChatGPT and Llama2

In [1]:
# openai
# !pip3 install openai tiktoken

## Data Preparation

### Load text into Document

In [2]:
from langchain.document_loaders import TextLoader
import os
root_dir = "mydata/langchain-sourcecode/libs"

docs = []
files = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        if file.endswith(".py") and "*venv/" not in dirpath:
            try:
                filepath=os.path.join(dirpath, file)
                loader = TextLoader(filepath, encoding="utf-8")
                files.append(filepath)
                docs.extend(loader.load_and_split())
            except Exception as e:
                pass
print(f"load {len(docs)} docs from {len(files)} *.py")


load 3317 docs from 1953 *.py


### Split Documants
#### CharacterTextSplitter VS RecursiveCharacterTextSplitter
##### RecursiveCharacterTextSplitter
> https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter
> tries to split on them in order until the chunks are small enough. The default list is \["\n\n", "\n", " ", ""\]

##### CharacterTextSplitter
> https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter
> This is the simplest method. This splits based on characters (by default "\n\n") and measure chunk length by number of characters.
>
> 

In [3]:
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=0)
split_docs = text_splitter.split_documents(docs)
print(f"There are {len(split_docs)} documents after split")

There are 5972 documents after split


### Make Vectorstore with deeplake

In [4]:
# Store data locally
# https://python.langchain.com/docs/integrations/vectorstores/activeloop_deeplake#deep-lake-locally
def make_local_vectorstore(split_docs,embeddings, dataset_path="./.my_deeplake/"):
    from langchain.vectorstores import DeepLake
    local_vectorstore=DeepLake.from_documents(split_docs, dataset_path=dataset_path, embedding=embeddings, overwrite=True) #, read_only=True
    return local_vectorstore 


In [5]:
# Store data on activeloop hub
def make_hub_vectorstore(split_docs,embedding, hub_name):
    from langchain.vectorstores import DeepLake
    activeloop_key=""
    username = ""
    with open('./mydata/activeloop_key.txt', 'r') as file:
        username = file.readline().strip()
        activeloop_key = file.readline().strip()
    
    os.environ['ACTIVELOOP_TOKEN'] = activeloop_key

    hub_vectorstore = DeepLake.from_documents(
        split_docs, embeddings, dataset_path=f"hub://{username}/{hub_name}", runtime={"tensor_db": True} #, overwrite=True
    )
    return hub_vectorstore

# Inference test 

In [6]:
# Our inference test function
from timeit import default_timer as timer
def inferenceQA(chat_model, vectorstore):
    qa = ConversationalRetrievalChain.from_llm(chat_model, retriever=vectorstore.as_retriever())
    questions = [
        "What is the class hierarchy?",
        "What classes are derived from the Chain class?",
        "What kind of retrievers does LangChain have?",
    ]
    chat_history = []
    qa_dict = {}
    
    for question in questions:
        print(f"-> **Question**: {question} \n")
        start=timer()
        result = qa({"question": question, "chat_history": chat_history})
        end=timer()
        print(f"**{int((end-start)*100)/100.0} secs**\n")
        print(f"**Answer**: {result['answer']} \n")
        chat_history.append((question, result["answer"]))
        qa_dict[question] = result["answer"]
    qa_dict

## OpenAI

In [7]:
openai_api_key=""
with open('./mydata/openai_api_key.txt', 'r') as file:
    openai_api_key = file.read().strip()

In [8]:
# ======= OpenAI Transformer==========
from langchain.embeddings.openai import OpenAIEmbeddings

gpt_embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
gpt_embeddings

OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version='', openai_api_base='', openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key='xx-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX', openai_organization='', allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=6, request_timeout=None, headers=None, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False)

In [9]:
# gpt_vectorstore = make_hub_vectorstore(split_docs, gpt_embeddings,hub_name="my_gpt_deeplake")
gpt_vectorstore = make_local_vectorstore(split_docs, gpt_embeddings, dataset_path="./.my_gpt_deeplake/")



creating embeddings: 100% 916/916 [11:05<00:00,  1.38it/s]
100% 5972/5972 [00:04<00:00, 1267.71it/s]


Dataset(path='./.my_gpt_deeplake/', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype       shape       dtype  compression
  -------    -------     -------     -------  ------- 
   text       text      (5972, 1)      str     None   
 metadata     json      (5972, 1)      str     None   
 embedding  embedding  (5972, 1536)  float32   None   
    id        text      (5972, 1)      str     None   




In [10]:
# ======= OpenAI Model==========
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

chat_model = ChatOpenAI(openai_api_key=openai_api_key,model_name="gpt-3.5-turbo-0613")

In [11]:
inferenceQA(chat_model, gpt_vectorstore)

-> **Question**: What is the class hierarchy? 

**2.61 secs**

**Answer**: The class hierarchy for the Memory module is as follows:

BaseMemory --> BaseChatMemory --> <name>Memory

The class hierarchy for the ChatMessageHistory module is as follows:

BaseChatMessageHistory --> <name>ChatMessageHistory

The class hierarchy for the Document Transformers module is as follows:

BaseDocumentTransformer --> <name> 

-> **Question**: What classes are derived from the Chain class? 

**7.35 secs**

**Answer**: The classes derived from the Chain class are:

- APIChain
- OpenAPIEndpointChain
- AnalyzeDocumentChain
- MapReduceDocumentsChain
- MapRerankDocumentsChain
- ReduceDocumentsChain
- RefineDocumentsChain
- StuffDocumentsChain
- ConstitutionalChain
- ConversationChain
- ChatVectorDBChain
- ConversationalRetrievalChain
- FlareChain
- ArangoGraphQAChain
- GraphQAChain
- GraphCypherQAChain
- FalkorDBQAChain
- HugeGraphQAChain
- KuzuQAChain
- NebulaGraphQAChain
- NeptuneOpenCypherQAChain
- Graph

## Llama2 (GGUF on CTransformer)

In [12]:
# ======== GGUF =========
from langchain.llms import CTransformers
import os

model_id=os.path.abspath('./models/Llama-2-7b-Chat-GGUF')

# context_length must be > chunk_size=1000 of text_splitter
# If context length is too short, the output would be poor.
config = {'max_new_tokens': 2048, 'repetition_penalty': 1.05,'context_length':4096}
# https://api.python.langchain.com/en/latest/llms/langchain.llms.ctransformers.CTransformers.html
cTransformers_llm = CTransformers(model=model_id, model_file="llama-2-7b-chat.Q4_K_M.gguf", config=config)

In [13]:
# embedding data
from mylib.MyModelUtils import MyModelUtils
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import os
llama2_embeddings=HuggingFaceEmbeddings(
    model_name=os.path.abspath("./models/sentence-transformers/all-mpnet-base-v2"), 
    model_kwargs={"device": MyModelUtils.device()}
)

In [14]:
# llama2_vectorstore = make_hub_vectorstore(split_docs, llama2_embeddings,hub_name="my_llama2_deeplake")
llama2_vectorstore = make_local_vectorstore(split_docs, llama2_embeddings, dataset_path="./.my_llama2_deeplake/")



creating embeddings: 100% 916/916 [03:30<00:00,  4.35it/s]
100% 5972/5972 [00:04<00:00, 1264.07it/s]


Dataset(path='./.my_llama2_deeplake/', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
   text       text      (5972, 1)     str     None   
 metadata     json      (5972, 1)     str     None   
 embedding  embedding  (5972, 768)  float32   None   
    id        text      (5972, 1)     str     None   




In [15]:
# What will happen if we use gpt-embedding vertorstore?
inferenceQA(cTransformers_llm, gpt_vectorstore)

-> **Question**: What is the class hierarchy? 

**212.45 secs**

**Answer**:  The class hierarchy is: BaseDocumentTransformer --> <name> --> Document.
Explanation: The class hierarchy shows the relationships between different classes in the langchain.document_transformers module. The topmost class is "BaseDocumentTransformer", which is a base class for all document transformers. Below it are various subclasses, each representing a different type of document transformer (e.g., BeautifulSoupTransformer, DoctranQATransformer, etc.). Finally, at the bottom of the hierarchy is the "Document" class, which represents a single document that can be transformed by one of the above classes. 

-> **Question**: What classes are derived from the Chain class? 

**283.88 secs**

**Answer**:   The Chain class is the base class for many other classes in the LangChain project, including APIChain, OpenAPIEndpointChain, AnalyzeDocumentChain, MapReduceDocumentsChain, MapRerankDocumentsChain, ReduceDocuments

In [16]:
inferenceQA(cTransformers_llm, llama2_vectorstore)

-> **Question**: What is the class hierarchy? 

**226.02 secs**

**Answer**:  The class hierarchy for tools in the given code snippet is as follows:

* `ToolMetaclass`: The base metaclass for all tools.
* `BaseTool`: The base class for all tools.
* `<name>Tool`: A tool with a specific name (e.g., `AINetworkTool`, `AIOwnerTool`, etc.).

In this hierarchy, each tool inherits from the previous class in the chain, with the top-most class being `ToolMetaclass`. This allows for a standardized and predictable class structure across all tools. 

-> **Question**: What classes are derived from the Chain class? 

**199.49 secs**

**Answer**: 
The derived classes are listed below along with their parent class, `Chain`, based on the provided code snippet:

* `LLMChain` (Chain)
* `MapReduceChain` (Chain)
* `RouterChain` (Chain)

Therefore, the answer is `LLMChain`, `MapReduceChain`, and `RouterChain`. 

-> **Question**: What kind of retrievers does LangChain have? 

**132.16 secs**

**Answer**:  Acc