The idea: ask questions to an LLM about a codebase, using embeddings. Feeding the entire codebase to the LLM would exceed it's maximum context length, so embeddings are used to select those pieces of text which are most relevant to the input prompt/question. A single embedding represents a chunk of the codebase, e.g. 1000 tokens of it, turned into a vector representation using OpenAI's embeddings API. 

The pipeline then goes as follows:
- ask a question
- match the question to the most similar embeddings (e.g. top 5) 
- add the pieces of text corresponding to the matched embeddings to the LLM input, along with the question itself
- run the LLM

Note that a lot hinges on the matched embeddings. If the answer to a question is not contained in the matched embeddings, the LLM has no way to give a correct answer based on its context.

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
import os
from pathlib import Path

from dotenv import load_dotenv

load_dotenv(verbose=True);

## Tinygrad codebase

I took all the Python files from https://github.com/geohot/tinygrad and put them in `data/tinygrad`.

In [48]:
from langchain.document_loaders import DirectoryLoader

tinygrad_path = "data/tinygrad"

# Collect all Python files in the tinygrad repo
loader = DirectoryLoader(tinygrad_path, glob="**/*.py")
documents = loader.load()

In [49]:
print(f"{len(documents)} documents loaded.")
for doc in documents:
    print(doc.metadata["source"])

23 documents loaded.
data/tinygrad/image.py
data/tinygrad/jit.py
data/tinygrad/ops_clang.py
data/tinygrad/tensor.py
data/tinygrad/ops.py
data/tinygrad/shapetracker.py
data/tinygrad/mlops.py
data/tinygrad/ops_cpu.py
data/tinygrad/helpers.py
data/tinygrad/symbolic.py
data/tinygrad/cstyle.py
data/tinygrad/llvmir.py
data/tinygrad/ops_gpu.py
data/tinygrad/__init__.py
data/tinygrad/graph.py
data/tinygrad/lazy.py
data/tinygrad/lib.py
data/tinygrad/ops_llvm.py
data/tinygrad/ops_cuda.py
data/tinygrad/linearizer.py
data/tinygrad/ops_metal.py
data/tinygrad/ops_torch.py
data/tinygrad/optim.py


In [80]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

Created a chunk of size 1459, which is longer than the specified 500
Created a chunk of size 685, which is longer than the specified 500
Created a chunk of size 609, which is longer than the specified 500
Created a chunk of size 1005, which is longer than the specified 500
Created a chunk of size 1959, which is longer than the specified 500
Created a chunk of size 517, which is longer than the specified 500
Created a chunk of size 540, which is longer than the specified 500
Created a chunk of size 587, which is longer than the specified 500
Created a chunk of size 2511, which is longer than the specified 500
Created a chunk of size 1064, which is longer than the specified 500
Created a chunk of size 518, which is longer than the specified 500
Created a chunk of size 1056, which is longer than the specified 500
Created a chunk of size 924, which is longer than the specified 500
Created a chunk of size 604, which is longer than the specified 500
Created a chunk of size 655, which is long

In [81]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
embeddings

OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=None, openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=6, request_timeout=None, headers=None)

In [82]:
from langchain.vectorstores import Chroma

docsearch = Chroma.from_documents(docs, embeddings)

In [86]:
retriever = docsearch.as_retriever()
# Settings from https://python.langchain.com/en/latest/use_cases/code/code-analysis-deeplake.html
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['fetch_k'] = 20
retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 20

In [87]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

# query = "What does the Tensor class represent?"
# query = "What happens when you call `detach()` on a Tensor?"
# query = "How does the Tensor.stack() method work?"
# query = "What classes inherit from the Function class?"

model = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0) # 'ada' 'gpt-3.5-turbo' 'gpt-4',
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)
# qa({"question": query, "chat_history": []})

# qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever )
# qa.run(query)

# docs = docsearch.similarity_search(query)
# docs = docsearch.similarity_search_with_score(query)

In [88]:
questions = [
    # "What does the Tensor class represent?",
    # "What happens when you call `detach()` on a Tensor?",
    # "How does the Tensor.stack() method work?",
    "What classes are derived from the Function class?",
    # "What classes and functions in the ./langchain/utilities/ forlder are not covered by unit tests?",
    # "What one improvement do you propose in code in relation to the class herarchy for the Chain class?",
] 
chat_history = []

for question in questions:  
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result['answer']))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")



-> **Question**: What classes are derived from the Function class? 

**Answer**: The Sin and Relu classes are derived from the Function class. 

