The idea: ask questions to an LLM about a codebase, using embeddings. Feeding the entire codebase to the LLM would exceed it's maximum context length, so embeddings are used to select those pieces of text which are most relevant to the input prompt/question. A single embedding represents a chunk of the codebase, e.g. 1000 tokens of it, turned into a vector representation using OpenAI's embeddings API. 

The pipeline then goes as follows:
- ask a question
- match the question to the most similar embeddings (e.g. top 5) 
- add the pieces of text corresponding to the matched embeddings to the LLM input, along with the question itself
- run the LLM

Note that a lot hinges on the matched embeddings. If the answer to a question is not contained in the matched embeddings, the LLM has no way to give a correct answer based on its context.

In [1]:
%reload_ext autoreload
%autoreload 2

In [4]:
import os
from pathlib import Path

from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv());

True

## Tinygrad codebase

I took all the Python files from https://github.com/geohot/tinygrad and put them in `data/tinygrad`.

In [5]:
import textwrap

def printw(text):
    print(textwrap.fill(text, width=100))

In [6]:
from langchain.document_loaders import DirectoryLoader

tinygrad_path = "data/tinygrad"

# Collect all Python files in the tinygrad repo
loader = DirectoryLoader(tinygrad_path, glob="**/*.py")
documents = loader.load()

In [7]:
print(f"{len(documents)} documents loaded.")
for doc in documents:
    print(doc.metadata["source"])

23 documents loaded.
data/tinygrad/image.py
data/tinygrad/jit.py
data/tinygrad/ops_clang.py
data/tinygrad/tensor.py
data/tinygrad/ops.py
data/tinygrad/shapetracker.py
data/tinygrad/mlops.py
data/tinygrad/ops_cpu.py
data/tinygrad/helpers.py
data/tinygrad/symbolic.py
data/tinygrad/cstyle.py
data/tinygrad/llvmir.py
data/tinygrad/ops_gpu.py
data/tinygrad/__init__.py
data/tinygrad/graph.py
data/tinygrad/lazy.py
data/tinygrad/lib.py
data/tinygrad/ops_llvm.py
data/tinygrad/ops_cuda.py
data/tinygrad/linearizer.py
data/tinygrad/ops_metal.py
data/tinygrad/ops_torch.py
data/tinygrad/optim.py


In [8]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

Created a chunk of size 1459, which is longer than the specified 1000
Created a chunk of size 1005, which is longer than the specified 1000
Created a chunk of size 1959, which is longer than the specified 1000
Created a chunk of size 2511, which is longer than the specified 1000
Created a chunk of size 1064, which is longer than the specified 1000
Created a chunk of size 1056, which is longer than the specified 1000
Created a chunk of size 1247, which is longer than the specified 1000
Created a chunk of size 1169, which is longer than the specified 1000
Created a chunk of size 1199, which is longer than the specified 1000
Created a chunk of size 1407, which is longer than the specified 1000
Created a chunk of size 1124, which is longer than the specified 1000
Created a chunk of size 1763, which is longer than the specified 1000
Created a chunk of size 1632, which is longer than the specified 1000
Created a chunk of size 5008, which is longer than the specified 1000
Created a chunk of s

In [9]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
embeddings

OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=None, openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=6, request_timeout=None, headers=None)

In [21]:
from langchain.vectorstores import Chroma

docsearch = Chroma.from_documents(docs, embeddings).as_retriever()

### Specify system prompt

In [45]:
from langchain.prompts import PromptTemplate

delimiter = "####"  # this is 1 token

In [46]:
# System prompt 1: question answering in JSON format
prompt_template = """You will receive code questions based on the code snippets below. \
Your goal is to give an answer which is as accurate as possible. \
If the question is of the form '<object> info', where <object> is a class or function, \
answer in JSON format, using the following fields: object, parent, description, parameters, \
return value, and examples. \

{context}

Question: {question}
Answer:"""

In [39]:
# System prompt 2: chain of thought reasoning
prompt_template_cot = f"""You will receive code questions based on the \
code snippets below. Follow these steps to answer the question. \
The question will be delimited with four hashtags, i.e. {delimiter}. 

Step 1:{delimiter} First decide whether the user is asking about a class, function, \
or variable.

Step 2:{delimiter} If the user is asking about a class or function, \
answer in JSON format, using the following fields: object, parent, \
description, parameters.

Step 3:{delimiter} If the user is asking about a variable, \
answer what the variable is used for.

Step 4:{delimiter}: If the user made any assumptions, \
figure out whether the assumption is true based on the code snippets.

Step 5:{delimiter}: First, politely correct the \
user's incorrect assumptions if applicable. \
Answer the user in a friendly tone.

Use the following format:
Step 1:{delimiter} <step 1 reasoning>
Step 2:{delimiter} <step 2 reasoning>
Step 3:{delimiter} <step 3 reasoning>
Step 4:{delimiter} <step 4 reasoning>
Response to user:{delimiter} <response to user>

Make sure to include {delimiter} to separate every step.
""" + """\

{context}

Question: {question}
"""

In [None]:
# Select the prompt template to use
# PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
PROMPT = PromptTemplate(template=prompt_template_cot, input_variables=["context", "question"])

In [50]:
print(f"Prompt template is {len(PROMPT.template)} characters long.\n")
print("=== Prompt template ===\n")
print(PROMPT.template)

Prompt is 1031 characters long.

=== Prompt template ===

You will receive code questions based on the code snippets below. Follow these steps to answer the question. The question will be delimited with four hashtags, i.e. ####. 

Step 1:#### First decide whether the user is asking about a class, function, or variable.

Step 2:#### If the user is asking about a class or function, answer in JSON format, using the following fields: object, parent, description, parameters.

Step 3:#### If the user is asking about a variable, answer what the variable is used for.

Step 4:####: If the user made any assumptions, figure out whether the assumption is true based on the code snippets.

Step 5:####: First, politely correct the user's incorrect assumptions if applicable. Answer the user in a friendly tone.

Use the following format:
Step 1:#### <step 1 reasoning>
Step 2:#### <step 2 reasoning>
Step 3:#### <step 3 reasoning>
Step 4:#### <step 4 reasoning>
Response to user:#### <response to user>

M

### Load model and ask questions

In [41]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.chains.question_answering import load_qa_chain

model = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0) # 'ada' 'gpt-3.5-turbo' 'gpt-4',
qa = load_qa_chain(model, prompt=PROMPT)

# qa({"question": query, "chat_history": []})
# docs = docsearch.similarity_search(query)
# docs = docsearch.similarity_search_with_score(query)

In [44]:
questions = [
    # "Tensor info",
    # "What does Function class represent, and why does it inherit from the Tensor class?",
    "What does shape_fxn_for_op represent?",
    # "What does the Tensor class represent?",
    # "What happens when you call `detach()` on a Tensor?",
    # "How does the Tensor.stack() method work?",
    # "What classes are derived from the Function class?",
    # "What classes and functions in the ./langchain/utilities/ forlder are not covered by unit tests?",
    # "What one improvement do you propose in code in relation to the class herarchy for the Chain class?",
] 

for question in questions:  
    docs_ = docsearch.get_relevant_documents(question)
    result = qa({"input_documents": docs_, "question": question, "chat_history": []})
    print(f"-> **Question**: {question} \n")
    print(result["output_text"])
    # print(f"**Answer**: {result['answer']} \n")



-> **Question**: What does shape_fxn_for_op represent? 

Step 1:#### The code snippet represents a dictionary.
Step 2:#### The dictionary maps operations (represented by the Op enum) to functions that perform those operations on tensors.
Step 3:#### N/A
Step 4:#### N/A
Response to user:#### The code snippet represents a dictionary that maps operations to functions that perform those operations on tensors.
