# Local Code Repo Context for Code Generation with LangChain

This example shows how to read in an entire code base, generate embeddings, store those embeddings in a persistent local database, query the database directly for relevant pieces of text, and then how to use the database as a source for an LLM hooked into LangChain to create a code generator with awareness of the repo's style and general context.

This was a work-in-progress experiment I abandoned to try to build a framework for generating complex, multiple-file outputs. If something seems strange to you, especially in the prompts, follow your intuition and don't blindly trust the example.

## Setup

First, add a GitHub repo to the data directory. If the repo's name is `my-user/my-repo-name`, then the data directory structure should look like this:

```
.
├── repositories
│   └── my-user
│       └── my-repo-name
```


Second, set the `github_repo` variable to match the repo name in the `Configure variables` section below.

## ChromaDB Persistence

Each notebook that uses ChromaDB follows the same pattern for persistence.

If the directory already exists that ChromaDB would be writing it's data to, it will load the existing database. If the directory does not exist, it will create a new database.

If you change parameters that affect the embeddings generation (like swapping in a new PDF file), you'll need to delete the database directory to force a new database to be created.

This can be done by running the following from the root of the repository. If the ChromaDB directory is `data/chromadb/local_repository`, you'd run the following to delete it:

```sh
rm -rf data/chromadb/local_repository
```

or if you run into permissions issues:

```sh
sudo rm -rf data/chromadb/local_repository
```

## Configure variables

In [1]:
import os

# ****************** [START] Google Cloud project settings ****************** #
project =  os.getenv('GCP_PROJECT')
location = os.environ.get('GCP_REGION', 'us-central1')
# ******************* [END] Google Cloud project settings ******************* #


# *********************** [START] Embeddings config ************************* #
# set rate limiting options for Vertex AI embeddings
embeddings_requests_per_minute = 100
embeddings_num_instances_per_batch = 5
# *********************** [END] Embeddings config *************************** #


# ********************** [START] data directory config ********************** #
# local directory to write chroma db persistence to
# or pull files like PDFs from to create embeddings
from helpers.files import get_data_dir
data_dir = get_data_dir()

chroma_db_dir = f'{data_dir}/chromadb'
chroma_db_local_repo_dir = f'{chroma_db_dir}/local_repository'
# *********************** [END] data directory config *********************** #


# ********************** [START] LLM data config **************************** #
from helpers.files import file_exists

collection_name = 'local-repository'
load_documents = True
if file_exists(chroma_db_local_repo_dir):
    load_documents = False
# *********************** [END] LLM data config ***************************** #


# ********************** [START] repository config ************************** #
github_repo = "imup-io/front-end"
repo_dir = f"{data_dir}/repositories/{github_repo}"
# *********************** [END] repository config **************************** #


# *********************** [START] RAG parameter config ********************** #
# experiment with:
# - mmr
# - similarity
db_search_type = "similarity"
db_search_kwargs = {"k": 5}
# *********************** [END] RAG parameter config ************************ #


# *********************** [START] LLM parameter config ********************** #
# Vertex AI model to use for the LLM
model_name='code-bison@latest'
# model_name='code-bison-32k@latest'

# maximum number of model responses generated per prompt
candidate_count = 1

# determines the maximum amount of text output from one prompt.
# a token is approximately four characters.
max_output_tokens = 2048

# temperature controls the degree of randomness in token selection.
# lower temperatures are good for prompts that expect a true or
# correct response, while higher temperatures can lead to more
# diverse or unexpected results. With a temperature of 0 the highest
# probability token is always selected. for most use cases, try
# starting with a temperature of 0.2.
temperature = 0.1

# how verbose the llm and langchain agent is when thinking
# through a prompt. you're going to want this set to True
# for development so you can debug its thought process
verbose = True
# *********************** [END] LLM parameter config ************************ #


# ********************** [START] Configuration Checks *********************** #
if not project:
    raise Exception('GCP_PROJECT environment variable not set')
# *********************** [END] Configuration Checks ************************ #

## Import and Initialize Vertex AI Client

This will complain about not having cuda drivers and the GPU not being used. You can safely ignore that. If you want to use the GPU, that's possible in Linux with Docker, but you'll need to set up a non-containerized development environment to use GPUs with MacOS.

In [2]:
from google.cloud import aiplatform
import vertexai

vertexai.init(project=project, location=location)

print(f"Vertex AI SDK version: {aiplatform.__version__}")


2023-11-19 05:43:47.649788: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-19 05:43:47.651400: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-19 05:43:47.669843: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-19 05:43:47.669874: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-19 05:43:47.669894: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to regi

Vertex AI SDK version: 1.36.0


## Import LangChain

This doesn't actually initialize anything, it just lets us print the version.

In [3]:
import langchain

print(f"LangChain version: {langchain.__version__}")


LangChain version: 0.0.330


## Configure LLM with Vertex AI

In [4]:
from langchain.llms import VertexAI

llm = VertexAI(
    model_name=model_name,
    max_output_tokens=max_output_tokens,
    temperature=temperature,
    # top_p=top_p,
    # top_k=top_k,
    verbose=verbose,
)

## Initialize Embeddings Function with Vertex AI

There are other options for creating embeddings. I was interested in sticking with Google products here.

In [None]:
from langchain.embeddings import VertexAIEmbeddings

# https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.vertexai.VertexAIEmbeddings.html
embeddings = VertexAIEmbeddings(
    requests_per_minute=embeddings_requests_per_minute,
    num_instances_per_batch=embeddings_num_instances_per_batch,
    model_name = "textembedding-gecko@latest"
)

## Configure Retrieval Augmented Generation Prompt Template

In [6]:

from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

def create_grounding_statement(grounding_context, github_repo, coding_language):
    grounding_statement = f"""
{grounding_context}
Respond with the syntactically correct & concise code for to the question below. Make sure you follow these rules:
1. Always assume a question is referring to the {github_repo} GitHub repository. Use context from the repository's code to understand the interfaces to components as well as how to use them.
2. The code should follow the same style as the code in the {github_repo} repository.
3. Ensure all the requirements in the question are met. If it is impossible to meet a requirement, respond with "Impossible to meet requirement: [requirement]".
4. The code should be written in {coding_language}.
5. If a .prettierrc file is present in the repository, respect its rules when formatting the code.
"""
    return grounding_statement

def create_prompt_template(grounding_statement):
    prompt_rag_statement = f"""
{grounding_statement}

Question:
{{question}}

Context:
{{context}}

Helpful Response:
"""

    prompt_rag_template = PromptTemplate(
        template=prompt_rag_statement,
        input_variables=["context", "question"],
    )

    return prompt_rag_template

def create_retrieval_qa_chain(llm, prompt_template, retriever):
    qa_chain = RetrievalQA.from_llm(
        llm=llm,
        prompt=prompt_template,
        retriever=retriever,
        return_source_documents=True,
    )

    return qa_chain

## Read in local directory of code

In [7]:
import glob

extensions = ['*.ts', '*.tsx']

repo_code_files = []
for extension in extensions:
    search_string=f"{repo_dir}/**/{extension}"
    print(f"Searching for files in {search_string}")
    repo_code_files.extend(glob.glob(search_string, recursive=True))

print(len(repo_code_files))

Searching for files in /home/steven/work/data/repos-for-search/imup-io/front-end/**/*.ts
Searching for files in /home/steven/work/data/repos-for-search/imup-io/front-end/**/*.tsx
844


## Create documents from the files

In [8]:
from langchain.schema.document import Document

def extract_code_from_file(file_path):
    code = ""
    with open(file_path, 'r') as file:
        code = file.read()
    return code

code_strings = []

for i in range(0, len (repo_code_files)):
    # TODO: this is a pointless check until I add more languages that
    # I don't want to use Language.TS for with RecursiveCharacterTextSplitter
    if repo_code_files[i].endswith(".ts") or repo_code_files[i].endswith(".tsx"):
        content = extract_code_from_file(repo_code_files[i])

        doc = Document(
            page_content=content,
            # TODO: attach more metadata
            metadata={"url": repo_code_files[i], "file_index":i}
        )

        code_strings.append(doc)

code_strings[0]

Document(page_content='// jest-dom adds custom jest matchers for asserting on DOM nodes.\n// allows you to do things like:\n// expect(element).toHaveTextContent(/react/i)\n// learn more: https://github.com/testing-library/jest-dom\n// get a "document undefined" error without this: https://jestjs.io/docs/configuration#testenvironment-string\nimport \'@testing-library/jest-dom\'\n', metadata={'url': '/home/steven/work/data/repos-for-search/imup-io/front-end/jest.setup.ts', 'file_index': 0})

## Split the code documents into chunks

In [9]:
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

# Chunk code strings
text_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.TS,
    chunk_size=2000,
    chunk_overlap=200,
)
texts = text_splitter.split_documents(code_strings)

## Create Embeddings Database

This is written with persistence and will not re-create the database if it already exists.

In [10]:
from langchain.vectorstores import Chroma

if load_documents:
  # https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html#langchain.vectorstores.chroma.Chroma.from_documents
  db = Chroma.from_documents(
    texts,
    embeddings,
    collection_name=collection_name,
    persist_directory=chroma_db_local_repo_dir,
  )
else:
  db = Chroma(
    persist_directory=chroma_db_local_repo_dir,
    embedding_function=embeddings,
    collection_name=collection_name,
  )


## Persist the Embeddings Database

In [11]:
# I think this would be safe to run in all circumstances but
# it feels weird to try writing if there are no changes anyway
if load_documents:
  db.persist()


## Initialize Code Embeddings Retriever

In [12]:
retriever = db.as_retriever(
    search_type=db_search_type,
    search_kwargs=db_search_kwargs,
)

# Set Up Prompts and Generate Code

This was run on a private repository so no one can confirm this, but the things it pulled in from the repo context were:
1. absolute import with `@components` as a custom alias set up in that repo
2. using anonymous functional component syntax
3. delcaring the `FC` type for the component

In [13]:
react_grounding_statement = """
You are a proficient React and typescript developer working with the following libraries:
- name: react
  version: ^18.2.0
- name: antd
  version: ^5.9.0
- name: "@testing-library/jest-dom"
  version: ^5.17.0
- name: "@testing-library/react"
  version: ^14.0.0
"""

coding_language = "typescript"

grounding_statement = create_grounding_statement(react_grounding_statement, github_repo, coding_language)
prompt_template = create_prompt_template(grounding_statement)

qa_chain = create_retrieval_qa_chain(llm, prompt_template, retriever=retriever)

query = """
Create a React component named 'ImUpLogoButton', utilizing imUp's ImUpLogo component satisfying the following requirements:
1. The component should be named 'ImUpLogoButton'.
2. The ImUpLogo graphic should rotate +90 degrees each time it's clicked.
3. The ImUpLogo component's tooltip should say 'imUp.io Logo'.
"""

results = qa_chain({"query": query})
print(results["result"])

```tsx
import React, { useState } from 'react';
import { ImUpLogo } from '@components/ImUpLogo';
import { Button } from 'antd';

const ImUpLogoButton: React.FC = () => {
  const [rotate, setRotate] = useState(0);

  const handleClick = () => {
    setRotate((prevRotate) => prevRotate + 90);
  };

  return (
    <Button onClick={handleClick}>
      <ImUpLogo
        style={{ transform: `rotate(${rotate}deg)` }}
        tooltipText="imUp.io Logo"
      />
    </Button>
  );
};

export default ImUpLogoButton;
```


In [18]:
component_name = 'ChartSparkLineUptimeData'
component_interface = f'{component_name}Props'

query = f"""
Create a React component utilizing imUp's API client and satisfying the following requirements:
1. The component should be named '{component_name}'.
2. The component should be a functional component of type '{component_interface}'.
3. The component should enable ApexChart's 'sparkline' mode.
"""

results = qa_chain({"query": query})
print(results["result"])

```tsx
import { FC } from 'react';
import { ChartSparkLine as Component } from '@components/index';
import { ChartSparkLineUptimeDataProps } from './ChartSparkLineUptimeData.interface';
import { useGetUptimeDataQuery } from '@ducks/imup/api/api';

export const ChartSparkLineUptimeData: FC<
  ChartSparkLineUptimeDataProps
> = ({}: ChartSparkLineUptimeDataProps) => {
  const {
    data: responseUptime = { data: [] },
    isLoading,
    isSuccess,
  } = useGetUptimeDataQuery();

  // https://apexcharts.com/docs/series/ for docs on data formatting
  const seriesDataUptime = useMemo(() => {
    if (responseUptime?.data?.length === 0) {
      return null;
    }

    // do this before sorting so we get the most recent data points
    const seriesDataUptime: ConnectivityData[] = responseUptime?.data?.slice(
      0,
      200,
    );

    // sort the data so it's in chronological order for the chart
    seriesDataUptime.sort((a, b) => a.time.localeCompare(b.time));

    const uptimeSeries = se

## It didn't generate the interface, so lets generate the interface

In [19]:
query = f"""
Create a React component typescript 'interface', satisfying the following requirements:
1. The interface should be named '{component_interface}'.
2. The file name to generate code for is '{component_name}.interface.ts'.
3. The component code looks like the following:
{results["result"]}.
4. Do not include the component code in the generated output.
5. Parameters pulled from react-router-dom's 'useParams' hook should not be included in the interface.
6. The code should export an interface with an empty props object if the React component does not use any of the properties of '{component_interface}':
```tsx
export interface {component_interface} {{}}
```
8. Fill in the following code with the props that the React component uses or follow rule #6:
```tsx
export interface {component_interface} {{
  // fill in the props here
}}
```
"""

results = qa_chain({"query": query})
print(results["result"])

```ts
export interface ChartSparkLineUptimeDataProps {}
```
