# Remote GitHub Repo Context For Code Generation with LangChain

This example shows how to read in an entire code base, generate embeddings, store those embeddings in a persistent local database, query the database directly for relevant pieces of text, and then how to use the database as a source for an LLM hooked into LangChain to create a code generator with awareness of the repo's style and general context.

This was a work-in-progress experiment I abandoned to try to build a framework for generating complex, multiple-file outputs. If something seems strange to you, especially in the prompts, follow your intuition and don't blindly trust the example.

## Setup

Make sure your `GITHUB_TOKEN` is set in the [../../.env](../../.env) file. If you set that variable after starting your development environment, you'll need to restart (`make stop; make start;`).

Make sure to set the `github_repo` variable to one that you have access to in the `Configure variables` section below.

## References
- [Rubens Zimbres's Code Generation RAG Medium article](https://medium.com/@rubenszimbres/code-generation-using-retrieval-augmented-generation-langchain-861e3c1a1a53)
- [code generation prompts (`code-bison`)](https://cloud.google.com/vertex-ai/docs/generative-ai/code/code-generation-prompts)
- [code chat prompts (`codechat-bison`)](https://cloud.google.com/vertex-ai/docs/generative-ai/code/code-chat-prompts)
- [test code chat prompts (`codechat-bison`)](https://cloud.google.com/vertex-ai/docs/generative-ai/code/test-code-chat-prompts)

## ChromaDB Persistence

Each notebook that uses ChromaDB follows the same pattern for persistence.

If the directory already exists that ChromaDB would be writing it's data to, it will load the existing database. If the directory does not exist, it will create a new database.

If you change parameters that affect the embeddings generation (like swapping in a new PDF file), you'll need to delete the database directory to force a new database to be created.

This can be done by running the following from the root of the repository. If the ChromaDB directory is `data/chromadb/remote_repository`, you'd run the following to delete it:

```sh
rm -rf data/chromadb/remote_repository
```

or if you run into permissions issues:

```sh
sudo rm -rf data/chromadb/remote_repository
```

## Configure variables

In [1]:
import os

# ****************** [START] Google Cloud project settings ****************** #
project =  os.getenv('GCP_PROJECT')
location = os.environ.get('GCP_REGION', 'us-central1')
# ******************* [END] Google Cloud project settings ******************* #


# *********************** [START] GitHub Repo Config ************************ #
github_token = os.getenv('GITHUB_TOKEN')
github_repo = "imup-io/front-end"
# *********************** [END] GitHub Repo Config ************************** #


# *********************** [START] Embeddings config ************************* #
# set rate limiting options for Vertex AI embeddings
embeddings_requests_per_minute = 100
embeddings_num_instances_per_batch = 5
# *********************** [END] Embeddings config *************************** #


# ********************** [START] data directory config ********************** #
# local directory to write chroma db persistence to
# or pull files like PDFs from to create embeddings
from helpers.files import get_data_dir
data_dir = get_data_dir()

chroma_db_dir = f'{data_dir}/chromadb'
chroma_db_remote_repo_dir = f'{chroma_db_dir}/remote_repository'
# *********************** [END] data directory config *********************** #


# ********************** [START] LLM data config **************************** #
from helpers.files import file_exists

collection_name = 'remote-repository'
load_documents = True
if file_exists(chroma_db_remote_repo_dir):
    load_documents = False
# *********************** [END] LLM data config ***************************** #


# ********************** [START] repository config ************************** #
github_repo = "imup-io/front-end"
repo_dir = f"{data_dir}/repositories/{github_repo}"
# *********************** [END] repository config **************************** #


# *********************** [START] LLM parameter config ********************** #
# Vertex AI model to use for the LLM
model_name='code-bison@latest'
# model_name='code-bison-32k@latest'

# maximum number of model responses generated per prompt
candidate_count = 1

# determines the maximum amount of text output from one prompt.
# a token is approximately four characters.
max_output_tokens = 1024

# temperature controls the degree of randomness in token selection.
# lower temperatures are good for prompts that expect a true or
# correct response, while higher temperatures can lead to more
# diverse or unexpected results. With a temperature of 0 the highest
# probability token is always selected. for most use cases, try
# starting with a temperature of 0.2.
temperature = 0.2

# top-p changes how the model selects tokens for output. Tokens are
# selected from most probable to least until the sum of their
# probabilities equals the top-p value. For example, if tokens A, B, and C
# have a probability of .3, .2, and .1 and the top-p value is .5, then the
# model will select either A or B as the next token (using temperature).
# the default top-p value is .8.
top_p = 0.8

# top-k changes how the model selects tokens for output.
# a top-k of 1 means the selected token is the most probable among
# all tokens in the model’s vocabulary (also called greedy decoding),
# while a top-k of 3 means that the next token is selected from among
# the 3 most probable tokens (using temperature).
top_k = 40

# how verbose the llm and langchain agent is when thinking
# through a prompt. you're going to want this set to True
# for development so you can debug its thought process
verbose = True
# *********************** [END] LLM parameter config ************************ #


# ********************** [START] Configuration Checks *********************** #
if not project:
    raise Exception('GCP_PROJECT environment variable not set')

if not github_token:
    raise Exception('GITHUB_TOKEN environment variable not set')
# *********************** [END] Configuration Checks ************************ #

## Import and Initialize Vertex AI Client

This will complain about not having cuda drivers and the GPU not being used. You can safely ignore that. If you want to use the GPU, that's possible in Linux with Docker, but you'll need to set up a non-containerized development environment to use GPUs with MacOS.

In [None]:
from google.cloud import aiplatform
import vertexai

vertexai.init(project=project, location=location)

print(f"Vertex AI SDK version: {aiplatform.__version__}")


2023-12-16 01:38:09.069447: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-16 01:38:09.071158: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-16 01:38:09.090029: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-16 01:38:09.090056: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-16 01:38:09.090072: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to regi

Vertex AI SDK version: 1.38.1


## Import LangChain

This doesn't actually initialize anything, it just lets us print the version.

In [None]:
import langchain

print(f"LangChain version: {langchain.__version__}")


LangChain version: 0.0.350


## Configure LLM with Vertex AI

In [None]:
from langchain.llms import VertexAI

llm = VertexAI(
    model_name=model_name,
    max_output_tokens=max_output_tokens,
    temperature=temperature,
    # top_p=top_p,
    # top_k=top_k,
    verbose=verbose,
)


## Initialize Embeddings Function with Vertex AI

There are other options for creating embeddings. I was interested in sticking with Google products here.

In [None]:
from langchain.embeddings import VertexAIEmbeddings

# https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.vertexai.VertexAIEmbeddings.html
embeddings = VertexAIEmbeddings(
    requests_per_minute=embeddings_requests_per_minute,
    num_instances_per_batch=embeddings_num_instances_per_batch,
    model_name = "textembedding-gecko@latest"
)

## Prompt Configuration

In [3]:
def create_prompt(topic):
  prompt = """
Assume the role of an expert software developer and suggest code for the final description.


// [START_GENERATION_SUMMARY] React "Alert" Component Interface
// - description : React functional component interface for an "Alert"
//                 written using Ant Design v5.
// - file path   : src/components/Alert/Alert.interface.tsx
// - language    : typescript
// [END_GENERATION_SUMMARY] React "Alert" Component Interface
//
// [START_GENERATED_CODE] React "Alert" Component Interface
"""

  return prompt


# Scrape github repo

In [4]:
import requests, time

# Crawls a GitHub repository and returns a list of all ts or tsx files in the repository
def crawl_github_repo(url, is_sub_dir, access_token = f"{github_token}"):

    ignore_list = ['__init__.py']

    if not is_sub_dir:
        api_url = f"https://api.github.com/repos/{url}/contents"
    else:
        api_url = url

    headers = {
        "Accept": "application/vnd.github.v3+json",
        "Authorization": f"Bearer {access_token}"
    }

    response = requests.get(api_url, headers=headers)
    response.raise_for_status()  # Check for any request errors

    files = []

    contents = response.json()

    for item in contents:
        # if item['type'] == 'file' and item['name'] not in ignore_list and (item['name'].endswith('.py') or item['name'].endswith('.ipynb')):
        if item['type'] == 'file' and item['name'] not in ignore_list and (item['name'].endswith('.ts') or item['name'].endswith('.tsx')):
            files.append(item['html_url'])
        elif item['type'] == 'dir' and not item['name'].startswith("."):
            sub_files = crawl_github_repo(item['url'],True)
            time.sleep(.1)
            files.extend(sub_files)

    return files

# Write discovered files to disk

In [5]:
from helpers.files import make_dir_if_not_exists

files_discovered_file_path = f"{data_dir}/code-gen-rag-repo-scrape/{github_repo}_files.txt"
make_dir_if_not_exists(files_discovered_file_path)

code_files_urls = crawl_github_repo(github_repo, False, github_token)

print(f"Found {len(code_files_urls)} code files in {github_repo}")

# Write list to a file so you do not have to download each time
with open(files_discovered_file_path, "w") as f:
    for item in code_files_urls:
        f.write(item + '\n')

Found 828 code files in imup-io/front-end


# Extract everything in URLs

I left some commented out code in here from the [Rubens Zimbres's Code Generation RAG Medium article](https://medium.com/@rubenszimbres/code-generation-using-retrieval-augmented-generation-langchain-861e3c1a1a53) I was working off of just in case it's useful to someone to be aware of.

In [8]:
import requests
from langchain.schema.document import Document

import nbformat
import json

# Extracts the python code from an .ipynb file from github
# def extract_python_code_from_ipynb(github_url,cell_type = "code"):
#     raw_url = github_url.replace("github.com", "raw.githubusercontent.com").replace("/blob/", "/")

#     response = requests.get(raw_url)
#     response.raise_for_status()  # Check for any request errors

#     notebook_content = response.text

#     notebook = nbformat.reads(notebook_content, as_version=nbformat.NO_CONVERT)

#     python_code = None

#     for cell in notebook.cells:
#         if cell.cell_type == cell_type:
#           if not python_code:
#             python_code = cell.source
#           else:
#             python_code += "\n" + cell.source

#     return python_code

# Extracts the python code from an .py file from github
def extract_code_from_file(github_url, github_token = f"{github_token}"):

    headers = {
        "Accept": "application/vnd.github.v3+json",
        "Authorization": f"Bearer {github_token}"
    }

    raw_url = github_url.replace("github.com", "raw.githubusercontent.com").replace("/blob/", "/")

    response = requests.get(raw_url, headers=headers)
    response.raise_for_status()  # Check for any request errors

    code = response.text

    return code


with open(files_discovered_file_path) as f:
    code_files_urls = f.read().splitlines()

code_strings = []

for i in range(0, len (code_files_urls)):
    if code_files_urls[i].endswith(".ipynb"):
        print("Skipping .ipynb file")

        continue

    if code_files_urls[i].endswith(".ts") or code_files_urls[i].endswith(".tsx"):
        content = extract_code_from_file(code_files_urls[i])
        doc = Document(page_content=content, metadata= {"url": code_files_urls[i], "file_index":i})
        code_strings.append(doc)

code_strings[0]

Document(page_content='// jest-dom adds custom jest matchers for asserting on DOM nodes.\n// allows you to do things like:\n// expect(element).toHaveTextContent(/react/i)\n// learn more: https://github.com/testing-library/jest-dom\n// get a "document undefined" error without this: https://jestjs.io/docs/configuration#testenvironment-string\nimport \'@testing-library/jest-dom\'\n', metadata={'url': 'https://github.com/imup-io/front-end/blob/main/jest.setup.ts', 'file_index': 0})

# Chunk the strings

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import Language

# Chunk code strings
text_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.TS, chunk_size=2000, chunk_overlap=200
)
texts = text_splitter.split_documents(code_strings)

## Create Embeddings Database

This is written with persistence and will not re-create the database if it already exists.

In [None]:
from langchain.vectorstores import Chroma

if load_documents:
  # https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html#langchain.vectorstores.chroma.Chroma.from_documents
  db = Chroma.from_documents(
    texts,
    embeddings,
    collection_name=collection_name,
    persist_directory=chroma_db_remote_repo_dir,
  )
else:
  db = Chroma(
    persist_directory=chroma_db_remote_repo_dir,
    embedding_function=embeddings,
    collection_name=collection_name,
  )


# initialize retriever

In [14]:
# Init your retriever.
retriever = db.as_retriever(
    search_type="similarity",  # Also test "similarity", "mmr"
    search_kwargs={"k": 5},
)

# Running prompts (non-RAG)

In [16]:
from langchain.prompts import PromptTemplate

user_question = "Create a typescript React 'Button' component using Ant Design."

# Zero Shot prompt template
prompt_zero_shot = """
    You are a proficient React and typescript developer. Respond with the syntactically correct & concise code for to the question below.

    Question:
    {question}

    Output Code :
    """

prompt_prompt_zero_shot = PromptTemplate(
  input_variables=["question"],
  template=prompt_zero_shot,
)

response = llm.predict(text=user_question, max_output_tokens=2048, temperature=0.1)
print(response)

```tsx
import React from "react";
import { Button } from "antd";

const MyButton: React.FC = () => {
  return (
    <Button type="primary">Click me!</Button>
  );
};

export default MyButton;
```


# Running prompts (RAG)

In [21]:
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

# You are a proficient React and typescript developer. Respond with the syntactically correct & concise code for to the question below. Make sure you follow these rules:
# 1. Use context to understand the APIs and how to use it & apply.
# 2. Do not add license information to the output code.
# 3. Do not include colab code in the output.
# 4. Ensure all the requirements in the question are met.

# RAG template
prompt_RAG = """
    You are a proficient React and typescript developer. Respond with the syntactically correct & concise code for to the question below. Make sure you follow these rules:
    1. Use context from imup-io/front-end GitHub repository to understand the interfaces to components and how to use it & apply.
    2. Do not add license information to the output code.
    3. Ensure all the requirements in the question are met.

    Question:
    {question}

    Context:
    {context}

    Helpful Response :
    """

prompt_RAG_tempate = PromptTemplate(
    template=prompt_RAG, input_variables=["context", "question"]
)

qa_chain = RetrievalQA.from_llm(
    llm=llm, prompt=prompt_RAG_tempate, retriever=retriever, return_source_documents=True
)

# user_question = "Create a typescript React 'Button' component using Ant Design."
user_question = "Create a typescript React component using Ant Design and imup-io/front-end repository's ImUpLogo component. The component should be named 'ImUpLogoButton' and should rotate the logo 90 degrees on click. The ImUpLogo component's tooltip should say 'ImUp.io' and the ImUpLogoButton component's tooltip should say 'ImUp.io Logo'."

results = qa_chain({"query": user_question})
print(results["result"])

```jsx
import React, { FC } from 'react';
import { ImUpLogo } from '@components/ImUpLogo';
import { Tooltip } from 'antd';

export const ImUpLogoButton: FC = () => {
  const [rotate, setRotate] = useState(false);

  const handleClick = () => {
    setRotate(!rotate);
  };

  return (
    <Tooltip title="ImUp.io Logo">
      <ImUpLogo
        onClick={handleClick}
        style={{ transform: `rotate(${rotate ? 90 : 0}deg)` }}
        tooltipText="ImUp.io"
      />
    </Tooltip>
  );
};
```


## Running multiple prompts back to back

In [18]:
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

# Create a memory object
memory = ConversationBufferMemory()

# Create a conversation chain
chain = ConversationChain(llm=llm, memory=memory)

# Respond to the user
chain.predict(input='Create a python function that sums two variables a and b.')
chain.predict(input='Now, run this function with a=2 and b=7 and give me only the result')


' 9'