#Background and Objective
**Talk to Your Code: Interact with Your GitHub Repository Using RAG**🔍📂

This project involves interacting with GitHub repository files. We will take a GitHub repository as input, read all the code files, and enable users to ask questions about the repository. Using the RAG (Retrieval-Augmented Generation) technique, we will access and retrieve the relevant code based on the user's queries.


*   Input Data- GitHub Repo
*   User Query - Realted to code
*   Output - Retrieved Code based on the user query



## Example Input and Output

**Input Data** - GitHub Repo Example Link - https://github.com/scikit-learn/scikit-learn

**User Query**- "Code for decision trees"

**RAG Tool Answer** -[ Screenshot](https://raw.githubusercontent.com/venkatareddykonasani/Datasets/master/Sample_images/Output_Screen.png)

##Packages

#Step-1: Importing(Clone) the GitHub Repo to Local Folders

### Instructions to Write the Python Code for Cloning a GitHub Repository

1. **Install the GitPython Library:**
   - Ensure you have the `GitPython` library installed. This library allows you to interact with Git repositories from Python code. Use the following command to install it:

2. **Import Required Modules:**
   - Import the necessary modules `os` and `Repo` from the `git` library.

3. **Set Up the Local Directory:**
   - Remove any existing directory named `local_copy_repo` to avoid conflicts, and then create a new directory with the same name. This directory will be used to clone the GitHub repository.


4. **Define Local Repository Path and Repository URL:**
   - Specify the path where the repository will be cloned locally. Also, define the URL of the GitHub repository you want to clone. In this example, we will use the scikit-learn GitHub repository.


5. **Clone the Repository:**
   - Use the `Repo.clone_from` method to clone the specified GitHub repository into the defined local directory.
     ```python
     repo = Repo.clone_from(repo_url, local_repo_path)
     ```

6. **Verify the Cloned Repository:**
   - List the contents of the cloned repository to ensure that it has been successfully cloned.



In [None]:

import os
from git import Repo

!rm -rf local_copy_repo
!mkdir local_copy_repo

local_repo_path="local_copy_repo"
#Use scikit-learn github repo in this example
repo_url="https://github.com/scikit-learn/scikit-learn"

repo = Repo.clone_from(repo_url, local_repo_path)

Collecting GitPython
  Downloading GitPython-3.1.43-py3-none-any.whl (207 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gitdb<5,>=4.0.1 (from GitPython)
  Downloading gitdb-4.0.11-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->GitPython)
  Downloading smmap-5.0.1-py3-none-any.whl (24 kB)
Installing collected packages: smmap, gitdb, GitPython
Successfully installed GitPython-3.1.43 gitdb-4.0.11 smmap-5.0.1


In [None]:
#Check the folders in the downloaded repo
print(os.listdir(local_repo_path))

['.mailmap', 'setup.cfg', 'CONTRIBUTING.md', '.git', 'CODE_OF_CONDUCT.md', 'README.rst', 'pyproject.toml', 'asv_benchmarks', 'examples', '.github', 'maint_tools', 'sklearn', 'build_tools', '.git-blame-ignore-revs', '.circleci', '.pre-commit-config.yaml', 'meson.build', 'COPYING', '.codecov.yml', 'azure-pipelines.yml', 'SECURITY.md', '.binder', 'Makefile', '.coveragerc', 'benchmarks', '.gitattributes', '.cirrus.star', '.gitignore', 'doc']


#Step-2: Load the files(Code files only)

We will load all the python code files, only .py files in this project.

### Instructions to Write the Python Code for Loading and Parsing `.py` Files from a Specific Folder

1. **Import Necessary Modules:**
   - Import the `GenericLoader` from `langchain.document_loaders.generic` to load documents.
   - Import the `LanguageParser` from `langchain.document_loaders.parsers` to parse Python files.

2. **Specify the Path to the Target Folder:**
   - Define the path to the folder containing the `.py` files you want to load. In this example, we will use the `examples` folder from the scikit-learn repository.

3. **Initialize the Loader:**
   - Create an instance of `GenericLoader` by specifying the path to the folder, the file suffix (in this case, `.py` for Python files), and the glob pattern to match all files recursively. Use the `LanguageParser` to parse the Python files.


4. **Load the Documents:**
   - Use the `load` method of the `GenericLoader` instance to load the documents from the specified folder.


5. **Check the Number of Loaded Documents:**
   - Verify the number of loaded documents by printing the length of the `documents` list.


By following these steps, you will be able to load and parse `.py` files from a specified folder using the `GenericLoader` and `LanguageParser` classes.

In [None]:
#Generic Loader for loading the .py files
from langchain.document_loaders.generic import GenericLoader

#To parse the python files
from langchain.document_loaders.parsers import LanguageParser

#We will take only one folder from the Sk-learn repo. Its huge, we will take the folder named "local_copy_repo/examples"
local_repo_path_example_folder="/content/local_copy_repo/examples"

loader=GenericLoader.from_filesystem(local_repo_path_example_folder,
                                      suffixes=[".py"],
                                      glob="**/*",
                                      parser=LanguageParser(language="python"))

documents=loader.load()
len(documents)

493

# Step-3: Split the Documents into Chunks

### Instructions to Write the Python Code for Splitting Python Code into Chunks

1. **Import Necessary Modules:**
   - Import the `RecursiveCharacterTextSplitter` and `Language` classes from the `langchain.text_splitter` module.

2. **Initialize the Text Splitter:**
   - Create an instance of `RecursiveCharacterTextSplitter` for the Python language. Define the `chunk_size` (number of characters per chunk) and `chunk_overlap` (number of overlapping characters between chunks).
     ```python
     code_text_splitter = RecursiveCharacterTextSplitter.from_language(chunk_size=3000,
                                                                       chunk_overlap=300,
                                                                       language=Language.PYTHON)
     ```

3. **Split the Documents into Chunks:**
   - Use the `split_documents` method of the `code_text_splitter` instance to split the loaded documents into smaller chunks.

4. **Print the Number and Content of Chunks:**
   - Print the number of generated chunks to verify the splitting process.

   - Print the entire list of chunks to see the split content.
   - Print the first chunk to inspect its content.


By following these steps, you will be able to split Python code documents into smaller chunks using the `RecursiveCharacterTextSplitter` class, making the content easier to manage and process.

In [None]:

from langchain.text_splitter import RecursiveCharacterTextSplitter
#We need Language for splitting the code
from langchain.text_splitter import Language

code_text_splitter = RecursiveCharacterTextSplitter.from_language(chunk_size=3000,
                                                                  chunk_overlap=300,
                                                                  language=Language.PYTHON)
code_chunks = code_text_splitter.split_documents(documents)
print("code_chunks", len(code_chunks))
print(code_chunks)

code_chunks 898


In [None]:
print(code_chunks[0])

page_content='"""
The Iris Dataset
This data sets consists of 3 different types of irises'
(Setosa, Versicolour, and Virginica) petal and sepal
length, stored in a 150x4 numpy.ndarray

The rows being the samples and the columns being:
Sepal Length, Sepal Width, Petal Length and Petal Width.

The below plot uses the first two features.
See `here <https://en.wikipedia.org/wiki/Iris_flower_data_set>`_ for more
information on this dataset.

"""

# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# SPDX-License-Identifier: BSD-3-Clause

# %%
# Loading the iris dataset
# ------------------------
from sklearn import datasets

iris = datasets.load_iris()


# %%
# Scatter Plot of the Iris dataset
# --------------------------------
import matplotlib.pyplot as plt

_, ax = plt.subplots()
scatter = ax.scatter(iris.data[:, 0], iris.data[:, 1], c=iris.target)
ax.set(xlabel=iris.feature_names[0], ylabel=iris.feature_names[1])
_ = ax.legend(
    scatter.legend_elements()[0], 

#Step-4: Embeddings and VectorDB

### Instructions to Write the Python Code for Creating Embeddings and Vector Database

1. **Set Up the Environment:**
   - Import necessary modules and set the `OPENAI_API_KEY` environment variable. Ensure you have the required key stored in Google Colab's `userdata`.


2. **Install Required Libraries:**
   - Install the `chromadb` and `tiktoken` libraries using pip.


3. **Import Necessary Classes:**
   - Import the `OpenAIEmbeddings` class from `langchain.embeddings` and the `Chroma` class from `langchain.vectorstores`. Also, import the `tiktoken` library.


4. **Create Embeddings:**
   - Instantiate the `OpenAIEmbeddings` class to create embeddings for the code chunks.


5. **Create Vector Database:**
   - Use the `Chroma.from_documents` method to create a vector database from the code chunks. Specify the embeddings and the directory where the database will be persisted.


6. **Persist the Database:**
   - Persist the vector database to the specified directory.


By following these steps, you will be able to create embeddings for the code chunks and store them in a persistent vector database using the `Chroma` vector store and `OpenAIEmbeddings`.

In [None]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
import tiktoken

embeddings = OpenAIEmbeddings()
code_db = Chroma.from_documents(documents=code_chunks,
                                embedding=embeddings,
                                persist_directory="code_db")


  warn_deprecated(
  warn_deprecated(


#Step-5: RAG and Q_A Chain

### Instructions to Write the Python Code for Retrieval-Augmented Generation (RAG) and Q&A Chain

1. **Import Necessary Classes:**
   - Import the `RetrievalQA` and `RetrievalQAWithSourcesChain` classes from `langchain.chains`, and the `OpenAI` class from `langchain.llms`.


2. **Create a Retriever from the Vector Database:**
   - Convert the previously created `code_db` vector store into a retriever object.


3. **Initialize the Language Model:**
   - Create an instance of the `OpenAI` language model with a specified temperature (controls the randomness of the model's output).


4. **Create the RetrievalQA Chain:**
   - Use the `RetrievalQA.from_chain_type` method to create a Q&A chain by specifying the language model (`llm`), chain type, and retriever.


5. **Run Queries on the Code Repository:**
   - Define queries to search for specific code snippets within the repository. Use the `run` method of the `Code_Repo_QandA` chain to get the results.
     ```python
     query = "Code for decision trees"
     query = "Code for Random Forest"
     query = "Code for Gradient Boosting"
     ```



In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.chains import RetrievalQAWithSourcesChain

retriever=code_db.as_retriever()
llm = OpenAI(temperature=0)

Code_Repo_QandA = RetrievalQA.from_chain_type(llm=llm,
                                              chain_type="stuff",
                                              retriever=retriever)

  warn_deprecated(


In [None]:
query="Code for decision trees"
result=Code_Repo_QandA.run({"query":query})
print(result)

  warn_deprecated(




The code for decision trees can vary depending on the specific implementation or library being used. However, here is an example of code for a decision tree classifier using the scikit-learn library:

```
from sklearn.tree import DecisionTreeClassifier

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Create decision tree classifier
clf = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)

# Train the classifier on the training data
clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = clf.predict(X_test)

# Evaluate the performance of the classifier
accuracy = clf.score(X_test, y_test)
print("Accuracy:", accuracy)
```

This code first imports the `DecisionTreeClassifier` class from the `sklearn.tree` module. Then, it loads the iris dataset and splits it into training and testing sets. Next, it creates an instance of the decisio

In [None]:
query="Code for Random Forest"
result=Code_Repo_QandA.run({"query":query})
print(result)



The code for Random Forest is:

```
from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier with default parameters
rf = RandomForestClassifier()

# Fit the model on your data
rf.fit(X, y)

# Make predictions on new data
y_pred = rf.predict(X_new)
```

Note: This is just a basic example, and you may need to adjust the parameters and data preprocessing steps based on your specific problem. It's always a good idea to consult the documentation and experiment with different settings to find the best model for your data.


In [None]:
query="Code for Gradient Boosting"
result=Code_Repo_QandA.run({"query":query})
print(result)


I'm not sure what you mean by "code for Gradient Boosting". The code for Gradient Boosting is already included in the context provided. It is the code for training and evaluating a Gradient Boosting model on the California Housing Prices dataset. Is there something specific you are looking for?
