<a href="https://colab.research.google.com/github/venkatareddykonasani/Assignments/blob/main/GenAI_Assignments/Assignment6_Questions_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Background and Objective
**Talk to Your Code: Interact with Your GitHub Repository Using RAG**🔍📂

This project involves interacting with GitHub repository files. We will take a GitHub repository as input, read all the code files, and enable users to ask questions about the repository. Using the RAG (Retrieval-Augmented Generation) technique, we will access and retrieve the relevant code based on the user's queries.


*   Input Data- GitHub Repo
*   User Query - Realted to code
*   Output - Retrieved Code based on the user query



## Example Input and Output

**Input Data** - GitHub Repo Example Link - https://github.com/scikit-learn/scikit-learn

**User Query**- "Code for decision trees"

**RAG Tool Answer** -[ Screenshot](https://raw.githubusercontent.com/venkatareddykonasani/Datasets/master/Sample_images/Output_Screen.png)

##Packages

#Step-1: Importing(Clone) the GitHub Repo to Local Folders

### Instructions to Write the Python Code for Cloning a GitHub Repository

1. **Install the GitPython Library:**
   - Ensure you have the `GitPython` library installed. This library allows you to interact with Git repositories from Python code. Use the following command to install it:

2. **Import Required Modules:**
   - Import the necessary modules `os` and `Repo` from the `git` library.

3. **Set Up the Local Directory:**
   - Remove any existing directory named `local_copy_repo` to avoid conflicts, and then create a new directory with the same name. This directory will be used to clone the GitHub repository.


4. **Define Local Repository Path and Repository URL:**
   - Specify the path where the repository will be cloned locally. Also, define the URL of the GitHub repository you want to clone. In this example, we will use the scikit-learn GitHub repository.


5. **Clone the Repository:**
   - Use the `Repo.clone_from` method to clone the specified GitHub repository into the defined local directory.
     ```python
     repo = Repo.clone_from(repo_url, local_repo_path)
     ```

6. **Verify the Cloned Repository:**
   - List the contents of the cloned repository to ensure that it has been successfully cloned.



In [None]:
#Write your code here

#Step-2: Load the files(Code files only)

We will load all the python code files, only .py files in this project.

### Instructions to Write the Python Code for Loading and Parsing `.py` Files from a Specific Folder

1. **Import Necessary Modules:**
   - Import the `GenericLoader` from `langchain.document_loaders.generic` to load documents.
   - Import the `LanguageParser` from `langchain.document_loaders.parsers` to parse Python files.

2. **Specify the Path to the Target Folder:**
   - Define the path to the folder containing the `.py` files you want to load. In this example, we will use the `examples` folder from the scikit-learn repository.

3. **Initialize the Loader:**
   - Create an instance of `GenericLoader` by specifying the path to the folder, the file suffix (in this case, `.py` for Python files), and the glob pattern to match all files recursively. Use the `LanguageParser` to parse the Python files.


4. **Load the Documents:**
   - Use the `load` method of the `GenericLoader` instance to load the documents from the specified folder.


5. **Check the Number of Loaded Documents:**
   - Verify the number of loaded documents by printing the length of the `documents` list.


By following these steps, you will be able to load and parse `.py` files from a specified folder using the `GenericLoader` and `LanguageParser` classes.

In [None]:
#Write your code here

# Step-3: Split the Documents into Chunks

### Instructions to Write the Python Code for Splitting Python Code into Chunks

1. **Import Necessary Modules:**
   - Import the `RecursiveCharacterTextSplitter` and `Language` classes from the `langchain.text_splitter` module.

2. **Initialize the Text Splitter:**
   - Create an instance of `RecursiveCharacterTextSplitter` for the Python language. Define the `chunk_size` (number of characters per chunk) and `chunk_overlap` (number of overlapping characters between chunks).
     ```python
     code_text_splitter = RecursiveCharacterTextSplitter.from_language(chunk_size=3000,
                                                                       chunk_overlap=300,
                                                                       language=Language.PYTHON)
     ```

3. **Split the Documents into Chunks:**
   - Use the `split_documents` method of the `code_text_splitter` instance to split the loaded documents into smaller chunks.

4. **Print the Number and Content of Chunks:**
   - Print the number of generated chunks to verify the splitting process.

   - Print the entire list of chunks to see the split content.
   - Print the first chunk to inspect its content.


By following these steps, you will be able to split Python code documents into smaller chunks using the `RecursiveCharacterTextSplitter` class, making the content easier to manage and process.

In [None]:
#Write your code here

#Step-4: Embeddings and VectorDB

### Instructions to Write the Python Code for Creating Embeddings and Vector Database

1. **Set Up the Environment:**
   - Import necessary modules and set the `OPENAI_API_KEY` environment variable. Ensure you have the required key stored in Google Colab's `userdata`.


2. **Install Required Libraries:**
   - Install the `chromadb` and `tiktoken` libraries using pip.


3. **Import Necessary Classes:**
   - Import the `OpenAIEmbeddings` class from `langchain.embeddings` and the `Chroma` class from `langchain.vectorstores`. Also, import the `tiktoken` library.


4. **Create Embeddings:**
   - Instantiate the `OpenAIEmbeddings` class to create embeddings for the code chunks.


5. **Create Vector Database:**
   - Use the `Chroma.from_documents` method to create a vector database from the code chunks. Specify the embeddings and the directory where the database will be persisted.


6. **Persist the Database:**
   - Persist the vector database to the specified directory.


By following these steps, you will be able to create embeddings for the code chunks and store them in a persistent vector database using the `Chroma` vector store and `OpenAIEmbeddings`.

In [None]:
#Write your code here

#Step-5: RAG and Q_A Chain

### Instructions to Write the Python Code for Retrieval-Augmented Generation (RAG) and Q&A Chain

1. **Import Necessary Classes:**
   - Import the `RetrievalQA` and `RetrievalQAWithSourcesChain` classes from `langchain.chains`, and the `OpenAI` class from `langchain.llms`.


2. **Create a Retriever from the Vector Database:**
   - Convert the previously created `code_db` vector store into a retriever object.


3. **Initialize the Language Model:**
   - Create an instance of the `OpenAI` language model with a specified temperature (controls the randomness of the model's output).


4. **Create the RetrievalQA Chain:**
   - Use the `RetrievalQA.from_chain_type` method to create a Q&A chain by specifying the language model (`llm`), chain type, and retriever.


5. **Run Queries on the Code Repository:**
   - Define queries to search for specific code snippets within the repository. Use the `run` method of the `Code_Repo_QandA` chain to get the results.
     ```python
     query = "Code for decision trees"
     query = "Code for Random Forest"
     query = "Code for Gradient Boosting"
     ```



In [None]:
#Write your code here