<a href="https://colab.research.google.com/github/rawkintrevo/sme-seeks/blob/main/notebooks/Add_Git_Documents_to_Existing_Index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Step 1: Install Dependencies

In [1]:
!pip install -q llama-index GitPython "pinecone-client[grpc]"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.8/15.8 MB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.4/179.4 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.4/225.4 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

## Config

You have two configs at the moment:

`git_targets` - a list of tuples where it is the (git_url, friendly name, and path to docs)

and

`INDEX_NAME` - which is the index cooresponding to your API key, which is stored as a secret in `pinecone_api_key_js` (for explanation on 'why _js?' see future text block).


In [2]:
git_targets = [
    # ("https://github.com/<org>/<repo>.git", "Title", "path/to/docs")
    ("https://github.com/reactjs/react.dev.git", "React Developer Documentation", "src/content"),
    ("https://github.com/react-bootstrap/react-bootstrap.git", "React Bootstrap Documentation", "www/docs"),

]

INDEX_NAME = "firebase-react-helper"

## Majik p1: Loading Documents From Git Repos

This block just:
1. Clones a repo
2. Lists all the files (recursively) in the docs folder (specified in config).
3. Reads the data from each `.md` / `.mdx` file
4. Creates some metadata
5. Appends the `Document` to a list which will be used in the next block.

In [3]:
from git import Repo
import os
from datetime import datetime
from llama_index.readers.schema.base import Document

from datetime import datetime


documents = []

for url, name, path in git_targets:
    print(f"Cloning {url} to {name}")
    Repo.clone_from(url, name)
    for root, dirs, files in os.walk(f"./{name}/{path}"):
        for file_name in files:
            if ".md" in file_name and not os.path.islink(file_name): # if '.md' in ... include md AND mdx
                file_path = os.path.join(root, file_name)
                with open(file_path, 'r', encoding='utf-8') as file:
                    data = file.read()
                title = name
                lines = data.splitlines()
                for line in lines:
                    if line.startswith("title:"):
                        title += line.replace("title:", " - ")
                        break
                if title == name:
                  title += " - " + file_path.split(name+'/')[1]
                documents.append(Document(text=data,
                                          metadata={"date":
                                                    datetime.now().strftime("%Y-%m-%d"),
                                                    "src": f"{url.replace('.git', '')}/blob/main/{file_path.split(name+'/')[1]}",
                                                    "title": title}))



Cloning https://github.com/reactjs/react.dev.git to React Developer Documentation
Cloning https://github.com/react-bootstrap/react-bootstrap.git to React Bootstrap Documentation


## Magik p2: Indexing The New Documents

The following code uploads the documents you just created to your Pinecone Index. For this to work you will need the following keys available.

- `open_ai_key` an OpenAI API Key, this is required for creating embeddings. Is there anyway around this? Yes, do the work around work OK? Not really. It will cost a couple of dollars and is a worthwhile investment.
- `pinecone_api_key_js` I have multiple free pinecone indexes. I give them little flags like `_js` which kind of mean something that I'll forget whenever I come back to this. You should probably change the line where it pulls that key to your more appropriately named key.

You'll also notice a blank cell. I was having some issues bc I got a little _too_ cavalier with auto-complete. Net out, I had to run the cell a few times- the output from the blank cell you _should_ see in the output of the next cell.

The final cell is just a count of how many things were inserted.

In [17]:
from llama_index.storage.storage_context import StorageContext
from llama_index import VectorStoreIndex
from llama_index.node_parser import SimpleNodeParser
from llama_index.vector_stores import PineconeVectorStore
import pinecone
import openai

from google.colab import userdata

openai.api_key = userdata.get('open_ai_key') #required for making embeddings- work arounds exist, they aren't great- pay the $3 and have it done right.
pinecone.init(api_key=userdata.get("pinecone_api_key_js") , environment="gcp-starter")
pinecone_index = pinecone.Index(INDEX_NAME)
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
# Parse documents into nodes
print("Parsing new documents into nodes...")
parser = SimpleNodeParser()
new_nodes = parser.get_nodes_from_documents(documents)
# Add nodes to the existing index
print(f"Adding new nodes to the existing index {INDEX_NAME}...")
index.insert_nodes(new_nodes)


Parsing new documents into nodes...


Adding new nodes to the existing index firebase-react-helper...


Upserted vectors:   0%|          | 0/990 [00:00<?, ?it/s]

In [20]:
print(f"Added {len(new_nodes)} to index {INDEX_NAME}")

Added 990 to index firebase-react-helper
