## document preprocessing 


In [6]:
import os 

pdf_path = "./git_magic.pdf"

if os.path.exists(pdf_path):
    print(f"File {pdf_path} exists.")

File ./git_magic.pdf exists.


In [8]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading PyMuPDF-1.24.11-cp38-abi3-macosx_11_0_arm64.whl.metadata (3.4 kB)
Downloading PyMuPDF-1.24.11-cp38-abi3-macosx_11_0_arm64.whl (18.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.24.11


In [18]:
import fitz
from tqdm.auto import tqdm 

def text_formater(text: str) -> str: 
    cleaned_text = text.replace("\n", " ").strip()
    return cleaned_text 


def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formater(text=text)
        pages_and_texts.append({
            "page_number" : page_number,
            "page_char_count": len(text),
            "page_word_count" : len(text.split(" ")),
            "page_sentence_count_raw": len(text.split(". ")),
            "page_token_count": len(text)/4,
            "text": text})
    return pages_and_texts
pages_and_texts = open_and_read_pdf(pdf_path = pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': 0,
  'page_char_count': 1419,
  'page_word_count': 228,
  'page_sentence_count_raw': 21,
  'page_token_count': 354.75,
  'text': 'Git Magic Ben Lynn August 2007 Preface Git is a version control Swiss army knife. A reliable versatile multipurpose revision control tool whose extraordinary flexibility makes it tricky to learn, let alone master. As Arthur C. Clarke observed, any suﬀiciently advanced technology is indistin- guishable from magic. This is a great way to approach Git: newbies can ignore its inner workings and view Git as a gizmo that can amaze friends and infuriate enemies with its wondrous abilities. Rather than go into details, we provide rough instructions for particular effects. After repeated use, gradually you will understand how each trick works, and how to tailor the recipes for your needs. • Simplified Chinese: by JunJie, Meng and JiangWei. Converted to Tradi- tional Chinese via cconv -f UTF8-CN -t UTF8-TW. • French: by Alexandre Garel, Paul Gaborit, 

In [22]:
import random 

random.sample(pages_and_texts, k =3)

[{'page_number': 10,
  'page_char_count': 1776,
  'page_word_count': 293,
  'page_sentence_count_raw': 12,
  'page_token_count': 444.0,
  'text': '$ git push central.server/path/to/proj.git HEAD To check out the source, a developer types: $ git clone central.server/path/to/proj.git After making changes, the developer saves changes locally: $ git commit -a To update to the latest version: $ git pull Any merge conflicts should be resolved then committed: $ git commit -a To check in local changes into the central repository: $ git push If the main server has new changes due to activity by other developers, the push fails, and the developer should pull the latest version, resolve any merge conflicts, then try again. Developers must have SSH access for the above pull and push commands. How- ever, anyone can see the source by typing: $ git clone git://central.server/path/to/proj.git The native git protocol is like HTTP: there is no authentication, so anyone can retrieve the project. Accordin

In [26]:
import pandas as pd 

df = pd.DataFrame(pages_and_texts)
df.head()


Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,0,1419,228,21,354.75,Git Magic Ben Lynn August 2007 Preface Git is ...
1,1,1795,264,17,448.75,• PDF file: printer-friendly. • EPUB file: E-r...
2,2,2280,400,29,570.0,Introduction I’ll use an analogy to introduce ...
3,3,2681,463,26,670.25,Distributed Control Now imagine a very diﬀicul...
4,4,1853,318,20,463.25,"system, but using systems that scale poorly fo..."


In [28]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,44.0,44.0,44.0,44.0,44.0
mean,21.5,1915.5,325.07,17.98,478.88
std,12.85,345.54,61.08,4.74,86.38
min,0.0,1084.0,180.0,10.0,271.0
25%,10.75,1707.5,285.0,14.0,426.88
50%,21.5,1876.5,320.5,18.0,469.12
75%,32.25,2106.75,365.0,22.0,526.69
max,43.0,2681.0,463.0,29.0,670.25


In [35]:
from spacy.lang.en import English

nlp = English()

nlp.add_pipe("sentencizer")



<spacy.pipeline.sentencizer.Sentencizer at 0x30c4b5b10>

In [47]:
pages_and_texts[10]

{'page_number': 10,
 'page_char_count': 1776,
 'page_word_count': 293,
 'page_sentence_count_raw': 12,
 'page_token_count': 444.0,
 'text': '$ git push central.server/path/to/proj.git HEAD To check out the source, a developer types: $ git clone central.server/path/to/proj.git After making changes, the developer saves changes locally: $ git commit -a To update to the latest version: $ git pull Any merge conflicts should be resolved then committed: $ git commit -a To check in local changes into the central repository: $ git push If the main server has new changes due to activity by other developers, the push fails, and the developer should pull the latest version, resolve any merge conflicts, then try again. Developers must have SSH access for the above pull and push commands. How- ever, anyone can see the source by typing: $ git clone git://central.server/path/to/proj.git The native git protocol is like HTTP: there is no authentication, so anyone can retrieve the project. Accordingly, b

In [55]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    item["page_senteces_count_spacy"] = len(item["sentences"])

  0%|          | 0/44 [00:00<?, ?it/s]

In [57]:
random.sample(pages_and_texts, k=1)

[{'page_number': 15,
  'page_char_count': 1470,
  'page_word_count': 272,
  'page_sentence_count_raw': 13,
  'page_token_count': 367.5,
  'text': 'In some directory: $ echo "I\'m smarter than my boss" > myfile.txt $ git init $ git add . $ git commit -m "Initial commit" We have created a Git repository that tracks one text file containing a certain message. Now type: $ git checkout -b boss # nothing seems to change after this $ echo "My boss is smarter than me" > myfile.txt $ git commit -a -m "Another commit" It looks like we’ve just overwritten our file and committed it. But it’s an illusion. Type: $ git checkout master # switch to original version of the file and hey presto! The text file is restored. And if the boss decides to snoop around this directory, type: $ git checkout boss # switch to version suitable for boss\' eyes You can switch between the two versions of the file as much as you like, and commit to each independently. Dirty Work Say you’re working on some feature, and for

In [59]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_senteces_count_spacy
count,44.0,44.0,44.0,44.0,44.0,44.0
mean,21.5,1915.5,325.07,17.98,478.88,19.27
std,12.85,345.54,61.08,4.74,86.38,5.03
min,0.0,1084.0,180.0,10.0,271.0,10.0
25%,10.75,1707.5,285.0,14.0,426.88,15.0
50%,21.5,1876.5,320.5,18.0,469.12,20.0
75%,32.25,2106.75,365.0,22.0,526.69,23.0
max,43.0,2681.0,463.0,29.0,670.25,29.0


In [61]:
num_sentence_chunk_size = 10 

def split_list(input_list: list[str],
               slice_size: int = num_sentence_chunk_size) ->list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]
test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [65]:
# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/44 [00:00<?, ?it/s]

In [67]:
random.sample(pages_and_texts, k=1)

[{'page_number': 40,
  'page_char_count': 1812,
  'page_word_count': 291,
  'page_sentence_count_raw': 19,
  'page_token_count': 453.0,
  'text': 'SHA1 Weaknesses As time passes, cryptographers discover more and more SHA1 weaknesses. Al- ready, finding hash collisions is feasible for well-funded organizations. Within years, perhaps even a typical PC will have enough computing power to silently corrupt a Git repository. Hopefully Git will migrate to a better hash function before further research destroys SHA1. Microsoft Windows Git on Microsoft Windows can be cumbersome: • Cygwin, a Linux-like environment for Windows, contains a Windows port of Git. • Git for Windows is an alternative requiring minimal runtime support, though a few of the commands need some work. Unrelated Files If your project is very large and contains many unrelated files that are constantly being changed, Git may be disadvantaged more than other systems because single files are not tracked. Git tracks changes to the

In [71]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_senteces_count_spacy,num_chunks
count,44.0,44.0,44.0,44.0,44.0,44.0,44.0
mean,21.5,1915.5,325.07,17.98,478.88,19.27,2.39
std,12.85,345.54,61.08,4.74,86.38,5.03,0.54
min,0.0,1084.0,180.0,10.0,271.0,10.0,1.0
25%,10.75,1707.5,285.0,14.0,426.88,15.0,2.0
50%,21.5,1876.5,320.5,18.0,469.12,20.0,2.0
75%,32.25,2106.75,365.0,22.0,526.69,23.0,3.0
max,43.0,2681.0,463.0,29.0,670.25,29.0,3.0


In [73]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts): 
    for sentence_chunk in item["sentence_chunks"]: 
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph-like structure, aka join the list of sentences into one paragraph
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" => ". A" (will work for any captial letter)

        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get some stats on our chunks
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 chars

        pages_and_chunks.append(chunk_dict) 

len(pages_and_chunks)

  0%|          | 0/44 [00:00<?, ?it/s]

105

In [75]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 37,
  'sentence_chunk': 'Git is content-addressable: files are not stored according to their filename, but rather by the hash of the data they contain, in a file we call a blob object. We can think of the hash as a unique ID for a file’s contents, so in a sense we are addressing files by their content. The initial blob 6 is merely a header consisting of the object type and its length in bytes; it simplifies internal bookkeeping. Thus I could easily predict what you would see. The file’s name is irrelevant: only the data inside is used to construct the blob object. You may be wondering what happens to identical files. Try adding copies of your file, with any filenames whatsoever. The contents of .git/objects stay the same no matter how many you add. Git only stores the data once. By the way, the files within .git/objects are compressed with zlib so you should not stare at them directly.',
  'chunk_char_count': 873,
  'chunk_word_count': 158,
  'chunk_token_count': 218.2

In [77]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,105.0,105.0,105.0,105.0
mean,21.76,800.92,135.04,200.23
std,12.68,364.26,60.72,91.07
min,0.0,41.0,9.0,10.25
25%,11.0,580.0,100.0,145.0
50%,21.0,886.0,148.0,221.5
75%,32.0,1028.0,171.0,257.0
max,43.0,1658.0,272.0,414.5


In [79]:
df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count
0,0,Git Magic Ben Lynn August 2007 Preface Git is ...,950,150,237.5
1,0,Italian: by Mattia Rigotti. •Korean: by Jung-H...,456,66,114.0
2,1,• PDF file: printer-friendly. •EPUB file: E-re...,924,130,231.0
3,1,François Marier maintains the Debian package o...,863,127,215.75
4,2,Introduction I’ll use an analogy to introduce ...,717,129,179.25


### filter chunks of text for short chunks 

In [84]:
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(3).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 29.5 | Text: GitHub has an interface that facilitates this: fork the ”gitmagic” project, push your changes, then ask me to merge.44
Chunk token count: 29.25 | Text: Developers clone your project from it, and push the latest oﬀicial changes to it. Typically it resides on a server 11
Chunk token count: 19.75 | Text: Now you’re in the master branch again, with Part II in the working directory.18


>no need to filter 

### embedding our text chunks 

In [105]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path= "all-mpnet-base-v2",
                                      device = "cpu")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [107]:
for item in tqdm(pages_and_chunks):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

  0%|          | 0/105 [00:00<?, ?it/s]

In [117]:
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks]
text_chunks[57]


'$ git bisect reset Instead of testing every change by hand, automate the search by running: $ git bisect run my_script Git uses the return value of the given command, typically a one-off script, to decide whether a change is good or bad: the command should exit with code 0 when good, 125 when the change should be skipped, and anything else between 1 and 127 if it is bad. A negative return value aborts the bisect. You can do much more: the help page explains how to visualize bisects, examine or replay the bisect log, and eliminate known innocent changes for a speedier search. Who Made It All Go Wrong?Like many other version control systems, Git has a blame command: $ git blame bug.c which annotates every line in the given file showing who last changed it, and when. Unlike many other version control systems, this operation works offline, reading only from local disk. Personal Experience In a centralized version control system, history modification is a diﬀicult oper- ation, and only ava

In [119]:
len(text_chunks)

105

In [121]:
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=32, # you can experiment to find which batch size leads to best results
                                               convert_to_tensor=True)
text_chunk_embeddings                        

tensor([[ 0.0217,  0.0315, -0.0532,  ..., -0.0096,  0.0194, -0.0360],
        [ 0.0210,  0.0199, -0.0226,  ...,  0.0085, -0.0036, -0.0356],
        [ 0.0474, -0.0186, -0.0213,  ...,  0.0123,  0.0027, -0.0482],
        ...,
        [ 0.0323, -0.0361, -0.0270,  ..., -0.0407,  0.0150, -0.0252],
        [ 0.0504, -0.0207, -0.0460,  ..., -0.0041, -0.0301, -0.0319],
        [ 0.0068,  0.0603, -0.0189,  ..., -0.0128,  0.0024, -0.0053]])

In [123]:
text_chunk_embeddings   

tensor([[ 0.0217,  0.0315, -0.0532,  ..., -0.0096,  0.0194, -0.0360],
        [ 0.0210,  0.0199, -0.0226,  ...,  0.0085, -0.0036, -0.0356],
        [ 0.0474, -0.0186, -0.0213,  ...,  0.0123,  0.0027, -0.0482],
        ...,
        [ 0.0323, -0.0361, -0.0270,  ..., -0.0407,  0.0150, -0.0252],
        [ 0.0504, -0.0207, -0.0460,  ..., -0.0041, -0.0301, -0.0319],
        [ 0.0068,  0.0603, -0.0189,  ..., -0.0128,  0.0024, -0.0053]])

### save embedding to file 

In [132]:
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [134]:
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,0,Git Magic Ben Lynn August 2007 Preface Git is ...,950,150,237.5,[ 2.16734037e-02 3.14504728e-02 -5.31752259e-...
1,0,Italian: by Mattia Rigotti. •Korean: by Jung-H...,456,66,114.0,[ 2.10260246e-02 1.99276339e-02 -2.26166267e-...
2,1,• PDF file: printer-friendly. •EPUB file: E-re...,924,130,231.0,[ 4.74463515e-02 -1.85674764e-02 -2.13095807e-...
3,1,François Marier maintains the Debian package o...,863,127,215.75,[ 8.51391669e-05 2.84147169e-02 -2.16863006e-...
4,2,Introduction I’ll use an analogy to introduce ...,717,129,179.25,[-1.14341658e-02 -3.59649956e-02 -1.53461462e-...


### rag 

In [139]:
import random

import torch
import numpy as np
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("text_chunks_and_embeddings_df.csv")

# Convert embedding column back to np.array (it got converted to string when it saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert our embeddings into a torch.tensor
embeddings = torch.tensor(np.stack(text_chunks_and_embedding_df["embedding"].tolist(), axis=0), dtype=torch.float32).to(device)

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

text_chunks_and_embedding_df

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,0,Git Magic Ben Lynn August 2007 Preface Git is ...,950,150,237.50,"[0.0216734037, 0.0314504728, -0.0531752259, -0..."
1,0,Italian: by Mattia Rigotti. •Korean: by Jung-H...,456,66,114.00,"[0.0210260246, 0.0199276339, -0.0226166267, -0..."
2,1,• PDF file: printer-friendly. •EPUB file: E-re...,924,130,231.00,"[0.0474463515, -0.0185674764, -0.0213095807, 0..."
3,1,François Marier maintains the Debian package o...,863,127,215.75,"[8.51391669e-05, 0.0284147169, -0.0216863006, ..."
4,2,Introduction I’ll use an analogy to introduce ...,717,129,179.25,"[-0.0114341658, -0.0359649956, -0.0153461462, ..."
...,...,...,...,...,...,...
100,42,Global Counter Some centralized version contro...,980,158,245.00,"[0.0134130493, 0.0685647056, -0.00317752501, -..."
101,42,"Unfortunately, with respect to commits, git do...",933,148,233.25,"[-0.00950409845, -0.013323958, 0.0379692167, -..."
102,42,"Interface Quirks For commits A and B, the mean...",197,36,49.25,"[0.0323276184, -0.0361246839, -0.0269896649, -..."
103,43,Translating This Guide I recommend the followi...,961,157,240.25,"[0.0503651574, -0.0206981469, -0.0459663495, 0..."


In [141]:
embeddings.shape

torch.Size([105, 768])

In [143]:
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device=device)



In [145]:
embeddings.shape

torch.Size([105, 768])

In [149]:
# 1. Define the query
query = "git structure"
print(f"Query: {query}")

# 2. Embed the query
# Note: it's import to embed you query with the same model you embedding your passages
query_embedding = embedding_model.encode(query, convert_to_tensor=True).to("cpu")

# 3. Get similarity scores with the dot product (use cosine similarity if outputs of model aren't normalized)
from time import perf_counter as timer

start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer() 

print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

# 4. Get the top-k results (we'll keep top 5)
top_results_dot_product = torch.topk(dot_scores, k=5)
top_results_dot_product 

Query: git structure
[INFO] Time taken to get scores on 105 embeddings: 0.00223 seconds.


torch.return_types.topk(
values=tensor([0.6155, 0.6144, 0.6102, 0.5949, 0.5902]),
indices=tensor([94, 88, 44, 87, 79]))

In [151]:
larger_embeddings = torch.randn(100*embeddings.shape[0], 768).to(device)
print(f"Embeddings shape: {larger_embeddings.shape}")

# Perform dot product across 168,000 embeddings
start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=larger_embeddings)[0]
end_time = timer() 

print(f"[INFO] Time taken to get scores on {len(larger_embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

Embeddings shape: torch.Size([10500, 768])
[INFO] Time taken to get scores on 10500 embeddings: 0.00279 seconds.


In [153]:
larger_embeddings = torch.randn(100*embeddings.shape[0], 768).to(device)
print(f"Embeddings shape: {larger_embeddings.shape}")

# Perform dot product across 168,000 embeddings
start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=larger_embeddings)[0]
end_time = timer() 

print(f"[INFO] Time taken to get scores on {len(larger_embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

Embeddings shape: torch.Size([10500, 768])
[INFO] Time taken to get scores on 10500 embeddings: 0.00131 seconds.


In [155]:
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

In [157]:
query = "git command to save "
print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indices from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    print(f"Page number: {pages_and_chunks[idx]['page_number']}")
    print("\n")

Query: 'git command to save '

Results:
Score: 0.6155
Text:
The current head is kept in the file .git/HEAD, which contains a hash of a
commit object. The hash gets updated during a commit as well as many other
commands. Branches are almost the same: they are files in .git/refs/heads. Tags
too: they live in .git/refs/tags but they are updated by a different set of
commands. Git Shortcomings There are some Git issues I’ve swept under the
carpet. Some can be handled easily with scripts and hooks, some require
reorganizing or redefining the project, and for the few remaining annoyances,
one will just have to wait. Or better yet, pitch in and help!40
Page number: 39


Score: 0.6144
Text:
Git is content-addressable: files are not stored according to their filename,
but rather by the hash of the data they contain, in a file we call a blob
object. We can think of the hash as a unique ID for a file’s contents, so in a
sense we are addressing files by their content. The initial blob 6 is merely 

In [159]:
import torch

def dot_product(vector1, vector2):
    return torch.dot(vector1, vector2)

def cosine_similarity(vector1, vector2):
    dot_product = torch.dot(vector1, vector2)

    # Get Euclidean/L2 norm
    norm_vector1 = torch.sqrt(torch.sum(vector1**2))
    norm_vector2 = torch.sqrt(torch.sum(vector2**2))

    return dot_product / (norm_vector1 * norm_vector2)

# Example vectors/tensors
vector1 = torch.tensor([1, 2, 3], dtype=torch.float32)
vector2 = torch.tensor([1, 2, 3], dtype=torch.float32)
vector3 = torch.tensor([4, 5, 6], dtype=torch.float32)
vector4 = torch.tensor([-1, -2, -3], dtype=torch.float32)

# Calculate dot product
print("Dot product between vector1 and vector2:", dot_product(vector1, vector2))
print("Dot product between vector1 and vector3:", dot_product(vector1, vector3))
print("Dot product between vector1 and vector4:", dot_product(vector1, vector4))

# Cosine similarity
print("Cosine similarity between vector1 and vector2:", cosine_similarity(vector1, vector2))
print("Cosine similarity between vector1 and vector3:", cosine_similarity(vector1, vector3))
print("Cosine similarity between vector1 and vector4:", cosine_similarity(vector1, vector4))


Dot product between vector1 and vector2: tensor(14.)
Dot product between vector1 and vector3: tensor(32.)
Dot product between vector1 and vector4: tensor(-14.)
Cosine similarity between vector1 and vector2: tensor(1.0000)
Cosine similarity between vector1 and vector3: tensor(0.9746)
Cosine similarity between vector1 and vector4: tensor(-1.0000)


In [161]:
def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query, convert_to_tensor=True)

    # Get dot product scores on embeddings
    start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()

    if print_time:
        print(f"[INFO] Time taken to get scores on ({len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

    scores, indices = torch.topk(input=dot_scores,
                                 k=n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks,
                                 n_resources_to_return: int=5):
    """
    Finds relevant passages given a query and prints them out along with their scores.
    """
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_to_return=n_resources_to_return)

    # Loop through zipped together scores and indices from torch.topk
    for score, idx in zip(scores, indices):
        print(f"Score: {score:.4f}")
        print("Text:")
        print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
        print(f"Page number: {pages_and_chunks[idx]['page_number']}")
        print("\n")

In [163]:
query="git commit change"
# retrieve_relevant_resources(query=query, embeddings=embeddings) 
print_top_results_and_scores(query=query, embeddings=embeddings)

[INFO] Time taken to get scores on (105 embeddings: 0.00006 seconds.
Score: 0.6545
Text:
Some developers strongly feel history should be immutable, warts and all. Oth-
ers feel trees should be made presentable before they are unleashed in public.
Git accommodates both viewpoints. Like cloning, branching, and merging, rewrit-
ing history is simply another power Git gives you. It is up to you to use it
wisely. I Stand Corrected Did you just commit, but wish you had typed a
different message?Then run: $ git commit --amend to change the last message.
Realized you forgot to add a file?Run git add to add it, and then run the above
command. Want to include a few more edits in that last commit?
Page number: 20


Score: 0.6131
Text:
Then make those edits and run: $ git commit --amend -a … And Then Some Suppose
the previous problem is ten times worse. After a lengthy session you’ve made a
bunch of commits. But you’re not quite happy with the way they’re organized, and
some of those commit messag

In [165]:
import psutil

# Get total available system memory in GB
system_memory_gb = psutil.virtual_memory().total / (1024 ** 3)

# Logic for handling model selection and precision based on available memory
if system_memory_gb < 8:
    print(f"Your available system memory is {system_memory_gb:.2f}GB. You may not have enough memory to run Gemma models locally without quantization.")
elif system_memory_gb < 16:
    print(f"System memory: {system_memory_gb:.2f}GB | Recommended model: Gemma 2B in 4-bit precision.")
    use_quantization_config = True 
    model_id = "google/gemma-2b-it"
elif system_memory_gb < 32:
    print(f"System memory: {system_memory_gb:.2f}GB | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.")
    use_quantization_config = False 
    model_id = "google/gemma-2b-it"
else:
    print(f"System memory: {system_memory_gb:.2f}GB | Recommend model: Gemma 7B in 4-bit or float16 precision.")
    use_quantization_config = False 
    model_id = "google/gemma-7b-it"

# Output the chosen settings
print(f"use_quantization_config set to: {use_quantization_config}")
print(f"model_id set to: {model_id}")


System memory: 16.00GB | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.
use_quantization_config set to: False
model_id set to: google/gemma-2b-it


In [183]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_torch_available


from transformers import BitsAndBytesConfig


quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         bnb_4bit_compute_dtype=torch.float16)



attn_implementation = "sdpa"  # Use regular scaled dot product attention

print(f"Using attention implementation: {attn_implementation}")


model_id = model_id


tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id)


llm_model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_id,
    torch_dtype=torch.float16,  # Use float16 for better performance on M1
    quantization_config=quantization_config if use_quantization_config else None,
    low_cpu_mem_usage=True,  # Important for M1 Pro, manage memory efficiently
    attn_implementation=attn_implementation
)


if not use_quantization_config:
    if torch.backends.mps.is_available():
        llm_model.to("mps")  # Move model to Metal Performance Shaders (MPS) backend
        print("Model moved to MPS (Metal Performance Shaders) backend.")
    else:
        llm_model.to("cpu")  # Fallback to CPU if MPS is not available
        print("MPS not available, model moved to CPU.")
else:
    llm_model.to("cpu")  # Quantization might require sticking to CPU if MPS isn't supported in quantization workflows

print(f"Model is ready on device: {'MPS' if torch.backends.mps.is_available() else 'CPU'}")


Using attention implementation: sdpa


`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Model moved to MPS (Metal Performance Shaders) backend.
Model is ready on device: MPS


In [185]:
llm_model

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear(in_features=16384, out_features=2048, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): GemmaRMSNorm((2048,), eps=1e-

In [187]:
def get_model_num_params(model: torch.nn.Module):
    return sum([param.numel() for param in model.parameters()])

get_model_num_params(llm_model)

2506172416

In [189]:
def get_model_mem_size(model: torch.nn.Module):
    # Get model parameters and buffer sizes
    mem_params = sum([param.nelement() * param.element_size() for param in model.parameters()])
    mem_buffers = sum([buf.nelement() * buf.element_size() for buf in model.buffers()])

    # Calculate model sizes
    model_mem_bytes = mem_params + mem_buffers
    model_mem_mb = model_mem_bytes / (1024**2)
    model_mem_gb = model_mem_bytes / (1024**3) 

    return {"model_mem_bytes": model_mem_bytes,
            "model_mem_mb": round(model_mem_mb, 2), 
            "model_mem_gb": round(model_mem_gb, 2)}

get_model_mem_size(llm_model)

{'model_mem_bytes': 5012354048, 'model_mem_mb': 4780.15, 'model_mem_gb': 4.67}

In [193]:
input_text = "What is the  significance of git "
print(f"Input text:\n{input_text}")

# Create prompt template for instruction-tuned model
dialogue_template = [
    {"role": "user",
     "content": input_text}
]

# Apply the chat template
prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False,
                                       add_generation_prompt=True)
print(f"\nPrompt (formatted):\n{prompt}")

Input text:
What is the  significance of git 

Prompt (formatted):
<bos><start_of_turn>user
What is the  significance of git<end_of_turn>
<start_of_turn>model



In [195]:
tokenizer

GemmaTokenizerFast(name_or_path='google/gemma-2b-it', vocab_size=256000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<bos>', 'eos_token': '<eos>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': ['<start_of_turn>', '<end_of_turn>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<eos>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<bos>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("<mask>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
	5: AddedToken("<2mass>", rstrip=False, lstrip=False, single_w

In [199]:

device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
print(f"Using device: {device}")


input_ids = tokenizer(prompt, return_tensors="pt").to(device)


outputs = llm_model.generate(**input_ids, max_new_tokens=256)


outputs_cpu = outputs.to("cpu")

print(f"Model output (tokens):\n{outputs_cpu[0]}\n")


Using device: mps


Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Model output (tokens):
tensor([     2,      2,    106,   1645,    108,   1841,    603,    573,    139,
          2470,  46407,    576,  25489,    107,    108,    106,   2516,    108,
           688,  91168,    576,  28750,  66058,    109,    688,   7198,   7572,
          1479,    591, 235330,   6172,   1245,    688,    108, 235287,  28750,
           603,    476,  10276,   3797,   2582,   1812,    591, 235330,   6172,
        235275,    674,   8563,   6211,    577,   7029,   4559,   1644,    577,
           476,   3542,   1163,   1069, 235265,    108, 235287,   1165,   7154,
         26337,  11441, 235269,  67127,    577, 235269,    578,  12607,   2167,
         16062,    576,    476,   3542, 235265,    108, 235287,  28750,   6572,
           476,   3110,   4281,    576,   4559, 235269,   3547,    665,  10154,
           604,  26337,    577,   3508,    573,  14764,    576,    476,   3542,
        235265,    109,    688, 134318,    578,   6698,  53041,  66058,    108,
        235287,  

In [201]:
# Decode the output tokens to text
outputs_decoded = tokenizer.decode(outputs[0])
print(f"Model output (decoded):\n{outputs_decoded}\n")

Model output (decoded):
<bos><bos><start_of_turn>user
What is the  significance of git<end_of_turn>
<start_of_turn>model
**Significance of Git:**

**Version Control System (VCS):**
* Git is a powerful version control system (VCS) that allows users to track changes made to a project over time.
* It helps developers identify, revert to, and manage different versions of a project.
* Git provides a clear history of changes, making it easier for developers to understand the evolution of a project.

**Collaboration and Code Sharing:**
* Git facilitates collaboration among multiple developers by allowing them to work on the same project simultaneously.
* It enables developers to track changes made by others, merge them into their own versions, and share the project with others.
* This promotes teamwork and reduces the risk of losing changes.

**Code Maintenance and Refactoring:**
* Git allows developers to track changes made to code over time, making it easier to identify and fix bugs, refact

In [203]:

gpt4_git_questions = [
    "What is Git and how does it differ from centralized version control systems?",
    "Explain the concept of branching in Git and its benefits in project management.",
    "What are merge conflicts in Git and how can they be resolved?",
    "Describe the process of creating and switching between branches in Git.",
    "How does Git handle file renaming and deletions during version control?"
]


manual_git_questions = [
    "What is the command to initialize a new Git repository?",
    "How can you view the commit history in Git?",
    "Explain the purpose of the 'git clone' command.",
    "What is the significance of the HEAD in Git, and how can it be reset?",
    "How does Git handle distributed version control compared to centralized systems?"
]

query_list = gpt4_git_questions + manual_git_questions
query_list


['What is Git and how does it differ from centralized version control systems?',
 'Explain the concept of branching in Git and its benefits in project management.',
 'What are merge conflicts in Git and how can they be resolved?',
 'Describe the process of creating and switching between branches in Git.',
 'How does Git handle file renaming and deletions during version control?',
 'What is the command to initialize a new Git repository?',
 'How can you view the commit history in Git?',
 "Explain the purpose of the 'git clone' command.",
 'What is the significance of the HEAD in Git, and how can it be reset?',
 'How does Git handle distributed version control compared to centralized systems?']

In [205]:
import random

query = random.choice(query_list)
print(f"Query: {query}") 

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

Query: What is Git and how does it differ from centralized version control systems?
[INFO] Time taken to get scores on (105 embeddings: 0.00154 seconds.


(tensor([0.6499, 0.6471, 0.6118, 0.6064, 0.6017], dtype=torch.float32),
 tensor([58, 83, 22, 86, 29]))

In [209]:
def prompt_formatter(query: str,
                     context_items: list[dict]) -> str:
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    base_prompt = """Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.
\nExample 1:
Query: What is a Git branch?
Answer: A Git branch represents an independent line of development in a project. Branching allows developers to work on features or bug fixes separately from the main codebase, often referred to as the master or main branch. When a branch is created, it contains all the history of the original branch, but changes made in the new branch do not affect the original branch until a merge is performed. This isolation of changes allows for better collaboration and experimentation without disrupting the main project.
\nExample 2:
Query: How do you resolve merge conflicts in Git?
Answer: Merge conflicts in Git occur when two branches have modifications to the same line in a file or when one branch deletes a file that another branch has modified. Git will pause the merge and mark the conflict in the affected files. To resolve the conflict, a developer needs to manually edit the files by selecting which changes to keep or by combining parts of both conflicting changes. Once resolved, the developer must stage the file and complete the merge by committing the resolved changes.
\nExample 3:
Query: What is the purpose of 'git stash'?
Answer: The 'git stash' command temporarily saves your modified and staged changes in a stack of unfinished changes, allowing you to switch branches or perform other tasks without committing incomplete work. The saved changes can be reapplied later using 'git stash apply'. This is especially useful when you need to quickly change context and return to a clean working directory without losing your current work.
\nNow use the following context items to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""

    base_prompt = base_prompt.format(context=context,
                                     query=query)

    # Create prompt template for instruction-tuned model 
    dialogue_template = [
        {"role": "user",
         "content": base_prompt}
    ]

    # Apply the chat template
    prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                           tokenize=False,
                                           add_generation_prompt=True)
    
    return prompt

query = random.choice(query_list) 
print(f"Query: {query}")

# Get relevant resources
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)

# Create a list of context items
context_items = [pages_and_chunks[i] for i in indices]

# Format our prompt
prompt = prompt_formatter(query=query,
                          context_items=context_items)
print(prompt)

Query: Describe the process of creating and switching between branches in Git.
[INFO] Time taken to get scores on (105 embeddings: 0.00042 seconds.
<bos><start_of_turn>user
Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.

Example 1:
Query: What is a Git branch?
Answer: A Git branch represents an independent line of development in a project. Branching allows developers to work on features or bug fixes separately from the main codebase, often referred to as the master or main branch. When a branch is created, it contains all the history of the original branch, but changes made in the new branch do not affect the original branch until a merge is performed. This isolation of changes allows for better 

In [215]:

device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
print(f"Using device: {device}")


input_ids = tokenizer(prompt, return_tensors="pt").to(device)


outputs = llm_model.generate(
    **input_ids,
    temperature=0.7,  # Controls creativity of the output
    do_sample=True,   # Whether or not to use sampling
    max_new_tokens=256  # Limits the number of new tokens generated
)


outputs_cpu = outputs.to("cpu")

output_text = tokenizer.decode(outputs_cpu[0], skip_special_tokens=True)

response = output_text[len(prompt):].strip()


print(f"Query: {query}")
print(f"RAG answer:\n{response}")


Using device: mps
Query: Describe the process of creating and switching between branches in Git.
RAG answer:
creating and switching between branches in Git:

1. **Creating a Branch:** To create a new branch, the `git checkout -b` command is used. This command takes the name of the new branch as its argument. The new branch is created in the current directory, and the working directory is saved to a separate location for the duration of the branch.


2. **Switching between Branches:** To switch between branches, the `git checkout` command is used. This command takes the name of the branch to switch to as its argument. The working directory is then changed to the specified branch.


3. **Branching Off Retroactively:** In some cases, it may be necessary to branch off a branch that was created at a later time. This can be done using the `git branch -m` command, which creates a new branch that is based on the existing branch. The new branch is named based on the original branch, with the su

In [219]:
query = random.choice(query_list)
print(f"Query: {query}")

input_ids = tokenizer(query, return_tensors="pt").to(device)

outputs = llm_model.generate(
    **input_ids,
    temperature=0.2,
    do_sample=True,
    max_new_tokens=256
)

outputs_cpu = outputs.to("cpu")
output_text = tokenizer.decode(outputs_cpu[0], skip_special_tokens=True)
response = output_text[len(query):].strip()

print(f"RAG answer:\n{response}")


Query: Describe the process of creating and switching between branches in Git.
RAG answer:
**Creating a Branch**

1. **Checkout** the main branch: `git checkout main`
2. **Create** a new branch: `git branch <branch_name>`
3. **Set** the new branch as the active branch: `git branch --active <branch_name>`

**Switching Between Branches**

1. **Switch** to the desired branch: `git checkout <branch_name>`
2. **Switch** back to the main branch: `git checkout main`

**Branching Workflow**

1. **Create** a new branch.
2. **Switch** to the new branch.
3. **Make changes** and commit them.
4. **Push** the changes to the remote repository.
5. **Create** a pull request to merge the changes into the main branch.
6. **Merge** the changes into the main branch.
7. **Switch** back to the main branch.
8. **Repeat** steps 1-7 for future changes.

**Branching Benefits**

* **Isolation:** Branches allow you to work on different features without affecting the main codebase.
* **Collaboration:** You can crea