In [1]:
from main import load_corpus, build_chunks

In [3]:
corpus = load_corpus(corpus_dir="docs")
chunked_corpus = build_chunks(corpus, chunk_size=500, overlap=50)

In [4]:
chunked_corpus[:2]  # Display the first two chunks

[{'doc_id': 'how-y-comb-started-paul-graham.txt',
  'chunk_id': 'how-y-comb-started-paul-graham.txt_chunk_0',
  'text': "How Y Combinator Started Y Combinator's 7th birthday was March 11. As usual we were so busy we didn't notice till a few days after. I don't think we've ever managed to remember our birthday on our birthday. On March 11 2005, Jessica and I were walking home from dinner in Harvard Square. Jessica was working at an investment bank at the time, but she didn't like it much, so she had interviewed for a job as director of marketing at a Boston VC fund. The VC fund was doing what now seems a comically familiar thing for a VC fund to do: taking a long time to make up their mind. Meanwhile I had been telling Jessica all the things they should change about the VC business — essentially the ideas now underlying Y Combinator: investors should be making more, smaller investments, they should be funding hackers instead of suits, they should be willing to fund younger founders, etc

In [3]:
import os, glob, json
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from main import load_corpus, build_chunks

In [4]:
corpus = load_corpus(corpus_dir="docs")
chunked_corpus = build_chunks(corpus, chunk_size=500, overlap=50)
print(f"Total chunks created: {len(chunked_corpus)}")

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode([chunk['text'] for chunk in chunked_corpus], show_progress_bar=True, convert_to_numpy=True)

Total chunks created: 4


Batches: 100%|██████████| 1/1 [00:02<00:00,  2.04s/it]


In [5]:
embeddings

array([[ 0.00262533, -0.06060995,  0.04796309, ..., -0.10234585,
        -0.06416606, -0.02177049],
       [-0.06242759,  0.0737354 , -0.01731625, ..., -0.12356475,
        -0.07399751, -0.01288157],
       [-0.02590338, -0.03875904,  0.0522448 , ..., -0.0087055 ,
        -0.07217222, -0.03136091],
       [ 0.08763316, -0.04844471,  0.09743939, ...,  0.03077282,
        -0.02612675, -0.02646558]], shape=(4, 384), dtype=float32)

In [6]:
def normalize(vec):
    norm = np.linalg.norm(vec, axis=1, keepdims=True)
    return vec / norm

vec = normalize(embeddings).astype('float32')
dimension = vec.shape[1]
print(f"Dimension of embeddings: {dimension}")

Dimension of embeddings: 384


In [7]:
index = faiss.IndexFlatIP(dimension)
index.add(vec)
print(f"Total vectors in index: {index.ntotal}")

Total vectors in index: 4


In [8]:
query = "When is the YCOmbinators birthday?"
query_embedding = model.encode([query], convert_to_numpy=True)

In [12]:
query_vec = normalize(query_embedding).astype('float32')

In [14]:
# Retrieve from the index
index.search(query_vec, 2)

(array([[0.3564238 , 0.29097176]], dtype=float32), array([[0, 2]]))

## Home Lab
1. Create a Vector Datastore using FAISS for University of Moratuwa Related Documents
2. FastAPI Server to serve Retrieval API
3.  When you input a query, retrieve the top 2 relevant chunks from the FAISS index based on cosine similarity.
4.  Also you should be able to filter the documents first based on metadata (e.g., document type, date) before performing the similarity search.

1.. Electrical Department Documents, 2.. Computer Science Department Documents, 3.. General University Documents
metadata = {
    "department": "Electrical",
    "year": "2023"
}

/search?query=admissions&department=Electrical&year=2023 --> Only search within Electrical Department Documents from 2023

https://www.youtube.com/watch?v=wh0XkBeQNSM

In [15]:
## 0719102569
## irugalbandarachandra@gmail.com

In [16]:
# 1. Next up - Augmentation and Generation
# 2. Capstone Teams Intro + Few Project Proposal from my end - Price Money