# Creating a simple search engine

### Goals of this notebook

1) Explore a few different ways we can implement a simple search engine for queries. The goal is that the user can type a query related to the zoomcamp course FAQ pages, and can receive a few results in order of their relevance. We will see how different methods yield different results and which are more effective in extracting the most relevant results. 

In this exercise we will look at both __Text Search and Semantic/Vector__ search methods. Note that both these methods are under the umbrella of the 'Bag of Words' method, which means that the order of the words has no meaning. This has obvious limitations and can be overcome with more advanced models like BERT. 

We can illustrate the difference in these methods with a small example:

`query = 'I just discovered the course. Can I still join?'`

In text search, we will find all the documents that contain words like 'discovered', 'course', 'join', etc. However, often the user forms a question that does not really match the documents. For example:

`query = 'I just found out about the program. Can I still enroll?'`

Semantically, both queries have the same meaning, but with text search we will not get good results. This is when a semantic/vector approach will perform much better. 

2) Understand the steps of getting relevant search results using more basic methods like CountVectorizer/TfidfVectorizer, and then slightly more sophisticated methods using singular value reduction (dimensionality reduction methods) like SVD and NMF to embed the vector and gain semantic meaning.

Here is a quick breakdown of each of these methods:

__Text search methods__:
- create an instance of the Vectorizer (CV, Tfdif), fit_transform the documents to get document matrix (X), and transform the query (q)
- calculate similarity score (with cosine similarity between X and q) and rank results

__Semantic/Vector methods__:
- create an instance of the Vectorizer (CV, Tfdif), fit_transform the documents to get document matrix (X), and transform the query (Q)
- create an instance of the Embedder (SVD, NMF), fit_transform X to dense document matrix (X_emb), and transform Q to get dense query array (Q_emb)
- calculate similarity score (with cosine similarity between X_emb and Q_emb) and rank results

### Downloading the data

In [3]:
import pandas as pd
import requests


In [4]:
docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [5]:
documents[2]

{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
 'section': 'General course-related questions',
 'question': 'Course - Can I still join the course after the start date?',
 'course': 'data-engineering-zoomcamp'}

In [6]:
# converting to dataframe
df = pd.DataFrame(documents, columns=['course', 'section', 'question', 'text'])

In [7]:
df.head()

Unnamed: 0,course,section,question,text
0,data-engineering-zoomcamp,General course-related questions,Course - When will the course start?,The purpose of this document is to capture fre...
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...
2,data-engineering-zoomcamp,General course-related questions,Course - Can I still join the course after the...,"Yes, even if you don't register, you're still ..."
3,data-engineering-zoomcamp,General course-related questions,Course - I have registered for the Data Engine...,You don't need it. You're accepted. You can al...
4,data-engineering-zoomcamp,General course-related questions,Course - What can I do before the course starts?,You can start by installing and setting up all...


In [8]:
df.shape

(948, 4)

## Text Search Methods

### Using CountVectorizer from sklearn

From the dataframe `df` we see that we have 948 documents, each containing 4 different fields. We need to convert the text of each document to a numerical representation (to encode the document), in a process called vectorization. In vectorization, we turn the document into a vector with encodings. We are essentially creating a dictionary of all the words that appear in all our documents, and then assigning 1 or 0 if the document contains this word. This creates a document matrix, when the rows are each document (in our case, 948 rows) and the columns are the words/tokens.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english', min_df=5) 
X = cv.fit_transform(df.text)
names = cv.get_feature_names_out()
df_docs = pd.DataFrame(X.toarray(), columns=names).T

Notes about parameters in CV instance: 

__min_df__: only care about terms that appear in 5 documents (to avoid questions that are rarely asked or in nonEglish languages)
__stop_words__: Stopwords are the words which occur frequently and don't provide any useful information. We define it as 'english' to recognize and remove all the English stop-words.

In [15]:
cv.get_feature_names_out()

array(['01', '02', '03', ..., 'youtube', 'zip', 'zoomcamp'], dtype=object)

After fitting the `text` feature of the dataframe of documents, we can see our word dictionary using the CountVectorizer.get_feature_names_out(). It contains 1333 words.

Our document matrix after using CountVectorizer:

We can see that it is a sparse matrix, meaning that most of the values are 0.

In [17]:
df_docs.T

Unnamed: 0,01,02,03,04,05,06,09,10,100,11,...,y_val,yaml,year,yellow,yellow_tripdata_2021,yes,yml,youtube,zip,zoomcamp
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
943,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
944,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
945,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
946,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Using Tfidf from sklearn

Another way to create a document matrix is using a Tfidf vectorizer, instead of Count vectorizer. This is going to be an improvement, because instead of just assigning 0 or 1, this vectorizer assignes a float value between 0 and 1. Therefore, we get more information about the significance of the word in the document, rather than just knowing if it is there or not. 

TF-IDF stands for Term Frequency-Inverse Document Frequency, where:
- Term Frequency (TF): The number of times a term appears in a document.
- Inverse Document Frequency (IDF): A measure of how much information the word provides, i.e., if it is common or rare across all documents.

The score represents the importance of a word in a particular document, relative to all the documents. So, if a word is more rare, it will get a higher score, since that word would carry more meaningful information about the content of that particular document.

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Tfidf gives more importance to less frequent terms

tf = TfidfVectorizer(stop_words='english', min_df=5)
X = tf.fit_transform(df.text)
names = tf.get_feature_names_out()
df_docs = pd.DataFrame(X.toarray(), columns=names).T
df_docs.T

Unnamed: 0,01,02,03,04,05,06,09,10,100,11,...,y_val,yaml,year,yellow,yellow_tripdata_2021,yes,yml,youtube,zip,zoomcamp
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.428961
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.279891,0.000000,0.0,0.0,0.000000
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
943,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.107298,0.0,0.0,0.000000
944,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
945,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.167274,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
946,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000


Now that we have our document matrix, we can vectorize a query to an array, and then multiply the array with the matrix to determine which document has the most similarity with the query.  

For example, for document in row 945:

- The score for the word 'yaml' in our dictionary maxtrix is 0.1673, while the score for the word 'year' was 0.
- If our query will contain the word 'yaml', then the score for this word will be high, since it is a more rare word. If our query will contain the word 'year', it will not point to this document.
-  If we multiply two non-zero values, we get a non-zero value. This means that the similarity score is high, meaning document in row 945 is relevant to the query.

What is important to recognize is that we are taking the dot product of these two matrices (the dictionary matrix by the transoformed query matrix), which is also the same as cosine similarity. 


In [53]:
query = "I just discovered the course, is it too late to join?"

q = tf.transform([query])
q.toarray().shape

(1, 1333)

In [55]:
query_dict = dict(zip(names, q.toarray()[0]))
print(query_dict['course'])


0.49695797492447685


### Taking the dot product to get a similarity score

In [56]:
from sklearn.metrics.pairwise import cosine_similarity
score = cosine_similarity(X, q).flatten()

In [57]:
score

array([0.48049682, 0.        , 0.        , 0.2083882 , 0.        ,
       0.        , 0.        , 0.17557272, 0.        , 0.        ,
       0.        , 0.15870689, 0.        , 0.        , 0.        ,
       0.09680922, 0.        , 0.        , 0.07529201, 0.        ,
       0.        , 0.        , 0.29986763, 0.10520675, 0.        ,
       0.        , 0.        , 0.27447476, 0.12828407, 0.        ,
       0.        , 0.        , 0.        , 0.05163407, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.03156309,
       0.04914818, 0.07138962, 0.        , 0.04329773, 0.        ,
       0.        , 0.        , 0.        , 0.02804374, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.06739038, 0.        , 0.00980845,
       0.        , 0.        , 0.        , 0.        , 0.05820102,
       0.        , 0.        , 0.        , 0.        , 0.     

In [61]:
import numpy as np
idx = np.argsort(score)[-5:] # sorts from lowest to highest, so we need the last ones
idx

array([ 22, 448, 449, 440,   0])

In [71]:
print(f'Query: {query}\n')
print('Search Results:')
for row in idx:
    print(f'Index {row}')
    print(df.iloc[row].text)
    print('\n')

Query: I just discovered the course, is it too late to join?

Search Results:
Index 22
It's up to you which platform and environment you use for the course.
Github codespaces or GCP VM are just possible options, but you can do the entire course from your laptop.


Index 448
Here’s how you join a in Slack: https://slack.com/help/articles/205239967-Join-a-channel
Click “All channels” at the top of your left sidebar. If you don't see this option, click “More” to find it.
Browse the list of public channels in your workspace, or use the search bar to search by channel name or description.
Select a channel from the list to view it.
Click Join Channel.
Do we need to provide the GitHub link to only our code corresponding to the homework questions?
Yes. You are required to provide the URL to your repo in order to receive a grade


Index 449
Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.
In order to get a certificate, you need to submi

## Taking into account all fields in documents

Up to now we have only been using the 'text' field of the documents, but it makes more sense to use the 'question' field. We will take into consideration all the fields, especially the 'question' field, using a boost dictionary. In addition, we can add a filter to only show results relevant to a particular course.

In [99]:
n = len(df)
score = np.zeros(n)
fields = ['section', 'question', 'text']

# giving 'question' 3x more value, and 'text' 0.5 less value.
boosts = {
    'question': 3,
    'text': 0.5
}

filters = {
    'course': 'data-engineering-zoomcamp'
}

for f in fields:
    tf = TfidfVectorizer(stop_words='english', min_df=5)
    X = tf.fit_transform(df[f])
    q = tf.transform([query])
    f_score = cosine_similarity(X, q).flatten()
    boost = boosts.get(f, 1.0) # if f not in boosts, assign 1
    score += boost*f_score 

score_no_filter = score.copy()

for field, value in filters.items():
    mask = (df[field] == value).astype(int).values
    score *= mask

score_with_filter = score

In [100]:
# top 5 results with no filter yields results in 2 different courses
idx = np.argsort(-score_no_filter)[:5]
df.iloc[idx]

Unnamed: 0,course,section,question,text
448,machine-learning-zoomcamp,General course-related questions,I’m new to Slack and can’t find the course cha...,Here’s how you join a in Slack: https://slack....
7,data-engineering-zoomcamp,General course-related questions,Course - Can I follow the course after it fini...,"Yes, we will keep all the materials after the ..."
9,data-engineering-zoomcamp,General course-related questions,Course - Which playlist on YouTube should I re...,All the main videos are stored in the Main “DA...
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...
453,machine-learning-zoomcamp,General course-related questions,What are the deadlines in this course?,"For the 2023 cohort, you can see the deadlines..."


In [102]:
# top 5 results with a filter yields results in only one course
idx = np.argsort(-score_with_filter)[:5]
df.iloc[idx]

Unnamed: 0,course,section,question,text
7,data-engineering-zoomcamp,General course-related questions,Course - Can I follow the course after it fini...,"Yes, we will keep all the materials after the ..."
1,data-engineering-zoomcamp,General course-related questions,Course - What are the prerequisites for this c...,GitHub - DataTalksClub data-engineering-zoomca...
4,data-engineering-zoomcamp,General course-related questions,Course - What can I do before the course starts?,You can start by installing and setting up all...
5,data-engineering-zoomcamp,General course-related questions,Course - how many Zoomcamps in a year?,"There are 3 Zoom Camps in a year, as of 2024. ..."
34,data-engineering-zoomcamp,General course-related questions,How can we contribute to the course?,Star the repo! Share it with friends if you fi...


## Putting it all together using OOP

In [103]:
class TextSearch:

    def __init__(self, text_fields):
        self.text_fields = text_fields
        self.matrices = {}
        self.vectorizers = {}

    def fit(self, records, vectorizer_params={}):
        self.df = pd.DataFrame(records)

        for f in self.text_fields:
            tf = TfidfVectorizer(**vectorizer_params)
            X = tf.fit_transform(self.df[f])
            self.matrices[f] = X
            self.vectorizers[f] = tf

    def search(self, query, n_results=10, boost={}, filters={}):
        score = np.zeros(len(self.df))

        for f in self.text_fields:
            b = boost.get(f, 1.0)
            q = self.vectorizers[f].transform([query])
            s = cosine_similarity(self.matrices[f], q).flatten()
            score = score + b * s

        for field, value in filters.items():
            mask = (self.df[field] == value).values
            score = score * mask

        idx = np.argsort(-score)[:n_results]
        results = self.df.iloc[idx]
        return results.to_dict(orient='records')

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


In [104]:
# Example:

index = TextSearch(
    text_fields=['section', 'question', 'text']
)
index.fit(documents)

index.search(
    query='I just singned up. Is it too late to join the course?',
    n_results=5,
    boost={'question': 3.0},
    filters={'course': 'data-engineering-zoomcamp'}
)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
  'section': 'General course-related questions',
  'question': 'Course - When will the course start?',
  'course': 'data-engineerin

# Semantic/Vector Search

As mentioned before, the main limitation of text search is that it relies on the presence of a word in order to determine which documents are relevant. However, often the query can be phrased differently, not contain the exact words in the document, but still be relevant. The magic of using an embedder is that it creates n clusters of similar words/concepts. How does it manage to do this without considering word order?

Example:

Doc1: "I am taking a course on machine learning."

Doc2: "This program teaches machine learning and data science."

In these documents, the word 'course' and 'program' both appear with words like 'machine' and 'learning', so it understands that they have similar usage contexts since they appear with a similar set of words. Therefore, it can put 'course' and 'program' in a cluster. 

### Using SVD for vector search

In [133]:
from sklearn.decomposition import TruncatedSVD

tf = TfidfVectorizer(stop_words='english', min_df=5)
svd = TruncatedSVD(n_components=16)

X = tf.fit_transform(df['text'])
X_emb = svd.fit_transform(X)

query = 'I just signed up. Is it too late to join the course?'
Q = tf.transform([query])
Q_emb = svd.transform(Q)
score = cosine_similarity(X_emb, Q_emb).flatten()

In [134]:
X_emb.shape

(948, 16)

In [135]:
X_emb[0]

array([ 0.0965294 , -0.08209022, -0.10243038, -0.07913662,  0.06815341,
       -0.06097549,  0.02942991, -0.14643791,  0.24885705,  0.27039071,
        0.07383093,  0.06997887,  0.07140326,  0.09644724, -0.02853453,
        0.00932135])

Instead of a sparse matrix with 1333 dimensions, we have a dense representaiton called an 'embedding' with only 16 dimensions.

In [136]:
Q_emb.shape

(1, 16)

In [138]:
Q_emb

array([[ 0.05790344, -0.03847635, -0.05661956, -0.02763634,  0.04012041,
        -0.06361821,  0.0182229 , -0.09670466,  0.16579205,  0.17604985,
         0.06023559,  0.06394687,  0.05172968,  0.07648963, -0.00458787,
         0.014149  ]])

In [140]:
score = cosine_similarity(X_emb, Q_emb).flatten()
idx = np.argsort(-score)[:5]
df.loc[idx]

Unnamed: 0,course,section,question,text
451,machine-learning-zoomcamp,General course-related questions,Can I submit the homework after the due date?,"No, it’s not possible. The form is closed afte..."
764,machine-learning-zoomcamp,Projects (Midterm and Capstone),What If I submitted only two projects and fail...,If you have submitted two projects (and peer-r...
0,data-engineering-zoomcamp,General course-related questions,Course - When will the course start?,The purpose of this document is to capture fre...
2,data-engineering-zoomcamp,General course-related questions,Course - Can I still join the course after the...,"Yes, even if you don't register, you're still ..."
436,machine-learning-zoomcamp,General course-related questions,Is it going to be live? When?,"The course videos are pre-recorded, you can st..."


In [141]:
query = 'I just enrolled. Is it too late to join the program?'
Q = tf.transform([query])
Q_emb = svd.transform(Q)
score = cosine_similarity(X_emb, Q_emb).flatten()
idx = np.argsort(-score)[:5]
df.loc[idx]

Unnamed: 0,course,section,question,text
3,data-engineering-zoomcamp,General course-related questions,Course - I have registered for the Data Engine...,You don't need it. You're accepted. You can al...
8,data-engineering-zoomcamp,General course-related questions,Course - Can I get support if I take the cours...,"Yes, the slack channel remains open and you ca..."
797,machine-learning-zoomcamp,Miscellaneous,I may end up submitting the assignment late. W...,Depends on whether the form will still be open...
440,machine-learning-zoomcamp,General course-related questions,"I filled the form, but haven't received a conf...","The process is automated now, so you should re..."
451,machine-learning-zoomcamp,General course-related questions,Can I submit the homework after the due date?,"No, it’s not possible. The form is closed afte..."


### Using NMF for vector search

NMF (Non-negative Matrix Factorization) is considered to be more interpretable than SVD since the matrix values are all positive, and because it leads to sparser matrices. Since a matrix value represents 'how much' of that cluster is related, a negative value can lack meaning and be more confusing. In addition, a sparse matrix indicates clearly which of the clusters are playing a part. 

In [142]:
from sklearn.decomposition import NMF
nmf = NMF(n_components=16)
X_emb = nmf.fit_transform(X)
X_emb[0]

array([0.00613327, 0.00589399, 0.        , 0.        , 0.08526761,
       0.        , 0.0010501 , 0.        , 0.00216954, 0.01244853,
       0.00030579, 0.        , 0.        , 0.00766827, 0.00452049,
       0.00914782])

In [76]:
# NMF creates 'clusters' of different topics, and non zero values can be seen as the query being related to those 2 topics

In [143]:
query = 'I just signed up. Is it too late to join the course?'
Q = cv.transform([query])
Q_emb = nmf.transform(Q)
Q_emb[0]

array([1.44931879e-02, 1.39271127e-02, 0.00000000e+00, 8.27306250e-03,
       7.75442915e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       4.34842512e-05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 2.11862938e-02, 2.28288798e-02, 0.00000000e+00])

In [145]:
score = cosine_similarity(X_emb, Q_emb).flatten()
idx = np.argsort(-score)[:5]
df.loc[idx]


Unnamed: 0,course,section,question,text
456,machine-learning-zoomcamp,General course-related questions,Submitting learning in public links,When you post about what you learned from the ...
452,machine-learning-zoomcamp,General course-related questions,I just joined. What should I do next? How can ...,Welcome to the course! Go to the course page (...
0,data-engineering-zoomcamp,General course-related questions,Course - When will the course start?,The purpose of this document is to capture fre...
758,machine-learning-zoomcamp,Projects (Midterm and Capstone),"What modules, topics, problem-sets should a mi...","Answer: Ideally midterms up to module-06, caps..."
760,machine-learning-zoomcamp,Projects (Midterm and Capstone),How to conduct peer reviews for projects?,Answer: Previous cohorts projects page has ins...


In [148]:
class VectorSearch:

    def __init__(self, text_fields):
        self.text_fields = text_fields
        self.matrices = {}
        self.vectorizers = {}
        self.embedders = {}

    def fit(self, records, vectorizer_params={}):
        self.df = pd.DataFrame(records)

        for f in self.text_fields:
            tf = TfidfVectorizer(**vectorizer_params)
            X = tf.fit_transform(self.df[f])
            self.vectorizers[f] = tf

            svd = TruncatedSVD(n_components=16)
            X_emb = svd.fit_transform(X)
            self.matrices[f] = X_emb
            self.embedders[f] = svd

    def search(self, query, n_results=10, boost={}, filters={}):
        score = np.zeros(len(self.df))
    
        for f in self.text_fields:
            b = boost.get(f, 1.0)
            q = self.vectorizers[f].transform([query])
            q_emb = self.embedders[f].transform(q)
            s = cosine_similarity(self.matrices[f], q_emb).flatten()
            score = score + b * s
    
        for field, value in filters.items():
            mask = (self.df[field] == value).values
            score = score * mask
    
        idx = np.argsort(-score)[:n_results]
        results = self.df.iloc[idx]
        return results.to_dict(orient='records')

In [155]:
# Example
index = VectorSearch(
    text_fields=['section', 'question', 'text']
)
index.fit(documents)

index.search(
    query='I just signed up. Is it too late to join the course?',
    n_results=5,
    boost={'question': 3.0
          },
    filters={'course': 'data-engineering-zoomcamp'}
)

[{'text': 'Yes. For the 2024 edition we are using Mage AI instead of Prefect and re-recorded the terraform videos, For 2023, we used Prefect instead of Airflow..',
  'section': 'General course-related questions',
  'question': 'Course - Is the current cohort going to be different from the previous cohort?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes! Every “Office Hours” will be recorded and available a few minutes after the live session is over; so you can view (or rewatch) whenever you want.',
  'section': 'General course-related questions',
  'question': 'Office Hours - I can’t attend the “Office hours” / workshop, will it be recorded?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'The zoom link is only published to instructors/presenters/TAs.\nStudents participate via Youtube Live and submit questions to Slido (link would be pinned in the chat when Alexey goes Live). The video URL should be posted in the announcements channel on Telegram & Slack before it begi

## BERT
NN that turns a document into an embedding. Captures not only semantic similarity but also word order. 


In [108]:
import torch
from transformers import BertModel, BertTokenizer
# loading tokenizer and pre-trained model
# tokenizer turns text into a numerical representation
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.eval()  # Set the model to evaluation mode if not training

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [109]:
texts = [
    "Yes, we will keep all the materials after the course finishes.",
    "You can follow the course at your own pace after it finishes"
]
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
encoded_input

{'input_ids': tensor([[  101,  2748,  1010,  2057,  2097,  2562,  2035,  1996,  4475,  2044,
          1996,  2607, 12321,  1012,   102],
        [  101,  2017,  2064,  3582,  1996,  2607,  2012,  2115,  2219,  6393,
          2044,  2009, 12321,   102,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])}

In [110]:
with torch.no_grad():  # Disable gradient calculation for inference
    outputs = model(**encoded_input)
    hidden_states = outputs.last_hidden_state # contains embeddings

In [111]:
hidden_states.shape
# 2 is num. documens

torch.Size([2, 15, 768])

In [112]:
hidden_states[0]

tensor([[ 0.1010,  0.0181,  0.1303,  ..., -0.2932,  0.1863,  0.6615],
        [ 1.0608, -0.1242,  0.1370,  ..., -0.1605,  1.0429,  0.3532],
        [ 0.1802,  0.0776,  0.3941,  ..., -0.1379,  0.5974,  0.1704],
        ...,
        [ 0.4738, -0.0184,  0.2186,  ..., -0.0013, -0.0833, -0.2170],
        [ 0.6516,  0.1216, -0.2494,  ...,  0.1557, -0.5632, -0.4310],
        [ 0.7164,  0.2157, -0.0281,  ...,  0.2281, -0.6725, -0.3245]])

In [113]:
sentence_embeddings = hidden_states.mean(dim=1)
sentence_embeddings.shape

torch.Size([2, 768])

In [114]:
sentence_embeddings

tensor([[ 0.3600, -0.1607,  0.3545,  ...,  0.0429,  0.0348, -0.0382],
        [ 0.1785, -0.5000,  0.2528,  ..., -0.1141, -0.3361,  0.4110]])

In [115]:
X_emb = sentence_embeddings.numpy()

In [118]:
def make_batches(seq, n):
    result = []
    for i in range(0, len(seq), n):
        batch = seq[i:i+n]
        result.append(batch)
    return result

In [120]:
from tqdm import tqdm
texts = df['text'].tolist()
text_batches = make_batches(texts, 8)

all_embeddings = []

for batch in tqdm(text_batches):
    encoded_input = tokenizer(batch, padding=True, truncation=True, return_tensors='pt')

    with torch.no_grad():
        outputs = model(**encoded_input)
        hidden_states = outputs.last_hidden_state
        
        batch_embeddings = hidden_states.mean(dim=1)
        batch_embeddings_np = batch_embeddings.cpu().numpy()
        all_embeddings.append(batch_embeddings_np)

final_embeddings = np.vstack(all_embeddings)

100%|███████| 119/119 [11:11<00:00,  5.65s/it]


In [116]:
def compute_embeddings(texts, batch_size=8):
    text_batches = make_batches(texts, 8)
    
    all_embeddings = []
    
    for batch in tqdm(text_batches):
        encoded_input = tokenizer(batch, padding=True, truncation=True, return_tensors='pt')
    
        with torch.no_grad():
            outputs = model(**encoded_input)
            hidden_states = outputs.last_hidden_state
            
            batch_embeddings = hidden_states.mean(dim=1)
            batch_embeddings_np = batch_embeddings.cpu().numpy()
            all_embeddings.append(batch_embeddings_np)
    
    final_embeddings = np.vstack(all_embeddings)
    return final_embeddings

In [117]:
X_text = compute_embeddings(df['text'].tolist())

NameError: name 'make_batches' is not defined