### Overview:

- RAG is designed to give relavent answers based on query about space


In [49]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import torch
import torch.nn as nn

#### Import PDF Document:

In [3]:
PDF_path = "/content/drive/My Drive/MLRAG/MachineLearningTomMitchell.pdf"

In [4]:
import os
from tqdm.auto import tqdm

def text_formatter(text:str) -> str:
    ''' Performs basic text cleaning'''

    cleaned_text = text.replace('\n', ' ').strip()
    return cleaned_text

In [5]:
pip install pymupdf


Collecting pymupdf
  Downloading PyMuPDF-1.24.14-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading PyMuPDF-1.24.14-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (19.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m63.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.24.14


In [6]:
import fitz

def open_and_read_pdf(pdf_path:str)-> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text)
        pages_and_texts.append({'page_number': page_number-25,
                                'page_char_count': len(text), #it includes the spaces, special characters and punctuation count
                                'page_word_count': len(text.split(' ')), #it includes the word count
                                'page_sentence_count_raw': len(text.split('. ')),
                                'page_token_count': len(text)/4, #in general 4 characters make a single token,
                                'text': text})
    return pages_and_texts


In [7]:
pages_and_texts = open_and_read_pdf(PDF_path)

0it [00:00, ?it/s]

In [8]:
import random

random.sample(pages_and_texts,k=3)

[{'page_number': 52,
  'page_char_count': 3225,
  'page_word_count': 521,
  'page_sentence_count_raw': 19,
  'page_token_count': 806.25,
  'text': "be true for instances that are classified positive by the decision tree in Figure 3.1  and false otherwise. Thus, two learners, both applying Occam's razor, would  generalize in different ways if one used the XYZ attribute to describe its examples  and the other used only the attributes Outlook, Temperature, Humidity, and Wind.  This last argument shows that Occam's razor will produce two different  hypotheses from the same training examples when it is applied by two learners  that perceive these examples in terms of different internal representations. On this  basis we might be tempted to reject Occam's razor altogether. However, consider  the following scenario that examines the question of which internal representa-  tions might arise from a process of evolution and natural selection. Imagine a  population of artificial learning agents c

#### Data Analysis:

In [9]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-25,0,1,1,0.00,
1,-24,1793,302,13,448.25,Machine Learning Tom M. Mitchell Produ...
2,-23,2126,343,13,531.50,PREFACE The field of machine learning is conc...
3,-22,3071,497,12,767.75,xvi PREFACE A third principle that guided th...
4,-21,1424,247,14,356.00,"PREFACE xvii Joachim, Atsushi Kawamura, Marti..."
...,...,...,...,...,...,...
416,391,2441,422,8,610.25,"Probability distribution, 133. See also Binom..."
417,392,2668,443,7,667.00,"Resolution rule, 293-294 first-order, 296-297..."
418,393,2445,425,7,611.25,"Split infomation, 73-74 Squashing function, 9..."
419,394,974,169,2,243.50,"Variables, in logic, 284, 285 Variance, 133, ..."


In [10]:
df.describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,421.0,421.0,421.0,421.0,421.0
mean,185.0,2600.408551,455.32304,23.171021,650.102138
std,121.676484,739.66437,124.31205,21.682212,184.916093
min,-25.0,0.0,1.0,1.0,0.0
25%,80.0,2278.0,408.0,15.0,569.5
50%,185.0,2770.0,484.0,19.0,692.5
75%,290.0,3109.0,535.0,23.0,777.25
max,395.0,4140.0,654.0,142.0,1035.0


- There are average of 30 sentences per page, and the average word count is 377


#### Splitting paragraph into sentences:

In [11]:
from spacy.lang.en import English

nlp = English()
nlp.add_pipe('sentencizer') #add a sentencizer pipeline
#spacy libray works better for splitting sentences, rather than splitting using .split(' ')

<spacy.pipeline.sentencizer.Sentencizer at 0x7c54b0f0bb40>

In [12]:
for item in pages_and_texts:
    item['sentences'] = list(nlp(item['text']).sents)

    #making sure all the sentences are in string format
    item['sentences'] = [str(sentence) for sentence in item['sentences']]
    item['page_sentence_count_spacy'] = len(item['sentences'])

In [13]:
random.sample(pages_and_texts, k =4)

[{'page_number': 341,
  'page_char_count': 3318,
  'page_word_count': 570,
  'page_sentence_count_raw': 20,
  'page_token_count': 829.5,
  'text': 'CHAPTER 12 C O M B m G  INDUCTIVE AND ANALYTICAL LEARNING 355  Although the notation here appears a bit tedious, the idea is simple. The  error given by Equation (12.2) has the same general form as the error function  in Equation (12.1) minimized by TANGENTPROP.  The leftmost term measures the  usual sum of squared errors between the training value f (xi) and the value pre-  dicted by the target network f"(xi). The rightmost term measures the squared error  between the training derivatives  extracted from the domain theory and the  actual derivatives of the target network e.  Thus, the leftmost term contributes  the inductive constraint that the hypothesis must fit the observed training data,  whereas the rightmost term contributes the analytical constraint that it must fit  the training derivatives extracted from the domain theory. Notice 

In [14]:
df = pd.DataFrame(pages_and_texts)
df.describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,421.0,421.0,421.0,421.0,421.0,421.0
mean,185.0,2600.408551,455.32304,23.171021,650.102138,20.679335
std,121.676484,739.66437,124.31205,21.682212,184.916093,14.478367
min,-25.0,0.0,1.0,1.0,0.0,0.0
25%,80.0,2278.0,408.0,15.0,569.5,14.0
50%,185.0,2770.0,484.0,19.0,692.5,19.0
75%,290.0,3109.0,535.0,23.0,777.25,22.0
max,395.0,4140.0,654.0,142.0,1035.0,102.0


#### Chunking our sentences together:

- chunking helps to provide specific information, within the acceptable count of input tokens to the LLM

In [15]:
def split_list(input_list: list, slice_size: int = 10) -> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]


In [16]:
for item in tqdm(pages_and_texts):
    item['sentence_chunks'] = split_list(item['sentences'])
    item['num_chunks'] = len(item['sentence_chunks'])


  0%|          | 0/421 [00:00<?, ?it/s]

In [17]:
random.sample(pages_and_texts,k=2)

[{'page_number': 66,
  'page_char_count': 1746,
  'page_word_count': 266,
  'page_sentence_count_raw': 64,
  'page_token_count': 436.5,
  'text': "80  MACHINE LEARNING  Quinlan, J. R., & Rivest, R. (1989). Information and Computation, (go), 227-248.  Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.  Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length.  Annals of Statistics 11 (2), 416-431.  Rivest, R. L. (1987). Learning decision lists. Machine Learning, 2(3), 229-246.  Schaffer, C. (1993). Overfitting avoidance as bias. Machine Learning, 10, 113-152.  Shavlik, J. W., Mooney, R. J., & Towell, G. G. (1991). Symbolic and neural learning algorithms: an  experimental comparison. Machine kaming, 6(2), 11 1-144.  Tan, M. (1993). Cost-sensitive learning of classification knowledge and its applications in robotics.  Machine Learning, 13(1), 1-33.  Tan, M., & Schlimmer, J. C. (1990). Two case studies in cost

In [18]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,421.0,421.0,421.0,421.0,421.0,421.0,421.0
mean,185.0,2600.41,455.32,23.17,650.1,20.68,2.5
std,121.68,739.66,124.31,21.68,184.92,14.48,1.49
min,-25.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,80.0,2278.0,408.0,15.0,569.5,14.0,2.0
50%,185.0,2770.0,484.0,19.0,692.5,19.0,2.0
75%,290.0,3109.0,535.0,23.0,777.25,22.0,3.0
max,395.0,4140.0,654.0,142.0,1035.0,102.0,11.0


#### Splitting each chunk into its own item:

In [19]:
import re

pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item['sentence_chunks']:
        chunk_dict = {}
        chunk_dict['page_number'] = item['page_number']
        joined_sentence_chunk = ''.join(sentence_chunk).replace('  ',' ').strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk)
        chunk_dict['sentence_chunk'] = joined_sentence_chunk
        #stats
        chunk_dict['chunk_char_count'] = len(joined_sentence_chunk)
        chunk_dict['chunk_word_count'] = len([word for word in joined_sentence_chunk.split(' ')])
        chunk_dict['chunk_token_count'] = len(joined_sentence_chunk)/4

        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

  0%|          | 0/421 [00:00<?, ?it/s]

1052

In [20]:
random.sample(pages_and_chunks, k = 1)

[{'page_number': 115,
  'sentence_chunk': 'A convenient way to model this is to assume there is some unknown probability distribution D that defines the probability of encountering each instance in X (e-g., 23 might assign a higher probability to en- countering 19-year-old people than 109-year-old people). Notice 23 says nothing',
  'chunk_char_count': 287,
  'chunk_word_count': 45,
  'chunk_token_count': 71.75}]

In [21]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1052.0,1052.0,1052.0,1052.0
mean,188.85,1022.54,164.69,255.63
std,117.25,568.5,92.93,142.13
min,-24.0,7.0,1.0,1.75
25%,85.0,464.0,68.0,116.0
50%,192.0,1106.0,180.0,276.5
75%,291.0,1457.0,236.0,364.25
max,394.0,2676.0,447.0,669.0


In [22]:
min_token_length = 30
for row in df[df['chunk_token_count']<min_token_length].sample(5).iterrows():
    print(f'chunk: {row[1]["chunk_token_count"]} | text: {row[1]["sentence_chunk"]}')

chunk: 28.0 | text: This is exactly analogous to the setting we consider when estimating the error of a hypothesis in Chapter 5: The
chunk: 28.25 | text: Notice furthermore that we need not enumerate every hypothesis in the version space in order to test whether each
chunk: 18.25 | text: In this case, a population of size 640,000 was maintained, with selection
chunk: 8.0 | text: What is a good query strategy in
chunk: 3.25 | text: At each step,


In [23]:
pages_and_chunks_over_min_token_len = df[df['chunk_token_count'] > min_token_length].to_dict(orient='records')
len(pages_and_chunks_over_min_token_len)

1023

#### Embedding our text chunks:

In [24]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('all-mpnet-base-v2')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [25]:
embedding = embedding_model.encode('My main aim of my life is to master the mindfulness')
embedding.shape

(768,)

In [26]:
for item in tqdm(pages_and_chunks_over_min_token_len):
    item['embedding'] = embedding_model.encode(item['sentence_chunk'],
                                              batch_size = 32,
                                              convert_to_tensor = True)

  0%|          | 0/1023 [00:00<?, ?it/s]

#### Save embeddings to file:

In [27]:
pages_and_chunks_over_min_token_len[419]

{'page_number': 153,
 'sentence_chunk': 'CHAPTER 6 BAYESIAN LEARNING 167 the true target value, where this random noise is drawn independently for each example from a Normal distribution with zero mean. As the above derivation makes clear, the squared error term (di - h ( ~ ~ ) ) ~  follows directly from the exponent in the definition of the Normal distribution. Similar derivations can be performed starting with other assumed noise distributions, producing different results. Notice the structure of the above derivation involves selecting the hypothesis that maximizes the logarithm of the likelihood (In p(D1h)) in order to determine the most probable hypothesis. As noted earlier, this yields the same result as max- imizing the likelihood p(D1h). This approach of working with the log likelihood is common to many Bayesian analyses, because it is often more mathematically tractable than working directly with the likelihood. Of course, as noted earlier, the maximum likelihood hypothesis mig

In [28]:
#save embeddings to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = '/content/drive/My Drive/MLRAG/embeddings_df.csv'
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False,  escapechar='\\')

In [29]:
import pandas as pd
embeddings_df_save_path = '/content/drive/My Drive/MLRAG/embeddings_df.csv'
#Import saved file and view
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-24,Machine Learning Tom M. Mitchell Product De...,1436,217,359.00,"tensor([ 1.4353e-02, 1.2050e-02, -4.6707e-02,..."
1,-24,DLC: Computer algorithms. Book Description: Th...,316,45,79.00,"tensor([-1.2693e-02, 3.9788e-02, -5.5346e-02,..."
2,-23,PREFACE The field of machine learning is conce...,1623,242,405.75,"tensor([ 1.4706e-02, 3.2608e-02, -4.7097e-02,..."
3,-23,The book is intended for both undergraduate an...,476,75,119.00,"tensor([-2.0732e-03, 1.3112e-02, -4.9041e-02,..."
4,-22,xvi PREFACE A third principle that guided the ...,1491,226,372.75,"tensor([ 1.9730e-02, 1.0837e-02, -4.8825e-02,..."
...,...,...,...,...,...,...
1018,390,"410 SUBJECT INDEX Normal distribution, 133, 13...",2429,336,607.25,"tensor([-2.6373e-02, 3.3413e-03, -5.4208e-02,..."
1019,391,"Probability distribution, 133. See also Binomi...",2344,325,586.00,"tensor([-1.9942e-02, -1.7694e-02, -3.0645e-02,..."
1020,392,"Resolution rule, 293-294 first-order, 296-297 ...",2570,345,642.50,"tensor([ 6.2808e-03, -2.5652e-02, -2.3606e-02,..."
1021,393,"Split infomation, 73-74 Squashing function, 96...",2352,332,588.00,"tensor([-1.6852e-04, -4.5895e-02, -1.4752e-02,..."


- If we have over 100k embeddings, we need to use vector database, it uses Approximate Nearest Neighbor technique to find the nearest neighbor embeddings

#### RAG - search and answer:
- We want to retrieve relavent passages based on the query and use those passages to augment an input to an LLM so it can generate output

In [30]:
#semantic search
import random
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-24,Machine Learning Tom M. Mitchell Product De...,1436,217,359.00,"tensor([ 1.4353e-02, 1.2050e-02, -4.6707e-02,..."
1,-24,DLC: Computer algorithms. Book Description: Th...,316,45,79.00,"tensor([-1.2693e-02, 3.9788e-02, -5.5346e-02,..."
2,-23,PREFACE The field of machine learning is conce...,1623,242,405.75,"tensor([ 1.4706e-02, 3.2608e-02, -4.7097e-02,..."
3,-23,The book is intended for both undergraduate an...,476,75,119.00,"tensor([-2.0732e-03, 1.3112e-02, -4.9041e-02,..."
4,-22,xvi PREFACE A third principle that guided the ...,1491,226,372.75,"tensor([ 1.9730e-02, 1.0837e-02, -4.8825e-02,..."
...,...,...,...,...,...,...
1018,390,"410 SUBJECT INDEX Normal distribution, 133, 13...",2429,336,607.25,"tensor([-2.6373e-02, 3.3413e-03, -5.4208e-02,..."
1019,391,"Probability distribution, 133. See also Binomi...",2344,325,586.00,"tensor([-1.9942e-02, -1.7694e-02, -3.0645e-02,..."
1020,392,"Resolution rule, 293-294 first-order, 296-297 ...",2570,345,642.50,"tensor([ 6.2808e-03, -2.5652e-02, -2.3606e-02,..."
1021,393,"Split infomation, 73-74 Squashing function, 96...",2352,332,588.00,"tensor([-1.6852e-04, -4.5895e-02, -1.4752e-02,..."


In [31]:

def normalize(embedding):
    norm = np.linalg.norm(embedding)
    return embedding / norm if norm > 0 else embedding


def parse_and_normalize_embedding(embedding_str):
    #converting the tensor of type string to numpy array, faiss needs embeddings to be in numpy array
    cleaned_str = embedding_str.replace('tensor(', '').replace(', device=\'cuda:0\')', '').replace('\n', '')
    embedding = np.array(eval(cleaned_str), dtype=np.float32)
    return normalize(embedding)


In [33]:
!pip install datasets


Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [59]:
import pandas as pd
from datasets import Dataset, DatasetDict
import numpy as np
import faiss

hf_dataset = Dataset.from_pandas(text_chunks_and_embedding_df_load)
hf_dataset = hf_dataset.map(lambda x: {'embedding': parse_and_normalize_embedding(x['embedding'])})


Map:   0%|          | 0/1023 [00:00<?, ? examples/s]

In [61]:
!pip install annoy


Collecting annoy
  Downloading annoy-1.17.3.tar.gz (647 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/647.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━[0m [32m399.4/647.5 kB[0m [31m11.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m647.5/647.5 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25l[?25hdone
  Created wheel for annoy: filename=annoy-1.17.3-cp310-cp310-linux_x86_64.whl size=550738 sha256=1289826475698f56afbcd5e6b107907fa321b0315088554d0c775a7016c20495
  Stored in directory: /root/.cache/pip/wheels/64/8a/da/f714bcf46c5efdcfcac0559e63370c21abe961c48e3992465a
Successfully built annoy
Installing collected packages: annoy
Successfully installed annoy-1.17.3


In [62]:
from annoy import AnnoyIndex
import numpy as np

# Assume the first embedding's length to initialize AnnoyIndex
first_embedding = np.array(hf_dataset[0]['embedding'])
f = first_embedding.shape[0]

# Initialize Annoy index, using cosine similarity
t = AnnoyIndex(f, 'angular')  # 'angular' is equivalent to cosine similarity

# Add all embeddings to the Annoy index
for i, item in enumerate(hf_dataset):
    t.add_item(i, item['embedding'])

# Build the index
t.build(10)  # The number 10 is a parameter to tune, depending on your dataset size and required precision

# The index is now built and can be used to find nearest neighbors
# t.get_nns_by_item(item_index, n) can be used to retrieve the indices of the nearest neighbors


True

In [63]:
query = 'what is Machine Learning?'
query_embedding = embedding_model.encode(query, convert_to_tensor=True)
query_embedding = query_embedding.cpu().numpy()  # Convert to NumPy array
query_embedding = normalize(query_embedding)

In [65]:
from annoy import AnnoyIndex
import numpy as np

# Assuming 'hf_dataset' is your Dataset object with embeddings
# Initialize Annoy index with the dimension of the first embedding and using cosine similarity
first_embedding = np.array(hf_dataset[0]['embedding'])
f = first_embedding.shape[0]
t = AnnoyIndex(f, 'angular')  # 'angular' is cosine similarity

# Add all embeddings to the Annoy index
for i, item in enumerate(hf_dataset):
    t.add_item(i, item['embedding'])

# Build the index
t.build(10)  # You can adjust this number


True

In [66]:
# To query the index, first get the embedding for which you want to find neighbors
# Let's say 'query_embedding' is your query vector
query_index = t.get_nns_by_vector(query_embedding, 25, include_distances=True)  # 25 neighbors

# Retrieve the actual data of the nearest neighbors from your dataset
scores, neighbors = query_index[1], [hf_dataset[i] for i in query_index[0]]


In [69]:
for i in range(len(scores)):
    print(f"Neighbor {i+1}:")
    print(f"Score: {scores[i]}")
    print(f"Text Chunk: {neighbors[i]['sentence_chunk']}")
    print(f"Page Number: {neighbors[i]['page_number']}")
    print("-----------")


Neighbor 1:
Score: 0.8476611971855164
Text Chunk: CHAPTER INTRODUCTION Ever since computers were invented, we have wondered whether they might be made to learn. If we could understand how to program them to learn-to improve automatically with experience-the impact would be dramatic. Imagine comput- ers learning from medical records which treatments are most effective for new diseases, houses learning from experience to optimize energy costs based on the particular usage patterns of their occupants, or personal software assistants learn- ing the evolving interests of their users in order to highlight especially relevant stories from the online morning newspaper. A successful understanding of how to make computers learn would open up many new uses of computers and new levels of competence and customization. And a detailed understanding of information- processing algorithms for machine learning might lead to a better understanding of human learning abilities (and disabilities) as well. We

- **Note**: In the case of getting relavent documents, we need cosine similarity, which does not take magnitude into account, it considers direction , so we normalize the two vectors and performed the dot product

#### Functionizing our semantic search pipeline:

In [115]:
from annoy import AnnoyIndex
import numpy as np

# Assuming initialization and building of Annoy index have been done elsewhere in the code
# For example, assuming `t` is your Annoy index and `hf_dataset` is fully prepared

def print_top_results_and_scores(query, hf_dataset, t, n_resources_to_return=5):
    # Step 1: Create the query embedding
    query_embedding = embedding_model.encode(query, convert_to_tensor=True)
    query_embedding = query_embedding.cpu().numpy()
    query_embedding = normalize(query_embedding.flatten())

    # Step 2: Perform Annoy search, returning indices and distances
    indices, distances = t.get_nns_by_vector(query_embedding, n_resources_to_return, include_distances=True)

    # Step 3: Print top results, including distances, neighbors, and their corresponding indices
    for i, idx in enumerate(indices):
        print(f"Neighbor {i+1}:")
        print(f"Distance: {distances[i]}")
        print(f"Text Chunk: {hf_dataset[idx]['sentence_chunk']}")
        print(f"Page Number: {hf_dataset[idx]['page_number']}")
        print("-----------")

    # Return distances, indices
    return distances, [hf_dataset[idx] for idx in indices]

# Example usage
t = AnnoyIndex(first_embedding.shape[0], 'angular')  # Assume this is set up correctly
query = 'What is Machine Learning?'
distances, neighbors = print_top_results_and_scores(query, hf_dataset, t, n_resources_to_return=5)


#### Getting LLM:

In [116]:
# Install required libraries
!pip install transformers accelerate torch




In [117]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# No Hugging Face token required for this open-source model
token = None  # Set to None as this model is not gated

# Load the tokenizer and model (different LLM: EleutherAI/gpt-neo-2.7B)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")
model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/gpt-neo-2.7B",
    device_map="auto",       # Automatically map model across GPUs
    torch_dtype=torch.float16,  # Use float16 for better performance
)

# Prepare input text
input_text = "What is Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")  # Move input to GPU

# Generate text
outputs = model.generate(
    input_ids["input_ids"],  # Pass only the input_ids tensor
    max_new_tokens=50,       # Generate up to 50 new tokens
    temperature=0.7,         # Adjust randomness; lower value for more deterministic output
    top_p=0.9,               # Nucleus sampling for diverse generation
    do_sample=True           # Enable sampling for creative responses
)

# Decode and print the output
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(output_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


What is Machine Learning.
Machine learning is a set of techniques for automating the creation of new data and models. Machine learning is an umbrella term for a wide range of techniques, from statistical techniques like linear regression to deep learning techniques like convolutional neural networks. Machine
