## What we're going to build

We're going to build NutriChat to "chat with a nutrition textbook". 

Specifically:

1. Open a PDF document (you could use almost any PDF here or even a collection of PDFs).
2. Format the text of the PDF textbook ready for an embedding model.
3. Embed all of the chunks of text in the textbook and turn them into numerical reprentations (embedding) which can store for later.
4. Build a retrieval system that uses vector search to find relevant chunk of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on the passages of the textbook with an LLM.

All locally!

1. Steps 1-3: Document preprocessing and embedding creation.
2. Steps 4-6: Search and answer.

## 1. Document/text processing and embedding creation

Ingredients: 
* PDF document of choice (note: this could be almost any kind of document, I've just chosen to focus on PDFs for now).
* Embedding model of choice.

Steps:
1. Import PDF document.
2. Process text for embedding (e.g. split into chunks of sentences).
3. Embed text chunks with embedding model.
4. Save embeddings to file for later (embeddings will store on file for many years or until you lose your hard drive).

In [1]:
import os
import re
import requests
import fitz # you should install PyMuPDF
from tqdm.auto import tqdm
import random
import pandas as pd
import numpy as np
from spacy.lang.en import English

import torch
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

Device: cuda


## A-) DATA PREPROCESSING

### 1-) Read PDF document

In [3]:
# Get PDF document path
pdf_path = "human-nutrition-text.pdf"

# Download PDF
if not os.path.exists(pdf_path):
    print("[INFO] File doesn't exist, downloading...")

    # Enter the URL of the PDF
    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

    # The local filename to save the downloaded file
    filename = pdf_path

    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Open the file and save it
        with open(filename, "wb") as file:
            file.write(response.content) 
        print(f"[INFO] The file has been download and saved as {filename}")
    else:
        print(f"[INFO] Failed to download the file. Status code: {response.status_code}")

else:
    print(f"File {pdf_path} exists.")

File human-nutrition-text.pdf exists.


In [4]:
def text_formatter(text: str):
    """Perform text formatting for embedding."""
    cleaned_text = text.replace("\n", " ").strip()
    
    return cleaned_text

def read_pdf(pdf_path: str) -> list[dict]:
    """Read a PDF file and extract text from it."""
    with fitz.open(pdf_path) as doc:
        pages_and_texts = []
        for page_number, page in tqdm(enumerate(doc)):
            text = page.get_text()
            text = text_formatter(text)
            pages_and_texts.append({"page_number": page_number -41, 
                                    "page_char_count": len(text),
                                    "page_word_count": len(text.split(" ")),
                                    "page_sentence_count": len(text.split(".")),
                                    "page_token_count": len(text)/4, #assumes that 1 token = 4 characters
                                    "text": text,
                                    })
            
        return pages_and_texts

In [5]:
pages_and_texts = read_pdf(pdf_path)

1208it [00:00, 1321.90it/s]


In [6]:
pages_and_texts[38:45]

[{'page_number': -3,
  'page_char_count': 479,
  'page_word_count': 92,
  'page_sentence_count': 6,
  'page_token_count': 119.75,
  'text': 'Note to Educators Using this Resource  Please send edits and suggestions directly to Dr. Fialkowski Revilla  on how we may improve the textbook. We also welcome others to  adopt the book for their own course needs, however, we would like  to be able to keep a record of users so that we may update them on  any critical changes to the textbook. Please contact Dr. Fialkowski  Revilla if you are considering to adopt the textbook for your course.  About the Contributors  |  xxxix'},
 {'page_number': -2,
  'page_char_count': 1117,
  'page_word_count': 203,
  'page_sentence_count': 6,
  'page_token_count': 279.25,
  'text': 'Acknowledgements  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  This Open Educational Resource textbook has been adapted from:  OpenStax Anatomy and Physiology // CC BY 4.0  • C

In [7]:
random.sample(pages_and_texts, k=3)

[{'page_number': 954,
  'page_char_count': 1203,
  'page_word_count': 209,
  'page_sentence_count': 11,
  'page_token_count': 300.75,
  'text': 'Image by  Allison  Calabrese /  CC BY 4.0  Physical Activity Intensity and Fuel Use  The exercise intensity determines the contribution of the type of  fuel source used for ATP production(see Figure 16.4 “The Effect of  Exercise Intensity on Fuel Sources”). Both anaerobic and aerobic  metabolism combine during exercise to ensure that the muscles  are equipped with enough ATP to carry out the demands placed on  them. The amount of contribution from each type of metabolism  will depend on the intensity of an activity. When low-intensity  activities are performed, aerobic metabolism is used to supply  enough ATP to muscles. However, during high-intensity activities  more ATP is needed so the muscles must rely on both anaerobic and  aerobic metabolism to meet the body’s demands.  During low-intensity activities, the body will use aerobic  metaboli

In [8]:
df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,3,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,3,199.25,Contents Preface University of Hawai‘i at Mā...


In [9]:
df.describe().round(2)
# As we see our average per page token number is 287 which should be considered max_sequence_length for the model we will use.
# So we can embed 1 page at a time.

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,14.18,287.0
std,348.86,560.38,95.76,9.54,140.1
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,8.0,190.5
50%,562.5,1231.5,214.5,13.0,307.88
75%,864.25,1603.5,271.0,19.0,400.88
max,1166.0,2308.0,429.0,82.0,577.0


### 2-) Text preprocessing (splitting pages into sentences)

In [10]:
nlp = English()
nlp.add_pipe("sentencizer")

test = nlp("This is first sentence. This is second sentence. This is third sentence.")
for sent in test.sents:
    print(sent)

This is first sentence.
This is second sentence.
This is third sentence.


In [11]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)
    item["sentences"] = [str(sent) for sent in item["sentences"]]
    item["page_sentence_count"] = len(item["sentences"])

100%|██████████| 1208/1208 [00:01<00:00, 951.15it/s] 


In [12]:
pages_and_texts[1098]

{'page_number': 1057,
 'page_char_count': 1841,
 'page_word_count': 307,
 'page_sentence_count': 19,
 'page_token_count': 460.25,
 'text': 'harmful microorganisms that can cause foodborne illnesses.  Therefore, people who primarily eat raw foods should thoroughly  clean all fruit and vegetables before eating them. Poultry and other  meats should always be cooked before eating.12  Vegetarian and Vegan Diets  Vegetarian and vegan diets have been followed for thousands of  years for different reasons, including as part of a spiritual practice,  to show respect for living things, for health reasons, or because of  environmental concerns. For many people, being a vegetarian is a  logical outgrowth of “thinking green.” A meat-based food system  requires more energy, land, and water resources than a plant-based  food system. This may suggest that the plant-based diet is more  sustainable than the average meat-based diet in the U.S.By avoiding  animal flesh, vegetarians hope to look after thei

In [13]:
df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,text,sentences
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition,[Human Nutrition: 2020 Edition]
1,-40,0,1,0,0.0,,[]
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...,[Human Nutrition: 2020 Edition UNIVERSITY OF...
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...,[Human Nutrition: 2020 Edition by University o...
4,-37,797,145,2,199.25,Contents Preface University of Hawai‘i at Mā...,[Contents Preface University of Hawai‘i at M...


In [14]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,10.32,287.0
std,348.86,560.38,95.76,6.3,140.1
min,-41.0,0.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,5.0,190.5
50%,562.5,1231.5,214.5,10.0,307.88
75%,864.25,1603.5,271.0,15.0,400.88
max,1166.0,2308.0,429.0,28.0,577.0


### 3-) Chunking sentences as 10 sentences each group

In [15]:
num_sentences_per_chunk = 10


def chunk_sentences(sentences: list, num_sentences_per_chunk: int) -> list[list[str]]:
    """Chunk sentences into groups of num_sentences_per_chunk."""
    chunks = []
    for i in range(0, len(sentences), num_sentences_per_chunk):
        chunk = sentences[i:i + num_sentences_per_chunk]
        chunks.append(chunk)
    
    return chunks

In [16]:
#Let's test the function with a sample
chunk_sentences(list(range(25)), num_sentences_per_chunk=10)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [17]:
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = chunk_sentences(item["sentences"], num_sentences_per_chunk=num_sentences_per_chunk)
    item["page_chunk_count"] = len(item["sentence_chunks"])

100%|██████████| 1208/1208 [00:00<00:00, 952390.83it/s]


In [18]:
pages_and_texts[1098]

{'page_number': 1057,
 'page_char_count': 1841,
 'page_word_count': 307,
 'page_sentence_count': 19,
 'page_token_count': 460.25,
 'text': 'harmful microorganisms that can cause foodborne illnesses.  Therefore, people who primarily eat raw foods should thoroughly  clean all fruit and vegetables before eating them. Poultry and other  meats should always be cooked before eating.12  Vegetarian and Vegan Diets  Vegetarian and vegan diets have been followed for thousands of  years for different reasons, including as part of a spiritual practice,  to show respect for living things, for health reasons, or because of  environmental concerns. For many people, being a vegetarian is a  logical outgrowth of “thinking green.” A meat-based food system  requires more energy, land, and water resources than a plant-based  food system. This may suggest that the plant-based diet is more  sustainable than the average meat-based diet in the U.S.By avoiding  animal flesh, vegetarians hope to look after thei

In [19]:
df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,text,sentences,sentence_chunks,page_chunk_count
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition,[Human Nutrition: 2020 Edition],[[Human Nutrition: 2020 Edition]],1
1,-40,0,1,0,0.0,,[],[],0
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...,[Human Nutrition: 2020 Edition UNIVERSITY OF...,[[Human Nutrition: 2020 Edition UNIVERSITY O...,1
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...,[Human Nutrition: 2020 Edition by University o...,[[Human Nutrition: 2020 Edition by University ...,1
4,-37,797,145,2,199.25,Contents Preface University of Hawai‘i at Mā...,[Contents Preface University of Hawai‘i at M...,[[Contents Preface University of Hawai‘i at ...,1


In [20]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,page_chunk_count
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,10.32,287.0,1.53
std,348.86,560.38,95.76,6.3,140.1,0.64
min,-41.0,0.0,1.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,5.0,190.5,1.0
50%,562.5,1231.5,214.5,10.0,307.88,1.0
75%,864.25,1603.5,271.0,15.0,400.88,2.0
max,1166.0,2308.0,429.0,28.0,577.0,3.0


In [21]:
pages_and_chunks = []


for item in tqdm(pages_and_texts):
    for chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        joined_sentences_chunk = " ".join(chunk).replace("  "," ").strip()
        
        joined_sentences_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentences_chunk)# ".A" => ". A"
        
        chunk_dict["sentence_chunk"] = joined_sentences_chunk
        chunk_dict["chunk_char_count"] = len(joined_sentences_chunk)
        chunk_dict["chunk_word_count"] = len(joined_sentences_chunk.split(" "))
        chunk_dict["chunk_token_count"] = len(joined_sentences_chunk) / 4 #assumes that 1 token = 4 characters
        
        pages_and_chunks.append(chunk_dict)
    

100%|██████████| 1208/1208 [00:00<00:00, 81580.49it/s]


In [22]:
len(pages_and_chunks)

1843

In [23]:
#pages_and_chunks

In [24]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 893,
  'sentence_chunk': 'Nutrient Males, Ages 14–18 Females, Ages 14–18 Vitamin A (mcg) 900.0 700.0 Vitamin B6 (mg) 1.3 1.2 Vitamin B12 (mcg) 2.4 2.4 Vitamin C (mg) 75.0 65.0 Vitamin D (mcg) 5.0 5.0 Vitamin E (mg) 15.0 15.0 Vitamin K (mcg) 75.0 75.0 Calcium (mg) 1,300.0 1,300.0 Folate mcg) 400.0 400.0 Iron (mg) 11.0 15.0 Magnesium (mg) 410.0 360.0 Niacin (B3) (mg) 16.0 14.0 Phosphorus (mg) 1,250.0 1,250.0 Riboflavin (B2) (mg) 1.3 1.0 Selenium (mcg) 55.0 55.0 Thiamine (B1) (mg) 1.2 1.0 Zinc (mg) 11.0 9.0 Source: Institute of Medicine. 2006. Dietary Reference Intakes: The Essential Guide to Nutrient Requirements. Washington, DC: The National Academies Press. https://doi.org/10.17226/11537. Accessed December 10, 2017. Eating Disorders Many teens struggle with an eating disorder, which can have a detrimental effect on diet and health. A study published by North Dakota State University estimates that these conditions impact twenty-four million people in the United States a

In [25]:
df = pd.DataFrame(pages_and_chunks)
df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count
0,-41,Human Nutrition: 2020 Edition,29,4,7.25
1,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0
2,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5
3,-37,Contents Preface University of Hawai‘i at Māno...,766,114,191.5
4,-36,Lifestyles and Nutrition University of Hawai‘i...,942,143,235.5


In [26]:
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,735.14,113.03,183.78
std,347.79,447.64,71.27,111.91
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,45.0,78.75
50%,586.0,747.0,114.0,186.75
75%,890.0,1119.0,174.0,279.75
max,1166.0,1832.0,298.0,458.0


- Our models window size is 768. our max token count per chunk is 458 and min 3. 
- Means for this case we have no much to worry about the model's window size.
- We can think as below 30 tokens chunk will not give us meaningful results. We can filter out these chunks.

In [27]:
df[df["page_number"]==1057] 

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count
1656,1057,harmful microorganisms that can cause foodborn...,1330,210,332.5
1657,1057,Vegetarian diets have a number of benefits. We...,483,70,120.75


In [28]:
# Let's check what kind of info includes in these chunks which are less than 30 tokens
min_token_count = 30

for row in df[df["chunk_token_count"] <= min_token_count].sample(5).iterrows():
    print(f"Chunk token count: {row[1]['chunk_token_count']}, Sentence chunk: {row[1]['sentence_chunk']}")

Chunk token count: 24.75, Sentence chunk: http://www.ajcn.org/content/87/1/64.long. Accessed September 22, 2017. 554 | Water-Soluble Vitamins
Chunk token count: 9.25, Sentence chunk: Type 2 diabetes after 804 | Pregnancy
Chunk token count: 9.75, Sentence chunk: Table 3.5 Salt Substitutes Sodium | 185
Chunk token count: 13.25, Sentence chunk: https://doi.org/10.1186/ 1743-7075-4-24. Sulfur | 637
Chunk token count: 16.25, Sentence chunk: Health Consequences and Benefits of High-Carbohydrate Diets | 267


In [29]:
pages_and_chunks_over_min_token_count = df[df["chunk_token_count"] > min_token_count].to_dict(orient="records")

pages_and_chunks_over_min_token_count[:3]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5},
 {'page_number': -37,
  'sentence_chunk': 'Contents Preface University of Hawai‘i at Mānoa Food Science and Human Nutrition Program and Human Nutrition Program xxv About the Contributors University of Hawai‘i at Mānoa Food S

In [30]:
random.sample(pages_and_chunks_over_min_token_count, k=1)

[{'page_number': 141,
  'sentence_chunk': 'participants from fifty-two countries concluded that the waist-to- hip ratio is highly correlated with heart attack risk worldwide and is a better predictor of heart attacks than BMI.1. Abdominal obesity is defined by the World Health Organization (WHO) as having a waist- to-hip ratio above 0.90 for males and above 0.85 for females. Learning Activities Technology Note: The second edition of the Human Nutrition Open Educational Resource (OER) textbook features interactive learning activities. \xa0 These activities are available in the web-based textbook and not available in the downloadable versions (EPUB, Digital PDF, Print_PDF, or Open Document). Learning activities may be used across various mobile devices, however, for the best user experience it is strongly recommended that users complete these activities using a desktop or laptop computer and in Google Chrome. \xa0 1. \xa0Yusuf S, Hawken S, et al. ( 2005). Obesity and the Risk of Myocardi

### 4-) Embedding chunks with embedding model

In [31]:
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2")

In [32]:
test_sentences = ["This is a test sentence.", "Another test sentence.", "This is a third test sentence."]
test_embeddings = embedding_model.encode(test_sentences)
test_embeddings.shape
embedding_dict =dict(zip(test_sentences, test_embeddings))

for sent, embedding in embedding_dict.items():
    print(f"Sentence: {sent}, Embedding: {embedding}")

Sentence: This is a test sentence., Embedding: [ 3.78062250e-04 -5.08035198e-02 -3.51471975e-02 -2.32510362e-02
 -4.41583097e-02  2.04878431e-02  1.46187784e-03  3.12617980e-02
  5.60515523e-02  1.88153777e-02  6.46201670e-02 -1.66587401e-02
  2.24149274e-03 -6.62649125e-02  2.82418374e-02 -2.49872077e-03
  8.14975724e-02  8.00239854e-03 -4.89552096e-02  3.32183763e-02
 -1.88362971e-02  9.67359543e-03 -2.18884065e-03 -3.58971134e-02
 -5.01143709e-02 -2.18429603e-03 -2.14774571e-02 -3.25635113e-02
  2.42515989e-02 -2.65391860e-02  6.25296757e-02 -3.62269976e-03
 -1.09872911e-02 -7.67027736e-02  1.53072881e-06  1.44890873e-02
 -3.17214685e-03 -3.32370065e-02 -6.87476769e-02 -5.63172065e-03
  5.28364070e-03  6.53427169e-02  4.27035708e-03  4.32255492e-02
 -2.95564383e-02  9.66731086e-03  4.99073006e-02  1.99880656e-02
 -5.37453927e-02  8.12139958e-02 -1.67013239e-03 -2.15639739e-04
 -3.63481045e-03 -5.01495972e-02  7.31552914e-02  3.38429250e-02
  2.20924732e-03  3.29122208e-02  1.5380965

In [33]:
%%time
embedding_model.to("cuda")

for item in tqdm(pages_and_chunks_over_min_token_count):
    item["embedding"] = embedding_model.encode([item["sentence_chunk"]])


100%|██████████| 1680/1680 [00:16<00:00, 102.55it/s]

CPU times: user 2min 1s, sys: 140 ms, total: 2min 1s
Wall time: 16.4 s





In [34]:
embedding_model.to("cuda")

text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_count]
text_chunks[419]


'often. • Calm your “sweet tooth” by eating fruits, such as berries or an apple. • Replace sugary soft drinks with seltzer water, tea, or a small amount of 100 percent fruit juice added to water or soda water. The Food Industry: Functional Attributes of Carbohydrates and the Use of Sugar Substitutes In the food industry, both fast-releasing and slow-releasing carbohydrates are utilized to give foods a wide spectrum of functional attributes, including increased sweetness, viscosity, bulk, coating ability, solubility, consistency, texture, body, and browning capacity. The differences in chemical structure between the different carbohydrates confer their varied functional uses in foods. Starches, gums, and pectins are used as thickening agents in making jam, cakes, cookies, noodles, canned products, imitation cheeses, and a variety of other foods. Molecular gastronomists use slow- releasing carbohydrates, such as alginate, to give shape and texture to their fascinating food creations. Add

In [35]:
%%time
text_chunks_embeddings = embedding_model.encode(text_chunks, batch_size=32, convert_to_tensor=True)
text_chunks_embeddings


CPU times: user 19 s, sys: 171 ms, total: 19.1 s
Wall time: 13.9 s


tensor([[ 0.0674,  0.0902, -0.0051,  ..., -0.0221, -0.0232,  0.0126],
        [ 0.0552,  0.0592, -0.0166,  ..., -0.0120, -0.0103,  0.0227],
        [ 0.0280,  0.0340, -0.0206,  ..., -0.0054,  0.0213,  0.0313],
        ...,
        [ 0.0771,  0.0098, -0.0122,  ..., -0.0409, -0.0752, -0.0241],
        [ 0.1030, -0.0165,  0.0083,  ..., -0.0574, -0.0283, -0.0295],
        [ 0.0864, -0.0125, -0.0113,  ..., -0.0522, -0.0337, -0.0299]],
       device='cuda:0')

In [None]:
text_chunks_and_embeddings_df= pd.DataFrame(pages_and_chunks_over_min_token_count)
text_chunks_and_embeddings_df.head()

In [37]:
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [38]:
text_chunks_and_embeddings_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embeddings_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0,[[ 6.74242675e-02 9.02281329e-02 -5.09549491e...
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5,[[ 5.52156270e-02 5.92139587e-02 -1.66167449e...
2,-37,Contents Preface University of Hawai‘i at Māno...,766,114,191.5,[[ 2.79801972e-02 3.39813679e-02 -2.06426457e...
3,-36,Lifestyles and Nutrition University of Hawai‘i...,942,143,235.5,[[ 6.82566985e-02 3.81275043e-02 -8.46854504e...
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.5,[[ 3.30264494e-02 -8.49768892e-03 9.57159698e...


## B-) RAG