This notebook is used to generate `.csv` files with chunks from the texts in `glutamate/texts` to be used for RAG in `glutamate.ipynb`.

In [1]:
import os
import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [2]:
files = [
    "breit.txt",
    "coelho.txt",
    "danaher.txt",
    "howe.txt",
    "peake.txt",
    "zauber.txt"
]

In [3]:
texts = []
for file in files:
    path = os.path.join("texts", file)
    text = open(path, "r").read()
    texts.append(text)

In [4]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 800,
    chunk_overlap = 100,
    length_function = len,
)
metadata = [ 
    {
        "id": i, 
        "file": files[i]
    } for i in range(len(files))
]
docs = text_splitter.create_documents(texts, metadata)

In [5]:
len(docs)

510

In [6]:
docs[0]

Document(page_content='ABSTRACT\nThe objectives of this work were the classification of dynamic\nmetabolic biomarker candidates and the modeling and characterization\nof kinetic regulatory mechanisms in human metabolism with response to\nexternal perturbations by physical activity. Longitudinal metabolic\nconcentration data of 47 individuals from 4 different groups were\nexamined, obtained from a cycle ergometry cohort study. In total, 110\nmetabolites (within the classes of acylcarnitines, amino acids, and\nsugars) were measured through a targeted metabolomics approach,\ncombining tandem mass spectrometry (MS/MS) with the concept of stable\nisotope dilution (SID) for metabolite quantitation. Biomarker\ncandidates were selected by combined analysis of maximum fold changes', metadata={'id': 0, 'file': 'breit.txt'})

In [7]:
for file in files:

    # Print number of chunks.
    file_docs = [ doc for doc in docs if doc.metadata["file"] == file ]
    num_chunks = len(file_docs)
    print(f"{file}: {num_chunks}")

    # Add chunk_id to metadata.
    for chunk_id in range(num_chunks):
        file_docs[chunk_id].metadata["chunk_id"] = chunk_id

breit.txt: 96
coelho.txt: 60
danaher.txt: 89
howe.txt: 62
peake.txt: 92
zauber.txt: 111


In [8]:
df = pd.DataFrame({
    "id": [doc.metadata["id"] for doc in docs],
    "chunk_id": [doc.metadata["chunk_id"] for doc in docs],
    "chunk": [doc.page_content for doc in docs],
    "file": [doc.metadata["file"] for doc in docs]
})
df

Unnamed: 0,id,chunk_id,chunk,file
0,0,0,ABSTRACT\nThe objectives of this work were the...,breit.txt
1,0,1,candidates were selected by combined analysis ...,breit.txt
2,0,2,"MFC and statistical significance, was classifi...",breit.txt
3,0,3,biomarker identification and the investigation...,breit.txt
4,0,4,through a cycle ergometry stress test. In tota...,breit.txt
...,...,...,...,...
505,5,106,New perspectives arise from metabolomic analys...,zauber.txt
506,5,107,Table 3. Overrepresentation analysis of GO ter...,zauber.txt
507,5,108,Male\tSignal transduction\t0.052\t0.035\t0.687...,zauber.txt
508,5,109,"In summary, our study is one of the first case...",zauber.txt


In [9]:
df.to_csv("glutamate.csv")