<a href="https://colab.research.google.com/github/smthomas1704/restoration-rag/blob/main/chunk_from_GROBID_generated_TEI_files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install gdown==v4.6.3

!gdown https://drive.google.com/file/d/1U8mrDhhHdrLxQGn5TEIlemMV-46f9oLL/view?usp=drive_link -O /content/functional_trait_literature_unsegmented_sentences.zip --fuzzy

!unzip /content/functional_trait_literature_unsegmented_sentences.zip

!pip install beautifulsoup4
!pip install lxml
!pip install pandas
!pip install langchain

Successfully installed dataclasses-json-0.6.3 jsonpatch-1.33 jsonpointer-2.4 langchain-0.1.3 langchain-community-0.0.15 langchain-core-0.1.15 langsmith-0.0.83 marshmallow-3.20.2 mypy-extensions-1.0.0 typing-inspect-0.9.0


### Processing TEI files.
1. We will be splitting the GROBID generated TEI files into smaller chunks.
2. Only paragraphs with titles and body will be used. Will skip references at the end of the page.
3. However, we do want to include the inline references to other material. Need to figure out how to do that.

### References:
1. https://kermitt2-grobid.hf.space/
2. https://python.langchain.com/docs/integrations/document_loaders/grobid
3. https://research.google.com/colaboratory/local-runtimes.html#:~:text=You%20can%20either%20run%20Jupyter,and%20the%20resource%20utilization%20monitor.
4. https://pypi.org/project/grobid-tei-xml/
5. https://stackoverflow.com/questions/2136267/beautiful-soup-and-extracting-a-div-and-its-contents-by-id
6. https://grobid.readthedocs.io/en/latest/Grobid-docker/

In [None]:
from bs4 import BeautifulSoup
from langchain.docstore.document import Document

import json
import os

tei_file_list = os.listdir("/content/funtional_trait_literature_grobid_tei")
chunks = []
for i, f in enumerate(tei_file_list):
  print(f)
  with open(f"/content/funtional_trait_literature_grobid_tei/{f}", 'r') as tei:
      soup = BeautifulSoup(tei, 'xml')
      sections = soup.find_all("div")
      heads = soup.find_all("head")
      title = soup.find_all("title")[0].text
      print(f"Title: {title}")
      print("Abstract: ")
      print(soup.find("abstract").get_text())
      keywords = []

      for keyword in soup.find_all("keywords"):
        for term in keyword.find_all("term"):
          keywords.append(term.get_text())

      for j, section in enumerate(sections):
          head = section.find_all("head")
          # Only consider the paragraphs that have a head
          if len(head) > 0:
              paragraphs = section.findAll("p")
              # Each paragraph can be a chunk. When feeding context we will get several similar
              # chunks and combine it to pass context
              for k, para in enumerate(paragraphs):
                obj = {
                    "file_name": f,
                    "page_content": para.get_text(),
                    "title": title,
                    "id": f"{i}.{j}.{k}",
                    "keywords": keywords
                }
                print(obj)
                chunks.append(obj)


# Write chunks to jsonl file.
with open("/content/function_trait_paper_small_chunks.jsonl", "w") as final:
  json.dump(chunks, final, indent=2)



Uploaded the file to Huggingface. Now download it in the next section

TODO:
1. Programmatically upload dataset to Huggingface

In [None]:
from huggingface_hub import hf_hub_download
import pandas as pd

REPO_ID = "collaborativeearth/functional_trait_papers"
FILENAME = "function_trait_paper_small_chunks.jsonl"

# Currently the dataset is not gated, thats why we're able to download it like this
# Otherwise authentication is required
dataset = pd.read_json(
    hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")
)

print(dataset)