<a href="https://colab.research.google.com/github/smthomas1704/restoration-rag/blob/main/chunk_from_GROBID_generated_TEI_files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install gdown==v4.6.3

!gdown https://drive.google.com/file/d/1LrXwPMgiok1zcn4i4LND7LOohslSmv8x/view?usp=drive_link -O /content/functional_trait_literature_unsegmented_sentences.zip --fuzzy

!unzip /content/functional_trait_literature_unsegmented_sentences.zip

!pip install beautifulsoup4
!pip install lxml
!pip install pandas
!pip install langchain

Successfully installed dataclasses-json-0.6.4 jsonpatch-1.33 jsonpointer-2.4 langchain-0.1.12 langchain-community-0.0.28 langchain-core-0.1.32 langchain-text-splitters-0.0.1 langsmith-0.1.27 marshmallow-3.21.1 mypy-extensions-1.0.0 orjson-3.9.15 packaging-23.2 typing-inspect-0.9.0


### Processing TEI files.
1. We will be splitting the GROBID generated TEI files into smaller chunks.
2. Only paragraphs with titles and body will be used. Will skip references at the end of the page.
3. However, we do want to include the inline references to other material. Need to figure out how to do that.

### References:
1. https://kermitt2-grobid.hf.space/
2. https://python.langchain.com/docs/integrations/document_loaders/grobid
3. https://research.google.com/colaboratory/local-runtimes.html#:~:text=You%20can%20either%20run%20Jupyter,and%20the%20resource%20utilization%20monitor.
4. https://pypi.org/project/grobid-tei-xml/
5. https://stackoverflow.com/questions/2136267/beautiful-soup-and-extracting-a-div-and-its-contents-by-id
6. https://grobid.readthedocs.io/en/latest/Grobid-docker/


### TODO
1. Also get "figDesc" to the chunks. Some of these figure descriptions have a lot of information.
2. Possibly we should also eliminate some useless chunks such as acknowledgements, conflict of interest statements etc

In [2]:
from bs4 import BeautifulSoup
from langchain.docstore.document import Document

import json
import os

tei_file_list = os.listdir("/content/tei_all_afr_carbon_2")
chunks = []
large_chunks = []
abstracts_only = []
for i, f in enumerate(tei_file_list):
  print(f)
  with open(f"/content/tei_all_afr_carbon_2/{f}", 'r') as tei:
      soup = BeautifulSoup(tei, 'xml')
      sections = soup.find_all("div")
      heads = soup.find_all("head")
      title = soup.find_all("title")[0].text
      abstract = soup.find("abstract").get_text()
      abstracts_only.append(abstract)
      keywords = []

      for keyword in soup.find_all("keywords"):
        for term in keyword.find_all("term"):
          keywords.append(term.get_text())

      for j, section in enumerate(sections):
          head = section.find_all("head")
          # Only consider the paragraphs that have a head
          combined_paras = []
          if len(head) > 0:
              # large_chunks.append(section.get_text())
              paragraphs = section.findAll("p")
              # Each paragraph can be a chunk. When feeding context we will get several similar
              # chunks and combine it to pass context
              for k, para in enumerate(paragraphs):
                obj = {
                    "file_name": f,
                    "page_content": para.get_text(),
                    "title": title,
                    "id": f"{i}.{j}.{k}",
                    "keywords": keywords
                }
                combined_paras.append(para.get_text())
          if len(combined_paras) > 0:
            chunks.extend(combined_paras)
            large_chunk = "\n".join(combined_paras)
            # print(large_chunk)
            large_chunks.append({
                "file_name": f,
                "page_content": large_chunk,
                "title": title,
                "id": f"{i}.{j}",
                "keywords": keywords
            })


# Write chunks to jsonl file.
with open("/content/all_afr_carbon_small_chunks.jsonl", "w") as final:
  json.dump(chunks, final, indent=2)

with open("/content/all_afr_carbon_large_chunks.jsonl", "w") as final:
  json.dump(large_chunks, final, indent=2)

with open("/content/all_afr_carbon_abstracts_only.jsonl", "w") as final:
  json.dump(abstracts_only, final, indent=2)



Mangwale et al. - 2017 - Changes in forest cover and carbon stocks of the coastal scarp forests of the Wild Coast, South Afri.grobid.tei.xml
Jimenez et al. - 2022 - Recovery of Soil Processes in Replanted Mangroves Implications for Soil Functions.grobid.tei.xml
Rao et al. - 2022 - Participatory active restoration of communal forests in temperate Himalaya, India.grobid.tei.xml
Assefa et al. - 2017 - Deforestation and land use strongly effect soil organic carbon and nitrogen stock in Northwest Ethio.grobid.tei.xml
Jamaluddin - 2013 - ASSESSING SOIL FERTILITY STATUS OF REHABILITATED DEGRADED TROPICAL RAINFOREST.grobid.tei.xml
Yao Ping et al. - 2014 - Carbon sequestration potential of the major stands under the Grain for Green Program in southwest Ch.grobid.tei.xml
He et al. - 2013 - Carbon storage capacity of monoculture and mixed-species plantations in subtropical China.grobid.tei.xml
Marcuzzo et al. - 2014 - Comparação entre áreas em restauração e área de referência no Rio Grande

Uploaded the file to Huggingface. Now download it in the next section

TODO:
1. Programmatically upload dataset to Huggingface

In [None]:
!pip install huggingface_hub
!pip install datasets

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets

In [None]:
from datasets import load_dataset
from google.colab import userdata
from huggingface_hub import hf_hub_download
import pandas as pd

HUGGINGFACE_TOKEN = userdata.get("HUGGINGFACE_TOKEN")
REPO_ID = "collaborativeearth/functional_trait_papers"
FILENAME = "all_afr_carbon_small_chunks.jsonl"

# Currently the dataset is not gated, thats why we're able to download it like
dataset = load_dataset(REPO_ID)

dataset.push_to_hub(FILENAME, token=HUGGINGFACE_TOKEN)

# print(dataset)

FileNotFoundError: Directory all_afr_carbon_small_chunks.jsonl is neither a `Dataset` directory nor a `DatasetDict` directory.