# Data Curation
This notebook showcases the building blocks that can be used for building a simple data curation pipeline using [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator).

## Reading Materials
Before proceeding, we highly recommend looking through the following deep dive blog posts that walk you through building data curation pipelines using NeMo Curator:
- [Curating Custom Datasets for LLM Training with NVIDIA NeMo Curator](https://developer.nvidia.com/blog/curating-custom-datasets-for-llm-training-with-nvidia-nemo-curator/)
- [Curating Custom Datasets for LLM Parameter-Efficient Fine-Tuning with NVIDIA NeMo Curator](https://developer.nvidia.com/blog/curating-custom-datasets-for-llm-parameter-efficient-fine-tuning-with-nvidia-nemo-curator/)

Also, please checkout [our tutorials](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials) in the repository to learn more about various functionalities that NeMo Curator provides.

In this notebook, we will use the [Law-StackExchange dataset](https://huggingface.co/datasets/ymoslem/Law-StackExchange) for this pipeline, which is a dataset of legal question/answers scraped from the Stack Exchange website. This notebook is the summarized version of our existing [synthetic data generation tutorial](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation-with-sdg). Feel free to go through that tutorial to gain a better understanding of various NeMo Curator facilities.

## Setup and Requirements
The NeMo dependencies are already installed in the container. However, before proceeding you need to install one dependency to follow along. Execute the following cell before getting started.

In [1]:
! pip install ipywidgets

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting ipywidgets
  Downloading ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting widgetsnbextension~=4.0.12 (from ipywidgets)
  Downloading widgetsnbextension-4.0.13-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.12 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.13-py3-none-any.whl.metadata (4.1 kB)
Downloading ipywidgets-8.1.5-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.8/139.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jupyterlab_widgets-3.0.13-py3-none-any.whl (214 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m214.4/214.4 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading widgetsnbextension-4.0.13-py3-none-any.whl (2.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m79.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collect

---
## Getting Started

To get started, let's setup some environment variables, as well as path variables that will be used for storing the curated data, as well as intermediate temporary files that are required for this notebooks to function.

In [1]:
import os
os.environ["DASK_DATAFRAME__QUERY_PLANNING"] = "False"  # Needed for running Curator on the GPU

NOTEBOOK_DIR = os.path.abspath("")
DATA_DIR = os.path.join(NOTEBOOK_DIR, "data")
TEMP_DIR = os.path.join(NOTEBOOK_DIR, ".temp")
os.makedirs(DATA_DIR, exist_ok=True)

Let's now import everything we need to build our data curation pipeline. For your conveniene, we've provided the document builder implementations that allow you to download the dataset from HuggingFace and convert it into a Pandas `DataFrame`.

We have additionally implemented a score-based filter that allows you to filter the dataset rows using the score values assigned to each question. You can use this implementation as the basis for creating your own filtering/scoring mechanisms using NeMo Curator.

In [2]:
from nemo_curator.utils.distributed_utils import get_client
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import WordCountFilter
from nemo_curator.modifiers import UnicodeReformatter
from nemo_curator.utils.file_utils import expand_outdir_and_mkdir
from nemo_curator import ScoreFilter, Sequential
from nemo_curator.modules.modify import Modify

# Importing helper functions
from helpers.filters import FilterLowScores
from helpers.docbuilder import download_and_convert_dataset

Before proceeding, let's decide the compute resources we'd like to use for running our data curation pipeline. NeMo Curator uses Dask to orchestrate scalable data processing. As such, it needs to know what resources to use. 

For the purposes of this notebook, we will instruct NeMo Curator to use 8 CPU workers. While most NeMo Curator functionalities can be executed on the CPU, some modules (such as semantic deduplication) can only be executed on the GPU. Please make sure to select the appropriate device.

Note that you can increase or decrease the number of CPU workers depending on the runtime environment. Keep in mind that each CPU worker gets allocated a fixed amount of the total available system memory (RAM). Thus, if the environment does not have enough memory available, Dask operations might fail.

Once we have decided on the resources to use, we can initialize our Dask cluster and start using NeMo Curator.

In [3]:
device = "cpu"  # It can be either "cpu" or "gpu"
n_workers = 4  # Number of workers to use for Dask. If running out of memory, try reducing this.
client = get_client(device, n_workers=n_workers, set_torch_to_use_rmm=False)

---
## The Main Data Curation and Processing Pipeline

We start by downloading and converting the dataset into a suitable format. This is done via the document builders that we have provided for you.

In [141]:
dataset_df = download_and_convert_dataset(DATA_DIR)
raw_dataset = DocumentDataset.from_pandas(dataset_df)

Download directory:  /root/ODSC-Hackathon-Repository/data/raw
File '/root/ODSC-Hackathon-Repository/data/raw/law-stackexchange-questions-answers.json' already exists, skipping download.


In [130]:
file_path_og = 'data/raw/law-stackexchange-questions-answers.json' 
file_path_synth = 'data/raw/synth_data.jsonl' 
train_dataset_df = pd.read_json(file_path_og, orient='records')
synth_dataset_df = pd.read_json(file_path_synth, orient='records',lines=True)

In [132]:
synth_dataset_df.rename(columns={'answer': 'answers', 'answer_score': 'link', 'filename':'license', 'id': 'question_id','question':'question_body','question_score':'score','title':'question_title'}, inplace=True)
synth_dataset_df.head()

Unnamed: 0,answers,link,license,question_id,question_body,score,tags,question_title
0,"PersonB has to attribute PersonA and, if possi...",0,law-qa-train.jsonl,law-stackexchange-qa-13384,How does the attribution part of the CC licens...,0,"attribution,creative-commons",CC-BY Layers of people
1,Are there ways to comply with GDPR and not dis...,1,law-qa-train.jsonl,law-stackexchange-qa-29296,"I'm working as a processor, we host and develo...",0,gdpr,Do i have to disclose all processors to my con...
2,A related post is here. Are police required to...,4,law-qa-train.jsonl,law-stackexchange-qa-30036,Are police required to record in car dashcam v...,1,"discovery,new-jersey,traffic,united-states",What can I do if police did not record video i...
3,"This might be an unnecessary subtle point, but...",2,law-qa-train.jsonl,law-stackexchange-qa-43835,If someone enters personal data in a field whi...,1,"data-protection,data-storage,encryption,europe...",Does GDPR apply to personal information entere...
4,This is a Federal court decision There are no ...,7,law-qa-train.jsonl,law-stackexchange-qa-49039,In a recent case involving an accusation of ex...,1,"appeal,criminal-law,united-states",Federal judge sets aside jury verdict


In [133]:
import numpy as np
synth_dataset_df['question_id'] = np.arange(100000, 100000+len(synth_dataset_df))
train_dataset_df['link'] = np.arange(100000, 100000+len(train_dataset_df))
train_dataset_df.head()

Unnamed: 0,question_id,tags,score,license,link,question_title,question_body,answers
0,94665,"[criminal-law, driving, sentencing]",23,CC BY-SA 4.0,100000,Why is drunk driving causing accident punished...,<p>When people drink and drive and then cause ...,"[{'answer_id': 94666, 'score': 72, 'body': '<h..."
1,94671,"[contract-law, legal-terms, consideration]",0,CC BY-SA 4.0,100001,What counts as consideration in contract law?,<p>What counts as consideration in contract la...,"[{'answer_id': 94672, 'score': 1, 'body': '<p>..."
2,94683,"[employment, california, teenager]",1,CC BY-SA 4.0,100002,Question Concerning Responding to Employer of ...,<p>My high school daughter worked for about a ...,"[{'answer_id': 94687, 'score': 3, 'body': '<h2..."
3,67110,"[united-states, constitutional-law, federalism]",2,CC BY-SA 4.0,100003,Can Hawaii secede from the U.S. through legal ...,<p>Can Hawaii secede from the U.S. through leg...,"[{'answer_id': 67111, 'score': 9, 'body': '<p>..."
4,94678,"[united-kingdom, property, any-jurisdiction, l...",1,CC BY-SA 4.0,100004,Legality of privately bibby Stockholming to sa...,<p>It seems that the principal impetus of movi...,"[{'answer_id': 94712, 'score': 1, 'body': '<p>..."


In [134]:
import random
synth_dataset_df['answers'] = [ [{'answer_id': random.randint(946666, 9466666), 'score': 72, 'body': x}] for x in synth_dataset_df['answers'] ]
synth_dataset_df.head()

Unnamed: 0,answers,link,license,question_id,question_body,score,tags,question_title
0,"[{'answer_id': 3357914, 'score': 72, 'body': '...",0,law-qa-train.jsonl,100000,How does the attribution part of the CC licens...,0,"attribution,creative-commons",CC-BY Layers of people
1,"[{'answer_id': 6371778, 'score': 72, 'body': '...",1,law-qa-train.jsonl,100001,"I'm working as a processor, we host and develo...",0,gdpr,Do i have to disclose all processors to my con...
2,"[{'answer_id': 977219, 'score': 72, 'body': 'A...",4,law-qa-train.jsonl,100002,Are police required to record in car dashcam v...,1,"discovery,new-jersey,traffic,united-states",What can I do if police did not record video i...
3,"[{'answer_id': 4601866, 'score': 72, 'body': '...",2,law-qa-train.jsonl,100003,If someone enters personal data in a field whi...,1,"data-protection,data-storage,encryption,europe...",Does GDPR apply to personal information entere...
4,"[{'answer_id': 6679438, 'score': 72, 'body': '...",7,law-qa-train.jsonl,100004,In a recent case involving an accusation of ex...,1,"appeal,criminal-law,united-states",Federal judge sets aside jury verdict


In [135]:
train_dataset_df['tags'] = [ ','.join(map(str, x)) for x in train_dataset_df['tags']]
# df['column_with_lists'] = df['column_with_lists'].apply(lambda x: str(x) if isinstance(x, list) else x)
train_dataset_df.head()

Unnamed: 0,question_id,tags,score,license,link,question_title,question_body,answers
0,94665,"criminal-law,driving,sentencing",23,CC BY-SA 4.0,100000,Why is drunk driving causing accident punished...,<p>When people drink and drive and then cause ...,"[{'answer_id': 94666, 'score': 72, 'body': '<h..."
1,94671,"contract-law,legal-terms,consideration",0,CC BY-SA 4.0,100001,What counts as consideration in contract law?,<p>What counts as consideration in contract la...,"[{'answer_id': 94672, 'score': 1, 'body': '<p>..."
2,94683,"employment,california,teenager",1,CC BY-SA 4.0,100002,Question Concerning Responding to Employer of ...,<p>My high school daughter worked for about a ...,"[{'answer_id': 94687, 'score': 3, 'body': '<h2..."
3,67110,"united-states,constitutional-law,federalism",2,CC BY-SA 4.0,100003,Can Hawaii secede from the U.S. through legal ...,<p>Can Hawaii secede from the U.S. through leg...,"[{'answer_id': 67111, 'score': 9, 'body': '<p>..."
4,94678,"united-kingdom,property,any-jurisdiction,law-o...",1,CC BY-SA 4.0,100004,Legality of privately bibby Stockholming to sa...,<p>It seems that the principal impetus of movi...,"[{'answer_id': 94712, 'score': 1, 'body': '<p>..."


In [136]:
#synth_dataset_df = synth_dataset_df.drop(columns=["answers"])
#train_dataset_df = train_dataset_df.drop(columns=["answers"])
synth_dataset_df.head()

Unnamed: 0,answers,link,license,question_id,question_body,score,tags,question_title
0,"[{'answer_id': 3357914, 'score': 72, 'body': '...",0,law-qa-train.jsonl,100000,How does the attribution part of the CC licens...,0,"attribution,creative-commons",CC-BY Layers of people
1,"[{'answer_id': 6371778, 'score': 72, 'body': '...",1,law-qa-train.jsonl,100001,"I'm working as a processor, we host and develo...",0,gdpr,Do i have to disclose all processors to my con...
2,"[{'answer_id': 977219, 'score': 72, 'body': 'A...",4,law-qa-train.jsonl,100002,Are police required to record in car dashcam v...,1,"discovery,new-jersey,traffic,united-states",What can I do if police did not record video i...
3,"[{'answer_id': 4601866, 'score': 72, 'body': '...",2,law-qa-train.jsonl,100003,If someone enters personal data in a field whi...,1,"data-protection,data-storage,encryption,europe...",Does GDPR apply to personal information entere...
4,"[{'answer_id': 6679438, 'score': 72, 'body': '...",7,law-qa-train.jsonl,100004,In a recent case involving an accusation of ex...,1,"appeal,criminal-law,united-states",Federal judge sets aside jury verdict


In [137]:
for c in train_dataset_df.columns:
    print(train_dataset_df[c].dtype)
    print(synth_dataset_df[c].dtype)

int64
int64
object
object
int64
int64
object
object
int64
int64
object
object
object
object
object
object


In [126]:
synth_dataset_df.columns 

Index(['answers', 'link', 'license', 'question_id', 'question_body', 'score',
       'tags', 'question_title'],
      dtype='object')

In [138]:
df = pd.concat([train_dataset_df, synth_dataset_df], ignore_index=True)
df.head()

Unnamed: 0,question_id,tags,score,license,link,question_title,question_body,answers
0,94665,"criminal-law,driving,sentencing",23,CC BY-SA 4.0,100000,Why is drunk driving causing accident punished...,<p>When people drink and drive and then cause ...,"[{'answer_id': 94666, 'score': 72, 'body': '<h..."
1,94671,"contract-law,legal-terms,consideration",0,CC BY-SA 4.0,100001,What counts as consideration in contract law?,<p>What counts as consideration in contract la...,"[{'answer_id': 94672, 'score': 1, 'body': '<p>..."
2,94683,"employment,california,teenager",1,CC BY-SA 4.0,100002,Question Concerning Responding to Employer of ...,<p>My high school daughter worked for about a ...,"[{'answer_id': 94687, 'score': 3, 'body': '<h2..."
3,67110,"united-states,constitutional-law,federalism",2,CC BY-SA 4.0,100003,Can Hawaii secede from the U.S. through legal ...,<p>Can Hawaii secede from the U.S. through leg...,"[{'answer_id': 67111, 'score': 9, 'body': '<p>..."
4,94678,"united-kingdom,property,any-jurisdiction,law-o...",1,CC BY-SA 4.0,100004,Legality of privately bibby Stockholming to sa...,<p>It seems that the principal impetus of movi...,"[{'answer_id': 94712, 'score': 1, 'body': '<p>..."


I did it!

In [139]:
df['tags'] = [ x.split(',') for x in df['tags']]
df.head()

Unnamed: 0,question_id,tags,score,license,link,question_title,question_body,answers
0,94665,"[criminal-law, driving, sentencing]",23,CC BY-SA 4.0,100000,Why is drunk driving causing accident punished...,<p>When people drink and drive and then cause ...,"[{'answer_id': 94666, 'score': 72, 'body': '<h..."
1,94671,"[contract-law, legal-terms, consideration]",0,CC BY-SA 4.0,100001,What counts as consideration in contract law?,<p>What counts as consideration in contract la...,"[{'answer_id': 94672, 'score': 1, 'body': '<p>..."
2,94683,"[employment, california, teenager]",1,CC BY-SA 4.0,100002,Question Concerning Responding to Employer of ...,<p>My high school daughter worked for about a ...,"[{'answer_id': 94687, 'score': 3, 'body': '<h2..."
3,67110,"[united-states, constitutional-law, federalism]",2,CC BY-SA 4.0,100003,Can Hawaii secede from the U.S. through legal ...,<p>Can Hawaii secede from the U.S. through leg...,"[{'answer_id': 67111, 'score': 9, 'body': '<p>..."
4,94678,"[united-kingdom, property, any-jurisdiction, l...",1,CC BY-SA 4.0,100004,Legality of privately bibby Stockholming to sa...,<p>It seems that the principal impetus of movi...,"[{'answer_id': 94712, 'score': 1, 'body': '<p>..."


In [140]:
df.to_json(file_path_og, orient='records')

In [10]:
import pandas as pd
synth_ds = "data/raw/synth_data.jsonl"
synth_ds = DocumentDataset.read_json(synth_ds)
#dataset_df = pd.merge(dataset_df, synth_ds, on='key')
synth_ds.df.head()

Reading 1 files


Unnamed: 0,answer,answer_score,filename,id,question,question_score,tags,title
0,"PersonB has to attribute PersonA and, if possi...",0,law-qa-train.jsonl,law-stackexchange-qa-13384,How does the attribution part of the CC licens...,0,"attribution,creative-commons",CC-BY Layers of people
1,Are there ways to comply with GDPR and not dis...,1,law-qa-train.jsonl,law-stackexchange-qa-29296,"I'm working as a processor, we host and develo...",0,gdpr,Do i have to disclose all processors to my con...
2,A related post is here. Are police required to...,4,law-qa-train.jsonl,law-stackexchange-qa-30036,Are police required to record in car dashcam v...,1,"discovery,new-jersey,traffic,united-states",What can I do if police did not record video i...
3,"This might be an unnecessary subtle point, but...",2,law-qa-train.jsonl,law-stackexchange-qa-43835,If someone enters personal data in a field whi...,1,"data-protection,data-storage,encryption,europe...",Does GDPR apply to personal information entere...
4,This is a Federal court decision There are no ...,7,law-qa-train.jsonl,law-stackexchange-qa-49039,In a recent case involving an accusation of ex...,1,"appeal,criminal-law,united-states",Federal judge sets aside jury verdict


Next, we need to define our data curation pipeline. The pipeline we define here is very simple, as it contains basic filtering operations

> NOTE: to use the modules that need a GPU, the dataset has to be converted to the `cudf` backend. Please refer to [this tutorial](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation-with-sdg) for an example demonstrating the usage of GPU modules.

In [142]:
def run_curation_pipeline(dataset: DocumentDataset, device: str) -> DocumentDataset:
    print(f"Running curation pipeline on '{device}'...")
    orig_dataset = dataset

    cpu_curation_steps = Sequential(
        [
            #
            # Modifications
            #
            # Unify the text encoding to Unicode.
            Modify(UnicodeReformatter(), text_field="title"),
            Modify(UnicodeReformatter(), text_field="question"),
            #
            # Filtering
            #
            # Filter out records based on the question word counts.
            ScoreFilter(
                WordCountFilter(min_words=40, max_words=400),
                text_field="question",
                score_type=int,
            ),
            # Filter out records where the question has a negative score.
            ScoreFilter(
                FilterLowScores(score_threshold=0),
                text_field="question_score",
                score_type=bool,
            ),
        ]
    )

    # Run the CPU curation steps.
    dataset = cpu_curation_steps(dataset)
    dataset = dataset.persist()
    # Drop the columns that are no longer needed.
    dataset.df = dataset.df.drop(columns=["answer", "answer_score", "question_score"])
    orig_len = len(orig_dataset.df)
    new_len = len(dataset.df)

    print(f"Original dataset length: {orig_len}")
    print(f"New dataset length: {new_len}")

    return dataset

Finally, we are ready to run the pipeline and get our final dataset. This may take up to 10 minutes to execute, especially if any GPU functionalities are used.

In [143]:
curated_dataset = run_curation_pipeline(raw_dataset, device)

Running curation pipeline on 'cpu'...


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.
This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Original dataset length: 36587
New dataset length: 31490


Next, let's specify the final columns that we would like our dataset to have. Depending on how you plan on consuming this dataset for training, you may decide to introduce other arbitrary columns to help the model learn better.

Also, this is a great place to add system or instruction prompts to every record, in case you intend to use the same instruction prompt for every record.

Let's define a function that formats the dataset, and also adds system prompts.

In [144]:
def format_dataset(dataset: DocumentDataset, filename: str) -> DocumentDataset:
   # SYSTEM_PROMPT = "Read the following title and question about a legal issue and assign the most appropriate tag to it. All tags must be in lowercase, ordered lexicographically and separated by commas.\n\n"

    SYSTEM_PROMPT = "Read the question about legal issues and assign between 1 and 5 tags to it such as the type of law, key legal concept, location. All tags must be in lowercase, this-format, ordered lexicographically and separated by commas.\n\n"

    df = dataset.df.compute()
    has_tags = "tags" in df.columns
    df["input"] = SYSTEM_PROMPT + "TITLE:\n" + df["title"] + "\n\n" + "QUESTION:\n" + df["question"]
    df["output"] = df["tags"] if has_tags else ""  # If the dataset doesn't have tags, use an empty string.
    df["filename"] = filename

    df = df.drop(columns=["title", "question"])
    if has_tags:
        df = df.drop(columns=["tags"]) # Drop the tags column if it exists.
    return DocumentDataset.from_pandas(df)

We use the function above to format the dataset. We apply the same logic to the final evaluation dataset.

In [145]:
formatted_dataset = format_dataset(curated_dataset, "law-stackexchange-curated.jsonl")
print(f"Original dataset columns: {curated_dataset.df.columns}")
print(f"Formatted dataset columns: {formatted_dataset.df.columns}")

Original dataset columns: Index(['filename', 'id', 'title', 'question', 'tags'], dtype='object')
Formatted dataset columns: Index(['filename', 'id', 'input', 'output'], dtype='object')


Once the final dataset is ready, we can write it into a JSONL file that is in the format expected for training with NeMo Framework.

> NOTE: The curated dataset will be written under `curator/data/curated_dataset/law-stackexchange-curated.jsonl`

In [146]:
print(f"Curated dataset columns: {formatted_dataset.df.columns}")
result_fp = os.path.join(DATA_DIR, "curated_dataset")
print()
print(f"Saving curated dataset to '{result_fp}'...")
formatted_dataset.to_json(result_fp, write_to_filename=True)

Curated dataset columns: Index(['filename', 'id', 'input', 'output'], dtype='object')

Saving curated dataset to '/root/ODSC-Hackathon-Repository/data/curated_dataset'...


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Writing to disk complete for 1 partitions


---
# Spliting the Dataset

Before starting the model training procedure, let's split the dataset we've just curated into `training`, `validation` and `test` splits with 80/10/10 ratios.

In [147]:
from sklearn.model_selection import train_test_split

VAL_RATIO = 0.05

df = formatted_dataset.df.compute()

# Some sanity checks
assert len(df) > 0, "The dataset is empty."
assert VAL_RATIO >= 0 and VAL_RATIO <= 1, "VAL_RATIO must be between 0 and 1."
val_size = int(len(df) * VAL_RATIO)
output_dir = f"{DATA_DIR}/split"
os.makedirs(output_dir, exist_ok=True)

# Split the data into training and temporary sets
train_df, val_df = train_test_split(df, test_size=val_size, random_state=42)

print(f"Original size: {len(df)}")
print("After splitting:")
print(f"    Train size: {len(train_df)}")
print(f"    Validation size: {len(val_df)}")

train_df["filename"] = "train.jsonl"
val_df["filename"] = "val.jsonl"

DocumentDataset.from_pandas(train_df).to_json(output_dir, write_to_filename=True)
DocumentDataset.from_pandas(val_df).to_json(output_dir, write_to_filename=True)


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Original size: 31490
After splitting:
    Train size: 29916
    Validation size: 1574


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


Writing to disk complete for 1 partitions
Writing to disk complete for 1 partitions


---
# Preparing the Submission Dataset

The submission dataset is dataset of questions and titles, where every participating team would have to predict the tags for.
It needs to have a format similar to training datasets so that you can evaluate your model on it, and submit your predicted tags.

In [148]:
submission_ds = "data/submission/evaluation-dataset-verified-for-participants.jsonl"
assert os.path.exists(submission_ds), f"The submission dataset does not exist at '{submission_ds}'"
submission_ds = DocumentDataset.read_json(submission_ds)
submission_ds = format_dataset(submission_ds, "submission.jsonl")
print("Writing the formatted submission dataset to disk...")
submission_ds.to_json(output_dir, write_to_filename=True)

Reading 1 files
Writing the formatted submission dataset to disk...
Writing to disk complete for 1 partitions


Once you have run the above cell, your data that is suitable for training will be written under `data/split`. When making submissions, run inference with your model on `data/split/submission.jsonl`.

---
# Freeing Memory and Other Resources

Before moving to the next notebook, please execute the following cell to free up all the allocated resources to avoid running into out-of-memory or other issues.

Alternatively, please restart the kernel by navigating to `Kernel > Restart Kernel` (if using Jypyter notebook), or clicking the `Restart` button in VS Code.

In [149]:
client.close()
exit(0)