# Create Hugging Face Dataset

<a target="_blank" href="https://colab.research.google.com/github/simonguest/CS-394/blob/main/src/07/notebooks/create-dataset.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<a target="_blank" href="https://github.com/simonguest/CS-394/raw/refs/heads/main/src/07/notebooks/create-dataset.ipynb">
  <img src="https://img.shields.io/badge/Download_.ipynb-blue" alt="Download .ipynb"/>
</a>

## Install dependencies

In [1]:
!uv pip install datasets

[2mUsing Python 3.12.3 environment at: /mnt/d/Projects/Digipen/2026Spring/CS394/GenAI-Learning/.venv[0m
[2K[2mResolved [1m43 packages[0m [2min 1.53s[0m[0m                                        [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/13)                                                  
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/13)------------------[0m[0m     0 B/146.76 KiB          [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/13)------------------[0m[0m 14.90 KiB/146.76 KiB        [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/13)------------------[0m[0m 14.90 KiB/146.76 KiB        [1A
[2maiosignal           [0m [32m[30m[2m------------------------------[0m[0m     0 B/7.31 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/13)------------------[0m[0m 14.90 KiB/146.76 KiB        [2A
[2maiosignal           [0m [32m------------------------------[30m[2m[0m[0m 7.31 KiB/7.31 KiB
[2K[2A[37m⠙[0m [2mPreparin

## Configuration

**Note**: If you are running this notebook on Colab, be sure to first upload your training data (.jsonl) files.

In [1]:
from pathlib import Path

REPO_DIR = Path.cwd().parents[1]

TRAIN_FILE = REPO_DIR / "database/wright_rendered_splits/train.jsonl"
VALIDATION_FILE = REPO_DIR / "database/wright_rendered_splits/validation.jsonl"
TEST_FILE = REPO_DIR / "database/wright_rendered_splits/test.jsonl"

DATASET_REPO = "saneaven/novel-dataset-v0"

## Create dataset functions

In [2]:
from datasets import Dataset, DatasetDict
import json


def load_jsonl(file_path):
    data = []
    with open(file_path, "r", encoding="utf-8") as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:  # Skip empty lines
                continue
            try:
                # Try parsing the line
                data.append(json.loads(line))
            except json.JSONDecodeError as e:
                print(f"Warning: Error parsing line {line_num}: {e}")
                print(f"Problematic line: {line[:200]}...")
    return data


def create_hf_dataset(train_file, val_file, test_file):
    # Load the data
    train_data = load_jsonl(train_file)
    val_data = load_jsonl(val_file)
    test_data = load_jsonl(test_file)

    # Create datasets
    train_dataset = Dataset.from_list(train_data)
    val_dataset = Dataset.from_list(val_data)
    test_dataset = Dataset.from_list(test_data)

    # Combine into DatasetDict
    dataset_dict = DatasetDict(
        {
            "train": train_dataset,
            "validation": val_dataset,
            "test": test_dataset,
        }
    )

    return dataset_dict


# Create and validate the dataset is ready to upload
dataset = create_hf_dataset(TRAIN_FILE, VALIDATION_FILE, TEST_FILE)
print(f"Train samples: {len(dataset['train'])}")
print(f"Validation samples: {len(dataset['validation'])}")
print(f"Test samples: {len(dataset['test'])}")
print(f"\nSample entry: {dataset['train'][0]}")

  from .autonotebook import tqdm as notebook_tqdm


Train samples: 106635
Validation samples: 11848
Test samples: 13

Sample entry: {'messages': [{'content': "\n\nYou are an expert novel-writing assistant. You collaborate with authors to build compelling stories using structured project tools. You have access to tools for creating and managing characters, locations, organizations, lore entries, outlines, and manuscript text. Always use the appropriate tool calls to read existing project state before making changes, and to persist your work.\n\n## Current Project\n- **Title**: Summerfield: A Tale of Rural Life\n- **Genre**: Domestic Fiction, Frontier Literature, Moral Fiction\n- **Logline**: A young pioneer builds a prosperous life in the New York wilderness, navigating the loss of his son, the hardships of the seasons, and the deceptions of early commerce before witnessing a miraculous family reunion.\n\n\n\n## Characters\n\n- **Matthew Fabens**: The protagonist, a hardworking pioneer who migrates from the Hudson to the Lake Country.\n\

## Get Hugging Face token

In [3]:
import sys
import os
from dotenv import load_dotenv

if 'google.colab' in sys.modules:
  from google.colab import userdata # type:ignore
  os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')
  print("HF API Token set for Colab")
else:
  load_dotenv()
  print("Loaded env vars from .env")

Loaded env vars from .env


## Upload to Hugging Face

In [5]:
def upload_dataset(dataset, repo_name, token):
    dataset.push_to_hub(
        repo_name,
        token=token,
        private=False
    )
    
    print(f"Dataset uploaded successfully to: https://huggingface.co/datasets/{repo_name}")

upload_dataset(dataset, DATASET_REPO, os.environ.get("HF_TOKEN"))


[A

[A[A

[A[A

[A[A

[A[A

Creating parquet from Arrow format: 100%|██████████| 5/5 [00:00<00:00,  8.37ba/s]
Processing Files (0 / 1): 100%|█████████▉|  437MB /  438MB, 1.35MB/s  

Processing Files (0 / 1): 100%|█████████▉|  437MB /  438MB, 1.21MB/s    
New Data Upload: 100%|██████████|  268MB /  268MB, 1.16MB/s  
Uploading the dataset shards:  17%|█▋        | 1/6 [03:02<15:10, 182.05s/ shards]
[A
[A
[A
[A
Creating parquet from Arrow format: 100%|██████████| 5/5 [00:00<00:00,  8.94ba/s]
Processing Files (0 / 1): 100%|█████████▉|  443MB /  445MB, 1.45MB/s  
New Data Upload: 100%|██████████|  394MB /  394MB, 1.41MB/s  
Uploading the dataset shards:  33%|███▎      | 2/6 [05:57<11:51, 177.91s/ shards]
[A
[A
[A
[A
Creating parquet from Arrow format: 100%|██████████| 5/5 [00:00<00:00,  8.60ba/s]
Processing Files (0 / 1): 100%|█████████▉|  456MB /  457MB, 2.07MB/s  
New Data Upload: 100%|██████████|  390MB /  390MB, 2.05MB/s  
Uploading the dataset shards:  50%|█████     | 

Dataset uploaded successfully to: https://huggingface.co/datasets/saneaven/novel-dataset-v0


## Create the dataset card

In [None]:
from huggingface_hub import DatasetCard

card_content = f"""---
---
pretty_name: "Test Dataset for CS-394"
license: mit
---

# Test Dataset

This is a test dataset of Python code snippets and explanations, used in DigiPen's CS-394 course.
"""

card = DatasetCard(card_content)
card.save(f"./README.md")

## Upload the dataset card

In [None]:
from huggingface_hub import HfApi

api = HfApi()
api.upload_file(
    path_or_fileobj="./README.md",
    path_in_repo="README.md",
    repo_id=DATASET_REPO,
    repo_type="dataset",
)

CommitInfo(commit_url='https://huggingface.co/datasets/simonguest/test-dataset/commit/debbc8c9e4e94646b5ee769874a27abc00da093c', commit_message='Upload README.md with huggingface_hub', commit_description='', oid='debbc8c9e4e94646b5ee769874a27abc00da093c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/simonguest/test-dataset', endpoint='https://huggingface.co', repo_type='dataset', repo_id='simonguest/test-dataset'), pr_revision=None, pr_num=None)