<a href="https://colab.research.google.com/github/stevenbowler/EmployeeSurvey/blob/main/UploadPDFs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Upload PDFs to HuggingFace Private Dataset
Steven Bowler

https://grok.com/share/bGVnYWN5_6fbd212d-6669-478c-b941-f5c51f0418aa

How to Upload

1. Prepare Your Repo: Create a new dataset repository on huggingface.co (e.g., via the web UI or CLI: huggingface-cli repo create your-username/pdf-dataset --type dataset).
2. Track PDFs with LFS: In your local repo, run git lfs track "*.pdf" to ensure large PDFs are stored efficiently. Add this to your .gitattributes file.
3. Upload Options:
  a. Via Git/CLI (for smaller batches): Use git add . and git commit -m "Add PDFs", then git push. For large uploads, enable HF Transfer for speed: pip install huggingface_hub[hf_transfer] and set  HF_HUB_ENABLE_HF_TRANSFER=1.
  b. Programmatic (Recommended for 1000 Files): Use the huggingface_hub library to upload folders in chunks:

In [None]:
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
    folder_path="/path/to/your/pdfs",
    repo_id="your-username/pdf-dataset",
    repo_type="dataset"
)

This handles LFS automatically and is efficient for many files.
If total size is huge, compress PDFs into tarballs (e.g., 100 per file) to reduce the file count.


Dataset Structure: Organize PDFs in a folder like /data/pdfs/ and add a README.md or Dataset Card describing the content (e.g., sources, licenses) for better discoverability.

Here's a complete, ready-to-run Google Colab notebook code that will:

1. Mount Google Drive (where your 1000 PDFs are stored)
2. Authenticate with Hugging Face
3. Create a new dataset repository (if needed)
4. Upload the folder of 1000 PDFs using  upload_folder() with HF Transfer for speed
5. Add a basic README.md Dataset Card

In [None]:
# === STEP 1: Install required packages ===
!pip install -q huggingface_hub[hf_transfer] PyPDF2

# === STEP 2: Mount Google Drive (where your PDFs are stored) ===
from google.colab import drive
drive.mount('/content/drive')

# === STEP 3: Login to Hugging Face ===
from huggingface_hub import login
login()  # This will prompt you to enter your HF token

# === STEP 4: Set your paths and repo details ===
import os

# UPDATE THESE PATHS
PDF_FOLDER_PATH = "/content/drive/MyDrive/pdfs"  # Folder with your 1000 PDFs
REPO_ID = "your-username/pdf-dataset"            # Change to your username and dataset name
REPO_TYPE = "dataset"

# Optional: Create a Dataset Card (README.md)
README_CONTENT = """
# PDF Dataset - 1000 Documents

This dataset contains 1000 PDF files for research, NLP, or document analysis.

## Structure
- `pdfs/` - All 1000 PDF files

## Usage
```python
from datasets import load_dataset
dataset = load_dataset("your-username/pdf-dataset")