<a href="https://colab.research.google.com/github/stevenbowler/EmployeeSurvey/blob/main/UploadPDFs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Upload PDFs to HuggingFace Private Dataset
Steven Bowler

How to Upload

1. Prepare Your Repo: Create a new dataset repository on huggingface.co (e.g., via the web UI or CLI: huggingface-cli repo create your-username/pdf-dataset --type dataset).
2. Track PDFs with LFS: In your local repo, run git lfs track "*.pdf" to ensure large PDFs are stored efficiently. Add this to your .gitattributes file.
3. Upload Options:
  a. Via Git/CLI (for smaller batches): Use git add . and git commit -m "Add PDFs", then git push. For large uploads, enable HF Transfer for speed: pip install huggingface_hub[hf_transfer] and set  HF_HUB_ENABLE_HF_TRANSFER=1.
  b. Programmatic (Recommended for 1000 Files): Use the huggingface_hub library to upload folders in chunks:

In [None]:
from google.colab import userdata

In [None]:
# Cell 3: Configuration
# Replace with your values
HF_TOKEN = userdata.get('HF_TOKEN')  # Your HF token for private repo
REPO_ID = userdata.get('REPO_ID')  # e.g., "user/private-survey-pdfs"
XAI_API_KEY = userdata.get('XAI_API_KEY')  # Get from https://console.x.ai/
MODEL = "grok-4"  # Or "grok-4-fast-reasoning" for cheaper/faster
NUM_QUESTIONS = 25
PDF_DIR = userdata.get('PDF_FOLDER_PATH')

In [None]:
# Confirmed: fastest, easiest way to upload 1,500 multi-page pdfs to huggingface
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
    folder_path=PDF_DIR,
    repo_id=REPO_ID,
    repo_type="dataset"
)

This handles LFS automatically and is efficient for many files.
If total size is huge, compress PDFs into tarballs (e.g., 100 per file) to reduce the file count.


Dataset Structure: Organize PDFs in a folder like /data/pdfs/ and add a README.md or Dataset Card describing the content (e.g., sources, licenses) for better discoverability.

Google Colab notebook code that will:

1. Mount Google Drive (where your 1000 PDFs are stored)
2. Authenticate with Hugging Face
3. Create a new dataset repository (if needed)
4. Upload the folder of 1000 PDFs using  upload_folder() with HF Transfer for speed
5. Add a basic README.md Dataset Card

In [None]:
# === STEP 1: Install required packages ===
!pip install -q huggingface_hub[hf_transfer] PyPDF2

# === STEP 2: Mount Google Drive (where your PDFs are stored) ===
from google.colab import drive
drive.mount('/content/drive')

# === STEP 3: Login to Hugging Face ===
from huggingface_hub import login
login()  # This will prompt you to enter your HF token

# === STEP 4: Set your paths and repo details ===
import os

# UPDATE THESE PATHS
PDF_FOLDER_PATH = "/content/drive/MyDrive/pdfs"  # Folder with your 1000 PDFs
REPO_ID = "your-username/pdf-dataset"            # Change to your username and dataset name
REPO_TYPE = "dataset"

# Optional: Create a Dataset Card (README.md)
README_CONTENT = """
# PDF Dataset - 1000 Documents

This dataset contains 1000 PDF files for research, NLP, or document analysis.

## Structure
- `pdfs/` - All 1000 PDF files

## Usage
```python
from datasets import load_dataset
dataset = load_dataset("your-username/pdf-dataset")

In [None]:
#=== STEP 5: Upload the folder using HF Transfer (fast & reliable) ===
from huggingface_hub import HfApi
import os
#Enable HF Transfer for fast uploads
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
api = HfApi()

In [None]:
#Create repo if it doesn't exist
try:
api.create_repo(repo_id=REPO_ID, repo_type=REPO_TYPE, private=False)
print(f"Created new dataset repo: {REPO_ID}")
except Exception as e:
print(f"Repo likely already exists: {e}")

In [None]:
#Upload the PDFs folder
print("Starting upload of 1000 PDFs... This may take a while.")
api.upload_folder(
folder_path=PDF_FOLDER_PATH,
repo_id=REPO_ID,
repo_type=REPO_TYPE,
path_in_repo="pdfs",  # Uploads to /pdfs in the repo
allow_patterns=[".pdf"],  # Only upload PDFs
ignore_patterns=[".txt", "*.csv"],  # Optional: ignore non-PDF
)

In [None]:
#Upload README.md
from huggingface_hub import upload_file
import tempfile
with tempfile.NamedTemporaryFile(mode='w', suffix='.md', delete=False) as f:
f.write(README_CONTENT)
readme_path = f.name
upload_file(
path_or_fileobj=readme_path,
path_in_repo="README.md",
repo_id=REPO_ID,
repo_type=REPO_TYPE
)
print(f"Successfully uploaded 1000 PDFs to https://huggingface.co/datasets/{REPO_ID}")
print("View your dataset card and files on Hugging Face!")

### Readme.md Text
---

### Instructions Before Running

1. **Upload your 1000 PDFs** to a folder in Google Drive (e.g., `MyDrive/pdfs`)
2. **Get a Hugging Face token**:
   - Go to: https://huggingface.co/settings/tokens
   - Create a **new token** with `write` access
3. **Update these lines**:
   ```python
   PDF_FOLDER_PATH = "/content/drive/MyDrive/pdfs"  # Your folder
   REPO_ID = "your-username/pdf-dataset"            # e.g., "johnsmith/research-pdfs"

Following 3 cells are a one off test of Grok Interface

In [None]:
#following 3 code cells test Grok Interface
from google.colab import userdata

In [None]:
# Cell 3: Configuration
# Replace with your values
HF_TOKEN = userdata.get('HF_TOKEN')  # Your HF token for private repo
REPO_ID = userdata.get('REPO_ID')  # e.g., "user/private-survey-pdfs"
XAI_API_KEY = userdata.get('XAI_API_KEY')  # Get from https://console.x.ai/
MODEL = "grok-4"  # Or "grok-4-fast-reasoning" for cheaper/faster
NUM_QUESTIONS = 25
PDF_DIR = userdata.get('PDF_FOLDER_PATH')

In [None]:
#Test XAI_API_KEY
# Use f-strings to correctly format the curl command with the API key
import os

curl_command = f"""curl https://api.x.ai/v1/chat/completions \\
    -H "Content-Type: application/json" \\
    -H "Authorization: Bearer {XAI_API_KEY}" \\
    -d '{{
      "messages": [
        {{
          "role": "system",
          "content": "You are a test assistant."
        }},
        {{
          "role": "user",
          "content": "Testing. Just say hi and hello world and nothing else."
        }}
      ],
      "model": "grok-4-latest",
      "stream": false,
      "temperature": 0
    }}'"""

# Execute the curl command using os.system or subprocess.run
# os.system is simpler for this case, but subprocess.run is generally preferred for more control
os.system(curl_command)