[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.5/create-your-own-dataset.ipynb)
[![Open with SageMaker](https://img.shields.io/badge/Open%20with-SageMaker-orange?logo=amazonaws)](https://studiolab.sagemaker.aws/import/github/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.5/create-your-own-dataset.ipynb)
[![View on GitHub](https://img.shields.io/badge/View_on-GitHub-blue?logo=github)](https://github.com/vuhung16au/hf-transformer-trove/blob/main/examples/basic5.5/create-your-own-dataset.ipynb)

# Creating Your Own Dataset and Uploading to Hugging Face Hub

## 🎯 Learning Objectives
By the end of this notebook, you will understand:
- How to collect data from real-world sources (GitHub issues)
- Data cleaning and preprocessing techniques for text data
- Dataset augmentation strategies to enhance your data
- How to upload datasets to the Hugging Face Hub
- Best practices for creating comprehensive dataset cards
- Dataset versioning and sharing with the community

## 📋 Prerequisites
- Basic understanding of machine learning concepts
- Familiarity with Python and pandas
- Knowledge of NLP fundamentals (refer to [NLP Learning Journey](https://github.com/vuhung16au/nlp-learning-journey))
- A Hugging Face account (create one at https://huggingface.co/)

## 📚 What We'll Cover
1. **Getting the Data**: Fetch GitHub issues from Hugging Face repositories
2. **Cleaning the Data**: Process and clean the collected text data
3. **Augmenting the Dataset**: Enhance data with additional features
4. **Uploading to Hub**: Push your dataset to Hugging Face Hub
5. **Creating a Dataset Card**: Document your dataset professionally

## 💡 Why Create and Share Datasets?

Following the [HuggingFace course Chapter 5, Section 5](https://huggingface.co/learn/llm-course/chapter5/5?fw=pt), creating and sharing datasets:
- **Contributes to the ML community**: Help researchers and practitioners
- **Enables reproducibility**: Others can build on your work
- **Demonstrates data quality**: Show best practices in data collection
- **Fosters collaboration**: Connect with other ML enthusiasts

> 💡 **Educational Focus**: This notebook demonstrates creating a dataset from GitHub issues to showcase practical data collection and preparation workflows.

> ⚠️ **Important**: Always respect API rate limits and data privacy. Follow the terms of service for any APIs you use.

**References:**
- HF Course: https://huggingface.co/learn/llm-course/chapter5/5?fw=pt
- Colab Example: https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter5/section5.ipynb

## Setup: Import Libraries and Configure Environment

In [None]:
# Install required packages (uncomment if needed)
# !pip install datasets transformers huggingface_hub pandas numpy requests tqdm

# Import essential libraries
import os
import json
import time
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from typing import List, Dict, Optional, Union
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Hugging Face imports
from datasets import Dataset, DatasetDict, load_dataset
from huggingface_hub import login, whoami, HfApi

# Progress tracking
from tqdm.auto import tqdm

print("📚 Libraries imported successfully!")
print(f"🐍 Python packages ready for dataset creation")

In [None]:
# Set reproducible environment with repository standard seed=16
import random

def set_seed(seed_value: int = 16):
    """
    Set seed for reproducibility across all random number generators.
    Repository standard is seed=16.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    print(f"🔢 Random seed set to {seed_value} for reproducibility")

# Set repository standard seed
set_seed(16)

# Configure visualization style (repository standard)
sns.set_style('darkgrid')  # Better readability with gridlines
sns.set_palette("husl")     # Consistent, accessible colors
print("📊 Visualization style configured: darkgrid with husl palette")

In [None]:
# Credential management for multi-platform compatibility
def get_api_key(key_name: str, required: bool = True) -> Optional[str]:
    """
    Load API key from environment or Google Colab secrets.
    
    Args:
        key_name: Environment variable name
        required: Whether to raise error if not found
        
    Returns:
        API key string or None
    """
    # Try Colab secrets first
    try:
        from google.colab import userdata
        api_key = userdata.get(key_name)
        if api_key:
            print(f"✅ Loaded {key_name} from Google Colab secrets")
            return api_key
    except:
        pass
    
    # Try environment variable
    api_key = os.getenv(key_name)
    if api_key:
        print(f"✅ Loaded {key_name} from environment variable")
        return api_key
    
    # Handle missing required keys
    if required:
        print(f"⚠️ {key_name} not found. Please set it in:")
        print(f"  - Local: .env.local file or environment variable")
        print(f"  - Colab: Secrets manager (🔑 icon in sidebar)")
        return None
    
    return None

# Authentication setup
print("🔐 Setting up authentication...")
print("\n📋 Authentication Methods:")
print("1. 🔑 Google Colab: Use Secrets manager (recommended for Colab)")
print("2. 💻 Local: Set HF_TOKEN and GITHUB_TOKEN environment variables")
print("3. 🖥️ CLI: Run `huggingface-cli login` in terminal")

## 1. Getting the Data: Fetching GitHub Issues

We'll collect GitHub issues from Hugging Face repositories to create a real-world dataset. This demonstrates:
- API interaction and data collection
- Handling pagination and rate limits
- Extracting structured data from APIs

> 💡 **Pro Tip**: Always respect API rate limits and implement proper error handling when collecting data from external sources.

In [None]:
def fetch_github_issues(
    repo_owner: str = "huggingface",
    repo_name: str = "transformers",
    max_issues: int = 100,
    state: str = "all",
    github_token: Optional[str] = None
) -> List[Dict]:
    """
    Fetch GitHub issues from a repository.
    
    Args:
        repo_owner: Repository owner (organization or user)
        repo_name: Repository name
        max_issues: Maximum number of issues to fetch
        state: Issue state ('open', 'closed', 'all')
        github_token: GitHub API token (optional, increases rate limit)
        
    Returns:
        List of issue dictionaries
    """
    base_url = f"https://api.github.com/repos/{repo_owner}/{repo_name}/issues"
    headers = {}
    
    # Add authentication if token provided
    if github_token:
        headers['Authorization'] = f'token {github_token}'
        print("🔑 Using GitHub token for authentication")
    else:
        print("⚠️ No GitHub token provided. Rate limit: 60 requests/hour")
        print("💡 Provide GITHUB_TOKEN for higher rate limit (5000 requests/hour)")
    
    all_issues = []
    page = 1
    per_page = min(100, max_issues)  # GitHub API max is 100 per page
    
    print(f"\n📥 Fetching issues from {repo_owner}/{repo_name}...")
    
    with tqdm(total=max_issues, desc="Fetching issues") as pbar:
        while len(all_issues) < max_issues:
            try:
                # Construct API URL with parameters
                params = {
                    'state': state,
                    'per_page': per_page,
                    'page': page
                }
                
                # Make API request
                response = requests.get(base_url, headers=headers, params=params)
                
                # Check rate limiting
                if response.status_code == 403:
                    print("\n⚠️ Rate limit exceeded. Waiting 60 seconds...")
                    time.sleep(60)
                    continue
                
                response.raise_for_status()
                issues = response.json()
                
                # No more issues available
                if not issues:
                    break
                
                # Filter out pull requests (they also appear in issues endpoint)
                issues = [issue for issue in issues if 'pull_request' not in issue]
                
                all_issues.extend(issues)
                pbar.update(len(issues))
                
                page += 1
                
                # Be respectful of rate limits
                time.sleep(0.5)
                
            except requests.exceptions.RequestException as e:
                print(f"\n❌ Error fetching issues: {e}")
                break
    
    # Limit to requested maximum
    all_issues = all_issues[:max_issues]
    
    print(f"\n✅ Successfully fetched {len(all_issues)} issues")
    return all_issues

In [None]:
# Get GitHub token (optional but recommended)
github_token = get_api_key('GITHUB_TOKEN', required=False)

# Fetch issues from Hugging Face transformers repository
# Using a smaller number for demonstration (adjust as needed)
print("🔄 Starting data collection...")
print("📝 Collecting GitHub issues from huggingface/transformers repository\n")

raw_issues = fetch_github_issues(
    repo_owner="huggingface",
    repo_name="transformers",
    max_issues=150,  # Reasonable number for demonstration
    state="all",      # Get both open and closed issues
    github_token=github_token
)

print(f"\n📊 Raw data collected: {len(raw_issues)} issues")

In [None]:
# Inspect the raw data structure
if raw_issues:
    print("🔍 Sample Issue Structure:")
    print("=" * 50)
    
    sample_issue = raw_issues[0]
    
    # Display key fields
    print(f"\n📌 Title: {sample_issue.get('title', 'N/A')}")
    print(f"🔢 Number: {sample_issue.get('number', 'N/A')}")
    print(f"👤 Author: {sample_issue.get('user', {}).get('login', 'N/A')}")
    print(f"📅 Created: {sample_issue.get('created_at', 'N/A')}")
    print(f"🏷️ State: {sample_issue.get('state', 'N/A')}")
    print(f"💬 Comments: {sample_issue.get('comments', 0)}")
    
    # Show labels
    labels = [label.get('name', '') for label in sample_issue.get('labels', [])]
    print(f"🏷️ Labels: {', '.join(labels) if labels else 'None'}")
    
    # Show body preview
    body = sample_issue.get('body', '')
    body_preview = body[:200] + '...' if body and len(body) > 200 else body
    print(f"\n📄 Body Preview:\n{body_preview}")
    
    print("\n" + "=" * 50)
    print(f"\n📋 Available fields: {len(sample_issue)} total")
    print(f"🔑 Key fields: {', '.join(list(sample_issue.keys())[:10])}...")
else:
    print("❌ No issues fetched. Check API access and try again.")

## 2. Cleaning Up the Data

Now we'll clean and structure our raw data:
- Extract relevant fields
- Handle missing values
- Clean text content
- Create a structured dataset

> 💡 **Data Quality**: Clean data is essential for training effective models. Always inspect and validate your data before using it.

In [None]:
def extract_issue_features(issue: Dict) -> Dict:
    """
    Extract relevant features from a GitHub issue.
    
    Args:
        issue: Raw issue dictionary from GitHub API
        
    Returns:
        Dictionary with cleaned and structured features
    """
    # Extract basic information
    issue_data = {
        'id': issue.get('id', None),
        'number': issue.get('number', None),
        'title': issue.get('title', ''),
        'body': issue.get('body', ''),
        'state': issue.get('state', 'unknown'),
        'created_at': issue.get('created_at', ''),
        'updated_at': issue.get('updated_at', ''),
        'closed_at': issue.get('closed_at', ''),
        'author': issue.get('user', {}).get('login', 'unknown'),
        'comments_count': issue.get('comments', 0),
        'labels': [label.get('name', '') for label in issue.get('labels', [])],
        'url': issue.get('html_url', ''),
    }
    
    return issue_data

# Extract features from all issues
print("🔄 Extracting and cleaning features...")
cleaned_issues = [extract_issue_features(issue) for issue in raw_issues]
print(f"✅ Extracted features from {len(cleaned_issues)} issues")

In [None]:
import re
import html

def clean_text(text: str) -> str:
    """
    Clean and normalize text content.
    
    Args:
        text: Raw text string
        
    Returns:
        Cleaned text string
    """
    if not text or pd.isna(text):
        return ""
    
    # Convert to string
    text = str(text)
    
    # Unescape HTML entities
    text = html.unescape(text)
    
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Strip leading/trailing whitespace
    text = text.strip()
    
    return text

# Apply text cleaning
print("🧹 Cleaning text content...")
for issue in cleaned_issues:
    issue['title'] = clean_text(issue['title'])
    issue['body'] = clean_text(issue['body'])

print("✅ Text cleaning complete")

In [None]:
# Convert to pandas DataFrame for easier manipulation
df = pd.DataFrame(cleaned_issues)

print("📊 Dataset Statistics:")
print("=" * 50)
print(f"Total issues: {len(df)}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nMissing values:")
print(df.isnull().sum())

# Display first few rows
print("\n📋 Sample Data:")
display(df[['number', 'title', 'state', 'author', 'comments_count']].head())

In [None]:
# Data quality checks and filtering
print("🔍 Performing data quality checks...\n")

# Check for empty titles
empty_titles = df['title'].isna() | (df['title'] == '')
print(f"Issues with empty titles: {empty_titles.sum()}")

# Check for empty bodies
empty_bodies = df['body'].isna() | (df['body'] == '')
print(f"Issues with empty bodies: {empty_bodies.sum()}")

# Filter out issues with empty titles (keep those with empty bodies as they might still be useful)
df_filtered = df[~empty_titles].copy()
print(f"\nFiltered dataset size: {len(df_filtered)} issues")

# Basic statistics
print("\n📈 Data Statistics:")
print("=" * 50)
print(f"\nStates distribution:")
print(df_filtered['state'].value_counts())
print(f"\nComments statistics:")
print(df_filtered['comments_count'].describe())

In [None]:
# Visualize the data
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: State distribution
state_counts = df_filtered['state'].value_counts()
axes[0, 0].bar(state_counts.index, state_counts.values, color=['green', 'red'])
axes[0, 0].set_title('Issue State Distribution', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('State')
axes[0, 0].set_ylabel('Count')
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Comments distribution
axes[0, 1].hist(df_filtered['comments_count'], bins=30, color='skyblue', edgecolor='black')
axes[0, 1].set_title('Comments Count Distribution', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Number of Comments')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Top 10 authors
top_authors = df_filtered['author'].value_counts().head(10)
axes[1, 0].barh(top_authors.index, top_authors.values, color='coral')
axes[1, 0].set_title('Top 10 Issue Authors', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Number of Issues')
axes[1, 0].set_ylabel('Author')
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Title length distribution
df_filtered['title_length'] = df_filtered['title'].str.len()
axes[1, 1].hist(df_filtered['title_length'], bins=30, color='lightgreen', edgecolor='black')
axes[1, 1].set_title('Issue Title Length Distribution', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Title Length (characters)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✅ Data visualization complete")

## 3. Augmenting the Dataset

Let's enhance our dataset with additional features:
- Text length metrics
- Temporal features
- Categorical encodings
- Derived features

> 💡 **Feature Engineering**: Adding relevant features can significantly improve downstream model performance and analysis capabilities.

In [None]:
def augment_dataset(df: pd.DataFrame) -> pd.DataFrame:
    """
    Augment dataset with additional derived features.
    
    Args:
        df: Input DataFrame
        
    Returns:
        Augmented DataFrame
    """
    df_augmented = df.copy()
    
    print("🔄 Adding derived features...\n")
    
    # 1. Text length features
    print("📏 Computing text length features...")
    df_augmented['title_length'] = df_augmented['title'].str.len()
    df_augmented['body_length'] = df_augmented['body'].str.len()
    df_augmented['title_word_count'] = df_augmented['title'].str.split().str.len()
    df_augmented['body_word_count'] = df_augmented['body'].str.split().str.len()
    
    # 2. Temporal features
    print("📅 Extracting temporal features...")
    df_augmented['created_at'] = pd.to_datetime(df_augmented['created_at'])
    df_augmented['updated_at'] = pd.to_datetime(df_augmented['updated_at'])
    
    # Extract date components
    df_augmented['created_year'] = df_augmented['created_at'].dt.year
    df_augmented['created_month'] = df_augmented['created_at'].dt.month
    df_augmented['created_day_of_week'] = df_augmented['created_at'].dt.dayofweek
    df_augmented['created_hour'] = df_augmented['created_at'].dt.hour
    
    # 3. Label features
    print("🏷️ Processing labels...")
    df_augmented['labels_count'] = df_augmented['labels'].apply(len)
    df_augmented['has_labels'] = df_augmented['labels_count'] > 0
    df_augmented['labels_text'] = df_augmented['labels'].apply(lambda x: ', '.join(x) if x else 'none')
    
    # 4. Engagement features
    print("💬 Computing engagement features...")
    df_augmented['has_comments'] = df_augmented['comments_count'] > 0
    df_augmented['is_closed'] = df_augmented['state'] == 'closed'
    
    # 5. Combined text feature for NLP tasks
    print("📝 Creating combined text field...")
    df_augmented['full_text'] = df_augmented['title'] + ' ' + df_augmented['body']
    
    print("\n✅ Dataset augmentation complete")
    print(f"\n📊 New features added: {len(df_augmented.columns) - len(df.columns)}")
    print(f"Total columns: {len(df_augmented.columns)}")
    
    return df_augmented

# Apply augmentation
df_augmented = augment_dataset(df_filtered)

In [None]:
# Inspect augmented dataset
print("📋 Augmented Dataset Sample:")
print("=" * 50)

# Select informative columns for display
display_cols = [
    'number', 'title', 'state', 'comments_count', 
    'title_length', 'body_word_count', 'labels_count', 'created_year'
]
display(df_augmented[display_cols].head(10))

print("\n📊 Augmented Features Summary:")
print(df_augmented[[
    'title_length', 'body_length', 'title_word_count', 
    'body_word_count', 'labels_count', 'comments_count'
]].describe())

## 4. Uploading the Dataset to the Hugging Face Hub

Now we'll prepare and upload our dataset to the Hugging Face Hub:
- Convert to HF Dataset format
- Create train/test splits
- Authenticate with Hugging Face
- Push to the Hub

> 💡 **Sharing Best Practices**: Always include proper documentation and respect privacy when sharing datasets.

**Reference**: [HF Course Chapter 5, Section 5](https://huggingface.co/learn/llm-course/chapter5/5?fw=pt)

In [None]:
# Select columns for the final dataset
# Keep the most relevant columns for ML tasks
columns_to_keep = [
    'id', 'number', 'title', 'body', 'full_text',
    'state', 'author', 'comments_count', 'labels_text',
    'created_at', 'url',
    'title_length', 'body_length', 'title_word_count', 'body_word_count',
    'labels_count', 'has_labels', 'has_comments', 'is_closed'
]

df_final = df_augmented[columns_to_keep].copy()

# Convert datetime to string for Dataset compatibility
df_final['created_at'] = df_final['created_at'].astype(str)

print("📦 Preparing dataset for upload...")
print(f"Final dataset shape: {df_final.shape}")
print(f"Columns: {list(df_final.columns)}")

In [None]:
# Convert to Hugging Face Dataset
from datasets import Dataset, DatasetDict

print("🔄 Converting to Hugging Face Dataset format...")

# Create train/test split with repository standard seed=16
train_size = 0.8
train_df = df_final.sample(frac=train_size, random_state=16)
test_df = df_final.drop(train_df.index)

print(f"\n📊 Split Statistics:")
print(f"Training set: {len(train_df)} examples ({train_size*100:.0f}%)")
print(f"Test set: {len(test_df)} examples ({(1-train_size)*100:.0f}%)")

# Create Dataset objects
train_dataset = Dataset.from_pandas(train_df, preserve_index=False)
test_dataset = Dataset.from_pandas(test_df, preserve_index=False)

# Create DatasetDict
dataset_dict = DatasetDict({
    'train': train_dataset,
    'test': test_dataset
})

print("\n✅ Dataset created successfully")
print(f"\n📋 Dataset Info:")
print(dataset_dict)
print(f"\n🔍 Features: {train_dataset.features}")

In [None]:
# Authenticate with Hugging Face Hub
print("🔐 Authenticating with Hugging Face Hub...\n")

# Get HF token
hf_token = get_api_key('HF_TOKEN', required=False)

AUTHENTICATED = False

if hf_token:
    try:
        # Login programmatically
        login(token=hf_token)
        
        # Verify authentication
        user_info = whoami()
        print(f"✅ Successfully authenticated as: {user_info['name']}")
        print(f"📧 Email: {user_info.get('email', 'Not provided')}")
        
        AUTHENTICATED = True
        
    except Exception as e:
        print(f"❌ Authentication failed: {e}")
        print("💡 Please check your HF_TOKEN and try again")
        AUTHENTICATED = False
else:
    print("⚠️ HF_TOKEN not found. Cannot upload to Hub.")
    print("\n📝 To get your token:")
    print("1. Go to https://huggingface.co/settings/tokens")
    print("2. Create a new token with 'write' permissions")
    print("3. Set it as HF_TOKEN in your environment or Colab secrets")
    AUTHENTICATED = False

print(f"\n🔐 Authentication status: {'Authenticated ✅' if AUTHENTICATED else 'Not authenticated ❌'}")

In [None]:
# Push dataset to Hub
if AUTHENTICATED:
    print("📤 Pushing dataset to Hugging Face Hub...\n")
    
    # Define dataset name
    dataset_name = "github-issues-transformers"
    
    try:
        # Push to Hub
        dataset_dict.push_to_hub(
            dataset_name,
            private=False,  # Set to True if you want a private dataset
            commit_message="Upload GitHub issues dataset from HF transformers repo"
        )
        
        print(f"\n✅ Dataset successfully pushed to Hub!")
        print(f"\n🔗 Dataset URL: https://huggingface.co/datasets/{user_info['name']}/{dataset_name}")
        print(f"\n💡 Load your dataset with:")
        print(f"   from datasets import load_dataset")
        print(f"   dataset = load_dataset('{user_info['name']}/{dataset_name}')")
        
        DATASET_PUSHED = True
        
    except Exception as e:
        print(f"❌ Error pushing dataset: {e}")
        print("💡 Make sure your token has write permissions")
        DATASET_PUSHED = False
else:
    print("❌ Cannot push dataset - not authenticated")
    print("\n💡 Dataset push_to_hub features:")
    print("  1. Automatic format detection and conversion")
    print("  2. Efficient storage using Apache Arrow")
    print("  3. Automatic data card generation")
    print("  4. Version control and collaboration")
    print("  5. Easy sharing and discovery")
    DATASET_PUSHED = False

## 5. Creating a Dataset Card

A dataset card (README.md) is essential for documenting your dataset:
- Describes the dataset purpose and contents
- Explains data collection methodology
- Lists limitations and biases
- Provides usage examples
- Includes citation information

> 💡 **Documentation Matters**: A good dataset card makes your dataset more discoverable and usable by the community.

**Reference**: [Dataset Card Guide](https://huggingface.co/docs/hub/datasets-cards)

In [None]:
# Create comprehensive dataset card
dataset_card = f"""
# GitHub Issues Dataset - HuggingFace Transformers

## Dataset Description

This dataset contains GitHub issues collected from the [huggingface/transformers](https://github.com/huggingface/transformers) repository. 
It was created for educational purposes as part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) project,
following the guidelines from [HuggingFace Course Chapter 5, Section 5](https://huggingface.co/learn/llm-course/chapter5/5?fw=pt).

### Dataset Summary

- **Purpose**: Educational demonstration of dataset creation and sharing
- **Source**: GitHub API (huggingface/transformers repository)
- **Collection Date**: {datetime.now().strftime('%Y-%m-%d')}
- **Total Issues**: {len(df_final)}
- **Train/Test Split**: 80/20 (seed=16 for reproducibility)

## Dataset Structure

### Data Fields

- `id`: GitHub issue unique identifier
- `number`: Issue number in the repository
- `title`: Issue title
- `body`: Issue description/body text
- `full_text`: Combined title and body for NLP tasks
- `state`: Issue state (open/closed)
- `author`: GitHub username of the issue creator
- `comments_count`: Number of comments on the issue
- `labels_text`: Comma-separated list of labels
- `created_at`: Timestamp when issue was created
- `url`: URL to the GitHub issue
- `title_length`: Length of title in characters
- `body_length`: Length of body in characters
- `title_word_count`: Number of words in title
- `body_word_count`: Number of words in body
- `labels_count`: Number of labels assigned
- `has_labels`: Boolean indicating if issue has labels
- `has_comments`: Boolean indicating if issue has comments
- `is_closed`: Boolean indicating if issue is closed

### Data Splits

| Split | Examples |
|-------|----------|
| train | {len(train_df)} |
| test  | {len(test_df)} |

## Dataset Creation

### Source Data

Data was collected using the GitHub API from the huggingface/transformers repository.
Only actual issues were included (pull requests were filtered out).

### Data Collection Process

1. **Fetching**: Used GitHub API to retrieve issues
2. **Filtering**: Removed pull requests and empty entries
3. **Cleaning**: Applied text normalization and cleaning
4. **Augmentation**: Added derived features and metadata
5. **Splitting**: Created train/test splits with seed=16

### Data Cleaning

- HTML entities were unescaped
- Excessive whitespace was normalized
- Empty titles were filtered out
- Text was cleaned and standardized

## Considerations for Using the Data

### Intended Use

This dataset is intended for:
- Educational purposes and learning about dataset creation
- Text classification experiments
- Issue analysis and categorization
- NLP model training and evaluation

### Limitations

- Limited to {len(df_final)} issues (snapshot in time)
- May not represent all types of issues in the repository
- English language only
- Subject to GitHub API rate limits during collection
- Does not include issue comments (only comment counts)

### Ethical Considerations

- All data is publicly available on GitHub
- Usernames are included as they are public information
- No personal or sensitive information was collected
- Users should respect GitHub's terms of service when using this data

## Additional Information

### Dataset Curators

Created by [Vu Hung Nguyen](https://github.com/vuhung16au) as part of the HF Transformer Trove educational project.

### Licensing Information

This dataset is released under the same license as the source repository.
Public GitHub data is subject to GitHub's terms of service.

### Citation Information

If you use this dataset, please cite:

```
@misc{{github-issues-transformers,
  title={{GitHub Issues Dataset - HuggingFace Transformers}},
  author={{Nguyen, Vu Hung}},
  year={{2024}},
  howpublished={{\\url{{https://github.com/vuhung16au/hf-transformer-trove}}}}
}}
```

### Acknowledgments

- HuggingFace team for the transformers library and Hub platform
- GitHub for providing the API to access public repository data
- HuggingFace Course for excellent educational materials

## Usage Example

```python
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("YOUR_USERNAME/github-issues-transformers")

# Access train split
train_data = dataset["train"]

# View first example
print(train_data[0])

# Filter closed issues
closed_issues = dataset.filter(lambda x: x["is_closed"])
```

## Dataset Card Contact

For questions or issues about this dataset, please:
- Open an issue on the [HF Transformer Trove repository](https://github.com/vuhung16au/hf-transformer-trove/issues)
- Contact: [GitHub Profile](https://github.com/vuhung16au)
"""

print("📝 Dataset card created!")
print("\n" + "=" * 50)
print("Dataset Card Preview:")
print("=" * 50)
print(dataset_card[:500] + "\n...\n[truncated for display]")

In [None]:
# Upload dataset card to Hub
if AUTHENTICATED and DATASET_PUSHED:
    print("📤 Uploading dataset card to Hub...\n")
    
    try:
        api = HfApi()
        
        # Create README.md content
        with open("/tmp/README.md", "w", encoding="utf-8") as f:
            f.write(dataset_card)
        
        # Upload the README
        api.upload_file(
            path_or_fileobj="/tmp/README.md",
            path_in_repo="README.md",
            repo_id=f"{user_info['name']}/{dataset_name}",
            repo_type="dataset",
            commit_message="Add comprehensive dataset card"
        )
        
        print("✅ Dataset card uploaded successfully!")
        print(f"\n🔗 View your dataset card at:")
        print(f"   https://huggingface.co/datasets/{user_info['name']}/{dataset_name}")
        
    except Exception as e:
        print(f"❌ Error uploading dataset card: {e}")
else:
    print("ℹ️ Dataset card created but not uploaded (not authenticated or dataset not pushed)")
    print("\n💾 You can save the dataset card locally:")
    
    # Save locally as example
    with open("/tmp/dataset_card_README.md", "w", encoding="utf-8") as f:
        f.write(dataset_card)
    
    print("✅ Dataset card saved to: /tmp/dataset_card_README.md")

---

## 📋 Summary

### 🔑 Key Concepts Mastered

- **Data Collection**: Learned how to fetch data from APIs (GitHub) with proper error handling and rate limiting
- **Data Cleaning**: Applied text preprocessing, normalization, and quality checks
- **Feature Engineering**: Created derived features to enhance dataset value
- **Dataset Creation**: Converted raw data into structured HuggingFace Dataset format
- **Data Splitting**: Created reproducible train/test splits using seed=16
- **Hub Upload**: Successfully uploaded dataset to HuggingFace Hub using push_to_hub()
- **Documentation**: Created comprehensive dataset card following best practices

### 📈 Best Practices Learned

- **API Respect**: Always respect rate limits and implement proper error handling
- **Reproducibility**: Use consistent random seeds (seed=16) for reproducible results
- **Data Quality**: Perform thorough cleaning and validation before sharing
- **Documentation**: Create detailed dataset cards to help users understand your data
- **Privacy**: Only share publicly available data and respect terms of service
- **Version Control**: Use HuggingFace Hub for dataset versioning and collaboration

### 🚀 Next Steps

Now that you've created and uploaded your dataset, you can:

1. **Train Models**: Use your dataset to train classification or NLP models
2. **Share & Collaborate**: Invite others to use and improve your dataset
3. **Iterate & Improve**: Update your dataset based on feedback and new data
4. **Explore Analysis**: Perform deeper analysis on issue patterns and trends
5. **Create Variants**: Build specialized subsets for specific tasks

**Related Notebooks:**
- **basic5.3/HF-training-data-preparation.ipynb**: Advanced data preparation techniques
- **basic4.3/push_to_hub_API_demo.ipynb**: More about pushing to HuggingFace Hub
- **05_fine_tuning_trainer.ipynb**: Fine-tune models on your custom dataset

**External Resources:**
- [HF Course Chapter 5](https://huggingface.co/learn/llm-course/chapter5/5?fw=pt): Complete guide on datasets
- [HF Hub Documentation](https://huggingface.co/docs/hub/datasets): Comprehensive Hub docs
- [Datasets Library](https://huggingface.co/docs/datasets/): Full API reference

---

## About the Author

**Vu Hung Nguyen** - AI Engineer & Researcher

Connect with me:
- 🌐 **Website**: [vuhung16au.github.io](https://vuhung16au.github.io/)
- 💼 **LinkedIn**: [linkedin.com/in/nguyenvuhung](https://www.linkedin.com/in/nguyenvuhung/)
- 💻 **GitHub**: [github.com/vuhung16au](https://github.com/vuhung16au/)

*This notebook is part of the [HF Transformer Trove](https://github.com/vuhung16au/hf-transformer-trove) educational series.*