# CSV to Wikidata Transformation (Dry Run)

This notebook demonstrates how to:
1. Load data from a CSV file
2. Apply a mapping configuration
3. Transform the data to Wikidata JSON format
4. Preview the results without submitting to Wikidata

This is useful for testing and validating your mappings before actual submission.

## Setup

Import required libraries and create sample CSV data.

In [1]:
import csv
import json
from pathlib import Path
import pandas as pd

from gkc import WikiverseAuth
from gkc.item_creator import PropertyMapper, ItemCreator

mapping_file_path = "mappings/fed_tribe_from_missing_ak_tribes.json"

## Step 1: Read and Preview example CSV Data

Let's load and view the CSV data using pandas.

In [2]:
# Load CSV with pandas
df = pd.read_csv("/Users/sky/Downloads/Federally Recognized Tribes - Missing Tribes.csv")

created_items = [
    "Egegik Village"
]

df = df[~df['fr_label'].isin(created_items)]

print(f"Loaded {len(df)} records\n")
print("Columns:", df.columns.tolist())
print("\nFirst few rows:")
df.head()

Loaded 130 records

Columns: ['fr_label', 'fr_alternate_labels', 'old_name', 'wikipedia_en', 'qid_ak_city', 'qid_anrc', 'member_count_2005', 'nill_index', 'nill_ref']

First few rows:


Unnamed: 0,fr_label,fr_alternate_labels,old_name,wikipedia_en,qid_ak_city,qid_anrc,member_count_2005,nill_index,nill_ref
0,Chinik Eskimo Community,Golovin,,Chinik Eskimo Community,Q79379,Q4891841,110,https://narf.org/nill/tribes/chinik.html,https://narf.org/nill/triballaw/index.html#c
1,Circle Native Community,,,Circle Native Community,Q974350,Q5303808,185,https://narf.org/nill/tribes/circle_native.html,https://narf.org/nill/triballaw/index.html#c
2,Curyung Tribal Council,,Native Village of Dillingham,Curyung Tribal Council,Q79383,Q13581663,1873,https://narf.org/nill/tribes/curyung.html,https://narf.org/nill/triballaw/index.html#c
4,Evansville Village,Bettles Field,,Evansville Village (aka Bettles Field),Q79943,Q5303808,15,https://narf.org/nill/tribes/evansville.html,https://narf.org/nill/triballaw/index.html#e
5,Healy Lake Village,,,Healy Lake Village,Q378233,Q5303808,27,https://narf.org/nill/tribes/healy_lake.html,https://narf.org/nill/triballaw/index.html#h


In [None]:
# Use the built-in sitelinks validator from gkc
from gkc import check_wikipedia_page

df["wikipedia_en"] = df["wikipedia_en"].apply(
    lambda x: check_wikipedia_page(x, site_code="enwiki")
)

## Step 2: Build Mapping Configuration

Build the mapping configuration that defines how CSV fields map to Wikidata properties. This can be used to at least stub out a mapping JSON file if one does not already exist.

In [None]:
# Option A: Generate mapping directly and use it
from gkc import ClaimsMapBuilder

# Uncomment to auto-generate mapping from EntitySchema E502
# builder = ClaimsMapBuilder(eid="E502")
# mapper = PropertyMapper.from_claims_builder(builder, entity_type="Q7840353")
# print("✓ Generated and loaded mapping from EntitySchema E502")

# Option B: Generate, save, then customize
builder = ClaimsMapBuilder(eid="E502")
mapping_config = builder.build_complete_mapping(entity_type="Q7840353")

# Save for customization
with open(mapping_file_path, "w") as f:
    json.dump(mapping_config, f, indent=2)
print(f"✓ Saved auto-generated mapping to {mapping_file_path}")
print("  Now edit the file to update source_field names to match your CSV")

# Then load the customized version
mapper = PropertyMapper.from_file(mapping_file_path)

# print("Using Option A generates mapping on-the-fly from EntitySchema")
# print("Using Option B allows you to customize field names before using")

## Step 3: Load Mapping Configuration

Load the mapping configuration that defines how CSV fields map to Wikidata properties.

In [5]:
# METHOD 1: Load from pre-made mapping file
mapping_file = Path(mapping_file_path)

if not mapping_file.exists():
    print(f"⚠️  Mapping file not found: {mapping_file}")
    print("   Using current directory...")
    mapping_file = Path("tribe_mapping_example.json")

# Load the mapper from file
mapper = PropertyMapper.from_file(str(mapping_file))
print(f"✓ Loaded mapping configuration from: {mapping_file}")

# METHOD 2: Generate mapping from EntitySchema (uncomment to use)
# from gkc import ClaimsMapBuilder
# builder = ClaimsMapBuilder(eid="E502")
# mapper = PropertyMapper.from_claims_builder(builder, entity_type="Q7840353")
# print("✓ Generated mapping from EntitySchema E502")

# Preview mapping structure
print("\nMapping includes:")
print(f"  - Labels: {len(mapper.config['mappings'].get('labels', []))} fields")
print(f"  - Aliases: {len(mapper.config['mappings'].get('aliases', []))} fields")
print(f"  - Descriptions: {len(mapper.config['mappings'].get('descriptions', []))} fields")
print(f"  - Claims: {len(mapper.config['mappings'].get('claims', []))} properties")

✓ Loaded mapping configuration from: mappings/fed_tribe_from_missing_ak_tribes.json

Mapping includes:
  - Labels: 1 fields
  - Aliases: 1 fields
  - Descriptions: 1 fields
  - Claims: 7 properties


## Step 4: Transform Single Record (Detailed View)

Let's transform one record and examine the resulting Wikidata JSON structure in detail.

In [6]:
# Get first record as dictionary
first_record = df.iloc[0].to_dict()

print("Source record:")
print("=" * 60)
for key, value in first_record.items():
    print(f"  {key}: {value}")

# Transform to Wikidata JSON
wikidata_json = mapper.transform_to_wikidata(first_record)

print("\n" + "=" * 60)
print("Transformed Wikidata JSON:")
print("=" * 60)
print(json.dumps(wikidata_json, indent=2, ensure_ascii=False))

Source record:
  fr_label: Chinik Eskimo Community
  fr_alternate_labels: Golovin
  old_name: nan
  wikipedia_en: nan
  qid_ak_city: Q79379
  qid_anrc: Q4891841
  member_count_2005: 110
  nill_index: https://narf.org/nill/tribes/chinik.html
  nill_ref: https://narf.org/nill/triballaw/index.html#c

Transformed Wikidata JSON:
{
  "labels": {
    "en": {
      "language": "en",
      "value": "Chinik Eskimo Community"
    }
  },
  "descriptions": {
    "en": {
      "language": "en",
      "value": "federally recognized tribe in Alaska, United States"
    }
  },
  "aliases": {
    "en": [
      {
        "language": "en",
        "value": "Golovin"
      }
    ]
  },
  "claims": {
    "P31": [
      {
        "mainsnak": {
          "snaktype": "value",
          "property": "P31",
          "datavalue": {
            "value": {
              "entity-type": "item",
              "numeric-id": 7840353,
              "id": "Q7840353"
            },
            "type": "wikibase-entityid"
  

## Step 5: Examine Key Sections

Let's look at specific sections of the transformed data to understand the structure better.

In [None]:
# Labels
print("LABELS:")
print(json.dumps(wikidata_json.get('labels', {}), indent=2, ensure_ascii=False))

# Aliases (note the separator handling)
print("\nALIASES (split from semicolon-separated string):")
print(json.dumps(wikidata_json.get('aliases', {}), indent=2, ensure_ascii=False))

# Descriptions
print("\nDESCRIPTIONS:")
print(json.dumps(wikidata_json.get('descriptions', {}), indent=2, ensure_ascii=False))

# Sitelinks (Wikipedia and other Wikimedia project links)
print("\nSITELINKS (links to Wikipedia/Wikimedia projects):")
print(json.dumps(wikidata_json.get('sitelinks', {}), indent=2, ensure_ascii=False))

In [None]:
# Claims - show a few examples
print("CLAIMS (sample properties):")
print("\nP31 (instance of):")
if 'P31' in wikidata_json.get('claims', {}):
    print(json.dumps(wikidata_json['claims']['P31'], indent=2, ensure_ascii=False))

print("\nP2124 (member count with qualifier and reference):")
if 'P2124' in wikidata_json.get('claims', {}):
    print(json.dumps(wikidata_json['claims']['P2124'], indent=2, ensure_ascii=False))

## Step 6: Dry Run - Transform All Records

Now let's process all records using the ItemCreator in dry-run mode. This shows what would be submitted without actually sending data to Wikidata.

In [None]:
# Create auth (not actually needed for dry run, but required by ItemCreator)
auth = WikiverseAuth()

# Create ItemCreator in DRY RUN mode
creator = ItemCreator(auth=auth, mapper=mapper, dry_run=True)

print("Processing all records in DRY RUN mode...")
print("=" * 60)

# Convert dataframe to list of dicts
records = df.to_dict('records')

# Process each record
for i, record in enumerate(records, 1):
    print(f"\n{'='*60}")
    print(f"Record {i}/{len(records)}: {record['fr_label']}")
    print(f"{'='*60}")
    
    result = creator.create_item(record, validate=False)
    print(f"\nResult: {result}")

## Step 7: Batch Processing Summary

Use the batch processing feature to get a summary of all transformations.

In [None]:
# Process batch and get summary
results = creator.create_batch(records, validate=False)

print("\nBatch Processing Summary")
print("=" * 60)
print(f"Total records: {len(records)}")
print(f"Successful: {len(results['success'])}")
print(f"Failed: {len(results['failed'])}")

if results['success']:
    print("\nSuccessfully processed:")
    for item in results['success']:
        record = item['record']
        print(f"  ✓ {record['fr_label']} → {item['qid']}")

if results['failed']:
    print("\nFailed records:")
    for item in results['failed']:
        record = item['record']
        print(f"  ✗ {record.get('fr_label', 'Unknown')}: {item['error']}")

## Step 8: Export Transformed Data

Save all transformed Wikidata JSON structures to a file for review.

In [None]:
# Transform all records and save
transformed_records = []

for record in records:
    wikidata_json = mapper.transform_to_wikidata(record)
    transformed_records.append({
        "source_label": record['fr_label'],
        "wikidata_json": wikidata_json
    })

# Save to JSON file
# output_path = Path("transformed_items.json")
# with open(output_path, 'w', encoding='utf-8') as f:
#     json.dump(transformed_records, f, indent=2, ensure_ascii=False)

# print(f"✓ Saved {len(transformed_records)} transformed records to: {output_path}")
# print(f"  File size: {output_path.stat().st_size:,} bytes")

### Summary Statistics

Analyze the transformed data to understand what will be created.

In [None]:
# Analyze transformed data
stats = {
    'total_records': len(transformed_records),
    'properties_used': set(),
    'languages': set(),
    'total_aliases': 0,
    'total_claims': 0
}

for item in transformed_records:
    wikidata_json = item['wikidata_json']
    
    # Count languages
    stats['languages'].update(wikidata_json.get('labels', {}).keys())
    
    # Count aliases
    for lang, aliases in wikidata_json.get('aliases', {}).items():
        stats['total_aliases'] += len(aliases)
    
    # Count properties
    claims = wikidata_json.get('claims', {})
    stats['properties_used'].update(claims.keys())
    stats['total_claims'] += len(claims)

print("Transformation Statistics")
print("=" * 60)
print(f"Total records transformed: {stats['total_records']}")
print(f"Languages: {', '.join(sorted(stats['languages']))}")
print(f"Total aliases created: {stats['total_aliases']}")
print(f"Total claims (statements): {stats['total_claims']}")
print(f"\nUnique properties used: {len(stats['properties_used'])}")
for prop in sorted(stats['properties_used']):
    print(f"  - {prop}")

## Step 9: Item Creation Test

Test a small number of items to ensure everything works.

In [7]:
first_record

{'fr_label': 'Chinik Eskimo Community',
 'fr_alternate_labels': 'Golovin',
 'old_name': nan,
 'wikipedia_en': nan,
 'qid_ak_city': 'Q79379',
 'qid_anrc': 'Q4891841',
 'member_count_2005': 110,
 'nill_index': 'https://narf.org/nill/tribes/chinik.html',
 'nill_ref': 'https://narf.org/nill/triballaw/index.html#c'}

In [8]:
auth = WikiverseAuth()

if not auth.is_authenticated():
    print("\n⚠️  No credentials found.")
    print("Set WIKIVERSE_USERNAME and WIKIVERSE_PASSWORD to run this example.")

print(f"\nAuthenticating as: {auth.username}")
try:
    auth.login()
    print("✓ Successfully logged in")
except Exception as e:
    print(f"✗ Login failed: {e}")

creator = ItemCreator(auth=auth, mapper=mapper, dry_run=False)

print(f"\nCreating item for: {first_record['fr_label']}")

response = input("Are you sure you want to create this item? (yes/no): ")
if response.lower() != "yes":
    print("Cancelled.")
else:
    try:
        qid = creator.create_item(first_record, validate=False)
        print(f"\n✓ Successfully created item: {qid}")
        print(f"   View at: https://www.wikidata.org/wiki/{qid}")
    except Exception as e:
        print(f"\n✗ Failed to create item: {e}")


Authenticating as: Skybristol bot@icd
✓ Successfully logged in

Creating item for: Chinik Eskimo Community

✗ Failed to create item: API error: {'code': 'modification-failed', 'info': 'Data value corrupt: $timestamp must resemble ISO 8601, given +2005T00:00:00Z', 'messages': [{'name': 'wikibase-validator-bad-value', 'parameters': ['$timestamp must resemble ISO 8601, given +2005T00:00:00Z'], 'html': {'*': 'Data value corrupt: $timestamp must resemble ISO 8601, given +2005T00:00:00Z'}}], '*': 'See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}
