# NAOMI-II Wikipedia Parsing (Local - 10 Workers)
## Parse 12.4M Pre-Extracted Sentences Locally

**Hardware:** Your local machine with 12 CPU cores (using 10 for safety)

**Cost:** FREE! üí∞

**Time Estimates (with 10-worker parallelization):**
- **100K sentences (test)**: ~12 minutes
- **12.4M sentences (full)**: ~29 hours (chart parser)

**Why 10 workers not 12?**
- Each worker = 1 separate Python process
- 10 workers + 1 main notebook process + 1 OS = 12 cores total
- Leaves 2 cores free for your system to stay responsive

**Data:** Pre-extracted Wikipedia sentences in `notebooks/data/extracted_articles.txt`

**Output:**
- `data/wikipedia_parsed/parsed_corpus.pkl` (parsed sentences with WSD)
- `data/wikipedia_parsed/parse_stats.json` (parsing statistics)

**Next Step:** Upload results to Google Drive ‚Üí Run `NAOMI_A100_Training.ipynb` on A100

**Key Advantage:** Chart parser evaluates ALL parse options (most robust for Wikipedia's complex sentences)

## 1. Verify Local Setup

In [1]:
import sys
import os
import psutil
import multiprocessing as mp
from pathlib import Path

# Verify we're in the NAOMI-II directory
if not Path('../scripts/batch_parse_corpus.py').exists():
    print("‚ö†Ô∏è ERROR: Not in NAOMI-II directory!")
    print("Please run this notebook from the NAOMI-II/notebooks/ directory")
else:
    print("‚úì NAOMI-II directory detected")

# Check CPU cores
cpu_count = mp.cpu_count()
print(f"\nCPU cores: {cpu_count}")
if cpu_count >= 12:
    print("‚úì 12+ cores available - Excellent for parallel parsing!")
    print(f"  Expected speed: ~{cpu_count * 12} sentences/sec (chart parser)")
    est_hours = 12400000 / (cpu_count * 12 * 3600)
    print(f"  Est. time for 12.4M sentences: ~{est_hours:.0f} hours")
elif cpu_count >= 8:
    print(f"‚úì {cpu_count} cores available - Good for parallel parsing")
    print(f"  Expected speed: ~{cpu_count * 12} sentences/sec (chart parser)")
    est_hours = 12400000 / (cpu_count * 12 * 3600)
    print(f"  Est. time for 12.4M sentences: ~{est_hours:.0f} hours")
else:
    print(f"‚ö†Ô∏è Only {cpu_count} cores - parsing will be slower")
    est_hours = 12400000 / (cpu_count * 12 * 3600)
    print(f"  Est. time for 12.4M sentences: ~{est_hours:.0f} hours")

# Check RAM
ram_gb = psutil.virtual_memory().total / 1e9
print(f"\nRAM: {ram_gb:.1f} GB")
if ram_gb >= 16:
    print("‚úì Sufficient RAM for parsing")
else:
    print("‚ö†Ô∏è Low RAM - consider reducing batch size if errors occur")

# Check disk space
disk = psutil.disk_usage('.')
disk_free_gb = disk.free / 1e9
print(f"\nDisk space free: {disk_free_gb:.1f} GB")
if disk_free_gb >= 20:
    print("‚úì Sufficient disk space for parsed output")
else:
    print("‚ö†Ô∏è Low disk space - may need to clean up")

# Check dependencies
print("\nChecking dependencies...")
try:
    import nltk
    import numpy as np
    import tqdm
    print("‚úì All dependencies installed")
except ImportError as e:
    print(f"‚ö†Ô∏è Missing dependency: {e}")
    print("   Run: pip install nltk numpy tqdm")

# Check WordNet
try:
    from nltk.corpus import wordnet
    wordnet.synsets('test')
    print("‚úì WordNet data available")
except:
    print("‚ö†Ô∏è WordNet not downloaded")
    print("   Run: python -c \"import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')\"")

‚úì NAOMI-II directory detected

CPU cores: 12
‚úì 12+ cores available - Excellent for parallel parsing!
  Expected speed: ~144 sentences/sec (chart parser)
  Est. time for 12.4M sentences: ~24 hours

RAM: 16.8 GB
‚úì Sufficient RAM for parsing

Disk space free: 220.4 GB
‚úì Sufficient disk space for parsed output

Checking dependencies...
‚úì All dependencies installed
‚úì WordNet data available


## 2. Load and Verify Data

In [2]:
from pathlib import Path

# Check for extracted sentences
data_file = Path('data/extracted_articles.txt')

if data_file.exists():
    print("‚úì Pre-extracted sentences found!")
    
    # Get file size
    size_mb = data_file.stat().st_size / 1e6
    print(f"  File size: {size_mb:.0f} MB")
    
    # Count sentences (quick estimate)
    print("\nCounting sentences...")
    with open(data_file, 'r', encoding='utf-8') as f:
        sentence_count = sum(1 for _ in f)
    
    print(f"  Total sentences: {sentence_count:,}")
    
    # Show sample
    print("\nSample sentences:")
    with open(data_file, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if i < 3:
                print(f"  {i+1}. {line.strip()[:100]}...")
            else:
                break
    
    # Time estimates
    cpu_count = mp.cpu_count()
    est_hours_chart = sentence_count / (cpu_count * 12 * 3600)
    est_hours_quantum = sentence_count / (cpu_count * 17 * 3600)
    
    print(f"\nEstimated parsing time ({cpu_count} cores):")
    print(f"  Chart parser (robust):   ~{est_hours_chart:.0f} hours")
    print(f"  Quantum parser (faster): ~{est_hours_quantum:.0f} hours")
    print(f"\n  Using chart parser for maximum robustness on Wikipedia text")
    
else:
    print("‚ö†Ô∏è ERROR: Data file not found!")
    print(f"  Expected location: {data_file.absolute()}")
    print("\nPlease ensure extracted_articles.txt is in notebooks/data/")

‚úì Pre-extracted sentences found!
  File size: 1365 MB

Counting sentences...
  Total sentences: 12,377,687

Sample sentences:
  1. A historically left-wing movement, anarchism is usually described as the libertarian wing of the soc...
  2. Although traces of anarchist ideas are found all throughout history, modern anarchism emerged from t...
  3. During the latter half of the 19th and the first decades of the 20th century, the anarchist movement...

Estimated parsing time (12 cores):
  Chart parser (robust):   ~24 hours
  Quantum parser (faster): ~17 hours

  Using chart parser for maximum robustness on Wikipedia text


## 3. Test Parse (100K sentences - RECOMMENDED)

**‚ö†Ô∏è START HERE BEFORE FULL PARSE**

**Time:** ~10 minutes with 12 cores

**Purpose:** Verify everything works and check parse success rate before committing to 24-hour run

In [None]:
%%time
# Test parse with 100K sentences
# Using CHART parser (evaluates all options - most robust)

import multiprocessing as mp

# WINDOWS FIX: Use fewer workers due to Windows multiprocessing overhead
# Start with 6 workers, can adjust up/down based on system performance
num_workers = 6  # Conservative for Windows
print(f"Total CPU cores: {mp.cpu_count()}")
print(f"Using {num_workers} worker processes (Windows-safe configuration)")
print(f"This creates {num_workers} Python processes + 1 main notebook process = {num_workers + 1} total\n")

!python ../scripts/batch_parse_corpus.py \
    --corpus data/extracted_articles.txt \
    --parser-type chart \
    --max-sentences 100000 \
    --batch-size 1000 \
    --checkpoint-every 10000 \
    --num-workers {num_workers} \
    --output-dir ../data/wikipedia_parsed_100k \
    --resume

print("\n" + "="*70)
print("TEST PARSE COMPLETE!")
print("="*70)

Total CPU cores: 12
Using 6 worker processes (Windows-safe configuration)
This creates 6 Python processes + 1 main notebook process = 7 total



In [None]:
# Display test parse statistics
import json
from pathlib import Path

stats_file = Path('../data/wikipedia_parsed_100k/parse_stats.json')

if stats_file.exists():
    with open(stats_file, 'r') as f:
        stats = json.load(f)
    
    print("Test Parse Statistics:")
    print("="*70)
    print(f"Total sentences: {stats.get('total', 0):,}")
    print(f"Successful parses: {stats.get('success', 0):,}")
    print(f"Failed parses: {stats.get('failed', 0):,}")
    
    success_rate = stats.get('success', 0) / max(stats.get('total', 1), 1) * 100
    print(f"\nSuccess rate: {success_rate:.1f}%")
    print(f"Average parse score: {stats.get('avg_score', 0):.3f}")
    print(f"Total triples extracted: {stats.get('total_triples', 0):,}")
    
    print("\n" + "="*70)
    print("RECOMMENDATION:")
    print("="*70)
    if success_rate >= 90:
        print("‚úì Excellent success rate (>90%)!")
        print("  You can proceed with full parse using chart parser")
        print("  Or try quantum parser for faster parsing (Cell 4b)")
    elif success_rate >= 80:
        print("‚úì Good success rate (80-90%)")
        print("  Proceed with chart parser for robustness")
    else:
        print("‚ö†Ô∏è Low success rate (<80%)")
        print("  Chart parser is already the most robust option")
        print("  Check error log for common failure patterns")
    
else:
    print("No statistics file found - did the test parse complete?")

## 4a. Full Parse with Chart Parser (RECOMMENDED)

**Parser:** Chart (most robust, evaluates ALL parse options)

**Time:** ~24 hours for 12.4M sentences (with 12 cores)

**Cost:** FREE (runs locally)

**Checkpointing:** Saves every 50K sentences (fully resumable)

**‚ö†Ô∏è This will run for ~24 hours - let it run overnight/weekend!**

In [None]:
%%time
# Full parse with chart parser (MOST ROBUST)
# With 10-worker parallelization: ~120 sent/sec
# This will take ~29 hours
# Checkpoints every 50K sentences (resumable with --resume flag)

import multiprocessing as mp

# Use 10 workers (leave 2 cores free for system)
num_workers = max(1, mp.cpu_count() - 2)
print(f"Total CPU cores: {mp.cpu_count()}")
print(f"Using {num_workers} worker processes (leaving 2 cores free)")
print(f"Expected speed: ~{num_workers * 12} sentences/sec")
print(f"Expected time: ~{12400000 / (num_workers * 12 * 3600):.0f} hours\n")

!python ../scripts/batch_parse_corpus.py \
    --corpus data/extracted_articles.txt \
    --parser-type chart \
    --max-sentences 12400000 \
    --batch-size 1000 \
    --checkpoint-every 50000 \
    --num-workers {num_workers} \
    --output-dir ../data/wikipedia_parsed \
    --resume

print("\n" + "="*70)
print("PARSING COMPLETE!")
print("="*70)

## 4b. Full Parse with Quantum Parser (ALTERNATIVE - Faster)

**Parser:** Quantum (faster, smart branching)

**Time:** ~17 hours for 12.4M sentences (with 12 cores)

**Use only if:** Test parse showed >90% success rate with quantum parser

In [None]:
%%time
# Full parse with quantum parser (FASTER)
# With 10-worker parallelization: ~170 sent/sec
# This will take ~20 hours
# Checkpoints every 50K sentences (resumable)

import multiprocessing as mp

# Use 10 workers (leave 2 cores free for system)
num_workers = max(1, mp.cpu_count() - 2)
print(f"Total CPU cores: {mp.cpu_count()}")
print(f"Using {num_workers} worker processes (leaving 2 cores free)")
print(f"Expected speed: ~{num_workers * 17} sentences/sec")
print(f"Expected time: ~{12400000 / (num_workers * 17 * 3600):.0f} hours\n")

!python ../scripts/batch_parse_corpus.py \
    --corpus data/extracted_articles.txt \
    --parser-type quantum \
    --max-sentences 12400000 \
    --batch-size 1000 \
    --checkpoint-every 50000 \
    --num-workers {num_workers} \
    --output-dir ../data/wikipedia_parsed \
    --resume

print("\n" + "="*70)
print("PARSING COMPLETE!")
print("="*70)

## 5. Display Final Parse Statistics

In [None]:
import json
from pathlib import Path

# Check for full parse stats
stats_file = Path('../data/wikipedia_parsed/parse_stats.json')

if not stats_file.exists():
    # Fall back to test parse
    stats_file = Path('../data/wikipedia_parsed_100k/parse_stats.json')

if stats_file.exists():
    with open(stats_file, 'r') as f:
        stats = json.load(f)
    
    print("Final Parse Statistics:")
    print("="*70)
    print(f"Total sentences: {stats.get('total', 0):,}")
    print(f"Successful parses: {stats.get('success', 0):,}")
    print(f"Failed parses: {stats.get('failed', 0):,}")
    
    success_rate = stats.get('success', 0) / max(stats.get('total', 1), 1) * 100
    print(f"\nSuccess rate: {success_rate:.1f}%")
    print(f"Average parse score: {stats.get('avg_score', 0):.3f}")
    print(f"Total triples extracted: {stats.get('total_triples', 0):,}")
    
    avg_triples = stats.get('total_triples', 0) / max(stats.get('success', 1), 1)
    print(f"Average triples per sentence: {avg_triples:.1f}")
    
    print("\n" + "="*70)
    
    # Show output files
    output_dir = Path('../data/wikipedia_parsed')
    if output_dir.exists():
        print("\nOutput files:")
        for file in output_dir.glob('*'):
            size_mb = file.stat().st_size / 1e6
            print(f"  {file.name}: {size_mb:.1f} MB")
    
else:
    print("No statistics file found - has parsing been run?")

## 6. Next Steps: Upload to Google Drive for A100 Training

**You have successfully parsed Wikipedia locally! üéâ**

### Upload to Google Drive:

1. **Open Google Drive** in your browser
2. **Navigate to:** `My Drive/NAOMI-II-data/`
3. **Upload these files:**
   - `data/wikipedia_parsed/parsed_corpus.pkl` (large file, ~10-20GB)
   - `data/wikipedia_parsed/parse_stats.json` (small)

4. **Rename in Drive:**
   - `parsed_corpus.pkl` ‚Üí `wikipedia_parsed_corpus.pkl`
   - `parse_stats.json` ‚Üí `wikipedia_parse_stats.json`

### Then run A100 training:

1. Open `colab-results/NAOMI_A100_Training.ipynb` in Google Colab
2. Switch to **A100 GPU** runtime (7 credits/hour)
3. Run the training notebook (~6 hours, $4.50)

### Total Project Cost:
- **Parsing (local)**: $0.00 ‚úì
- **Training (A100)**: $4.50
- **TOTAL**: **$4.50** (vs $33 if you had parsed on Colab!)

---

**Alternative: Use command line to upload to Drive**

If you have Google Drive desktop app installed, you can just copy the files:

```bash
# Copy to Google Drive (adjust path for your system)
cp data/wikipedia_parsed/parsed_corpus.pkl "G:\My Drive\NAOMI-II-data\wikipedia_parsed_corpus.pkl"
cp data/wikipedia_parsed/parse_stats.json "G:\My Drive\NAOMI-II-data\wikipedia_parse_stats.json"
```