# Puyi Dataset - Web Scraping and LLaMA Factory Formatting

This notebook scrapes unique data about Puyi (Last Emperor of China) from multiple sources and formats it for LLaMA Factory training.

**Output Files:**
- `puyi_llama_factory_detailed.json` - Alpaca format dataset
- `puyi_llama_factory_detailed.csv` - CSV format dataset

In [58]:
# Import required libraries
import requests
from bs4 import BeautifulSoup
import re
import json
import time
import pandas as pd
from collections import defaultdict

In [59]:
# URLs to scrape
urls = [
    "https://www.chinahighlights.com/travelguide/china-history/puyi.htm",
    "https://www.historyhit.com/puyi-last-emperor-of-china/",
    "https://www.pacificatrocities.org/blog/prince-puyi-chinas-last-dynasty",
    "https://www.thinkchina.sg/history/photo-story-puyi-last-emperor-china",
    "https://biographics.org/puyi-the-last-emperor-of-china/",
    "https://www.thoughtco.com/puyi-chinas-last-emperor-195612"
]

In [60]:
# Function to fetch and parse webpage
def fetch_page(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        return BeautifulSoup(response.content, 'lxml')
    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return None

# Function to extract text content
def extract_text(soup):
    if not soup:
        return []
    
    # Remove script and style elements
    for script in soup(["script", "style", "nav", "footer", "header"]):
        script.decompose()
    
    # Get text
    text = soup.get_text()
    
    # Break into lines and remove leading/trailing space
    lines = (line.strip() for line in text.splitlines())
    
    # Break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    
    # Drop blank lines and filter short lines
    text_chunks = [chunk for chunk in chunks if chunk and len(chunk) > 20]
    
    return text_chunks

In [61]:
# Fetch content from all URLs
print("Fetching pages...")
all_content = {}

for url in urls:
    print(f"Fetching: {url}")
    soup = fetch_page(url)
    if soup:
        content = extract_text(soup)
        all_content[url] = content
        print(f"  Extracted {len(content)} text chunks")
    time.sleep(1)  # Be respectful to servers
    
print(f"\nTotal URLs processed: {len(all_content)}")

Fetching pages...
Fetching: https://www.chinahighlights.com/travelguide/china-history/puyi.htm
  Extracted 136 text chunks
Fetching: https://www.historyhit.com/puyi-last-emperor-of-china/
  Extracted 57 text chunks
Fetching: https://www.pacificatrocities.org/blog/prince-puyi-chinas-last-dynasty
  Extracted 392 text chunks
Fetching: https://www.thinkchina.sg/history/photo-story-puyi-last-emperor-china
  Extracted 2 text chunks
Fetching: https://biographics.org/puyi-the-last-emperor-of-china/
  Extracted 109 text chunks
Fetching: https://www.thoughtco.com/puyi-chinas-last-emperor-195612
  Extracted 1 text chunks

Total URLs processed: 6


In [62]:
# Function to normalize text for comparison
def normalize_text(text):
    text = text.lower()
    text = ' '.join(text.split())
    text = re.sub(r'[^\w\s.,;:!?-]', '', text)
    return text

# Function to check similarity
def is_similar(text1, text2, threshold=0.85):
    """Check if two texts are similar based on word overlap (Jaccard similarity)"""
    words1 = set(normalize_text(text1).split())
    words2 = set(normalize_text(text2).split())
    
    if not words1 or not words2:
        return False
    
    intersection = len(words1.intersection(words2))
    union = len(words1.union(words2))
    similarity = intersection / union if union > 0 else 0
    
    return similarity >= threshold

In [63]:
# Extract unique content (remove duplicates)
unique_content = []
seen_normalized = set()

print("Extracting unique content...")
for url, chunks in all_content.items():
    print(f"\nProcessing: {url}")
    unique_from_url = 0
    
    for chunk in chunks:
        # Skip very short or very long chunks
        if len(chunk) < 30 or len(chunk) > 1000:
            continue
            
        normalized = normalize_text(chunk)
        
        # Check if we've seen this exact text before
        if normalized in seen_normalized:
            continue
        
        # Check if similar to any existing unique content
        is_duplicate = False
        for existing in unique_content:
            if is_similar(chunk, existing['text'], threshold=0.85):
                is_duplicate = True
                break
        
        if not is_duplicate:
            unique_content.append({
                'text': chunk,
                'source': url
            })
            seen_normalized.add(normalized)
            unique_from_url += 1
    
    print(f"  Unique chunks found: {unique_from_url}")

print(f"\nTotal unique content pieces: {len(unique_content)}")

Extracting unique content...

Processing: https://www.chinahighlights.com/travelguide/china-history/puyi.htm
  Unique chunks found: 109

Processing: https://www.historyhit.com/puyi-last-emperor-of-china/
  Unique chunks found: 50

Processing: https://www.pacificatrocities.org/blog/prince-puyi-chinas-last-dynasty
  Unique chunks found: 132

Processing: https://www.thinkchina.sg/history/photo-story-puyi-last-emperor-china
  Unique chunks found: 1

Processing: https://biographics.org/puyi-the-last-emperor-of-china/
  Unique chunks found: 103

Processing: https://www.thoughtco.com/puyi-chinas-last-emperor-195612
  Unique chunks found: 1

Total unique content pieces: 396


In [64]:
# Instruction templates for LLaMA Factory format
instruction_templates = [
    "Tell me about Puyi, the Last Emperor of China.",
    "What do you know about Puyi?",
    "Provide information about the Last Emperor of China.",
    "Explain who Puyi was.",
    "Share facts about Emperor Puyi.",
    "Describe Puyi's life and reign.",
    "What can you tell me about the Last Emperor?",
    "Give me details about Puyi.",
    "Who was Puyi?",
    "What is the story of China's last emperor?",
    "Explain the life of Puyi.",
    "Tell me about China's last emperor.",
    "What are some important facts about Puyi?",
    "Describe the Last Emperor of China.",
    "Share historical information about Puyi."
]

In [65]:
# Convert to LLaMA Factory format (instruction-input-output-source)
llama_factory_data = []
instruction_idx = 0

for item in unique_content:
    # Skip very short entries (likely headers)
    if len(item['text']) < 50:
        continue
    
    # Rotate through instruction templates
    instruction = instruction_templates[instruction_idx % len(instruction_templates)]
    instruction_idx += 1
    
    llama_factory_data.append({
        "instruction": instruction,
        "input": "",
        "output": item['text'],
        "source": item['source']
    })

print(f"Created {len(llama_factory_data)} instruction-output pairs")
print(f"Filtered out {len(unique_content) - len(llama_factory_data)} short entries")

Created 259 instruction-output pairs
Filtered out 137 short entries


In [66]:
# Save as JSON
json_filename = 'puyi_llama_factory_detailed.json'
with open(json_filename, 'w', encoding='utf-8') as f:
    json.dump(llama_factory_data, f, indent=2, ensure_ascii=False)

print(f"âœ… Saved: {json_filename}")
print(f"ðŸ“Š Total entries: {len(llama_factory_data)}")

âœ… Saved: puyi_llama_factory_detailed.json
ðŸ“Š Total entries: 259


In [67]:
# Convert to DataFrame and save as CSV
df = pd.DataFrame(llama_factory_data)

csv_filename = 'puyi_llama_factory_detailed.csv'
df.to_csv(csv_filename, index=False, encoding='utf-8')

print(f"âœ… Saved: {csv_filename}")
print(f"ðŸ“Š Total rows: {len(df)}")
print(f"ðŸ“‹ Columns: {', '.join(df.columns.tolist())}")

âœ… Saved: puyi_llama_factory_detailed.csv
ðŸ“Š Total rows: 259
ðŸ“‹ Columns: instruction, input, output, source


In [68]:
# Display sample entries
print("\n" + "="*80)
print("SAMPLE ENTRIES")
print("="*80)

for i in range(min(3, len(llama_factory_data))):
    print(f"\n--- Entry {i+1} ---")
    print(json.dumps(llama_factory_data[i], indent=2, ensure_ascii=False))


SAMPLE ENTRIES

--- Entry 1 ---
{
  "instruction": "Tell me about Puyi, the Last Emperor of China.",
  "input": "",
  "output": "10 Facts about Puyi You Didn't Know, the Last Emperor of China",
  "source": "https://www.chinahighlights.com/travelguide/china-history/puyi.htm"
}

--- Entry 2 ---
{
  "instruction": "What do you know about Puyi?",
  "input": "",
  "output": "Puyi was the last emperor of China. His life was full of ups and downs. He led a special life in China's turbulent era of change â€” from emperor to citizen. The following facts will help you better understand the Last Emperor.",
  "source": "https://www.chinahighlights.com/travelguide/china-history/puyi.htm"
}

--- Entry 3 ---
{
  "instruction": "Provide information about the Last Emperor of China.",
  "input": "",
  "output": "1. Puyi was the only emperor to be enthroned 3 times.",
  "source": "https://www.chinahighlights.com/travelguide/china-history/puyi.htm"
}


In [69]:
# Dataset statistics
from collections import Counter

source_counts = Counter([entry['source'] for entry in llama_factory_data])
instruction_counts = Counter([entry['instruction'] for entry in llama_factory_data])

print("\n" + "="*80)
print("DATASET STATISTICS")
print("="*80)
print(f"Total entries: {len(llama_factory_data)}")
print(f"Unique instructions: {len(instruction_counts)}")
print(f"Unique sources: {len(source_counts)}")
print(f"Average output length: {df['output'].str.len().mean():.0f} characters")
print(f"Min output length: {df['output'].str.len().min()} characters")
print(f"Max output length: {df['output'].str.len().max()} characters")

print("\n" + "="*80)
print("SOURCE DISTRIBUTION")
print("="*80)
for source, count in source_counts.items():
    print(f"{count:4d} entries from: {source}")


DATASET STATISTICS
Total entries: 259
Unique instructions: 15
Unique sources: 4
Average output length: 183 characters
Min output length: 50 characters
Max output length: 987 characters

SOURCE DISTRIBUTION
  76 entries from: https://www.chinahighlights.com/travelguide/china-history/puyi.htm
  43 entries from: https://www.historyhit.com/puyi-last-emperor-of-china/
  44 entries from: https://www.pacificatrocities.org/blog/prince-puyi-chinas-last-dynasty
  96 entries from: https://biographics.org/puyi-the-last-emperor-of-china/


## âœ… Dataset Generation Complete!

### Output Files:
1. **`puyi_llama_factory_detailed.json`** - LLaMA Factory Alpaca format
2. **`puyi_llama_factory_detailed.csv`** - CSV format for easy viewing

### Format Structure:
```json
{
  "instruction": "Tell me about Puyi, the Last Emperor of China.",
  "input": "",
  "output": "Detailed information about Puyi...",
  "source": "https://www.example.com/..."
}
```

### Key Features:
- âœ… Unique content only (duplicates removed)
- âœ… 15 diverse instruction templates
- âœ… All source URLs preserved
- âœ… Ready for LLaMA Factory training