# UMAP Visualization of Benjamin Franklin's Autobiography

This notebook creates a UMAP visualization of events extracted from Benjamin Franklin's autobiography using:
1. W5H (Who, What, When, Where, Why, How) event extraction
2. Sentence transformers for embeddings
3. UMAP for dimensionality reduction
4. Interactive visualization

## 1. Setup and Imports

In [None]:
# Packages should already be installed. If not, run in terminal:
# uv pip install requests beautifulsoup4 spacy sentence-transformers umap-learn plotly pandas numpy tqdm scikit-learn
# uv pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl

print("Packages should be installed. If you see import errors, run the commands above in terminal.")

In [None]:
import requests
import re
import pandas as pd
import numpy as np
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass, field
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# NLP libraries
import spacy
from sentence_transformers import SentenceTransformer

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Dimensionality reduction
import umap
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

# Progress tracking
from tqdm import tqdm
tqdm.pandas()

# Load spaCy model - use the installed package
import en_core_web_sm
nlp = en_core_web_sm.load()
nlp.max_length = 2000000  # Increase max length for processing

## 2. Data Acquisition

In [None]:
def download_franklin_autobiography() -> str:
    """Download Benjamin Franklin's autobiography from Project Gutenberg."""
    # Updated URL - using the plain text version
    url = "https://www.gutenberg.org/files/148/148-0.txt"
    
    print("Downloading Benjamin Franklin's autobiography...")
    response = requests.get(url)
    
    if response.status_code == 200:
        text = response.text
        print(f"Downloaded {len(text)} characters")
        return text
    else:
        raise Exception(f"Failed to download: {response.status_code}")

# Download the text
raw_text = download_franklin_autobiography()

In [None]:
def clean_gutenberg_text(text: str) -> str:
    """Clean Project Gutenberg text by removing headers, footers, and extra whitespace."""
    
    # Find the actual content boundaries
    start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK"
    end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"
    
    start_idx = text.find(start_marker)
    if start_idx != -1:
        start_idx = text.find('\n', start_idx) + 1
        text = text[start_idx:]
    
    end_idx = text.find(end_marker)
    if end_idx != -1:
        text = text[:end_idx]
    
    # Clean up formatting
    text = re.sub(r'\[Illustration[^\]]*\]', '', text)  # Remove illustration tags
    text = re.sub(r'\[\d+\]', '', text)  # Remove footnote references
    text = re.sub(r'\n{3,}', '\n\n', text)  # Normalize paragraph breaks
    text = re.sub(r' {2,}', ' ', text)  # Remove extra spaces
    
    return text.strip()

# Clean the text
cleaned_text = clean_gutenberg_text(raw_text)
print(f"Cleaned text: {len(cleaned_text)} characters")

In [None]:
def split_into_chunks(text: str, chunk_size: int = 1000) -> List[str]:
    """Split text into chunks by paragraphs, keeping chunks under specified size."""
    paragraphs = text.split('\n\n')
    chunks = []
    current_chunk = []
    current_size = 0
    
    for paragraph in paragraphs:
        para_size = len(paragraph.split())
        
        if current_size + para_size > chunk_size and current_chunk:
            chunks.append(' '.join(current_chunk))
            current_chunk = [paragraph]
            current_size = para_size
        else:
            current_chunk.append(paragraph)
            current_size += para_size
    
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

# Split into manageable chunks
text_chunks = split_into_chunks(cleaned_text, chunk_size=500)
print(f"Split into {len(text_chunks)} chunks")
print(f"Average chunk size: {np.mean([len(c.split()) for c in text_chunks]):.0f} words")

## 3. W5H Event Extraction

In [None]:
@dataclass
class W5HEvent:
    """Represents an event extracted from text with W5H components."""
    who: List[str] = field(default_factory=list)
    what: str = ""
    when: List[str] = field(default_factory=list)
    where: List[str] = field(default_factory=list)
    why: str = ""
    how: str = ""
    original_text: str = ""
    chunk_index: int = 0
    
    def to_sentence(self) -> str:
        """Convert W5H event to a natural language sentence."""
        parts = []
        
        # Who
        if self.who:
            who_str = ', '.join(self.who[:2])  # Limit to 2 main entities
            parts.append(who_str)
        else:
            parts.append("Someone")
        
        # What
        if self.what:
            parts.append(self.what)
        else:
            parts.append("did something")
        
        # Where
        if self.where:
            parts.append(f"in {', '.join(self.where[:1])}")
        
        # When
        if self.when:
            parts.append(f"during {', '.join(self.when[:1])}")
        
        # Why
        if self.why:
            parts.append(f"because {self.why}")
        
        # How
        if self.how:
            parts.append(f"by {self.how}")
        
        return ' '.join(parts)
    
    def to_dict(self) -> dict:
        """Convert to dictionary for DataFrame."""
        return {
            'who': ', '.join(self.who),
            'what': self.what,
            'when': ', '.join(self.when),
            'where': ', '.join(self.where),
            'why': self.why,
            'how': self.how,
            'sentence': self.to_sentence(),
            'original_text': self.original_text[:200] + '...' if len(self.original_text) > 200 else self.original_text,
            'chunk_index': self.chunk_index
        }

In [None]:
def extract_w5h_events(text: str, chunk_index: int = 0) -> List[W5HEvent]:
    """Extract W5H events from a text chunk using spaCy."""
    doc = nlp(text)
    events = []
    
    # Process each sentence
    for sent in doc.sents:
        event = W5HEvent(original_text=sent.text.strip(), chunk_index=chunk_index)
        
        # Extract WHO (Person entities)
        for ent in sent.ents:
            if ent.label_ == "PERSON":
                event.who.append(ent.text)
            elif ent.label_ in ["GPE", "LOC", "FAC"]:
                event.where.append(ent.text)
            elif ent.label_ in ["DATE", "TIME"]:
                event.when.append(ent.text)
        
        # Extract WHAT (main verb and object)
        for token in sent:
            if token.pos_ == "VERB" and token.dep_ == "ROOT":
                # Get the verb and its direct objects
                what_parts = [token.text]
                for child in token.children:
                    if child.dep_ in ["dobj", "attr", "xcomp"]:
                        what_parts.append(child.text)
                event.what = ' '.join(what_parts)
                break
        
        # Extract WHY (causal indicators)
        why_indicators = ["because", "since", "as", "due to", "owing to", "for"]
        sent_lower = sent.text.lower()
        for indicator in why_indicators:
            if indicator in sent_lower:
                idx = sent_lower.find(indicator)
                if idx != -1:
                    # Get the text after the indicator
                    why_text = sent.text[idx + len(indicator):].strip()
                    # Clean it up and take first clause
                    why_text = why_text.split(',')[0].split('.')[0].strip()
                    if len(why_text) > 5:  # Minimum meaningful length
                        event.why = why_text[:100]  # Limit length
                        break
        
        # Extract HOW (manner adverbs and phrases)
        for token in sent:
            if token.pos_ == "ADV" and token.dep_ in ["advmod", "amod"]:
                # Check if it's a manner adverb (ends in -ly)
                if token.text.endswith('ly'):
                    event.how = token.text
                    break
            elif token.text.lower() in ["by", "through", "with", "using"]:
                # Get the phrase after these prepositions
                how_parts = []
                for child in token.children:
                    if child.dep_ == "pobj":
                        how_parts.append(child.text)
                if how_parts:
                    event.how = ' '.join(how_parts)
                    break
        
        # Only add events that have at least WHO or WHAT
        if event.who or event.what:
            events.append(event)
    
    return events

In [None]:
# Extract events from all chunks
all_events = []

print("Extracting W5H events from text chunks...")
for i, chunk in enumerate(tqdm(text_chunks)):
    chunk_events = extract_w5h_events(chunk, chunk_index=i)
    all_events.extend(chunk_events)

print(f"\nExtracted {len(all_events)} events")

# Convert to DataFrame
events_df = pd.DataFrame([event.to_dict() for event in all_events])
print(f"\nDataFrame shape: {events_df.shape}")
events_df.head()

In [None]:
# Analyze event extraction quality
print("Event Extraction Statistics:")
print("="*50)
print(f"Total events: {len(events_df)}")
print(f"\nField coverage:")
for col in ['who', 'what', 'when', 'where', 'why', 'how']:
    non_empty = (events_df[col] != '').sum()
    percentage = (non_empty / len(events_df)) * 100
    print(f"  {col.upper():6s}: {non_empty:5d} ({percentage:.1f}%)")

# Sample events with good coverage
print("\nSample events with multiple W5H fields:")
print("="*50)
# Count non-empty fields for each event
events_df['field_count'] = events_df[['who', 'what', 'when', 'where', 'why', 'how']].apply(
    lambda x: (x != '').sum(), axis=1
)
rich_events = events_df[events_df['field_count'] >= 3].head(3)
for idx, row in rich_events.iterrows():
    print(f"\nEvent {idx}:")
    print(f"Sentence: {row['sentence']}")
    print(f"Original: {row['original_text'][:100]}...")

## 4. Sentence Embeddings

In [None]:
# Load sentence transformer model
print("Loading sentence transformer model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"Model loaded: {model}")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

In [None]:
# Generate embeddings for event sentences
print("Generating sentence embeddings...")
sentences = events_df['sentence'].tolist()

# Generate embeddings in batches
embeddings = model.encode(sentences, show_progress_bar=True, batch_size=32)

print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding shape: {embeddings.shape}")

## 5. UMAP Dimensionality Reduction

In [None]:
# Configure UMAP
print("Applying UMAP dimensionality reduction...")

# UMAP parameters
reducer = umap.UMAP(
    n_neighbors=30,
    min_dist=0.1,
    n_components=2,
    metric='cosine',
    random_state=42
)

# Fit and transform
embeddings_2d = reducer.fit_transform(embeddings)

print(f"Reduced to shape: {embeddings_2d.shape}")

# Add UMAP coordinates to dataframe
events_df['umap_x'] = embeddings_2d[:, 0]
events_df['umap_y'] = embeddings_2d[:, 1]

In [None]:
# Optional: Add 3D UMAP for richer visualization
print("Generating 3D UMAP...")

reducer_3d = umap.UMAP(
    n_neighbors=30,
    min_dist=0.1,
    n_components=3,
    metric='cosine',
    random_state=42
)

embeddings_3d = reducer_3d.fit_transform(embeddings)
events_df['umap_x_3d'] = embeddings_3d[:, 0]
events_df['umap_y_3d'] = embeddings_3d[:, 1]
events_df['umap_z_3d'] = embeddings_3d[:, 2]

print("3D UMAP complete")

In [None]:
# Cluster detection using DBSCAN
print("Detecting clusters...")

clustering = DBSCAN(eps=0.5, min_samples=10)
events_df['cluster'] = clustering.fit_predict(embeddings_2d)

n_clusters = len(set(events_df['cluster'])) - (1 if -1 in events_df['cluster'] else 0)
n_noise = list(events_df['cluster']).count(-1)

print(f"Found {n_clusters} clusters")
print(f"Noise points: {n_noise}")

## 6. Interactive Visualization

In [None]:
# Create color mapping for temporal progression
events_df['color_temporal'] = events_df['chunk_index']

# Create labels for hover text
events_df['hover_text'] = events_df.apply(
    lambda x: f"<b>Who:</b> {x['who']}<br>"
              f"<b>What:</b> {x['what']}<br>"
              f"<b>Where:</b> {x['where']}<br>"
              f"<b>When:</b> {x['when']}<br>"
              f"<b>Why:</b> {x['why']}<br>"
              f"<b>How:</b> {x['how']}<br>"
              f"<b>Chunk:</b> {x['chunk_index']}<br>"
              f"<b>Cluster:</b> {x['cluster']}",
    axis=1
)

In [None]:
# 2D Interactive Visualization
fig_2d = px.scatter(
    events_df,
    x='umap_x',
    y='umap_y',
    color='color_temporal',
    hover_data=['sentence', 'original_text'],
    custom_data=['hover_text'],
    title="Benjamin Franklin's Autobiography: Event Manifold (2D UMAP)",
    labels={'color_temporal': 'Chronological Order', 'umap_x': 'UMAP 1', 'umap_y': 'UMAP 2'},
    color_continuous_scale='Viridis',
    width=1000,
    height=700
)

# Update hover template
fig_2d.update_traces(
    hovertemplate='%{customdata[0]}<br><b>Sentence:</b> %{hoverdata[0]}<extra></extra>',
    marker=dict(size=6, opacity=0.7)
)

fig_2d.update_layout(
    hovermode='closest',
    font=dict(size=12),
    plot_bgcolor='white',
    paper_bgcolor='white'
)

fig_2d.show()

In [None]:
# 3D Interactive Visualization
fig_3d = px.scatter_3d(
    events_df,
    x='umap_x_3d',
    y='umap_y_3d',
    z='umap_z_3d',
    color='cluster',
    hover_data=['sentence', 'who', 'what', 'where', 'when'],
    title="Benjamin Franklin's Autobiography: Event Manifold (3D UMAP with Clusters)",
    labels={'cluster': 'Event Cluster'},
    width=1000,
    height=700,
    color_discrete_sequence=px.colors.qualitative.Set3
)

fig_3d.update_traces(
    marker=dict(size=4, opacity=0.8),
    selector=dict(mode='markers')
)

fig_3d.update_layout(
    scene=dict(
        xaxis_title='UMAP 1',
        yaxis_title='UMAP 2',
        zaxis_title='UMAP 3',
        bgcolor='white'
    ),
    font=dict(size=12)
)

fig_3d.show()

In [None]:
# Create a density heatmap
fig_density = px.density_heatmap(
    events_df,
    x='umap_x',
    y='umap_y',
    nbinsx=50,
    nbinsy=50,
    title="Event Density in Benjamin Franklin's Autobiography",
    labels={'umap_x': 'UMAP 1', 'umap_y': 'UMAP 2'},
    color_continuous_scale='Blues',
    width=900,
    height=700
)

fig_density.update_layout(
    font=dict(size=12),
    plot_bgcolor='white',
    paper_bgcolor='white'
)

fig_density.show()

## 7. Analysis and Insights

In [None]:
# Analyze clusters
print("Cluster Analysis")
print("="*50)

for cluster_id in sorted(events_df['cluster'].unique()):
    if cluster_id == -1:
        continue
    
    cluster_events = events_df[events_df['cluster'] == cluster_id]
    print(f"\nCluster {cluster_id} ({len(cluster_events)} events):")
    
    # Find most common entities and words
    who_list = ' '.join(cluster_events['who'].values).split(', ')
    who_list = [w for w in who_list if w]
    
    if who_list:
        from collections import Counter
        top_who = Counter(who_list).most_common(3)
        print(f"  Top people: {', '.join([f'{person} ({count})' for person, count in top_who])}")
    
    where_list = ' '.join(cluster_events['where'].values).split(', ')
    where_list = [w for w in where_list if w]
    
    if where_list:
        top_where = Counter(where_list).most_common(3)
        print(f"  Top places: {', '.join([f'{place} ({count})' for place, count in top_where])}")
    
    # Sample sentences
    sample_sentences = cluster_events['sentence'].head(2).values
    print(f"  Sample events:")
    for sent in sample_sentences:
        print(f"    - {sent[:100]}...")

In [None]:
# Find key figures in the autobiography
print("Key Figures in Benjamin Franklin's Autobiography")
print("="*50)

all_people = []
for who_str in events_df['who']:
    if who_str:
        all_people.extend(who_str.split(', '))

from collections import Counter
people_counts = Counter(all_people)
top_people = people_counts.most_common(15)

for person, count in top_people:
    print(f"{person:20s}: {count:3d} mentions")

In [None]:
# Create a timeline visualization
# Group events by chunk (temporal order) and show progression
temporal_stats = events_df.groupby('chunk_index').agg({
    'sentence': 'count',
    'who': lambda x: len(' '.join(x).split(', ')),
    'where': lambda x: len(' '.join(x).split(', '))
}).reset_index()

temporal_stats.columns = ['chunk_index', 'event_count', 'people_count', 'place_count']

fig_timeline = make_subplots(
    rows=3, cols=1,
    subplot_titles=('Events per Chunk', 'People Mentioned', 'Places Mentioned'),
    shared_xaxes=True
)

fig_timeline.add_trace(
    go.Scatter(x=temporal_stats['chunk_index'], y=temporal_stats['event_count'],
               mode='lines', name='Events', line=dict(color='blue')),
    row=1, col=1
)

fig_timeline.add_trace(
    go.Scatter(x=temporal_stats['chunk_index'], y=temporal_stats['people_count'],
               mode='lines', name='People', line=dict(color='green')),
    row=2, col=1
)

fig_timeline.add_trace(
    go.Scatter(x=temporal_stats['chunk_index'], y=temporal_stats['place_count'],
               mode='lines', name='Places', line=dict(color='red')),
    row=3, col=1
)

fig_timeline.update_xaxes(title_text="Text Chunk (Chronological)", row=3, col=1)
fig_timeline.update_yaxes(title_text="Count", row=1, col=1)
fig_timeline.update_yaxes(title_text="Count", row=2, col=1)
fig_timeline.update_yaxes(title_text="Count", row=3, col=1)

fig_timeline.update_layout(
    height=700,
    title_text="Temporal Analysis of Benjamin Franklin's Autobiography",
    showlegend=False
)

fig_timeline.show()

In [None]:
# Export data for further analysis
print("Saving processed data...")

# Save events with embeddings
events_df.to_csv('franklin_events_w5h.csv', index=False)
np.save('franklin_embeddings.npy', embeddings)

print("Data saved:")
print("  - franklin_events_w5h.csv (events with W5H extraction and UMAP coordinates)")
print("  - franklin_embeddings.npy (original high-dimensional embeddings)")

## 8. Summary and Conclusions

This notebook successfully:

1. **Downloaded and preprocessed** Benjamin Franklin's autobiography from Project Gutenberg
2. **Extracted W5H events** (Who, What, When, Where, Why, How) using NLP techniques
3. **Generated natural language sentences** from structured events
4. **Created sentence embeddings** using transformer models
5. **Applied UMAP** for dimensionality reduction
6. **Visualized the event manifold** with interactive plots
7. **Identified clusters** of related events and themes

### Key Insights:

- The UMAP visualization reveals natural groupings of events in Franklin's life
- Temporal progression can be traced through the manifold
- Clusters correspond to different themes, periods, or types of events
- Key figures and locations emerge from the W5H analysis

### Next Steps:

- Fine-tune W5H extraction with more sophisticated NLP models
- Experiment with different sentence transformer models
- Add interactive filtering by W5H components
- Compare with other autobiographies or historical texts
- Implement path analysis to trace Franklin's life journey through the manifold