Skip to content

A temporal semantic evolution analysis framework for tracking how meaning changes across time, tokens, and context.

Notifications You must be signed in to change notification settings

warrofua/semantic-tensor-analysis

Repository files navigation

Semantic Tensor Analysis

CI

Semantic Tensor Analysis (STA) sits on top of your embeddings or vector store and gives you temporal drift, trajectories, and token-level alignmentβ€”no custom infra required. It keeps token-level detail alongside session-level summaries so you can inspect drift without losing context.

What is STA?

  • βœ… Token- and session-level embeddings (BERT + Sentence-BERT) in one pipeline
  • βœ… Drift metrics and clustering for ordered text sessions (CSV/JSON/TXT)
  • βœ… Visual explanations (PCA, heatmaps, trajectories) tailored to time-ordered data
  • βœ… Domain presets for clinical notes, learning progress, research logs, and conversations

Who is this for?

  • Researchers tracking concept drift over time
  • Clinicians / ABA teams monitoring progress across notes
  • Anyone with time-stamped text who wants more than cosine similarity

Why STA (what’s different)

  • Dual-resolution memory: token-level (BERT) + sequence-level (SBERT) stored together for Hungarian token alignment, token drift heatmaps, and session trajectories without re-embedding.
  • Ragged, mask-aware analytics: pad/stack/flatten utilities consistently handle variable-length sessions across PCA, clustering, trajectoriesβ€”no silent truncation.
  • Temporal semantics first: velocity/acceleration of meaning, inflection-point cues, and multi-view trajectories for ordered text (not just static similarity).
  • Concept evolution with alignment: session clustering + transition graphs plus token alignment to show what moved and how.
  • Vision grounding for charts: server-side Plotlyβ†’PNG snapshots fed to local vision GGUF (llama.cpp); graceful fallback to text-only if vision isn’t available.
  • Storage hygiene: built-in storage stats/cleanup (sidebar + CLI), CPU-portable persistence.
  • Grounded LLM context: prompts reuse analysis context (clusters, PCA axes, drift) instead of generic summaries.

Core approach: dual resolution

STA tracks meaning at two resolutions:

  • Token-level (BERT): follow individual concept drift -, Session-level (Sentence-BERT): follow overall semantic movement

Both are kept so you can align tokens while also inspecting higher-level trajectories.


πŸ’‘ Concrete Example: What Can STA Tell You?

Scenario: ABA Therapy Progress Tracking

You have 30 therapy session notes for a patient over 6 months:

# Load sessions
store = UniversalMemoryStore()
for note in therapy_notes:
    store.add_session(note)

# Run analysis

STA automatically reveals:

  1. Semantic Trajectory (evolution_tab)

    • "Patient meaning shifted from cluster 'behavioral challenges' (weeks 1-8) to 'skill acquisition' (weeks 9-20) to 'generalization' (weeks 21-30)"
    • Velocity graph shows rapid progress in weeks 12-15, plateau in weeks 22-26
  2. Token-Level Drift (token_alignment_heatmap)

    • Words that appeared/disappeared: "tantrum" (high early, faded), "independence" (emerged week 10), "peer interaction" (emerged week 18)
    • Optimal alignment shows which specific concepts persisted vs. transformed
  3. Concept Evolution (concepts_tab)

    • KMeans identifies 4 semantic clusters: "regulation struggles", "skill building", "social engagement", "mastery"
    • Transition graph shows patient moved through clusters sequentially with brief regression in week 23
  4. Inflection Points (trajectory_computation)

    • Week 12: Acceleration spike (breakthrough moment)
    • Week 23: Temporary deceleration (regression or plateau)
    • Week 28: Final acceleration (consolidation phase)
  5. PCA Narrative (dimensionality_tab + LLM)

    • "PC1 (43% variance) represents 'independence vs. support needs'"
    • "PC2 (28% variance) represents 'emotional regulation vs. dysregulation'"
    • "Patient trajectory: moved positively along PC1 while PC2 oscillated, then stabilized"
  6. Domain-Aware Insights (AI_insights_tab)

    • "Based on 6-month span, this represents a typical ABA intensive phase"
    • "The regression in week 23 aligns with expected variance in skill acquisition"
    • "Recommend: Continue current approach, monitor for sustained generalization"

All of this from just uploading a CSV. No custom code, no manual analysis.


πŸš€ Quick Start

  1. (Optional) Create and activate a venv:

    python -m venv venv
    source venv/bin/activate
  2. Install dependencies:

    pip install -r requirements.txt
    • Additional system requirement: For the CLI CSV import feature, ensure your Python installation includes tkinter (standard on most desktop Python distributions).
  3. Start the Streamlit app:

    streamlit run app.py
    • On first load, the sidebar opens to let you upload a CSV. After upload, the sidebar stays minimized for more canvas space.
    • Try with ultimate_demo_dataset.csv or aba_therapy_dataset.csv in the repo root.
  4. Interactive CLI demo (optional):

    python demo.py
    • Type sentences to build memory, import to load a CSV (requires tkinter), plot for PCA/heatmap, drift for metrics, tokens for token-level drift, exit to quit.

πŸ—‚οΈ Project Structure

  • app.py: Streamlit web app (tabs: Overview, Evolution, Patterns, Dimensionality, Concepts, Explain, AI Insights); wires sidebar chat and loaders.
  • src/semantic_tensor_analysis/app/: App modules (main.py, tabs/, sidebar.py, sidebar_chat.py, temporal_visualizations.py, assets/config).
  • src/semantic_tensor_analysis/memory/: Core STM types (universal_core.py), text embedder (text_embedder.py), drift (drift.py, sequence_drift.py), storage (store.py), legacy shim modules that forward to archive/legacy_embedders/ when explicitly enabled.
  • src/semantic_tensor_analysis/storage/: Storage manager/stats/cleanup utilities (manager.py).
  • src/semantic_tensor_analysis/streamlit/: Streamlit helpers (utils.py, plots.py) used across tabs.
  • src/semantic_tensor_analysis/analytics/: Tensor batching, dimensionality, trajectories, and concept analytics.
  • src/semantic_tensor_analysis/visualization/: Plotting backends (viz/, tools/, Streamlit-facing plots.py).
  • src/semantic_tensor_analysis/chat/: LLM integration (llama_cpp_analyzer.py, unified_analyzer.py, insights in analysis.py, history parsing).
  • src/semantic_tensor_analysis/demos/: CLI demos.
  • archive/legacy_embedders/: Archived embedders kept for compatibility only.
  • data/: Demo CSVs (ultimate_demo_dataset.csv, aba_therapy_dataset.csv).
  • tests/: Test suite.
  • pyproject.toml: Package metadata/dependencies.

πŸ“˜ Examples

  • examples/aba_progress.ipynb: Load ABA demo CSV β†’ embed via API β†’ quick PCA view.
  • examples/finance_narrative.ipynb: Embed narrative CSV β†’ run concept clustering β†’ inspect clusters.

Open in Jupyter/VS Code and run locally; both use the STA API (no Streamlit).


Capabilities

Temporal semantic analysis

  • Mask-aware batching for variable-length sessions (pad_and_stack, masked_session_means)
  • Token-level drift with Hungarian alignment and token importance drift
  • Trajectories with velocity/acceleration to spot rapid semantic shifts
  • PCA + clustering over ordered sessions for broad patterns and transitions

Visualizations

  • PCA timelines and 3D trajectories
  • Similarity and token-alignment heatmaps
  • Concept evolution and transition graphs
  • Ridgeline/distribution views
  • Trajectory tunnel (experimental) for long-run drift

LLM-assisted insights

  • Token + sentence embeddings kept together for downstream prompts
  • Domain-aware summaries (clinical, learning, research, conversations)
  • Axis interpretation for PCA dimensions

Workflows

  • Clinical progress tracking
  • Learning/journey mapping
  • Research note evolution
  • Conversation/topic drift
  • Draft/version comparison

Practicalities

  • CSV/JSON/TXT ingestion
  • Persistent storage (CPU-portable)
  • Session state management in Streamlit
  • Test suite coverage across embedding, storage, and viz
  • CLI demo for fast iteration

πŸ“¦ Datasets

  • ultimate_demo_dataset.csv: High-quality demo with clear trajectories and richer, longer texts.
  • aba_therapy_dataset.csv: ABA-specific schema/content; extended to a larger set for the same client.

Upload either via the Streamlit sidebar to explore the full suite of analyses.

Expected columns (typical): session_id, date, title (optional), text.


πŸ€– LLM Backend Setup

STA uses llama.cpp as the default backend (sidebar auto-configured to http://localhost:8080, model local). Ollama UI is deprecated.

Advantages: Faster inference, lower memory footprint, vision support with the right GGUF.

  1. Install llama-cpp-python:

    pip install llama-cpp-python
  2. Download a GGUF model:

    • Vision (Apple M4/16GB): Qwen/Qwen3-VL-4B-Instruct-GGUF (e.g., Q4_0 or Q4_K_M).
    • Text-only: 4–8B Q4/Q5 GGUFs (Mistral-7B, Llama-3-8B, Qwen2-7B, Phi-3-Mini) work well.
  3. Run llama-server:

    ./server -m /path/to/model.gguf -c 4096 --host 0.0.0.0 --port 8080
  4. In the app:

    • Sidebar auto-uses llama.cpp at http://localhost:8080 with model local.
    • Vision snapshot button will leverage a vision-capable GGUF if provided.
    • For vision models (e.g., Qwen3-VL), start llama-server with both model and projector, e.g.:
      llama-server \
        -m /path/to/Qwen3VL-8B-Instruct-Q4_K_M.gguf \
        --mmproj /path/to/mmproj-Qwen3-VL-8B-Instruct-Q8_0.gguf \
        --port 8080 --ctx-size 5000

No LLM (Optional)

You can use STA without any LLM backend. The core analysis and visualizations work independently. You'll just miss the AI-generated narrative insights.


πŸ’‘ Extensions & Ideas

  • Drift alerts
  • Sentence search
  • HTML dashboard
  • Enhanced multimodal support
  • Clinical applications

⚠️ Notes

  • The venv/ directory is excluded from git and should not be committed.
  • LLM Integration: STA supports two LLM backends:
    • llama.cpp (recommended): Use local GGUF models for faster, memory-efficient inference
    • Ollama: Traditional Ollama server with model management
  • The Streamlit app renders Matplotlib figures inline; no external windows will block interaction.
  • Key dependencies: torch, transformers, scikit-learn, plotly, streamlit, pandas, numpy, rich, requests, llama-cpp-python.
  • tkinter (for file browser): Usually pre-installed with Python. On Linux, install with sudo apt-get install python3-tk if needed.
  • Storage: Session files are stored under data/universal/. Check sidebar storage stats and use the cleanup expander to prune old sessions; CLI available via python -m semantic_tensor_analysis.storage.manager --stats and cleanup options.

πŸ“„ Citation

If you use this codebase or ideas in your research, please cite the accompanying paper or link to this repository.


πŸ“„ Documentation Alignment: Paper/TeX vs. Codebase

This section maps the semantic-tensor-memory.tex write-up (and associated PDF) to the codebase. It documents feature completeness and correspondence.

Overview

  • The paper/TeX describes the motivation, architecture, algorithms, applications, and limitations of STM.
  • The codebase implements STA with ragged tensor handling, dual-resolution embeddings, token alignment, and domain-aware LLM interpretation.

Feature Correspondence Table

Area Paper Coverage Codebase Coverage Notes
STA Architecture Yes Yes Aligned; dynamic dims and ragged sequences implemented.
Data Import Yes Yes CSV upload in Streamlit; CLI import with tkinter.
Visualization Yes Yes PCA, heatmaps, token alignment, token trajectories.
LLM Integration Yes Yes Axis Explainer; domain-aware insights with time scale.
Applications Yes Yes ABA and general datasets provided.
Example Analysis Yes Yes Demo datasets included.
Limitations/Future Yes Partial Multimodal audio, alerts, streaming, storage optimizations.
UI/CLI Details Brief Yes More detail in codebase/README than in paper.
Figures Yes Yes All figures rendered inline in app; assets can be saved.

Summary

  • All major features and analyses described in the paper are implemented.
  • The code includes practical details (CLI commands, Streamlit UI) beyond the paper.
  • Remaining roadmap items: audio modality, drift alerts/governance, streaming ingestion, storage efficiency, and expanded tests/CI.

🧩 Technical Architecture: Why Sessions, Not Individual Vectors?

The Session-Based Approach

STA operates on sessions (temporal snapshots containing variable-length sequences), not individual vectors:

# A session is a variable-length sequence
session = UniversalEmbedding(
    event_embeddings=[token_1_emb, token_2_emb, ..., token_n_emb],  # n varies per session
    sequence_embedding=session_mean,  # Holistic meaning
    events=[EventDescriptor(...), ...]  # Token metadata
)

# Sessions vary in length:
session_1: [100 tokens Γ— 768 dims]
session_2: [237 tokens Γ— 768 dims]
session_3: [89 tokens Γ— 768 dims]

This enables dual-resolution analysis: zoom into token-level details or analyze session-level trends.

Ragged Tensor Operations with Masking

The key innovation for handling variable-length sessions:

from semantic_tensor_analysis.analytics.tensor_batching import (
    pad_and_stack,
    masked_session_means,
    flatten_with_mask
)

# Convert ragged sequences to batched tensor
sessions_tensor, mask = pad_and_stack(sessions)
# Shape: [3, 237, 768]  (padded to max length = 237)
# Mask: [3, 237] boolean  (False = padding, ignore in computation)

# Compute session-level statistics (ignoring padding)
session_means = masked_session_means(sessions_tensor, mask)
# Shape: [3, 768] - one mean per session

# Flatten to token level with provenance tracking
flat_tokens, session_ids, token_ids = flatten_with_mask(sessions_tensor, mask)
# flat_tokens: [426, 768]  (100 + 237 + 89 tokens total)
# session_ids: [426]  (which session each token came from)
# token_ids: [426]  (position within session)

Why this matters:

  • Padding doesn't corrupt statistics (masked operations)
  • Can analyze at session OR token granularity seamlessly
  • Enables optimal token alignment across sessions (Hungarian algorithm)
  • PCA can operate on all tokens while preserving session boundaries

Flow: Raw Data β†’ Analysis β†’ Visualization

CSV/JSON/TXT
    ↓
Text Embedding (dual-resolution)
    β”œβ†’ Token embeddings [n_tokens, 768] via BERT
    β””β†’ Sequence embedding [768] via Sentence-BERT
    ↓
Session Creation (UniversalEmbedding)
    ↓
Storage (UniversalMemoryStore)
    ↓
Ragged Tensor Batching (pad_and_stack)
    ↓
Global Analysis
    β”œβ†’ PCA across all sessions/tokens
    β”œβ†’ Concept clustering (KMeans)
    β”œβ†’ Token alignment (Hungarian)
    β””β†’ Drift computation (cosine distance)
    ↓
Visualization
    β”œβ†’ Temporal trajectories (velocity, acceleration)
    β”œβ†’ Heatmaps (session similarity, token alignment)
    β”œβ†’ 3D semantic space (PCA projection)
    β””β†’ Concept evolution graphs
    ↓
Optional: LLM narrative generation (Ollama)

The key insight: Operations are across sessions (temporal), not within a database (spatial).


πŸ€” FAQ: Why Not Just Use...?

"Why not just use a Jupyter notebook with sklearn?"

You can! STA essentially packages what you'd build in a research notebook into a reusable framework:

Without STA:

# You'd need to implement:
- Dual BERT + S-BERT embedding pipeline
- Ragged tensor padding and masking logic
- Hungarian algorithm for token alignment
- Drift velocity/acceleration computation
- 10+ specialized visualization functions
- Domain-adaptive prompts for LLM analysis
- Streamlit UI for interactive exploration

With STA:

# Just load your data
store = UniversalMemoryStore()
for session in sessions:
    store.add_session(session)

# Everything else is ready to use

STA saves you from re-implementing this infrastructure for every temporal semantic analysis project.

"Why not use LangSmith or W&B for tracking?"

Great tools, different purposes:

Feature LangSmith W&B STA
Conversation tracking βœ… Excellent ❌ βœ…
Metric dashboards βœ… βœ… Excellent βœ…
Semantic drift analysis ❌ ❌ βœ… Token + session level
Token alignment ❌ ❌ βœ… Hungarian algorithm
Trajectory computation ❌ ❌ βœ… Velocity, acceleration
Domain-specific workflows ❌ ❌ βœ… Clinical, learning, research

Use LangSmith/W&B for production monitoring. Use STA for deep temporal semantic analysis.

"Why not just compute cosine similarity between embeddings?"

Simple similarity misses temporal patterns:

# Simple approach: pairwise similarity
similarity(session_1, session_2)  # β†’ 0.87
similarity(session_2, session_3)  # β†’ 0.82

# STA approach: temporal dynamics
velocity = compute_drift_velocity([session_1, session_2, session_3])
# β†’ [0.13, 0.18]  (change is accelerating)

inflection_points = detect_rapid_shifts(velocity)
# β†’ [session_5, session_12]  (when meaning changed rapidly)

token_drift = token_importance_drift(session_1, session_3)
# β†’ ["anxiety": high drift, "coping": low drift]  (which concepts changed)

STA provides the calculus of semantic change, not just static snapshots.

"Why session-based instead of continuous streaming?"

Session-based is intentional for certain domains:

  • Clinical notes: Each therapy session is a natural boundary
  • Learning journeys: Each lesson/assignment is discrete
  • Research evolution: Each draft/experiment is a snapshot
  • Meeting summaries: Each meeting is a unit of analysis

Future work: STA could support streaming by defining windows, but sessions align with how many domains naturally structure temporal data.


πŸ”— Token alignment & drift

  • Consecutive/session-pair alignment via Hungarian algorithm (in sequence_drift.py).
  • Visualize with viz.heatmap.token_alignment_heatmap (returns a Matplotlib Figure; rendered inline in Streamlit).

🧠 AI prompts

  • Prompts in chat_analysis.py infer domain and an appropriate time scale (days/weeks/months/quarters) from the dataset date span.
  • Explain tab uses AnalysisExplanation fields: what_it_means, why_these_results, what_to_do_next.

πŸ› οΈ Troubleshooting

  • Port 8501 in use: lsof -ti:8501 | xargs -r kill -9
  • Ollama not running: install/start Ollama and pull a model (e.g., qwen3:latest).
  • PyTorch view/reshape error: the PCA pipeline uses .reshape(...) and contiguous tensors in tensor_batching.py.
  • pytest not found: install via pip install pytest or use the app directly.

About

A temporal semantic evolution analysis framework for tracking how meaning changes across time, tokens, and context.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages