Skip to content

val1813/BIIC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BIIC: Bio-Inspired Information Cell

**→ View full project page:val1813.github.io/BIIC/README**点击这个网站详看

A geometric algebra framework for lossless information representation in language models.

基于几何代数的语言模型无损信息表示框架

License: BSL-1.1 Phase 1 Phase 2 Phase 3


问题 / The Problem

当前语言模型的token表示有一个根本缺陷:所有语义信息压在一个扁平向量里,推理过程中被逐层覆盖。

Current token representations compress all semantics into a single flat vector that gets overwritten layer by layer during inference.


Fig 1. Grade-0 invariance after 100 consecutive transformations — error stays at 10⁻⁶ level (3 seeds)

失败模式 / Failure Mode 原因 / Cause
信息损耗不可逆 / Irreversible loss 深层覆盖原始语义 / Deep layers overwrite original semantics
信息过载 / Information overload 残差只加不减 / Residual only adds, never subtracts
长推理退化 / Long-reasoning degradation 无关状态无限累积 / Irrelevant states accumulate

思路:从DNA学习 / Approach: Learn from DNA

DNA同时做到了三件事:永久保存基因组、动态读写表观标记、主动擦除过时标记。

DNA achieves three things simultaneously: permanent genome preservation, dynamic epigenetic read/write, and active erasure of outdated marks.

┌─────────────────────────────────────────────────────────┐
│  DNA Architecture          →    BIIC Architecture       │
├─────────────────────────────────────────────────────────┤
│  Genome (immutable)        →    Grade-0 (invariant)     │
│  Epigenome (read/write)    →    Grade-1~4 (equivariant) │
│  TET demethylase (erase)   →    GradeAwareEraser        │
└─────────────────────────────────────────────────────────┘

关键洞察:Clifford几何代数Cl(4,1)在同一结构中同时提供不变量和等变量,不变性由定理保证。

Key insight: Clifford algebra Cl(4,1) provides both invariant and equivariant quantities in one structure — invariance guaranteed by theorem.


实验结果 / Results

Phase 1: 数学验证 / Mathematical Verification ✅

指标 / Metric 值 / Value (3 seeds)
Grade-0 invariance error (100 transforms) 6.56×10⁻⁶ ± 4.95×10⁻⁶
Grade-5 invariance error 5.12×10⁻⁶ ± 3.81×10⁻⁶
Multi-channel leakage 0.0 (exact)
Eraser preserves grade-0 0.0 (exact)

Phase 2: 编解码链路 / Encoding-Decoding Pipeline ✅


Fig 3. Grade separation emerges naturally — different grades learn different roles

指标 / Metric 值 / Value
All-grade vs grade-0 only decoding 5.3× better (0.006 vs 0.032)
Token discrimination (cosine sim) 0.029 ± 0.013 (near-orthogonal)
Grade-0 after 6 inference layers 0.0 change (exact)


Fig 4. Different tokens achieve near-orthogonal grade-0 representations

Phase 3: 对照实验 / Comparative 🔄

Group Description Final Loss (mean ± std, 3 seeds)
A1 BIIC Full (Eraser=0.5) 10.8285 ± 0.0008
A2 BIIC + Weak Eraser (0.01) 10.8289 ± 0.0010 ✅
B Orthogonal Token + tanh (H1 baseline) 10.8319 ± 0.0020 ✅
C Linear + LayerNorm (lower bound) 10.8292 ± 0.0007 ✅
D BIIC grade-0 only (H2 ablation) 🔄 Running
E 2048-dim Embedding (H2 dim-matched) 10.9984 ± 0.0116 ✅

Hypothesis test results:

  • H1 (Geometry): A1 (10.8285) < B (10.8319) — BIIC outperforms orthogonal baseline
  • H2 (Equivariance): A1 (10.8285) < E (10.9984) — equivariant structure has clear value over raw dimensionality
  • H3 (Eraser): A1 ≈ A2 — Eraser strength has limited effect in this setup (both work)

Phase 4: 语言模型训练 / Language Model Training 🔄

BIIC as a drop-in replacement for token embeddings in a language model:

Metric v0.1 (random data) v0.2 (WikiText-103)
Params 20M 73M
Data Random tokens WikiText-103 (117M tokens)
Loss (step 0) 10.98 10.94
Loss (latest) 10.83 (step 8470) 6.35, PPL 572 (step 800)
Status 🔄 Near complete 🔄 Training (ETA ~28h)

v0.2 loss: 10.94 → 6.35 in 800 steps on real text (PPL 58895 → 572). The BIIC multivector learns language structure.

Memory Scaling: BIIC vs Transformer ✅

seq_len BIIC (MB) Transformer (MB) Growth
256 747 431
512 972 640
1024 1425 1060
2048 2327 2622 BIIC wins

Key finding: BIIC memory grows 3.1× from 256→2048, Transformer grows 6.1×. Crossover at ~1800 tokens. Beyond that, BIIC uses less memory — no KV cache.

BIIC params: 74M, Transformer params: 53M (BIIC has higher base cost but better scaling).


实验计划 / Experiment Plan

Phase 目标 / Goal 状态 / Status
1 Cl(4,1) 数学性质验证 ✅ Complete
2 编解码链路验证 ✅ Complete
3 6组对照 (H1/H2/H3假设检验) 🔄 Running
4 MVP语言模型 (SlowFast + DualCodebook) 📋 Planned

Phase 3 正在验证三个假设 / Testing three hypotheses:

  • H1: 几何结构本身有价值?还是只来自正交约束?
  • H2: 等变分量有独立贡献?还是只来自维度更高?
  • H3: Eraser在长序列上是否真正控制信息熵?

如果成功 / If This Works

能力 / Capability 机制 / Mechanism
无损长上下文 / Lossless long-context Grade-0无论推理多深都保持原始语义
不需要KV Cache / No KV cache 可变态替代键值存储
内建可解释性 / Built-in interpretability Grade分解揭示"记住了什么" vs "在想什么"
O(L)复杂度 / Linear complexity 慢快分离消除二次方注意力
天然多模态对齐 / Natural multimodal 不同模态共享代数空间

仓库结构 / Structure

BIIC/
├── src/                          # 核心实现
│   ├── clifford_cl41.py          # Cl(4,1) 几何代数
│   ├── rotor_utils.py            # 旋转子 & sandwich积
│   ├── eraser_ops.py             # GradeAwareEraser
│   ├── token_to_ic.py            # 编码器
│   ├── all_grade_decoder.py      # 全grade门控解码器
│   ├── mutable_state.py          # BIICLayer
│   └── biic_loss.py              # 分阶段辅助损失
├── tests/                        # 验证测试
├── results/                      # 实验数据 (JSON, 3 seeds)
├── figures/                      # 论文图表
├── requirements.txt
└── LICENSE

快速开始 / Quick Start

pip install torch numpy scipy matplotlib

# Phase 1 验证 (CPU, ~2min)
python tests/test_phase1.py

# Phase 2 验证 (CPU, ~10min)
python tests/test_decoder_basic.py
python tests/test_encoder.py
python tests/test_full_pipeline.py

参考文献 / References


Contact

对这个方向感兴趣、愿意一起写论文或探索新范式的朋友,欢迎联系:

Interested in collaborating on the paper or exploring new paradigms together? Reach out:

WeChat: llmbbs


Citation

@misc{huang2025biic,
  title={Bio-Inspired Information Cell: A Geometric Algebra Framework for
         Lossless Information Representation in Language Models},
  author={Huang, Zhongchang},
  year={2025},
  note={Phase 1-2 complete, Phase 3-4 ongoing.}
}

License

Business Source License 1.1 — Free for non-production use. See LICENSE for details.

About

一种可以替代token的研究,专业点叫:一种基于代数不变量分解的语言信息处理方法及系统。Algebraic Invariant Decomposition based Language Information Processing Method and System – A research on replacing token embeddings with algebraically grounded invariant & equivariant representations.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages