Real-time cognitive bias detection and behavioral intervention for Indonesian retail investors.
CogniFi detects FOMO, loss aversion, and confirmation bias from investor-generated text and delivers friction-based behavioral interventions before trade execution. Rule-based and fully interpretable in every output is traceable to specific rules and keyword matches.
Paper: The Behavioral Layer Gap: Domain-Specific Interpretability Challenges Scale in Low-Resource Safety-Critical Classification, Evidence from Cognitive Bias Detection in Indonesian Retail Investor Text
git clone https://github.com/redsandr/cognifi.git
cd cognifi
pip install -r requirements.txt
export GEMINI_API_KEY="your-key-here"
streamlit run app.py| System | Accuracy | FOMO F1 | LA F1 | CB F1 | NONE F1 |
|---|---|---|---|---|---|
| Majority class baseline | 29.3% | 0.0% | 0.0% | 0.0% | 45.4% |
| Gemini 3 Flash zero-shot (medium reasoning) | 76.4% | 76.5% | 78.0% | 79.2% | 72.9% |
| CogniFi (this work) | 94.7% | 94.0% | 95.8% | 96.2% | 93.6% |
+18.3pp over Gemini 3 Flash on a stratified held-out set of 150 cases.
Code-switching finding: Gemini accuracy drops from 80.0% on monolingual Indonesian to 64.1% on code-switching samples (−15.9pp). CogniFi maintains 91.6% on the same samples. The gap is distributional, not architectural, terms like masih sempet ga and serok bareng are absent from any general-purpose training corpus.
CogniFi runs a six-layer rule-based pipeline:
Input text
│
├── Layer 1: Early exit (analytical/educational text → NONE)
├── Layer 2: Price context (Yahoo Finance real-time data)
├── Layer 3: Keyword scoring (1,307 terms, 13 categories)
├── Layer 4: Contextual adjustments (57 rules)
├── Layer 5: Confidence thresholding
└── Layer 6: Signal construction
│
▼
Bias label + confidence + counter-evidence + friction prompt
The rule-based design is intentional. In behavioral intervention contexts, unexplainable outputs are dismissed precisely when intervention is most needed.
1,193 labeled samples of authentic Indonesian retail investor posts, collected from five platforms.
| Platform | Training | % |
|---|---|---|
| Stockbit | 882 | 84.6% |
| X.com | 85 | 8.1% |
| 32 | 3.1% | |
| Threads | 44 | 4.2% |
Language distribution:
| Category | n | % |
|---|---|---|
| Monolingual Indonesian | 844 | 70.7% |
| Intra-sentential code-switching | 307 | 25.7% |
| Monolingual English | 42 | 3.5% |
Language detection uses hybrid FastText LID + domain lexicon overlap.
Inter-annotator agreement: Cohen's Kappa = 0.7815 (substantial agreement, Landis & Koch 1977), across two domain-qualified annotators (n=100 samples).
cognifi/
├── bias_detector.py # 6-layer detection pipeline
├── keywords.py # 1,307-term domain lexicon
├── counter_evidence.py # Historical backtesting engine
├── intervention.py # Friction and Pre-Mortem prompts
├── ticker_extractor.py # Ticker detection
├── price_fetcher.py # Yahoo Finance data layer
├── fundamental_data.py # IDX fundamentals (10 tickers)
├── llm.py # Gemini API wrapper
├── news_context.py # News sentiment layer
├── rag.py # ChromaDB RAG engine
├── app.py # Streamlit web app
├── config.py # Configuration
├── evaluate_final.py # Held-out evaluation
├── gemini_baseline.py # Gemini baseline script
├── test_bias_accuracy.py # Training accuracy script
└── data/
├── training_v2.json # 1,043 training cases with provenance
└── held_out_TEST_ONLY.json # 150 held-out cases (labels withheld)
# Training set accuracy
python test_bias_accuracy.py
# Gemini 3 Flash baseline (requires API key)
python gemini_baseline.pyThe held-out answer key is not included to prevent data leakage. Evaluation results are documented in the paper.
For 10 covered IDX tickers, the engine surfaces historical precedents relevant to the detected bias. Outputs are base rates, not predictions.
| Ticker | FOMO Episodes | Correction Prob. | LA Episodes | Recovery Rate | Avg Days |
|---|---|---|---|---|---|
| BBCA.JK | — | — | 75 | 100% | 2 |
| BBRI.JK | 4 | 50% | 76 | 100% | 4 |
| TLKM.JK | — | — | 97 | 100% | 3 |
| ASII.JK | — | — | 95 | 100% | 8 |
| BMRI.JK | — | — | 78 | 100% | 3 |
| UNVR.JK | 10 | 50% | 111 | 95% | 7 |
| GOTO.JK | 10 | 20% | 44 | 95% | 7 |
| BREN.JK | 17 | 18% | 20 | 90% | 5 |
| EMTK.JK | 18 | 22% | 113 | 100% | 1 |
| SIDO.JK | 6 | 50% | 93 | 95% | 10 |
— = no episodes meeting the >20% 5-day FOMO threshold. System returns an explicit insufficient_data response rather than generating output from inadequate samples.
MIT. See LICENSE.
Dataset released for research use. All samples are authentic user-generated content from Indonesian retail investor communities (Stockbit, X.com, Reddit, Threads).