# Phase 1 Demo: MWB Tracking Pipeline

This notebook demonstrates the end-to-end Phase 1 pipeline:
1. **Ingestion**: Load raw conversational text from CSV
2. **Preprocessing**: Normalize slang, emojis, and apply fixes
3. **Persistence**: Write to SQLite and demonstrate history retrieval

## Setup


In [1]:
import sys
from pathlib import Path

# Add src to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

import pandas as pd
import logging
from datetime import datetime
import time

from src.ingest import ingest_csv
from src.preprocess import preprocess_pipeline
from src.persistence import MWBPersistence
from src.utils import setup_logging, checkpoint_dataframe

# Setup logging
setup_logging(log_level="INFO")
logging.info("Starting Phase 1 Demo")


2025-12-06 02:17:51,417 - root - INFO - Starting Phase 1 Demo


## Step 1: Ingestion

Load sample data from CSV file.


In [2]:
# Ingest sample data
data_path = project_root / "data" / "sample_data.csv"
df_raw = ingest_csv(data_path)

print(f"Loaded {len(df_raw)} records")
print("\nFirst few rows:")
print(df_raw.head())
print(f"\nColumns: {list(df_raw.columns)}")
print(f"\nData types:\n{df_raw.dtypes}")

# Checkpoint after ingestion
checkpoint_dir = project_root / "checkpoints"
checkpoint_dataframe(df_raw, checkpoint_dir, "ingestion")


2025-12-06 02:17:51,422 - src.ingest - INFO - Ingesting CSV from /Users/vikranthnara/Documents/GitHub/Emotix/data/sample_data.csv
2025-12-06 02:17:51,428 - src.ingest - INFO - Validated DataFrame with 20 rows


Loaded 20 records

First few rows:
    UserID                                             Text  \
0  user001  Ugh, today was so stressful ðŸ˜« tbh I'm exhausted   
1  user001                         lol that's hilarious ðŸ˜‚ðŸ˜‚ðŸ˜‚   
2  user002           idk what to do anymore... feeling lost   
3  user001                 omg I can't believe it worked! ðŸŽ‰   
4  user003                  wtf is wrong with me today? smh   

            Timestamp  
0 2024-01-15 10:30:00  
1 2024-01-15 11:15:00  
2 2024-01-15 12:00:00  
3 2024-01-15 13:45:00  
4 2024-01-15 14:20:00  

Columns: ['UserID', 'Text', 'Timestamp']

Data types:
UserID               object
Text                 object
Timestamp    datetime64[ns]
dtype: object


2025-12-06 02:17:52,167 - src.utils - INFO - Checkpoint saved: /Users/vikranthnara/Documents/GitHub/Emotix/checkpoints/ingestion_20251206_021751.parquet (20 rows)


PosixPath('/Users/vikranthnara/Documents/GitHub/Emotix/checkpoints/ingestion_20251206_021751.parquet')

## Step 2: Preprocessing

Apply normalization pipeline: slang dictionary lookup, emoji demojization, and syntactic fixes.


In [3]:
# Load slang dictionary and preprocess
slang_dict_path = project_root / "data" / "slang_dictionary.json"
df_preprocessed = preprocess_pipeline(df_raw, slang_dict_path=slang_dict_path)

print(f"Preprocessed {len(df_preprocessed)} records")
print("\nSample transformations:")
sample = df_preprocessed[['Text', 'NormalizedText', 'NormalizationFlags']].head(3)
for idx, row in sample.iterrows():
    print(f"\nOriginal: {row['Text']}")
    print(f"Normalized: {row['NormalizedText']}")
    flags = row['NormalizationFlags']
    if flags.get('slang_replacements'):
        print(f"  Slang replaced: {flags['slang_replacements']}")
    if flags.get('emojis_found'):
        print(f"  Emojis found: {flags['emojis_found']}")

# Checkpoint after preprocessing
checkpoint_dataframe(df_preprocessed, checkpoint_dir, "preprocessing")


2025-12-06 02:17:52,174 - src.preprocess - INFO - Loaded 33 slang/abbrev mappings
2025-12-06 02:17:52,176 - src.preprocess - INFO - Preprocessing 20 rows
2025-12-06 02:17:52,184 - src.preprocess - INFO - Preprocessing complete
2025-12-06 02:17:52,187 - src.utils - INFO - Checkpoint saved: /Users/vikranthnara/Documents/GitHub/Emotix/checkpoints/preprocessing_20251206_021752.parquet (20 rows)


Preprocessed 20 records

Sample transformations:

Original: Ugh, today was so stressful ðŸ˜« tbh I'm exhausted
Normalized: Ugh, today was so stressful tired_face to be honest in my opinion exhausted
  Slang replaced: [{'original': 'tbh', 'replacement': 'to be honest'}]
  Emojis found: ['ðŸ˜«']

Original: lol that's hilarious ðŸ˜‚ðŸ˜‚ðŸ˜‚
Normalized: laughing out loud that's hilarious face_with_tears_of_joy face_with_tears_of_joy face_with_tears_of_joy
  Slang replaced: [{'original': 'lol', 'replacement': 'laughing out loud'}]
  Emojis found: ['ðŸ˜‚']

Original: idk what to do anymore... feeling lost
Normalized: i don't know what to do anymore. .. feeling lost
  Slang replaced: [{'original': 'idk', 'replacement': "i don't know"}]


PosixPath('/Users/vikranthnara/Documents/GitHub/Emotix/checkpoints/preprocessing_20251206_021752.parquet')

## Step 3: Persistence

Write preprocessed data to SQLite database.


In [4]:
# Initialize persistence layer
db_path = project_root / "data" / "mwb_log.db"
persistence = MWBPersistence(db_path)

# Write results (with raw text archiving)
rows_written = persistence.write_results(df_preprocessed, archive_raw=True)
print(f"Successfully wrote {rows_written} rows to database")
print(f"Database location: {db_path}")


2025-12-06 02:17:52,246 - src.persistence - INFO - Database schema initialized
2025-12-06 02:17:52,251 - src.persistence - INFO - Committed batch: 20 rows (total: 20)
2025-12-06 02:17:52,251 - src.persistence - INFO - Successfully wrote 20 rows to database


Successfully wrote 20 rows to database
Database location: /Users/vikranthnara/Documents/GitHub/Emotix/data/mwb_log.db


## Step 4: History Retrieval

Demonstrate fast context retrieval for a user (<100ms requirement).


In [5]:
# Fetch history for user001
user_id = "user001"

start_time = time.time()
history = persistence.fetch_history(user_id)
elapsed_ms = (time.time() - start_time) * 1000

print(f"Retrieved {len(history)} records for user {user_id} in {elapsed_ms:.2f}ms")
print(f"\nHistory (ordered by timestamp):")
print(history[['Timestamp', 'NormalizedText', 'NormalizationFlags']].head(10))

# Test with timestamp filter
since = datetime(2024, 1, 15, 12, 0, 0)
history_filtered = persistence.fetch_history(user_id, since_timestamp=since)
print(f"\nRecords since {since}: {len(history_filtered)}")

# Test with limit
history_limited = persistence.fetch_history(user_id, limit=3)
print(f"\nLast 3 records:")
print(history_limited[['Timestamp', 'NormalizedText']])


2025-12-06 02:17:52,269 - src.persistence - INFO - Fetched 16 records for user user001
2025-12-06 02:17:52,273 - src.persistence - INFO - Fetched 12 records for user user001
2025-12-06 02:17:52,275 - src.persistence - INFO - Fetched 3 records for user user001


Retrieved 16 records for user user001 in 2.85ms

History (ordered by timestamp):
             Timestamp                                     NormalizedText  \
0  2024-01-15 10:30:00  Ugh, today was so stressful tired_face to be h...   
1  2024-01-15 10:30:00  Ugh, today was so stressful tired_face to be h...   
2  2024-01-15 11:15:00  laughing out loud that's hilarious face_with_t...   
3  2024-01-15 11:15:00  laughing out loud that's hilarious face_with_t...   
4  2024-01-15 13:45:00  oh my god I can't believe it worked! party_popper   
5  2024-01-15 13:45:00  oh my god I can't believe it worked! party_popper   
6  2024-01-15 16:00:00                 be right back need to take a break   
7  2024-01-15 16:00:00                 be right back need to take a break   
8  2024-01-15 19:15:00    you only live once let's do this! flexed_biceps   
9  2024-01-15 19:15:00    you only live once let's do this! flexed_biceps   

                                  NormalizationFlags  
0  {'emojis_foun

## Step 5: Verify Database Schema

Check that all required columns are present in the database.


In [6]:
import sqlite3

conn = sqlite3.connect(str(db_path))
cursor = conn.cursor()

# Check schema
cursor.execute("PRAGMA table_info(mwb_log)")
columns = cursor.fetchall()
print("MWB Log Schema:")
for col in columns:
    print(f"  {col[1]} ({col[2]})")

# Sample query
cursor.execute("SELECT COUNT(*) FROM mwb_log")
total = cursor.fetchone()[0]
print(f"\nTotal records in mwb_log: {total}")

cursor.execute("SELECT COUNT(*) FROM raw_archive")
total_archive = cursor.fetchone()[0]
print(f"Total records in raw_archive: {total_archive}")

conn.close()


MWB Log Schema:
  LogID (INTEGER)
  UserID (TEXT)
  Timestamp (DATETIME)
  NormalizedText (TEXT)
  PrimaryEmotionLabel (TEXT)
  IntensityScore_Primary (REAL)
  AmbiguityFlag (INTEGER)
  NormalizationFlags (TEXT)
  CreatedAt (DATETIME)

Total records in mwb_log: 40
Total records in raw_archive: 40


## Summary

âœ… **Ingestion**: Successfully loaded and validated 20 records  
âœ… **Preprocessing**: Applied normalization (slang, emojis, fixes) with flags tracking  
âœ… **Persistence**: Wrote structured data to SQLite with ACID transactions  
âœ… **History Retrieval**: Fast context retrieval (<100ms) with indexing  

**Next Steps (Phase 2):**
- Layer 3: Contextualization (multi-turn sequence creation)
- Layer 4: Modeling (fine-tuned DistilRoBERTa inference)
- Synthetic ambiguity dataset generation
- Validation metrics tracking
