# Phase 1: Data Ingestion & Structuring

## Overview
This notebook handles loading and structuring the multi-modal data for the character backstory consistency checking system using Pathway.

## Output Tables
- `train_table`: Training data with character backstories and labels
- `books_table`: Full novel text content
- `joined_table`: Unified view combining structured and unstructured data

In [1]:
# Install Pathway if needed
!pip install pathway pandas -q

In [2]:
import pathway as pw
import os
import pandas as pd
from pathlib import Path

  __import__("pkg_resources").declare_namespace(__name__)  # type: ignore


In [3]:
# Define paths
PROJECT_ROOT = Path("/root/DataDivas_KDSH_2026")
DATA_DIR = PROJECT_ROOT / "Data"
BOOKS_DIR = DATA_DIR / "Books"

print(f"Project root: {PROJECT_ROOT}")
print(f"Data directory: {DATA_DIR}")
print(f"Books directory: {BOOKS_DIR}")

# Verify directories exist
print(f"Data dir exists: {DATA_DIR.exists()}")
print(f"Books dir exists: {BOOKS_DIR.exists()}")

Project root: /root/DataDivas_KDSH_2026
Data directory: /root/DataDivas_KDSH_2026/Data
Books directory: /root/DataDivas_KDSH_2026/Data/Books
Data dir exists: True
Books dir exists: True


## Step 1: Load CSV Data with Pathway

In [4]:
# Define schema for training data

class TrainSchema(pw.Schema):
    uid: int          
    book_name: str
    char: str
    caption: str
    content: str
    label: str


In [5]:
# Load training and test data using Pathway's CSV connector
train_csv_path = str(DATA_DIR / "train.csv")
test_csv_path = str(DATA_DIR / "test.csv")

train_table = pw.io.csv.read(
    train_csv_path,
    schema=TrainSchema,
    mode="static"
)


test_table = pw.io.csv.read(
    test_csv_path,
    schema=TrainSchema,
    mode='static'
)
print("Schema fields:", [name for name in TrainSchema.__annotations__.keys()])

print(f"Test table created successfully")

Schema fields: ['uid', 'book_name', 'char', 'caption', 'content', 'label']
Test table created successfully


## Step 2: Load Novel Text Files

In [6]:
class BooksSchema(pw.Schema):
    title: str
    full_text: str
    file_path: str
    char_count: int
    word_count: int

In [7]:
def load_book_content(file_path: str) -> dict:
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    return {
        "title": Path(file_path).stem,
        "full_text": content,
        "file_path": file_path,
        "char_count": len(content),
        "word_count": len(content.split())
    }

book_data_list = []
for book_file in BOOKS_DIR.glob("*.txt"):
    book_info = load_book_content(str(book_file))
    book_data_list.append(book_info)
    print(f"Loaded: {book_info['title']} - {book_info['word_count']:,} words")

print(f"\nTotal books loaded: {len(book_data_list)}")

Loaded: The Count of Monte Cristo - 464,020 words
Loaded: In search of the castaways - 138,830 words

Total books loaded: 2


In [8]:
# Create books DataFrame and convert to Pathway table
books_df = pd.DataFrame(book_data_list)
books_table = pw.debug.table_from_pandas(
    books_df,
    schema=BooksSchema
)
print(f"Books table created with {len(book_data_list)} entries")

Books table created with 2 entries


## Step 3: Create Unified Joined Table

In [9]:
joined_table = train_table.join(
    books_table,
    pw.this.book_name == books_table.title
).select(
    sample_id=pw.this.uid,          # renamed from uid
    character_name=pw.this.char,
    backstory=pw.this.content,      # content is the backstory
    label=pw.this.label,
    novel_title=pw.this.book_name,
    full_text=books_table.full_text,
    char_count=books_table.char_count,
    word_count=books_table.word_count
)
print("Joined table schema fields:", [name for name in joined_table.schema.__annotations__.keys()])

Joined table schema fields: ['sample_id', 'character_name', 'backstory', 'label', 'novel_title', 'full_text', 'char_count', 'word_count']


## Step 4: Data Validation & Statistics

In [10]:
train_pd = pw.debug.table_to_pandas(train_table)
joined_pd = pw.debug.table_to_pandas(joined_table)

print("=" * 60)
print("DATA STATISTICS")
print("=" * 60)

print(f"Training Data: {len(train_pd)} entries")
print(f"Joined Data: {len(joined_pd)} entries")
print(f"Unique characters: {joined_pd['character_name'].nunique()}")

print("\nLabel Distribution:")
label_counts = joined_pd['label'].value_counts()

for label, count in label_counts.items():
    label_name = "Consistent" if label == 1 else "Contradict"
    percentage = (count / len(joined_pd)) * 100
    print(f"  {label_name} ({label}): {count} ({percentage:.1f}%)")


[2026-01-08T15:50:09]:INFO:Preparing Pathway computation
[2026-01-08T15:50:09]:INFO:Enter read_snapshot method with reader PosixLike
[2026-01-08T15:50:09]:INFO:FileSystem(/root/DataDivas_KDSH_2026/Data/train.csv): 0 entries (1 minibatch(es)) have been sent to the engine
[2026-01-08T15:50:09]:INFO:FileSystem(/root/DataDivas_KDSH_2026/Data/train.csv): 80 entries (2 minibatch(es)) have been sent to the engine
[2026-01-08T15:50:09]:INFO:subscribe-0: Done writing 0 entries, time 1767887409524. Current batch writes took: 0 ms. All writes so far took: 0 ms.
[2026-01-08T15:50:09]:INFO:subscribe-0: Done writing 80 entries, closing data sink. Current batch writes took: 0 ms. All writes so far took: 0 ms.
[2026-01-08T15:50:09]:INFO:Preparing Pathway computation
[2026-01-08T15:50:09]:INFO:Enter read_snapshot method with reader PosixLike
[2026-01-08T15:50:09]:INFO:FileSystem(/root/DataDivas_KDSH_2026/Data/train.csv): 0 entries (1 minibatch(es)) have been sent to the engine
[2026-01-08T15:50:09]:INF

DATA STATISTICS
Training Data: 80 entries
Joined Data: 31 entries
Unique characters: 2

Label Distribution:
  Contradict (contradict): 16 (51.6%)
  Contradict (consistent): 15 (48.4%)


## Summary

Phase 1 Complete! All Pathway tables are ready for Phase 2.