# 01 - Data Ingest

Curate legacy conversation assets into the cleaned layout, record provenance, and keep quick sanity checks for downstream notebooks.

**Goals**
- Map original pickle/CSV assets into `assets/` with clearer names.
- Keep notes on shapes/fields so later steps can reuse the data without re-computing.
- Provide small helpers to inspect conversation bundles (list-of-DataFrames structure).

In [None]:

from pathlib import Path
import pandas as pd

from utils.data_io import load_df_list_pickle, flatten_conversation_bundles, describe_bundle


In [None]:

# Paths
PROJECT_ROOT = Path.cwd()
ASSETS_RAW = PROJECT_ROOT / 'assets' / 'raw'
ASSETS_PROCESSED = PROJECT_ROOT / 'assets' / 'processed'
SOURCE_ROOT = PROJECT_ROOT.parent / 'Raja'
YOUTUBE_SOURCE = PROJECT_ROOT.parent / 'jp_vs_cn.csv'
ASSETS_RAW.mkdir(parents=True, exist_ok=True)
ASSETS_PROCESSED.mkdir(parents=True, exist_ok=True)
ASSETS_RAW, ASSETS_PROCESSED, SOURCE_ROOT


### Asset manifest
Each row captures a source asset, the cleaned target name, and a short note. Copy with `cp` to keep byte-for-byte fidelity.

In [None]:

asset_manifest = [
    {
        'source': SOURCE_ROOT / 'Convo' / 'combat_df_list_full.pkl',
        'target': ASSETS_RAW / 'combat_threads_text_only.pkl',
        'description': '6842 dialogue segments; text only (convokit combat extraction).',
    },
    {
        'source': SOURCE_ROOT / 'Convo' / 'combat_df_list.pkl',
        'target': ASSETS_RAW / 'combat_threads_with_agu_sample.pkl',
        'description': '334 dialogue segments with agu_1 summarization.',
    },
    {
        'source': SOURCE_ROOT / 'Convo' / 'combat_df_list_imms_1_full.pkl',
        'target': ASSETS_PROCESSED / 'combat_threads_with_imitation.pkl',
        'description': '6842 dialogues with LLM imitation text imm_1 + imm_1_check.',
    },
    {
        'source': SOURCE_ROOT / 'revised_convo' / 'combat_df_list_imms_1_full_perspective.pkl',
        'target': ASSETS_PROCESSED / 'combat_threads_with_perspective.pkl',
        'description': '6842 dialogues with imitation plus raw Perspective scores.',
    },
    {
        'source': YOUTUBE_SOURCE,
        'target': ASSETS_RAW / 'jp_vs_cn_youtube_comments.csv',
        'description': 'YouTube comment dataset used in side analysis.',
    },
]

manifest_df = pd.DataFrame(asset_manifest)
manifest_df


### Quick structural checks
Use after assets are copied into `assets/` to confirm column layouts and typical shapes. The loaders are defensive: they handle both list-of-DataFrames bundles and flattened lists.

In [None]:

for row in asset_manifest:
    if not row['target'].exists():
        continue
    bundle = load_df_list_pickle(row['target'])
    summary = describe_bundle(bundle)
    print(f"{row['target'].name}: {summary['bundle_len']} conversations")
    print(f"columns: {summary['columns']}")
    print(f"example ids: {summary['example_ids']}")
    flattened = flatten_conversation_bundles(bundle)
    display(flattened.head())
