# Prepare Data for Doccano Labeling

This notebook demonstrates how to load, clean, batch, and convert parliamentary documents into a JSONL format suitable for Doccano annotation. The workflow uses the `process_file` and `segment_documents` functions from the `io_utils` module, then converts the batched data to JSONL and shuffles it for unbiased labeling.

## 1. Load and Clean Data

Load the original data from `combined_transformed_data.json`, clean the text, and batch the documents using the utility functions.

In [None]:
import sys
import os
sys.path.append(os.path.abspath('..'))
from src.preprocessing import process_file, segment_documents

# Load and clean the data
cleaned_data = process_file('../01data_collection/combined_transformed_data.json', mode='basic', output_mode='memory')

# Batch the cleaned data
batched_data = segment_documents(cleaned_data)

===== Summary =====
Total original documents: 4647
Removed (empty): 470
Removed (≤ 10 tokens): 139
Removed (> 15000 tokens): 131
Remaining valid documents: 3907
Total segments generated: 8086
Average segments per document: 2.07
Removed (TokenLength < 30): 92
Remaining segments: 7994


## 2. Convert Batched Data to JSONL Format

Transform the batched data into the JSONL format required by Doccano, including metadata for each batch.

In [2]:
import json

jsonl_entries = []
for entry in batched_data:
    jsonl_entry = {
        'text': entry['SegmentText'],
        'meta': {
            'segment_id': entry['SegmentID'],
            'doc_id': entry['ID'],
            'landtag': entry['Landtag'],
            'datum': entry['Datum'],
            'filter': entry['FilterDetails'],
            'links': entry['Links'],
            'beschreibung': entry.get('Beschreibungstext', '')
        },
        'label': []
    }
    jsonl_entries.append(jsonl_entry)

## 3. Shuffle the JSONL Data

Randomize the order of entries to minimize annotation bias.

In [3]:
import random
random.shuffle(jsonl_entries)

## 4. Save as Doccano-Ready JSONL

Write the shuffled entries to `doccano_ready_data.jsonl` for import into Doccano.

In [4]:
with open('doccano_ready_data.jsonl', 'w', encoding='utf-8') as f_out:
    for entry in jsonl_entries:
        f_out.write(json.dumps(entry, ensure_ascii=False) + '\n')

In [None]:
#TBD ggf. Funktion anpassen dass sie einfach alle Dateien checkt, nicht nur Liste[dicts] mit json File
#check if docs are the same: 
#from src.io_utils import compare_list_to_json
#
#s1 = json.dumps(data_list, ensure_ascii=False, indent=2)
#with open('doccano_ready_data.jsonl', 'r', encoding='utf-8') as f:
#       s2 = f.read()
#
#compare_list_to_json(s2, 'shuffled.jsonl')

FileNotFoundError: [Errno 2] No such file or directory: 'shuffled.jsonl'

---
**Summary:** This notebook loads, cleans, batches, converts, shuffles, and saves parliamentary documents in a format ready for Doccano labeling.