# Task 2: CoNLL Format Labeling

This notebook handles:
- Loading preprocessed data
- Auto-labeling with BIO tags
- Creating CoNLL format dataset

In [1]:
import sys
sys.path.append('..')

import pandas as pd
from pathlib import Path
from src.labeling.conll_labeler import CoNLLLabeler

In [2]:
# Load processed data
df = pd.read_csv("../data/processed/cleaned_messages.csv")
labeler = CoNLLLabeler()

print(f"Loaded {len(df)} messages for labeling")

Loaded 13443 messages for labeling


## Sample Message Labeling

In [3]:
# Test labeling on sample messages
sample_texts = [
    "·àª·äï·å£ ·ãã·åã 500 ·â•·à≠ ·ä†·ã≤·àµ ·ä†·â†·â£ ·â¶·àå",
    "·àµ·àç·ä≠ ·â† 2000 ·â•·à≠ ·àò·à≠·ä´·â∂",
    "·å´·àõ ·àç·â•·àµ ·çí·ã´·à≥ ·àã·ã≠"
]

for text in sample_texts:
    labeled = labeler.auto_label_message(text)
    print(f"Text: {text}")
    print("Labels:")
    for token, label in labeled:
        print(f"  {token} -> {label}")
    print("---")

etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)
etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)
etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)


Text: ·àª·äï·å£ ·ãã·åã 500 ·â•·à≠ ·ä†·ã≤·àµ ·ä†·â†·â£ ·â¶·àå
Labels:
  ·àª·äï·å£ -> B-Product
  ·ãã·åã -> I-Product
  500 -> B-PRICE
  ·â•·à≠ -> I-PRICE
  ·ä†·ã≤·àµ -> B-LOC
  ·ä†·â†·â£ -> I-LOC
  ·â¶·àå -> B-LOC
---
Text: ·àµ·àç·ä≠ ·â† 2000 ·â•·à≠ ·àò·à≠·ä´·â∂
Labels:
  ·àµ·àç·ä≠ -> B-Product
  ·â† -> O
  2000 -> B-PRICE
  ·â•·à≠ -> I-PRICE
  ·àò·à≠·ä´·â∂ -> B-LOC
---
Text: ·å´·àõ ·àç·â•·àµ ·çí·ã´·à≥ ·àã·ã≠
Labels:
  ·å´·àõ -> B-Product
  ·àç·â•·àµ -> I-Product
  ·çí·ã´·à≥ -> I-Product
  ·àã·ã≠ -> I-Product
---


## Label Dataset Sample

In [4]:
# Label 50 messages in CoNLL format
conll_data = labeler.label_dataset_sample(df, num_messages=50)

print(f"Generated CoNLL data length: {len(conll_data)} characters")
print("\nFirst 500 characters:")
print(conll_data[:500])

etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)
etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)
etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)
etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)
etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)
etnltk tokenization failed: cannot import name 'word_tokeniz

Generated CoNLL data length: 17223 characters

First 500 characters:
```First	O
class	O
short	O
500```	O

üìå	O
High	O
pressure	O
water	O
Gun	O
nozzle	O
üëç·àò·ä™·äì	B-Product
·àà·àõ·å†·â•	I-Product
üëç·åç·â¢	I-Product
·àà·àõ·çÖ·ã≥·âµ	I-Product
·â∞·àò·à´·å≠	I-Product
üëç·ã®·à´·à±	I-Product
·àò·àà·ãã·ãà·å´	I-Product
·ã´·àà·ãç	I-Product
·ãã·åã·ç¶	I-Product
üí∞üè∑	I-Product
800	B-PRICE
·â•·à≠	I-PRICE
‚ô¶Ô∏è·ãç·àµ·äï	O
·çç·à¨	O
·äê·ãç	O
·ã´·àà·ãçüî•üî•üî•	O
üè¢	O
·ä†·ãµ·à´·àªüëâ	O
üìç·âÅ.1Ô∏è‚É£‚ô¶Ô∏è·àò·åà·äì·äõ	O
·àò·à∞·à®·âµ	O
·ã∞·çã·à≠	O
·àû·àç	O
·àÅ·àà·â∞·äõ	O
·çé·âÖ	O
·â¢·àÆ	O
·âÅ	O
S05/S06	O
üìç	O
·âÅ.2Ô∏è‚É£‚ô¶Ô∏è·çí·ã´·à≥	B-LOC
·åä·ãÆ·à≠·åä·àµ	O
·ä†·ã∞·â£·â£·ã≠	O
·à´·àò·âµ_·â≥·â¶·à≠_·ä¶·ã≥_·àÖ·äï·çÉ	O
1·äõ	O
·çé·âÖ	O
·à±·âÖ	O
·âÅ	O
G1	O
-107	O
üíßüíßüíßüíß	O
üì≤	O
0902660722	O
üì≤	O
0928460606	O
üîñ	O
üí¨	O
·â†Telegram	O


## Save CoNLL Data

In [5]:
# Save labeled data
Path("../data/labeled").mkdir(parents=True, exist_ok=True)
labeler.save_conll_data(conll_data, "../data/labeled/train_data.conll")

print("CoNLL data saved to data/labeled/train_data.conll")

CoNLL data saved to data/labeled/train_data.conll


## Analyze Labels

In [6]:
# Count label distribution
lines = conll_data.strip().split('\n')
labels = []

for line in lines:
    if line.strip() and '\t' in line:
        token, label = line.split('\t')
        labels.append(label)

from collections import Counter
label_counts = Counter(labels)

print("Label Distribution:")
for label, count in label_counts.items():
    print(f"{label}: {count}")

print(f"\nTotal tokens labeled: {len(labels)}")
print(f"Entity tokens: {len(labels) - label_counts.get('O', 0)}")

Label Distribution:
O: 1887
B-Product: 16
I-Product: 88
B-PRICE: 41
I-PRICE: 20
B-LOC: 17

Total tokens labeled: 2069
Entity tokens: 182
