# Task 2: CoNLL Format Labeling

This notebook handles:
- Loading preprocessed data
- Auto-labeling with BIO tags
- Creating CoNLL format dataset

In [1]:
import sys
sys.path.append('..')

import pandas as pd
from pathlib import Path
from src.labeling.conll_labeler import CoNLLLabeler

In [2]:
# Load processed data
df = pd.read_csv("../data/processed/cleaned_messages.csv")
labeler = CoNLLLabeler()

print(f"Loaded {len(df)} messages for labeling")

Loaded 13443 messages for labeling


## Sample Message Labeling

In [3]:
# Test labeling on sample messages
sample_texts = [
    "ሻንጣ ዋጋ 500 ብር አዲስ አበባ ቦሌ",
    "ስልክ በ 2000 ብር መርካቶ",
    "ጫማ ልብስ ፒያሳ ላይ"
]

for text in sample_texts:
    labeled = labeler.auto_label_message(text)
    print(f"Text: {text}")
    print("Labels:")
    for token, label in labeled:
        print(f"  {token} -> {label}")
    print("---")

etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)
etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)
etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)


Text: ሻንጣ ዋጋ 500 ብር አዲስ አበባ ቦሌ
Labels:
  ሻንጣ -> B-Product
  ዋጋ -> I-Product
  500 -> B-PRICE
  ብር -> I-PRICE
  አዲስ -> B-LOC
  አበባ -> I-LOC
  ቦሌ -> B-LOC
---
Text: ስልክ በ 2000 ብር መርካቶ
Labels:
  ስልክ -> B-Product
  በ -> O
  2000 -> B-PRICE
  ብር -> I-PRICE
  መርካቶ -> B-LOC
---
Text: ጫማ ልብስ ፒያሳ ላይ
Labels:
  ጫማ -> B-Product
  ልብስ -> I-Product
  ፒያሳ -> I-Product
  ላይ -> I-Product
---


## Label Dataset Sample

In [4]:
# Label 50 messages in CoNLL format
conll_data = labeler.label_dataset_sample(df, num_messages=50)

print(f"Generated CoNLL data length: {len(conll_data)} characters")
print("\nFirst 500 characters:")
print(conll_data[:500])

etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)
etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)
etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)
etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)
etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)
etnltk tokenization failed: cannot import name 'word_tokeniz

Generated CoNLL data length: 17223 characters

First 500 characters:
```First	O
class	O
short	O
500```	O

📌	O
High	O
pressure	O
water	O
Gun	O
nozzle	O
👍መኪና	B-Product
ለማጠብ	I-Product
👍ግቢ	I-Product
ለማፅዳት	I-Product
ተመራጭ	I-Product
👍የራሱ	I-Product
መለዋወጫ	I-Product
ያለው	I-Product
ዋጋ፦	I-Product
💰🏷	I-Product
800	B-PRICE
ብር	I-PRICE
♦️ውስን	O
ፍሬ	O
ነው	O
ያለው🔥🔥🔥	O
🏢	O
አድራሻ👉	O
📍ቁ.1️⃣♦️መገናኛ	O
መሰረት	O
ደፋር	O
ሞል	O
ሁለተኛ	O
ፎቅ	O
ቢሮ	O
ቁ	O
S05/S06	O
📍	O
ቁ.2️⃣♦️ፒያሳ	B-LOC
ጊዮርጊስ	O
አደባባይ	O
ራመት_ታቦር_ኦዳ_ህንፃ	O
1ኛ	O
ፎቅ	O
ሱቅ	O
ቁ	O
G1	O
-107	O
💧💧💧💧	O
📲	O
0902660722	O
📲	O
0928460606	O
🔖	O
💬	O
በTelegram	O


## Save CoNLL Data

In [5]:
# Save labeled data
Path("../data/labeled").mkdir(parents=True, exist_ok=True)
labeler.save_conll_data(conll_data, "../data/labeled/train_data.conll")

print("CoNLL data saved to data/labeled/train_data.conll")

CoNLL data saved to data/labeled/train_data.conll


## Analyze Labels

In [6]:
# Count label distribution
lines = conll_data.strip().split('\n')
labels = []

for line in lines:
    if line.strip() and '\t' in line:
        token, label = line.split('\t')
        labels.append(label)

from collections import Counter
label_counts = Counter(labels)

print("Label Distribution:")
for label, count in label_counts.items():
    print(f"{label}: {count}")

print(f"\nTotal tokens labeled: {len(labels)}")
print(f"Entity tokens: {len(labels) - label_counts.get('O', 0)}")

Label Distribution:
O: 1887
B-Product: 16
I-Product: 88
B-PRICE: 41
I-PRICE: 20
B-LOC: 17

Total tokens labeled: 2069
Entity tokens: 182
