# Amharic NER Project Setup & Overview

This notebook provides:
- Project setup and environment configuration
- Overview of all tasks
- Quick pipeline execution

## Environment Setup

In [None]:
# Install required packages
!pip install -r ../requirements.txt

In [None]:
# Check environment
import sys
import torch
import transformers
import pandas as pd

print(f"Python: {sys.version}")
print(f"PyTorch: {torch.__version__}")
print(f"Transformers: {transformers.__version__}")
print(f"Pandas: {pd.__version__}")

# Check GPU availability
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("GPU: Not available (using CPU)")

## Project Structure

In [1]:
import os
from pathlib import Path

# Display project structure
def show_tree(path, prefix="", max_depth=3, current_depth=0):
    if current_depth > max_depth:
        return
    
    path = Path(path)
    items = sorted([p for p in path.iterdir() if not p.name.startswith('.')])
    
    for i, item in enumerate(items):
        is_last = i == len(items) - 1
        current_prefix = "└── " if is_last else "├── "
        print(f"{prefix}{current_prefix}{item.name}")
        
        if item.is_dir() and current_depth < max_depth:
            next_prefix = prefix + ("    " if is_last else "│   ")
            show_tree(item, next_prefix, max_depth, current_depth + 1)

print("Project Structure:")
show_tree("..")

Project Structure:
├── config
│   └── config.py
├── data
│   ├── labeled
│   ├── processed
│   └── raw
├── main_pipeline.py
├── models
│   └── checkpoints
├── notebooks
│   ├── 00_setup_and_overview.ipynb
│   ├── 01_data_ingestion.ipynb
│   ├── 02_conll_labeling.ipynb
│   ├── 03_model_training.ipynb
│   ├── 04_model_evaluation.ipynb
│   ├── 05_interpretability.ipynb
│   └── 06_vendor_analytics.ipynb
├── README.md
├── requirements.txt
├── scripts
│   └── run_individual_tasks.py
├── src
│   ├── __init__.py
│   ├── data_ingestion
│   │   └── telegram_scraper.py
│   ├── evaluation
│   │   └── model_evaluator.py
│   ├── interpretability
│   │   └── model_explainer.py
│   ├── labeling
│   │   └── conll_labeler.py
│   ├── preprocessing
│   │   └── text_processor.py
│   ├── training
│   │   └── ner_trainer.py
│   ├── utils
│   │   ├── __init__.py
│   │   └── etnltk_helper.py
│   └── vendor_analytics
│       └── scorecard.py
└── tests


## Task Overview

### Task 1: Data Ingestion & Preprocessing
- Scrape 5+ Ethiopian Telegram e-commerce channels
- Clean and preprocess Amharic text using etnltk
- Extract entity hints and metadata

### Task 2: CoNLL Labeling
- Auto-label 30-50 messages in CoNLL format
- Entity types: Product, Location, Price
- BIO tagging scheme

### Task 3: Model Training
- Fine-tune XLM-RoBERTa, DistilBERT, mBERT
- Use Hugging Face Transformers
- GPU-accelerated training

### Task 4: Model Evaluation
- Compare models on accuracy, speed, robustness
- Select best performing model
- Generate comparison reports

### Task 5: Interpretability
- SHAP and LIME explanations
- Difficult case analysis
- Model transparency reports

### Task 6: Vendor Analytics
- Calculate vendor performance metrics
- Generate micro-lending scorecards
- Business insights and recommendations

## Configuration Check

In [3]:
import sys
sys.path.append('..')
from config.config import telegram_config, model_config, data_config

print("Configuration:")
print(f"Target Channels: {len(telegram_config.channels)}")
print(f"Models to Train: {len(model_config.model_names)}")
print(f"Entity Labels: {data_config.entity_labels}")

# Check if .env file exists
env_file = Path("../.env")
if env_file.exists():
    print("✅ Environment file configured")
else:
    print("⚠️  Please copy .env.example to .env and configure Telegram API credentials")

Configuration:
Target Channels: 9
Models to Train: 3
Entity Labels: ['O', 'B-Product', 'I-Product', 'B-LOC', 'I-LOC', 'B-PRICE', 'I-PRICE']
✅ Environment file configured


## Quick Pipeline Test

In [4]:
# Test core components
from src.preprocessing.text_processor import AmharicTextProcessor
from src.labeling.conll_labeler import CoNLLLabeler
from src.vendor_analytics.scorecard import VendorAnalytics

# Test text processing
processor = AmharicTextProcessor()
sample_text = "ሻንጣ ዋጋ 500 ብር አዲስ አበባ ቦሌ ላይ"
cleaned = processor.clean_text(sample_text)
tokens = processor.tokenize_amharic(sample_text)

print(f"Original: {sample_text}")
print(f"Cleaned: {cleaned}")
print(f"Tokens: {tokens}")

# Test labeling
labeler = CoNLLLabeler()
labeled = labeler.auto_label_message(sample_text)
print(f"\nLabeled tokens:")
for token, label in labeled:
    print(f"  {token} -> {label}")

print("\n✅ Core components working correctly!")

etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)
etnltk tokenization failed: cannot import name 'word_tokenize' from 'etnltk.tokenize' (c:\Users\W-HP\AppData\Local\Programs\Python\Python313\Lib\site-packages\etnltk\tokenize\__init__.py)


Original: ሻንጣ ዋጋ 500 ብር አዲስ አበባ ቦሌ ላይ
Cleaned: ሻንጣ ዋጋ 500 ብር አዲስ አበባ ቦሌ ላይ
Tokens: ['ሻንጣ', 'ዋጋ', '500', 'ብር', 'አዲስ', 'አበባ', 'ቦሌ', 'ላይ']

Labeled tokens:
  ሻንጣ -> B-Product
  ዋጋ -> I-Product
  500 -> B-PRICE
  ብር -> I-PRICE
  አዲስ -> B-LOC
  አበባ -> I-LOC
  ቦሌ -> B-LOC
  ላይ -> O

✅ Core components working correctly!


## Next Steps

1. **Configure Environment**: Set up `.env` file with Telegram API credentials
2. **Run Task 1**: Execute `01_data_ingestion.ipynb` to scrape and preprocess data
3. **Run Task 2**: Execute `02_conll_labeling.ipynb` to create labeled dataset
4. **Run Task 3**: Execute `03_model_training.ipynb` to train NER models
5. **Run Task 4**: Execute `04_model_evaluation.ipynb` to compare models
6. **Run Task 5**: Execute `05_interpretability.ipynb` for model explanations
7. **Run Task 6**: Execute `06_vendor_analytics.ipynb` for business insights

**Alternative**: Run `python main_pipeline.py` to execute all tasks automatically.