<a href="https://colab.research.google.com/github/segnig/Amharic-E-commerce-Data-Extractor/blob/task-1/notebooks/task_one.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing library

In [29]:
import pandas as pd
import numpy as np

import etnltk

## Importing data   

data = pd.read_csv("../data/telegram_data.csv")

## Tokenizing data
data.head()

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path
0,Zemen Express®,@ZemenExpress,6982,💥💥...................................💥💥\n\n📌Im...,2025-06-18 06:01:10+00:00,
1,Zemen Express®,@ZemenExpress,6981,💥💥...................................💥💥\n\n📌 B...,2025-06-16 12:21:00+00:00,
2,Zemen Express®,@ZemenExpress,6980,,2025-06-16 05:11:57+00:00,data\photos\@ZemenExpress_6980.jpg
3,Zemen Express®,@ZemenExpress,6979,,2025-06-16 05:11:57+00:00,data\photos\@ZemenExpress_6979.jpg
4,Zemen Express®,@ZemenExpress,6978,,2025-06-16 05:11:57+00:00,data\photos\@ZemenExpress_6978.jpg


## Drop Null message

Remove chat entries where the Message values are null

In [30]:
data_clean = data[data["Message"].notna()]

print("Total number of chat: ", len(data_clean))

print("Total number of message: ", len(data_clean["Message"]))


Total number of chat:  23606
Total number of message:  23606


## Text Preprocessing

### **Key Steps & Functions**
1. **Imports from `etnltk`**:
   - `remove_emojis`: Strips emojis (e.g., 😊 → "").
   - `remove_links`: Removes URLs (e.g., `https://example.com` → "").
   - `remove_ethiopic_punct`: Clears Ethiopic punctuation (e.g., `።` → "").
   - `remove_tags`: Deletes social tags (e.g., `@user` → "").
   - `clean_amharic`: Master function for end-to-end cleaning.

2. **Custom Function: `remove_telegram_tags`**:
   - Removes hashtags (e.g., `#Amharic` → `Amharic`).

3. **Pipeline (`text_process`)**:
   - Applies cleaning steps sequentially:
     1. Remove Telegram hashtags (`#`).
     2. Strip emojis.
     3. Delete links.
     4. Clear Ethiopic punctuation.
     5. Remove social tags (`@`, `#`).
   - Uses `clean_amharic` with the custom pipeline.

4. **Execution**:
   - Applies `text_process` to each message in `data_clean["Message"]`.
   - Stores cleaned output in `data_clean["cleaned_message"]`.


In [40]:
# Import Amharic text preprocessing functions from ETNLP (Ethio NLP Toolkit)
from etnltk.lang.am.preprocessing import (
    remove_emojis,       # Removes emoji characters
    remove_links,        # Removes URLs and web addresses
    remove_ethiopic_punct,  # Removes Ethiopic punctuation marks
    remove_tags          # Removes social tags (@mentions, #hashtags)
)
from etnltk.lang.am import clean_amharic  # Main Amharic text cleaning function


def remove_telegram_tags(text):
    """Remove Telegram-style hashtags from text while preserving other words"""
    words = text.split()  # Split text into individual words
    # Filter out words that start with '#' (hashtags)
    words = [word for word in words if not word.startswith("#")] 
    return " ".join(words)  # Rejoin remaining words into a string
  

def text_process(text):
    """
    Process and clean Amharic text using a custom pipeline of cleaning functions.
    Returns standardized, cleaned text ready for analysis.
    """
    # Define the sequence of cleaning operations to apply
    custom_pipeline = [
      remove_telegram_tags,  # First: Remove Telegram hashtags
      remove_emojis,        # Then: Remove emoji characters
      remove_links,         # Then: Remove web URLs
      remove_ethiopic_punct,  # Then: Remove Ethiopic punctuation
      remove_tags           # Finally: Remove social tags
    ]

    # Apply the cleaning pipeline using ETNLP's clean_amharic function
    # abbrev=False prevents abbreviation expansion
    cleaned_text = clean_amharic(text, abbrev=False, pipeline=custom_pipeline)

    return cleaned_text


# Apply the text processing function to each message in the DataFrame
# Creates new column 'cleaned_message' with processed text
data_clean["cleaned_message"] = data_clean["Message"].apply(text_process)

# Display the first few rows to verify cleaning worked
data_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_clean["cleaned_message"] = data_clean["Message"].apply(text_process)


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path,cleaned_message,normalized_message,tokenized_meassage
0,Zemen Express®,@ZemenExpress,6982,💥💥...................................💥💥\n\n📌Im...,2025-06-18 06:01:10+00:00,,Imitation Volcano Humidifier with LED Light በኤ...,Imitation Volcano Humidifier with LED Light በኤ...,"[በኤሌክትሪክየሚሰራ, ለቤት, መልካም, መዓዛን, የሚሰጥ, ዋጋ, ብር, ው..."
1,Zemen Express®,@ZemenExpress,6981,💥💥...................................💥💥\n\n📌 B...,2025-06-16 12:21:00+00:00,,Baby Carrier በፈለጉት አቅጣጫ ልጅዎን በምቾት ማዘል ያስችልዎታል ...,Baby Carrier በፈለጉት አቅጣጫ ልጅዎን በምቾት ማዘል ያስችልዎታል ...,"[በፈለጉት, አቅጣጫ, ልጅዎን, በምቾት, ማዘል, ያስችልዎታል, ዋጋ, ብር..."
9,Zemen Express®,@ZemenExpress,6973,💥💥...................................💥💥\n\n📌Sm...,2025-06-16 05:11:57+00:00,data\photos\@ZemenExpress_6973.jpg,Smart Usb Ultrasonic Car And Home Air Humidifi...,Smart Usb Ultrasonic Car And Home Air Humidifi...,"[በኤሌክትሪክ, የሚሰራ, ለቤትና, ለመኪና, መልካም, መዓዛን, የሚሰጥ, ..."
10,Zemen Express®,@ZemenExpress,6972,💥💥...................................💥💥\n\n📌Sm...,2025-06-16 05:11:26+00:00,,Smart Usb Ultrasonic Car And Home Air Humidifi...,Smart Usb Ultrasonic Car And Home Air Humidifi...,"[በኤሌክትሪክ, የሚሰራ, ለቤትና, ለመኪና, መልካም, መዓዛን, የሚሰጥ, ..."
12,Zemen Express®,@ZemenExpress,6970,💥💥...................................💥💥\n\n📌Ba...,2025-06-16 05:09:03+00:00,data\photos\@ZemenExpress_6970.jpg,Baby Head Helmet Cotton Walk Safety Hat Breath...,Baby Head Helmet Cotton Walk Safety Hat Breath...,"[ዋጋ, ብር, ውስን, ፍሬ, ነው, ያለን, አድራሻ, መገናኛመሰረትደፋርሞል..."


In [41]:
# Normalize labialized Amharic characters to their standard forms
# Example: Converts variations like 'ሏ' (labialized ለ) to regular 'ለ'
# This ensures consistent text representation for downstream NLP tasks
from etnltk.lang.am.normalizer import normalize_labialized

# Apply normalization to each cleaned message in the DataFrame
# Creates new column 'normalized_message' with standardized characters
data_clean["normalized_message"] = data_clean["cleaned_message"].apply(normalize_labialized)

# Display sample results to verify normalization
data_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_clean["normalized_message"] = data_clean["cleaned_message"].apply(normalize_labialized)


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path,cleaned_message,normalized_message,tokenized_meassage
0,Zemen Express®,@ZemenExpress,6982,💥💥...................................💥💥\n\n📌Im...,2025-06-18 06:01:10+00:00,,Imitation Volcano Humidifier with LED Light በኤ...,Imitation Volcano Humidifier with LED Light በኤ...,"[በኤሌክትሪክየሚሰራ, ለቤት, መልካም, መዓዛን, የሚሰጥ, ዋጋ, ብር, ው..."
1,Zemen Express®,@ZemenExpress,6981,💥💥...................................💥💥\n\n📌 B...,2025-06-16 12:21:00+00:00,,Baby Carrier በፈለጉት አቅጣጫ ልጅዎን በምቾት ማዘል ያስችልዎታል ...,Baby Carrier በፈለጉት አቅጣጫ ልጅዎን በምቾት ማዘል ያስችልዎታል ...,"[በፈለጉት, አቅጣጫ, ልጅዎን, በምቾት, ማዘል, ያስችልዎታል, ዋጋ, ብር..."
9,Zemen Express®,@ZemenExpress,6973,💥💥...................................💥💥\n\n📌Sm...,2025-06-16 05:11:57+00:00,data\photos\@ZemenExpress_6973.jpg,Smart Usb Ultrasonic Car And Home Air Humidifi...,Smart Usb Ultrasonic Car And Home Air Humidifi...,"[በኤሌክትሪክ, የሚሰራ, ለቤትና, ለመኪና, መልካም, መዓዛን, የሚሰጥ, ..."
10,Zemen Express®,@ZemenExpress,6972,💥💥...................................💥💥\n\n📌Sm...,2025-06-16 05:11:26+00:00,,Smart Usb Ultrasonic Car And Home Air Humidifi...,Smart Usb Ultrasonic Car And Home Air Humidifi...,"[በኤሌክትሪክ, የሚሰራ, ለቤትና, ለመኪና, መልካም, መዓዛን, የሚሰጥ, ..."
12,Zemen Express®,@ZemenExpress,6970,💥💥...................................💥💥\n\n📌Ba...,2025-06-16 05:09:03+00:00,data\photos\@ZemenExpress_6970.jpg,Baby Head Helmet Cotton Walk Safety Hat Breath...,Baby Head Helmet Cotton Walk Safety Hat Breath...,"[ዋጋ, ብር, ውስን, ፍሬ, ነው, ያለን, አድራሻ, መገናኛመሰረትደፋርሞል..."


### Tokenization  - word

In [42]:
# Tokenize normalized Amharic text into individual words using ETNLP's Amharic tokenizer
# This converts each message from a string to a list of words (tokens) for further NLP processing
# Example: "ዛሬ በሰማይ ነጭ ደመና" → ["ዛሬ", "በሰማይ", "ነጭ", "ደመና"]
from etnltk.tokenize.am import word_tokenize

# Apply word tokenization to each normalized message in the DataFrame
# Creates new column 'tokenized_message' containing lists of word tokens
data_clean["tokenized_message"] = data_clean["normalized_message"].apply(word_tokenize)

# Display the first 5 rows to verify tokenization results
# Shows original message, cleaned/normalized versions, and final tokenization
data_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_clean["tokenized_message"] = data_clean["normalized_message"].apply(word_tokenize)


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path,cleaned_message,normalized_message,tokenized_meassage,tokenized_message
0,Zemen Express®,@ZemenExpress,6982,💥💥...................................💥💥\n\n📌Im...,2025-06-18 06:01:10+00:00,,Imitation Volcano Humidifier with LED Light በኤ...,Imitation Volcano Humidifier with LED Light በኤ...,"[በኤሌክትሪክየሚሰራ, ለቤት, መልካም, መዓዛን, የሚሰጥ, ዋጋ, ብር, ው...","[በኤሌክትሪክየሚሰራ, ለቤት, መልካም, መዓዛን, የሚሰጥ, ዋጋ, ብር, ው..."
1,Zemen Express®,@ZemenExpress,6981,💥💥...................................💥💥\n\n📌 B...,2025-06-16 12:21:00+00:00,,Baby Carrier በፈለጉት አቅጣጫ ልጅዎን በምቾት ማዘል ያስችልዎታል ...,Baby Carrier በፈለጉት አቅጣጫ ልጅዎን በምቾት ማዘል ያስችልዎታል ...,"[በፈለጉት, አቅጣጫ, ልጅዎን, በምቾት, ማዘል, ያስችልዎታል, ዋጋ, ብር...","[በፈለጉት, አቅጣጫ, ልጅዎን, በምቾት, ማዘል, ያስችልዎታል, ዋጋ, ብር..."
9,Zemen Express®,@ZemenExpress,6973,💥💥...................................💥💥\n\n📌Sm...,2025-06-16 05:11:57+00:00,data\photos\@ZemenExpress_6973.jpg,Smart Usb Ultrasonic Car And Home Air Humidifi...,Smart Usb Ultrasonic Car And Home Air Humidifi...,"[በኤሌክትሪክ, የሚሰራ, ለቤትና, ለመኪና, መልካም, መዓዛን, የሚሰጥ, ...","[በኤሌክትሪክ, የሚሰራ, ለቤትና, ለመኪና, መልካም, መዓዛን, የሚሰጥ, ..."
10,Zemen Express®,@ZemenExpress,6972,💥💥...................................💥💥\n\n📌Sm...,2025-06-16 05:11:26+00:00,,Smart Usb Ultrasonic Car And Home Air Humidifi...,Smart Usb Ultrasonic Car And Home Air Humidifi...,"[በኤሌክትሪክ, የሚሰራ, ለቤትና, ለመኪና, መልካም, መዓዛን, የሚሰጥ, ...","[በኤሌክትሪክ, የሚሰራ, ለቤትና, ለመኪና, መልካም, መዓዛን, የሚሰጥ, ..."
12,Zemen Express®,@ZemenExpress,6970,💥💥...................................💥💥\n\n📌Ba...,2025-06-16 05:09:03+00:00,data\photos\@ZemenExpress_6970.jpg,Baby Head Helmet Cotton Walk Safety Hat Breath...,Baby Head Helmet Cotton Walk Safety Hat Breath...,"[ዋጋ, ብር, ውስን, ፍሬ, ነው, ያለን, አድራሻ, መገናኛመሰረትደፋርሞል...","[ዋጋ, ብር, ውስን, ፍሬ, ነው, ያለን, አድራሻ, መገናኛመሰረትደፋርሞል..."


In [43]:
# Export cleaned and processed Telegram data to CSV file
# Includes metadata (Channel, ID, Date) and processed text columns:
# - cleaned_message: Raw text after removing noise (emojis, links, etc.)
# - normalized_message: Text with standardized Amharic characters  
# - tokenized_meassage: Word-tokenized text (Note: Column name contains typo)
# Saved to ../data/cleaned_telegram_data.csv for further analysis or modeling

data_clean[['Channel Title', 'Channel Username', 'ID', 'Date',
       'Media Path', 'cleaned_message', 'normalized_message',
       'tokenized_meassage']].to_csv("../data/cleaned_telegram_data.csv")