### Text Processing: Handling Amharic text, tokenization, and preprocessing techniques.

To preprocess the scraped Amharic text data for tasks like tokenization, normalization, and handling Amharic-specific linguistic features, we need to follow several preprocessing steps tailored for the language. 

Here’s how we can approach this task:

**Steps to Preprocess Amharic Text**

- **Tokenization**: Tokenization is the process of splitting text into individual units such as words or subwords. Since Amharic uses a different script and has some unique linguistic features, tokenizing might need adjustments. 
    - Use specialized libraries that handle Amharic text or a custom rule-based tokenizer.

- **Normalization**: This step involves cleaning and converting the text into a standard format:

    - Remove special characters, punctuation, and numbers.
    - Normalize similar-looking characters.
    - Convert text to a standard form (for example, removing diacritics if necessary).

- **Handling Amharic-Specific Features:**

    - Amharic, like other Semitic languages, has specific features such as root-and-pattern morphology.

    - Handling unique orthographic variants and considering suffixes, prefixes, and infixes in the language.

    - Identifying verb conjugations, plural forms, and possessives for better tokenization.

In [1]:

# Import necessary libraries
import pandas as pd
import logging
import os, sys
import matplotlib.pyplot as plt
from matplotlib import font_manager
from collections import Counter
# Add the 'scripts' directory to the Python path for module imports
sys.path.append(os.path.abspath(os.path.join('..', 'scripts')))
# Import data preprocessor class
from amharic_text_processor import AmharicTextPreprocessor # type: ignore
from amharic_labeler import AmharicNERLabeler # type: ignore

# Set max rows and columns to display
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

# Configure logging
logging.basicConfig(level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

logger.info("Imported libraries and configured logging.")

2025-01-18 16:36:02,780 - INFO - Imported libraries and configured logging.


**Load the scraped Telegram data**

In [2]:
# Read the data
data = pd.read_csv('../data/telegram_data11.csv')
# Explore the first five rows
data.head()

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path
0,Fashion tera,@Fashiontera,4059,,2025-01-13 19:35:35+00:00,../data/photos\@Fashiontera_4059.jpg
1,Fashion tera,@Fashiontera,4058,,2025-01-11 19:36:42+00:00,../data/photos\@Fashiontera_4058.jpg
2,Fashion tera,@Fashiontera,4057,〰️〰️〰️〰️〰️〰️〰️\nUnder Armur\nMade in Vietnam \...,2025-01-11 15:20:03+00:00,../data/photos\@Fashiontera_4057.jpg
3,Fashion tera,@Fashiontera,4056,,2025-01-11 15:20:03+00:00,../data/photos\@Fashiontera_4056.jpg
4,Fashion tera,@Fashiontera,4054,,2025-01-04 16:56:49+00:00,../data/photos\@Fashiontera_4054.jpg


In [3]:
# Check the last five rows
data.tail()

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path
2687,Fashion tera,@Fashiontera,96,Nikon \nD9 Digital Camera\nPrice 16000\nContac...,2018-07-05 20:11:35+00:00,../data/photos\@Fashiontera_96.jpg
2688,Fashion tera,@Fashiontera,95,"Vans Leather\nMade in Vietnam\nSize 40, 41\nPr...",2018-07-05 19:50:38+00:00,../data/photos\@Fashiontera_95.jpg
2689,Fashion tera,@Fashiontera,94,Samsung TV\nCurved Full HD \n55 Inch\nPrice 36...,2018-07-05 19:06:11+00:00,../data/photos\@Fashiontera_94.jpg
2690,Fashion tera,@Fashiontera,92,Rebook\nMade in Vietnam\nSize 41\nPrice 1350\n...,2018-07-05 18:39:16+00:00,../data/photos\@Fashiontera_92.jpg
2691,Fashion tera,@Fashiontera,1,,2018-05-29 09:31:26+00:00,


In [4]:
data.shape

(2692, 6)

In [5]:
# Let's check the missing values
data.isnull().sum()

Channel Title         0
Channel Username      0
ID                    0
Message             749
Date                  0
Media Path           33
dtype: int64

In [6]:
# Preprocess and tokenizes the amharic message
if __name__ == "__main__":
    # Amharic text sample
    amharic_text = "ሰላም እንዴት ነህ? እንኳን ደህና መጣህ።"

    preprocessor = AmharicTextPreprocessor()

    # Preprocess the text
    tokens = preprocessor.preprocess_dataframe(data, 'Message')
    display(tokens)


Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path,preprocessed_message
0,Fashion tera,@Fashiontera,4059,,2025-01-13 19:35:35+00:00,../data/photos\@Fashiontera_4059.jpg,
1,Fashion tera,@Fashiontera,4058,,2025-01-11 19:36:42+00:00,../data/photos\@Fashiontera_4058.jpg,
2,Fashion tera,@Fashiontera,4057,〰️〰️〰️〰️〰️〰️〰️\nUnder Armur\nMade in Vietnam \...,2025-01-11 15:20:03+00:00,../data/photos\@Fashiontera_4057.jpg,414243 3500 5266 ስልክ 251945355266 ፋሽን ተራ / አድራ...
3,Fashion tera,@Fashiontera,4056,,2025-01-11 15:20:03+00:00,../data/photos\@Fashiontera_4056.jpg,
4,Fashion tera,@Fashiontera,4054,,2025-01-04 16:56:49+00:00,../data/photos\@Fashiontera_4054.jpg,
...,...,...,...,...,...,...,...
2687,Fashion tera,@Fashiontera,96,Nikon \nD9 Digital Camera\nPrice 16000\nContac...,2018-07-05 20:11:35+00:00,../data/photos\@Fashiontera_96.jpg,9 16000 5266 0945355266
2688,Fashion tera,@Fashiontera,95,"Vans Leather\nMade in Vietnam\nSize 40, 41\nPr...",2018-07-05 19:50:38+00:00,../data/photos\@Fashiontera_95.jpg,40 41 1400 5266 0945355266
2689,Fashion tera,@Fashiontera,94,Samsung TV\nCurved Full HD \n55 Inch\nPrice 36...,2018-07-05 19:06:11+00:00,../data/photos\@Fashiontera_94.jpg,55 36000 5266 0945355266
2690,Fashion tera,@Fashiontera,92,Rebook\nMade in Vietnam\nSize 41\nPrice 1350\n...,2018-07-05 18:39:16+00:00,../data/photos\@Fashiontera_92.jpg,41 1350 5266 0945355266


In [7]:
# Drop NaN 

data.dropna(subset='Message', inplace=True)

In [8]:
list(data['preprocessed_message'])

['414243 3500 5266 ስልክ 251945355266 ፋሽን ተራ / አድራሻ አዲስ አበባ ጦር ሀይሎች ድሪም ታወር 2ተኛ ፎቅ ቢሮ ቁጥር 205',
 '4243 3400 5266 ስልክ 251945355266 ፋሽን ተራ / አድራሻ አዲስ አበባ ጦር ሀይሎች ድሪም ታወር 2ተኛ ፎቅ ቢሮ ቁጥር 205',
 '414243 2900 5266 ስልክ 251945355266 ፋሽን ተራ / አድራሻ አዲስ አበባ ጦር ሀይሎች ድሪም ታወር 2ተኛ ፎቅ ቢሮ ቁጥር 205',
 '4344 3800 5266 ስልክ 251945355266 ፋሽን ተራ / አድራሻ አዲስ አበባ ጦር ሀይሎች ድሪም ታወር 2ተኛ ፎቅ ቢሮ ቁጥር 205',
 '404243 3200 5266 ስልክ 251945355266 ፋሽን ተራ / አድራሻ አዲስ አበባ ጦር ሀይሎች ድሪም ታወር 2ተኛ ፎቅ ቢሮ ቁጥር 205',
 '40414243 3800 5266 ስልክ 251945355266 ፋሽን ተራ / አድራሻ አዲስ አበባ ጦር ሀይሎች ድሪም ታወር 2ተኛ ፎቅ ቢሮ ቁጥር 205',
 '1 4043 3500 5266 ስልክ 251945355266 ፋሽን ተራ / አድራሻ አዲስ አበባ ጦር ሀይሎች ድሪም ታወር 2ተኛ ፎቅ ቢሮ ቁጥር 205',
 '2200 5266 ስልክ 251945355266 ፋሽን ተራ / አድራሻ አዲስ አበባ ጦር ሀይሎች ድሪም ታወር 2ተኛ ፎቅ ቢሮ ቁጥር 205',
 '4042 3500 5266 ስልክ 251945355266 ፋሽን ተራ / አድራሻ አዲስ አበባ ጦር ሀይሎች ድሪም ታወር 2ተኛ ፎቅ ቢሮ ቁጥር 205',
 '5266 ስልክ 251945355266 ፋሽን ተራ / አድራሻ አዲስ አበባ ጦር ሀይሎች ድሪም ታወር 2ተኛ ፎቅ ቢሮ ቁጥር 205',
 '2200 5266 ስልክ 251945355266 ፋሽን ተራ / አድራሻ አዲስ አበባ ጦር ሀይሎች ድሪም ታወር 2ተኛ ፎቅ ቢሮ ቁጥር 20

In [9]:
# Ensure there are no NaN values in the preprocessed column
preprocessed_texts = tokens['preprocessed_message'].dropna().tolist()
df = pd.Series(preprocessed_texts).reset_index(name='message')


In [11]:
# Initialize the labeler

labeler = AmharicNERLabeler()

# Ensure there are no NaN values in the preprocessed column
preprocessed_texts = tokens['preprocessed_message'].dropna().tolist()
df = pd.Series(preprocessed_texts).reset_index(name='message')
# df = df.iloc[10:15]
df['Tokenized'] = df['message'].apply(lambda x: x.split())
# Label the tokens in the DataFrame
labeled_df = labeler.label_dataframe(df, 'Tokenized')


# Save to CoNLL format
labeler.save_conll_format(labeled_df, '../data/labeled_data_conll.conll')

In [12]:
labeled_df.drop(columns=['index'], inplace=True)

In [13]:
labeled_df['message'].duplicated().sum()

np.int64(1154)