# Data Procesing and Labeling


In [1]:
import pandas as pd
import os
notebook_dir = os.getcwd()
parent_path=os.path.dirname(notebook_dir)

os.chdir(parent_path)

In [2]:
from script.data_processor_labler import Processor

**Instance of the imported class**

In [3]:
processor=Processor()

**Load the data which was scrapped from**

In [4]:
tg_data=pd.read_csv("data/adamagebeya_telegram_data.csv")

### Handle missing data
Drop the row which does not have a messgae

In [5]:
processor.drop_missing_messsage(tg_data)

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path
0,አዳማ ገበያ - Adama gebeya,@gebeyaadama,4043,✅Hot Water Bag\n🎯 የትኩስ ውሃ መያዢያ ከረጢት\n👉 1.8 ሊትር...,2024-09-25 08:36:52+00:00,photos/@gebeyaadama_4043.jpg
1,አዳማ ገበያ - Adama gebeya,@gebeyaadama,4042,❇️Hair Scalp Massager,2024-09-25 08:33:14+00:00,
2,አዳማ ገበያ - Adama gebeya,@gebeyaadama,4041,❇️Hair Scalp Massager\n\n﻿﻿👉Stimulate blood fl...,2024-09-25 08:32:54+00:00,photos/@gebeyaadama_4041.jpg
3,አዳማ ገበያ - Adama gebeya,@gebeyaadama,4040,የፀጉር መፈረዣ,2024-09-25 07:19:39+00:00,
4,አዳማ ገበያ - Adama gebeya,@gebeyaadama,4039,✅ የፀጉር መፈረዣ ✅\n\n📌ለሁሉም አይነት ፀጉር የሚሆን እና ለ አጠቃቀ...,2024-09-25 07:19:27+00:00,photos/@gebeyaadama_4039.jpg
...,...,...,...,...,...,...
3298,አዳማ ገበያ - Adama gebeya,@gebeyaadama,17,☎️0911-76-22-01\n ዋጋ 1200 ብር \n❣️❣️🇪🇹🇪🇹 በ ...,2020-10-06 09:06:46+00:00,
3301,አዳማ ገበያ - Adama gebeya,@gebeyaadama,14,❤️❤️❤️ አዳማ ❤️❤️❤️\n 🎯የጀርባ ችግር አለቦት\n ...,2020-10-05 18:04:46+00:00,photos/@gebeyaadama_14.jpg
3304,አዳማ ገበያ - Adama gebeya,@gebeyaadama,10,❣ለውስን ጊዜ የሚቆይ ታላቅ ቅናሽ❣\n 💯አንድ Smart watch ሲገ...,2020-10-05 12:08:36+00:00,photos/@gebeyaadama_10.jpg
3305,አዳማ ገበያ - Adama gebeya,@gebeyaadama,9,❤ አዳማ /ናዝሬት ❤\n 0911-76-22-01\n❣️❣️🇪...,2020-10-05 04:58:56+00:00,photos/@gebeyaadama_9.jpg


## Remove emojis and extra spaces from the messages

In [6]:
processor.clean_message(tg_data)

0       Hot Water Bag\n የትኩስ ውሃ መያዢያ ከረጢት\n 1.8 ሊትር ውሃ...
1                                     Hair Scalp Massager
2       Hair Scalp Massager\nStimulate blood flow to t...
3                                               የፀጉር መፈረዣ
4        የፀጉር መፈረዣ \nለሁሉም አይነት ፀጉር የሚሆን እና ለ አጠቃቀም ምቹ\...
                              ...                        
3298    0911-76-22-01\n     ዋጋ 1200 ብር \n በ ሆድ የሰውነት ክ...
3301     አዳማ \n     የጀርባ ችግር አለቦት\n        0911-76-22-...
3304    ለውስን ጊዜ የሚቆይ ታላቅ ቅናሽ\n  አንድ  Smart watch ሲገዙ በ...
3305     አዳማ /ናዝሬት \n           0911-76-22-01\n በ ሆድ የ...
3309      0911-76-22-01\n         አዳማ / ናዝሬት \n     የጁ...
Name: Message, Length: 2488, dtype: object

## Filter the data to keep only amharic messages
Since the project is focused on *Amharic Named Entity Recognition* and the data have a mixed language (amharic and english) it is essential filtering out.in this case keep meaages that have 50% or more amharic characters.

In [7]:
tg_data=processor.filter_amharic(tg_data)

## Label messages
We are going to label a portion of the provided dataset in the CoNLL format. This format is commonly used for Named Entity Recognition (NER) tasks.
The goal is to identify and label entities such as products, price, and Location in Amharic text.
Entity Types:

**B-Product:** The beginning of a product entity (e.g., "Baby bottle"). 

**I-Product:** Inside a product entity (e.g., the word "bottle" in "Baby bottle").

**B-LOC:** The beginning of a location entity (e.g., "Addis abeba", "Bole").

**I-LOC:** Inside a location entity (e.g., the word "Abeba" in “Addis abeba”)

**B-PRICE:** The beginning of a price entity (e.g., "ዋጋ 1000 ብር", "በ 100 ብር").

**I -PRICE፡** Inside a price entity (e.g., the word "1000" in “ዋጋ 1000 ብር”)

**O:** Tokens that are outside any entities


In [8]:
processor.label_dataset(tg_data)

Hot B-Product
Water I-Product
Bag I-Product
የትኩስ O
ውሃ O
መያዢያ O
ከረጢት O
1.8 O
ሊትር O
ውሃ O
ይይዛል O
ሙቀት O
ከሚቋቋም O
ወፍራም O
ጎማ O
የተሰራ O
አስተማማኝ O
ክዳን O
ያለው O
ወፍራም O
ጨርቅ O
ያለው O
ከወር O
አበባ O
፣ O
ከወገብ O
ህመም፣ O
ከመደንዘዝ፣ O
ከደም O
ስር O
መዞርና O
ከውልቃት O
ጋር O
የተያያዙ O
ህመሞችን O
ለማስታገስ O
ይረዳል O
750 B-Price
ብር I-Price
0911762201 O
0972824252 O
0988404491 O
0922282582 O
በቴሌግራም O
ለማዘዝ O
@GebeyaAdama21 O
አድራሻችን O
አዳማ B-LOC
ፖስታ I-LOC
ቤት I-LOC
ሶሬቲ B-LOC
ሞል B-LOC
ምድር I-LOC
ላይ I-LOC
ሱ.ቁ I-LOC
33 I-LOC
ይሄንን O
በመጫን O
የቤተሰባችን O
አባል O
ይሁኑ O
https://t.me/gebeyaadama O
የመረጡትን O
እቃ O
ይዘዙ፤ O
ያሉበት O
እናደርሳለን!! O
በኪስዎ O
ጥሬ O
ገንዘብ O
ካልያዙ O
በሞባይል O
ማስተላለፍ O
ይችላሉ። O

የፀጉር B-Product
መፈረዣ I-Product

የፀጉር B-Product
መፈረዣ I-Product
ለሁሉም O
አይነት O
ፀጉር O
የሚሆን O
እና O
ለ O
አጠቃቀም O
ምቹ O
ዋጋ B-Price
650 I-Price
ብር I-Price
+251911762201 O
+251972824252 O
በቴሌግራም O
ለማዘዝ O
@GebeyaAdama21 O
አድራሻችን O
አዳማ B-LOC
ፖስታ I-LOC
ቤት I-LOC
ሶሬቲ B-LOC
ሞል I-LOC
ምድር I-LOC
ላይ I-LOC
ሱ.ቁ I-LOC
33 I-LOC
አዲስአበባ B-LOC
መገናኛ I-LOC
ከ I-LOC
ዋአች I-LOC
ህንፃ I-LOC
ፊትለፊት I-LOC
ኪኔሬት