1. ## Extracting Telegram Channel Data and Saving Media Files
Scrape data from Telegram channels and save it to telegram_data.csv, while downloading media files to the photos folder.

In [5]:
from telethon import TelegramClient
import csv
import os
from dotenv import load_dotenv

# Load environment variables once
load_dotenv('../.env')
api_id = os.getenv('TG_API_ID')
api_hash = os.getenv('TG_API_HASH')
phone = os.getenv('TG_PHONE_NUMBER')

# Function to scrape data from a single channel
async def scrape_channel(client, channel_username, writer, media_dir):
    entity = await client.get_entity(channel_username)
    channel_title = entity.title  # Extract the channel's title
    async for message in client.iter_messages(entity, limit=1000):
        media_path = None
        if message.media and hasattr(message.media, 'photo'):
            # Create a unique filename for the photo
            filename = f"{channel_username}_{message.id}.jpg"
            media_path = os.path.join(media_dir, filename)
            # Download the media to the specified directory if it's a photo
            await client.download_media(message.media, media_path)
        
        # Write the channel title along with other data
        writer.writerow([channel_title, channel_username, message.id, message.message, message.date, media_path])

# Initialize the client once
client = TelegramClient('scraping_session', api_id, api_hash)

async def main():
    await client.start()
    
    # Create a directory for media files
    media_dir = '../data/photos'
    os.makedirs(media_dir, exist_ok=True)

    # Open the CSV file and prepare the writer
    with open('../data/raw/telegram_data.csv', 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Channel Title', 'Channel Username', 'ID', 'Message', 'Date', 'Media Path'])  # Include channel title in the header
        
        # List of channels to scrape
        channels = [
            '@shageronlinestore', '@ethio_brand_collection', '@Shewabrand', '@gebeyaadama', '@classybrands'
            
        ]
        
        # Iterate over channels and scrape data into the single CSV file
        for channel in channels:
            await scrape_channel(client, channel, writer, media_dir)
            print(f"Scraped data from {channel}")

async with client:
    await main()

Attempt 1 at connecting failed: TimeoutError: 
Attempt 2 at connecting failed: TimeoutError: 
Attempt 3 at connecting failed: TimeoutError: 
Attempt 4 at connecting failed: TimeoutError: 


Scraped data from @shageronlinestore
Scraped data from @ethio_brand_collection


CancelledError: 

2. ## Load Dataset

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
sys.path.append('../src') 

from utils.data_loader import load_data

# Load data
df = load_data('../data/raw/telegram_data.csv')
# print(df.head())

Data loaded successfully from ../data/raw/telegram_data.csv


3. ## Data Cleaning

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2694 entries, 0 to 2693
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Channel Title     2694 non-null   object
 1   Channel Username  2694 non-null   object
 2   ID                2694 non-null   int64 
 3   Message           2049 non-null   object
 4   Date              2694 non-null   object
 5   Media Path        2507 non-null   object
dtypes: int64(1), object(5)
memory usage: 126.4+ KB


Handling Missing Values

In [2]:
df.isna().sum()

Channel Title         0
Channel Username      0
ID                    0
Message             645
Date                  0
Media Path          187
dtype: int64

In [None]:
from utils.data_cleaning import clean_data

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Data Cleaning Function Overview**
The clean_data(df) function prepares raw Telegram data for analysis through the following steps:

- Copy DataFrame – Makes a clean copy of the original data.

- Remove Duplicates – Drops any duplicate rows.

- Drop Incomplete Rows – Removes rows missing Message, Date, or Media Path.

- Format Dates – Converts the Date column to proper datetime format.

- Remove Invalid Dates – Drops rows with unconvertible or missing dates.

- Reset Index – Resets the DataFrame index after cleaning.

In [None]:
# utility function to clean the data
cleand_df = clean_data(df)


In [4]:
cleand_df.info()
print(cleand_df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1866 entries, 0 to 1865
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   Channel Title     1866 non-null   object             
 1   Channel Username  1866 non-null   object             
 2   ID                1866 non-null   int64              
 3   Message           1866 non-null   object             
 4   Date              1866 non-null   datetime64[ns, UTC]
 5   Media Path        1866 non-null   object             
dtypes: datetime64[ns, UTC](1), int64(1), object(4)
memory usage: 87.6+ KB
Channel Title       0
Channel Username    0
ID                  0
Message             0
Date                0
Media Path          0
dtype: int64


4. ## Preprocess and Structure Data

In [5]:
from utils.data_preprocessor import preprocess_amharic_messages

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Amharic Text Preprocessing Function Overview**
The preprocess_amharic_messages(messages_df) function processes Amharic Telegram messages for NLP tasks. It performs the following steps:

- Emoji Removal – Strips out emojis using a regex pattern.
➤ (Handled by) normalize_amharic_text(text)

- Text Normalization – Removes unwanted punctuation (except Amharic delimiters) and normalizes spacing.
➤ (Handled by) normalize_amharic_text(text)

- Tokenization – Splits the normalized text into individual words.
➤ (Handled by) tokenize(text)

- Stopword Removal – Filters out common Amharic stopwords.
➤ (Handled by) remove_stopwords(tokens)

    - Column Assignment – Adds three new columns to the DataFrame:

    - Clean_Text: normalized text (from normalize_amharic_text)

    - Tokens: raw tokens (from tokenize)

    - Processed: tokens with stopwords removed (from remove_stopwords)

In [6]:
processed_df = preprocess_amharic_messages(cleand_df)

In [7]:
processed_df

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path,Clean_Text,Tokens,Processed
0,Sheger online-store,@shageronlinestore,7403,👋 BARDEFU 2 IN 1 Multi purpose juicer\n\n👉 ኳሊቲ...,2025-06-21 14:15:20+00:00,photos\@shageronlinestore_7403.jpg,BARDEFU 2 IN 1 Multi purpose juicer ኳሊቲ የሆነ የጁ...,"[BARDEFU, 2, IN, 1, Multi, purpose, juicer, ኳሊ...","[BARDEFU, 2, IN, 1, Multi, purpose, juicer, ኳሊ..."
1,Sheger online-store,@shageronlinestore,7401,💥 portable electrical water dispenser\n\n👉ባለ 3...,2025-06-21 07:02:35+00:00,photos\@shageronlinestore_7401.jpg,portable electrical water dispenser ባለ 3 press...,"[portable, electrical, water, dispenser, ባለ, 3...","[portable, electrical, water, dispenser, ባለ, 3..."
2,Sheger online-store,@shageronlinestore,7399,💥GROOMING SET \n\n✂️ ሶስት በአንድ የያዘ የፀጉር ማሽን እና ...,2025-06-20 15:25:41+00:00,photos\@shageronlinestore_7399.jpg,GROOMING SET ሶስት በአንድ የያዘ የፀጉር ማሽን እና ሼቨር የሚሰራ...,"[GROOMING, SET, ሶስት, በአንድ, የያዘ, የፀጉር, ማሽን, እና,...","[GROOMING, SET, ሶስት, በአንድ, የያዘ, የፀጉር, ማሽን, እና,..."
3,Sheger online-store,@shageronlinestore,7395,💥GROOMING SET \n\n✂️ ሶስት በአንድ የያዘ የፀጉር ማሽን እና ...,2025-06-20 15:25:40+00:00,photos\@shageronlinestore_7395.jpg,GROOMING SET ሶስት በአንድ የያዘ የፀጉር ማሽን እና ሼቨር የሚሰራ...,"[GROOMING, SET, ሶስት, በአንድ, የያዘ, የፀጉር, ማሽን, እና,...","[GROOMING, SET, ሶስት, በአንድ, የያዘ, የፀጉር, ማሽን, እና,..."
4,Sheger online-store,@shageronlinestore,7393,💥 1L Water Bottle\n\n 💯High Quality\n\n⚡...,2025-06-20 11:47:53+00:00,photos\@shageronlinestore_7393.jpg,1L Water Bottle High Quality 1L water time sca...,"[1L, Water, Bottle, High, Quality, 1L, water, ...","[1L, Water, Bottle, High, Quality, 1L, water, ..."
...,...,...,...,...,...,...,...,...,...
1861,Shewa Brand,@Shewabrand,2640,JORDAN 9\nsize 441#42#43#44#45\nMADE IN VIETNA...,2023-06-09 09:48:35+00:00,photos\@Shewabrand_2640.jpg,JORDAN 9 size 44142434445 MADE IN VIETNAM SHEW...,"[JORDAN, 9, size, 44142434445, MADE, IN, VIETN...","[JORDAN, 9, size, 44142434445, MADE, IN, VIETN..."
1862,Shewa Brand,@Shewabrand,2639,Reebok hunter Green\nsize 40#41#42#43\nMADE IN...,2023-06-09 08:21:17+00:00,photos\@Shewabrand_2639.jpg,Reebok hunter Green size 40414243 MADE IN VIET...,"[Reebok, hunter, Green, size, 40414243, MADE, ...","[Reebok, hunter, Green, size, 40414243, MADE, ..."
1863,Shewa Brand,@Shewabrand,2638,NIKE Alpha Huarache Elite 3\nsize 40#41#42#43\...,2023-06-07 14:59:20+00:00,photos\@Shewabrand_2638.jpg,NIKE Alpha Huarache Elite 3 size 40414243 MADE...,"[NIKE, Alpha, Huarache, Elite, 3, size, 404142...","[NIKE, Alpha, Huarache, Elite, 3, size, 404142..."
1864,Shewa Brand,@Shewabrand,2637,Alexander McQUEEN\nsize 36#37#38#39\nSHEWA BRA...,2023-06-05 10:35:24+00:00,photos\@Shewabrand_2637.jpg,Alexander McQUEEN size 36373839 SHEWA BRAND አድ...,"[Alexander, McQUEEN, size, 36373839, SHEWA, BR...","[Alexander, McQUEEN, size, 36373839, SHEWA, BR..."


Add aditional columns to give more details on the metadata

In [8]:
import re
from utils.data_preprocessor import process_media_path
    
    # Media type extraction
processed_df['media_type'] = processed_df['Media Path'].apply(process_media_path)
    
    # Extract features
processed_df['message_length'] = processed_df['Clean_Text'].apply(len)
processed_df['word_count'] = processed_df['Processed'].apply(len)
processed_df['amharic_ratio'] = processed_df['Clean_Text'].apply(
        lambda x: len(re.findall(r'[\u1200-\u137F]', x))/len(x) if len(x) > 0 else 0
    )

In [11]:
processed_df

Unnamed: 0,Channel Title,Channel Username,ID,Message,Date,Media Path,Clean_Text,Tokens,Processed,media_type,message_length,word_count,amharic_ratio
0,Sheger online-store,@shageronlinestore,7403,👋 BARDEFU 2 IN 1 Multi purpose juicer\n\n👉 ኳሊቲ...,2025-06-21 14:15:20+00:00,photos\@shageronlinestore_7403.jpg,BARDEFU 2 IN 1 Multi purpose juicer ኳሊቲ የሆነ የጁ...,"[BARDEFU, 2, IN, 1, Multi, purpose, juicer, ኳሊ...","[BARDEFU, 2, IN, 1, Multi, purpose, juicer, ኳሊ...",image,481,91,0.498960
1,Sheger online-store,@shageronlinestore,7401,💥 portable electrical water dispenser\n\n👉ባለ 3...,2025-06-21 07:02:35+00:00,photos\@shageronlinestore_7401.jpg,portable electrical water dispenser ባለ 3 press...,"[portable, electrical, water, dispenser, ባለ, 3...","[portable, electrical, water, dispenser, ባለ, 3...",image,405,74,0.461728
2,Sheger online-store,@shageronlinestore,7399,💥GROOMING SET \n\n✂️ ሶስት በአንድ የያዘ የፀጉር ማሽን እና ...,2025-06-20 15:25:41+00:00,photos\@shageronlinestore_7399.jpg,GROOMING SET ሶስት በአንድ የያዘ የፀጉር ማሽን እና ሼቨር የሚሰራ...,"[GROOMING, SET, ሶስት, በአንድ, የያዘ, የፀጉር, ማሽን, እና,...","[GROOMING, SET, ሶስት, በአንድ, የያዘ, የፀጉር, ማሽን, እና,...",image,562,106,0.596085
3,Sheger online-store,@shageronlinestore,7395,💥GROOMING SET \n\n✂️ ሶስት በአንድ የያዘ የፀጉር ማሽን እና ...,2025-06-20 15:25:40+00:00,photos\@shageronlinestore_7395.jpg,GROOMING SET ሶስት በአንድ የያዘ የፀጉር ማሽን እና ሼቨር የሚሰራ...,"[GROOMING, SET, ሶስት, በአንድ, የያዘ, የፀጉር, ማሽን, እና,...","[GROOMING, SET, ሶስት, በአንድ, የያዘ, የፀጉር, ማሽን, እና,...",image,562,106,0.596085
4,Sheger online-store,@shageronlinestore,7393,💥 1L Water Bottle\n\n 💯High Quality\n\n⚡...,2025-06-20 11:47:53+00:00,photos\@shageronlinestore_7393.jpg,1L Water Bottle High Quality 1L water time sca...,"[1L, Water, Bottle, High, Quality, 1L, water, ...","[1L, Water, Bottle, High, Quality, 1L, water, ...",image,384,68,0.348958
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1861,Shewa Brand,@Shewabrand,2640,JORDAN 9\nsize 441#42#43#44#45\nMADE IN VIETNA...,2023-06-09 09:48:35+00:00,photos\@Shewabrand_2640.jpg,JORDAN 9 size 44142434445 MADE IN VIETNAM SHEW...,"[JORDAN, 9, size, 44142434445, MADE, IN, VIETN...","[JORDAN, 9, size, 44142434445, MADE, IN, VIETN...",image,127,24,0.251969
1862,Shewa Brand,@Shewabrand,2639,Reebok hunter Green\nsize 40#41#42#43\nMADE IN...,2023-06-09 08:21:17+00:00,photos\@Shewabrand_2639.jpg,Reebok hunter Green size 40414243 MADE IN VIET...,"[Reebok, hunter, Green, size, 40414243, MADE, ...","[Reebok, hunter, Green, size, 40414243, MADE, ...",image,135,25,0.237037
1863,Shewa Brand,@Shewabrand,2638,NIKE Alpha Huarache Elite 3\nsize 40#41#42#43\...,2023-06-07 14:59:20+00:00,photos\@Shewabrand_2638.jpg,NIKE Alpha Huarache Elite 3 size 40414243 MADE...,"[NIKE, Alpha, Huarache, Elite, 3, size, 404142...","[NIKE, Alpha, Huarache, Elite, 3, size, 404142...",image,143,27,0.223776
1864,Shewa Brand,@Shewabrand,2637,Alexander McQUEEN\nsize 36#37#38#39\nSHEWA BRA...,2023-06-05 10:35:24+00:00,photos\@Shewabrand_2637.jpg,Alexander McQUEEN size 36373839 SHEWA BRAND አድ...,"[Alexander, McQUEEN, size, 36373839, SHEWA, BR...","[Alexander, McQUEEN, size, 36373839, SHEWA, BR...",image,117,21,0.273504


5. ## Extract Metadata

In [9]:
from utils.data_preprocessor import extract_metadata
metadata = extract_metadata(processed_df)


In [14]:
metadata

Unnamed: 0,message_id,channel_name,channel_username,timestamp,has_media,media_type,message_length,word_count,amharic_ratio,hour_of_day,day_of_week
0,7403,Sheger online-store,@shageronlinestore,2025-06-21 14:15:20+00:00,True,image,481,91,0.498960,14,Saturday
1,7401,Sheger online-store,@shageronlinestore,2025-06-21 07:02:35+00:00,True,image,405,74,0.461728,7,Saturday
2,7399,Sheger online-store,@shageronlinestore,2025-06-20 15:25:41+00:00,True,image,562,106,0.596085,15,Friday
3,7395,Sheger online-store,@shageronlinestore,2025-06-20 15:25:40+00:00,True,image,562,106,0.596085,15,Friday
4,7393,Sheger online-store,@shageronlinestore,2025-06-20 11:47:53+00:00,True,image,384,68,0.348958,11,Friday
...,...,...,...,...,...,...,...,...,...,...,...
1861,2640,Shewa Brand,@Shewabrand,2023-06-09 09:48:35+00:00,True,image,127,24,0.251969,9,Friday
1862,2639,Shewa Brand,@Shewabrand,2023-06-09 08:21:17+00:00,True,image,135,25,0.237037,8,Friday
1863,2638,Shewa Brand,@Shewabrand,2023-06-07 14:59:20+00:00,True,image,143,27,0.223776,14,Wednesday
1864,2637,Shewa Brand,@Shewabrand,2023-06-05 10:35:24+00:00,True,image,117,21,0.273504,10,Monday


6. ## Store structured Data

In [10]:
from utils.data_preprocessor import store_processed_data
%pip install pyarrow
import pyarrow


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


**Storing Processed Data**
The store_processed_data(processed_df, metadata, output_dir) function saves cleaned Telegram message data and associated metadata to disk in a structured format. Here’s what it does:

🔧 Steps & Tasks:
Create Output Directory
➤ Ensures the output folder (default: ../data/processed) exists.
→ Uses: os.makedirs(output_dir, exist_ok=True)

Save Cleaned Messages
➤ Writes selected columns (ID, Message, Clean_Text, Tokens, Processed, Media Path) to messages.parquet.
→ Format: Parquet (efficient for large datasets)

Save Metadata
➤ Saves additional metadata (e.g., channel info) to metadata.parquet.

Compute & Save Stats
➤ Calculates and stores summary statistics:

    - Total messages

    - Number of unique channels

    - Count of messages with media

    - Date range (start to end)

    - Average message length

    - Average Amharic character ratio
    → Output: stats.json (in native JSON format)

Confirmation Message
➤ Prints a success message indicating where the files were stored.

In [11]:
store_processed_data(processed_df, metadata)

Data successfully stored in ../data/processed


In [13]:
# Load the saved parquet data
parquet_df = pd.read_parquet('../data/processed/messages.parquet')
parquet_df.head()

Unnamed: 0,ID,Clean_Text,Tokens,Processed,Message,Media Path
0,7403,BARDEFU 2 IN 1 Multi purpose juicer ኳሊቲ የሆነ የጁ...,"[BARDEFU, 2, IN, 1, Multi, purpose, juicer, ኳሊ...","[BARDEFU, 2, IN, 1, Multi, purpose, juicer, ኳሊ...",👋 BARDEFU 2 IN 1 Multi purpose juicer\n\n👉 ኳሊቲ...,photos\@shageronlinestore_7403.jpg
1,7401,portable electrical water dispenser ባለ 3 press...,"[portable, electrical, water, dispenser, ባለ, 3...","[portable, electrical, water, dispenser, ባለ, 3...",💥 portable electrical water dispenser\n\n👉ባለ 3...,photos\@shageronlinestore_7401.jpg
2,7399,GROOMING SET ሶስት በአንድ የያዘ የፀጉር ማሽን እና ሼቨር የሚሰራ...,"[GROOMING, SET, ሶስት, በአንድ, የያዘ, የፀጉር, ማሽን, እና,...","[GROOMING, SET, ሶስት, በአንድ, የያዘ, የፀጉር, ማሽን, እና,...",💥GROOMING SET \n\n✂️ ሶስት በአንድ የያዘ የፀጉር ማሽን እና ...,photos\@shageronlinestore_7399.jpg
3,7395,GROOMING SET ሶስት በአንድ የያዘ የፀጉር ማሽን እና ሼቨር የሚሰራ...,"[GROOMING, SET, ሶስት, በአንድ, የያዘ, የፀጉር, ማሽን, እና,...","[GROOMING, SET, ሶስት, በአንድ, የያዘ, የፀጉር, ማሽን, እና,...",💥GROOMING SET \n\n✂️ ሶስት በአንድ የያዘ የፀጉር ማሽን እና ...,photos\@shageronlinestore_7395.jpg
4,7393,1L Water Bottle High Quality 1L water time sca...,"[1L, Water, Bottle, High, Quality, 1L, water, ...","[1L, Water, Bottle, High, Quality, 1L, water, ...",💥 1L Water Bottle\n\n 💯High Quality\n\n⚡...,photos\@shageronlinestore_7393.jpg
