# Task 1: Data Ingestion and Preprocessing

This notebook handles:
- Telegram channel scraping
- Text preprocessing with etnltk
- Data cleaning and structuring

In [1]:
import sys
sys.path.append('..')

import pandas as pd
import asyncio
from pathlib import Path

from src.data_ingestion.telegram_scraper import TelegramScraper
from src.preprocessing.text_processor import AmharicTextProcessor

## Step 1: Initialize Components

In [2]:
scraper = TelegramScraper()
processor = AmharicTextProcessor()

print("Components initialized successfully")

Components initialized successfully


## Step 2: Scrape Telegram Channels

In [3]:
# Scrape data from channels
df = await scraper.scrape_all_channels(limit_per_channel=5000)

print(f"Scraped {len(df)} messages")
print(f"Channels: {df['channel'].unique()}")
df.head()

INFO:telethon.network.mtprotosender:Connecting to 149.154.167.51:443/TcpFull...
INFO:telethon.network.mtprotosender:Connection to 149.154.167.51:443/TcpFull complete!
INFO:telethon.client.users:Phone migrated to 4
INFO:telethon.client.telegrambaseclient:Reconnecting to new data center 4
INFO:telethon.network.mtprotosender:Disconnecting from 149.154.167.51:443/TcpFull...
INFO:telethon.network.mtprotosender:Disconnection from 149.154.167.51:443/TcpFull complete!
INFO:telethon.network.mtprotosender:Connecting to 149.154.167.91:443/TcpFull...
INFO:telethon.network.mtprotosender:Connection to 149.154.167.91:443/TcpFull complete!
INFO:src.data_ingestion.telegram_scraper:Telegram client initialized successfully


Signed in successfully as Wende M.; remember to not break the ToS or you will risk an account ban!


INFO:src.data_ingestion.telegram_scraper:Scraped 5000 messages from @ethio_market_place
ERROR:src.data_ingestion.telegram_scraper:Error scraping @addis_shopping: Nobody is using this username, or the username is unacceptable. If the latter, it must match r"[a-zA-Z][\w\d]{3,30}[a-zA-Z\d]" (caused by ResolveUsernameRequest)
INFO:src.data_ingestion.telegram_scraper:Scraped 0 messages from @ethio_electronics
INFO:src.data_ingestion.telegram_scraper:Scraped 19 messages from @bole_market
ERROR:src.data_ingestion.telegram_scraper:Error scraping @merkato_online: Nobody is using this username, or the username is unacceptable. If the latter, it must match r"[a-zA-Z][\w\d]{3,30}[a-zA-Z\d]" (caused by ResolveUsernameRequest)
INFO:telethon.network.mtprotosender:Connection closed while receiving data: [WinError 10054] An existing connection was forcibly closed by the remote host
INFO:telethon.network.mtprotosender:Closing current connection to begin reconnect...
INFO:telethon.network.connection.conn

Scraped 13604 messages
Channels: ['@ethio_market_place' '@bole_market' '@zemenExpress' '@shewabrand'
 '@lobelia4cosmetics' '@yetenaweg']


Unnamed: 0,channel,message_id,text,date,views,forwards,sender_id,media_type
0,@ethio_market_place,11011,```📌 iPhone 14 Pro Max```,2024-09-19 12:43:57+00:00,312.0,0.0,-1001601399995,MessageMediaPhoto
1,@ethio_market_place,11010,```📌 iPhone 15 Pro Max```,2024-09-19 12:43:45+00:00,308.0,0.0,-1001601399995,MessageMediaPhoto
2,@ethio_market_place,11009,```ዉድ ደንበኞቻችን፣በሁሉም ላፕቶፖች \nላይ ምንም አይነት የዋጋ ጭማሪ...,2024-09-19 12:43:25+00:00,272.0,0.0,-1001601399995,MessageMediaPhoto
3,@ethio_market_place,11008,```ዉድ ደንበኞቻችን፣በሁሉም ላፕቶፖች \nላይ ምንም አይነት የዋጋ ጭማሪ...,2024-09-19 12:42:56+00:00,245.0,1.0,-1001601399995,MessageMediaPhoto
4,@ethio_market_place,11007,```ዉድ ደንበኞቻችን፣በሁሉም ላፕቶፖች \nላይ ምንም አይነት የዋጋ ጭማሪ...,2024-09-19 12:42:38+00:00,198.0,0.0,-1001601399995,MessageMediaPhoto


## Step 3: Save Raw Data

In [4]:
# save raw data
df.to_csv("../data/raw/telegram_messages.csv", index=False)

print("Raw data saved to data/raw/telegram_messages.csv")

Raw data saved to data/raw/telegram_messages.csv


## Step 4: Text Preprocessing

In [5]:
# Preprocess the dataset
df_processed = processor.preprocess_dataset(df)

print(f"Processed {len(df_processed)} messages")
print("\nSample processed text:")
for i in range(3):
    print(f"Original: {df.iloc[i]['text'][:100]}...")
    print(f"Cleaned:  {df_processed.iloc[i]['cleaned_text'][:100]}...")
    print("---")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text_length'] = df['cleaned_text'].str.len()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['entity_hints'] = df['cleaned_text'].apply(self.extract_entities_hints)
INFO:src.preprocessing.text_processor:Preprocessed 13443 messages


Processed 13443 messages

Sample processed text:
Original: ```📌 iPhone 14 Pro Max```...
Cleaned:  ```📌 iPhone 14 Pro Max```...
---
Original: ```📌 iPhone 15 Pro Max```...
Cleaned:  ```📌 iPhone 15 Pro Max```...
---
Original: ```ዉድ ደንበኞቻችን፣በሁሉም ላፕቶፖች 
ላይ ምንም አይነት የዋጋ ጭማሪ አላደረግንም !🙏🙏🙏```...
Cleaned:  ```ዉድ ደንበኞቻችን፣በሁሉም ላፕቶፖች ላይ ምንም አይነት የዋጋ ጭማሪ አላደረግንም !🙏🙏🙏```...
---


## Step 5: Data Analysis

In [6]:
# Basic statistics
print("Dataset Statistics:")
print(f"Total messages: {len(df_processed)}")
print(f"Average text length: {df_processed['text_length'].mean():.2f}")
print(f"Messages per channel:")
print(df_processed['channel'].value_counts())

# Entity hints analysis
print("\nEntity Hints Found:")
total_prices = sum(len(hints.get('prices', [])) for hints in df_processed['entity_hints'] if isinstance(hints, dict))
total_locations = sum(len(hints.get('locations', [])) for hints in df_processed['entity_hints'] if isinstance(hints, dict))
print(f"Price mentions: {total_prices}")
print(f"Location mentions: {total_locations}")

Dataset Statistics:
Total messages: 13443
Average text length: 246.31
Messages per channel:
channel
@ethio_market_place    5000
@zemenExpress          3389
@shewabrand            2775
@lobelia4cosmetics     1649
@yetenaweg              612
@bole_market             18
Name: count, dtype: int64

Entity Hints Found:
Price mentions: 12366
Location mentions: 4013


## Step 6: Save Processed Data

In [7]:
# Save processed data
df_processed.to_csv("../data/processed/cleaned_messages.csv", index=False)

print("Processed data saved to data/processed/cleaned_messages.csv")
print("\nData ingestion and preprocessing completed!")

Processed data saved to data/processed/cleaned_messages.csv

Data ingestion and preprocessing completed!
