# 🕷️ Telegram Data Collection

## Overview
Automated data collection from Telegram e-commerce channels:
- Multi-channel scraping
- Media download and storage
- Rate limiting and caching
- CSV export with metadata

**Output**: `../data/telegram_data.csv`

---

### 📚 Import Libraries

In [7]:
import os
import sys
import asyncio
import pandas as pd
from datetime import datetime
from dotenv import load_dotenv

# Add src to path
sys.path.append(os.path.abspath('../src'))
from data_collection.telegram_scraper import TelegramScraper, ScrapingConfig

# Load environment
load_dotenv()

True

### ⚙️ Configuration

In [8]:
# Telegram channels to scrape
CHANNELS = [
    '@classybrands',
    '@Shageronlinestore', 
    '@ZemenExpress',
    '@sinayelj',
    '@modernshoppingcenter'
]

# Get credentials
API_ID = os.getenv('TG_API_ID')
API_HASH = os.getenv('TG_API_HASH')
OUTPUT_FILE = '../data/telegram_data.csv'

if not API_ID or not API_HASH:
    print("⚠️ API credentials not found in .env file")
    API_ID = input("Enter your Telegram API ID: ")
    API_HASH = input("Enter your Telegram API Hash: ")

print(f"📡 Channels to scrape: {len(CHANNELS)}")
print(f"💾 Output file: {OUTPUT_FILE}")
print("✅ Configuration loaded")

📡 Channels to scrape: 5
💾 Output file: ../data/telegram_data.csv
✅ Configuration loaded


### 🚀 Start Data Collection

In [None]:
# Create scraper configuration
scraper_config = ScrapingConfig(
    api_id=API_ID,
    api_hash=API_HASH,
    output_file=OUTPUT_FILE,
    max_messages=5000
)

# Initialize scraper with config object
scraper = TelegramScraper(scraper_config)

print("🔄 Starting data collection...")
print(f"⏰ Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

# Run scraping
results = await scraper.scrape_channels(CHANNELS)

print(f"✅ Collection completed at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("📊 Results:")
for channel, count in results.items():
    print(f"  {channel}: {count} messages")

await scraper.close()

### 📊 Data Summary

In [None]:
# Load and display data summary
if os.path.exists(OUTPUT_FILE):
    df = pd.read_csv(OUTPUT_FILE)
    
    print("📈 Collection Summary:")
    print("=" * 40)
    print(f"Total messages: {len(df):,}")
    print(f"Columns: {list(df.columns)}")
    print(f"Date range: {df['Date'].min()} to {df['Date'].max()}")
    
    # Messages per channel
    channel_counts = df['Channel Username'].value_counts()
    print("\n📡 Messages per channel:")
    for channel, count in channel_counts.items():
        print(f"  {channel}: {count:,} messages")
    
    print("\n👀 Sample data:")
    print(df.head(3)[['Channel Username', 'Message', 'Date']].to_string())
else:
    print("❌ No data file found")

### ✅ Collection Complete

**Next Steps:**
1. Run `Preprocessing.ipynb` to clean and label the data
2. Train NER models using the fine-tuning notebooks
3. Generate vendor analytics with `vendor_scorecard_Engine.ipynb`

**Files Created:**
- Raw scraped data in configured output path
- Media files (if enabled)
- Cache files for incremental updates