# Task 1 – Data Scraping & Collection



---

## 📖 Overview

In this task, we’ll extract raw data from Telegram channels and populate our data lake with unaltered JSON and image files. This “source of truth” will serve as the foundation for all downstream transformation and enrichment.

---

## Objectives

1. **Telegram Scraping**  
   - Connect to the Telegram API (via Telethon or similar).  
   - Extract messages from targeted Ethiopian medical-business channels.  
2. **Image Collection**  
   - Download any media (images) attached to those messages.  
3. **Data Lake Ingestion**  
   - Store all raw JSON payloads under `data/raw/telegram_messages/YYYY-MM-DD/<channel>.json`.  
   - Save images under `data/raw/images/YYYY-MM-DD/<channel>/<message_id>.jpg`.  
4. **Logging & Reliability**  
   - Log channel names, timestamps, and success/failure for each scrape.  
   - Handle errors (e.g., rate limits) gracefully and retry where appropriate.  

---

## 🔗 Channel Sources

- **Chemed Cosmetics**: `https://t.me/lobelia4cosmetics`  
- **Tikvah Pharma**: `https://t.me/tikvahpharma`  
- _…plus additional channels from_ [et.tgstat.com/medicine](https://et.tgstat.com/medicine) :contentReference[oaicite:0]{index=0}

---





In [1]:
# 2️⃣ Load your .env so that the script sees the same config
from dotenv import load_dotenv
from pathlib import Path
import os

project_root = Path(r"C:\Users\ABC\Desktop\10Acadamy\week_7\Shipping-a-Data-Product")
load_dotenv(dotenv_path=project_root / ".env")

# Optional: verify channels are loaded
print("TEXT_CHANS =", os.getenv("TELEGRAM_TEXT_CHANNELS"))
print("IMAGE_CHANS =", os.getenv("TELEGRAM_IMAGE_CHANNELS"))


TEXT_CHANS = lobelia4cosmetics,tikvahpharma,CheMed123
IMAGE_CHANS = lobelia4cosmetics,CheMed123


## Load configuration from `.env`

Use the `python-dotenv` package to load your environment variables from the project’s `.env` file so that your script can pick up the same API keys and channel settings every time it runs.




In [2]:

import nest_asyncio
nest_asyncio.apply()
print("nest_asyncio applied — you can now use sync Telethon in Jupyter!")


nest_asyncio applied — you can now use sync Telethon in Jupyter!


In [3]:
# 2️⃣ Load your .env so that the script sees the same config
from dotenv import load_dotenv
from pathlib import Path
import os

project_root = Path(r"C:\Users\ABC\Desktop\10Acadamy\week_7\Shipping-a-Data-Product")
load_dotenv(dotenv_path=project_root / ".env")

# Optional: verify channels are loaded
print("TEXT_CHANS =", os.getenv("TELEGRAM_TEXT_CHANNELS"))
print("IMAGE_CHANS =", os.getenv("TELEGRAM_IMAGE_CHANNELS"))



TEXT_CHANS = lobelia4cosmetics,tikvahpharma,CheMed123
IMAGE_CHANS = lobelia4cosmetics,CheMed123


##  Run the scraper as a standalone process

To avoid conflicts with any existing event loops (e.g., in Jupyter), launch your scraper script in its own Python process:


In [8]:

project_root = Path(r"C:\Users\ABC\Desktop\10Acadamy\week_7\Shipping-a-Data-Product")
# 3️⃣ Run the scraper as its own process (no event‐loop clash)
!python C:\Users\ABC\Desktop\10Acadamy\week_7\Shipping-a-Data-Product\src\scrapper.py


  return datetime.utcnow().strftime("%Y-%m-%d")
2025-07-13 09:16:05,016 INFO Connecting to 149.154.167.92:443/TcpFull...
2025-07-13 09:16:05,211 INFO Connection to 149.154.167.92:443/TcpFull complete!
2025-07-13 09:16:06,546 INFO Connected to Telegram for text scraping
2025-07-13 09:16:06,546 INFO Fetching up to 100 messages from lobelia4cosmetics
2025-07-13 09:16:07,844 INFO Saved 100 messages to data\raw\telegram_messages\2025-07-13\lobelia4cosmetics.json
2025-07-13 09:16:07,844 INFO Fetching up to 100 messages from tikvahpharma
2025-07-13 09:16:09,995 INFO Saved 100 messages to data\raw\telegram_messages\2025-07-13\tikvahpharma.json
2025-07-13 09:16:09,995 INFO Fetching up to 100 messages from CheMed123
2025-07-13 09:16:11,199 INFO Saved 76 messages to data\raw\telegram_messages\2025-07-13\CheMed123.json
2025-07-13 09:16:11,200 INFO Disconnecting from 149.154.167.92:443/TcpFull...
2025-07-13 09:16:11,203 INFO Disconnection from 149.154.167.92:443/TcpFull complete!
2025-07-13 09:16:1

## Verify that the JSON and JPG files landed under `data/raw`

After running the scraper, we can programmatically check that our raw data files exist in the expected directories:


In [9]:
# 4️⃣ Verify that the JSON and JPG files landed under data/raw
import glob, json
from pprint import pprint

msg_files = sorted(glob.glob("data/raw/telegram_messages/*/*.json"))
img_files = sorted(glob.glob("data/raw/telegram_images/*/*/*.jpg"))

print(f"Found {len(msg_files)} JSON files:")
pprint(msg_files)
print(f"\nFound {len(img_files)} image files:")
pprint(img_files)

# 5️⃣ Peek at one JSON record
if msg_files:
    with open(msg_files[0], encoding="utf-8") as f:
        sample = json.load(f)
    print("\nSample record:")
    pprint(sample[0])


Found 3 JSON files:
['data/raw/telegram_messages\\2025-07-13\\CheMed123.json',
 'data/raw/telegram_messages\\2025-07-13\\lobelia4cosmetics.json',
 'data/raw/telegram_messages\\2025-07-13\\tikvahpharma.json']

Found 95 image files:
['data/raw/telegram_images\\2025-07-13\\CheMed123\\33.jpg',
 'data/raw/telegram_images\\2025-07-13\\CheMed123\\34.jpg',
 'data/raw/telegram_images\\2025-07-13\\CheMed123\\38.jpg',
 'data/raw/telegram_images\\2025-07-13\\CheMed123\\39.jpg',
 'data/raw/telegram_images\\2025-07-13\\CheMed123\\40.jpg',
 'data/raw/telegram_images\\2025-07-13\\CheMed123\\41.jpg',
 'data/raw/telegram_images\\2025-07-13\\CheMed123\\43.jpg',
 'data/raw/telegram_images\\2025-07-13\\CheMed123\\44.jpg',
 'data/raw/telegram_images\\2025-07-13\\CheMed123\\45.jpg',
 'data/raw/telegram_images\\2025-07-13\\CheMed123\\46.jpg',
 'data/raw/telegram_images\\2025-07-13\\CheMed123\\48.jpg',
 'data/raw/telegram_images\\2025-07-13\\CheMed123\\49.jpg',
 'data/raw/telegram_images\\2025-07-13\\CheMed123

In [None]:
from collections import Counter
import os, glob
# 6️⃣ Count images per channel
img_paths = glob.glob("data/raw/telegram_images/*/*/*.jpg")
# the channel name is the second‐to-last folder in the path
channels = [os.path.basename(os.path.dirname(p)) for p in img_paths]
counts = Counter(channels)

print("Images per channel:")
for channel, cnt in counts.items():
    print(f"  {channel}: {cnt}")


Images per channel:
  CheMed123: 45
  lobelia4cosmetics: 50
