The Scraper is a Go-based application designed to scrape messages and metadata from both Telegram channels and YouTube channels. For Telegram, it uses the TDLib library to collect and process messages. For YouTube, it uses the YouTube Data API. Results are stored in either local files or Azure Blob Storage for further analysis.
- Multi-platform Support:
- Telegram channels via the TDLib API
- YouTube channels via the YouTube Data API
- Scrapes and processes messages, videos, metadata, and engagement statistics
- Supports various Telegram message types (e.g., text, video, photo)
- Captures YouTube video metadata including views, likes, and comments
- Saves data in JSONL format for easy integration with data pipelines
- Optionally integrates with Azure Blob Storage for scalable cloud storage
- Supports incremental crawling with progress tracking to resume interrupted crawls
- Skip media downloads option to save bandwidth and storage
- Go: Ensure Go is installed on your system.
- TDLib: The Telegram Database Library must be installed for Telegram scraping. See TDLib Installation section.
- Telegram API Credentials (for Telegram scraping):
TG_API_ID
TG_API_HASH
TG_PHONE_NUMBER
TG_PHONE_CODE
(OTP sent to your phone)
- YouTube API Key (for YouTube scraping):
- Google Developer API key with YouTube Data API v3 enabled
- Environment Variables: Set up the following based on your use case.
brew install tdlib
sudo apt-get update
sudo apt-get install -y build-essential cmake gperf libssl-dev zlib1g-dev
git clone https://github.com/tdlib/td.git
cd td
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
sudo make install
TG_API_ID
: Telegram API ID (obtained from https://my.telegram.org).TG_API_HASH
: Telegram API Hash (obtained from https://my.telegram.org).TG_PHONE_NUMBER
: Your Telegram phone number with country code (e.g., +12025551234).TG_PHONE_CODE
: OTP sent to your phone by Telegram during authentication.
- No environment variable required, but you need to provide the YouTube API key via the
--youtube-api-key
parameter when running the scraper with--platform youtube
.
CONTAINER_NAME
: Name of the Azure Blob Storage container.BLOB_NAME
: Name of the blob path to store scraped data.AZURE_STORAGE_ACCOUNT_URL
: Azure Storage account URL.
SEED_LIST
: Comma-separated list of channel usernames/IDs to scrape (or provide as a command-line argument with--urls
).STORAGE_DIR
: Custom directory path for local storage (default:./storage
).MAX_MESSAGES
: Maximum number of messages to scrape per channel (default: all messages).LOG_LEVEL
: Logging level (default: "info", options: "debug", "info", "warn", "error").
/storage
: Default local directory for storing progress and scraped data./crawls/{crawlid}/channel_name/data.jsonl
: Output for scraped channel data./crawls/{crawlid}/progress.json
: Tracks crawling progress for resumption.
-
Clone the repository:
git clone https://github.com/your-repo/telegram-scraper.git cd telegram-scraper
-
Install Dependencies:
go mod tidy
-
Build the project:
go build -o telegram-scraper
The scraper supports the following command-line arguments:
Usage: ./telegram-scraper [options]
Options:
--urls string Comma-separated list of channel usernames/IDs to scrape
--url-file string File containing URLs to crawl (one per line)
--crawl-id string Specify a custom crawl ID for tracking (default: auto-generated)
--crawl-label string User-defined label for the crawl (e.g., "youtube-snowball")
--storage-root string Directory for storing data locally (default: "/tmp/crawl")
--max-posts int Maximum number of posts to collect per channel (default: all)
--max-comments int Maximum number of comments to crawl per post (default: all)
--max-depth int Maximum depth of the crawl (default: all)
--min-post-date string Minimum post date to crawl (format: YYYY-MM-DD)
--time-ago string Only consider posts newer than this time ago (e.g., '30d', '6h', '2w', '1m', '1y')
--skip-media Skip downloading media files (thumbnails, videos, etc.)
--platform string Platform to crawl (telegram, youtube) (default: "telegram")
--youtube-api-key string API key for YouTube Data API (required for YouTube platform)
--log-level string Set logging level: trace, debug, info, warn, error (default: "debug")
--dapr Run with DAPR enabled
--help Display this help message
Run the scraper with a list of Telegram channel usernames:
./telegram-scraper --urls "channel1,channel2,channel3"
For macOS users, you need to set specific environment variables for the TDLib compilation:
export CGO_CFLAGS=-I/opt/homebrew/include
export CGO_LDFLAGS=-L/opt/homebrew/lib -lssl -lcrypto
./telegram-scraper --urls "channel1,channel2,channel3"
To scrape YouTube channels, you need to provide your YouTube API key:
./telegram-scraper --platform youtube --youtube-api-key "YOUR_API_KEY" --urls "UCxxx,UCyyy"
Where:
UCxxx,UCyyy
are YouTube channel IDs (starting with UC)- You can also use channel handles (starting with @) or custom URLs
To save bandwidth and storage, you can skip media downloads:
./telegram-scraper --urls "channel1,channel2" --skip-media
To resume an interrupted crawl:
./telegram-scraper --urls "channel1,channel2" --crawl-id "your-previous-crawl-id"
To limit the number of posts scraped per channel:
./telegram-scraper --urls "channel1,channel2" --max-posts 1000
To only scrape posts after a specific date:
./telegram-scraper --urls "channel1,channel2" --min-post-date "2023-01-01"
Or using relative time:
./telegram-scraper --urls "channel1,channel2" --time-ago "30d"
To specify a custom storage location:
./telegram-scraper --urls "channel1" --storage-root "/path/to/custom/dir"
If Azure Blob Storage is enabled via environment variables (CONTAINER_NAME, BLOB_NAME, and AZURE_STORAGE_ACCOUNT_URL), the scraper will automatically upload data to the specified container and blob.
To explicitly enable Azure upload:
./telegram-scraper --urls "channel1" --dapr
When running the scraper for the first time:
- Enter your phone number when prompted (with country code, e.g., +12025551234)
- Enter the authentication code sent to your Telegram app
- If you have Two-Factor Authentication enabled, enter your password when prompted
The auth session will be saved locally for future use.
main.go
: Entry point for the application, CLI command setup, and configuration processingcrawl/
: Core crawling logic for Telegramcrawler/
: Platform-agnostic crawler interfaces and factoriescrawler/youtube/
: YouTube-specific crawler implementation
telegramhelper/
: Handles TDLib client connections and Telegram API interactionsstate/
: Manages progress, seed list setup, and storage (local and Azure Blob)model/
: Defines unified data structures for storing messages from all platformsmodel/youtube/
: YouTube-specific data models
common/
: Shared utilities, configuration structures, and helper functionsstandalone/
: Runner implementation for standalone mode executiondapr/
: DAPR integration for cloud-based operation
crawler.Crawler
: Common interface for all platform crawlerscrawler.CrawlerFactory
: Factory for creating platform-specific crawlersstate.StateManagementInterface
: Interface for managing state across different storage backendsstate.StateManagerFactory
: Factory for creating state managers based on configuration
To scrape Telegram data for @examplechannel and @anotherchannel, storing results in Azure Blob Storage:
- Set environment variables:
export TG_API_ID="your_api_id"
export TG_API_HASH="your_api_hash"
export TG_PHONE_NUMBER="your_phone_number"
export TG_PHONE_CODE="your_telegram_code"
export CONTAINER_NAME="your_container_name"
export BLOB_NAME="your_blob_path"
export AZURE_STORAGE_ACCOUNT_URL="https://youraccount.blob.core.windows.net"
- Run the scraper:
./telegram-scraper --urls "examplechannel,anotherchannel"
To scrape YouTube videos from specific channels:
- Set environment variables for storage (optional):
export CONTAINER_NAME="your_container_name"
export BLOB_NAME="your_blob_path"
export AZURE_STORAGE_ACCOUNT_URL="https://youraccount.blob.core.windows.net"
- Run the scraper with your YouTube API key:
./telegram-scraper --platform youtube --youtube-api-key "YOUR_API_KEY" --urls "UCxxx1,UCxxx2"
- Run with additional parameters:
# Scrape only videos from the last 90 days with a custom label
./telegram-scraper --platform youtube --youtube-api-key "YOUR_API_KEY" --urls "UCxxx1,UCxxx2" --time-ago "90d" --crawl-label "tech-channels-q1"
# Limit to 100 videos per channel
./telegram-scraper --platform youtube --youtube-api-key "YOUR_API_KEY" --urls "UCxxx1,UCxxx2" --max-posts 100
The scraper outputs Telegram data in JSONL format with the following structure:
{
"post_uid": "12345-examplechannel",
"url": "https://t.me/examplechannel/12345",
"channel_id": "1001234567890",
"channel_name": "Example Channel",
"published_at": "2023-04-01T12:34:56Z",
"description": "Example message content",
"thumb_url": "https://storage.example.com/thumbnails/abc123.jpg",
"media_url": "https://storage.example.com/videos/def456.mp4",
"views_count": 1000,
"comments_count": 25,
"shares_count": 50,
"engagement": 1075,
"platform_name": "Telegram",
"post_type": ["video"],
"outlinks": ["channel1", "channel2"],
"reactions": {"👍": 20, "❤️": 15},
"comments": [
{
"comment_id": "1234",
"author": "User1",
"text": "Great post!",
"created_at": "2023-04-01T12:40:00Z"
}
],
"channel_data": {
"channel_id": "1001234567890",
"channel_name": "Example Channel",
"channel_url_external": "https://t.me/c/examplechannel",
"channel_engagement_data": {
"follower_count": 50000,
"post_count": 1500,
"views_count": 2000000
}
}
}
The scraper outputs YouTube data in a similar JSONL format:
{
"post_uid": "video123-channel456",
"url": "https://www.youtube.com/watch?v=video123",
"channel_id": "UCchannel456",
"channel_name": "Example YouTube Channel",
"published_at": "2023-04-01T12:34:56Z",
"description": "Example video description",
"thumb_url": "https://i.ytimg.com/vi/video123/maxresdefault.jpg",
"media_url": "",
"views_count": 10000,
"comments_count": 250,
"likes_count": 500,
"engagement": 10750,
"platform_name": "YouTube",
"post_type": ["video"],
"outlinks": [],
"comments": [
{
"comment_id": "comment789",
"author": "YouTubeUser1",
"text": "Great video!",
"created_at": "2023-04-01T13:00:00Z"
}
],
"channel_data": {
"channel_id": "UCchannel456",
"channel_name": "Example YouTube Channel",
"channel_url_external": "https://www.youtube.com/channel/UCchannel456",
"channel_description": "This is an example YouTube channel",
"channel_engagement_data": {
"follower_count": 100000,
"video_count": 500
}
},
"crawl_label": "tech-channels-q1"
}
The scraper implements exponential backoff for handling rate limits from the Telegram API. If you encounter persistent rate limiting:
- Reduce the number of channels in your seed list
- Increase delay between requests by modifying the code in
telegramhelper/client.go
- Consider using a different Telegram account with fewer API calls
YouTube Data API has strict quota limits (typically 10,000 units per day for a new API key):
- Each search request costs 100 units
- Each video details request costs 1 unit
- Each comment thread request costs 1 unit
To avoid quota exhaustion:
- Limit the number of channels you scrape in a single run
- Use the
--max-posts
parameter to limit videos per channel - Consider using multiple API keys for larger scraping jobs
- Authentication Issues: Ensure Telegram API credentials are correct. Delete the
.tdlib
directory to restart authentication. - TDLib Errors: Check that TDLib is properly installed and accessible.
- YouTube API Key Issues: Verify your API key is valid and has YouTube Data API v3 enabled.
- YouTube API Quota Exceeded: Wait until your quota resets (usually at midnight Pacific Time) or use a different API key.
- Storage Errors: Verify write permissions to the storage directory.
- Azure Upload Failures: Confirm your Azure Blob Storage configuration and credentials.
- macOS Compilation Errors: Set the required CGO environment variables as described in the Usage section.
- Log Analysis: Set
--log-level debug
for more detailed logging information.
This project is licensed under Apache 2.0 License.