Skip to content

itkhansunny/Tranco-filter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Tranco Domain List Processor

A powerful Python utility for downloading, filtering, processing, and analyzing domain data from the Tranco list.

Table of Contents

Overview

The Tranco Domain List Processor is a comprehensive tool that helps you download, filter, and analyze domain data from the Tranco list - a research-oriented top sites ranking that aims to be more reliable than commercial rankings. This script provides multiple filtering options, automated scheduling, and different output formats for your domain data needs.

Features

  • Automatic Downloads: Fetches the latest Tranco list directly from the official website
  • Custom Domain Filtering:
    • Filter by Top-Level Domain (TLD): .com, .net, .org, etc.
    • Filter by domain length (excluding TLD)
    • Filter by Tranco rank range
    • Filter by custom regex patterns
  • Multiple Output Formats:
    • Plain text files (one domain per line)
    • CSV format (with rank information)
    • JSON format (with detailed domain metadata)
    • HTML format (with clickable links)
  • TLD Analysis:
    • Comprehensive distribution statistics
    • Top TLD occurrence rankings
    • Exportable analysis reports
  • Advanced Processing:
    • Multithreaded processing for large datasets
    • Batch processing of multiple TLDs
    • Progress bars for long operations
  • Automation:
    • Automated scheduling via Windows Task Scheduler
    • Set daily, weekly, or monthly update schedules

Requirements

  • Python 3.6+
  • Required Python packages:
    • requests
    • beautifulsoup4
    • tqdm

Installation

  1. Clone or download this repository to your local machine
  2. Install the required dependencies:
pip install requests beautifulsoup4 tqdm

Usage

Run the script with Python:

python tranco.py

Main Menu Options

When you run the script, you'll be presented with a main menu:

  1. Filter domains from Tranco list - Download and filter domains based on your criteria
  2. Analyze TLD distribution - Get statistics about TLD distribution in the Tranco list
  3. Delete all text files from directory - Clean up output files

Basic Workflow

To filter and save domains:

  1. Select option 1 from the main menu
  2. Follow the prompts to specify your filtering criteria
  3. Choose your preferred output format
  4. The script will download the latest Tranco list and apply your filters
  5. Filtered domain lists will be saved to your current directory

Advanced Features

Domain Filtering

The script offers multiple filtering options that can be combined:

Filter by TLD

Enter TLD name(s) separated by commas or spaces (e.g., com net io), or 'all' for all TLDs: com org net

Use all to process all available TLDs in the Tranco list.

Filter by Domain Length

Do you want to filter domains by length? (y/n): y
Enter minimum domain length (characters in the domain name, excluding TLD): 5
Enter maximum domain length (characters in the domain name, excluding TLD): 10

Domain length refers to the characters in the domain name excluding the TLD.

Filter by Ranking

Do you want to filter domains by ranking? (y/n): y
Enter minimum rank (1 = highest ranked): 1
Enter maximum rank: 1000

This allows you to get only the top N domains (e.g., top 1000).

Filter by Pattern

Do you want to filter domains by a pattern? (y/n): y
Enter a regex pattern to match domains (e.g. 'tech|ai' for domains containing 'tech' or 'ai'): tech|ai

Use standard regex patterns to match specific domain characteristics.

TLD Distribution Analysis

The TLD distribution analysis feature provides insights into the composition of the Tranco list:

  • Shows the count and percentage of each TLD
  • Identifies the most common TLDs
  • Allows saving the complete distribution to a file

Multiple Output Formats

The script supports four output formats:

  • TXT: Simple text file with one domain per line
  • CSV: Comma-separated values with rank and domain
  • JSON: Structured JSON format with detailed domain information
  • HTML: Interactive HTML page with clickable domain links

Example of JSON output:

[
  {
    "rank": 1,
    "domain": "google.com",
    "name": "google",
    "tld": "com"
  },
  ...
]

Example of HTML output:

<!DOCTYPE html>
<html>
<head>
    <title>Filtered Domains (.com)</title>
    <!-- CSS styling -->
</head>
<body>
    <h1>Top .com Domains</h1>
    <table>
        <tr><th>Rank</th><th>Domain</th></tr>
        <tr><td>1</td><td><a href="http://google.com" target="_blank">google.com</a></td></tr>
        <!-- More domains -->
    </table>
</body>
</html>

Automatic Scheduling

The script can set up automatic recurring downloads via Windows Task Scheduler:

Do you want to set up automatic scheduling for this script? (y/n): y

Schedule frequency options:
1. Daily
2. Weekly
3. Monthly
Enter your choice (1-3): 1
Enter the time to run (HH:MM, 24-hour format): 03:00

This feature requires administrative privileges in Windows.

Multithreaded Processing

When processing multiple TLDs (especially with the all option), the script uses multithreading to significantly speed up the processing time. The number of threads is automatically optimized based on your system's CPU cores.

Sample Outputs

After processing, you'll get domain list files named according to your filters. Examples:

  • tranco_1M5JD_com_domains.txt - All .com domains
  • tranco_1M5JD_com_domains_5-10chars.txt - .com domains with 5-10 characters
  • tranco_1M5JD_com_domains_rank1-1000.txt - Top 1000 .com domains
  • tld_distribution_analysis_2023-04-15.txt - TLD distribution analysis report

Troubleshooting

Common Issues

Issue: Script fails to download the Tranco list Solution: Check your internet connection and ensure you have proper permissions to write files to the current directory

Issue: Task scheduler setup fails Solution: Run the script as administrator to set up scheduled tasks

Issue: Script seems slow when processing all TLDs Solution: This is normal for large datasets. The script uses multithreading to speed up processing, but it still takes time. Consider filtering by specific TLDs instead of using all

Error Messages

  • "Error fetching the website" - Check your internet connection
  • "Error downloading the file" - Check disk space and permissions
  • "Error setting up scheduled task" - Try running as administrator

License

This project is released under the MIT License. See the LICENSE file for details.


Created with ❤️ by Khan Sunny

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Languages