Skip to content

Conversation

@aidenmitchell
Copy link
Member

@aidenmitchell aidenmitchell commented Nov 5, 2025

Summary

  • Adds a new step to filter out TLD-like domains from the Tranco list before processing
  • Excludes 25 second-level domains under generic TLDs that function as alternative TLDs (like net.ru, br.com, uk.com, etc.)
  • Provides logging for transparency on which domains are removed

Motivation

The Tranco list includes domains like net.ru, br.com, and uk.com that function as alternative top-level domains rather than regular domains. These should be filtered out to improve data quality.

Changes

  • Added "Remove excluded domains" step in the workflow
  • Filters 25 known TLD-like domains from tranco.csv
  • Logs removal counts and final line count

Test plan

  • Verify workflow runs successfully
  • Check that excluded domains are removed from the processed Tranco list
  • Confirm logging output shows removal statistics

🤖 Generated with Claude Code

aidenmitchell and others added 6 commits November 5, 2025 10:00
This commit adds a new step to filter out TLD-like domains from the Tranco list before processing. The excluded domains are second-level domains under generic TLDs that function as alternative TLDs (like net.ru, br.com, uk.com, etc.).

This filtering step:
- Removes 25 known TLD-like domains from the Tranco list
- Logs removal counts for transparency
- Runs before the configuration and processing steps

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changes:
- Replace grep -c with grep | wc -l | tr -d ' ' to properly handle count
- Add TOTAL_REMOVED counter to track total exclusions
- Add summary output section for better visibility
- Use -E flag consistently for regex matching

The previous version had a newline issue with grep -c output that caused
"integer expression expected" errors in the comparison. This fix ensures
clean integer values for all comparisons.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The Tranco CSV file uses Windows line endings (\r\n), which was preventing
the grep pattern from matching domains correctly. The pattern ",domain$"
was failing because there's a \r character before the \n.

Changes:
- Properly escape dots in domain names for regex matching
- Update pattern to match optional \r before end of line: ",domain(\r)?$"
- This now correctly handles both Unix (\n) and Windows (\r\n) line endings

Tested with actual Tranco file and confirmed removal of:
- 613,uk.com
- 2644,net.ru
- 6123,br.com

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Instead of trying to match \r in regex patterns (which has inconsistent
behavior across different grep implementations), normalize the file to
Unix line endings first using 'tr -d', then use simple end-of-line patterns.

Changes:
- Add tr -d '\r' step to strip all carriage returns before processing
- Simplified grep pattern from ",domain(\r)?$" to ",domain$"
- Use grep -c directly (safe now that output is clean)

Tested locally and confirmed:
- Removes net.ru, uk.com, br.com successfully
- File line count reduced from 1000000 to 999997

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Replace 'grep -c' with 'grep | wc -l | tr -d' to ensure clean integer
output without newlines or extra whitespace. This prevents the
"integer expression expected" error when domains are not found.

The previous version using grep -c was outputting values with formatting
that caused bash integer comparison to fail.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Instead of maintaining a manual list of ~20 TLD-like domains, now fetch
and use the complete Public Suffix List (PSL) from publicsuffix.org.
Remove any Tranco entries that exactly match a PSL entry.

This is more comprehensive and maintainable:
- Covers ~9,754 public suffixes (vs 21 hardcoded)
- Automatically includes new suffixes as PSL is updated
- Removes infrastructure domains (workers.dev, github.io, herokuapp.com, etc.)
- Removes second-level TLDs (br.com, uk.com, net.ru, etc.)

Tested on top 10k Tranco domains:
- Found and removed 75 PSL entries
- Including: workers.dev, github.io, herokuapp.com, github.io,
  netlify.app, vercel.app, and all previously hardcoded domains

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new workflow step to filter out Public Suffix List (PSL) entries from the Tranco CSV file before further processing. The step fetches the PSL, processes it to remove comments and special entries, then removes any exact domain matches from the Tranco list.

  • Downloads and filters the Public Suffix List to exclude comments, wildcards, exceptions, and empty lines
  • Normalizes line endings in tranco.csv and removes PSL entries that match domains exactly
  • Reports statistics on PSL entries checked, found, and removed from the Tranco list

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…code quality

Changes based on Copilot review feedback:

1. **Removed redundant variable**: Eliminated FOUND_COUNT, using only TOTAL_REMOVED

2. **Fixed regex escaping vulnerability**: Replaced regex-based grep with awk's
   exact string matching using associative arrays. This avoids all regex
   special character issues (dots, brackets, parentheses, hyphens, etc.)

3. **Added comprehensive error handling**:
   - Check curl exit code and fail fast if PSL download fails
   - Verify PSL file has content before proceeding
   - Added explicit error messages

4. **Optimized performance**: Replaced O(n×m) loop (1000 iterations of grep
   over 1M lines) with single-pass awk using hash table lookups O(n+m).
   This reduces processing time from ~3.5 minutes to ~10 seconds.

The awk approach:
- Loads all PSL entries into an associative array (hash table)
- Processes tranco.csv in a single pass
- Uses exact string matching via 'in' operator (no regex)
- Outputs filtered data and removal count efficiently

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@aidenmitchell aidenmitchell requested a review from Copilot November 5, 2025 19:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@aidenmitchell aidenmitchell enabled auto-merge (squash) November 5, 2025 19:33
@aidenmitchell aidenmitchell merged commit 8881bd0 into master Nov 5, 2025
2 checks passed
@aidenmitchell aidenmitchell deleted the exclude-tld-domains-from-tranco branch November 5, 2025 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants