Skip to content
/ durl Public

Remove duplicate URLs by retaining only the unique combinations of hostname, path, and parameter names

Notifications You must be signed in to change notification settings

j3ssie/durl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Diff URLs

Remove duplicate URLs by retaining only the unique combinations of hostname, path, and parameter names.

Install

go install github.com/j3ssie/durl@latest

Usage

# basic usage
cat wayback_urls.txt | durl | tee differ-urls.txt

# with extra regex
cat wayback_urls.txt | durl -e 'your-regex-here' | tee differ-urls.txt

# only get the scope domain
cat spider-urls.txt | durl -t 'target.com' | tee in-scope-url.txt

# parse JSONL data
cat large-jsonl-data.txt | durl -t 'target.com' -f url | tee in-scope-jsonl-data.txt

Covered cases

The following examples illustrate the criteria used to ensure each URL is considered unique and listed only once:

  1. URLs with the same hostname, path, and parameter names
http://sample.example.com/product.aspx?productID=123&type=customer
http://sample.example.com/product.aspx?productID=456&type=admin
  1. Paths indicating static content like blog, news or calender.
https://www.example.com/cn/news/all-news/public-1.html
https://www.sample.com/de/about/business/countrysites.htm
https://www.sample.com/de/about/business/very-long-string-here-that-exceed-100-char.htm
https://www.sample.com/de/blog/2022/01/02/blog-title.htm
  1. URLs with numeric variations
https://www.example.com/data/0001.html
https://www.example.com/data/0002.html
  1. Static file will be ignore like http://example.com.com/cdn-cgi/style.css

  2. Select a url JSON field from the input then filtering with all of the cases above.

About

Remove duplicate URLs by retaining only the unique combinations of hostname, path, and parameter names

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages