Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit blocklist stamps #8

Closed
ignoramous opened this issue Jul 20, 2021 · 1 comment
Closed

Revisit blocklist stamps #8

ignoramous opened this issue Jul 20, 2021 · 1 comment
Assignees

Comments

@ignoramous
Copy link
Contributor

ignoramous commented Jul 20, 2021

Today, the blocklists metadata is stored in a dictionary-encoded bitmap as leaf node of domains compacted into a radix trie.

This encoded bitmap is then represented in a URL (like the one rethinkdns/configure generates in response to user's selection) as a url-safe base64.

The bitmap encoding is such that, up to a maximum of 256 -- 16 (blocklist groups) x 16 (blocklists per group) -- unique dictionary values (blocklist index) can be represented. Of course, this can be increased by increasing size of bytes used to represent the blocklist group (currently 2 bytes) and/or blocklists themselves (also at 2 bytes).

Ideally, the encoded representation must optimize for the data being encoded. Analysis tells us that there are likely over 80% domains that are present only either in one or two blocklists. So, there's a potential to rethink the current strategy of encoding the blocklist metadata the way we are while also increasing number of blocklists that RethinkDNS, the resolver, can support.

Total domain entries: 5814457
Unique domains: 2550019
// count is the total number of domains present in #{entry} number of blocklists
// that is, the first line below means, 1412059 domains (55%) exist in just 1 blocklist.
Entry: 1 : count: 1412059 : %: 55.37445015115574
Entry: 2 : count: 554896 : %: 21.76046531418001
Entry: 3 : count: 214961 : %: 8.429780327126975
Entry: 4 : count: 85965 : %: 3.371151352205611
Entry: 5 : count: 44341 : %: 1.7388497889623569
Entry: 6 : count: 62576 : %: 2.4539425000362742
Entry: 7 : count: 26620 : %: 1.0439137904462672
Entry: 8 : count: 49785 : %: 1.9523383943413755
Entry: 9 : count: 27338 : %: 1.0720704433966963
Entry: 10 : count: 17297 : %: 0.6783086714255855
Entry: 11 : count: 13686 : %: 0.5367018833977315
Entry: 12 : count: 17039 : %: 0.6681910997525901
Entry: 13 : count: 9138 : %: 0.358350271115627
Entry: 14 : count: 5380 : %: 0.21097882015781058
Entry: 15 : count: 3046 : %: 0.11945009037187565
Entry: 16 : count: 1807 : %: 0.07086221710504902
Entry: 17 : count: 1139 : %: 0.0446663338586889
Entry: 18 : count: 763 : %: 0.029921345684090984
Entry: 19 : count: 688 : %: 0.02698019112798767
Entry: 20 : count: 413 : %: 0.016195957755608878
Entry: 21 : count: 328 : %: 0.0128626492586918
Entry: 22 : count: 244 : %: 0.009568556155856094
Entry: 23 : count: 179 : %: 0.0070195555405665605
Entry: 24 : count: 109 : %: 0.00427447795487014
Entry: 25 : count: 70 : %: 0.00274507758569642
Entry: 26 : count: 54 : %: 0.0021176312803943814
Entry: 27 : count: 42 : %: 0.0016470465514178522
Entry: 28 : count: 24 : %: 0.0009411694579530583
Entry: 29 : count: 14 : %: 0.0005490155171392841
Entry: 30 : count: 10 : %: 0.00039215394081377434
Entry: 31 : count: 4 : %: 0.00015686157632550972
Entry: 32 : count: 2 : %: 0.00007843078816275486
Entry: 33 : count: 1 : %: 0.00003921539408137743
Entry: 34 : count: 0 : %: 0
Entry: 35 : count: 1 : %: 0.00003921539408137743

from: celzero/serverless-dns/issues/4.

@ignoramous
Copy link
Contributor Author

When an entry is in less than 4 blocklists, it is encoded as-is (8-bits per blocklist) instead of in a 16x16 bit-map (min 32 bits). This saves about 33% in size of the final compact trie.

serverless-dns/blocklists@c5fd408

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants