# [Comments on DNS Robustness, Mark Allman, IMC'18](https://www.icir.org/mallman/pubs/All18a/All18a.pdf)

## Summary
### Approach
This paper investigates the robustness of the DNS ecosystem for popular domain names. Using Alexa top1M lists and zone files for the .com, .net, .org TLDs.


### Datasets
Nine years of data (2009 to 2018):
- Alexa top 1M
- TLD Zone Files (.com, .net, .org)

Also includes traceroutes (only for 2018).


### Limitations / Future work
**Only 3 TLDs**: They have only have .com, .net, .org zone files so limit their study to these three TLDs domains.
The paper says that looking at more TLDs is left for future work (end of 'Dataset A', section 3.1). 

**Topological determination**: The topological diversity of servers is checked simply by looking if nameservers are in the same /24 or not. The paper says that better historical routing information will be used in future work to refine the analysis (section 3.2, step 3).

**Anycast prefixes**: One limitation of the original study is to ignore anycast prefixes (they keep that for future work). We can check that for them :)

**IPv6**: Original paper looks only at IPv4? we can do both

## (Section 3.1) Coverage of .com, .net, .org in popularity list 
The paper considers domain names from only three TLDs but shows that it represents the majority of the Alexa top 1M.

### Original results
<img src="fig/fig1.png" style="height: 400px;"/>

- The paper reports that between 2009 and 2018, the three TLDs constitute at least 56% of the Alexa list.
- Out of these the paper ignores 12-15% SLDs because their nameservers are under different TLDs.

### IYP Results

In [2]:
# Setup access to IYP

from neo4j import GraphDatabase, RoutingControl
from collections import defaultdict

# Using IYP local instance
# URI = "neo4j://localhost:7687"
# Using IYP public instance
URI = "neo4j+s://iyp-bolt.iijlab.net:7687"
AUTH = ('neo4j', 'password')
db = GraphDatabase.driver(URI, auth=AUTH)

In [54]:
# Get the percentage of .com, .net, and .org domain names in Tranco top 1M
query = """MATCH (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[:MANAGED_BY]-(a:AuthoritativeNameServer)
WHERE d.name ENDS WITH '.com' OR d.name ENDS WITH '.net' OR d.name ENDS WITH '.org'
RETURN COUNT(DISTINCT d.name)"""

res, _, _ = db.execute_query(query, database_="neo4j");
nb_sld = res[0][0]
print(f'{100*nb_sld/1000000:.1f}% of Tranco top1M domain names are under the .com, .net, or .org TLD.')

49.1% of Tranco top1M domain names are under the .com, .net, or .org TLD.


In [28]:
# Find the percentage of domain names that have nameservers not in the .com, .net, and .org TLDs
query = """MATCH (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[m:MANAGED_BY {reference_name:'openintel.dnsgraph_nl'}]-(a:AuthoritativeNameServer)
WHERE (d.name ENDS WITH '.com' OR d.name ENDS WITH '.net' OR d.name ENDS WITH '.org')
WITH d, COLLECT(a) AS ns, COLLECT(m) AS managed
// check if all nameservers are outside the zone and have no glue
WHERE all( a in ns WHERE NOT a.name ENDS WITH '.com' AND NOT d.name ENDS WITH '.net' AND NOT d.name ENDS WITH '.org') AND all( m in managed WHERE m.should_follow_glue_v4 = false)
RETURN COUNT(DISTINCT d.name)"""

res, _, _ = db.execute_query(query, database_="neo4j");
nb_excluded = res[0][0]
print(f'{100*nb_excluded/nb_sld:.1f}% of Tranco top1M domain names are ignored by the original paper assumptions (only .com, .net, .org nameservers).')


10.3% of Tranco top1M domain names are ignored by the original paper assumptions (only .com, .net, .org nameservers).


## (Section 4.1) Nameserver Replicas
The paper checks nameserver requirements for each .com, .net, and .org SLD, that is at least two nameservers should be deployed in two different locations (different /24 prefixes).

### Original Results

<img src="fig/fig3.png" style="height: 400px;"/>

The three curves are obtained using data from the three zones files.
The dots on the right hand side of the figure are for a refine analysis using traceroute data.

### IYP Results

In [41]:
# Get the number of domain names that meet, not meet, and exceed the nameserver requirements (same limitations as original studies)
query = """ // Exclude domain names that have nameservers outside of the .com, .net, .org TLDs and no glue record
MATCH (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[m:MANAGED_BY {reference_name:'openintel.dnsgraph_nl'}]-(a:AuthoritativeNameServer)
WHERE (d.name ENDS WITH '.com' OR d.name ENDS WITH '.net' OR d.name ENDS WITH '.org')
WITH d, COLLECT(a) AS ns, COLLECT(m) AS managed
// check if all nameservers are outside the zone and have no glue
WHERE all( a in ns WHERE NOT a.name ENDS WITH '.com' AND NOT d.name ENDS WITH '.net' AND NOT d.name ENDS WITH '.org') AND all( m in managed WHERE m.should_follow_glue_v4 = false)
WITH COLLECT(DISTINCT d.name) AS partial_cover

// Count nameserver's /24 per domain name
MATCH (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[:MANAGED_BY]-(a:AuthoritativeNameServer)
WHERE (d.name ENDS WITH '.com' OR d.name ENDS WITH '.net' OR d.name ENDS WITH '.org') AND NOT d.name IN partial_cover
OPTIONAL MATCH (d)-[:MANAGED_BY]-(ans:AuthoritativeNameServer)-[:RESOLVES_TO]-(i:IP {af:4})
WITH d.name as dname, COUNT(DISTINCT REDUCE(pfx = "", n IN SPLIT(i.ip, '.')[0..3] | pfx + n + ".")) AS nb_pfx
WITH dname,
CASE
     WHEN nb_pfx = 2 THEN "meet"
     WHEN nb_pfx < 2 THEN "don't meet"
     WHEN nb_pfx > 2 THEN "exceed"
     ELSE "unk."
END AS ns_req
RETURN ns_req, COUNT(DISTINCT dname) AS count"""

res, _, _ = db.execute_query(query, database_="neo4j")
for r in res:
    print(f'{100*r["count"]/nb_sld:.1f}% of domain names {r["ns_req"]} the nameserver requirements.')

4.4% of domain names don't meet the nameserver requirements.
66.9% of domain names exceed the nameserver requirements.
18.4% of domain names meet the nameserver requirements.


In [42]:
# Get the number of domain names that meet, not meet, and exceed the nameserver requirements for all Tranco
query = """
// Count nameserver's /24 per domain name
MATCH (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[:MANAGED_BY]-(a:AuthoritativeNameServer)
OPTIONAL MATCH (d)-[:MANAGED_BY]-(ans:AuthoritativeNameServer)-[:RESOLVES_TO]-(i:IP {af:4})
WITH d.name as dname, COUNT(DISTINCT REDUCE(pfx = "", n IN SPLIT(i.ip, '.')[0..3] | pfx + n + ".")) AS nb_pfx
WITH dname,
CASE
     WHEN nb_pfx = 2 THEN "meet"
     WHEN nb_pfx < 2 THEN "don't meet"
     WHEN nb_pfx > 2 THEN "exceed"
     ELSE "unk."
END AS ns_req
RETURN ns_req, COUNT(DISTINCT dname) AS count"""

res, _, _ = db.execute_query(query, database_="neo4j")
for r in res:
    print(f'{100*r["count"]/1000000:.1f}% of domain names {r["ns_req"]} the nameserver requirements.')

69.3% of domain names exceed the nameserver requirements.
6.1% of domain names don't meet the nameserver requirements.
22.4% of domain names meet the nameserver requirements.


## (Section 4.2) Glue Location
The paper also look for domain names that have a nameserver under the same TLD of the domain name. For example, a nameserver for www.example.com is ns1.example.com.

### Original Results

The original study founds that 69-73% of the popular SLDs have at least one in-zone NS record.

### IYP Results

In [33]:
# Get the number of domain names with a in-zone nameserver
old_query = """ // Infer the domain that have glue records in the same zone
WHERE (d.name ENDS WITH '.com' OR d.name ENDS WITH '.net' OR d.name ENDS WITH '.org') AND ( right(d.name, 3) = right(a.name, 3) )
RETURN COUNT(DISTINCT d.name)"""

query = """ // Count the domain that have glue records in the same zone
MATCH (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[m:MANAGED_BY {reference_name:'openintel.dnsgraph_nl'}]-(a:AuthoritativeNameServer)
WHERE (d.name ENDS WITH '.com' OR d.name ENDS WITH '.net' OR d.name ENDS WITH '.org') AND m.should_follow_glue_v4 = true
RETURN COUNT(DISTINCT d.name)"""

res, _, _ = db.execute_query(query, database_="neo4j")
nb_glue = res[0][0]
print(f'{100*nb_glue/nb_sld:.1f}% of domain names have a in-zone nameserver.')

76.7% of domain names have a in-zone nameserver.


## (Section 5) Shared Infrastructure
Next the paper investigates domain names that use the exact same set of nameservers. 

### Original Results
<img src="fig/fig4.png" style="height: 400px;"/>

Grouped by nameservers (April 2018):
- Median: Half the domain names exactly share a set of nameservers with at least **163** other domain names.
- Maximum: The largest group contains **9K** domain names that share the exact same set of nameservers.

Grouped by /24 prefixes (April 2018):
- Median: **3k**
- Maximum: **71k**

### IYP Results

In [57]:
query = """ // List nameservers for .com/.org/.net domain names in Tranco
MATCH (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[:MANAGED_BY]-(a:AuthoritativeNameServer) 
WHERE d.name ENDS WITH '.com' OR d.name ENDS WITH '.net' OR d.name ENDS WITH '.org'
RETURN d, COLLECT(DISTINCT a.name) AS auths"""

res, _, _ = db.execute_query(query, database_="neo4j")

counts = defaultdict(int)
for r in res:
    counts[frozenset(list(r[1]))] += 1

sorted_counts = list(counts.values())
sorted_counts.sort()
sorted_counts.reverse()

total = sum(sorted_counts)

cum = 0
for c in sorted_counts:
    cum += c
    if cum >= total/2:
        print(f'Median: {c} domains for the same authoritative nameservers.')
        break

print(f'Maximum: {sorted_counts[0]} domains for the same authoritative nameservers.')

Median: 9 domains for the same authoritative nameservers.
Maximum: 6055 domains for the same authoritative nameservers.


In [36]:
query = """ // List /24 prefixes of nameservers for .com/.net/.org domain names in Tranco
MATCH  (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[:MANAGED_BY]-(a:AuthoritativeNameServer)-[:RESOLVES_TO]-(i:IP {af:4})
WHERE d.name ENDS WITH '.com' OR d.name ENDS WITH '.net' OR d.name ENDS WITH '.org'
RETURN d, COLLECT(DISTINCT REDUCE(pfx = "", n IN SPLIT(i.ip, '.')[0..3] | pfx + n + ".")) AS pfx"""

res, _, _ = db.execute_query(query, database_="neo4j")

counts = defaultdict(int)
for r in res:
    counts[frozenset(list(r[1]))] += 1

sorted_counts = list(counts.values())
sorted_counts.sort()
sorted_counts.reverse()

total = sum(sorted_counts)

cum = 0
for c in sorted_counts:
    cum += c
    if cum >= total/2:
        print(f'Median: {c} domains for the same set of /24s.')
        break


print(f'Maximum: {sorted_counts[0]} domains for the same set of /24s.')


Median: 3959 domains for the same set of /24s.
Maximum: 114307 domains for the same set of /24s.


# Extension
Results for all domain names in Tranco and grouping by BGP prefix

In [60]:
query = """ // List nameservers for all domain names in Tranco
MATCH (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[:MANAGED_BY]-(a:AuthoritativeNameServer) 
RETURN d, COLLECT(DISTINCT a.name) AS auths"""

res, _, _ = db.execute_query(query, database_="neo4j")

counts = defaultdict(int)
for r in res:
    counts[frozenset(list(r[1]))] += 1

sorted_counts = list(counts.values())
sorted_counts.sort()
sorted_counts.reverse()

total = sum(sorted_counts)

cum = 0
for c in sorted_counts:
    cum += c
    if cum >= total/2:
        print(f'Median: {c} domains for the same authoritative nameservers.')
        break

print(f'Maximum: {sorted_counts[0]} domains for the same authoritative nameservers.')

Median: 15 domains for the same authoritative nameservers.
Maximum: 25707 domains for the same authoritative nameservers.


In [47]:
query = """ // List BGP prefixes of nameservers for .com/.net/.org domain names in Tranco
MATCH  (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[:MANAGED_BY]-(a:AuthoritativeNameServer)-[:RESOLVES_TO]-(i:IP {af:4})-[:PART_OF]-(pfx:Prefix)
WHERE d.name ENDS WITH '.com' OR d.name ENDS WITH '.net' OR d.name ENDS WITH '.org'
RETURN d, COLLECT(DISTINCT pfx)"""

res, _, _ = db.execute_query(query, database_="neo4j")

counts = defaultdict(int)
for r in res:
    counts[frozenset(list(r[1]))] += 1

sorted_counts = list(counts.values())
sorted_counts.sort()
sorted_counts.reverse()

total = sum(sorted_counts)

cum = 0
for c in sorted_counts:
    cum += c
    if cum >= total/2:
        print(f'Median: {c} domains for the same set of prefixes.')
        break


print(f'Maximum: {sorted_counts[0]} domains for the same set of prefixes.')


Median: 4148 domains for the same set of prefixes.
Maximum: 114307 domains for the same set of prefixes.


In [51]:
query = """ // List prefixes of nameservers for all domain names in Tranco
MATCH  (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[:MANAGED_BY]-(a:AuthoritativeNameServer)-[:RESOLVES_TO]-(i:IP {af:4})-[:PART_OF]-(pfx:Prefix)
RETURN d, COLLECT(DISTINCT pfx)"""

res, _, _ = db.execute_query(query, database_="neo4j")

counts = defaultdict(int)
for r in res:
    counts[frozenset(list(r[1]))] += 1

sorted_counts = list(counts.values())
sorted_counts.sort()
sorted_counts.reverse()

total = sum(sorted_counts)

cum = 0
for c in sorted_counts:
    cum += c
    if cum >= total/2:
        print(f'Median: {c} domains for the same set of prefixes.')
        break


print(f'Maximum: {sorted_counts[0]} domains for the same set of prefixes.')
print(f'Number of groups {len(sorted_counts)}')

Median: 6020 domains for the same set of /24s.
Maximum: 187732 domains for the same set of /24s.
Number of groups 71243


# TODO remove the next one


In [62]:
query = """ // List anycast prefixes of nameservers for domain names having only anycasted nameservers
MATCH  (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[:MANAGED_BY]-(a:AuthoritativeNameServer)-[:RESOLVES_TO]-(i:IP {af:4})-[:PART_OF]-(pfx:Prefix)
OPTIONAL MATCH (pfx)-[:CATEGORIZED]-(t:Tag {label:'Anycast'})
WITH d, COLLECT(DISTINCT pfx) AS pfxs, COLLECT(DISTINCT t) AS tags
MATCH (d)
WHERE all(tag in tags WHERE tag IS NOT NULL)
RETURN d, pfxs"""

res, _, _ = db.execute_query(query, database_="neo4j")

counts = defaultdict(int)
for r in res:
    counts[frozenset(list(r[1]))] += 1

sorted_counts = list(counts.values())
sorted_counts.sort()
sorted_counts.reverse()

total = sum(sorted_counts)

cum = 0
for c in sorted_counts:
    cum += c
    if cum >= total/2:
        print(f'Median: {c} domains for the same set of prefixes.')
        break


print(f'Maximum: {sorted_counts[0]} domains for the same set of prefixes.')
print(f'Number of groups {len(sorted_counts)}')


Median: 6020 domains for the same set of prefixes.
Maximum: 187732 domains for the same set of prefixes.
Number of groups 71243


## IYP Limitations
- Longitudinal Analysis: Original paper has 9 years of data
- Traceroute results: We don't have traceroutes to reproduce the validation step they did with traceroutes 
(but maybe our results are closer to the validation results?)
- 

# RPKI and DNS

Mixing the RiPKI and DNS Robustness paper. We look at the percentage of prefixes hosting nameservers that are protected by RPKI.

In [3]:
query = """ // NS servers covered by RPKI
MATCH  (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[:MANAGED_BY]-(a:AuthoritativeNameServer)-[:RESOLVES_TO]-(i:IP)-[:PART_OF]-(pfx:Prefix)
WITH COUNT(DISTINCT pfx) AS total_pfx

MATCH  (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[:MANAGED_BY]-(a:AuthoritativeNameServer)-[:RESOLVES_TO]-(i:IP)-[:PART_OF]-(pfx:Prefix)-[:CATEGORIZED]-(t:Tag)
WHERE t.label = 'RPKI Valid' OR t.label STARTS WITH  'RPKI Invalid'
WITH total_pfx, COUNT(DISTINCT pfx) as valid_pfx
RETURN 100*valid_pfx/total_pfx"""

res, _, _ = db.execute_query(query, database_="neo4j")
prec_v4 = res[0][0]
print(f'{res[0][0]}% of the prefixes hosting nameserver are protected by RPKI.')

48% of the prefixes hosting nameserver are protected by RPKI.


In [None]:
query = """ // NS servers covered by RPKI
MATCH  (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[:MANAGED_BY]-(a:AuthoritativeNameServer)-[:RESOLVES_TO]-(i:IP {af:6})-[:PART_OF]-(pfx:Prefix)
WITH COUNT(DISTINCT pfx) AS total_pfx

MATCH  (r:Ranking {name:'Tranco top 1M'})-[:RANK]-(d:DomainName)-[:MANAGED_BY]-(a:AuthoritativeNameServer)-[:RESOLVES_TO]-(i:IP {af:6})-[:PART_OF]-(pfx:Prefix)-[:CATEGORIZED]-(t:Tag)
WHERE t.label = 'RPKI Valid' OR t.label STARTS WITH  'RPKI Invalid'
WITH total_pfx, COUNT(DISTINCT pfx) as valid_pfx
RETURN 100*valid_pfx/total_pfx"""

res, _, _ = db.execute_query(query, database_="neo4j")
prec_v4 = res[0][0]
print(f'{res[0][0]}% of the IPv4 prefixes hosting nameserver are protected by RPKI.')