Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate IPv4 networks, IPv6 addresses, IPv6 networks #112

Merged
merged 2 commits into from
Oct 6, 2020

Conversation

davidchall
Copy link
Contributor

Description

I noticed charlatan has some TODO/FIXMEs related to generating:

  • IPv4 networks
  • IPv6 addresses
  • IPv6 networks

My ipaddress package supports randomly sampling the IPv6 address space, and can also do the bit masking needed to generate networks for both IPv4 and IPv6.

BTW the faker module won't generate an address in a reserved network (see here). We could achieve this using an accept-reject algorithm (see here), if this is something you're interested in?

Related Issue

None. The FIXMEs are in the code.

Example

library(charlatan)

x <- InternetProvider$new()
x$ipv4()
#> [1] "190.172.2.193"
x$ipv4(network = TRUE)
#> [1] "67.64.192.0/18"
x$ipv6()
#> [1] "40dc:98a8:380:548b:8822:4e97:1fce:6942"
x$ipv6(network = TRUE)
#> [1] "fba1:738d:df08:cb00::/58"

Created on 2020-09-25 by the reprex package (v0.3.0)

Produce IPv4 networks, IPv6 addresses, IPv6 networks
@codecov-commenter
Copy link

codecov-commenter commented Sep 26, 2020

Codecov Report

Merging #112 into master will increase coverage by 0.62%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #112      +/-   ##
==========================================
+ Coverage   69.84%   70.46%   +0.62%     
==========================================
  Files          43       43              
  Lines         955      965      +10     
==========================================
+ Hits          667      680      +13     
+ Misses        288      285       -3     
Impacted Files Coverage Δ
R/internet-provider.R 57.94% <100.00%> (+6.39%) ⬆️
R/taxonomy-provider.R 100.00% <0.00%> (+8.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efd3585...9152346. Read the comment docs.

@sckott
Copy link
Collaborator

sckott commented Sep 29, 2020

Thanks @davidchall ! This is great to have some of the ip address stuff finished off

Looks like there's a modest slow down compared to iptools:

microbenchmark(
  ipaddress = ipaddress::sample_ipv4(1),
  iptools = iptools::ip_random(1),
  times = 10^4
)
#> Unit: microseconds
#>       expr    min     lq      mean  median     uq      max neval
#>  ipaddress 89.379 93.396 112.63896 95.3445 98.352 6342.843 10000
#>    iptools 12.678 14.065  17.35034 16.0760 16.834 6208.996 10000

I don't know if that's accurate or meaningful. I don't typically use this kind of data so not sure of the use cases. e.g., do people often want to generate millions of IP addresses at a time (in which case the speed may become an issue), or do people most often generate 10's to hundreds/thousands of addresses at a time (in which case speed difference probably not an issue)?


BTW the faker module won't generate an address in a reserved network (see here). We could achieve this using an accept-reject algorithm (see here), if this is something you're interested in?

I'm not super familiar with the terminology. By "We could achieve this", what do you mean exactly? Is it best to avoid generating addresses in a reserved network? That is, should we avoid that here as well?

@davidchall
Copy link
Contributor Author

davidchall commented Sep 29, 2020

Hi @sckott,

Your benchmarking results are really interesting - thanks for bringing this to my attention! My first thought was that {ipaddress} supports both IPv4 and IPv6, and so there is some additional overhead involved. If we look at generating many addresses, then we see that {ipaddress} is faster than {iptools}:

microbenchmark(
  ipaddress = ipaddress::sample_ipv4(1e5),
  iptools = iptools::ip_random(1e5)
)
#> Unit: milliseconds
#>       expr       min       lq     mean   median        uq      max neval
#>  ipaddress  6.717324 12.01752  39.4633 15.91629  22.50388 396.7741   100
#>    iptools 53.141831 63.32564 117.5727 73.44913 117.57090 680.6242   100

If people want to generate millions of IP addresses, I'd recommend using ipaddress::sample_ipv4() directly (instead of using charlatan::InternetProvider$ipv4()), because this takes advantage of a vectorized implementation.


BTW the faker module won't generate an address in a reserved network (see here). We could achieve this using an accept-reject algorithm (see here), if this is something you're interested in?

I'm not super familiar with the terminology. By "We could achieve this", what do you mean exactly? Is it best to avoid generating addresses in a reserved network? That is, should we avoid that here as well?

The protocol reserves some regions of IP address space for special usage, and so a user would never be assigned one of these reserved addresses. Charlatan is creating fake user data, so I think it makes sense to exclude such addresses. The same idea applies to IPv6 too, though {faker} doesn't handle this (yet).

In reality, IP address allocation is very complicated. Here are a few other points to consider:

  • Public vs private: Some address ranges are reserved for private networks (e.g. LANs). But depending on the situation, {charlatan} users might want to generate these.
  • Unallocated: Although the IPv4 address space is now depleted, the IPv6 address space has only allocated a very small proportion of its addresses. So technically speaking, addresses shouldn't be generated in these unallocated regions. However, new addresses are getting allocated all the time...
  • Countries: Different address ranges are allocated to different countries. For {charlatan}, you could argue this should be incorporated into the localization model. However, there are many reasons this is not a strict rule (e.g. VPNs allow a user in country A to have an IP address in country B). And these country allocations also can change.

Yuck! It might make most sense for {charlatan} to avoid these complexities altogether and simply randomly generate any address (i.e. let's just forget about excluding networks). Let me know your decision and I can update the PR.

BTW -- I was suggesting that we could prevent {charlatan} from generating reserved addresses by using an accept-reject algorithm. In contrast, {faker} uses weighted sampling from the non-excluded networks. The {faker} implementation has a 100% acceptance rate (i.e. they will use the very first IP address they generate), whereas {charlatan} might need to generate 2 or more addresses until it finds an accepted address. However, the accept-reject algorithm is much easier to understand and they acceptance rate is expected to be high (roughly 87%).

@sckott
Copy link
Collaborator

sckott commented Oct 1, 2020

Good point that if a user wanted >1 address they'd be much better off with a vectorized approach. We should take advantage of any vectorization when possible. This is a longer term issue, charlatan i think largely does 1 thing at a time, and if you want many of those things you have to run the method that many times. opened an issue #113

I like the simplicity of just randomly generating any address. And then we could point people to your package in the documentation if they want more control/etc. But, what do you prefer?

@davidchall
Copy link
Contributor Author

Yeah, I like that approach. In the future, I might add a weighted sampling function (davidchall/ipaddress#67), similar to how {faker} handles this.

@sckott
Copy link
Collaborator

sckott commented Oct 1, 2020

okay, let me know when you're done updating the PR

@davidchall
Copy link
Contributor Author

The only things I'm wondering about is whether you'd like me to update the NEWS and codemeta.json files, or is that something you handle? Otherwise, I'm done already 👍

@sckott
Copy link
Collaborator

sckott commented Oct 1, 2020

no, i update news and codemeta before new releases to cran

@sckott sckott added this to the v0.5 milestone Oct 6, 2020
@sckott sckott merged commit 4b797ff into ropensci:master Oct 6, 2020
@davidchall davidchall deleted the ipaddress branch April 1, 2021 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants