Skip to content
This repository has been archived by the owner. It is now read-only.

Check that the scaling is working #182

Closed
juga0 opened this issue Jun 6, 2018 · 57 comments
Closed

Check that the scaling is working #182

juga0 opened this issue Jun 6, 2018 · 57 comments

Comments

Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
@juga0
Copy link
Contributor

@juga0 juga0 commented Jun 6, 2018

As part of the tasks for having a MVP mentioned in https://trac.torproject.org/projects/tor/wiki/org/meetings/2018NetworkTeamHackfestSeattle/sbws.

@juga0 juga0 added this to the 1.0 milestone Jun 6, 2018
@juga0
Copy link
Contributor Author

@juga0 juga0 commented Jun 12, 2018

I was going to plot what pastly did in [1] with new files.
Looking at the code in tjr's bwauth-tools, it using a database, which i don't think we need.
The code also needs to be changed to run.
@pastly, would you share the scripts or modifications you used to generate those graphs?
Thanks
[1] https://github.com/pastly/simple-bw-scanner/wiki/Comparing-sbws-to-torflow
[2] https://github.com/tomrittervg/bwauth-tools

@pastly
Copy link
Member

@pastly pastly commented Jun 12, 2018

https://github.com/pastly/simple-bw-scanner/blob/master/scripts/tools/v3bw-into-xy.sh used to take a v3bw file and produce output like

AAAA...AAAA 400
BBBB....BBBB 500

but it looks like it needs minor updates to handle the new v3bw file format.


https://github.com/pastly/simple-bw-scanner/blob/master/scripts/tools/plot-v3bw-xy.py takes the output of the previous script and plots the scatter plots that you referenced (and no longer exist because share.riseup.net deletes stuff after like a week).

Example usage (from memory):

./scripts/tools/v3bw-into-xy.sh moria.v3bw > moria.data
./scripts/tools/v3bw-into-xy.sh sbws.v3bw > sbws.data
./scripts/tools/plot-v3bw-xy.py --input moria.data moria --input sbws.data sbws

You can avoid making temporary files with some bash magic. This does the same thing as above without temporary files (this is how I was doing it, but again from memory):

./scripts/tools/plot-v3bw-xy.py --input <(./scripts/tools/v3bw-into-xy.sh moria.v3bw) moria --input <(./scripts/tools/v3bw-into-xy.sh sbws.v3bw) sbws

pastly added a commit that referenced this issue Jun 15, 2018
@pastly
Copy link
Member

@pastly pastly commented Jun 15, 2018

I replaced v3bw-into-xy.sh with v3bw-into-xy.py since it was easier to parse the slightly more complex v3bw files in python.

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Jun 15, 2018

Thanks for that change.
Since it's easy to add new methods in v3bwfile to parse versions 1.0.0 and 1.1.0 bandwidth files i started to do that, making simpler the scripts for plotting.
I haven't pushed yet cause didn't finish it. Will comment here on the progress in the next days

@pastly
Copy link
Member

@pastly pastly commented Jun 15, 2018

The plot script works now that I rewrote v3bw-into-xy (and it shows that my sbws setup sucks).

What more work needs to happen on the parsing/plotting scripts? What modifications are you making to V3BwFile?

I'm going to stop the webserver that I think is causing my problems and in ~5 days hopefully my results will look better.

screen shot 2018-06-15 at 09 44 10

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Jun 15, 2018

Yeah, now that v3bw-into-xy works my changes are not needed anymore, thought i still thought there's no need to have an intermediate script when V3BwFile needs little changes to parse any bw file and return data structures that can be used for plotting.
For next time, we both should have say we were planning to work on this.
Still i'd leave this ticket open, since it's about checking scaling, and i can't tell right now whether the web server was slow or actually there's something wrong with the scaling.

@pastly
Copy link
Member

@pastly pastly commented Jun 15, 2018

i can't tell right now whether the web server was slow or actually there's something wrong with the scaling.

I should mention that I didn't do any scaling with those results.

What I did do is wget a file from a variety of allegedly fast sources -- including my freebird server -- and freebird capped out at ~7.5 MBps while tityos got to 60+ MBps. (Yes bytes in both cases).

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Jun 15, 2018

hmm, you mean that pastly results are not results generated by sbws?.
If they are, you were not using scale-constant? (it's enabled by default).

@pastly
Copy link
Member

@pastly pastly commented Jun 15, 2018

The pastly results are generated by sbws generate, yes.

--scale-constant has a default value, yes. But it isn't used unless --scale is specified.

  --scale-constant SCALE_CONSTANT
                        When scaling bw weights, scale them using this const
                        multiplied by the number of measured relays (default:
                        7500)
  --scale               If specified, do not use bandwidth values as they are,
                        but scale them such that we have a budget of
                        scale_constant * num_measured_relays = bandwidth to
                        give out, and we do so proportionally (default: False)

https://github.com/pastly/simple-bw-scanner/blob/09691a0fe7b3809f2cdacd7713c2b37668c6c93b/sbws/lib/v3bwfile.py#L385

@pastly
Copy link
Member

@pastly pastly commented Jun 26, 2018

This is from sbws 0.4.2-dev, using only my "tityos" destination (no longer using the "freebird" destination) and measuring from ln5's host.

It's like ... no better. It's no closer to being similar to torflow.

screen shot 2018-06-26 at 12 11 27

By the way, this is what it looks like if I don't cap the Y at 10,000 (same data, different view).

screen shot 2018-06-26 at 12 20 39

Scaling my results doesn't really help. To understand why: scaling the way we've proposed doesn't change the shape of the curve.

There's either something really wrong with sbws or the environment I'm running it in.

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Jun 26, 2018

Or the graphs are not accurate. If every pastly dot is a relay, how is that moria1 is a line?.
Over which period of time is this?. This graphs are using scaling?
Even though in theory it would not change anything, what about a graph with bandwidth for a relay according to moria on 1 axis and bandwidth for same relay according to pastly in the other axis?

@pastly
Copy link
Member

@pastly pastly commented Jun 26, 2018

If every pastly dot is a relay, how is that moria1 is a line?

Because the dots are ordered by moria's data. Relay number 0 is the fastest relay according to moria and relay number ~7000 is the slowest according to moria.

Over which period of time is this?

The black dots come from a v3bw file fetched from moria. The red from a v3bw file I generated from sbws data. The two v3bw files were fetched at the same time.

This graphs are using scaling?

I did not scale sbws data.

Even though in theory it would not change anything, what about a graph with bandwidth for a relay according to moria on 1 axis and bandwidth for same relay according to pastly in the other axis?

I could plot this, sure. I'll try to remember to do so.

@teor2345
Copy link
Contributor

@teor2345 teor2345 commented Jun 26, 2018

Remember: there is significant variance between bandwidth authorities, and sbws only has to be similar to one of them.

Please plot all existing bandwidth authorities on the same graph, and order by sbws measurement.

@pastly
Copy link
Member

@pastly pastly commented Jul 6, 2018

These are the 3 bwauths that make their v3bw files public. URLs fetched from here

image

Compared to sbws, the bwauths are not that different from each other.

I'd include sbws and run my new plotting scripts except my sbws died a few days ago and I haven't noticed until now.

I'm starting to think we're going to have to do more than just single circuit download performance. For example, download over many circuits at once through a target relay, or do whatever torflow does with relays' self-measured bandwidth (like @binnacle talked about in #150).

@teor2345
Copy link
Contributor

@teor2345 teor2345 commented Jul 6, 2018

I'm starting to think we're going to have to do more than just single circuit download performance. For example, download over many circuits at once through a target relay, or do whatever torflow does with relays' self-measured bandwidth (like @binnacle talked about in #150).

So sbws produces a flatter curve.
How do we know that's bad?

Remember: the goal of the bandwidth measurement system is to produce weights that make efficient use of the available relay bandwidth:
https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/README.spec.txt#n18

And there is evidence that torflow is allocating too much load to large relays:
https://lists.torproject.org/pipermail/tor-dev/2018-March/012995.html

So before we make sbws match torflow, let's check if torflow's results are actually what the Tor network needs:

  • What if sbws's results are better than torflow's?
  • What if sbws's results are constrained by the endpoints, rather than the relays being measured?
  • What if depending on relays' self-measured bandwidths is causing problems?

I'm not sure how we can answer all these questions.

As a first step, let's scale sbws's results to match torflow's results. Yes, the shape of the curve will be the same. But maybe that shape is better that torflow's.

As a second step, let's set up both the sbws client and server on fast servers in Germany or France, near most of the current high-bandwidth relays. Then we can see if the results are closer to torflow's.

@teor2345
Copy link
Contributor

@teor2345 teor2345 commented Jul 6, 2018

Or here's a better comparison:

Set up torflow and sbws on the same client and server, and compare the results.
(But don't run them at exactly the same time, they'll fight for bandwidth.)

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Jul 6, 2018

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Jul 6, 2018

@pastly, which server machine would you use for the client?, i've run Torflow before in ln5's one and the vps i'm using, let me know if should setup it up somewhere else.

@pastly
Copy link
Member

@pastly pastly commented Jul 6, 2018

@pastly, which server machine would you use for the client?

I'm using ln5's machine for my scanner and tityos for the server.

I think you can keep using whatever you want because right now I don't think there's a bandwidth contention issue on ln5's machine.

If we can get tjr to run sbws and get a 1 GiB file on whatever webserver he's using, then we will have a direct comparison.

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Jul 7, 2018

I was curious to graph the same sbws data with and without scale.
Here the result, the scaled one looks way more similar to moria in comment #182 (comment), where the sbws generate file was not scaled.
screenshot from 2018-07-07 19-03-51
I could retrieve moria files and actually check that the scaled line is closer to moria.

@teor2345
Copy link
Contributor

@teor2345 teor2345 commented Jul 7, 2018

Graph Analysis

I could retrieve moria files and actually check that the scaled line is closer to moria.

It's closer, but the curve is a different shape:

That's interesting, because then we get to ask ourselves:

  • what is the most likely shape for the relay bandwidth distribution in the tor network?
  • what do we want the distribution of client usage to be in the tor network?

Metcalf's law suggests that the network itself should follow a linear to parabolic distribution (n*log(n) to n2):
https://en.m.wikipedia.org/wiki/Metcalfe%27s_law

And maybe a hyperbolic distribution is bad for the Tor network?
#182 (comment)

Possible Explanations

We could be seeing a different distribution because torflow distributes its bandwidth files based on its own scaled bandwidth measurements.

torflow claims that each relay is measured using > 5 times that relay's bandwidth (since the files are in powers of two, that's 5-10 times):
https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/aggregate.py#n324

But it's actually measuring them at (scaled bandwidth) * 5-10 times.

Next Steps

Let's increase the sbws download length so that bandwidth dominates, rather than latency. Tor latency is at most 1 second for large relays:
https://metrics.torproject.org/onionperf-latencies.html

So let's try 20-40 second sbws downloads, rather than 5-10 second downloads?
(Please reset your sbws results after changing the download time.)

@pastly
Copy link
Member

@pastly pastly commented Jul 9, 2018

So let's try 20-40 second sbws downloads, rather than 5-10 second downloads? (Please reset your sbws results after changing the download time.)

Okay. I'm running sbws with the following settings (changed the download_* ones). Measurements that take between 20 and 40 seconds are accepted.

nickname = aaseae_30s
measurement_threads = 6
download_toofast = 10
download_min = 20
download_target = 30
download_max = 40

@pastly
Copy link
Member

@pastly pastly commented Jul 11, 2018

First I'll mention that a 30 second target causes the scanning to take much much longer. sbws has been running for about two days and isn't done measuring every relay yet. (This was what I predicted, but I just wanted to make sure it was explicit).

Now, on to the results.

Here's the same graph as 15 days ago but with today's data from moria and sbws (and sbws targeting 30s downloads).

screen shot 2018-07-11 at 13 11 02

This looks very very similar to the old graph when sbws was targeting 6s. I don't think a 30s target is a good idea and that 6s was fine.

What I think the next steps should be:

Either

  1. Run torflow and sbws side-by-side (but not at the same time) to remove more variables. This has the added benefit of us having access to the raw scanner results from torflow before it does whatever magic scaling it does. OR

  2. Ask for access to raw scanner results from someone running torflow.

@binnacle
Copy link

@binnacle binnacle commented Jul 11, 2018

Have a couple suggestions: In addition to comparing raw and cooked absolute votes, try comparing vote sets normalized to selection probability. Include the synchronous consensus selection probability curve as an overly. Avoid comparing anachronistic vote and consensus sets since the numbers shift by as much as 10% in twelve hours.

@pastly
Copy link
Member

@pastly pastly commented Jul 19, 2018

Both arma and tjr have expressed ability to share raw scanner results with us (option 2 above), though tjr says he'd like instructions on how/what to share.

arma pointed to https://trac.torproject.org/projects/tor/ticket/2532, but I don't think we need to reopen it because I don't think metrics needs to get involved at this time.

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Jul 23, 2018

In https://lists.torproject.org/pipermail/tor-dev/2018-July/013330.html, @teor2345 said:
<<[...]Ok, so juga can run sbws and torflow at different times on the same machine.[...]And tom can run sbws and torflow at the same time on the same machine.>>
However today @pastly stated in IRC that he doesn't think 2 persons need to run both scanner and he stated his preference to be tjr and arma to run it.
So, i'm not going to run torflow nor sbws in the way suggested by @teor2345 atm.
Please, let me know if there's a change of opinion or situation.

@pastly
Copy link
Member

@pastly pastly commented Jul 23, 2018

I left the IRC conversation today expecting you to run sbws and torflow. That's why I thanked arma and tjr. I see how it was confusing that I mentioned advantages to having tjr run sbws, but I don't think he's planning on doing it (because AFAICT, we un-asked him to run it on IRC).

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Jul 27, 2018

One think i've been thinking on and started to work on is to refactor part of the v3bwfile.py code so that we have the possibility to also generate the relays bw resting their own rtt. I think this may discard latency. It might also be interesting to compare these results with the results increasing the download time.
I'd also like to take one of relays with higher bandwidth and compare what happens when that relay is being measured by Torflow, sbws and why results are so different. I might need to write extra code for this too.
I also wanted to make it easier to create the graphs with several files (and probably make easier too to generate csvs just from v3bwfile, so that we can create more comparing graphs. I'm just afraid i might spend too much time making these changes.
Maybe i should open new tickets for this?

@teor2345
Copy link
Contributor

@teor2345 teor2345 commented Jul 29, 2018

One think i've been thinking on and started to work on is to refactor part of the v3bwfile.py code so that we have the possibility to also generate the relays bw resting their own rtt. I think this may discard latency.

I'm not sure I understand what you mean here.
Can you explain how sbws uses the rtt at the moment?

I'd also like to take one of relays with higher bandwidth and compare what happens when that relay is being measured by Torflow, sbws and why results are so different. I might need to write extra code for this too.

You could add debug logging, or add debug attributes to the bandwidth lines.
Choose the one that is faster to code.

I also wanted to make it easier to create the graphs with several files (and probably make easier too to generate csvs just from v3bwfile, so that we can create more comparing graphs. I'm just afraid i might spend too much time making these changes.

How many times are we going to generate the graphs?
If we are only going to generate them a few more times, you could write a shell script.

Maybe i should open new tickets for this?

Yes, please open new tickets for new code, this task is an analysis task.

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Jul 30, 2018

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Jul 30, 2018

It would be great if any of you could remind me why our scaling is bw * 7500/mean.
We had some discussions months ago about why that, but i still can't make sense of it.

@teor2345
Copy link
Contributor

@teor2345 teor2345 commented Jul 31, 2018

It would be great if any of you could remind me why our scaling is bw * 7500/mean.
We had some discussions months ago about why that, but i still can't make sense of it.

See the comment here:
https://github.com/pastly/simple-bw-scanner/blob/2b111356fb813379ab0b4a0dc705706766788ad3/sbws/core/generate.py#L27-L30

And the thread here:
https://lists.torproject.org/pipermail/tor-dev/2018-March/013049.html

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Jul 31, 2018

https://lists.torproject.org/pipermail/tor-dev/2018-March/013049.html

oh, sorry i didn' t follow that link, yeah, that's a more elaborated explanation.

So, in theory, a network with 7500 relays, would have:

  • a total bandwidth of 56250000 (7500^2) (Bytes/second?)
  • a mean of 7500 (total bw/num relays) (Bytes/second?)

If i get it correctly from the mail, in theory, a network of 6460 measured relays, would have:

  • a total bandwidth of 48450000 (7500 * 6460) (Bytes/second?)
  • a mean of 7500 (Bytes/second?)

Correct?

In practice, a network with 6460 relays, measured with sbws, have:

  • a total bandwidth of 2112522182 (Bytes/seconds) (or 2112522 KB/seconds)
  • a mean of 327015 (Bytes/seconds) (or 327 KB/seconds)

So it seems there is an error somewhere with the units, sbws results would make more sense divided by 100 and multiplied per 2:

  • total bw: 2112522182 *2/100 = 42250443
  • mean: 327015 * 2/100 = 6540

Correct?

Same measurements with sbws using scaling:

  • total scaled bw: 7824 (~=7500)
  • scaled mean: 1.2 (~=1)

So with scaling, are we trying to get the mean ~1?, and the total bw to ~num relays?

If yes, then scaling is almost working, but then i might not be interpreting correctly the mail.

With measurements from Torflow, having 8748 relays:

  • total bw: 8748
  • mean: 1

What confirms the previous paragraph.
But then we should rather multiply each bw per num relays/mean what is the same as 1/total bw (instead of 7500/mean)

With that, the previous scaled sbws with 6460 relays would be:

  • total bw: 6460
  • mean 1

Correct?

I can graph sbws scaled results with that, though i think the shape is not going to change

And what's the advantage of scaling this way?, would make more sense to just have the "raw" bandwidths and make Tor calculate weights in a different way?

Sorry i'm questioning this now and not months ago.

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Jul 31, 2018

With measurements from Torflow, having 8748 relays:
it seems i parsed incorrectly the torflow file, so ignore what i say from here for now

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Jul 31, 2018

With measurements from Torflow, having 8748 relays:

it seems i parsed incorrectly the torflow file, so ignore what i say from here for now

yes, i did, the correct results are:

  • total bw: 66639642
  • mean bw: 7618

So it seems there is an error somewhere with the units, sbws results would make more sense divided by 100 and multiplied per 2:

i mean, they don't make more sense, but they are closer to what expected. Actually i could just divide each sbws measured bw by 1/45 to get total bw: ~50000000 and mean bw: ~7500. But then, why ~1/45?

So with scaling, are we trying to get the mean ~1?, and the total bw to ~num relays?

ok, it's not the case, but would make sense if what we wanted was to normalized

I'll show some graphs here soon

@teor2345
Copy link
Contributor

@teor2345 teor2345 commented Aug 1, 2018

@juga0, I'm not sure how to answer the questions in your last 3 comments.

The purpose of scaling is to make sure that bandwidth weights don't change when torflow instances are replaced with sbws instances. So we want the total bandwidth (or average bandwidth) to be similar between torflow and sbws.

@pastly
Copy link
Member

@pastly pastly commented Aug 3, 2018

Right, what @teor2345 said.

In my mind this ticket morphed from "check that scaling is working" into "check that sbws produces sane results that may or may not need scaling" a while ago.

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Aug 6, 2018

I parsed torflow raw files, took strm_bw, then took the median of all the bandwidths for one node.
Results:

  • Num relays: 7021
  • Bw total 2058888342 B/s
  • Bw mean: 293247 B/s
  • Max Bw: 715096 B/s
  • Min Bw: 1046 B/s

sbws results from days before:

  • Num relays: 6460
  • Bw total 2112522182 B/s
  • Bw mean: 327016 B/s
  • Max Bw: 2904045 B/s
  • Min Bw: 3019 B/s

Results are now closer. The max and min differences could be cause i run sbws during less time, or just cause they run at different times.

Now plotting both together, first ordering with torflow, then ordering with sbws.

20180806_100109
20180806_100125

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Aug 6, 2018

All bandwidths are in Bytes/second, so there's no conversion errors.

@pastly
Copy link
Member

@pastly pastly commented Aug 6, 2018

Very exciting results, thanks @juga0! (I'll analyze better ASAP)

@teor2345
Copy link
Contributor

@teor2345 teor2345 commented Aug 6, 2018

Results are now closer. The max and min differences could be cause i run sbws during less time, or just cause they run at different times.

Ok, so we have two options:

  1. Implement torflow's observed-bandwidth scaling in sbws
  • extra implementation work
  • less secure, because it depends on relays' reported bandwidth
  • transition will happen sooner
  • easier to convince operators to transition
  • less risk of failure
  1. Replace torflow with sbws in a single hour
  • less implementation work
  • more secure
  • transition will happen later
  • much harder to convince operators to transition
  • more risk of failure

I suggest we go with option 1, but make the scaling depend on a consensus parameter. Then the directory authority operators can turn scaling off after a majority of bandwidth authority operators transition to sbws.

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Aug 6, 2018

Hmm, the main conclusion i get from those graphs is:

  • it seems that measurements are working
  • torflow "aggregation" and sbws "scaling" are quite different

So i was thinking:
a. your 1.: we will still have the same Tor issues as with Torflow
b. work out whether scaling should be change some other way
c. if b or with current scaling, maybe we need to change the Tor code part on how weights are calculated?

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Aug 6, 2018

According to the specs, when the bwweightscale is not present, it defaults to 10000 [0], so one change in sbws scaling could be to default to that, which will add 1 order of magnitude more.
@teor2345 do you know why all the /10000 in torflow's aggregate.py [1], and whether is related to the scale constant?

[1] https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/aggregate.py
[0] https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n1874

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Aug 6, 2018

Hmm, what about to calculate the weights that would be assigned for each relay with our sbws scaled results and compared them with the weights actually assigned to the relays in that period?

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Aug 6, 2018

Maybe this way we can check how much weights might be different using sbws instead of Torflow

@teor2345
Copy link
Contributor

@teor2345 teor2345 commented Aug 6, 2018

  1. Implement torflow's observed-bandwidth scaling in sbws

So i was thinking:
a. your 1.: we will still have the same Tor issues as with Torflow

I'm not sure if we're talking about the same thing here.

Torflow scales each relay's observed bandwidths using the ratio between that relay's measured bandwidth, and the total measured bandwidth for all relays. If we want sbws to match torflow, we need to do similar scaling in sbws.

Specifically, we need to:

  • work out how torflow scales
  • write a spec update that says how we should scale
  • implement the scaling in sbws

We know that stream bandwidths are similar between torflow and sbws, and we also know that PID control is broken.

So I think we need to copy these 4 lines of torflow's scaling code:
https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/aggregate.py#n548
https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/aggregate.py#n587
https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/aggregate.py#n744
https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/aggregate.py#n750

We might also need to cap the result:
https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/aggregate.py#n770
https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/aggregate.py#n778
https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/aggregate.py#n786

b. work out whether scaling should be change some other way
c. if b or with current scaling, maybe we need to change the Tor code part on how weights are calculated?

We can't change Tor's code, because it takes too long to deploy new tor versions. And if we did change Tor's code, it would be very easy to double-scale torflow's results.

@teor2345
Copy link
Contributor

@teor2345 teor2345 commented Aug 6, 2018

According to the specs, when the bwweightscale is not present, it defaults to 10000 [0], so one change in sbws scaling could be to default to that, which will add 1 order of magnitude more.
[0] https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n1874

bwweightscale is for bandwidth-weights. bandwidth-weights are not used to scale relay measured bandwidths. bandwidth-weights are used for relay position weights:
https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n3003

@teor2345 do you know why all the /10000 in torflow's aggregate.py [1], and whether is related to the scale constant?
[1] https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/aggregate.py

These divisions are used for the consensus parameters for PID control, which is broken. (The feedback loop is not fast enough or reliable enough. I can't find the ticket.)

Hmm, what about to calculate the weights that would be assigned for each relay with our sbws scaled results and compared them with the weights actually assigned to the relays in that period?

You already did scaling in:
#182 (comment)

scaling might make the results a similar size, but it isn't going to change the shape of the curve:
#182 (comment)

Maybe this way we can check how much weights might be different using sbws instead of Torflow

I think that this graph could be useful.

If we decide to implement torflow's observed bandwidth scaling in sbws, we can compare the graphs.

@binnacle
Copy link

@binnacle binnacle commented Aug 6, 2018

Would like to add my thoughts. Per earlier comments I favor the idea that observed self-measure is a necessary ingredient, that it cannot (with reasonable resources) be discerned remotely. Suggest a spreadsheet representing the data from SBWS, Torflow input and aggregate.py output will illuminate this more clearly than the graph, where the idea is to examine individual relays with an eye toward the sanity of the weights assigned to them. I believe cases of potential severe misrating will be evident.

Yet the approach of biasing self-measure may benefit from any number of refinements--no effort in that direction was pursued previously. Some ideas in no particular order:

Instead of a single simple linear factor (Kp=1) perhaps parameterized polynomial equations could be incorporated such that the degree of SBWS adjustment to self-measure can vary depending on the advertised (or measured) speed of each relay. A separate equation for above-mean and below-mean biasing would allow for curves that emphasize optimal consensus balance for the former and collaring of nodes gaming to high bandwidth for the latter.

Instead of a single all-relay average: a) perhaps calculate a sliding weighted average by bandwidth or bin by decile, ventile, etc. and/or b) perhaps calculate averages separately for each consensus class (e.g. exit, guard, unflagged/middle, ?exit+guard, ?exit-only). Relay selection probabality determination in each class operates independent of the others if I understand correctly.

All of the above and more could be implemented with consensus parameter controls that allows for cautious, iterative refinement of the consensus outcome. Modelling results saves a great deal of time, but cannot represent all real-world behaviors due to the feedback dimensions of bandwidth scanning and consensus construction.

Relay self-measure bears improvement so that inputs to the process represent true capacity. In particular, relays under-report capacity when lightly loaded.

Informing all of the above is the reality that gaming of bandwidth measurement appears to have little value to sophisticated adversaries--else more trouble surely would have arrived by now. Consider: overrated relays attract attention and an overload condition impairs various nuanced attack models. Gaming for high bandwidth is for amateurs and miscreants.

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Aug 7, 2018

Replying to teor's comment in #182 (comment):

So I think we need to copy these 4 lines of torflow's scaling code:

Thanks for pointing me at those lines, i think i finally understand what torflow is doing.
I'm going to implement that and graph again. I might not create PRs with the code until we're sure which scaling we'll be using.

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Aug 7, 2018

Replying to teor's comment in #182 (comment)

bwweightscale is for bandwidth-weights.

oh, right, sorry i got confused with that

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Aug 11, 2018

toflow vs sbws and the other way around, looks good
20180811_210215
20180811_210258

sbws scaling as torflow vs sbws without scale, shape still not as torflow
20180811_211344
20180811_211435

The jumps are explained by the descriptor bandwidth jumps
I think we should be plotting probability distributions instead, i'll try that.
More explanations and code tomorrow. I needed to do one adjustment.
I'll try also to compare sbws scaling as we did before and scaling as torflow

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Aug 12, 2018

Code (self-explained) in #243

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Aug 14, 2018

I'm quite sure now i found the problem.
This graph is just the descriptor bandwidth according to sbws:
figure_1-1
What explain the jumps in the previous graphs.
The reason is because it's only checked once when writing results (once a day): https://github.com/pastly/simple-bw-scanner/blob/master/sbws/lib/resultdump.py#L480
While torflow is taking the history avg before writing it to the raw results. i'll document this more in the code.

@juga0
Copy link
Contributor Author

@juga0 juga0 commented Aug 14, 2018

Oh, and not only is written once, it's not updated from day to day

@teor2345
Copy link
Contributor

@teor2345 teor2345 commented Nov 23, 2018

@teor2345 teor2345 closed this Nov 23, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.