Skip to content
This repository has been archived by the owner on May 4, 2021. It is now read-only.

think about practically of eliminating self-measure aka advertised bandwithth in voting #150

Closed
binnacle opened this issue May 4, 2018 · 9 comments
Labels
question Further information is requested wontfix This will not be worked on

Comments

@binnacle
Copy link

binnacle commented May 4, 2018

I've gradually progressed from loathing Torflow from the perspective of a relay operator whose ratings are smacked around with what seems capricious malevolence to grudging appreciation of the difficulty presented by the problem Torflow addresses, by reading code and arriving at a rough understanding of how it functions.

I agree a simpler, more straightforward rewrite of the scanner in particular is a fine idea, and would like to put forward a (probably redundant) suggestion and raise a concern.

The suggestion: I have always felt scanners should measure the entire network quickly and often and applying little-to-no effort retrying failed paths instead of the current dogged, painfully-slow approach. Scan all relays once-an-hour, repeat ad-infinitem and average-in results. Measurement sampling rate should match or exceed control and controlled-for-event rates. IMO the quality issues with Torflow derive largely from a sampling rate too slow to keep pace with changes in the network and to slow to avoid relay-load overshoot. Like driving a track with one's eyes open only one second of every ten seconds--you are going to bounce off walls.

The concern: A stated design goal of SBWS is it should not rely on relay bandwidth self-measure. Only recently have thought about this carefully, and I am now close to convinced it cannot be done. I will refrain from arguing the point in detail and instead invite the project participants to think hard about whether it really can be achieved and how that might be accomplished if so.

I offer as evidence this spreadsheet where an attempt is made to calculate votes using a single constant (averaged whole-network self-report bandwidth, arbitrary) as the basis for votes and adjusting it with Torflow progressive offset values from Maatuska (Tom Ritter kindly publishes vote documents from the BWAuth he operates). While Torflow has its issues I am certain that, in general, the ratio measurements it produces are decent enough to rely on for this type of what-if analysis.

bwscan_cnsns.20180429-1345-plus4.xlsx

An earlier version of the spreadsheet and related commentary can be found in ticket 17810:

#17810: TorFlow should ignore self-reported bandwidths when measuring relays
https://trac.torproject.org/projects/tor/ticket/17810#comment:13

Good luck with the project! Will be challenge and a pleasure I am sure.

@pastly
Copy link
Member

pastly commented May 4, 2018

Hey @binnacle.

You seem to have two core suggestions/concerns/comments/whatever-s.

  1. A bandwidth scanner should be fast
  2. Sbws says it doesn't want to use self-reported bw, but you don't think a scanner can avoid it

Bandwidth scanner speed

Summary: yes, hopefully sbws will be faster to react to changes than torflow.

Sbws aims to prioritize relays that it doesn't have many measurements for. Assuming a sbws scanner has been running for a few days already, it should reach a steady state of X fresh measurements per relay on average.

When calculating an output value from a relay's X measurements, sbws uses the median of the measurements from the last Y days, where Y is configurable and picked rather arbitrarily because "it sounded good to me." Median was picked for the same reason.

There are obviously many factors that go into X. How many measurements threads are running. How long the circuit build timeout is. How many relays are in the network. These things can be tuned once sbws is minimally viable.

Avoiding self-reporting bandwidths

Assuming I understand everything that goes into the w line in consensus documents correctly ...

The only time sbws uses self-advertised bandwidth is when picking an exit to help measure the current relay needing to be measured and that exit isn't measured by 3 or more bw auths yet.

AIUI, torflow use self-reported bandwidths as input to some calculations that eventually trickle down into the value on the w line[fn0]. Therefore, in a torflow-free network, self-reported bandwidths will not be used in any significant way.

Why should this not be done? Why should sbws use self-reported bandwidths in some more significant way? I don't have the spare cycles available to absorb or manipulate a massive spreadsheet.

fn0: Something to the effect of: it goes into some ratio to scale a relay's result, the bw auth that uses that torflow instance puts that result in its vote, and then the bw auths take the low median of everyone's result. Am I close? It probably isn't important to be close.

@binnacle
Copy link
Author

binnacle commented May 5, 2018

Note this evening; will expand tomorrow if needed

The design and implementation points all sound great; only thought is to keep in mind the big picture purpose, which is to continuously balance the entire Tor network and that it might make sense to scan all relays on every cycle. If a cycle is on the order of one hour, then no special attention to new relays should be required.

SBWS thus far sounds analogous to bwauthority_child.py, which scans residual bandwidth, though in a radically different (and flawed IMO) manner. Self-measure is not considered by Torflow scanner, but rather by the final vote generator aggregate.py and my second point is essentially that the type of logic found in aggregate.py probably cannot be avoided. Residual capacity in isolation does not appear sufficient for generating complete sets of consensus balancing votes. The spreadsheet examines this.

Spreadsheet has a ridiculous number of raw and partially cooked data columns, but consists of three straightforward sections: 1) Maatuska aggregate.py output [columns C-L], 2) joined by-relay consensus and descriptor data [columns N-X] and 3) what-if scenarios [columns Z-AK]. Less relevant columns are hidden. The second sheet is for whole-consensus summary data from external scripts and summarizing pivot tables added to support later what-if columns. Most relevant column to this is "relay class bw vote1" where it considers unvarnished residual bandwidth as the vote (i.e. the Torflow close equivalent to current SBWS output as I understand it without having eyeballed the code.)

If one sorts by "relay class bw vote1" the resulting consensus rank sucks. Current consensus sort is by "cbw", Maatuska's view is sort by "vote". F2 highlight can be used to see data relationships. "dbw" is self-measure or "descriptor bandwidth".

@binnacle
Copy link
Author

binnacle commented May 5, 2018

Another way to look at it is to sort by "dbw" capped at 10,000,000, which is what happens when fewer than three BWauths are online. Happened earlier this year and the Tor network performed rather well, probably much better than it did in the days before Torflow due to an explosion of capacity since. In contrast ranking by raw residual bandwidth would be dramatically worse--relays with minuscule capacity appear in the top 100.

@teor2345
Copy link
Contributor

teor2345 commented May 5, 2018

I've read everything you've written on trac and on tor-dev@.

Your calculations seem inconsistent with our analysis of sbws results on the testnet and real network. But I can't work why, because I don't understand the assumptions behind your calculations. There's just too much information in what you've written, and in your spreadsheets.

Can you please give short answers to the following questions?
One sentence answers are best.
Simple equations, with defined terms, are better than long paragraphs.

Relays report maximum observed bandwidth over 24 hours.
Torflow measures residual bandwidth + latency at a point in time, then does some post-processing involving observed bandwidth.
There is no measure of the total capacity of a relay.

How are you calculating the total capacity of a relay?
What if the torflow measurement causes the maximum observed bandwidth?
How are you eliminating latency?
What is the distribution of total relay capacities?
Do they correspond to relay bandwidth rates?
Do they correspond to typical capacities 10 Mbps, 100 Mbps, 250 Mbps, 1 Gbps?

sbws measures residual bandwidth + latency at a point in time, with no post-processing.
How are you calculating the results of sbws?
Have you actually run sbws to confirm your results?

How are you comparing torflow and sbws results?
How similar do you expect the results to be between different bandwidth scanners?
Have you done a similarity analysis on the 4 current bandwidth scanners?
(Or have you seen Tom Ritter's analysis at https://tomrittervg.github.io/bwauth-tools/ )
What do you think the problem is with the results?

@teor2345
Copy link
Contributor

teor2345 commented May 5, 2018

Please quote each question before giving a short answer. I'll get lost otherwise.

@binnacle
Copy link
Author

binnacle commented May 6, 2018

I have nothing further to add. What I wrote above is clear and sufficient.

@binnacle
Copy link
Author

binnacle commented May 6, 2018

One correction: Above attention was drawn to "all relay bw vote," this was a mistake. The correct column to focus on is "relay class bw vote1," which approximately reverses aggregate.py's pid_error to the scanner value strm_bw, the residual bandwidth measurment provide by the scanner. Aggregate.py lines 609-614 calculate pid_error from stream bandwidth as a ratio offset from the bandwidth average of the relay class. "Relay class bw vote1" above roughly reverses this, though using a point-in-time set of self-measure values rather than the running averages maintained by aggregate.py, which are not published.

Original comment was edited to reflect this.

@teor2345 teor2345 added the question Further information is requested label May 6, 2018
@pastly
Copy link
Member

pastly commented Jun 18, 2018

Advertised Bandwidth is now going to be a factor in sbws measurements. See PR #191.

@teor2345 teor2345 added the wontfix This will not be worked on label Nov 23, 2018
@teor2345
Copy link
Contributor

We have to use relay observed bandwidth. See this tor-dev thread for an explanation:
https://lists.torproject.org/pipermail/tor-dev/2018-November/013546.html

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants