think about practically of eliminating self-measure aka advertised bandwithth in voting #150
Comments
Hey @binnacle. You seem to have two core suggestions/concerns/comments/whatever-s.
Bandwidth scanner speedSummary: yes, hopefully sbws will be faster to react to changes than torflow. Sbws aims to prioritize relays that it doesn't have many measurements for. Assuming a sbws scanner has been running for a few days already, it should reach a steady state of X fresh measurements per relay on average. When calculating an output value from a relay's X measurements, sbws uses the median of the measurements from the last Y days, where Y is configurable and picked rather arbitrarily because "it sounded good to me." Median was picked for the same reason. There are obviously many factors that go into X. How many measurements threads are running. How long the circuit build timeout is. How many relays are in the network. These things can be tuned once sbws is minimally viable. Avoiding self-reporting bandwidthsAssuming I understand everything that goes into the The only time sbws uses self-advertised bandwidth is when picking an exit to help measure the current relay needing to be measured and that exit isn't measured by 3 or more bw auths yet. AIUI, torflow use self-reported bandwidths as input to some calculations that eventually trickle down into the value on the Why should this not be done? Why should sbws use self-reported bandwidths in some more significant way? I don't have the spare cycles available to absorb or manipulate a massive spreadsheet. fn0: Something to the effect of: it goes into some ratio to scale a relay's result, the bw auth that uses that torflow instance puts that result in its vote, and then the bw auths take the low median of everyone's result. Am I close? It probably isn't important to be close. |
Note this evening; will expand tomorrow if needed The design and implementation points all sound great; only thought is to keep in mind the big picture purpose, which is to continuously balance the entire Tor network and that it might make sense to scan all relays on every cycle. If a cycle is on the order of one hour, then no special attention to new relays should be required. SBWS thus far sounds analogous to bwauthority_child.py, which scans residual bandwidth, though in a radically different (and flawed IMO) manner. Self-measure is not considered by Torflow scanner, but rather by the final vote generator aggregate.py and my second point is essentially that the type of logic found in aggregate.py probably cannot be avoided. Residual capacity in isolation does not appear sufficient for generating complete sets of consensus balancing votes. The spreadsheet examines this. Spreadsheet has a ridiculous number of raw and partially cooked data columns, but consists of three straightforward sections: 1) Maatuska aggregate.py output [columns C-L], 2) joined by-relay consensus and descriptor data [columns N-X] and 3) what-if scenarios [columns Z-AK]. Less relevant columns are hidden. The second sheet is for whole-consensus summary data from external scripts and summarizing pivot tables added to support later what-if columns. Most relevant column to this is "relay class bw vote1" where it considers unvarnished residual bandwidth as the vote (i.e. the Torflow close equivalent to current SBWS output as I understand it without having eyeballed the code.) If one sorts by "relay class bw vote1" the resulting consensus rank sucks. Current consensus sort is by "cbw", Maatuska's view is sort by "vote". F2 highlight can be used to see data relationships. "dbw" is self-measure or "descriptor bandwidth". |
Another way to look at it is to sort by "dbw" capped at 10,000,000, which is what happens when fewer than three BWauths are online. Happened earlier this year and the Tor network performed rather well, probably much better than it did in the days before Torflow due to an explosion of capacity since. In contrast ranking by raw residual bandwidth would be dramatically worse--relays with minuscule capacity appear in the top 100. |
I've read everything you've written on trac and on tor-dev@. Your calculations seem inconsistent with our analysis of sbws results on the testnet and real network. But I can't work why, because I don't understand the assumptions behind your calculations. There's just too much information in what you've written, and in your spreadsheets. Can you please give short answers to the following questions? Relays report maximum observed bandwidth over 24 hours. How are you calculating the total capacity of a relay? sbws measures residual bandwidth + latency at a point in time, with no post-processing. How are you comparing torflow and sbws results? |
Please quote each question before giving a short answer. I'll get lost otherwise. |
I have nothing further to add. What I wrote above is clear and sufficient. |
One correction: Above attention was drawn to "all relay bw vote," this was a mistake. The correct column to focus on is "relay class bw vote1," which approximately reverses aggregate.py's pid_error to the scanner value strm_bw, the residual bandwidth measurment provide by the scanner. Aggregate.py lines 609-614 calculate pid_error from stream bandwidth as a ratio offset from the bandwidth average of the relay class. "Relay class bw vote1" above roughly reverses this, though using a point-in-time set of self-measure values rather than the running averages maintained by aggregate.py, which are not published. Original comment was edited to reflect this. |
Advertised Bandwidth is now going to be a factor in sbws measurements. See PR #191. |
We have to use relay observed bandwidth. See this tor-dev thread for an explanation: |
I've gradually progressed from loathing Torflow from the perspective of a relay operator whose ratings are smacked around with what seems capricious malevolence to grudging appreciation of the difficulty presented by the problem Torflow addresses, by reading code and arriving at a rough understanding of how it functions.
I agree a simpler, more straightforward rewrite of the scanner in particular is a fine idea, and would like to put forward a (probably redundant) suggestion and raise a concern.
The suggestion: I have always felt scanners should measure the entire network quickly and often and applying little-to-no effort retrying failed paths instead of the current dogged, painfully-slow approach. Scan all relays once-an-hour, repeat ad-infinitem and average-in results. Measurement sampling rate should match or exceed control and controlled-for-event rates. IMO the quality issues with Torflow derive largely from a sampling rate too slow to keep pace with changes in the network and to slow to avoid relay-load overshoot. Like driving a track with one's eyes open only one second of every ten seconds--you are going to bounce off walls.
The concern: A stated design goal of SBWS is it should not rely on relay bandwidth self-measure. Only recently have thought about this carefully, and I am now close to convinced it cannot be done. I will refrain from arguing the point in detail and instead invite the project participants to think hard about whether it really can be achieved and how that might be accomplished if so.
I offer as evidence this spreadsheet where an attempt is made to calculate votes using a single constant (averaged whole-network self-report bandwidth, arbitrary) as the basis for votes and adjusting it with Torflow progressive offset values from Maatuska (Tom Ritter kindly publishes vote documents from the BWAuth he operates). While Torflow has its issues I am certain that, in general, the ratio measurements it produces are decent enough to rely on for this type of what-if analysis.
bwscan_cnsns.20180429-1345-plus4.xlsx
An earlier version of the spreadsheet and related commentary can be found in ticket 17810:
#17810: TorFlow should ignore self-reported bandwidths when measuring relays
https://trac.torproject.org/projects/tor/ticket/17810#comment:13
Good luck with the project! Will be challenge and a pleasure I am sure.
The text was updated successfully, but these errors were encountered: