Conversation
pastly
left a comment
There was a problem hiding this comment.
Important thoughts:
Can we use the same time units for both the data files and the v3bw files? One is days, the other is minutes. If we do, then _get_older_files_than doesn't have to change. Alternatively we could pass it a timedelta object instead of a number of days/minutes (but I think I prefer the "use the same unit of time" solution better).
I chose very high default values on purpose so that users could access old data easily. It seems the valid/stale/rotten times on v3bw files defaults to close to the minimum usable. Can we make the defaults much larger?
The option valid_mins_v3bw_files is unused from what I can tell. Further, I don't think we need it. What would a "valid" v3bw file even be? I suggest this option be removed. If needed in order to keep the code generic, I bet we could assume valid_time == stale_time.
_check_validity_periods is almost exactly the same as some code in main. I bet there's a way to write this such that code isn't duplicated so much.
Other thoughts:
It might be cool to have a generic option for stale_days and rotten_days, and then have more specific ones for both data files and v3bw files similar to what we do for log levels. I'm picturing a config like this, but with appropriate comments too.
[cleanup]
stale_days = 10
data_stale_days = ${stale_days}
v3bw_stale_days = ${stale_days}
rotten_days = 90
data_rotten_days = ${rotten_days}
v3bw_rotten_days = ${rotten_days}
I would merge a branch without this change; I just thought I'd mention it in case you think it's handy and easy enough to implement.
I considered that, but while we need at least 5 days of data files to generate a v3bw files, we only need the last hour v3bw to be available to be read by Tor [0] and "old" v3bw files doesn't influence on the newer v3bw files
would be changing days by minutes (regarding data files) a possible solution?.
as commented above
it would be the file that Tor needs to read, which is either the one being using in the current vote or in the consensus, what is New idea: instead of having
If we implement the "new idea" then i don't think we need these configuration options for v3bw files
yes, i didn't modify the data files code to use it, but then it could be removed there. However, let's see what we think about the "new idea"
This could work, though see comments on v3bw files only needed for minutes and the "new idea"
Thanks for the comments! [0] https://github.com/torproject/torspec/pull/26/files#diff-1a245b60e913b59d7e986bd215f5d898R2575, https://trac.torproject.org/projects/tor/ticket/26740#comment:3 |
|
IMO sbws should not assume certain commands are being run at certain times, nor should Tor do anything more than expect sbws to dump its latest v3bw file at a standard location. Therefore I don't like the idea about having two symlinks, one intending to be pointing towards the most recently probably used v3bw file and one intending to be pointing toward the newest probably about-to-be-used v3bw file. There are too many ways for the semantics to become false. If for some reason we think it would be useful to have two symlinks that doesn't have inherit semantic meaning, I would be okay with merging that I guess. If, for example, they are named
.
True, but old v3bw files might be interesting to researchers fetching them from the bwauths. Furthermore, a given sbws instance may be operated by a non-bwauth, not archived by collecTor, and never even have its v3bw files read by Tor. (Looks like tjr has uncompressed copies dating back to March Of course, despite the new options being measured in minutes and having small default values, they can be changed by the user to have large values like I'm advocating for. That said, I don't think it's too crazy to have (for example) the last day's worth of v3bw files on disk uncompressed, the last week's worth on disk compressed, and all older v3bw files deleted. This gives researchers and archivers plenty of time to grab files from
I would be okay with changing the units on the data file options from days to minutes as long as the default values stay large. Also see the previous section (right before this quote) where I argue that measuring stale/rotten time for v3bw files in whole non-fractional days is fine or even desired. If @teor2345 is able to give their thoughts on this in the next few days, I'd also be glad to hear what they think. |
|
Matt Traudt:
IMO sbws should not assume certain commands are being run at certain times, nor should Tor do anything more than expect sbws to dump its latest v3bw file at a standard location. `sbws generate` has no way of knowing if its penultimate v3bw file was actually read by Tor, nor does it know if Tor is running and going to read the latest v3bw file, nor does it have any idea about how often it is being executed. It is **probably** being executed once an hour because that's what we suggest operators do, but if the code starts assuming things like this, I'm afraid this will lead to weird bugs.
Therefore I don't like the idea about having two symlinks, one intending to be pointing towards the most recently **probably** used v3bw file and one intending to be pointing toward the newest **probably** about-to-be-used v3bw file. There are too many ways for the semantics to become false.
If for some reason we think it would be useful to have two symlinks that doesn't have inherit semantic meaning, I would be okay with merging that I guess. If, for example, they are named `latest.v3bw` and `latest-1.v3bw`.
ok, i agree we don't need 2 symlinks.
Still with only one, we're doing all the assumptions you mention, since
it's not sbws business to control what Tor or the system configuration
(ie crontab or system package) is doing.
Yet, i think the "new idea" can still be:
``
sbws cleanup --v3bw => delete all the *.v3bw files that are not
symlinked by latest.v3bw
``
And if a bwauth operator wants the all v3bw files, then as simple as
*not* running ``cleanup --v3bw``
|
|
There's a lot of content here, and github threading doesn't do nested replies, so I'm going to comment by topic instead. Overall Goals sbws should not fill up the disk, because that's a common bwauth failure mode. Therefore, all files generated by sbws should be cleaned up by default. (Unless they are very small.) The design (and the defaults) should not surprise operators. Choosing Units
Choosing Defaults
Naming Options
Tor Timing Tor should only read the bandwidth file once per vote, to avoid race conditions: A bandwidth file generator just needs to atomically generate the file: |
|
If we use standard |
|
I reverted the modifications to cleanup, to leave at @pastly choice how to clean results files. |
pastly
left a comment
There was a problem hiding this comment.
I can make the requested changes if you'd like, as I'd also take the opportunity to refactor my code to use yours.
| _, ext = os.path.splitext(fname) | ||
| if ext not in extensions: | ||
| log.debug('Ignoring %s because it doesn\'t have extension ' | ||
| '%s', fname, ext) |
There was a problem hiding this comment.
"Because it doesn't have an extension in [the list of extensions]"
There was a problem hiding this comment.
ext is the extension the file does have. While extensions is the list of allowed extensions. The second half of my comment above was pointing out we're logging
Ignoring foo.bin becuase it doesn't have extension .bin
When we should be logging
Ignoring foo.bin because its extension is not in ['.v3bw', 'v3bw.gz']
| # first delete so that the files to be deleted are not compressed first | ||
| files_to_delete = _get_files_mtime_older_than(v3bw_dname, | ||
| delete_after_days, | ||
| ['.v3bw']) |
There was a problem hiding this comment.
This should also delete .v3bw.gz files.
|
Made the changes you suggest |
Not in a nice way, but what i came up with without having to refactor much.