Run cleaner concurrently also inside one archive #3097

pmoravec · 2022-12-29T12:07:23Z

sos report --cleaner cleans the generated report sequentially using one thread, what prolongs its already "naturally slow" execution. We should run it concurrently, similarly like sos report executes plugins.

I see two approaches here:

run individual parsers (IP, MAC, username,..) concurrently, with a lock on each file they update. This is required to prevent concurrent modification of the same file by multiple parsers.
or for each parser running sequentially, clean files in parallel

Currently, cleaner iterates over all files and runs individual parsers sequentially for one file after another. That is quite away of either above approach.

Also, sos clean --jobs=4 declares concurrency on individual archives that can be cleaned in parallel. This additional concurrency (with assumed 4 workers as default) should be aligned with that one - we should not end up with 4*4 concurrent workers running in parallel.

Despite there are the above obstacles to the proposal, I think the justification for the improvement is solid: the use cases when cleaner is run on a sequential archive dominate and the improvement would speed them up significantly.

The text was updated successfully, but these errors were encountered:

TurboTurtle · 2023-01-02T16:15:23Z

I took a stab at moving our concurrency to process-based a while back and the direction I took there was to try and assign each archive to a specific process. There were a few challenges there - in no particular order:

We need to keep the parsers sync'd across processes, so that obfuscations are kept consistent across both archives and files.
Because of this, we need to use shared memory objects for holding those obfuscations. For multiple archives this is fine, but in the case of single archive cleaning it actually significantly harms performance.
Logging is performed via the archive abstractions - but logging streams are not able to be serialized and so the pickling that occurs when forking out to a new process fails. There was/is some ambiguity here as newer versions of python don't seem to have the same problem (the new process initializes without issue) but core python maintainers have stated that shouldn't be happening.

That being said, I'm not of the opinion that we have to have parallelism at the archive level. If we instead move to file-level parallelism, that's fine - we just need to be able to report when we've finished the last file in an archive and when we've started a new archive's file list.

This also doesn't have to be with a specific library, but it does have to be available for python-3.6 for our downstreams that use that version.

NikhilKakade-1 · 2023-04-18T04:18:56Z

@TurboTurtle @pmoravec , any new developments to fix it?

pmoravec · 2023-04-18T06:32:34Z

@TurboTurtle @pmoravec , any new developments to fix it?

Nothing I am aware of..

NikhilKakade-1 · 2023-09-06T06:55:12Z

@TurboTurtle @pmoravec , any update?
Is 3262 some how related to it ?

TurboTurtle · 2023-09-06T14:14:54Z

@NikhilKakade-1 the short update is - no update.

I've since left RH and my time to dedicate to sos is as such much reduced. Before I left however, there was some discussion on moving to a queue-based systems for obfuscations. The basic idea is that we'd have multiple processes (as opposed to threads) spawned which trawl through an archive/file looking for matches via the parsers. On a match, an item is added to a queue running in the main thread which handles the obfuscation, and then returns the obfuscated match to the process waiting for it.

While mostly straight forward in theory, this is a major change to implement and test. I am not sure when (or if) I'd have the time available to really dedicate to this effort myself in the near future. We (as in the sos project) would likely need resource(s) from our commercial sponsors (namely RH/Canonical) to be able to get this kind of work hammered out in a reasonable timeframe.

NikhilKakade-1 · 2023-09-20T13:48:23Z

OK, Thanks !!

@pmoravec, @arif-ali anything from your side on the above comment?

pmoravec · 2023-09-20T14:46:48Z

OK, Thanks !!

@pmoravec, @arif-ali anything from your side on the above comment?

No promise here. We might have some more resource for sos development in general, but esp. for such major change, I dont want to raise expectations we dont need to meet.

arif-ali · 2024-02-08T23:15:29Z

OK, Thanks !!

@pmoravec, @arif-ali anything from your side on the above comment?

I would love to see better parallelism and concurrency, but I don't have bandwidth to take up on anything like that. But, as it has been raised this would require significant change

ijajmulani · 2024-03-22T06:38:36Z

any update on this issue. I'm also facing is issue.

pmoravec · 2024-03-22T21:18:23Z

any update on this issue. I'm also facing is issue.

Apart of some sporadic discussion at https://github.com/orgs/sosreport/discussions/3476 , no update. No final consensus on the way of implementation (Jake's original idea is followed by two other ones), no work force for it.

NikhilKakade-1 · 2024-04-10T17:47:27Z

@TurboTurtle / @pmoravec , To address this, I've been contemplating the implementation of a wrapper script.

The script starts by retrieving a list of enabled plugins using the sos report -l command. It then proceeds to execute each plugin individually, utilizing the -o flag along with the --clean option, allowing for concurrent execution. This approach facilitates direct cleaning of the files collected by each plugin, potentially streamlining the data collection and obfuscation process. With this setup, there's no need to wait for the completion of each plugin before initiating the cleaning process. Additionally, dividing the process into multiple plugins enables parallelism on a batch basis, further enhancing efficiency. Afterward, the resulting multiple tarballs can be consolidated into a single tarball.
However, progress is currently hindered by an issue identified as #3433.

I'm looking forward to receiving your inputs and suggestions

Thanks

TurboTurtle · 2024-04-10T18:03:39Z

That wouldn't be a design that gets picked up by us. To many moving parts, and the reassembly of potentially dozens of archives will, by nature, always be fragile.

Also, as you note, the obfuscation of an archive is based on the contents of the archive. If we want to obfuscate hostnames, we need to know the hostname of the system so we can identify other hostnames we would also want to obfuscate based on the FQDN - and that requires either the information to be present in the archive, to be passed by the user at runtime, or for that information to already be recorded in an obfuscation map in /etc/sos/cleaner/default (or similar saved mapping location). This doesn't lend itself well to wrappers that attempt to fragment the workflow.

I still believe that the best path forward is what I describe, admittedly at a high level, in the discussion on cleaner performance - we should leverage the native concurrency features of python across multiple processes instead of threads.

To be blunt, this work needs to be focused on by someone who is very familiar with process-based concurrency in python and who is able to commit full-time development to it for a decent stretch of time. I myself have not been able to do so since leaving Red Hat.

NikhilKakade-1 · 2024-04-10T19:42:31Z

I think we can have the require file/cmd output at runtime with --clean,
#3433 (comment)

pmoravec · 2024-04-15T07:12:12Z

I still believe that the best path forward is what I describe, admittedly at a high level, in the discussion on cleaner performance - we should leverage the native concurrency features of python across multiple processes instead of threads.

To be blunt, this work needs to be focused on by someone who is very familiar with process-based concurrency in python and who is able to commit full-time development to it for a decent stretch of time. I myself have not been able to do so since leaving Red Hat.

What about the alternative approach I suggested in the discussion? The processes-based concurrency approach might be most elegant solution but we strive finding a volunteer for it for approx one year, unsuccessfully. Implementing my approach should be much easier (please correct me if my estimation is wrong) and final solution can be still acceptable as a long term solution. Well, with two potential gotchas:

some UX regression with moving from obfuscated strings like host7 to strings like host-4B9F4B0B that is less human readable
Using salt in clustered environment

I don't insist on this idea. I am in for whatever is feasible from main angles (sufficient solution, robust enough, somebody who will implement it, reasonably user-friendly). I am just requesting some assessment or discussion.

NikhilKakade-1 · 2024-04-25T11:02:31Z

https://github.com/sosreport/sos/blob/main/sos/cleaner/parsers/__init__.py#L97

The time complexity of the provided function becomes:

Worst Case: O(k * n * m)
Best Case: O(k * m)

In my opinion, optimizing the data masking process is essential. In terms of computational complexity, the worst-case scenario is O(k * n * m), where k is the number of lines, n is the number of compiled regexes in self.mapping.compiled_regexes, and m is the average length of the lines. Conversely, the best case is O(k * m).

I've been mulling over a few questions.

Should we consider periodic cleanup of mapping files as the number of compiled regexes increases over time, potentially impacting total processing time?
What factors should be considered when deciding between processing find and replace operations line by line versus in batches?

pmoravec · 2024-05-06T06:57:25Z

The complexity theory approach reminds me my university studies :)

We dont have any cleanup of mapping files. It can have sense, but how/when to decide cleanup should be done? Every month? When mapping is bigger than X records? Remove selective records when they have not been found&replaced in past Y cleaner runs? Each heuristic has its pros and negs. Anyway that is independent topic (that I am definitely not against it) to the limited cleaner concurrency.

Anyway, the number of compiled regexes is usually the much much smallest number in that equation. We might better try preventing cleaners to touch files we know as "safe" (https://github.com/sosreport/sos/blob/main/sos/cleaner/__init__.py#L775-L780), and optionally skip processing files if no parser is applicable. As e.g. can /var/lib/selinux contain any sensitive info? Or /sys/fs/cgroup, /proc/sys or /sys/kernel? I expect some of the dirs can contain some sensitive info, I am not an expert on either, and we should be rather conservative than risk we skip obfuscating anything here, but maybe this is also a room for improvement..? Again, topic independent on the original "make cleaner (more) concurrent".

TrevorBenson · 2024-06-26T19:56:53Z

I've been mulling over a few questions.

Should we consider periodic cleanup of mapping files as the number of compiled regexes increases over time, potentially impacting total processing time?

Yes? However, retaining the same obfuscation is very useful in practice.

Removing selective records not seen in a long time as @pmoravec mentions is one option that could be viable. This presents separate issues, however, like tracking the "temperature" or last-seen timestamp for each obfuscation. This should only be added as a non-default option, allowing the choice of when to enable a cleanup.

Separately

I like the deterministically computed hash approach @pmoravec discusses in #3476_comment-8501581, especially if a solution to the problem of using separate salts per host in a single collection is solved. As mentioned, without solving this it presents a problem for a general analysis of separate archives in an sos collection.

The difficulty can also increase exponentially when analyzing cluster configurations, especially for clusters that continue to grow the number of total members. A different salt between separate collections would also make it quite difficult to compare collections from different times against one another, which can also be useful in practice.

pmoravec · 2024-06-27T07:05:27Z

using separate salts per host in a single collection is solved.

Assuming salt is somehow stored on each host (and just not shared in sosreport, of course), sos collect --clean can read the salt from each host before (or after?) collection of the sosreport archive.

This still does not solve use cases like:

"generate sosreport on one host, then run cleaner on another", where salt from another host is used to sosreport from the first host
sos collect --clean fetches a sosreport from a host that has not generated salt yet as no cleaner has been run there, so a local (new?) salt is used for that sosreport; later on the host runs sos clean that generates a new different salt..? (this has solution: when collector detects a host does not have a salt, it generates one and saves it for the host)

There can be other scenarios I haven't imagined, which makes the idea bit fragile. Thats the cons, compared to the Queue() idea described in https://github.com/orgs/sosreport/discussions/3476#discussioncomment-8225987 . Pros is much simplier implementation.

I would agree with either approach (if we are certain it has no gotchas and have somebody who can dedicate time to implement it). I dont need my idea to be implemented, I am very pragmatical - let implement anything we can and agree on.

NikhilKakade-1 · 2024-07-17T17:36:55Z

Anyway, the number of compiled regexes is usually the much much smallest number in that equation. We might better try preventing cleaners to touch files we know as "safe"

Skipping folders like /sys , ... during masking seems reasonable, despite containing a large number of files, their overall size is small. I believe this does not effectively increase performance. The time required for masking is directly proportional to the number of substitutions made and the size of the files, as we do check line by line.

NikhilKakade-1 · 2024-07-17T17:40:13Z

Removing selective records not seen in a long time as @pmoravec mentions is one option that could be viable. This presents separate issues, however, like tracking the "temperature" or last-seen timestamp for each obfuscation. This should only be added as a non-default option, allowing the choice of when to enable a cleanup.

Is there any Issue created ? If not can we create one, so that we won't miss it?

pmoravec · 2024-07-18T09:50:28Z

Removing selective records not seen in a long time as @pmoravec mentions is one option that could be viable. This presents separate issues, however, like tracking the "temperature" or last-seen timestamp for each obfuscation. This should only be added as a non-default option, allowing the choice of when to enable a cleanup.

Is there any Issue created ? If not can we create one, so that we won't miss it?

This has been implemented already, see --skip-cleaning-files option and #3520 .

TrevorBenson · 2024-07-18T14:31:15Z

Removing selective records not seen in a long time as @pmoravec mentions is one option that could be viable. This presents separate issues, however, like tracking the "temperature" or last-seen timestamp for each obfuscation. This should only be added as a non-default option, allowing the choice of when to enable a cleanup.

Is there any Issue created ? If not can we create one, so that we won't miss it?

This has been implemented already, see --skip-cleaning-files option and #3520 .

I don't believe these are the same topic.

[cleaner] Add option to skip cleaning files #3520 is about skipping the cleaning of sensitive information.

Where my statement:

This should only be added as a non-default option, allowing the choice of when to enable a cleanup.

Was a response to this comment bullet point 1

Should we consider periodic cleanup of mapping files as the number of compiled regexes increases over time, potentially impacting total processing time?

My statement was intended to mean that I did not believe we should automatically clean up mapping files (i.e. remove regex mappings from the map file that had not been seen/used a long time). I felt that even if tracking the temperature (recent usage of each obfuscation mapping) was added, the removal of a previous regex mapping should be left up to the user to enable when they want to remove old regex mappings.

If we have a consistent mapping as discussed in other threads this is probably not a big deal, as recreation of the obfuscation would result in an identical mapping anyway, as long as the salt was also stored and did not change.

NikhilKakade-1 · 2024-07-18T15:05:54Z

Thanks @TrevorBenson , yep I was referring to this part.

pmoravec · 2024-07-18T15:19:05Z

Ah I see, sorry for my confusion. Yes, it is worth raising a separate issue for the periodic cleanup of mapping files.

TrevorBenson · 2024-07-18T20:17:37Z

I went ahead and put together #3715

TrevorBenson mentioned this issue Jul 18, 2024

[cleaner] periodic cleanup of mapping files. #3715

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run cleaner concurrently also inside one archive #3097

Run cleaner concurrently also inside one archive #3097

pmoravec commented Dec 29, 2022

TurboTurtle commented Jan 2, 2023

NikhilKakade-1 commented Apr 18, 2023

pmoravec commented Apr 18, 2023

NikhilKakade-1 commented Sep 6, 2023

TurboTurtle commented Sep 6, 2023

NikhilKakade-1 commented Sep 20, 2023

pmoravec commented Sep 20, 2023

arif-ali commented Feb 8, 2024

ijajmulani commented Mar 22, 2024

pmoravec commented Mar 22, 2024

NikhilKakade-1 commented Apr 10, 2024

TurboTurtle commented Apr 10, 2024

NikhilKakade-1 commented Apr 10, 2024 •

edited

Loading

pmoravec commented Apr 15, 2024 •

edited

Loading

NikhilKakade-1 commented Apr 25, 2024 •

edited

Loading

pmoravec commented May 6, 2024

TrevorBenson commented Jun 26, 2024 •

edited

Loading

pmoravec commented Jun 27, 2024

NikhilKakade-1 commented Jul 17, 2024

NikhilKakade-1 commented Jul 17, 2024

pmoravec commented Jul 18, 2024

TrevorBenson commented Jul 18, 2024 •

edited

Loading

NikhilKakade-1 commented Jul 18, 2024

pmoravec commented Jul 18, 2024

TrevorBenson commented Jul 18, 2024

Run cleaner concurrently also inside one archive #3097

Run cleaner concurrently also inside one archive #3097

Comments

pmoravec commented Dec 29, 2022

TurboTurtle commented Jan 2, 2023

NikhilKakade-1 commented Apr 18, 2023

pmoravec commented Apr 18, 2023

NikhilKakade-1 commented Sep 6, 2023

TurboTurtle commented Sep 6, 2023

NikhilKakade-1 commented Sep 20, 2023

pmoravec commented Sep 20, 2023

arif-ali commented Feb 8, 2024

ijajmulani commented Mar 22, 2024

pmoravec commented Mar 22, 2024

NikhilKakade-1 commented Apr 10, 2024

TurboTurtle commented Apr 10, 2024

NikhilKakade-1 commented Apr 10, 2024 • edited Loading

pmoravec commented Apr 15, 2024 • edited Loading

NikhilKakade-1 commented Apr 25, 2024 • edited Loading

pmoravec commented May 6, 2024

TrevorBenson commented Jun 26, 2024 • edited Loading

Separately

pmoravec commented Jun 27, 2024

NikhilKakade-1 commented Jul 17, 2024

NikhilKakade-1 commented Jul 17, 2024

pmoravec commented Jul 18, 2024

TrevorBenson commented Jul 18, 2024 • edited Loading

NikhilKakade-1 commented Jul 18, 2024

pmoravec commented Jul 18, 2024

TrevorBenson commented Jul 18, 2024

NikhilKakade-1 commented Apr 10, 2024 •

edited

Loading

pmoravec commented Apr 15, 2024 •

edited

Loading

NikhilKakade-1 commented Apr 25, 2024 •

edited

Loading

TrevorBenson commented Jun 26, 2024 •

edited

Loading

TrevorBenson commented Jul 18, 2024 •

edited

Loading