[#220] avoid resource leaking in scanners #236

YuriRomanowski · 2022-12-11T18:32:39Z

Description

[#220] Avoid resource leaking in scanners

Problem: Currently we use lazy readFile, and xrefcheck opens all
the files immediately for reading, while file descriptors are
resource that can be finite.

Solution: Replaced lazy readFile with strict version.

To be done: Understand issues with parallelism

Related issue(s)

Fixes #220

✅ Checklist for your Pull Request

Ideally a PR has all of the checkmarks set.

If something in this list is irrelevant to your PR, you should still set this
checkmark indicating that you are sure it is dealt with (be that by irrelevance).

Related changes (conditional)

Tests
- If I added new functionality, I added tests covering it.
- If I fixed a bug, I added a regression test to prevent the bug from
  silently reappearing again.
Documentation
- I checked whether I should update the docs and did so if necessary:
  - README
  - Haddock
Public contracts
- Any modifications of public contracts comply with the Evolution
  of Public Contracts policy.
- I added an entry to the changelog if my changes are visible to the users
  and
- provided a migration guide for breaking changes if possible

Stylistic guide (mandatory)

My commits comply with the policy used in Serokell.
My code complies with the style guide.

✓ Release Checklist

I updated the version number in package.yaml.
I updated the changelog and moved everything
under the "Unreleased" section to a new section for this release version.
(After merging) I edited the auto-release.
- Change the tag and title using the format vX.Y.Z.
- Write a summary of all user-facing changes.
- Deselect the "This is a pre-release" checkbox at the bottom.
(After merging) I updated xrefcheck-action.
(After merging) I uploaded the package to hackage.

Problem: Currently we use lazy `readFile`, and xrefcheck opens all the files immediately for reading, while file descriptors are resource that can be finite. Solution: Replaced lazy `readFile` with strict version.

YuriRomanowski · 2022-12-11T18:50:07Z

When I tried to parallelize file parsing, I got some strange issues. I tried to run a program with ~100 md files repository with 8 core available. This is needed time info:

start scan repo; time: 0.00s
end scan repo; time: 0.47s
start verify repo; time: 0.47s
end verify repo; time: 1.88s

And here is the threadscope profile:

We see that CPU activity is not very high, although there is work to do. Something prevents threads to work normally... This is what can be found in the eventlog for that time:

There are a lot of foreign calls, although it seems in my code there is no anything that can result in such behaviour. I tried to run our program under the strace utility, and it showed this:

(Sorry for the russian caption, it means "Resource temporary unavailable")

Googling this returns one Stackoverflow entry, but I didn't understand much there 😄
click
Any ideas why this can occur? Maybe my code is written in a wrong way?

Martoon-00 · 2022-12-12T20:51:43Z

Woow, elaborate investigation 👍

However, what do you think exactly is wrong?

I'll tell my thoughts protecting the idea that everything seems fine, and you show where I'm wrong.

In the threadscope report, to me it looks like what I would expect - low CPU consumption during the scanning stage because it is computation interleaved with I/O (and I/O is not shown on the chart).

It may be reasonable that I/O takes more time than parsing, since the former is reading from the disc, and the latter is reading from memory + some logic written in C that cmark-gfm uses internally and which thus can be fast. (Btw, you are on SSD disk?)

On what you see in the eventlog - by default, foreign calls cannot be interrupted, at that moment runtime system gives up on trying to manage anything until the foreign call is finished. Maybe that's why we see those messages, but I think they are fine, they may just mean that the control is passed to the C logic, but everything is fine with threads (I'm not sure here though, may be worth googling particularly the meaning of that eventlog message).

On strace output, as far as I understand EAGAIN in our case likely means "pipe is yet empty, read syscall cannot return anything, please try again later", and this matches the explanation above that CPU is ready to read input at a more intense rate than the device can provide.

If my understanding is correct, most of the witnessed things will be still there even if we try parsing with no parallelization as before.

If you had some intuition that told you "something is wrong here" or "things are too slow" that pushed you onto this investigation, could you share it too?

YuriRomanowski · 2022-12-12T22:09:07Z

In the threadscope report, to me it looks like what I would expect - low CPU consumption during the scanning stage because it is computation interleaved with I/O (and I/O is not shown on the chart).

Well, I also tried to write code such that all of the files were read at once, and then only pure computations were performed in parallel, but results were the same. I can later show same graphs for this case.

If you had some intuition that told you "something is wrong here" or "things are too slow" that pushed you onto this investigation, could you share it too?

In my mind, when we have a big list of pure tasks, they can be parallelized very effectively, using ~100% load for all available CPUs. But here, although there are some CPUs, total activity is not higher (in average) than just one CPU 100% load (I look at the dashed lines on the activity graph), and this is not really satisfying. When I tried to read all the files first and then parse them in parallel, reading was performed with one thread which took 100% of CPU load, but total activity was lesser for further parsing stage, even for 8 CPUs. You can even see how activity grows up after 0.47 second (start repo verifying). Also, I don't like a lot of errors in a row for a system calls log, it looks similar to active waiting, the thing that good programs usually (in my mind) don't want to use. But GHC runtime is written with highly-skilled guys, so that can indicate that we are using their creature in a wrong way. So I believe that there's a problem we could solve, although I don't quite understand what it actually is.

Well, I also forgot that our raw md parsing is performed by the C library... Maybe that can influence this, too.

Martoon-00 · 2022-12-13T01:19:23Z

Also, I don't like a lot of errors in a row for a system calls log, it looks similar to active waiting

Yeah, that's true. Also not sure why it is done this way.

Well, I also tried to write code such that all of the files were read at once, and then only pure computations were performed in parallel

Oh that's a very good setup for testing, then we have to investigate what happens there. I'll try to think about it, if you have more info to share - it is welcome.

aeqz · 2022-12-13T15:46:44Z

I'm not sure if this will reveal anything, but could you try the following in order to check if the parallel performance looks more like what you expected?

Derive NFData for ScanResult and then, in the scanRepo function, replace

return . ScanResult errs $ RepoInfo
  { riFiles = M.fromList $ processedFiles <> notProcessedFiles
  , riDirectories = M.fromList
      $ map (, TrackedDirectory) trackedDirs
      <> map (, UntrackedDirectory) untrackedDirs
  , riRoot
  }

by

let result = ScanResult errs $ RepoInfo
      { riFiles = M.fromList $ processedFiles <> notProcessedFiles
      , riDirectories = M.fromList
                        $ map (, TrackedDirectory) trackedDirs
                        <> map (, UntrackedDirectory) untrackedDirs
      }

evaluateNF result

aeqz · 2022-12-13T15:51:06Z

src/Xrefcheck/Scan.hs

@@ -118,7 +119,8 @@ data ScanErrorDescription
  | FileErr
  | ParagraphErr Text
  | UnrecognisedErr Text
-  deriving stock (Show, Eq)
+  deriving stock (Show, Eq, Generic)
+  deriving anyclass (NFData)


You can derive NFData as in Xrefcheck.Core module if you prefer to follow that style:

instance NFData Position instance NFData Reference instance NFData AnchorType instance NFData Anchor instance NFData FileInfo instance NFData RepoInfo

YuriRomanowski · 2022-12-14T21:35:02Z

I tried to force evaluation of CMarkGFM part at first, and it looks like this. Here the highlighted area corresponds to parsing using CMarkGFM, and then sparks are created to do the work in pure Haskell. And in the CMarkGFM area there are a lot of foreign calls. It seems that these are just calls to C library :) It's a good question whether we can parallelize foreign calls, in some circumstances we should be able to. Also we can stop here and just run parallel computations interleaved with foreign calls. Anyway, we do get some speedup, but not a big one.

Martoon-00 · 2022-12-14T22:18:13Z

Interesting. I guess foreign calls should be parallelizable...

So from the recent picture we can tell, that if you don't use sparks for commonmarkToNode but use them for all the following parsing, then we see the expectedly high usage of CPU cores for that "following parsing".

I'd really like to see the threadscope report from this case:

Well, I also tried to write code such that all of the files were read at once, and then only pure computations were performed in parallel, but results were the same. I can later show same graphs for this case.

Martoon-00 · 2022-12-14T22:20:30Z

Although 0.07 seconds that this unparallelized call to cmark-gfm currently takes is really not that much, I agree we could stop optimizing here if we get stuck at it.

YuriRomanowski · 2022-12-15T09:59:21Z

Although 0.07 seconds that this unparallelized call to cmark-gfm currently takes is really not that much, I agree we could stop optimizing here if we get stuck at it.

I'm sorry for confusing you, but the highlighted period is not the full period of CMark parsing, just an example.

Well, I also tried to write code such that all of the files were read at once, and then only pure computations were performed in parallel, but results were the same. I can later show same graphs for this case.

Yep, later I'll show more accurate examples for different cases.

Martoon-00 · 2022-12-15T17:53:02Z

but the highlighted period is not the full period of CMark parsing, just an example

Oh I see, okay. Then we should try hard I suppose 😸

That's just a wild guess, but as one of the measures I would try replacing sparks with IO threading, maybe sparks really have problems with running foreign calls in parallel.

Mm, probably to have this organized, I'd also suggest preserving all your attempts in separate branches, maybe call them yuri/#220-experiment-xxx. And load your reports there as separate commits too. This way someone else can play with the approaches you tried.

Martoon-00 · 2023-01-31T19:13:59Z

On the current state of this PR:

Yet need to understand whether can we get higher CPU consumption.
In Avoid resource leaking in scanners #220 I'm asking to fully document the behaviour of ScanAction. If I write my own scanner, what should I do to make on strict on reading and then initialize parsing in parallel with parsing other files? What should I do to read + parse on the fly?

@YuriRomanowski I'm not sure if you manage to work on this PR during the remaining time, but that's fine, we can transfer this PR to someone.

In case you manage to allocate time on this, I suggest asking in Slack in Haskell channel, maybe someone will manage to explain what we observe. If not, let's suggest that we did quite a lot to make it performant and stop on what we have gained.

YuriRomanowski added 2 commits December 11, 2022 23:29

[#220] Avoid resource leaking in scanners

f7b8d9c

Problem: Currently we use lazy `readFile`, and xrefcheck opens all the files immediately for reading, while file descriptors are resource that can be finite. Solution: Replaced lazy `readFile` with strict version.

Add markdownParallelScanner for parallel file processing

2f5950d

YuriRomanowski requested review from Martoon-00 and aeqz December 11, 2022 18:32

YuriRomanowski changed the title ~~Yuri romanowski/#220 avoid resource leaking in scanners~~ [#220] avoid resource leaking in scanners Dec 12, 2022

aeqz reviewed Dec 13, 2022

View reviewed changes

Martoon-00 mentioned this pull request Jan 31, 2023

Parallelize file parsing #247

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#220] avoid resource leaking in scanners #236

[#220] avoid resource leaking in scanners #236

YuriRomanowski commented Dec 11, 2022

YuriRomanowski commented Dec 11, 2022

Martoon-00 commented Dec 12, 2022

YuriRomanowski commented Dec 12, 2022 •

edited

Martoon-00 commented Dec 13, 2022

aeqz commented Dec 13, 2022

aeqz Dec 13, 2022 •

edited

YuriRomanowski commented Dec 14, 2022

Martoon-00 commented Dec 14, 2022

Martoon-00 commented Dec 14, 2022

YuriRomanowski commented Dec 15, 2022

Martoon-00 commented Dec 15, 2022

Martoon-00 commented Jan 31, 2023

[#220] avoid resource leaking in scanners #236

Are you sure you want to change the base?

[#220] avoid resource leaking in scanners #236

Conversation

YuriRomanowski commented Dec 11, 2022

Description

Related issue(s)

✅ Checklist for your Pull Request

Related changes (conditional)

Stylistic guide (mandatory)

✓ Release Checklist

YuriRomanowski commented Dec 11, 2022

Martoon-00 commented Dec 12, 2022

YuriRomanowski commented Dec 12, 2022 • edited

Martoon-00 commented Dec 13, 2022

aeqz commented Dec 13, 2022

aeqz Dec 13, 2022 • edited

Choose a reason for hiding this comment

YuriRomanowski commented Dec 14, 2022

Martoon-00 commented Dec 14, 2022

Martoon-00 commented Dec 14, 2022

YuriRomanowski commented Dec 15, 2022

Martoon-00 commented Dec 15, 2022

Martoon-00 commented Jan 31, 2023

YuriRomanowski commented Dec 12, 2022 •

edited

aeqz Dec 13, 2022 •

edited