Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6.13.16 scour attempts to lstat a file deleted via another thread, race condition #103

Open
akrherz opened this issue Oct 28, 2021 · 7 comments

Comments

@akrherz
Copy link
Contributor

akrherz commented Oct 28, 2021

I am using LDM 6.13.16 on Centos 8 Stream 64 bit. I've noticed that since the upgrade to this release, I sometimes get errors like the following from ldmadmin scour

20211028T090217.262441Z scour[118662] scour.c:scourFilesAndDirs:291 ERROR lstat("/data/gempak/nexrad/NIDS/LVX/N3K/N3K_20211026_0045") failed: No such file or directory

Out of deleting thousands of files, I only see one or two errors reported on some days, but not all. I know that you recently updated scour to use c code and not perl, perhaps there is some threading / race condition with how files are deleted?

The /data path is NFS mounted, so perhaps there is troubles there. I verified that I am only running 1 scour process from cron and this is my scour.conf

/mesonet/data/gempak/model      1
/data/gempak/model				10
/data/gempak/nexrad				2
/data/gempak 					8
/data/rcm						7
/data/text						14

Thanks.

@mustbei
Copy link
Contributor

mustbei commented Oct 29, 2021

Hi Daryl,

This looks indeed like a race condition. Your scour.conf has overlapping directory entries (lines 3 and 4 in this case). The scour program launches a thread for each line. Therefore, by the time one thread reads a file under one directory that file may have already been seen in the other thread and deleted. Hence, leaving the first thread wondering and displaying the ERROR above. Note that the age for each directory entry is 2 and 8. Therefore, the missing file under /data/gempak/nexrad must have been age 8 or older. One way of preventing this rare case from happening is to lock the resource.

Best regards,
--Mustapha

@akrherz
Copy link
Contributor Author

akrherz commented Oct 29, 2021

Greetings, thanks for the response. The age of the file is less than 8, you can see that by the filename timestamp. So yeah, the lstat would perhaps be attempting to lookup a file that was deleted by the other thread..

@akrherz akrherz changed the title 6.13.16 scour appears to attempt deleting files twice, rarely 6.13.16 scour attempts to lstat a file deleted via another thread, race condition Oct 29, 2021
@sebenste
Copy link

I have seen this occur in the "old" way of scouring as well in LDM 6.13.10 and earlier, but it's now a moot point.

@akrherz
Copy link
Contributor Author

akrherz commented Nov 18, 2021

Perhaps a command line switch could be offered to disable threaded scouring? Or maybe this particular error could be sent to a lower priority log level?

@mustbei
Copy link
Contributor

mustbei commented Nov 18, 2021

The new scour program spawns as many threads as there are directory entries (in scour.conf.) Therefore, to make it mono-threaded (without code change) it suffices to provide one directory entry at a time (to ensure non-concurrency.) It is also possible to enforce sequentiality with minor code change and a switch if warranted. Setting this error to a lower priority log level is also possible and only requires minimum code change.

@semmerson
Copy link
Collaborator

@akrherz Or one could modify their scour(1) configuration-file to avoid overlapping entries.

@akrherz
Copy link
Contributor Author

akrherz commented Nov 18, 2021

@akrherz Or one could modify their scour(1) configuration-file to avoid overlapping entries.

Agreed, but that is brittle as I may add a new folder and forget to add a custom entry for it and very annoying as I have to add one entry for each sub-folder. Additionally, overlapping entries make total sense in my mind.

I have a blanket policy for anything in /data/gempak being at most 10 days old and then anything in /data/gempak/nexrad being at most 2 days old.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants