Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job_table() returning empty dataframe #517

Closed
Leimeroth opened this issue Nov 15, 2021 · 37 comments · Fixed by #519
Closed

job_table() returning empty dataframe #517

Leimeroth opened this issue Nov 15, 2021 · 37 comments · Fixed by #519
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@Leimeroth
Copy link
Member

Leimeroth commented Nov 15, 2021

After updating conda packages and pyiron versions pr.job_table() returns an empty dataframe. I have absolutely no clue what causes this problem.
The database seems to be completely fine. I tested it using sqlite3 PRAGMA integrity_check, by connecting to my database backup and trying pr.job_table() (still empty df) and had a look at the database using sqlitebrowser.
I also tried to downgrade sqlalchemy from 1.4.27 to 1.4.26 and checked if the connection in the Settings is correct.
If I create new jobs they get a correct id (starting with a number over 300000, not with 0) and the new jobs are shown correctly in the dataframe, but all jobs from before updating are not shown.

@jan-janssen @pmrv mentioning you to get you notified, if you have any idea what causes this / how to fix this please let me know, this really messes with my work

@Leimeroth Leimeroth added bug Something isn't working help wanted Extra attention is needed labels Nov 15, 2021
@jan-janssen
Copy link
Member

Did you use the pyiron base version from the Git repository or from conda? @liamhuber Can this be related to the recent database changes?

@jan-janssen
Copy link
Member

In addition if the new jobs appear but the old ones do not can you post the database entries to compare how they differ? You can use the https://sqlitebrowser.org to open the SQLite files in a graphical interface.

@liamhuber
Copy link
Member

I touched how the database access strings are parsed, so it's definitely possible this introduced a bug. There is almost certainly not a problem when using sqlite as I've been running the new code on my local machine for quite some time without trouble. I think I've done stuff on the cluster, in which case Postgres is probably ok? I haven't done anything with MySQL though, so couldn't say one way or another there

@liamhuber
Copy link
Member

Aha @Leimeroth i just remembered a big change: as before we use XOR logic to control the configuration, but we changed the order of priority! It used to be user input XOR confit file XOR environment variables. Now it's user-env-config. Sonic you have any pyiron env vars (except PYIRONCONFIG) then you'll get these+defaults, ognoringg your config file entirely. If you have your database defined in a config file, please check if you have any PYIRON vars in 'os.environ ' that would screw you up

@liamhuber
Copy link
Member

Ugh, phone spelling, sorry

@Leimeroth
Copy link
Member Author

Leimeroth commented Nov 15, 2021

Did you use the pyiron base version from the Git repository or from conda? @liamhuber Can this be related to the recent database changes?

Git versions

In addition if the new jobs appear but the old ones do not can you post the database entries to compare how they differ? You can use the https://sqlitebrowser.org to open the SQLite files in a graphical interface.

For new jobs (those I see in pr.job_table() now) the projectpath value is NULL and the complete path is stored in the project column. For old jobs projectpath is /home/pyiron/projects/, and project is the missing part for the working directory.
I don't know why this changed.

I touched how the database access strings are parsed, so it's definitely possible this introduced a bug. There is almost certainly not a problem when using sqlite as I've been running the new code on my local machine for quite some time without trouble. I think I've done stuff on the cluster, in which case Postgres is probably ok? I haven't done anything with MySQL though, so couldn't say one way or another there

I am using sqlite

Aha @Leimeroth i just remembered a big change: as before we use XOR logic to control the configuration, but we changed the order of priority! It used to be user input XOR confit file XOR environment variables. Now it's user-env-config. Sonic you have any pyiron env vars (except PYIRONCONFIG) then you'll get these+defaults, ognoringg your config file entirely. If you have your database defined in a config file, please check if you have any PYIRON vars in 'os.environ ' that would screw you up

I can't find any pyiron related environment variable, and I do not know where it should come from. It is possible that I overlooked something, but checking with

env = os.environ
for k, v in env.items():
    if "pyiron" in k.lower() or "pyiron" in v.lower():
        print(k, v)

also didn't find anything beside my PYTHONPATHS to the git pyiron versions.
EDIT: Also Settings().config seems correct, which should show env variables I assume?

@Leimeroth
Copy link
Member Author

I did some tests and I can get the full job_dict when running:
pr.db._job_dict(sql_query=None, user=None, project_path="", recursive=True)
but when I do
pr.db._job_dict(sql_query=None, user=None, project_path=pr.project_path, recursive=True)
I only get the new jobs.

In the _job_dict() method the "project" column is set to project_path, which I find confusing tbh.

if project_path == "./":
            project_path = ""
if recursive:
    dict_clause["project"] = str(project_path) + "%"
else:
    dict_clause["project"] = str(project_path)

But maybe this also confused someone else and is set to a different value somewhere?

@liamhuber
Copy link
Member

I am using sqlite

Also Settings().config seems correct, which should show env variables I assume?

Ok, yes, if these are the case then you should be looking at the right database so no worries there. And your snippet checking for ENV contamination looks good to me.

did some tests and I can get the full job_dict when running:
pr.db._job_dict(sql_query=None, user=None, project_path="", recursive=True)
but when I do
pr.db._job_dict(sql_query=None, user=None, project_path=pr.project_path, recursive=True)
I only get the new jobs.

I still can't rule out that my changes caused this, but it sounds less likely. I haven't touched pr.db._job_dict (DatabaseAccess._job_dict), so as long as your SQL connection string still checks out I don't see my updates doing this.

Git versions

Do you know where your head was before this? I think this bug might be nasty enough to warrant running a git bisection test. Once we know the exact commit that breaks things we'll stand a chance, but right now the behaviour is extremely mysterious to me.

@pmrv
Copy link
Contributor

pmrv commented Nov 16, 2021

Might be something I messed up in #336, but it's weird that it only comes up today.

@Leimeroth
Copy link
Member Author

Git bisecting shows that #486 introduced the bug. I still think this could be some mess up with project and project path, since all new entries have an empty project_path, and the full path up to working directory is stored under project instead.

@liamhuber
Copy link
Member

Git bisecting shows that #486 introduced the bug. I still think this could be some mess up with project and project path, since all new entries have an empty project_path, and the full path up to working directory is stored under project instead.

Perfect, thanks!

Yeah, your description is reasonable. I'm totally stumped right now because when I look at the diff I don't see anything in that direction getting touched. But now that we know the PR the problem must be soluble! I'll keep looking at it today.

@liamhuber
Copy link
Member

@pmrv you mentioned reproducing the error when you use the staging database...do you invoke one of the switch_to or update methods after starting your pyiron session in order to access this other DB?

@liamhuber
Copy link
Member

I can use this pr.db._job_dict(sql_query=None, user=None, project_path=pr.project_path / "", recursive=True) trick to reproduce a larger/smaller table on my system now as well. \

Both older and newer jobs have 'projectpath': None, but the older ones have 'project': '/Users/.../notebooks/scratch/cluster_1024_hdf5/',, while the newer ones don't have this job tail and look only like 'project': '/Users/.../notebooks/scratch/'

I also note that project_path="" gets me a global table from all sorts of folders on my machine, not just the directory path where I have my notebook.

@pmrv
Copy link
Contributor

pmrv commented Nov 16, 2021

@pmrv you mentioned reproducing the error when you use the staging database...do you invoke one of the switch_to or update methods after starting your pyiron session in order to access this other DB?

I did it by updating the sql_host entry in the user dict of Settings.update().

@liamhuber
Copy link
Member

I did it by updating the sql_host entry in the user dict of Settings.update().

Ok, then I really don't know what should be happening. Calling Settings.update should only update the settings, and not reset the DatabaseManager, i.e. I would have expected your database to stay old. state.update() first updates the settings then resets the database and queue adapter.

However while this properly updates state.database.database, this is then out of sync with the the pr.db which still points to the old (not disconnected) instance! I'm working on patching this in #519

@Leimeroth
Copy link
Member Author

Both older and newer jobs have 'projectpath': None, but the older ones have 'project': '/Users/.../notebooks/scratch/cluster_1024_hdf5/',, while the newer ones don't have this job tail and look only like 'project': '/Users/.../notebooks/scratch/'

This confuses me because ALL my old jobs have a project_path, and their project is never an absolute path (i.e. it never starts with / ), but whether the project entry ends with hdf_5 or not depends on whether the jobs are child jobs or not, but not at all on if they are new or old

@liamhuber
Copy link
Member

but whether the project entry ends with hdf_5 or not depends on whether the jobs are child jobs or not, but not at all on if they are new or old

Aha, great point! Yes, I think my 'cluster' job is a child, so that at least makes sense.

This confuses me because ALL my old jobs have a project_path, and their project is never an absolute path (i.e. it never starts with / ),

This is still a mystery to me too

@liamhuber
Copy link
Member

The default value of Settings().configuration['project_check_enabled'] got switched from True to False. Since I don't see any direct influence on the project path values, this is the best lead I have so far. I'll look into it more, but am in a meeting I need to listen in to, so I wanted to share the hint since I can't yet fully focus on this.

@liamhuber
Copy link
Member

The default value of Settings().configuration['project_check_enabled'] got switched from True to False. Since I don't see any direct influence on the project path values, this is the best lead I have so far. I'll look into it more, but am in a meeting I need to listen in to, so I wanted to share the hint since I can't yet fully focus on this.

Yup, this then gets used in DatabaseManager.top_path... looking suspicious

@Leimeroth
Copy link
Member Author

I added PROJECT_CHECK_ENABLED=True to my .pyiron file and now it works again, so I guess this causes the problem

@liamhuber
Copy link
Member

I added PROJECT_CHECK_ENABLED=True to my .pyiron file and now it works again, so I guess this causes the problem

Man, that is really bizarre. I did the same (well, actually I reverted the default as seen over in #519), but my job table still shows None for the projectpath.

At least this relieves some time pressure in solving this now that it's working for you again...

@niklassiemer
Copy link
Member

We really need more simple tests. I saw @liamhuber added some tests in

self.s.configuration["project_check_enabled"] = False
self.assertIs(self.dbm.top_path(self.project_path + "/test"), None)
self.s.configuration["project_check_enabled"] = True
self.s.configuration["project_paths"] = [self.s.convert_path_to_abs_posix(getcwd())]
self.s.configuration["disable_database"] = False # Otherwise has the chance to override project_check_enabled..
self.assertTrue(self.dbm.top_path(self.project_path + "/test") in self.project_path)

However, they check for if the PROJECT_PATH is correctly added to these paths or not. I would like to have a test checking for the correct split of the whole path into root_path and project_path (in terms of the pr.attribute names, not the ones in the db). Actually we should probably always write a test if we found a bug to make our CI fail as well 😃

@liamhuber
Copy link
Member

I would like to have a test checking for the correct split of the whole path into root_path and project_path (in terms of the pr.attribute names, not the ones in the db). Actually we should probably always write a test if we found a bug to make our CI fail as well

💯 Does anyone want to do this? I won't have time to look at it this week. Since I revert the default behaviour over in #519, and since it's now working for @Leimeroth I guess we can close? I am also amenable to keeping it open until someone adds a test for the project attributes.

@niklassiemer niklassiemer linked a pull request Nov 16, 2021 that will close this issue
@niklassiemer
Copy link
Member

Tests in #520 😄

@pmrv
Copy link
Contributor

pmrv commented Nov 18, 2021

This is issue is also present now on the cluster, but setting PROJECT_CHECK_ENABLED in the config does fix it. I suppose we'll need a code fix for this though, before we push the next update.

@liamhuber
Copy link
Member

This is issue is also present now on the cluster, but setting PROJECT_CHECK_ENABLED in the config does fix it. I suppose we'll need a code fix for this though, before we push the next update.

Fix is sitting in #519

@pmrv
Copy link
Contributor

pmrv commented Nov 18, 2021

This is issue is also present now on the cluster, but setting PROJECT_CHECK_ENABLED in the config does fix it. I suppose we'll need a code fix for this though, before we push the next update.

Fix is sitting in #519

Great! Testing the robot now with a merge of #519 and #518 on ~600 jobs, so we get an idea of how long it will take.

@pmrv
Copy link
Contributor

pmrv commented Nov 18, 2021

It's still running, but the short news is that it will take forever to convert all files in the current way. I'm currently waiting for more than half an hour on the rewrite of a single ~100MB file.

I did take the opportunity to do an strace on the robot while it rewrites and it shows that our current approach to open and close the HDF for every access is killing performance. Here's a profile of the system calls that are issued during the rewrite

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 69.36   43.468292           1  44903039           pread64
 16.21   10.159870           5   2083523           openat
  5.48    3.433824           2   2083524           close
  5.21    3.265076           2   2083523           lstat
  1.88    1.178396           1   2083523           flock
  1.63    1.021626           0   2088343           fstat
  0.12    0.076456           2     45517           pwrite64
  0.05    0.031685           1     21203           stat
  0.02    0.010530           1      7485           munmap
  0.01    0.007523          13       599        33 futex
  0.01    0.005814           1      7485           mmap
  0.01    0.005560           8       722           ftruncate
  0.00    0.003133           1      4820           read
  0.00    0.000006           2         3           brk
------ ----------- ----------- --------- --------- ----------------
100.00   62.667791              55413309        33 total

i.e. a good 20% of runtime is spent opening and closing files! Notice also disproportionate amount of reading vs. writing, which suggests we do a lot of auxiliary reading before actually fetching and writing data.

@niklassiemer
Copy link
Member

Well the misbalance between read and write is kind of expected with the awful 'read array and then array(to_list)' before we actually have got it... However, for the number of calls this is also a big difference...

@pmrv
Copy link
Contributor

pmrv commented Nov 18, 2021

Well the misbalance between read and write is kind of expected with the awful 'read array and then array(to_list)' before we actually have got it... However, for the number of calls this is also a big difference...

The read/write system calls are purely for reading from file, so I'd be surprised, but not knowing how h5io knows it's not so clear to me.

Anyway I've updated the robot to

  1. track how much files it has processed so far
  2. check if the file actually needs to be rewritten

Once I've converted all of those 600 jobs (which tqdm estimates to ~1h), this should be good to go from my side. Still it'd be nice if someone can test the script on some of their data, before we unleash it onto the cluster.

@pmrv
Copy link
Contributor

pmrv commented Nov 18, 2021

Update: It took around 1h 30min for 1.3GB of data combined. I didn't see any errors, so I think it's good to go, but will try to move to more data now and see if anything happens.

@niklassiemer
Copy link
Member

Thanks for testing this rigorously! I just do not have data to test 😆

@niklassiemer
Copy link
Member

And if we patch the robot to be faster in an upcoming release, this is fine from my perspective.

@pmrv
Copy link
Contributor

pmrv commented Nov 18, 2021

Currently testing on my full data set and patching small errors that come up, will update before I leave the office.

@niklassiemer
Copy link
Member

ok, please just update #518 with these patches and merge. Then I will make a new release also lifting the h5io restrictions on conda this evening 🎉

@pmrv
Copy link
Contributor

pmrv commented Nov 18, 2021

I pushed it accidentally to #525, but since it includes #518 I suppose this will still work for you.

Update from my extended testing shows two failure modes that I addressed in the script now:

  1. sometimes a job is in the database, but the HDF5 file does not exist anymore. I simply skip those, but it might be nice to have a utility that cleans this kind of "dangling" job
  2. A large number jobs fail the rewrite with an error like could not broadcast input array from shape (4,3) into shape (4,), these are jobs that are definitely affected and should be rewritten, but I don't the issue right now. It seems like we have to wait another day. EDIT: Actually all of these jobs seem to be Murnaghan child jobs, it might be that their weird HDF5 layout gives issues with the rewrite. If you have time to look at it, it might make sense to run a murnaghan and try the robot on it.

Still there was a large number of unaffected jobs in my ~100GB of data, so we can actually be a bit optimistic to do the conversion over night.

@niklassiemer
Copy link
Member

Yes, I will cherry-pick them over to the other branch and get everything merged and released!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants