job_table() returning empty dataframe #517

Leimeroth · 2021-11-15T13:04:37Z

After updating conda packages and pyiron versions pr.job_table() returns an empty dataframe. I have absolutely no clue what causes this problem.
The database seems to be completely fine. I tested it using sqlite3 PRAGMA integrity_check, by connecting to my database backup and trying pr.job_table() (still empty df) and had a look at the database using sqlitebrowser.
I also tried to downgrade sqlalchemy from 1.4.27 to 1.4.26 and checked if the connection in the Settings is correct.
If I create new jobs they get a correct id (starting with a number over 300000, not with 0) and the new jobs are shown correctly in the dataframe, but all jobs from before updating are not shown.

@jan-janssen @pmrv mentioning you to get you notified, if you have any idea what causes this / how to fix this please let me know, this really messes with my work

jan-janssen · 2021-11-15T16:40:23Z

Did you use the pyiron base version from the Git repository or from conda? @liamhuber Can this be related to the recent database changes?

jan-janssen · 2021-11-15T16:41:49Z

In addition if the new jobs appear but the old ones do not can you post the database entries to compare how they differ? You can use the https://sqlitebrowser.org to open the SQLite files in a graphical interface.

liamhuber · 2021-11-15T16:47:09Z

I touched how the database access strings are parsed, so it's definitely possible this introduced a bug. There is almost certainly not a problem when using sqlite as I've been running the new code on my local machine for quite some time without trouble. I think I've done stuff on the cluster, in which case Postgres is probably ok? I haven't done anything with MySQL though, so couldn't say one way or another there

liamhuber · 2021-11-15T17:53:14Z

Aha @Leimeroth i just remembered a big change: as before we use XOR logic to control the configuration, but we changed the order of priority! It used to be user input XOR confit file XOR environment variables. Now it's user-env-config. Sonic you have any pyiron env vars (except PYIRONCONFIG) then you'll get these+defaults, ognoringg your config file entirely. If you have your database defined in a config file, please check if you have any PYIRON vars in 'os.environ ' that would screw you up

liamhuber · 2021-11-15T17:54:04Z

Ugh, phone spelling, sorry

Leimeroth · 2021-11-15T19:55:57Z

Did you use the pyiron base version from the Git repository or from conda? @liamhuber Can this be related to the recent database changes?

Git versions

In addition if the new jobs appear but the old ones do not can you post the database entries to compare how they differ? You can use the https://sqlitebrowser.org to open the SQLite files in a graphical interface.

For new jobs (those I see in pr.job_table() now) the projectpath value is NULL and the complete path is stored in the project column. For old jobs projectpath is /home/pyiron/projects/, and project is the missing part for the working directory.
I don't know why this changed.

I touched how the database access strings are parsed, so it's definitely possible this introduced a bug. There is almost certainly not a problem when using sqlite as I've been running the new code on my local machine for quite some time without trouble. I think I've done stuff on the cluster, in which case Postgres is probably ok? I haven't done anything with MySQL though, so couldn't say one way or another there

I am using sqlite

Aha @Leimeroth i just remembered a big change: as before we use XOR logic to control the configuration, but we changed the order of priority! It used to be user input XOR confit file XOR environment variables. Now it's user-env-config. Sonic you have any pyiron env vars (except PYIRONCONFIG) then you'll get these+defaults, ognoringg your config file entirely. If you have your database defined in a config file, please check if you have any PYIRON vars in 'os.environ ' that would screw you up

I can't find any pyiron related environment variable, and I do not know where it should come from. It is possible that I overlooked something, but checking with

env = os.environ
for k, v in env.items():
    if "pyiron" in k.lower() or "pyiron" in v.lower():
        print(k, v)

also didn't find anything beside my PYTHONPATHS to the git pyiron versions.
EDIT: Also Settings().config seems correct, which should show env variables I assume?

Leimeroth · 2021-11-16T06:55:00Z

I did some tests and I can get the full job_dict when running:
pr.db._job_dict(sql_query=None, user=None, project_path="", recursive=True)
but when I do
pr.db._job_dict(sql_query=None, user=None, project_path=pr.project_path, recursive=True)
I only get the new jobs.

In the _job_dict() method the "project" column is set to project_path, which I find confusing tbh.

if project_path == "./":
            project_path = ""
if recursive:
    dict_clause["project"] = str(project_path) + "%"
else:
    dict_clause["project"] = str(project_path)

But maybe this also confused someone else and is set to a different value somewhere?

liamhuber · 2021-11-16T08:09:00Z

I am using sqlite

Also Settings().config seems correct, which should show env variables I assume?

Ok, yes, if these are the case then you should be looking at the right database so no worries there. And your snippet checking for ENV contamination looks good to me.

did some tests and I can get the full job_dict when running:
pr.db._job_dict(sql_query=None, user=None, project_path="", recursive=True)
but when I do
pr.db._job_dict(sql_query=None, user=None, project_path=pr.project_path, recursive=True)
I only get the new jobs.

I still can't rule out that my changes caused this, but it sounds less likely. I haven't touched pr.db._job_dict (DatabaseAccess._job_dict), so as long as your SQL connection string still checks out I don't see my updates doing this.

Git versions

Do you know where your head was before this? I think this bug might be nasty enough to warrant running a git bisection test. Once we know the exact commit that breaks things we'll stand a chance, but right now the behaviour is extremely mysterious to me.

pmrv · 2021-11-16T08:40:17Z

Might be something I messed up in #336, but it's weird that it only comes up today.

Leimeroth · 2021-11-16T08:51:15Z

Git bisecting shows that #486 introduced the bug. I still think this could be some mess up with project and project path, since all new entries have an empty project_path, and the full path up to working directory is stored under project instead.

liamhuber · 2021-11-16T09:21:48Z

Git bisecting shows that #486 introduced the bug. I still think this could be some mess up with project and project path, since all new entries have an empty project_path, and the full path up to working directory is stored under project instead.

Perfect, thanks!

Yeah, your description is reasonable. I'm totally stumped right now because when I look at the diff I don't see anything in that direction getting touched. But now that we know the PR the problem must be soluble! I'll keep looking at it today.

liamhuber · 2021-11-16T09:27:07Z

@pmrv you mentioned reproducing the error when you use the staging database...do you invoke one of the switch_to or update methods after starting your pyiron session in order to access this other DB?

liamhuber · 2021-11-16T09:53:34Z

I can use this pr.db._job_dict(sql_query=None, user=None, project_path=pr.project_path / "", recursive=True) trick to reproduce a larger/smaller table on my system now as well. \

Both older and newer jobs have 'projectpath': None, but the older ones have 'project': '/Users/.../notebooks/scratch/cluster_1024_hdf5/',, while the newer ones don't have this job tail and look only like 'project': '/Users/.../notebooks/scratch/'

I also note that project_path="" gets me a global table from all sorts of folders on my machine, not just the directory path where I have my notebook.

pmrv · 2021-11-16T10:11:27Z

@pmrv you mentioned reproducing the error when you use the staging database...do you invoke one of the switch_to or update methods after starting your pyiron session in order to access this other DB?

I did it by updating the sql_host entry in the user dict of Settings.update().

liamhuber · 2021-11-16T10:15:40Z

I did it by updating the sql_host entry in the user dict of Settings.update().

Ok, then I really don't know what should be happening. Calling Settings.update should only update the settings, and not reset the DatabaseManager, i.e. I would have expected your database to stay old. state.update() first updates the settings then resets the database and queue adapter.

However while this properly updates state.database.database, this is then out of sync with the the pr.db which still points to the old (not disconnected) instance! I'm working on patching this in #519

Leimeroth · 2021-11-16T10:29:25Z

Both older and newer jobs have 'projectpath': None, but the older ones have 'project': '/Users/.../notebooks/scratch/cluster_1024_hdf5/',, while the newer ones don't have this job tail and look only like 'project': '/Users/.../notebooks/scratch/'

This confuses me because ALL my old jobs have a project_path, and their project is never an absolute path (i.e. it never starts with / ), but whether the project entry ends with hdf_5 or not depends on whether the jobs are child jobs or not, but not at all on if they are new or old

liamhuber · 2021-11-16T10:30:59Z

but whether the project entry ends with hdf_5 or not depends on whether the jobs are child jobs or not, but not at all on if they are new or old

Aha, great point! Yes, I think my 'cluster' job is a child, so that at least makes sense.

This confuses me because ALL my old jobs have a project_path, and their project is never an absolute path (i.e. it never starts with / ),

This is still a mystery to me too

liamhuber · 2021-11-16T10:36:59Z

The default value of Settings().configuration['project_check_enabled'] got switched from True to False. Since I don't see any direct influence on the project path values, this is the best lead I have so far. I'll look into it more, but am in a meeting I need to listen in to, so I wanted to share the hint since I can't yet fully focus on this.

liamhuber · 2021-11-16T10:39:57Z

The default value of Settings().configuration['project_check_enabled'] got switched from True to False. Since I don't see any direct influence on the project path values, this is the best lead I have so far. I'll look into it more, but am in a meeting I need to listen in to, so I wanted to share the hint since I can't yet fully focus on this.

Yup, this then gets used in DatabaseManager.top_path... looking suspicious

Leimeroth · 2021-11-16T10:57:10Z

I added PROJECT_CHECK_ENABLED=True to my .pyiron file and now it works again, so I guess this causes the problem

liamhuber · 2021-11-16T11:00:58Z

I added PROJECT_CHECK_ENABLED=True to my .pyiron file and now it works again, so I guess this causes the problem

Man, that is really bizarre. I did the same (well, actually I reverted the default as seen over in #519), but my job table still shows None for the projectpath.

At least this relieves some time pressure in solving this now that it's working for you again...

niklassiemer · 2021-11-16T13:30:01Z

We really need more simple tests. I saw @liamhuber added some tests in

pyiron_base/tests/database/test_manager.py

Lines 32 to 38 in 4f50b8d

    
           self.s.configuration["project_check_enabled"] = False 
        
           self.assertIs(self.dbm.top_path(self.project_path + "/test"), None) 
        
           self.s.configuration["project_check_enabled"] = True 
        
           self.s.configuration["project_paths"] = [self.s.convert_path_to_abs_posix(getcwd())] 
        
           self.s.configuration["disable_database"] = False  # Otherwise has the chance to override project_check_enabled.. 
        
           self.assertTrue(self.dbm.top_path(self.project_path + "/test") in self.project_path)

However, they check for if the PROJECT_PATH is correctly added to these paths or not. I would like to have a test checking for the correct split of the whole path into root_path and project_path (in terms of the pr.attribute names, not the ones in the db). Actually we should probably always write a test if we found a bug to make our CI fail as well 😃

liamhuber · 2021-11-16T14:12:22Z

I would like to have a test checking for the correct split of the whole path into root_path and project_path (in terms of the pr.attribute names, not the ones in the db). Actually we should probably always write a test if we found a bug to make our CI fail as well

💯 Does anyone want to do this? I won't have time to look at it this week. Since I revert the default behaviour over in #519, and since it's now working for @Leimeroth I guess we can close? I am also amenable to keeping it open until someone adds a test for the project attributes.

niklassiemer · 2021-11-16T16:14:30Z

Tests in #520 😄

pmrv · 2021-11-18T12:02:20Z

This is issue is also present now on the cluster, but setting PROJECT_CHECK_ENABLED in the config does fix it. I suppose we'll need a code fix for this though, before we push the next update.

liamhuber · 2021-11-18T12:15:48Z

This is issue is also present now on the cluster, but setting PROJECT_CHECK_ENABLED in the config does fix it. I suppose we'll need a code fix for this though, before we push the next update.

Fix is sitting in #519

pmrv · 2021-11-18T12:30:54Z

This is issue is also present now on the cluster, but setting PROJECT_CHECK_ENABLED in the config does fix it. I suppose we'll need a code fix for this though, before we push the next update.

Fix is sitting in #519

Great! Testing the robot now with a merge of #519 and #518 on ~600 jobs, so we get an idea of how long it will take.

pmrv · 2021-11-18T14:22:20Z

It's still running, but the short news is that it will take forever to convert all files in the current way. I'm currently waiting for more than half an hour on the rewrite of a single ~100MB file.

I did take the opportunity to do an strace on the robot while it rewrites and it shows that our current approach to open and close the HDF for every access is killing performance. Here's a profile of the system calls that are issued during the rewrite

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 69.36   43.468292           1  44903039           pread64
 16.21   10.159870           5   2083523           openat
  5.48    3.433824           2   2083524           close
  5.21    3.265076           2   2083523           lstat
  1.88    1.178396           1   2083523           flock
  1.63    1.021626           0   2088343           fstat
  0.12    0.076456           2     45517           pwrite64
  0.05    0.031685           1     21203           stat
  0.02    0.010530           1      7485           munmap
  0.01    0.007523          13       599        33 futex
  0.01    0.005814           1      7485           mmap
  0.01    0.005560           8       722           ftruncate
  0.00    0.003133           1      4820           read
  0.00    0.000006           2         3           brk
------ ----------- ----------- --------- --------- ----------------
100.00   62.667791              55413309        33 total

i.e. a good 20% of runtime is spent opening and closing files! Notice also disproportionate amount of reading vs. writing, which suggests we do a lot of auxiliary reading before actually fetching and writing data.

niklassiemer · 2021-11-18T14:33:00Z

Well the misbalance between read and write is kind of expected with the awful 'read array and then array(to_list)' before we actually have got it... However, for the number of calls this is also a big difference...

pmrv · 2021-11-18T14:58:27Z

Well the misbalance between read and write is kind of expected with the awful 'read array and then array(to_list)' before we actually have got it... However, for the number of calls this is also a big difference...

The read/write system calls are purely for reading from file, so I'd be surprised, but not knowing how h5io knows it's not so clear to me.

Anyway I've updated the robot to

track how much files it has processed so far
check if the file actually needs to be rewritten

Once I've converted all of those 600 jobs (which tqdm estimates to ~1h), this should be good to go from my side. Still it'd be nice if someone can test the script on some of their data, before we unleash it onto the cluster.

pmrv · 2021-11-18T15:22:41Z

Update: It took around 1h 30min for 1.3GB of data combined. I didn't see any errors, so I think it's good to go, but will try to move to more data now and see if anything happens.

niklassiemer · 2021-11-18T15:26:52Z

Thanks for testing this rigorously! I just do not have data to test 😆

niklassiemer · 2021-11-18T15:28:02Z

And if we patch the robot to be faster in an upcoming release, this is fine from my perspective.

pmrv · 2021-11-18T16:05:49Z

Currently testing on my full data set and patching small errors that come up, will update before I leave the office.

niklassiemer · 2021-11-18T16:09:17Z

ok, please just update #518 with these patches and merge. Then I will make a new release also lifting the h5io restrictions on conda this evening 🎉

pmrv · 2021-11-18T16:43:03Z

I pushed it accidentally to #525, but since it includes #518 I suppose this will still work for you.

Update from my extended testing shows two failure modes that I addressed in the script now:

sometimes a job is in the database, but the HDF5 file does not exist anymore. I simply skip those, but it might be nice to have a utility that cleans this kind of "dangling" job
A large number jobs fail the rewrite with an error like could not broadcast input array from shape (4,3) into shape (4,), these are jobs that are definitely affected and should be rewritten, but I don't the issue right now. It seems like we have to wait another day. EDIT: Actually all of these jobs seem to be Murnaghan child jobs, it might be that their weird HDF5 layout gives issues with the rewrite. If you have time to look at it, it might make sense to run a murnaghan and try the robot on it.

Still there was a large number of unaffected jobs in my ~100GB of data, so we can actually be a bit optimistic to do the conversion over night.

niklassiemer · 2021-11-18T17:05:34Z

Yes, I will cherry-pick them over to the other branch and get everything merged and released!

Leimeroth added bug Something isn't working help wanted Extra attention is needed labels Nov 15, 2021

liamhuber mentioned this issue Nov 16, 2021

State hotfixes #519

Merged

niklassiemer mentioned this issue Nov 16, 2021

Test top path #520

Closed

niklassiemer linked a pull request Nov 16, 2021 that will close this issue

State hotfixes #519

Merged

pmrv mentioned this issue Nov 16, 2021

Convert dtype=object arrays if possible #518

Merged

niklassiemer closed this as completed in #519 Nov 18, 2021

job_table() returning empty dataframe #517

job_table() returning empty dataframe #517

Comments

Leimeroth commented Nov 15, 2021 • edited Loading

jan-janssen commented Nov 15, 2021

jan-janssen commented Nov 15, 2021

liamhuber commented Nov 15, 2021

liamhuber commented Nov 15, 2021

liamhuber commented Nov 15, 2021

Leimeroth commented Nov 15, 2021 • edited Loading

Leimeroth commented Nov 16, 2021

liamhuber commented Nov 16, 2021

pmrv commented Nov 16, 2021

Leimeroth commented Nov 16, 2021

liamhuber commented Nov 16, 2021

liamhuber commented Nov 16, 2021

liamhuber commented Nov 16, 2021

pmrv commented Nov 16, 2021

liamhuber commented Nov 16, 2021

Leimeroth commented Nov 16, 2021

liamhuber commented Nov 16, 2021

liamhuber commented Nov 16, 2021

liamhuber commented Nov 16, 2021

Leimeroth commented Nov 16, 2021

liamhuber commented Nov 16, 2021

niklassiemer commented Nov 16, 2021

liamhuber commented Nov 16, 2021

niklassiemer commented Nov 16, 2021

pmrv commented Nov 18, 2021

liamhuber commented Nov 18, 2021

pmrv commented Nov 18, 2021

pmrv commented Nov 18, 2021

niklassiemer commented Nov 18, 2021

pmrv commented Nov 18, 2021

pmrv commented Nov 18, 2021

niklassiemer commented Nov 18, 2021

niklassiemer commented Nov 18, 2021

pmrv commented Nov 18, 2021

niklassiemer commented Nov 18, 2021

pmrv commented Nov 18, 2021 • edited Loading

niklassiemer commented Nov 18, 2021

Leimeroth commented Nov 15, 2021 •

edited

Loading

Leimeroth commented Nov 15, 2021 •

edited

Loading

pmrv commented Nov 18, 2021 •

edited

Loading