Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix batch processing #26

Closed
wants to merge 2 commits into from
Closed

Fix batch processing #26

wants to merge 2 commits into from

Conversation

nurikk
Copy link
Contributor

@nurikk nurikk commented Jul 1, 2021

This pr fixes issue with processing huge tables. It calls import_data inside loop that calls fetchmany and clears data and StringIO buffers after each iteration

Depends on #25

@nurikk nurikk mentioned this pull request Jul 2, 2021
@hkage
Copy link
Contributor

hkage commented Jul 2, 2021

Hi,

I tested your changes against a ~1,6 GB sized database and tables with at most ~370k data entries (per table). The anonymization breaks at some point with an OutOfMemory error. Have you discovered similar issues?

Im an running the anonymization with Python 3.9 in a virtualenv and on a Ubuntu Linux 21.04. with 24 GB of RAM.

postgres=# SELECT version();
                                                           version                                                            
------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 10.7 (Ubuntu 10.7-0ubuntu0.18.10.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 8.2.0-7ubuntu1) 8.2.0, 64-bit
(1 row)
Anonymizing |█                               | 10000/292059Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/__main__.py", line 18, in <module>
    main()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/__main__.py", line 10, in main
    main()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 40, in anonymize_tables
    build_and_then_import_data(connection, table_name, primary_key, columns, excludes,
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 83, in build_and_then_import_data
    import_data(connection, columns, table, primary_key, data)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 158, in import_data
    cursor.execute('DROP TABLE %s;' % temp_table)
psycopg2.errors.OutOfMemory: out of shared memory
HINT:  You might need to increase max_locks_per_transaction.

@nurikk
Copy link
Contributor Author

nurikk commented Jul 2, 2021

Hi,

I tested your changes against a ~1,6 GB sized database and tables with at most ~370k data entries (per table). The anonymization breaks at some point with an OutOfMemory error. Have you discovered similar issues?

Im an running the anonymization with Python 3.9 in a virtualenv and on a Ubuntu Linux 21.04. with 24 GB of RAM.

postgres=# SELECT version();
                                                           version                                                            
------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 10.7 (Ubuntu 10.7-0ubuntu0.18.10.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 8.2.0-7ubuntu1) 8.2.0, 64-bit
(1 row)
Anonymizing |█                               | 10000/292059Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/__main__.py", line 18, in <module>
    main()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/__main__.py", line 10, in main
    main()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 40, in anonymize_tables
    build_and_then_import_data(connection, table_name, primary_key, columns, excludes,
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 83, in build_and_then_import_data
    import_data(connection, columns, table, primary_key, data)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 158, in import_data
    cursor.execute('DROP TABLE %s;' % temp_table)
psycopg2.errors.OutOfMemory: out of shared memory
HINT:  You might need to increase max_locks_per_transaction.

No, i'm not getting such issues. Without proposed changes I assume everything works fine?

@hkage
Copy link
Contributor

hkage commented Jul 2, 2021

Hi,
I tested your changes against a ~1,6 GB sized database and tables with at most ~370k data entries (per table). The anonymization breaks at some point with an OutOfMemory error. Have you discovered similar issues?
Im an running the anonymization with Python 3.9 in a virtualenv and on a Ubuntu Linux 21.04. with 24 GB of RAM.

postgres=# SELECT version();
                                                           version                                                            
------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 10.7 (Ubuntu 10.7-0ubuntu0.18.10.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 8.2.0-7ubuntu1) 8.2.0, 64-bit
(1 row)
Anonymizing |█                               | 10000/292059Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/__main__.py", line 18, in <module>
    main()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/__main__.py", line 10, in main
    main()
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/cli.py", line 71, in main
    anonymize_tables(connection, schema.get('tables', []), verbose=args.verbose)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 40, in anonymize_tables
    build_and_then_import_data(connection, table_name, primary_key, columns, excludes,
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 83, in build_and_then_import_data
    import_data(connection, columns, table, primary_key, data)
  File "/home/henning/Projekte/postgresql-anonymizer/pganonymizer/utils.py", line 158, in import_data
    cursor.execute('DROP TABLE %s;' % temp_table)
psycopg2.errors.OutOfMemory: out of shared memory
HINT:  You might need to increase max_locks_per_transaction.

No, i'm not getting such issues. Without proposed changes I assume everything works fine?

Yes, as soon as I change to the development branch the anonymization works fine and completes after a couple of minutes.

I haven't take a deeper look into the max_locks_per_transaction settings from PostgreSQL, but it seems like the number of locks during the anonymization exceeds PostgreSQL's default. I tried to split the cursor operations into multiple transactions, but that resulted in errors when deleting the temporary tables (at https://github.com/rheinwerk-verlag/postgresql-anonymizer/blob/development/pganonymizer/utils.py#L157).

@nurikk
Copy link
Contributor Author

nurikk commented Jul 2, 2021

I'll fix this issue

@hkage
Copy link
Contributor

hkage commented Jul 2, 2021

I'll fix this issue

Thanks. I can support you with additional debug / system information if that helps you.

@nurikk
Copy link
Contributor Author

nurikk commented Jul 2, 2021

Hm, I can't reproduce it on my test db
Can you try to test against this pr? It's already includes changes from this PR and might resolves psycopg2.errors.OutOfMemory: out of shared memory issue

@nurikk
Copy link
Contributor Author

nurikk commented Jul 5, 2021

Superseded by #27

@nurikk nurikk closed this Jul 5, 2021
@hkage
Copy link
Contributor

hkage commented Jul 5, 2021

Hm, I can't reproduce it on my test db
Can you try to test against this pr? It's already includes changes from this PR and might resolves psycopg2.errors.OutOfMemory: out of shared memory issue

Sure, I will check it with PR 27. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants