Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #537 ValueError race condition when running multiprocessing with describe1d #549

Merged
merged 5 commits into from Aug 24, 2020

Conversation

chanedwin
Copy link
Collaborator

References issue #537

Problem :
ValueError is raised when running ProfileReport on large datasets and with multiprocessing on (pool_size >1). This is likely due to the series.fillna(np.nan, inplace=True) in summary.py seems to be performing multiple in-place mutations to the underlying DataFrame object through the passed series reference, resulting in some kind of race condition where two of the processes try to write to the DataFrame at the same time and the ValueError then occurs. This is also why changing the pool_size to 1 fixes the issue, and why the error doesn't always occur - you probably need enough data and threads to hit the race condition.

Solution :
Replace series.fillna(np.nan, inplace=True) with series = series.fillna(np.nan) , negating any side effects from mutation.
Write test case for multiprocessing describe1d to test for multiprocessing functionality

@codecov
Copy link

codecov bot commented Aug 19, 2020

Codecov Report

Merging #549 into develop will decrease coverage by 0.10%.
The diff coverage is 94.73%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #549      +/-   ##
===========================================
- Coverage    87.73%   87.62%   -0.11%     
===========================================
  Files          120      121       +1     
  Lines         3115     3152      +37     
===========================================
+ Hits          2733     2762      +29     
- Misses         382      390       +8     
Flag Coverage Δ
#examples 31.05% <0.00%> (-0.02%) ⬇️
#issue 76.63% <94.73%> (+0.02%) ⬆️
#unit 84.86% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tests/issues/test_issue537.py 94.44% <94.44%> (ø)
src/pandas_profiling/model/summary.py 89.28% <100.00%> (+0.04%) ⬆️
tests/issues/test_issue437.py 60.00% <0.00%> (-40.00%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa25e44...8c68ab7. Read the comment docs.

@chanedwin chanedwin closed this Aug 19, 2020
@chanedwin chanedwin reopened this Aug 19, 2020
@chanedwin chanedwin closed this Aug 20, 2020
@chanedwin chanedwin reopened this Aug 20, 2020
@chanedwin
Copy link
Collaborator Author

chanedwin commented Aug 20, 2020

Double checked the Travis, it is failing test_issue147.py on the reset_index() call (Error - cannot convert float NaN to integer), which is actually a bug in pandas 1.1.0 listed here - pandas-dev/pandas#35657

On pandas 1.1.0, I'm getting a ValueError exception when calling dataframe.reset_index() under the following conditions:

Input dataframe is empty
Multiindex from multiple columns, at least one of which is a datetime
The exception message is ValueError: cannot convert float NaN to integer.

@sbrugman sbrugman merged commit 577b230 into ydataai:develop Aug 24, 2020
chanedwin added a commit to chanedwin/pandas-profiling that referenced this pull request Oct 11, 2020
…g with describe1d (ydataai#549)

* include tests for issue 537
* fix hidden side effect from previous series.fillna(in_place=True) call by expliciting dropping na
chanedwin added a commit to chanedwin/pandas-profiling that referenced this pull request Oct 11, 2020
…g with describe1d (ydataai#549)

* include tests for issue 537
* fix hidden side effect from previous series.fillna(in_place=True) call by expliciting dropping na
chanedwin added a commit to chanedwin/pandas-profiling that referenced this pull request Oct 11, 2020
…g with describe1d (ydataai#549)

* include tests for issue 537
* fix hidden side effect from previous series.fillna(in_place=True) call by expliciting dropping na
chanedwin added a commit to chanedwin/pandas-profiling that referenced this pull request Oct 11, 2020
…g with describe1d (ydataai#549)

* include tests for issue 537
* fix hidden side effect from previous series.fillna(in_place=True) call by expliciting dropping na
chanedwin added a commit to chanedwin/pandas-profiling that referenced this pull request Oct 11, 2020
…e, tests and CI up, and with visions integration pulled in

Update integrations.rst (ydataai#544)

fix ydataai#537  ValueError race condition when running multiprocessing with describe1d (ydataai#549)

* include tests for issue 537
* fix hidden side effect from previous series.fillna(in_place=True) call by expliciting dropping na

Give visibility to our support (ydataai#536)

* Add support mention

Change formatters for overview (ydataai#535)

Fix 523 (ydataai#533)

* Fix 523

Incompatible with pandas 1.1.0 (ydataai#557)

Notebook update instructions (ydataai#556)

Fix 545 and test pandas 1.0.5 and >=1.1 (ydataai#558)

* Fix 545 and test pandas 1.0.5 and >=1.1

Bump visions[type_image_path] from 0.4.4 to 0.5.0 (ydataai#547)

Bumps [visions[type_image_path]](https://github.com/dylan-profiler/visions) from 0.4.4 to 0.5.0.
- [Release notes](https://github.com/dylan-profiler/visions/releases)
- [Commits](dylan-profiler/visions@v0.4.4...0.5.0)

Update frequent issues (ydataai#564)

Fix warning from cmap (ydataai#565)

Feature/distinct unique (ydataai#566)

* Fix ydataai#539

v2.9.0 details (ydataai#567)

[skip ci] Code formatting

Visions integration

Build summary from graph structure

Fix a few more tests

Typeset changes + test updates

Type checking

Correlations

Handler, warning structure, random sample, test fix

Test fix

Fixes

Fix warning

Captions missing diagrams

Fix 51

Unhashable

Process comments

Fix tests

Update messages.py

Add threshold to all correlation configs

Remove unused renderers (ydataai#580)

* Remove unused rendered

Update README.md

Fix check for infinite values (ydataai#588)

* Fix check for infinite values

Bump visions[type_image_path] from 0.5.0 to 0.6.0

Bumps [visions[type_image_path]](https://github.com/dylan-profiler/visions) from 0.5.0 to 0.6.0.
- [Release notes](https://github.com/dylan-profiler/visions/releases)
- [Commits](dylan-profiler/visions@0.5.0...v0.6.0)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

Update get_scatter_matrix for sparse dataframes

For a dataframe like:

	A	B	C
0	1.0	7.0	NaN
1	2.0	8.0	NaN
2	3.0	9.0	NaN
3	4.0	NaN	13.0
4	5.0	NaN	14.0
5	6.0	NaN	15.0
6	NaN	10.0	16.0
7	NaN	11.0	17.0
8	NaN	12.0	18.0

the 'Interactions' tab would not display any data (as all rows contain NaN's) while any pair of columns would contain valid data to plot.
This change allows columns A, B, and C to be pairwise plotted against each other by only removing rows with NaN's between the pairwise columns.

Update plot.py

Notation
chanedwin added a commit to chanedwin/pandas-profiling that referenced this pull request Oct 11, 2020
…e, tests and CI up, and with visions integration pulled in

Update integrations.rst (ydataai#544)

fix ydataai#537  ValueError race condition when running multiprocessing with describe1d (ydataai#549)

* include tests for issue 537
* fix hidden side effect from previous series.fillna(in_place=True) call by expliciting dropping na

Give visibility to our support (ydataai#536)

* Add support mention

Change formatters for overview (ydataai#535)

Fix 523 (ydataai#533)

* Fix 523

Incompatible with pandas 1.1.0 (ydataai#557)

Notebook update instructions (ydataai#556)

Fix 545 and test pandas 1.0.5 and >=1.1 (ydataai#558)

* Fix 545 and test pandas 1.0.5 and >=1.1

Bump visions[type_image_path] from 0.4.4 to 0.5.0 (ydataai#547)

Bumps [visions[type_image_path]](https://github.com/dylan-profiler/visions) from 0.4.4 to 0.5.0.
- [Release notes](https://github.com/dylan-profiler/visions/releases)
- [Commits](dylan-profiler/visions@v0.4.4...0.5.0)

Update frequent issues (ydataai#564)

Fix warning from cmap (ydataai#565)

Feature/distinct unique (ydataai#566)

* Fix ydataai#539

v2.9.0 details (ydataai#567)

[skip ci] Code formatting

Visions integration

Build summary from graph structure

Fix a few more tests

Typeset changes + test updates

Type checking

Correlations

Handler, warning structure, random sample, test fix

Test fix

Fixes

Fix warning

Captions missing diagrams

Fix 51

Unhashable

Process comments

Fix tests

Update messages.py

Add threshold to all correlation configs

Remove unused renderers (ydataai#580)

* Remove unused rendered

Update README.md

Fix check for infinite values (ydataai#588)

* Fix check for infinite values

Bump visions[type_image_path] from 0.5.0 to 0.6.0

Bumps [visions[type_image_path]](https://github.com/dylan-profiler/visions) from 0.5.0 to 0.6.0.
- [Release notes](https://github.com/dylan-profiler/visions/releases)
- [Commits](dylan-profiler/visions@0.5.0...v0.6.0)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

Update get_scatter_matrix for sparse dataframes

For a dataframe like:

	A	B	C
0	1.0	7.0	NaN
1	2.0	8.0	NaN
2	3.0	9.0	NaN
3	4.0	NaN	13.0
4	5.0	NaN	14.0
5	6.0	NaN	15.0
6	NaN	10.0	16.0
7	NaN	11.0	17.0
8	NaN	12.0	18.0

the 'Interactions' tab would not display any data (as all rows contain NaN's) while any pair of columns would contain valid data to plot.
This change allows columns A, B, and C to be pairwise plotted against each other by only removing rows with NaN's between the pairwise columns.

Update plot.py

Notation
chanedwin added a commit to chanedwin/pandas-profiling that referenced this pull request Oct 22, 2020
…g with describe1d (ydataai#549)

* include tests for issue 537
* fix hidden side effect from previous series.fillna(in_place=True) call by expliciting dropping na
@chanedwin chanedwin deleted the develop branch October 25, 2020 06:04
chanedwin added a commit to chanedwin/pandas-profiling that referenced this pull request Jan 13, 2021
…e, tests and CI up, and with visions integration pulled in

Update integrations.rst (ydataai#544)

fix ydataai#537  ValueError race condition when running multiprocessing with describe1d (ydataai#549)

* include tests for issue 537
* fix hidden side effect from previous series.fillna(in_place=True) call by expliciting dropping na

Give visibility to our support (ydataai#536)

* Add support mention

Change formatters for overview (ydataai#535)

Fix 523 (ydataai#533)

* Fix 523

Incompatible with pandas 1.1.0 (ydataai#557)

Notebook update instructions (ydataai#556)

Fix 545 and test pandas 1.0.5 and >=1.1 (ydataai#558)

* Fix 545 and test pandas 1.0.5 and >=1.1

Bump visions[type_image_path] from 0.4.4 to 0.5.0 (ydataai#547)

Bumps [visions[type_image_path]](https://github.com/dylan-profiler/visions) from 0.4.4 to 0.5.0.
- [Release notes](https://github.com/dylan-profiler/visions/releases)
- [Commits](dylan-profiler/visions@v0.4.4...0.5.0)

Update frequent issues (ydataai#564)

Fix warning from cmap (ydataai#565)

Feature/distinct unique (ydataai#566)

* Fix ydataai#539

v2.9.0 details (ydataai#567)

[skip ci] Code formatting

Visions integration

Build summary from graph structure

Fix a few more tests

Typeset changes + test updates

Type checking

Correlations

Handler, warning structure, random sample, test fix

Test fix

Fixes

Fix warning

Captions missing diagrams

Fix 51

Unhashable

Process comments

Fix tests

Update messages.py

Add threshold to all correlation configs

Remove unused renderers (ydataai#580)

* Remove unused rendered

Update README.md

Fix check for infinite values (ydataai#588)

* Fix check for infinite values

Bump visions[type_image_path] from 0.5.0 to 0.6.0

Bumps [visions[type_image_path]](https://github.com/dylan-profiler/visions) from 0.5.0 to 0.6.0.
- [Release notes](https://github.com/dylan-profiler/visions/releases)
- [Commits](dylan-profiler/visions@0.5.0...v0.6.0)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

Update get_scatter_matrix for sparse dataframes

For a dataframe like:

	A	B	C
0	1.0	7.0	NaN
1	2.0	8.0	NaN
2	3.0	9.0	NaN
3	4.0	NaN	13.0
4	5.0	NaN	14.0
5	6.0	NaN	15.0
6	NaN	10.0	16.0
7	NaN	11.0	17.0
8	NaN	12.0	18.0

the 'Interactions' tab would not display any data (as all rows contain NaN's) while any pair of columns would contain valid data to plot.
This change allows columns A, B, and C to be pairwise plotted against each other by only removing rows with NaN's between the pairwise columns.

Update plot.py

Notation
chanedwin added a commit to chanedwin/pandas-profiling that referenced this pull request Jan 13, 2021
…e, tests and CI up, and with visions integration pulled in

Update integrations.rst (ydataai#544)

fix ydataai#537  ValueError race condition when running multiprocessing with describe1d (ydataai#549)

* include tests for issue 537
* fix hidden side effect from previous series.fillna(in_place=True) call by expliciting dropping na

Give visibility to our support (ydataai#536)

* Add support mention

Change formatters for overview (ydataai#535)

Fix 523 (ydataai#533)

* Fix 523

Incompatible with pandas 1.1.0 (ydataai#557)

Notebook update instructions (ydataai#556)

Fix 545 and test pandas 1.0.5 and >=1.1 (ydataai#558)

* Fix 545 and test pandas 1.0.5 and >=1.1

Bump visions[type_image_path] from 0.4.4 to 0.5.0 (ydataai#547)

Bumps [visions[type_image_path]](https://github.com/dylan-profiler/visions) from 0.4.4 to 0.5.0.
- [Release notes](https://github.com/dylan-profiler/visions/releases)
- [Commits](dylan-profiler/visions@v0.4.4...0.5.0)

Update frequent issues (ydataai#564)

Fix warning from cmap (ydataai#565)

Feature/distinct unique (ydataai#566)

* Fix ydataai#539

v2.9.0 details (ydataai#567)

[skip ci] Code formatting

Visions integration

Build summary from graph structure

Fix a few more tests

Typeset changes + test updates

Type checking

Correlations

Handler, warning structure, random sample, test fix

Test fix

Fixes

Fix warning

Captions missing diagrams

Fix 51

Unhashable

Process comments

Fix tests

Update messages.py

Add threshold to all correlation configs

Remove unused renderers (ydataai#580)

* Remove unused rendered

Update README.md

Fix check for infinite values (ydataai#588)

* Fix check for infinite values

Bump visions[type_image_path] from 0.5.0 to 0.6.0

Bumps [visions[type_image_path]](https://github.com/dylan-profiler/visions) from 0.5.0 to 0.6.0.
- [Release notes](https://github.com/dylan-profiler/visions/releases)
- [Commits](dylan-profiler/visions@0.5.0...v0.6.0)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

Update get_scatter_matrix for sparse dataframes

For a dataframe like:

	A	B	C
0	1.0	7.0	NaN
1	2.0	8.0	NaN
2	3.0	9.0	NaN
3	4.0	NaN	13.0
4	5.0	NaN	14.0
5	6.0	NaN	15.0
6	NaN	10.0	16.0
7	NaN	11.0	17.0
8	NaN	12.0	18.0

the 'Interactions' tab would not display any data (as all rows contain NaN's) while any pair of columns would contain valid data to plot.
This change allows columns A, B, and C to be pairwise plotted against each other by only removing rows with NaN's between the pairwise columns.

Update plot.py

Notation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants