Improve types handling #82

conradoqg · 2018-01-02T23:46:12Z

The goal of this PR is to stabilize the handling of different types and to improve tests. The changes are:

value_counts, nuniques and type inference are cached;
boolean variances are treated correctly, eg. True, False, 0, 1 including NaN or not;
types like List, Tuple, Dict are now officially unsupported until we improve them;
mixed columns are also correctly handled;
add correlation threshold option and check recoded correlation parameter (memory heavy)
add variable anchor to improve navigation and small UX improvements (works both in stand-alone HTML and Jupyter notebook by using scrollTo instead of real link anchors)

This should solve #76 #77 #70 #44 #29 #66

Please review my changes and if agreed merge it.

Best

…ter (memory heavy)

romainx · 2018-01-04T06:06:22Z

Great work!!
I will have a closer look by the end of the day and merge it.
If fix I needed, we will do it later.

Thanks for your work

conradoqg · 2018-01-04T17:48:52Z

Thanks for your time @romainx

romainx

Hello,

Thanks a lot for this work.
I have reviewed the code and here are my remarks.
As I said, I will perform the changes after the merge.

The unit test fails due to the cache used. That seems to not be cleared in some cases. It's better to call the clear cache at the beginning of describe. I'm not a big fan of caching in a global variable. But we can keep it as is. It's simple to remove if it causes problem.
Using type as a variable name like in get_vartype is not a good practice since it can hide the build-in type function. We should use vartype instead
I will take into account new keyword arguments in the ProfileReport docstring. It's important since the help description will appear in Jupyter notebooks.

I've not tried the unsupported type -> I will try to build a test case for it.

conradoqg · 2018-01-04T21:21:36Z

Hey, thanks for your observations!

The unit test fails due to the cache used. That seems to not be cleared in some cases. It's better to call the clear cache at the beginning of describe. I'm not a big fan of caching in a global variable. But we can keep it as is. It's simple to remove if it causes problem.

I'm also not a fan of cache in global variables, indeed this part needs a good refactoring. I tried to simplify this first version before doing a refactoring. I evaluated some cache strategies like the lru-cache (built-in python 3) but it doesn't work since the function parameter is not a base type. (I also tried two other packages, but most of then relies on the parameter type)

Interesting is that the cache didn't fail in the tests here, not sure why.

Using type as a variable name like in get_vartype is not a good practice since it can hide the build-in type function. We should use vartype instead

I agree.

I will take into account new keyword arguments in the ProfileReport docstring. It's important since the help description will appear in Jupyter notebooks.

Ups, my mistake.

I've not tried the unsupported type -> I will try to build a test case for it.

I added some test cases for the unsupported types, not sure if you mean to add a test case for that or add more tests.

Best

romainx · 2018-01-04T21:24:33Z

Hello,

Thanks for the update I have just performed the changes you can review them.
I may need your help to close all the corresponding issues: #76 #77 #70 #44 #29 #66

I will have a look.

Many thanks

conradoqg · 2018-01-04T22:11:27Z

OK, I'm responding each issue with the related fix/improvement done by this PR.

Best

conradoqg · 2018-01-04T22:32:56Z

Done. I explained in each issue what was solved and requested the issue to be closed. In some issues I requested to you the openning of a new issue for a specific subject discussed in that issue (to separate the problem apart of new suggestions)

Best

conradoqg · 2018-01-05T02:18:45Z

I also found that this change may solve the issue #57 and #34, I'm responding it.

Improve types handling

conradoqg added 4 commits January 1, 2018 21:45

Merge pull request #3 from JosPolfliet/master

a861cc0

Cache distinct by series name

4b32fd1

Improve handling of boolean variables

0762fd5

Improve handling of unsupported variables

79e71b9

conradoqg force-pushed the improve-types-handling branch from e578fff to 79e71b9 Compare January 3, 2018 00:50

Add correlation threshold option and check recoded correlation parame…

06aab04

…ter (memory heavy)

conradoqg force-pushed the improve-types-handling branch from f5ba059 to 06aab04 Compare January 3, 2018 01:55

Add variable anchor to improve navigation and small UX improvements

409e233

romainx self-requested a review January 3, 2018 04:53

conradoqg mentioned this pull request Jan 4, 2018

Add in html report the image of correlation matrix #52

Closed

romainx added the enhancement label Jan 4, 2018

romainx reviewed Jan 4, 2018

View reviewed changes

romainx merged commit dea9f1b into ydataai:master Jan 4, 2018

romainx added a commit that referenced this pull request Jan 4, 2018

Some fixes after the merge #82 (see review for detail)

401ca63

conradoqg deleted the improve-types-handling branch January 4, 2018 22:47

This was referenced Jan 5, 2018

Low Memory option? #57

Closed

Gnome session dies on pandas_profiling.ProfileReport(df) #34

Closed

This was referenced Jan 6, 2018

RecursionError: maximum recursion depth exceeded while calling a Python object #84

Closed

Release version 1.4.1 / 1.5 #86

Closed

chanedwin pushed a commit to chanedwin/pandas-profiling that referenced this pull request Oct 11, 2020

Merge pull request ydataai#82 from conradoqg/improve-types-handling

05a24ed

Improve types handling

chanedwin pushed a commit to chanedwin/pandas-profiling that referenced this pull request Oct 11, 2020

Some fixes after the merge ydataai#82 (see review for detail)

37a9e72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve types handling #82

Improve types handling #82

conradoqg commented Jan 2, 2018 •

edited

romainx commented Jan 4, 2018

conradoqg commented Jan 4, 2018

romainx left a comment

conradoqg commented Jan 4, 2018 •

edited

romainx commented Jan 4, 2018

conradoqg commented Jan 4, 2018

conradoqg commented Jan 4, 2018

conradoqg commented Jan 5, 2018 •

edited

Improve types handling #82

Improve types handling #82

Conversation

conradoqg commented Jan 2, 2018 • edited

romainx commented Jan 4, 2018

conradoqg commented Jan 4, 2018

romainx left a comment

Choose a reason for hiding this comment

conradoqg commented Jan 4, 2018 • edited

romainx commented Jan 4, 2018

conradoqg commented Jan 4, 2018

conradoqg commented Jan 4, 2018

conradoqg commented Jan 5, 2018 • edited

conradoqg commented Jan 2, 2018 •

edited

conradoqg commented Jan 4, 2018 •

edited

conradoqg commented Jan 5, 2018 •

edited