Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve types handling #82

Merged
merged 6 commits into from
Jan 4, 2018
Merged

Conversation

conradoqg
Copy link
Contributor

@conradoqg conradoqg commented Jan 2, 2018

The goal of this PR is to stabilize the handling of different types and to improve tests. The changes are:

  • value_counts, nuniques and type inference are cached;
  • boolean variances are treated correctly, eg. True, False, 0, 1 including NaN or not;
  • types like List, Tuple, Dict are now officially unsupported until we improve them;
  • mixed columns are also correctly handled;
  • add correlation threshold option and check recoded correlation parameter (memory heavy)
  • add variable anchor to improve navigation and small UX improvements (works both in stand-alone HTML and Jupyter notebook by using scrollTo instead of real link anchors)

This should solve #76 #77 #70 #44 #29 #66

Please review my changes and if agreed merge it.

Best

@romainx
Copy link
Contributor

romainx commented Jan 4, 2018

Great work!!
I will have a closer look by the end of the day and merge it.
If fix I needed, we will do it later.

Thanks for your work

@conradoqg
Copy link
Contributor Author

Thanks for your time @romainx

Copy link
Contributor

@romainx romainx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,

Thanks a lot for this work.
I have reviewed the code and here are my remarks.
As I said, I will perform the changes after the merge.

  • The unit test fails due to the cache used. That seems to not be cleared in some cases. It's better to call the clear cache at the beginning of describe. I'm not a big fan of caching in a global variable. But we can keep it as is. It's simple to remove if it causes problem.

  • Using type as a variable name like in get_vartype is not a good practice since it can hide the build-in type function. We should use vartype instead

  • I will take into account new keyword arguments in the ProfileReport docstring. It's important since the help description will appear in Jupyter notebooks.

I've not tried the unsupported type -> I will try to build a test case for it.

@romainx romainx merged commit dea9f1b into ydataai:master Jan 4, 2018
@conradoqg
Copy link
Contributor Author

conradoqg commented Jan 4, 2018

Hey, thanks for your observations!

The unit test fails due to the cache used. That seems to not be cleared in some cases. It's better to call the clear cache at the beginning of describe. I'm not a big fan of caching in a global variable. But we can keep it as is. It's simple to remove if it causes problem.

I'm also not a fan of cache in global variables, indeed this part needs a good refactoring. I tried to simplify this first version before doing a refactoring. I evaluated some cache strategies like the lru-cache (built-in python 3) but it doesn't work since the function parameter is not a base type. (I also tried two other packages, but most of then relies on the parameter type)

Interesting is that the cache didn't fail in the tests here, not sure why.

Using type as a variable name like in get_vartype is not a good practice since it can hide the build-in type function. We should use vartype instead

I agree.

I will take into account new keyword arguments in the ProfileReport docstring. It's important since the help description will appear in Jupyter notebooks.

Ups, my mistake.

I've not tried the unsupported type -> I will try to build a test case for it.

I added some test cases for the unsupported types, not sure if you mean to add a test case for that or add more tests.

Best

@romainx
Copy link
Contributor

romainx commented Jan 4, 2018

Hello,

Thanks for the update I have just performed the changes you can review them.
I may need your help to close all the corresponding issues: #76 #77 #70 #44 #29 #66

I will have a look.

Many thanks

@conradoqg
Copy link
Contributor Author

OK, I'm responding each issue with the related fix/improvement done by this PR.

Best

@conradoqg
Copy link
Contributor Author

Done. I explained in each issue what was solved and requested the issue to be closed. In some issues I requested to you the openning of a new issue for a specific subject discussed in that issue (to separate the problem apart of new suggestions)

Best

@conradoqg conradoqg deleted the improve-types-handling branch January 4, 2018 22:47
@conradoqg
Copy link
Contributor Author

conradoqg commented Jan 5, 2018

I also found that this change may solve the issue #57 and #34, I'm responding it.

chanedwin pushed a commit to chanedwin/pandas-profiling that referenced this pull request Oct 11, 2020
chanedwin pushed a commit to chanedwin/pandas-profiling that referenced this pull request Oct 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants