-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plot categorical data #100
Conversation
@victor-torres @ejulio @raphapassini please check the layout and how do you like it. |
Codecov Report
@@ Coverage Diff @@
## master #100 +/- ##
==========================================
+ Coverage 78.45% 78.98% +0.53%
==========================================
Files 23 23
Lines 1564 1599 +35
Branches 273 280 +7
==========================================
+ Hits 1227 1263 +36
Misses 290 290
+ Partials 47 46 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really interesting and helpful feature for me.
What do you thing about extending it:
- for > 11 unique elements as additional method and per column?
- adding filter - for example filtering only records for a given category?
How will it look per column? There's
What is the use case here? This can be achieved with pandas. |
the use case would be: If you have products nested in different categories - to check distribution for the other fields. Let say category food. Check what is the categorical data among other fields. This will give you the ability to dive deeper in the data. For the first point - I didn't know for this option. Thanks |
I just made an update introducing
|
That's it, please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor comments
src/arche/rules/result.py
Outdated
""" | ||
data = [] | ||
for vc in values_counts: | ||
data.extend( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a nested for
loop would read better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, +
might be better than .extend
""" | ||
result = Result("Categories") | ||
|
||
result.stats = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's not a cache, then you're calculating value_counts
twice here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible solution:
result.stats = map(lambda c: df[c].value_counts(dropna=False), df)
result.stats = [vc for vc in result.stats if len(vc) <= max_uniques]
This one implements categorial and enum (almost) task and doesn't require any schema.
![Screenshot 2019-06-04 at 16 32 12](https://user-images.githubusercontent.com/10396557/58911718-6e248d00-86e6-11e9-88a0-64d1500f9f50.png)
![Screenshot 2019-06-04 at 16 32 54](https://user-images.githubusercontent.com/10396557/58911719-6ebd2380-86e6-11e9-8747-ca18756b6abf.png)
Basically I plot any categorical data on one graph. 1 bar per field, and each bar shows the distribution of unique values. A categorical data is any column with < 11 (default for
max_uniques
) unique values.This rule can be used separatedly, where one can specify
max_uniques
.I replaced the colours, initially it looked like:
![Screenshot 2019-05-29 at 18 43 58](https://user-images.githubusercontent.com/10396557/58596357-bc96df00-8241-11e9-8dce-ad1711946ae1.png)