Plot categorical data #100

manycoding · 2019-05-29T22:44:27Z

This one implements categorial and enum (almost) task and doesn't require any schema.

Basically I plot any categorical data on one graph. 1 bar per field, and each bar shows the distribution of unique values. A categorical data is any column with < 11 (default for max_uniques) unique values.
This rule can be used separatedly, where one can specify max_uniques.

I replaced the colours, initially it looked like:

manycoding · 2019-05-29T22:45:00Z

@victor-torres @ejulio @raphapassini please check the layout and how do you like it.

codecov · 2019-05-29T22:47:55Z

Codecov Report

Merging #100 into master will increase coverage by 0.53%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #100      +/-   ##
==========================================
+ Coverage   78.45%   78.98%   +0.53%     
==========================================
  Files          23       23              
  Lines        1564     1599      +35     
  Branches      273      280       +7     
==========================================
+ Hits         1227     1263      +36     
  Misses        290      290              
+ Partials       47       46       -1

Impacted Files	Coverage Δ
src/arche/rules/result.py	`99.12% <100%> (+0.15%)`	⬆️
src/arche/rules/category.py	`100% <100%> (ø)`	⬆️
src/arche/arche.py	`85.21% <100%> (+0.1%)`	⬆️
src/arche/readers/items.py	`85.24% <100%> (+1.17%)`	⬆️
src/arche/report.py	`97.72% <0%> (+1.13%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 454822c...87c4579. Read the comment docs.

src/arche/readers/items.py

ivankivanov

Really interesting and helpful feature for me.
What do you thing about extending it:

for > 11 unique elements as additional method and per column?
adding filter - for example filtering only records for a given category?

manycoding · 2019-05-30T15:24:50Z

for > 11 unique elements as additional method and per column?

How will it look per column? There's category tag which does a similar thing. I think I got it, it's when someone is interested in particular distribution and wants to see only the plot. I am not sure if it's hard with plotly/cufflinks, let me check.

adding filter - for example filtering only records for a given category?

What is the use case here? This can be achieved with pandas.

ivankivanov · 2019-05-30T18:23:06Z

adding filter - for example filtering only records for a given category?
What is the use case here? This can be achieved with pandas.

the use case would be:

If you have products nested in different categories - to check distribution for the other fields. Let say category food. Check what is the categorical data among other fields. This will give you the ability to dive deeper in the data.

For the first point - I didn't know for this option. Thanks

manycoding · 2019-05-30T20:58:58Z

I just made an update introducing max_uniques and max_uniques_ratio:

If nothing is changed, report_all() will return graph with defaults
arche.rules.category.get_categories(df, max_uniques=10, max_uniques_ratio=0.5)
Or if someone wants to change the parameters:
arche.rules.category.get_categories(df[["very interesting column"]], max_uniques=100)

If you have products nested in different categories - to check distribution for the other fields. Let say category food. Check what is the categorical data among other fields. This will give you the ability to dive deeper in the data.

With pandas it's easy
arche.rules.category.get_categories(df[df["category"] == "food"][["column", "column2"]])

manycoding · 2019-05-31T16:47:03Z

Percentages (distribution) or numbers as shown above?

manycoding · 2019-06-03T17:59:20Z

That's it, please review.
docs https://arche.readthedocs.io/en/plot_categorical_data/nbs/Rules.html
Docs seem broken at the moment #101

ejulio

Just minor comments

ejulio · 2019-06-03T19:32:11Z

src/arche/rules/result.py

+        """
+        data = []
+        for vc in values_counts:
+            data.extend(


Maybe a nested for loop would read better

Also, + might be better than .extend

victor-torres · 2019-06-18T20:02:39Z

src/arche/rules/category.py

+    """
+    result = Result("Categories")
+
+    result.stats = [


If there's not a cache, then you're calculating value_counts twice here.

Possible solution:

result.stats = map(lambda c: df[c].value_counts(dropna=False), df) result.stats = [vc for vc in result.stats if len(vc) <= max_uniques]

manycoding added 3 commits May 28, 2019 18:26

Categorize columns

71d0a18

Add categories rule, implements #17 #18

24141da

Refactor

3a6cb0b

manycoding added the Type: Rule label May 29, 2019

manycoding requested review from raphapassini, ejulio, victor-torres and ivankivanov May 29, 2019 22:44

ivankivanov reviewed May 30, 2019

View reviewed changes

src/arche/readers/items.py Outdated Show resolved Hide resolved

ivankivanov approved these changes May 30, 2019

View reviewed changes

Add max_uniques and max_uniques_percentage parameters

32eae37

Rename max_uniques_ratio

034cfeb

manycoding added 7 commits May 31, 2019 16:55

Return only stats, add tests

ab44900

Output tensors comparison error

2a59592

Add plot builder for categories

0faffc0

Remove redundant ratio

a5e310f

Remove max_x from layout

be5045b

Check real values in test_report_all

0d325a7

Add example to rules notebook

87ca00f

ejulio approved these changes Jun 3, 2019

View reviewed changes

Truncated legend to max 30 symbols

87c4579

manycoding merged commit 56d8da3 into master Jun 3, 2019

manycoding deleted the plot_categorical_data branch June 3, 2019 23:06

manycoding mentioned this pull request Jun 5, 2019

Show boolean distribution graph for one job #55

Closed

manycoding mentioned this pull request Jun 18, 2019

value_counts is slow for nested columns #115

Closed

victor-torres reviewed Jun 18, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plot categorical data #100

Plot categorical data #100

manycoding commented May 29, 2019 •

edited

Loading

manycoding commented May 29, 2019

codecov bot commented May 29, 2019 •

edited

Loading

ivankivanov left a comment

manycoding commented May 30, 2019 •

edited

Loading

ivankivanov commented May 30, 2019

manycoding commented May 30, 2019

manycoding commented May 31, 2019

manycoding commented Jun 3, 2019 •

edited

Loading

ejulio left a comment

ejulio Jun 3, 2019

ejulio Jun 3, 2019

victor-torres Jun 18, 2019

victor-torres Jun 18, 2019

Plot categorical data #100

Plot categorical data #100

Conversation

manycoding commented May 29, 2019 • edited Loading

manycoding commented May 29, 2019

codecov bot commented May 29, 2019 • edited Loading

Codecov Report

ivankivanov left a comment

Choose a reason for hiding this comment

manycoding commented May 30, 2019 • edited Loading

ivankivanov commented May 30, 2019

manycoding commented May 30, 2019

manycoding commented May 31, 2019

manycoding commented Jun 3, 2019 • edited Loading

ejulio left a comment

Choose a reason for hiding this comment

ejulio Jun 3, 2019

Choose a reason for hiding this comment

ejulio Jun 3, 2019

Choose a reason for hiding this comment

victor-torres Jun 18, 2019

Choose a reason for hiding this comment

victor-torres Jun 18, 2019

Choose a reason for hiding this comment

manycoding commented May 29, 2019 •

edited

Loading

codecov bot commented May 29, 2019 •

edited

Loading

manycoding commented May 30, 2019 •

edited

Loading

manycoding commented Jun 3, 2019 •

edited

Loading