Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plot categorical data #100

Merged
merged 13 commits into from
Jun 3, 2019
Merged

Plot categorical data #100

merged 13 commits into from
Jun 3, 2019

Conversation

manycoding
Copy link
Contributor

@manycoding manycoding commented May 29, 2019

This one implements categorial and enum (almost) task and doesn't require any schema.
Screenshot 2019-06-04 at 16 32 12
Screenshot 2019-06-04 at 16 32 54

Basically I plot any categorical data on one graph. 1 bar per field, and each bar shows the distribution of unique values. A categorical data is any column with < 11 (default for max_uniques) unique values.
This rule can be used separatedly, where one can specify max_uniques.

I replaced the colours, initially it looked like:
Screenshot 2019-05-29 at 18 43 58

@manycoding
Copy link
Contributor Author

@victor-torres @ejulio @raphapassini please check the layout and how do you like it.

@codecov
Copy link

codecov bot commented May 29, 2019

Codecov Report

Merging #100 into master will increase coverage by 0.53%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #100      +/-   ##
==========================================
+ Coverage   78.45%   78.98%   +0.53%     
==========================================
  Files          23       23              
  Lines        1564     1599      +35     
  Branches      273      280       +7     
==========================================
+ Hits         1227     1263      +36     
  Misses        290      290              
+ Partials       47       46       -1
Impacted Files Coverage Δ
src/arche/rules/result.py 99.12% <100%> (+0.15%) ⬆️
src/arche/rules/category.py 100% <100%> (ø) ⬆️
src/arche/arche.py 85.21% <100%> (+0.1%) ⬆️
src/arche/readers/items.py 85.24% <100%> (+1.17%) ⬆️
src/arche/report.py 97.72% <0%> (+1.13%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 454822c...87c4579. Read the comment docs.

Copy link
Member

@ivankivanov ivankivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really interesting and helpful feature for me.
What do you thing about extending it:

  • for > 11 unique elements as additional method and per column?
  • adding filter - for example filtering only records for a given category?

@manycoding
Copy link
Contributor Author

manycoding commented May 30, 2019

  • for > 11 unique elements as additional method and per column?

How will it look per column? There's category tag which does a similar thing. I think I got it, it's when someone is interested in particular distribution and wants to see only the plot. I am not sure if it's hard with plotly/cufflinks, let me check.

  • adding filter - for example filtering only records for a given category?

What is the use case here? This can be achieved with pandas.

@ivankivanov
Copy link
Member

adding filter - for example filtering only records for a given category?
What is the use case here? This can be achieved with pandas.

the use case would be:

If you have products nested in different categories - to check distribution for the other fields. Let say category food. Check what is the categorical data among other fields. This will give you the ability to dive deeper in the data.

For the first point - I didn't know for this option. Thanks

@manycoding
Copy link
Contributor Author

I just made an update introducing max_uniques and max_uniques_ratio:

  1. If nothing is changed, report_all() will return graph with defaults
    arche.rules.category.get_categories(df, max_uniques=10, max_uniques_ratio=0.5)
  2. Or if someone wants to change the parameters:
    arche.rules.category.get_categories(df[["very interesting column"]], max_uniques=100)

If you have products nested in different categories - to check distribution for the other fields. Let say category food. Check what is the categorical data among other fields. This will give you the ability to dive deeper in the data.

  1. With pandas it's easy
    arche.rules.category.get_categories(df[df["category"] == "food"][["column", "column2"]])

@manycoding
Copy link
Contributor Author

Percentages (distribution) or numbers as shown above?
Screenshot 2019-05-31 at 12 46 35

@manycoding
Copy link
Contributor Author

manycoding commented Jun 3, 2019

That's it, please review.
docs https://arche.readthedocs.io/en/plot_categorical_data/nbs/Rules.html
Docs seem broken at the moment #101

Copy link

@ejulio ejulio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor comments

"""
data = []
for vc in values_counts:
data.extend(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a nested for loop would read better

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, + might be better than .extend

@manycoding manycoding merged commit 56d8da3 into master Jun 3, 2019
@manycoding manycoding deleted the plot_categorical_data branch June 3, 2019 23:06
"""
result = Result("Categories")

result.stats = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's not a cache, then you're calculating value_counts twice here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible solution:

result.stats = map(lambda c: df[c].value_counts(dropna=False), df)
result.stats = [vc for vc in result.stats if len(vc) <= max_uniques]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants