Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wordcloud for categorical features #129

Closed
kodonnell opened this issue Jun 28, 2018 · 5 comments
Closed

wordcloud for categorical features #129

kodonnell opened this issue Jun 28, 2018 · 5 comments
Labels
feature request 💬 Requests for new features question/discussion ❓ Open dicussion

Comments

@kodonnell
Copy link

I've just come across this project, and it looks great - better than things we've built (internally) in the past. That said, one thing we found quite useful was having a wordcloud to display frequencies of unique categorical values (i.e. the size of the text corresponds to the frequency). For example:

image

For numbers people, it'll be of lesser value - though you could argue it's easier to communicate they key results quickly and in less space. For non-technical people (e.g. clients) it's often a lot more useful. Other comments:

  • good: there are existing JS libraries so it should be easy to do. You can make them look really swish - even having them in a specific shape, etc.
  • bad: it can be hard to do 'well' in all cases e.g. handling long strings, scaling linearly vs logarithmically (depending on frequencies), do you display the text in each row as an entry or split it into words and count those, etc.

I'm not recommending it per se - just an idea, and we'll see what others think.

@kodonnell kodonnell changed the title suggestion: wordcloud for features idea: wordcloud for categorical features Jun 28, 2018
@andyreagan
Copy link

Hey this is a cool idea!

I would recommend using a graphic that shows information in more accessible way than a word cloud, like even a histogram of the TD-IDF values by word.
(Or, scrambling the words on the histogram, you get the word cloud).
Here's an example of such a graphic: https://hedonometer.org/index.html?date=2018-07-04

@kodonnell
Copy link
Author

I guess it depends on what you're trying to communicate. Wordclouds are great for showing a bunch of information in a pleasing way, and often communicating an 'obvious' but not exactly quantified messages (e.g. for a wordcloud of country: "your customers live in 5 different countries, but mostly in Australia", or for gender "there are two genders, split roughly evenly"). For some purposes, they're a faster method of communication - e.g. I can skim over a wordcloud of gender in well under a second and extract all I care about (are M/F there, are they roughly even, etc.), and move on. Looking into normal graphs takes a little more work, and it's not obvious from skimming the graph (until you read e.g. the bar graph labels) what your'e looking at. Wordclouds are also pretty easy - nearly anyone (non-technical included) will understand what they're trying to convey, without requiring previous experience to interpret the graph (which, e.g. the one you sent would require - as a technical person, after a 5 second skim I wasn't sure what it was saying, or what the assumptions were, etc.).

However, they not good for anything more nuanced (i.e. where you want to do more than just skim over it - or you've got much more complicated data). For that, I agree that there are better solutions (including the one you mentioned). But, as above, it also depends on whether you're handling short or long strings, etc.

I guess it depends on the users of pandas-profiling.

@sbrugman sbrugman changed the title idea: wordcloud for categorical features wordcloud for categorical features May 29, 2019
@sbrugman sbrugman added the feature request 💬 Requests for new features label May 29, 2019
@sbrugman sbrugman added the getting started ☝ Straight-forward for beginning contributors label Jul 24, 2019
@pybokeh
Copy link

pybokeh commented Jul 27, 2019

Word clouds look nice, but I'm not a fan of word clouds to be honest. Sure it is easy to spot the word with largest count frequency. But more difficult to discern the 2nd largest, 3rd largest, etc word frequency. I'd rather see a simple horizontal Pareto bar chart. Function over aesthetics especially for a library that is already having to do a lot of computations.

@sbrugman sbrugman added question/discussion ❓ Open dicussion and removed getting started ☝ Straight-forward for beginning contributors labels Jul 28, 2019
@kodonnell
Copy link
Author

"Function over aesthetics"

So, this needs a bit of care.

  1. If no one wants to look at outputs (aesthetics) it doesn't matter how functional it is. I made (make?) this mistake often, and was hence often the only one that used the tools I made.
  2. Functional for who? As above, wordclouds can be more functional at conveying a message to a particular audience than bar charts. I mean, why do you prefer a bar chart over the raw numbers in a table? (Or even the raw data?) Because it communicates a particular message quickly? Wordclouds can do that too, and sometimes better for a person, given communication depends on who's being communicated to. I guess the purpose is to have options so that "function" can be defined by the user, and they decide their best mode of communication.
  3. As in the above post (see more examples), wordclouds can be more functional, whoever the user. Maybe you prefer the bar chart over raw data because it "quickly" tells you "John" is about 5% more popular than "Bob" as a name (though you're OK not knowing if it's 4% or 6% which you'd get from raw data), whereas a wordcloud would just indicate they're both similar-ish (though you could read this more quickly). But a wordcloud could also tell me that there are only 10 different names, which a bar graph doesn't tell me. Or (more quickly than a bar chart) that most of the values are "John Doe" and everything else is immaterial. It also gives a lot more context about a column (very quickly), e.g. maybe this column is marked as numeric because all the values are "$10.00" etc. (which is harder to detect in a bar graph because it can appear like the units). In addition, when you're profiling 1000 columns, it's my experience that it's a lot easier to do this at speed with a wordcloud (e.g. often I'd just eyeball and look for any unexpected weird disparities - e.g. a lot of "Unknown" or weird imbalances - or expected disparities - e.g. only M/F for gender.). I.e. for me I sometimes find them more functional.

Also, to confirm - I'm not overly sold on them. If I had to pick between one (for all text types) I'd probably go with a horizontal bar over a wordcloud.

But more difficult to discern the 2nd largest, 3rd largest, etc word frequency

I guess this depends on what you're looking at - for me, if I'm looking at a bunch of e.g. first names, all I really care about is that a) they all look like names and b) there aren't any massively over-represented values like "John Doe". I don't think I've ever had any use for "John" being slightly more popular than "Peter". Often the columns are industry specific codes, so I don't even know what they mean, and I wouldn't want to be digging that deep at a profiling stage.

Anyway, I guess where I'm coming from is that I don't want "function over aesthetics" to shut down productive discussion. (For example, if someone really wanted wordclouds enough - not me - they could turn some of the above points into a more structured proposal, and give examples of wordclouds against bar charts in different scenarios ... and give a concrete plan for how this could be incorporated into the UI etc.)

@github-actions
Copy link

Stale issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request 💬 Requests for new features question/discussion ❓ Open dicussion
Projects
None yet
Development

No branches or pull requests

4 participants