New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement tf*idf-like word frequency counting #42
Comments
I took a first stab at using normalization-based word sorting: 503ab88 The problem right now is that there are many words that aren't in A combination of filtering words from All in all, I think this is a promising method for filtering out "uninteresting" words like A good temporary solution might be to just add words like |
I don't think |
Well, the problem is that it shows up on nearly every MUW as the top word. On Saturday, March 23, 2013, Bryce Boe wrote:
Randal S. Olson Computer Science PhD Student |
Oops, ignore the above commit. Wrong issue... |
Divide the words by their usual occurrence frequency, something like tf*idf: http://en.wikipedia.org/wiki/Tf%E2%80%93idf
This would probably eliminate words like "people"/"person" from the word clouds, since they are commonly used in any context. What's more interesting for these word clouds are words that are used more often in the specific subreddit than in all subreddits combined.
The text was updated successfully, but these errors were encountered: