Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement tf*idf-like word frequency counting #42

Closed
rhiever opened this issue Mar 10, 2013 · 5 comments
Closed

Implement tf*idf-like word frequency counting #42

rhiever opened this issue Mar 10, 2013 · 5 comments

Comments

@rhiever
Copy link
Owner

rhiever commented Mar 10, 2013

Divide the words by their usual occurrence frequency, something like tf*idf: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

This would probably eliminate words like "people"/"person" from the word clouds, since they are commonly used in any context. What's more interesting for these word clouds are words that are used more often in the specific subreddit than in all subreddits combined.

@rhiever
Copy link
Owner Author

rhiever commented Mar 23, 2013

I took a first stab at using normalization-based word sorting: 503ab88

The problem right now is that there are many words that aren't in common-word-freqs.txt that actually are common words, especially contractions (e.g., don't). I was tempted to just throw out all words that don't show up in common-word-freqs.txt, but then the script will miss out on all kinds of novel words that the subreddit could be using (e.g., esports).

A combination of filtering words from common-words.txt and normalization worked so-so. Some words that are actually relevant to the subreddit got normalized, whereas others (e.g., probably, IIRC) didn't. That made for a weird word cloud.

All in all, I think this is a promising method for filtering out "uninteresting" words like people, but we're still hampered by not having a comprehensive common-word-freqs.txt. Maybe we could construct one from doing a huge scrape across reddit, rather than using an already established version (usually taken from books).

A good temporary solution might be to just add words like people to common-words.txt. Have there been any other "uninteresting" words that have been showing up regularly across many subreddits?

@bboe
Copy link
Contributor

bboe commented Mar 23, 2013

A good temporary solution might be to just add words like people to common-words.txt. Have there been any other "uninteresting" words that have been showing up regularly across many subreddits?

I don't think people is uninteresting. I think the stop-world list we have is perfectly suitable.

@rhiever
Copy link
Owner Author

rhiever commented Mar 23, 2013

Well, the problem is that it shows up on nearly every MUW as the top word.

On Saturday, March 23, 2013, Bryce Boe wrote:

A good temporary solution might be to just add words like people to
common-words.txt. Have there been any other "uninteresting" words that have
been showing up regularly across many subreddits?

I don't think people is uninteresting. I think the stop-world list we
have is perfectly suitable.


Reply to this email directly or view it on GitHubhttps://github.com//issues/42#issuecomment-15332576
.

Randal S. Olson

Computer Science PhD Student
Michigan State University
E-mail: rso@randalolson.com
http://www.randalolson.com

@rhiever
Copy link
Owner Author

rhiever commented Nov 23, 2013

Oops, ignore the above commit. Wrong issue...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants