Implement tf*idf-like word frequency counting #42

rhiever · 2013-03-10T18:44:05Z

Divide the words by their usual occurrence frequency, something like tf*idf: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

This would probably eliminate words like "people"/"person" from the word clouds, since they are commonly used in any context. What's more interesting for these word clouds are words that are used more often in the specific subreddit than in all subreddits combined.

rhiever · 2013-03-11T21:14:00Z

Links for common word frequencies:

http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/PG/2006/04/1-10000

http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/PG/2006/04/10001-20000

http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/PG/2006/04/20001-30000

http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/PG/2006/04/30001-40000

rhiever · 2013-03-23T04:40:20Z

I took a first stab at using normalization-based word sorting: 503ab88

The problem right now is that there are many words that aren't in common-word-freqs.txt that actually are common words, especially contractions (e.g., don't). I was tempted to just throw out all words that don't show up in common-word-freqs.txt, but then the script will miss out on all kinds of novel words that the subreddit could be using (e.g., esports).

A combination of filtering words from common-words.txt and normalization worked so-so. Some words that are actually relevant to the subreddit got normalized, whereas others (e.g., probably, IIRC) didn't. That made for a weird word cloud.

All in all, I think this is a promising method for filtering out "uninteresting" words like people, but we're still hampered by not having a comprehensive common-word-freqs.txt. Maybe we could construct one from doing a huge scrape across reddit, rather than using an already established version (usually taken from books).

A good temporary solution might be to just add words like people to common-words.txt. Have there been any other "uninteresting" words that have been showing up regularly across many subreddits?

bboe · 2013-03-23T06:24:43Z

A good temporary solution might be to just add words like people to common-words.txt. Have there been any other "uninteresting" words that have been showing up regularly across many subreddits?

I don't think people is uninteresting. I think the stop-world list we have is perfectly suitable.

rhiever · 2013-03-23T13:43:58Z

Well, the problem is that it shows up on nearly every MUW as the top word.

On Saturday, March 23, 2013, Bryce Boe wrote:

A good temporary solution might be to just add words like people to
common-words.txt. Have there been any other "uninteresting" words that have
been showing up regularly across many subreddits?

I don't think people is uninteresting. I think the stop-world list we
have is perfectly suitable.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/42#issuecomment-15332576
.

Randal S. Olson

Computer Science PhD Student
Michigan State University
E-mail: rso@randalolson.com
http://www.randalolson.com

rhiever · 2013-11-23T15:09:33Z

Oops, ignore the above commit. Wrong issue...

rhiever closed this as completed in 998bfd4 Nov 23, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement tf*idf-like word frequency counting #42

Implement tf*idf-like word frequency counting #42

rhiever commented Mar 10, 2013

rhiever commented Mar 11, 2013

rhiever commented Mar 23, 2013

bboe commented Mar 23, 2013

rhiever commented Mar 23, 2013

rhiever commented Nov 23, 2013

Implement tf*idf-like word frequency counting #42

Implement tf*idf-like word frequency counting #42

Comments

rhiever commented Mar 10, 2013

rhiever commented Mar 11, 2013

rhiever commented Mar 23, 2013

bboe commented Mar 23, 2013

rhiever commented Mar 23, 2013

rhiever commented Nov 23, 2013