Press & Presentations
BuzzFeed News : Meet The Man Who Gamed Reddit With A Bot
from the author Travis Hoppe
It is an exciting time right now if you're interested in Machine Learning. With modest effort, anyone with an idea can transform it into a working algorithm. I've been a fan of the subreddit r/today-I-learned and I always found it interesting that top posts would build upon my current knowledge and append a new factoid. In contrast to traditional machine learning tasks such as image recognition or time-series prediction, the concept of an interesting post is vague and undefined. This makes it an exciting topic to study!
The metric for a successful post on reddit is the upvote. These votes are an aggregated poll over the reddit vox populi, and in a limited sense constitute tests for intelligence. In the TIL subreddit especially, this requires higher order cognitive skills from the Bloom Taxonomy like Knowledge, Synthesis and Evaluation. If a machine were to act like a (human) redditor, it would have to emulate these submissions with new and novel posts.
In this context
u/possible_urban_king passes the Turing test.
Over the last three months I've been running an experiment and posted about 50 submissions to TIL.
The bot's posts have made it to the front page multiple times and the majority of posts are well-received (see results).
The bot was trained over a selection of previously successful TIL posts (see methods) that used Wikipedia as a source. Classification worked well, sometimes too well. I found that media characters (books, movies, etc...) were disproportionately tagged as interesting. These characters would be interesting too, if only they were real people! Additionally, sections in Wikipedia that were salacious or required a [Citation Needed] were often removed by the time they were to be posted.
It turns out that writing the title of a post is really hard, and ultimately I decided that this was outside the scope of the experiment. In all of the posts, I wrote the title and submitted by hand. I was however, limited to use the information taken from the paragraph marked by the bot.
- Which algorithm/classifier?
- Why the name
It's a colorless green idea.
In the interests of scientific reproducibility, all of the code used in the experiment is hosted in this project. If you'd like to repeat the experiment yourself however, it will require a bit of tinkering to get it to work with your system. A zipped sqlite3 database of the raw paragraphs marked as interesting can be found in db/report.db.bz2. Feel free to fork and do whatever you like with this repo as long you follow the CC Attribution 3.0 license.
Supervised machine learning requires a massive tagged collection of high-quality data to be effective. Fortunately the past submissions of to r/TIL have done just that. Redditors have carefully curated a selection of posts that they collectively find interesting through their voting system. We can filter these posts to just those that point to Wikipedia as a source. This way, the source of each post uses a somewhat standardized language and grammar.
Initially I started with the top 1000 posts of all-time (due to an API restriction in reddit's search) using praw. Ultimately however, I extended that to all posts that had a score of > 1000 in the years 2013 and 2014 (resulting in about 5000 quality TIL posts) using an alternate database.
From here it is relatively easy to download a parsed down versed of the wiki page linked to by the reddit post.
Somehow, we have to link the pithy one-line TIL title to the correct paragraph in the Wikipedia article. This is a non trivial task, as simple word frequencies are not enough. Ultimately I settled on a sort of "word-entropy". That is, each paragraph was stripped to it's unique words and these sets all formed a frequency vector for each paragraph. These vectors were normalized so that the unique words in each paragraph carried more weight. Then we took the title of the TIL post and compared it to the vectors of each paragraph settling on the paragraph with the closest match. This turns out to work surprisingly well.
Additionally, I saved the non-matching paragraphs as some useful false positives.
The next step was to prep the Wikipedia corpus.
Using a full XML corpus of Wikipedia (not provided and parsed with
bs4), I tokenized and stemmed each paragraph of text for each article. This uses both
nltk for the word tokenization & stop words and the porter2 stemmer from the aptly named package
This creates a rather massive SQLite database with each paragraph and the associated meta-data (like title, paragraph number, word-entropy, ...). Since there are many millions of assorted paragraphs (and I assume very few of them are interesting), I am going to use a random sampling of some of these as True Negatives in my machine learning.
Initially, I experimented with a simple word frequency as my feature vector. While this works for toy problems, the corpus of Wikipedia needed a smarter way to condense down the data. Fortunately, a neat textual feature generator, Word2Vec (developed by Google) is available in
Using Word2Vec requires two complete passes over the data, though it allows you to use an iterator making the memory requirements rather small.
Here, perhaps lies the most contentious part of the project, the construction of the classifier. In the end, I settled for the Extremely Random Trees implementation in
scikit-learn. This classifier, while fairly poor at detecting new true positives at about 10%, was extremely proficient at marking the true negatives. Since the assumption is that most of Wikipedia is, in fact, quite boring, this will help narrow down the results immensely.
Training classifier Test Accuracy: 0.878 Test Accuracy on TP: 0.116 Test Accuracy on TN: 0.998
With the classifier solved, the next step is score each and every paragraph in Wikipedia. The classifier marks about 6 per 10000 as potential candidates.
With the positives marked, we need to prepare the potentially interesting things to a human-readable format! Report starts building a new database that contains only the positive entries and the associated Wikipedia text from the original source.
Nobody likes a repost (unless it's better, or more aptly timed...), so we need to find out what has already been posted to reddit.
To do so, we need a proper search name of the Wikipedia article.
mediawiki-utils can do this, but stupidly requires python3.
Thus the cross-reference program makes a system call to properly encode name as a search query for reddit.
We then take the top search result (if exists) and store it; this info will serve as the criteria for a post/repost.
With the potential TIL candidates identified, let's find the best time to post! Note that we are going to posit that post time has a casual relationship with the ultimate score. Since reddit is dynamic and viewership is dependent on a steady-stream of upvotes, this should be a reasonable assumption. Going back over our training set, we can map the distribution of times for a r/TIL post:
it seems like the sweet spot for a submission is between 9AM-11AM!
What about the bottom r/TIL posts, those that had a score of < 1000? Considering only the ones we found with our algorithm, the posting time is dramatically different:
Since we are going to have a few false positives, I setup a simple script to help determine quality TIL's. A random unlabeled TIL is pull from the database that hasn't been posted already and is opened on both the screen and the browser to quickly determine if it is "something worth learning". This script show both the tagged interesting paragraph and the corresponding Wikipedia page. There is a simple prompt that allows you to mark an item to post later.