A Demonstration of bayesian filtering performance on ruby.
Quick and dirty for benchmarking, no tests, sue me.
Clone this repository and bundle install
Before you can run your tests you need to create a corpus for your bayesian filter to run off.
Therefore you will need a set of spam and ham emails to be saved in the appropriate directory
training/ham folder (You can copy a folder structure
if you wish, the trainer looks through them recursively)
A good sample dataset to use is the Enron Email Dataset available at http://www.cs.cmu.edu/~enron/enron_mail_20110402.tgz ( http://www.cs.cmu.edu/~enron/ )
This dataset has approximately 19,000 ham emails and 33,000 spam emails.
Once you have created the folders run
To create your corpus. This may take a while depending on the size of your source data. Once it is finished you should have a corpus file of about 17MB in size.
To test the filter performance we run emails taken from the original dataset and feed them through the filter. This can be done using rake; for 1000 emails use
for 10,000 emails user
- Fork it ( https://github.com/[my-github-username]/dont_bayes_me_bro/fork )
- Create your feature branch (
git checkout -b my-new-feature)
- Commit your changes (
git commit -am 'Add some feature')
- Push to the branch (
git push origin my-new-feature)
- Create a new Pull Request