sausheong/naive-bayes

Merge pull request #1 from vpereira/master

`ruby 1.9 + doc + Gemfile`
2 parents 6ff9794 + 6595085 commit b910e1a34ff077872b5db7f71d8142abb1a7cf46 committed
Showing with 197 additions and 205 deletions.
1. +3 −0 Gemfile
2. +10 −0 Gemfile.lock
4. +4 −3 bayes_test.rb
5. +2 −0 naive_bayes.rb
3 Gemfile
 @@ -0,0 +1,3 @@ +source :rubygems + +gem "stemmer"
10 Gemfile.lock
 @@ -0,0 +1,10 @@ +GEM + remote: http://rubygems.org/ + specs: + stemmer (1.0.1) + +PLATFORMS + ruby + +DEPENDENCIES + stemmer
 @@ -1,6 +1,6 @@ - I first learnt about probability when I was in secondary school. As with all the other topics in Maths, it was just another bunch of formulas to memorize and regurgitate to apply to exam questions. Although I was curious if there was any use for it beyond calculating the odds for gambling, I didn't manage to find out any. As with many things in my life, things pop up at unexpected places and I stumbled on it again when as I started on machine learning and naive Bayesian classifiers. +I first learnt about probability when I was in secondary school. As with all the other topics in Maths, it was just another bunch of formulas to memorize and regurgitate to apply to exam questions. Although I was curious if there was any use for it beyond calculating the odds for gambling, I didn't manage to find out any. As with many things in my life, things pop up at unexpected places and I stumbled on it again when as I started on machine learning and naive Bayesian classifiers. -A classifier is exactly that -- it's something that classifies other things. A classifier is a function that takes in a set of data and tells us which category or classification the data belongs to. A naive Bayesian classifier is a type of learning classifier, meaning that you can continually train it with more data and it will be be better at its job. The reason why it's called Bayesian is because it uses Bayes' Law, a mathematical theorem that talks about conditional probabilities of events, to determine how to classify the data. The classifier is called 'naive' because it assumes each event (in this case the data) to be totally unrelated to each other. That's a very simplistic view but in practice it has been proven to be a surprisingly accurate. Also, because it's relatively simple to implement, it's quite popular. Amongst its more well-known usage include email spam filters. +A classifier is exactly that -- it's something that classifies other things. A classifier is a function that takes in a set of data and tells us which category or classification the data belongs to. A naive Bayesian classifier is a type of learning classifier, meaning that you can continually train it with more data and it will be be better at its job. The reason why it's called Bayesian is because it uses [Bayes Law](http://en.wikipedia.org/wiki/Bayes%27_theorem), a mathematical theorem that talks about conditional probabilities of events, to determine how to classify the data. The classifier is called 'naive' because it assumes each event (in this case the data) to be totally unrelated to each other. That's a very simplistic view but in practice it has been proven to be a surprisingly accurate. Also, because it's relatively simple to implement, it's quite popular. Amongst its more well-known usage include email spam filters. So what's Bayes' Law and how can it be used to categorize data? As mentioned, Bayes' Law describes conditional probabilities. An example of conditional probability is the probability of an event A happening given that another event B has happened. This is usually written as Pr(A | B), which is read as the probability of A, given B. To classify a document, we ask -- given a particular text document, what's the probability that it belongs to this category? When we find the probabilities of the given document in all categories, the classifier picks the category with the highest probability and announce it as the winner, that is, the document most probably belongs to that category. @@ -25,141 +25,125 @@ In other words, the probability that a document exists, given a category, is the Now that we know Pr(document|category) let's look at Pr(category). This is simply the probability of any document being in this category (instead of being in another category). This is the number of documents used to train this category over the total number of documents that used to train all categories. So that's the basic idea behind naive Bayesian classifiers. With that I'm going to show you how to write a simple classifier in Ruby. There is already a rather popular Ruby implementation by Lucas Carlsson called the Classifier gem (http://classifier.rubyforge.org) which you can use readily but let's write our own classifier instead. We'll be creating class named NativeBayes, in a file called native_bayes.rb. This classifier will be used to classify text into different categories. Let's recap how this classifier will be used: -
-
1. First, tell the classifier how many categories there will be
2. -
3. Next, train the classifier with a number of documents, while indicating which category those document belongs to
4. -
5. Finally, pass the classifier a document and it should tell us which category it thinks the document should be in
6. -
-Now let's run through the public methods of the NativeBayes class, which should map to the 3 actions above: -
-
1. Provide the categories you want to classify the data into
2. -
3. Train the classifier by feeding it data
4. -
5. Doing the real work, that is to classify given data
6. -
7 bayes_test.rb
 @@ -1,7 +1,8 @@ -require 'naive_bayes' +#encoding: utf-8 +require './naive_bayes' require 'pp' -b = NaiveBayes.new("interesting", "not_interesting") +b = NaiveBayes.new(["interesting", "not_interesting"]) b.train("interesting","'Human error' hits Google search Google's search service has been hit by technical problems, with users unable to access search results. For a period on Saturday, all search results were flagged as potentially harmful, with users warned that the site \"may harm your computer\". Users who clicked on their preferred search result were advised to pick another one. Google attributed the fault to human error and said most users were affected for about 40 minutes. \"What happened? Very simply, human error,\" wrote Marissa Mayer, vice president, search products and user experience, on the Official Google Blog. The internet search engine works with stopbadware.org to ascertain which sites install malicious software on people's computers and merit a warning. Stopbadware.org investigates consumer complaints to decide which sites are dangerous. The list of malevolent sites is regularly updated and handed to Google. When Google updated the list on Saturday, it mistakenly flagged all sites as potentially dangerous. \"We will carefully investigate this incident and put more robust file checks in place to prevent it from happening again,\" Ms Mayer wrote.") @@ -79,4 +80,4 @@ pp b.probabilities(text7) puts "text7 > " + b.classify(text7) -puts "text7 > " + b.classify(text7) +puts "text7 > " + b.classify(text7)
2 naive_bayes.rb
 @@ -1,4 +1,6 @@ +#encoding: utf-8 require 'rubygems' +require "bundler/setup" require 'stemmer' class NaiveBayes