GitHub - yob/classifier: Experiments in text classification

Experiments in text classification, inspired by reading an introduction to Naive Bayes.

My focus was on learning the fundamental algorithms rather than building a good general purpose classification library. There are tools out there that are suitable for real world use - this is not one of them.

Usage

Basic naive bayes classification:

require 'classifier'

classifier = Classifier::NaiveBayes.new(categories: ["spam", "ham"])
classifier.train(category: "spam", features: ["cheap", "viagra"])
classifier.train(category: "ham", features: ["cheap", "fruit"])
puts classifier.classify("fruit").inspect

Retraining

An optional doc_id argument can be provided to the train method. On the second and subsequent times a doc_id is provided, the existing data for that document is replaced.

This allows for workflows where a human might re-classify documents over time.

classifier = Classifier::NaiveBayes.new(categories: ["spam", "ham"])
classifier.train(doc_id: "1", category: "spam", features: ["fruit", "vegetables"])
classifier.train(doc_id: "2", category: "spam", features: ["cheap", "viagra"])
classifier.train(doc_id: "3", category: "ham", features: ["cheap", "fruit"])
classifier.train(doc_id: "1", category: "ham", features: ["fruit", "vegetables"])
puts classifier.classify("fruit").inspect

Testing Classifier Accuracy

To test the performance of a classifier, first build a CSV data file with human classified features like this:

category1,feature1,feature2
category2,feature1,feature3

Then run the test harness:

ruby -I lib bin/classify-nb input.csv

To run multiple files through the test harness:

find . -name "input*.csv" | xargs -L1 ruby -I lib bin/classify-nb

Custom Persistence

By default, classifier data is stored in memory and will be lost when the ruby process ends. That's probably fine for experimenting, but to use a classifier in a real-world situation you'll likely want to store the data somewhere persistent like a database or file.

my_store = MyCustomStore.new
classifier = Classifier::NaiveBayes.new(store: my_store, categories: ["spam", "ham"])
classifier.train(category: "spam", features: ["cheap", "viagra"])
classifier.train(category: "ham", features: ["cheap", "fruit"])
puts classifier.classify("fruit").inspect

A store object MUST implement the following contract.

# Stores a new document. The return value is ignored.
add_document(doc_id, category, features)

# Return an BigDecimal that indicates the number of times 'feature' occured
# in documents classified in 'category'
count_feature_in_category(category, feature)

# Return an BigDecimal that indicates the total number of features in documents
# classified in 'category'
count_features_in_category(category)

# Return a BigDecimal that inidicates the number of unique features seen in
# any document from all categories
count_uniq_features

# Return a BigDecimal indicating the number of documents that have been
# classified in 'category'
count_documents_in_category(category)

# Return a BigDecimal indicating the number of documents that have been
# classified in any category
count_documents

A sample store that persists to a database using the sequel gem is also available:

require 'classifier'
require 'classifier/naive_bayes_sequel_store'

db = Sequel.connect(ENV["DATABASE_URL"])
db_store = Classifier::NaiveBayesSequelStore.new(db, :naive_bayes_table)
classifier = Classifier::NaiveBayes.new(store: db_store, categories: ["spam", "ham"])
classifier.train(category: "spam", features: ["cheap", "viagra"])
classifier.train(category: "ham", features: ["cheap", "fruit"])
puts classifier.classify("fruit").inspect

The behaviour of your custom store must be accurate, or the classifier will produce inaccurate results. It's recommended that you write tests for the store that compare it's behaviour to the reference in-memory store: Classifier::NaiveBayesMemoryStore.

Specs

There are a number of specs that demonstrate intended behaviour. You can run them via rspec:

bundle exec rspec --format doc

TODO

add a threshold so classification my a small margin can be flagged or ignored
expand README with more examples
- text classification: tokenisation, stemming, stop words, etc
- feature selection
- writing a custom store
should we accept training data with no features?
should we attempt to classify with no features?
consider adding a way for documents to be removed from a store

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
bin		bin
lib		lib
spec		spec
.gitignore		.gitignore
.rspec		.rspec
Gemfile		Gemfile
MIT-LICENSE		MIT-LICENSE
README.markdown		README.markdown
classifier.gemspec		classifier.gemspec
pg.sql		pg.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Usage

Retraining

Testing Classifier Accuracy

Custom Persistence

Specs

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Usage

Retraining

Testing Classifier Accuracy

Custom Persistence

Specs

TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages