Skip to content

Supervised machine learning model that determines if input is negative or positive based on a certain dataset it has been trained to.

Notifications You must be signed in to change notification settings

samulieronen/bayesian_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bayesian Classifier


This is a program that takes an input and tries it's best to determine if the input is negative or positive.
Accuracy is now 89.7% which is quite good. Dataset size approx 2000 reviews in total.

The dataset used is not my own and belongs to the Association of Computational Linguistics (ACL), 2007

More detailed readme also coming up!
This is a ongoing part-time project so it might take a while to update this.

Bayesian What?

In statistics, Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features.

Where:

  • A is our word.
  • B is the either positive or negative.

In short, this program takes a string and tries to determine if it is positive or negative, based on probability. For each word in a sentence, it calculates the probability of the word being positive or negative. And the highest probability wins.
The naive part is the assumption that each word is examined as an independent probability.

This is a supervised machine learning model.
Supervised means that it will not learn on it's own, it can only learn by data it has been fitted to.
This classifier is based on electronics reviews from amazon.

Improvements

  • Maybe binomial approach (Single word bad, 2 word sequence good?)
  • Dataset improvements, is there a better one?
  • Language processing improvements
  • Better regexing or parsing
  • Better lemmatization (with tags?)

How to run

To build database, run:

$ python3 build_dataset.py

Once the database has been built, you don't have to build it again.

Then you can run the classifier:

$ python3 byers.py "text-to-classify"
(Option --proba) Displays probabilities of text being neg / pos.
(Option --benchmark) Displays classifier accuracy in % measured by an independent dataset.

Requirements

You will need python3

$ pip install python3

And NLTK and nltk.wordnet

$ pip3 install nltk
$ python3
>>> import nltk
>>> nltk.download("wordnet")

And scikit-learn

$ pip3 install scikit

About

Supervised machine learning model that determines if input is negative or positive based on a certain dataset it has been trained to.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages