An analysis of the movie_review
data set included in the nltk
corpus. I would probably add some buzz words here later on.
- An implementation of
nltk.NaiveBayesClassifier
trained against 5000 movie reviews. Implemented innltkNB.ipynb
- Using
sklearn
- Naive Bayes:
-
MultinomialNB
: -
BernoulliNB
:
-
- Linear Model
-
LogisticRegression
: -
SGDClassifier
:
-
- SVM
-
SVC
: -
LinearSVC
: -
NuSVC
:
-
- Naive Bayes:
Implemented in scikitlearnNB.ipynb
- Implemented a voting system to choose the best out of all the learning methods. Implemented in
voting_process.ipynb
Classifiers | Accuracy achieved |
---|---|
nltk.NaiveBayesClassifier |
73.0% |
ScikitLearn Implementations | |
BernoulliNB |
72.0% |
MultinomialNB |
76.0% |
LogisticRegression |
74.0% |
SGDClassifier |
69.0% |
SVC |
48.0% |
LinearSVC |
74.0% |
NuSVC |
74.0% |
The simplest way(and the suggested way) would be to install the required packages and the dependencies by using either anaconda or miniconda
After that you can do
$ conda update conda
$ conda install scikit-learn nltk
The dataset used in this package is bundled along with the nltk
package.
Run your python interpreter
>>> import nltk
>>> nltk.download('stopwords')
>>> nltk.download('movie_reviews')
NOTE: You can check system specific installation instructions from the official nltk
website
Check if everything is good till now by running your interpreter again and importing these
>>> import nltk
>>> from nltk.corpus import stopwords, movie_reviews
>>> import sklearn
>>>
If these imports work for you. Then you are good to go!
- Clone the repo
$ git clone https://github.com/prodicus/movieReviewsAnalysis
$ cd movieReviewsAnalysis
## run the ipython server
$ ipython notebook
-
Order of running
-
nltkNB.ipynb
-
scikitlearnNB.ipynb
-
voting_process.ipynb
-
Hack away!
"So what, Well this is pretty basic!"
Yes, it is but hey we all do start somewhere right?
Psst. I am working on a spam filtering system. You know the one in which you paste an email and then it tells you whether it is a spam or not.
You can follow me on twitter @tasdikrahman to keep tabs on it.
Hacked together by Tasdik Rahman under the MIT License
You can find a copy of the License at http://prodicus.mit-license.org/