An analysis of the movie_review data set included in the nltk corpus. I would probably add some buzz words here later on.
- An implementation of
nltk.NaiveBayesClassifiertrained against 5000 movie reviews. Implemented innltkNB.ipynb - Using
sklearn- Naive Bayes:
-
MultinomialNB: -
BernoulliNB:
-
- Linear Model
-
LogisticRegression: -
SGDClassifier:
-
- SVM
-
SVC: -
LinearSVC: -
NuSVC:
-
- Naive Bayes:
Implemented in scikitlearnNB.ipynb
- Implemented a voting system to choose the best out of all the learning methods. Implemented in
voting_process.ipynb
| Classifiers | Accuracy achieved |
|---|---|
nltk.NaiveBayesClassifier |
73.0% |
| ScikitLearn Implementations | |
BernoulliNB |
72.0% |
MultinomialNB |
76.0% |
LogisticRegression |
74.0% |
SGDClassifier |
69.0% |
SVC |
48.0% |
LinearSVC |
74.0% |
NuSVC |
74.0% |
The simplest way(and the suggested way) would be to install the required packages and the dependencies by using either anaconda or miniconda
After that you can do
$ conda update conda
$ conda install scikit-learn nltkThe dataset used in this package is bundled along with the nltk package.
Run your python interpreter
>>> import nltk
>>> nltk.download('stopwords')
>>> nltk.download('movie_reviews') NOTE: You can check system specific installation instructions from the official nltk website
Check if everything is good till now by running your interpreter again and importing these
>>> import nltk
>>> from nltk.corpus import stopwords, movie_reviews
>>> import sklearn
>>> If these imports work for you. Then you are good to go!
- Clone the repo
$ git clone https://github.com/prodicus/movieReviewsAnalysis
$ cd movieReviewsAnalysis
## run the ipython server
$ ipython notebook-
Order of running
-
nltkNB.ipynb -
scikitlearnNB.ipynb -
voting_process.ipynb -
Hack away!
"So what, Well this is pretty basic!"
Yes, it is but hey we all do start somewhere right?
Psst. I am working on a spam filtering system. You know the one in which you paste an email and then it tells you whether it is a spam or not.
You can follow me on twitter @tasdikrahman to keep tabs on it.
Hacked together by Tasdik Rahman under the MIT License
You can find a copy of the License at http://prodicus.mit-license.org/