Natural Language Processing of Zooniverse Talk Data
Python script to train a Naive Bayesian Classifier with NLTK - based on https://github.com/abromberg/sentiment_analysis_python
Classifier is trained using 1.6M Tweets pre-procesed at Sanford and available at http://help.sentiment140.com/for-students. Other training data can also be used but is not saved in the repo's training-data folder because it's too large.
Script and HTML template are designed for specific Zooniverse data. This is extracted from the Zooniverse discussion platform 'Talk' - please contact email@example.com for more information.
Inputs are a CSV dump of text comments, and NLTK+training data. Outputs are CSV for of sentiment scores, and HTML files to show positive and negative comments
It runs with the filename as a param, i.e.
python process_comments.py example_input_file.csv
The most positive sentiment images from Galaxy Zoo based on Talk threads with 5 or more comments. The most positive sentiment images from Snapshot Serengeti based on Talk threads with 5 or more comments.
Images are linked to Talk page, and shown with:
- Zooniverse ID in the top-left
- Number of comments top-right
- Positive and Negative scores in the bottom-left (colour-coded)