Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master
README.rst

WikiSentiment

automatic categorization of user interactions in Wikipedia

Homepage: http://github.com/whym/wikisentiment
Contact: http://whym.org

Overview

preprocessing:

  1. For each entry:

    • Extract raw features and put it to a MongoDB

      {
        "entry" {
          "rev_id":   2894772,
          "title": "Yosri",
          "content": {
            "added": [ "Hi This is ....", ],
            "removed"" []
          },
          "comment": "Hi This is ....",
          "timestamp": "...",
          "sender": {},
          "receiver": {}
        },
        "labels": {
           "debate":  false,
           "other":   false,
           "template": true,
           "welcome"   true,
           "suggest":  true,
           "invite":  false,
           "minor":   false,
           "vandal":  false
        },
        "features": {
          "ngram":   {"type": "assoc", "values": {...}},
          "SentiWN": {"type": "assoc", "values": {...}},
          ...
        }
        "vector": {
          "1": True,
          "2": True,
          "101": True,
          ...
        },
        ...
      }
      
  2. Convert the raw features into vectors, and update all entries in the MongoDB. (Different selection of features and/or hash kernels may be used here.)

  3. For each entry, add it to the training set.

  4. Train a classifier with the training set.

  5. Output the resulting model.

Testing:

  1. Load the model and construct a classifier.
  2. For each entry, output it and the label predicted by the classifier.

Usage

  1. Obtain a list of revisioin IDs or list of actual messages as CSV.

Requirements

Following python modules are required.

  • urllib2
  • pymongo
  • nltk (wordnet)
  • murmur
  • liblinear, liblinearutil

Todo

  • Support exporting and importing models
  • Efficient pipelining of Wikipedia API call, feature extraction and database insert with producer-consumer style
  • Add a visualization script for error analysis.
  • Support other languages

See also

Something went wrong with that request. Please try again.