# Task A

This notebook explore performance on feature selecting using the pipeline runner to run with different classifier for Task A

Preprocessor is a module written to hold function tokenise and extract_features. The module can then be compiled and used for faster execution and clarity in the notebook.

PipelineRunner is a module written for this exercise to wrap extract training set, 10 fold cross-validation, and testing in functions for ease of use.

First we compile our modules Preprocessor.py and PipelineRunner.py

In [1]:
# import py_compile
# py_compile.compile("Preprocessor.py")
# py_compile.compile("PipelineRunner.py")

import Preprocessor
import PipelineRunner

## Baseline Attempt: Classify with NaiveBayesClassifier
Use nltk.NaiveBayesClassifier and the pipeline function to train and test classifier

Write tokenise, and feature extractor for this task  
### Tokeniser: tokenise the given text into a list of tokens

In [2]:
from functools import partial

tokenise = partial(Preprocessor.tokenise, \
                   more_instances=1, lemmatization=True)

### Pipeline with Feature extractor and  NaiveBayesClassifier

In [3]:
# import nltk

# classifier, tf_cm, tf_gold, tf_result, dev_cm, dev_gold, dev_result = \
#     PipelineRunner.runAllTaskA(tokenise, extract_features, nltk.NaiveBayesClassifier.train)

from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
        ('ngram', Preprocessor.extract_ngram()),
        ('dict', DictVectorizer()),
        ('nb', MultinomialNB())
    ])

pipeline, tf_mfs, tf_gold, tf_result, dev_mfs, dev_gold, dev_result = \
    PipelineRunner.runAllTaskA(pipeline, tokenise)


>>>Start 10 fold validation:
test range: [0, 765] accuracy: 0.766318537859
test range: [766, 1531] accuracy: 0.736292428198
test range: [1532, 2297] accuracy: 0.781984334204
test range: [2298, 3063] accuracy: 0.783289817232
test range: [3064, 3828] accuracy: 0.802614379085
test range: [3829, 4593] accuracy: 0.743790849673
test range: [4594, 5358] accuracy: 0.721568627451
test range: [5359, 6123] accuracy: 0.766013071895
test range: [6124, 6888] accuracy: 0.759477124183
test range: [6889, 7653] accuracy: 0.801307189542
         |    n         p |
         |    e    n    o |
         |    g    e    s |
         |    a    u    i |
         |    t    t    t |
         |    i    r    i |
         |    v    a    v |
         |    e    l    e |
---------+----------------+
negative |<1457>   1 1039 |
 neutral |   86   <.> 296 |
positive |  367    .<4408>|
---------+----------------+
(row = reference; col = test)

positive f-measure: 0.838182
neutral f-measure: 0.000000
negative f-measure: 0.6

## Final Attempt: Feature Union and  SGDClassifier
 regularized linear models with stochastic gradient descent (SGD) learning  
 This classifier turns out to be faster and have comparible results to SVD(kernel='linear')

In [4]:
import PipelineRunner
import Preprocessor

reload(PipelineRunner)
reload(Preprocessor)

from functools import partial

tokenise = partial(Preprocessor.tokenise, \
                   more_instances=1, lemmatization=True)

# load lexicon transformers for faster startup
lexicon_liuhu = Preprocessor.lexicon_liuhu()
lexicon_emoticon = Preprocessor.lexicon_emoticon()
lexicon_NRC_unigram = Preprocessor.lexicon_NRC_unigram()
lexicon_NRC_bigram = Preprocessor.lexicon_NRC_bigram()
WE_GloVe_Twitter = Preprocessor.WE_GloVe_Twitter()

In [5]:
import sklearn
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import preprocessing
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_extraction import DictVectorizer
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.decomposition import TruncatedSVD

# a list of features to be combine in FeatureUnion
# different features can be commented out to try different combinations
# make sure to also edit weights to match the feature used
feature_list = [
                ('unigram_tfidf', Pipeline([
                    ('ngram', Preprocessor.extract_ngram(ngram_range=(1,1), hashing=True)),
                    ('dict', DictVectorizer()),
                    ('tfidf', TfidfTransformer())
                ])),
                ('kskip_bigram_tfidf', Pipeline([
                    ('ngram', Preprocessor.extract_kskip_bigram(skip=1, hashing=True)),
                    ('dict', DictVectorizer()),
                    ('tfidf', TfidfTransformer())
                ])),
                ('trigram_tfidf', Pipeline([
                    ('ngram', Preprocessor.extract_ngram(ngram_range=(3,3), hashing=True)),
                    ('dict', DictVectorizer()),
                    ('tfidf', TfidfTransformer())
                ])),
#                   sentiwordnet is slow to run and does not improve much, commented out  
                ('lexi_sentiwordnet', Pipeline([
                    ('swn', Preprocessor.lexicon_sentiwordnet(feature='score')),
                    ('normalise', preprocessing.MinMaxScaler())
                ])),
                ('lexi_liuhu', Pipeline([
                    ('liuhu', lexicon_liuhu),
                    ('normalise', preprocessing.Normalizer())
                ])),
                ('lexi_emoticon', Pipeline([
                    ('emoticon', lexicon_emoticon),
                    ('normalise', preprocessing.Normalizer())
                ])),
                ('lexicon_NRC_unigram', Pipeline([
                    ('nrc1', lexicon_NRC_unigram),
                    ('normalise', preprocessing.Normalizer())
                ])),
                ('lexicon_NRC_bigram', Pipeline([
                    ('bigram', Preprocessor.extract_kskip_bigram(skip=3)),
                    ('nrc2', lexicon_NRC_bigram),
                    ('normalise', preprocessing.Normalizer())
                ])),
                ('WE_GloVe', Pipeline([
                    ('glove', WE_GloVe_Twitter),
                    ('normalise', preprocessing.MinMaxScaler())
                ])),
                ('related', Pipeline([
                    ('trl', Preprocessor.extract_tweeter_related()),
                    ('normalise', preprocessing.MinMaxScaler())
                ]))
               ]

# weights of all the feature that match the feature_list. If not given default to 1.0
weights = {
    'unigram_tfidf':       1.0,
    'kskip_bigram_tfidf':  0.4,
    'trigram_tfidf':       0.5,
    'lexi_sentiwordnet':   0.3,
    'lexi_liuhu' :         0.5,
    'lexi_emoticon':       0.2,
    'lexicon_NRC_unigram': 0.2,
    'lexicon_NRC_bigram' : 0.2,
    'WE_GloVe'  :          0.4,
    'related' :            0.2
}

clf = ('SGD', SGDClassifier(n_iter=50,average=10))
# clf = ('SVC', SVC(kernel='linear'))

# combine features and classifier to pipeline
pipeline = Pipeline([
        ('features', FeatureUnion(
                transformer_list = feature_list, 
                transformer_weights = weights,
        )),
        clf ])

# use our own PipelineRunner to perform testing 
pipeline, tf_mfs, tf_gold, tf_result, dev_mfs, dev_gold, dev_result = \
    PipelineRunner.runAllTaskA(pipeline, tokenise)


>>>Start 10 fold validation:
test range: [0, 765] accuracy: 0.825065274151
test range: [766, 1531] accuracy: 0.828981723238
test range: [1532, 2297] accuracy: 0.860313315927
test range: [2298, 3063] accuracy: 0.83681462141
test range: [3064, 3828] accuracy: 0.870588235294
test range: [3829, 4593] accuracy: 0.806535947712
test range: [4594, 5358] accuracy: 0.786928104575
test range: [5359, 6123] accuracy: 0.822222222222
test range: [6124, 6888] accuracy: 0.833986928105
test range: [6889, 7653] accuracy: 0.864052287582
         |    n         p |
         |    e    n    o |
         |    g    e    s |
         |    a    u    i |
         |    t    t    t |
         |    i    r    i |
         |    v    a    v |
         |    e    l    e |
---------+----------------+
negative |<2009>  16  472 |
 neutral |  107  <51> 224 |
positive |  421   34<4320>|
---------+----------------+
(row = reference; col = test)

positive f-measure: 0.882443
neutral f-measure: 0.211180
negative f-measure: 0.79

### Observation of Macro F-score from 10-fold CV and Dev set
The PipelineRunner calculate the Macro f-measure as :  
(f-score_for_positive + f-score_for_negative)/2  

From the above experiment: we got 2 micro f-score that we can use the value the performance, they are both testing using unseen data, the second score, tested using the dev set might provide a more realistic score because it trained on the entired training set and tested on a unseen dev set. The score is very similar to the score we got from 10-fold cross validation but only slightly lower.

We can average the 2 score from the two experiments so that it's easier to compare different settings and classifiers. (notice that this average is NOT the micro f-score from averaging positive and negative class)

In [6]:
(tf_mfs + dev_mfs)/2

0.8377570325287683

## Run Classifier on test set and write to result file

In [7]:
# write result to file
test_set = PipelineRunner.getTrainingSetA(PipelineRunner.twitter_test_A_path, tokenise)
result = list(pipeline.predict(test_set['tokens']))

assert len(result)==len(test_set['tokens'])

with open('result/test-A-final.txt', 'w') as resultfile:
    lineno = 0
    with open(PipelineRunner.twitter_test_A_path) as tsvfile:
            for aline in tsvfile:
                line = aline.strip().split('\t')
                resultfile.write('\t'.join(line[0:4]+[result[lineno]])+'\n')
                lineno += 1
    assert len(result)==lineno            