This notebook explores feature engineering for text classification.  Your task is to create two new feature functions (like `dictionary_feature` and `unigram_feature` below), and include them in the `build_features` function.  A check grade will be given to generic features that apply across arbitrary text classification problems (e.g., a feature for bigrams); check+ will be given for at least one feature that reveals your own understanding of the data. What features do you think will help for your particular problem? Your grade is *not* tied to whether accuracy goes up or down, so be creative!  You are free to read in any other external resources you like (dictionaries, document metadata, etc.)

You are free to use any of the following datasets for this exercise, or to use your own (if you have your own labeled data, I would encourage you to use it!).  If you use your own data, just be sure to format it like the examples below; each directory has a `train.tsv`, `dev.tsv` and `test.tsv` file, where each file is tab-separated (label in the first column and text in the second column).

* [Sentiment Analysis](https://ai.stanford.edu/~amaas/data/sentiment/) (Positive/Negative): `data/lmrd`
* [Congressional Speech](https://www.cs.cornell.edu/home/llee/data/convote.html) (Democrat/Republican): `data/convote`
* Library of Congress Subject Classication ([21 categories](https://en.wikipedia.org/wiki/Library_of_Congress_Classification)): `data/loc`
 

**Q0**: Briefly describe your data (including the categories you're predicting).  If you're using your own data, tell us about it; if you're using one of the datasets above, tell us something that shows you've looked at the data.

For this homework, I'm focusing on the Large Movie Review Dataset `LMRD` for a binary classfication of positive and negative sentiments encoded in the reviews. Each subset (i.e. train, dev, and test) of the dataset consists of an equal number of pos/neg-tagged reviews. I'm a bit surprised to find out that the test set contains more reviews that the training subset. Altogether `LMRD` includes 25,000 movie reviews from IMDB. According to the original article, the authors include at most 30 reviews for a movie to make sure there's a proper range of movies included. 

In [30]:
import sys
from collections import Counter
import operator
from sklearn import preprocessing
from sklearn import linear_model

from nltk import word_tokenize

import pandas as pd
from scipy import sparse
import numpy as np

In [31]:
import sys
from collections import Counter
from sklearn import preprocessing
from sklearn import linear_model
import pandas as pd
from scipy import sparse
import numpy as np
from math import sqrt 
from scipy.stats import norm

In [32]:
# Count the number of reviews in each subset
def count(data):
    df = pd.read_csv("../data/lmrd/"+data+".tsv", sep="\t", header=None)
    tag1 = df.iloc[:,0].value_counts().index[0]
    val1 = df.iloc[:,0].value_counts()[0]
    tag2 = df.iloc[:,0].value_counts().index[1]
    val2 = df.iloc[:,0].value_counts()[1]
    print(f"The {data} subset contains {val1} {tag1}tive reviews and {val2} {tag2}tive reviews.")

In [33]:
count("train")
count("test")
count("dev")

The train subset contains 10000 postive reviews and 10000 negtive reviews.
The test subset contains 12500 postive reviews and 12500 negtive reviews.
The dev subset contains 2500 postive reviews and 2500 negtive reviews.


In [34]:
def read_data(filename):
    X=[]
    Y=[]
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            label=cols[0]
            text=word_tokenize(cols[1])
            X.append(text)
            Y.append(label)
    return X, Y

In [35]:
# Change this to the directory with the data you will be using.
# The directory should contain train.tsv, dev.tsv and test.tsv
directory="../data/lmrd"

In [36]:
trainX, trainY=read_data("%s/train.tsv" % directory)
devX, devY=read_data("%s/dev.tsv" % directory)

In [37]:
def majority_class(trainY, devY):
    labelCounts=Counter()
    for label in trainY:
        labelCounts[label]+=1
    majority=labelCounts.most_common(1)[0][0]
    
    correct=0.
    for label in devY:
        if label == majority:
            correct+=1
            
    print("%s\t%.3f" % (majority, correct/len(devY)))

Here we'll create two feature classes -- one feature class noting the presence of a word in an external dictionary, and one feature class for the word identity (i.e., unigram).  We'll implement each feature class as a function that takes a single document as input (as a list of tokens) and returns a dict corresponding to the feature we're creating.

In [38]:
# start with a fixed dict to look up keywords occuring in the review
pos_dictionary = set(["like", "love", "good"])
neg_dictionary = set(["dislike", "hate", "bad"])

def sentiment_dictionary_feature(tokens):
    feats={}
    for word in tokens:
        if word in pos_dictionary:
            feats["word_in_pos_dictionary"]=1
        if word in neg_dictionary:
            feats["word_in_neg_dictionary"]=1
    return feats

In [39]:
# add a unigram feature set
def unigram_feature(tokens):
    feats={}
    for word in tokens:
        feats["UNIGRAM_%s" % word]=1
    return feats

**Q1**: Add first new feature function here.  Describe your feature and why you think it will help.

I continue with the unigram feature set and then create a bigram feature set to identify defining bigrams that can help distinguish between positive and negative-tagged movies. One potential advantage of using bigrams is to differentiate between phrases like `don't dislike` and `dislike`. The former contains a double negation, which can lead to misclassification as negative reviews when using unigrams alone.

In [40]:
# add a bigram feature set
def new_feature_class_one(tokens):
    feats={}  
    for i in range(len(tokens) - 1):
        bigram = (tokens[i], tokens[i + 1])  
        bigram = ' '.join(bigram)
        feats["BIGRAM_%s" % bigram] = 1
    return feats

**Q2**: Add second new feature function here. Describe your feature and why you think it will help.

For the second feature set, I use a keyword dict to look up tokens in capital letter, which upon closer inspection of the data, occurs frequently in pos/neg-tagged reviews. These tokens stand out as they typically occur in all caps, expressing one's strong opinions towards the movie. For positive reviews, `WOW` and `ABSOLUTELTY` are among the most commonly all-cap tokens. For negative reviews, `NOT`, `BUT`, and the questions mark `?` occur frequently to express one's strong dissatisfaction with the movie. A unigram-only model treats all tokens as equal, which might not pick out all-cap tokens as mentioned above. 

In [41]:
# add a keyword feature set
pos_cap_dict = set(["WOW", "ABSOLUTELY", "SO", "EVER"])
neg_cap_dict = set(["NOT", "VERY", "BUT", "?", "ONLY", "NO", "WTF", "...", "SPENT", "$"])

def new_feature_class_two(tokens):
    feats={}
    for word in tokens:
        if word in pos_cap_dict:
            feats["allcaps_in_pos_dict"]=1
        if word in neg_cap_dict:
            feats["allcaps_in_neg_dict"]=1
    return feats

This is the main function we'll use to aggregate together all of the information from different feature classes.  Each document has a feature dict (`feats`), and we'll update that dict with the new dict that each separate feature class is returning.  (Here you want to make sure that the keys each feature function is creating are unique so they don't get clobbered by other functions).

In [42]:
def build_features(trainX, feature_functions):
    data=[]
    for tokens in trainX:
        feats={}

        for function in feature_functions:
            feats.update(function(tokens))

        data.append(feats)
    return data

In [43]:
# This helper function converts a dictionary of feature names to unique numerical ids
def create_vocab(data):
    feature_vocab={}
    idx=0
    for doc in data:
        for feat in doc:
            if feat not in feature_vocab:
                feature_vocab[feat]=idx
                idx+=1
                
    return feature_vocab

In [44]:
# This helper function converts a dictionary of feature names to a sparse representation
# that we can fit in a scikit-learn model.  This is important because almost all feature 
# values will be 0 for most documents (note: why?), and we don't want to save them all in 
# memory.

def features_to_ids(data, feature_vocab):
    new_data=sparse.lil_matrix((len(data), len(feature_vocab)))
    for idx,doc in enumerate(data):
        for f in doc:
            if f in feature_vocab:
                new_data[idx,feature_vocab[f]]=doc[f]
    return new_data

In [45]:
# This function evaluates a list of feature functions on the training/dev data arguments
def pipeline(trainX, devX, trainY, devY, feature_functions):
    trainX_feat=build_features(trainX, feature_functions)
    devX_feat=build_features(devX, feature_functions)

    # just create vocabulary from features in *training* data
    feature_vocab=create_vocab(trainX_feat)

    trainX_ids=features_to_ids(trainX_feat, feature_vocab)
    devX_ids=features_to_ids(devX_feat, feature_vocab)
    
    logreg = linear_model.LogisticRegression(C=1.0, solver='lbfgs', penalty='l2', max_iter=10000)
    logreg.fit(trainX_ids, trainY)
    print("Accuracy: %.3f" % logreg.score(devX_ids, devY)) 
    return logreg, feature_vocab

In [46]:
def print_weights(clf, vocab, n=10):

    reverse_vocab=[None]*len(clf.coef_[0])
    for k in vocab:
        reverse_vocab[vocab[k]]=k

    if len(clf.classes_) == 2:
        
        weights=clf.coef_[0]
        print("Features predicting negative reviews:")
        for feature, weight in sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))[:n]:
            print("%.3f\t%s" % (weight, feature))

        print()
        print("Features predicting positive reviews:")
        for feature, weight in list(reversed(sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))))[:n]:
            print("%.3f\t%s" % (weight, feature))

    else:  
        for i, cat in enumerate(clf.classes_):

            weights=clf.coef_[i]

            for feature, weight in list(reversed(sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))))[:n]:
                print("%s\t%.3f\t%s" % (cat, weight, feature))
            print()

In [47]:
baseline=majority_class(trainY,devY)

pos	0.500


In [48]:
# This function trains a model and returns the predicted and true labels for test data
def evaluate(trainX, devX, trainY, devY, feature_functions):
    trainX_feat=build_features(trainX, feature_functions)
    devX_feat=build_features(devX, feature_functions)

    # just create vocabulary from features in *training* data
    feature_vocab=create_vocab(trainX_feat)

    trainX_ids=features_to_ids(trainX_feat, feature_vocab)
    devX_ids=features_to_ids(devX_feat, feature_vocab)
    
    logreg = linear_model.LogisticRegression(C=1.0, solver='lbfgs', penalty='l2', max_iter=10000)
    logreg.fit(trainX_ids, trainY)
    predictions=logreg.predict(devX_ids)
    return (predictions, devY)

def binomial_test(predictions, truth, baseline, significance_level=0.95):
    correct=[]
    for pred, gold in zip(predictions, truth):
        correct.append(int(pred==gold))
        
    success_rate=np.mean(correct)

    # two-tailed test
    critical_value=(1-significance_level)/2
    # ppf finds z such that p(X < z) = critical_value
    z_alpha=-1*norm.ppf(critical_value)
    print("Critical value: %.3f\tz_alpha: %.3f" % (critical_value, z_alpha))
    
    # the standard error is the square root of the variance/sample size
    # the variance for a binomial test is p*(1-p)
    standard_error=sqrt((success_rate*(1-success_rate))/len(correct))

    Z=(success_rate-baseline)/standard_error
    lower=success_rate-z_alpha*standard_error
    upper=success_rate+z_alpha*standard_error
    pval=norm.cdf(-abs(Z))
    print ("Accuracy: %.3f, n = %s" % (success_rate, len(correct)))
    print("%s%% Confidence interval: [%.3f,%.3f]" % (significance_level*100, lower, upper))

    print("Z score: %.3f" % Z)
    print("p-value: %.5f" % pval)

    print ("Critical region corresponding to z_alpha=[%.3f,%.3f]: [%.3f, %.3f]" % (-z_alpha, z_alpha, baseline-z_alpha*standard_error, baseline+z_alpha*standard_error))
    print ("Can we reject null that %.3f is different from %.3f at %s significance level? %s" % (success_rate, baseline, significance_level*100, "Yes" if Z < -z_alpha or Z > z_alpha else "No"))

Explore the impact of different feature functions by evaluating them below:

In [18]:
# run classification on sentiment_dict_feature
features=[sentiment_dictionary_feature]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.610


IF you want to print the coefficients for any of the models you train, you can do so like this.

In [19]:
# print_weights(clf, vocab)

In [28]:
# run classification on unigram_feature
features=[unigram_feature]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.887


In [21]:
print("Get top features from unigram feature set: \n")
print_weights(clf, vocab)

Get top features from unigram feature set: 

Features predicting negative reviews:
-2.606	UNIGRAM_4/10
-2.478	UNIGRAM_worst
-2.158	UNIGRAM_waste
-1.847	UNIGRAM_poorly
-1.726	UNIGRAM_awful
-1.689	UNIGRAM_3/10
-1.684	UNIGRAM_disappointment
-1.536	UNIGRAM_disappointing
-1.420	UNIGRAM_boring
-1.351	UNIGRAM_Avoid

Features predicting positive reviews:
2.685	UNIGRAM_7/10
1.583	UNIGRAM_wonderfully
1.510	UNIGRAM_8/10
1.459	UNIGRAM_7
1.453	UNIGRAM_excellent
1.433	UNIGRAM_perfect
1.421	UNIGRAM_Excellent
1.411	UNIGRAM_refreshing
1.257	UNIGRAM_10/10
1.251	UNIGRAM_8


In [22]:
# run classification on bigram_feature 
features=[new_feature_class_one]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.882


In [23]:
print("Get top features from bigram feature set: \n")
print_weights(clf, vocab)

Get top features from bigram feature set: 

Features predicting negative reviews:
-2.439	BIGRAM_the worst
-1.469	BIGRAM_waste of
-1.206	BIGRAM_awful .
-1.199	BIGRAM_bad .
-1.068	BIGRAM_. 4/10
-0.963	BIGRAM_boring .
-0.963	BIGRAM_at all
-0.930	BIGRAM_terrible .
-0.914	BIGRAM_not worth
-0.902	BIGRAM_so bad

Features predicting positive reviews:
1.181	BIGRAM_. Great
1.117	BIGRAM_the best
1.108	BIGRAM_my favorite
0.924	BIGRAM_is great
0.914	BIGRAM_excellent .
0.903	BIGRAM_love this
0.901	BIGRAM_an excellent
0.888	BIGRAM_a great
0.885	BIGRAM_is excellent
0.876	BIGRAM_on DVD


I'm a bit surprised that running classification on unigram and bigram features individually yields very similar classification accuracy. Moreover, it turns out that review scores do matter in predicting binary sentiment. For instance, a low score of 3/10 or 4/10 often appears in negative reviews, while higher scores like 7/10 and 8/10 are among determining features in predicting positive reviews. I believe this is something people often overlook when predicting reviews tied to a scoring mechanism. People's sentiment toward the movie is straightforwardly expressed through the score.

In [24]:
# run classification on keyword_feature 
features=[new_feature_class_two]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.598


In [25]:
print("Get top features from all cap feature set: \n")
print_weights(clf, vocab)

Get top features from all cap feature set: 

Features predicting negative reviews:
-0.723	allcaps_in_neg_dict
-0.487	allcaps_in_pos_dict

Features predicting positive reviews:
-0.487	allcaps_in_pos_dict
-0.723	allcaps_in_neg_dict


The accuracy of the second feature set, which includes all-cap tokens, is not high. The coefficients are both negative, differing only in magnitude. This suggests that the classification model does not distinguish between the two binary label outcomes.

In [26]:
features=[new_feature_class_one, new_feature_class_two]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.883


In [27]:
features=[unigram_feature, new_feature_class_one, new_feature_class_two]
clf, vocab=pipeline(trainX, devX, trainY, devY, features)

Accuracy: 0.899


In [28]:
print_weights(clf, vocab)

Features predicting negative reviews:
-1.580	UNIGRAM_worst
-1.411	UNIGRAM_awful
-1.272	UNIGRAM_boring
-1.269	UNIGRAM_waste
-1.231	BIGRAM_the worst
-1.104	UNIGRAM_poor
-1.099	UNIGRAM_terrible
-1.085	UNIGRAM_bad
-1.054	UNIGRAM_4/10
-0.991	UNIGRAM_poorly

Features predicting positive reviews:
1.250	UNIGRAM_excellent
1.094	UNIGRAM_perfect
1.000	UNIGRAM_7/10
0.900	UNIGRAM_amazing
0.887	UNIGRAM_7
0.822	UNIGRAM_wonderful
0.793	UNIGRAM_great
0.713	BIGRAM_. Great
0.712	UNIGRAM_true
0.712	BIGRAM_better than


Here, the final model including all three feature sets yield the highest classification accuracy. There is a mix of unigram/bigram features, typically adjectives and review scores. It is interesting that adjectives in comparative (e.g. `better than`) and superlative forms (e.g. `the worst`) are among the list, which are defining keywords that express one's sentiment towards the movie. 