<a href="https://colab.research.google.com/github/rgrudt/CS6120-PS1/blob/master/PS1_Reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this experiment, you will explore the accuracy of sentiment classificaiton using different feature representations of text documents.

First, you will implement `createBasicFeatures`, which creates a sparse matrix representation of a collection of documents. For this exercise, you should have a feature for each word containing at least one alphabetic character. You may use the `numpy` and `sklearn` packages to help with implementing a sparse matrix.

Then, you will implement `createFancyFeatures`, which can specify at any other features you choose to help improve performance on the classification task.

The two code blocks at the end train and evaluate two models—logistic regression with L1 and L2 regularization—using your featurization functions. Besides held-out classification accuracy with 10-fold cross-validation, you will also see the features in each class given high weights by the model.

A helpful resource for getting up to speed with vector representations of documents is the first two chapters of Delip Rao and Brian McMahan, _Natural Language Processing with PyTorch_, O'Reilly, 2019.  You should be able to <a href="https://learning.oreilly.com/library/view/natural-language-processing/9781491978221/">read it online</a> via the Northeastern Library's subscription using a <tt>northeastern.edu</tt> email address.

In [1]:
import json
import requests
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate,LeaveOneOut,KFold
import numpy as np

In [2]:
# libraries needed for fancy feature extraction
from sklearn.feature_extraction import DictVectorizer
from collections import Counter, OrderedDict

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import re
import pandas as pd

nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [3]:
# read in the movie review corpus
def readReviews():
  raw = requests.get("https://raw.githubusercontent.com/mutherr/CS6120-PS1-data/master/cornell_reviews.json").text.strip()
  corpus = [json.loads(line) for line in raw.split("\n")]

  return corpus

This is where you will implement two functions to featurize the data.

In [4]:
#NB: The current contents are for testing only
#This function should return: 
#  -a sparse numpy matrix of document features
#  -a list of the correct class for each document
#  -a list of the vocabulary used by the features, such that the ith term of the
#    list is the word whose counts appear in the ith column of the matrix. 

# This function should create a feature representation using all tokens that
# contain an alphabetic character.
def createBasicFeatures(corpus):
  v = DictVectorizer()
  # for each document count the number of occurances of each word containing 1+ alphabetic characters
  texts = v.fit_transform(Counter([item for item in corpus[doc]['text'].split(' ') if item.islower()]) for doc in range(len(corpus)))
  # find the class for each document
  classes = [corpus[i]['class'] for i in range(len(corpus))]
  # vocabulary list for matrix
  vocab = list(v.vocabulary_.keys())
  return texts,classes,vocab

# This function can add other features you want that help classification
# accuracy, such as bigrams, word prefixes and suffixes, etc.

# definitions used to process data
stop = stopwords.words('english')

def createFancyFeatures(corpus):
  # initialize dictionary for the features for all of the documents
  word_dict = []
  for doc in range(len(corpus)):
    # initial list of words for current document and 
    word_list = []
    # keep track of previous word in order to make bigrams
    prev_word = np.nan

    # split corpus on spaces and '-' loop through each word
    for word in re.split(' |-',corpus[doc]['text']):
      # if the word contains atleast one alphabetical character and is not in stop word
      if word.islower() and word not in stop:
        word_list += [word]
        # if the previous word is not null, then combine to make bigram
        if not pd.isnull(prev_word):
          word_list +=[prev_word +' '+ word]
        prev_word = word
      else:
        prev_word = np.nan
    counting = Counter(word_list)
    word_dict += [counting]

  # create matrix of counted features
  v = DictVectorizer()
  texts = v.fit_transform(word_dict)
  classes = [corpus[i]['class'] for i in range(len(corpus))]
  vocab = list(v.vocabulary_.keys())

  return texts,classes,vocab

In [5]:
#given a numpy matrix representation of the features for the training set, the 
# vector of true classes for each example, and the vocabulary as described 
# above, this computes the accuracy of the model using leave one out cross 
# validation and reports the most indicative features for each class

def evaluateModel(X,y,vocab,penalty="l1"):
  #create and fit the model
  model = LogisticRegression(penalty=penalty,solver="liblinear")
  results = cross_validate(model,X,y,cv=KFold(n_splits=10, shuffle=True, random_state=1))
  
  #determine the average accuracy
  scores = results["test_score"]
  avg_score = sum(scores)/len(scores)
  
  #determine the most informative features
  # this requires us to fit the model to everything, because we need a
  # single model to draw coefficients from, rather than 26
  model.fit(X,y)
  class0_weight_sorted = model.coef_[0, :].argsort()
  class1_weight_sorted = (-model.coef_[0, :]).argsort()

  termsToTake = 20
  class0_indicators = [vocab[i] for i in class0_weight_sorted[:termsToTake]]
  class1_indicators = [vocab[i] for i in class1_weight_sorted[:termsToTake]]

  if model.classes_[0] == "pos":
    return avg_score,class0_indicators,class1_indicators
  else:
    return avg_score,class1_indicators,class0_indicators

def runEvaluation(X,y,vocab):
  print("----------L1 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l1")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)
  #this call will fit a model with L2 normalization
  print("----------L2 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l2")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)

In [6]:
corpus = readReviews()

Run the following to train and evaluate two models using basic features:

In [7]:
X,y,vocab = createBasicFeatures(corpus)
runEvaluation(X, y, vocab)

----------L1 Norm-----------
The model's average accuracy is 0.816000
The most informative terms for pos are: ['palace', 'boobies', 'supplicate', '_hope', "billie's", 'newsman', 'camelia', 'francisco', 'pepto', 'rendered', "ripley's", 'cusackian', 'footsy', 'four-letter', 'up-and-up', 'shel', 'pirate', 'scratched', 'half-crazed', 'callous']
The most informative terms for neg are: ['hyper-colorized', 'ifans', 'mith', 'convenience', 'dread-factor', 'intones', "singleton's", 'ex-rolling', 'skies', 'hugely', "`kundun'", 'factory-worker-turned-prostitute', 'ti', 'imports', 'eighteenth', "wasn't", 'representing', 'chunky', 'nickolas', 'lifelessness']
----------L2 Norm-----------
The model's average accuracy is 0.833500
The most informative terms for pos are: ['four-letter', 'sample', 'bobo', 'boobies', 'thelma', 'up-and-up', 'prose', "'blair", 'effects-heavy', 'pepto', 'shel', 'supplicate', 'jade', '_hope', 'freeway', "mouse's", 'spyglass', 'schulz', 'orbit', 'rendered']
The most informative

Run the following to train and evaluate two models using extended features:

In [8]:
X,y,vocab = createFancyFeatures(corpus)
runEvaluation(X, y, vocab)

----------L1 Norm-----------
The model's average accuracy is 0.822000
The most informative terms for pos are: ['writer/director/star', 'life hell', 'astounding cate', 'dinos', 'home theater', 'fulfilled', 'job demonstrating', 'spent avidly', 'awful snake', 'spielberg', 'accidents', 'strangely insulting', 'glee path', "bunny's debt", 'superior abilities', 'wildly uneven', "judge's plot", 'quake', 'crime way', 'ironically clueless']
The most informative terms for neg are: ['physically backpedal', 'beautiful witch', 'errant ways', 'spark develops', 'chest beating', 'twenty something', 'big note', 'facing loss', 'formulaism', 'great experiment', 'titilation', 'wes anderson', 'mid 70s', 'director/actor/co writer', 'telecommunications industry', 'jarring christopher', "soran's plan", 'turned vampire', 'talking trash', 'ridiculous outfit']
----------L2 Norm-----------
The model's average accuracy is 0.852500
The most informative terms for pos are: ['quake', '`there', 'bandit king', 'accidents

Adding bigrams, removing stop words, and spliting words on '-' (e.g. up-and-up), improves the model accuracy.  