### Steps:
- 1. tokenize it and choose whether to do further pre-processing or filtering.
- 2. The second step is to produce the features in the notation of the NLTK.
    - a. write feature functions
    - b. start with the “bag-of-words” features where you collect all the words in the corpus and select some number of most frequent words to be the word features.
- 3. use the NLTK Naïve Bayes classifier to train and test a classifier on your feature sets. You should use cross-validation to obtain precision, recall and F-measure scores.
    - a. you can choose to produce the features as a csv file and use sklearn to train and test a classifier, using cross-validation scores.
- 4. For a base level completion of experiments, carry out at least several experiments where you
use two different sets of features and compare the results.
EXAMPLE: Take the
unigram word features as a baseline and see if the features you designed improve the accuracy of
the classification.

Some of the types of experiments:
- filter by stopwords or other pre-processing methods
- representing negation (if using twitter data, note the difference in tokenization)
- using a sentiment lexicon with scores or counts: Subjectivity
- different sizes of vocabularies

- POS tag features
- 5. define at least one “new” feature function not given in class. Also you should try to
combine some of the earlier features, e.g. to use unigrams, bigrams, POS tag counts, and
sentiment word counts all in one feature set. Examples of new features:
    - Use the LIWC sentiment lexicon
    - combine the use of sentiment lexicons
    - use a different representation of negation, for example, carrying the scope of the negation
work over to the next punctuation
- 6. Do something from this list:
using Sklearn classifiers with features produced in NLTK.
- • using an additional type of lexicon besides Subjectivity or LIWC
- • in addition to using cross-validation on the training set, train the classifier on the entire
training set and test it on a separately available test set (only the SemEval data has these)
o note that you must save the vocabulary from the training set and use the same for
creating feature sets for the test data

- • implement additional features
    - in the email dataset, use word frequency or tfidf scores as the values of the word
features, instead of Boolean values
    - use POS tagging from the ARK on Twitter
    - twitter emoticons or other features based on internet usage or informal text, such
as repeated letters, or all caps words

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# print all outputs, not just the last one
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
import pandas as pd
import sklearn as sk
from nltk import *
import re
import numpy as np
import random
import matplotlib.pyplot as plt
from nltk import sent_tokenize
from prettytable import PrettyTable

In [4]:
test = pd.read_table('./sentiment-analysis-on-movie-reviews/test.tsv')
train = pd.read_table('./sentiment-analysis-on-movie-reviews/train.tsv')

In [5]:
# drop sendenceId dupes
train.drop_duplicates(subset="SentenceId", keep="first", inplace=True)

### Add more columns to analyze the data

Clone train dataframe and add additional columns

In [6]:
train_additional_cols = pd.DataFrame(train)

train_additional_cols["word_tokens"] = np.nan
train_additional_cols["phrase_length"] = np.nan
train_additional_cols["POS_tags"] = np.nan


Remove all rows where Phrase is 1 char long

In [7]:
train_additional_cols[:76]

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,word_tokens,phrase_length,POS_tags
0,1,1,A series of escapades demonstrating the adage ...,1,,,
63,64,2,"This quiet , introspective and entertaining in...",4,,,
81,82,3,"Even fans of Ismail Merchant 's work , I suspe...",1,,,
116,117,4,A positively thrilling combination of ethnogra...,3,,,
156,157,5,Aggressive self-glorification and a manipulati...,1,,,
...,...,...,...,...,...,...,...
1937,1938,72,-LRB- Scherfig -RRB- has made a movie that wil...,3,,,
1965,1966,73,-LRB- An -RRB- absorbing documentary .,3,,,
1972,1973,74,Reeks of rot and hack work from start to finish .,2,,,
1983,1984,75,Plays like a series of vignettes -- clips of a...,1,,,


In [8]:
for i, phrase in enumerate(train_additional_cols['Phrase']):
    # just a single letter or space
    if len(phrase)==1:
        train_additional_cols.drop(train_additional_cols.index[i], inplace=True)

In [9]:
train_additional_cols.reset_index(drop=True, inplace=True)

In [10]:
# stop words from NLTK
nltk_stop_words = corpus.stopwords.words('english')

In [11]:
# function that takes a word and returns true if it consists only of non-alphabetic characters 
def alpha_filter(w):
  # pattern to match word of non-alphabetical characters
  pattern = re.compile('^[^a-z]+$')
  if pattern.match(w):
    return True
  else:
    return False

In [12]:
# removes stopwords and puctuation from text provided
def remove_stopwords_and_punct(sentence):
    for word in sentence:
        if word in nltk_stop_words:
            if alpha_filter(word):
                sentense = sentence.remove(word)
    return sentence

This will loop will:
- 1. tokenize 
- 2. set to lowercase
- 3. removes puctuation and stop words

In [13]:
 train_additional_cols_num_rows = len(train_additional_cols.index)

In [14]:
for index in range(train_additional_cols_num_rows):
    phrase = train_additional_cols['Phrase'][index]
    sent_token = sent_tokenize(phrase.lower())
#     sent_token comes as a list, we need a string
    word_token = word_tokenize(sent_token[0])
    word_token_clean = remove_stopwords_and_punct(word_token)
    train_additional_cols["word_tokens"][index] = word_token_clean
    train_additional_cols["phrase_length"][index] = len(phrase)
    

In [15]:
# check data to make sure tokens match sentences at the beginning and end
train_additional_cols.head()
train_additional_cols.tail()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,word_tokens,phrase_length,POS_tags
0,1,1,A series of escapades demonstrating the adage ...,1,"[a, series, of, escapades, demonstrating, the,...",188.0,
1,64,2,"This quiet , introspective and entertaining in...",4,"[this, quiet, ,, introspective, and, entertain...",74.0,
2,82,3,"Even fans of Ismail Merchant 's work , I suspe...",1,"[even, fans, of, ismail, merchant, 's, work, ,...",100.0,
3,117,4,A positively thrilling combination of ethnogra...,3,"[a, positively, thrilling, combination, of, et...",152.0,
4,157,5,Aggressive self-glorification and a manipulati...,1,"[aggressive, self-glorification, and, a, manip...",60.0,


Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,word_tokens,phrase_length,POS_tags
8522,155985,8540,... either you 're willing to go with this cla...,2,"[..., either, you, 're, willing, to, go, with,...",82.0,
8523,155998,8541,"Despite these annoyances , the capable Claybur...",2,"[despite, these, annoyances, ,, the, capable, ...",152.0,
8524,156022,8542,-LRB- Tries -RRB- to parody a genre that 's al...,1,"[-lrb-, tries, -rrb-, to, parody, a, genre, th...",81.0,
8525,156032,8543,The movie 's downfall is to substitute plot fo...,1,"[the, movie, 's, downfall, is, to, substitute,...",61.0,
8526,156040,8544,"The film is darkly atmospheric , with Herrmann...",2,"[the, film, is, darkly, atmospheric, ,, with, ...",137.0,


#### POS tagging, grammar rules phrase extraction

This loop will tag each list of tokens

In [16]:
for index, token_list in enumerate(train_additional_cols['word_tokens']):
    pos_tokens = pos_tag(token_list)
    train_additional_cols["POS_tags"][index] = pos_tokens

In [17]:
# check tags at the beginning and end
train_additional_cols.head()
train_additional_cols.tail()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,word_tokens,phrase_length,POS_tags
0,1,1,A series of escapades demonstrating the adage ...,1,"[a, series, of, escapades, demonstrating, the,...",188.0,"[(a, DT), (series, NN), (of, IN), (escapades, ..."
1,64,2,"This quiet , introspective and entertaining in...",4,"[this, quiet, ,, introspective, and, entertain...",74.0,"[(this, DT), (quiet, JJ), (,, ,), (introspecti..."
2,82,3,"Even fans of Ismail Merchant 's work , I suspe...",1,"[even, fans, of, ismail, merchant, 's, work, ,...",100.0,"[(even, RB), (fans, NNS), (of, IN), (ismail, J..."
3,117,4,A positively thrilling combination of ethnogra...,3,"[a, positively, thrilling, combination, of, et...",152.0,"[(a, DT), (positively, RB), (thrilling, VBG), ..."
4,157,5,Aggressive self-glorification and a manipulati...,1,"[aggressive, self-glorification, and, a, manip...",60.0,"[(aggressive, JJ), (self-glorification, NN), (..."


Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,word_tokens,phrase_length,POS_tags
8522,155985,8540,... either you 're willing to go with this cla...,2,"[..., either, you, 're, willing, to, go, with,...",82.0,"[(..., :), (either, CC), (you, PRP), ('re, VBP..."
8523,155998,8541,"Despite these annoyances , the capable Claybur...",2,"[despite, these, annoyances, ,, the, capable, ...",152.0,"[(despite, IN), (these, DT), (annoyances, NNS)..."
8524,156022,8542,-LRB- Tries -RRB- to parody a genre that 's al...,1,"[-lrb-, tries, -rrb-, to, parody, a, genre, th...",81.0,"[(-lrb-, JJ), (tries, NNS), (-rrb-, VBP), (to,..."
8525,156032,8543,The movie 's downfall is to substitute plot fo...,1,"[the, movie, 's, downfall, is, to, substitute,...",61.0,"[(the, DT), (movie, NN), ('s, POS), (downfall,..."
8526,156040,8544,"The film is darkly atmospheric , with Herrmann...",2,"[the, film, is, darkly, atmospheric, ,, with, ...",137.0,"[(the, DT), (film, NN), (is, VBZ), (darkly, JJ..."


In [18]:
# an ADJPH chunk should be formed whenever the chunker finds adverb (RB) followed by an adjective (JJ).
grammar_adjph = "ADJPH: {<RB.?>+<JJ.?>}"
# an ADVPH chunk should be formed whenever the chunker finds 2 consecutive adverbs ('RB')
grammar_advph = "ADVPH: {<RB>+<RB>}"
# an VBPH chunk should be formed whenever the chunker finds verb (VB) followed by a noun (NN).
grammar_vbph = "VBPH: {<VB.?>+<NN.?>}"
# an NPH chunk should be formed whenever the chunker finds a determiner (DT) followed by a noun (NN). We simply choose to define noun phrase as determiner followed by a noun of any kind.
grammar_nph = "NPH: {<DT>+<NN.?>}"

In [19]:
# function toimport the nltk parser to process each sentence
def create_chunk_parser(grammar_rules):
    return RegexpParser(grammar_rules)

This function will do the following:
- parse text based on parser provided
- get the actual phrase
- calculate frequency for most_common phrases (log only, we can add a return value)
- show the length of phrases sentences (log only)

In [20]:
def parse_phrases(sent, chunk_parser, label):
    tags = []
    phrases = []

    if len(sent) > 0:
#         print('sent', sent)
        tree = chunk_parser.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == label:
                tags.append(subtree)
    # Visualizing the actual phrase
    for sent in tags:
        temp = ''
        for w, t in sent:
            temp += w+ ' '    
        phrases.append(temp)
    print('phrases: ', phrases)
    # top 10 phrases
    freq = FreqDist(phrases)
    print('Top phrases by frequency: ')
    for word, freq in freq.most_common(10):
        print(word, freq)
    print("Length of {label} phrase sentences: ".format(label=label), len(tags))
    return phrases

##### Create parsers and parse the texts using the rules/parsers defined

In [21]:
adjph_parser = create_chunk_parser(grammar_adjph)
advph_parser = create_chunk_parser(grammar_advph)
vbph_parser = create_chunk_parser(grammar_vbph)
nph_parser = create_chunk_parser(grammar_nph)

Add more cols for grammar parser results

In [22]:
train_additional_cols["adjph"] = np.nan
train_additional_cols["advph"] = np.nan
train_additional_cols["vbph"] = np.nan
train_additional_cols["nph"] = np.nan

In [23]:
for index in range(train_additional_cols_num_rows):
    pos_tree = train_additional_cols['POS_tags'][index]
    if parse_phrases(pos_tree, adjph_parser, "ADJPH"):
        train_additional_cols["adjph"][index] = parse_phrases(pos_tree, adjph_parser, "ADJPH")
    else:
        train_additional_cols["adjph"][index] = 'no ADJPH phrases detected'

phrases:  ['also good ']
Top phrases by frequency: 
also good  1
Length of ADJPH phrase sentences:  1
phrases:  ['also good ']
Top phrases by frequency: 
also good  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['nearly epic ']
Top phrases by frequency: 
nearly epic  1
Length of ADJPH phrase sentences:  1
phrases:  ['nearly epic ']
Top phrases by frequency: 
nearly epic  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['so thick ']
Top phrases by frequency: 
so thick  1
Length of ADJPH phrase sentences:  1
phrases:  ['so thick ']
Top phrases by frequency: 
so thick  1
Length of ADJPH phrase sentences:  1

Top phrases by frequency: 
quite enough  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['delicately complex ']
Top phrases by frequency: 
delicately complex  1
Length of ADJPH phrase sentences:  1
phrases:  ['delicately complex ']
Top phrases by frequency: 
delicately complex  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase s

phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['as estrogen-free ']
Top phrases by frequency: 
as estrogen-free  1
Length of ADJPH phrase sentences:  1
phrases:  ['as estrogen-free ']
Top phrases by frequency: 
as estrogen-free  1
Length of ADJPH phrase sentences:  1
phrases:  ['admittedly limited ']
Top phrases by frequency: 
admittedly limited  1
Length of ADJPH phrase sentences:  1
phrases:  ['admittedly limited ']
Top phrases by frequency: 
admittedly limited  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['not entirely memorable ', 'certainly easy ']
Top phrases by frequency: 
not entirely memorable  1
certainly easy  1
Length of ADJPH phrase sentences:  2
phrases:  ['not entirely memorable ', 'certainly easy '

Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['otherwise excellent ']
Top phrases by frequency: 
otherwise excellent  1
Length of ADJPH phrase sentences:  1
phrases:  ['otherwise excellent ']
Top phrases by frequency: 
otherwise excellent  1
Length of ADJPH phrase sentences:  1
phrases:  ['extremely funny ']
Top phrases by frequency: 
extremely funny  1
Length of ADJPH phrase sentences:  1
phrases:  ['extremely funny ']
Top phrases by frequency: 
extremely funny  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['too much ']
Top phrases by frequency: 
too much  1
Length of ADJPH phrase sentences:  1
phrases:  ['too much ']
Top phrases by frequency: 
too much  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequ

phrases:  ['very much ']
Top phrases by frequency: 
very much  1
Length of ADJPH phrase sentences:  1
phrases:  ['very much ']
Top phrases by frequency: 
very much  1
Length of ADJPH phrase sentences:  1
phrases:  ['little more dramatic ', 'more editing ']
Top phrases by frequency: 
little more dramatic  1
more editing  1
Length of ADJPH phrase sentences:  2
phrases:  ['little more dramatic ', 'more editing ']
Top phrases by frequency: 
little more dramatic  1
more editing  1
Length of ADJPH phrase sentences:  2
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['easily skippable ']
Top phrases by frequency: 
easily skippable  1
Length of ADJPH ph

phrases:  ['much colorful ']
Top phrases by frequency: 
much colorful  1
Length of ADJPH phrase sentences:  1
phrases:  ['much colorful ']
Top phrases by frequency: 
much colorful  1
Length of ADJPH phrase sentences:  1
phrases:  ['just different ']
Top phrases by frequency: 
just different  1
Length of ADJPH phrase sentences:  1
phrases:  ['just different ']
Top phrases by frequency: 
just different  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ["n't very bright "]
Top phrases by frequency: 
n't very bright  1
Length of ADJPH phrase sentences:  1
phrases:  ["n't very bright "]
Top phrases by frequency: 
n't very bright  1
Length of ADJPH phrase sentences:  1
phrases:  ['so tame ', 'even slightly wised-up ']
Top phrases by frequency: 
so tame  1
even slightly wised-up  1
Length of ADJPH phrase sentences:  2
phrases:  ['so tame ', 'even slightly wised-up ']
Top phrases by frequency: 
so tame  1
even slightl

phrases:  ['relentlessly globalizing ']
Top phrases by frequency: 
relentlessly globalizing  1
Length of ADJPH phrase sentences:  1
phrases:  ['relentlessly globalizing ']
Top phrases by frequency: 
relentlessly globalizing  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['truly good ']
Top phrases by frequency: 
truly good  1
Length of ADJPH phrase sentences:  1
phrases:  ['trul

In [None]:
# see how adjective phrase compares to the whole review
for index in range(train_additional_cols_num_rows):
    if train_additional_cols['adjph'][index]!="no ADJPH phrases detected":
        print(index)
        print(train_additional_cols['Phrase'][index], train_additional_cols['Sentiment'][index], train_additional_cols['adjph'][index])
        print('------next item-----------')

This function will print top 50 tags and their frequency based on POS tag list provided provided ie we can supply adjective POS token tree and tagged text.

In [None]:
def get_top50_pos_tokens(pos_list, taggedtext):
    pos_tokens = []
    freq_table = PrettyTable(['word', 'frequency'])
    for sentence in taggedtext:
        for word, pos in sentence:
            if pos in pos_list:
                if len(word)>1:
                    pos_tokens.append(word)
    freq_pos = FreqDist(pos_tokens)
    for word, freq in freq_pos.most_common(50):
        freq_table.add_row([word, freq])
    print(freq_table)

### Next steps

- 1. We can incorporate frequency you worked on
- 2. Add adverb, verb, noun phrase analysis (rules already set up, I did adjective phrases as an example)
- 3. Run Naive Bayes and maybe another model to predict