### Steps:
- 1. tokenize it and choose whether to do further pre-processing or filtering.
- 2. The second step is to produce the features in the notation of the NLTK.
    - a. write feature functions
    - b. start with the “bag-of-words” features where you collect all the words in the corpus and select some number of most frequent words to be the word features.
- 3. use the NLTK Naïve Bayes classifier to train and test a classifier on your feature sets. You should use cross-validation to obtain precision, recall and F-measure scores.
    - a. you can choose to produce the features as a csv file and use sklearn to train and test a classifier, using cross-validation scores.
- 4. For a base level completion of experiments, carry out at least several experiments where you
use two different sets of features and compare the results.
EXAMPLE: Take the
unigram word features as a baseline and see if the features you designed improve the accuracy of
the classification.

Some of the types of experiments:
- filter by stopwords or other pre-processing methods
- representing negation (if using twitter data, note the difference in tokenization)
- using a sentiment lexicon with scores or counts: Subjectivity
- different sizes of vocabularies

- POS tag features
- 5. define at least one “new” feature function not given in class. Also you should try to
combine some of the earlier features, e.g. to use unigrams, bigrams, POS tag counts, and
sentiment word counts all in one feature set. Examples of new features:
    - Use the LIWC sentiment lexicon
    - combine the use of sentiment lexicons
    - use a different representation of negation, for example, carrying the scope of the negation
work over to the next punctuation
- 6. Do something from this list:
using Sklearn classifiers with features produced in NLTK.
- • using an additional type of lexicon besides Subjectivity or LIWC
- • in addition to using cross-validation on the training set, train the classifier on the entire
training set and test it on a separately available test set (only the SemEval data has these)
o note that you must save the vocabulary from the training set and use the same for
creating feature sets for the test data

- • implement additional features
    - in the email dataset, use word frequency or tfidf scores as the values of the word
features, instead of Boolean values
    - use POS tagging from the ARK on Twitter
    - twitter emoticons or other features based on internet usage or informal text, such
as repeated letters, or all caps words

In [87]:
import warnings
warnings.filterwarnings('ignore')

In [88]:
# print all outputs, not just the last one
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [89]:
import pandas as pd
import sklearn as sk
from nltk import *
import re
import numpy as np
import random
import matplotlib.pyplot as plt
from nltk import sent_tokenize
from prettytable import PrettyTable

In [90]:
test = pd.read_table('./sentiment-analysis-on-movie-reviews/test.tsv')
train = pd.read_table('./sentiment-analysis-on-movie-reviews/train.tsv')

In [91]:
# drop sendenceId dupes
train.drop_duplicates(subset="SentenceId", keep="first", inplace=True)

### Add more columns to analyze the data

Clone train dataframe and add additional columns

In [92]:
train_additional_cols = pd.DataFrame(train)

train_additional_cols["word_tokens"] = np.nan
train_additional_cols["phrase_length"] = np.nan
train_additional_cols["POS_tags"] = np.nan


Remove all rows where Phrase is 1 char long

In [93]:
train_additional_cols[:76]

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,word_tokens,phrase_length,POS_tags
0,1,1,A series of escapades demonstrating the adage ...,1,,,
63,64,2,"This quiet , introspective and entertaining in...",4,,,
81,82,3,"Even fans of Ismail Merchant 's work , I suspe...",1,,,
116,117,4,A positively thrilling combination of ethnogra...,3,,,
156,157,5,Aggressive self-glorification and a manipulati...,1,,,
...,...,...,...,...,...,...,...
1937,1938,72,-LRB- Scherfig -RRB- has made a movie that wil...,3,,,
1965,1966,73,-LRB- An -RRB- absorbing documentary .,3,,,
1972,1973,74,Reeks of rot and hack work from start to finish .,2,,,
1983,1984,75,Plays like a series of vignettes -- clips of a...,1,,,


In [94]:
for i, phrase in enumerate(train_additional_cols['Phrase']):
    # just a single letter or space
    if len(phrase)==1:
        train_additional_cols.drop(train_additional_cols.index[i], inplace=True)

In [95]:
train_additional_cols.reset_index(drop=True, inplace=True)

In [96]:
# stop words from NLTK
nltk_stop_words = corpus.stopwords.words('english')

In [97]:
# function that takes a word and returns true if it consists only of non-alphabetic characters 
def alpha_filter(w):
  # pattern to match word of non-alphabetical characters
  pattern = re.compile('^[^a-z]+$')
  if pattern.match(w):
    return True
  else:
    return False

In [98]:
# removes stopwords and puctuation from text provided
def remove_stopwords_and_punct(word_token_list):
    clean_word_token_list = []
    for word_token in word_token_list:
        print('word_token', word_token)
        if (word_token in nltk_stop_words) or (alpha_filter(word_token)):
            continue
        else:
            clean_word_token_list.append(word_token)
    return clean_word_token_list

This will loop will:
- 1. tokenize 
- 2. set to lowercase
- 3. removes puctuation and stop words

In [99]:
 train_additional_cols_num_rows = len(train_additional_cols.index)

In [100]:
for index in range(train_additional_cols_num_rows):
    phrase = train_additional_cols['Phrase'][index]
    sent_token = sent_tokenize(phrase.lower())
#     sent_token comes as a list, we need a string
    word_token = word_tokenize(sent_token[0])
    word_token_clean = remove_stopwords_and_punct(word_token)
    train_additional_cols["word_tokens"][index] = word_token_clean
    train_additional_cols["phrase_length"][index] = len(phrase)
    

word_token a
word_token series
word_token of
word_token escapades
word_token demonstrating
word_token the
word_token adage
word_token that
word_token what
word_token is
word_token good
word_token for
word_token the
word_token goose
word_token is
word_token also
word_token good
word_token for
word_token the
word_token gander
word_token ,
word_token some
word_token of
word_token which
word_token occasionally
word_token amuses
word_token but
word_token none
word_token of
word_token which
word_token amounts
word_token to
word_token much
word_token of
word_token a
word_token story
word_token .
word_token this
word_token quiet
word_token ,
word_token introspective
word_token and
word_token entertaining
word_token independent
word_token is
word_token worth
word_token seeking
word_token .
word_token even
word_token fans
word_token of
word_token ismail
word_token merchant
word_token 's
word_token work
word_token ,
word_token i
word_token suspect
word_token ,
word_token would
word_token have
wor

word_token a
word_token vile
word_token ,
word_token incoherent
word_token mess
word_token ...
word_token a
word_token scummy
word_token ripoff
word_token of
word_token david
word_token cronenberg
word_token 's
word_token brilliant
word_token `
word_token videodrome
word_token .
word_token '
word_token first
word_token ,
word_token for
word_token a
word_token movie
word_token that
word_token tries
word_token to
word_token be
word_token smart
word_token ,
word_token it
word_token 's
word_token kinda
word_token dumb
word_token .
word_token i
word_token was
word_token sent
word_token a
word_token copyof
word_token this
word_token film
word_token to
word_token review
word_token on
word_token dvd
word_token .
word_token ...
word_token silly
word_token humbuggery
word_token ...
word_token is
word_token n't
word_token as
word_token sharp
word_token as
word_token the
word_token original
word_token ...
word_token despite
word_token some
word_token visual
word_token virtues
word_token ,
word_tok

word_token what
word_token sets
word_token this
word_token romantic
word_token comedy
word_token apart
word_token from
word_token most
word_token hollywood
word_token romantic
word_token comedies
word_token is
word_token its
word_token low-key
word_token way
word_token of
word_token tackling
word_token what
word_token seems
word_token like
word_token done-to-death
word_token material
word_token .
word_token the
word_token film
word_token 's
word_token constant
word_token mood
word_token of
word_token melancholy
word_token and
word_token its
word_token unhurried
word_token narrative
word_token are
word_token masterfully
word_token controlled
word_token .
word_token barry
word_token sonnenfeld
word_token owes
word_token frank
word_token the
word_token pug
word_token big
word_token time
word_token one
word_token gets
word_token the
word_token impression
word_token the
word_token creators
word_token of
word_token do
word_token n't
word_token ask
word_token do
word_token n't
word_token tell

word_token a
word_token film
word_token without
word_token surprise
word_token geared
word_token toward
word_token maximum
word_token comfort
word_token and
word_token familiarity
word_token .
word_token definitely
word_token a
word_token crowd-pleaser
word_token ,
word_token but
word_token then
word_token ,
word_token so
word_token was
word_token the
word_token roman
word_token colosseum
word_token .
word_token here
word_token is
word_token a
word_token vh1
word_token behind
word_token the
word_token music
word_token special
word_token that
word_token has
word_token something
word_token a
word_token little
word_token more
word_token special
word_token behind
word_token it
word_token :
word_token music
word_token that
word_token did
word_token n't
word_token sell
word_token many
word_token records
word_token but
word_token helped
word_token change
word_token a
word_token nation
word_token .
word_token kids
word_token will
word_token love
word_token its
word_token fantasy
word_token and

word_token a
word_token buoyant
word_token romantic
word_token comedy
word_token about
word_token friendship
word_token ,
word_token love
word_token ,
word_token and
word_token the
word_token truth
word_token that
word_token we
word_token 're
word_token all
word_token in
word_token this
word_token together
word_token .
word_token davis
word_token ...
word_token gets
word_token vivid
word_token performances
word_token from
word_token her
word_token cast
word_token and
word_token pulls
word_token off
word_token some
word_token deft
word_token ally
word_token mcbeal-style
word_token fantasy
word_token sequences
word_token .
word_token the
word_token sundance
word_token film
word_token festival
word_token has
word_token become
word_token so
word_token buzz-obsessed
word_token that
word_token fans
word_token and
word_token producers
word_token descend
word_token upon
word_token utah
word_token each
word_token january
word_token to
word_token ferret
word_token out
word_token the
word_token n

word_token the
word_token film
word_token is
word_token ultimately
word_token about
word_token as
word_token inspiring
word_token as
word_token a
word_token hallmark
word_token card
word_token .
word_token with
word_token little
word_token visible
word_token talent
word_token and
word_token no
word_token energy
word_token ,
word_token colin
word_token hanks
word_token is
word_token in
word_token bad
word_token need
word_token of
word_token major
word_token acting
word_token lessons
word_token and
word_token maybe
word_token a
word_token little
word_token coffee
word_token .
word_token the
word_token film
word_token oozes
word_token craft
word_token .
word_token stevens
word_token '
word_token vibrant
word_token creative
word_token instincts
word_token are
word_token the
word_token difference
word_token between
word_token this
word_token and
word_token countless
word_token other
word_token flicks
word_token about
word_token guys
word_token and
word_token dolls
word_token .
word_token we

word_token nights
word_token feels
word_token more
word_token like
word_token a
word_token quickie
word_token tv
word_token special
word_token than
word_token a
word_token feature
word_token film
word_token ...
word_token it
word_token 's
word_token not
word_token even
word_token a
word_token tv
word_token special
word_token you
word_token 'd
word_token bother
word_token watching
word_token past
word_token the
word_token second
word_token commercial
word_token break
word_token .
word_token while
word_token there
word_token are
word_token times
word_token when
word_token the
word_token film
word_token 's
word_token reach
word_token exceeds
word_token its
word_token grasp
word_token ,
word_token the
word_token production
word_token works
word_token more
word_token often
word_token than
word_token it
word_token does
word_token n't
word_token .
word_token richly
word_token entertaining
word_token and
word_token suggestive
word_token of
word_token any
word_token number
word_token of
word_to

word_token this
word_token sade
word_token is
word_token hardly
word_token a
word_token perverse
word_token ,
word_token dangerous
word_token libertine
word_token and
word_token agitator
word_token --
word_token which
word_token would
word_token have
word_token made
word_token for
word_token better
word_token drama
word_token .
word_token `
word_token all
word_token in
word_token all
word_token ,
word_token reign
word_token of
word_token fire
word_token will
word_token be
word_token a
word_token good
word_token -lrb-
word_token successful
word_token -rrb-
word_token rental
word_token .
word_token '
word_token sunshine
word_token state
word_token surveys
word_token the
word_token landscape
word_token and
word_token assesses
word_token the
word_token issues
word_token with
word_token a
word_token clear
word_token passion
word_token for
word_token sociology
word_token .
word_token with
word_token the
word_token same
word_token sort
word_token of
word_token good-natured
word_token fun
word

word_token why
word_token would
word_token anyone
word_token cast
word_token the
word_token magnificent
word_token jackie
word_token chan
word_token in
word_token a
word_token movie
word_token full
word_token of
word_token stunt
word_token doubles
word_token and
word_token special
word_token effects
word_token ?
word_token koury
word_token frighteningly
word_token and
word_token honestly
word_token exposes
word_token one
word_token teenager
word_token 's
word_token uncomfortable
word_token class
word_token resentment
word_token and
word_token ,
word_token in
word_token turn
word_token ,
word_token his
word_token self-inflicted
word_token retaliation
word_token .
word_token terrific
word_token performances
word_token ,
word_token great
word_token to
word_token look
word_token at
word_token ,
word_token and
word_token funny
word_token .
word_token an
word_token unorthodox
word_token little
word_token film
word_token noir
word_token organized
word_token crime
word_token story
word_token t

word_token the
word_token plot
word_token twists
word_token give
word_token i
word_token am
word_token trying
word_token to
word_token break
word_token your
word_token heart
word_token an
word_token attraction
word_token it
word_token desperately
word_token needed
word_token .
word_token an
word_token action\/thriller
word_token of
word_token the
word_token finest
word_token kind
word_token ,
word_token evoking
word_token memories
word_token of
word_token day
word_token of
word_token the
word_token jackal
word_token ,
word_token the
word_token french
word_token connection
word_token ,
word_token and
word_token heat
word_token .
word_token the
word_token drama
word_token discloses
word_token almost
word_token nothing
word_token .
word_token it
word_token 's
word_token time
word_token to
word_token let
word_token your
word_token hair
word_token down
word_token --
word_token greek
word_token style
word_token .
word_token wilco
word_token fans
word_token will
word_token have
word_token a
w

word_token it
word_token 's
word_token like
word_token an
word_token all-star
word_token salute
word_token to
word_token disney
word_token 's
word_token cheesy
word_token commercialism
word_token .
word_token gooding
word_token is
word_token the
word_token energetic
word_token frontman
word_token ,
word_token and
word_token it
word_token 's
word_token hard
word_token to
word_token resist
word_token his
word_token enthusiasm
word_token ,
word_token even
word_token if
word_token the
word_token filmmakers
word_token come
word_token up
word_token with
word_token nothing
word_token original
word_token in
word_token the
word_token way
word_token of
word_token slapstick
word_token sequences
word_token .
word_token the
word_token path
word_token ice
word_token age
word_token follows
word_token most
word_token closely
word_token ,
word_token though
word_token ,
word_token is
word_token the
word_token one
word_token established
word_token by
word_token warner
word_token bros.
word_token giant
wo

word_token the
word_token sinister
word_token inspiration
word_token that
word_token fuelled
word_token devito
word_token 's
word_token early
word_token work
word_token is
word_token confused
word_token in
word_token death
word_token to
word_token smoochy
word_token into
word_token something
word_token both
word_token ugly
word_token and
word_token mindless
word_token .
word_token the
word_token leads
word_token we
word_token are
word_token given
word_token here
word_token are
word_token simply
word_token too
word_token bland
word_token to
word_token be
word_token interesting
word_token .
word_token as
word_token giddy
word_token and
word_token whimsical
word_token and
word_token relevant
word_token today
word_token as
word_token it
word_token was
word_token 270
word_token years
word_token ago
word_token .
word_token the
word_token skirmishes
word_token for
word_token power
word_token waged
word_token among
word_token victims
word_token and
word_token predators
word_token settle
word_t

word_token although
word_token it
word_token includes
word_token a
word_token fair
word_token share
word_token of
word_token dumb
word_token drug
word_token jokes
word_token and
word_token predictable
word_token slapstick
word_token ,
word_token ``
word_token orange
word_token county
word_token ``
word_token is
word_token far
word_token funnier
word_token than
word_token it
word_token would
word_token seem
word_token to
word_token have
word_token any
word_token right
word_token to
word_token be
word_token .
word_token a
word_token sleep-inducing
word_token thriller
word_token with
word_token a
word_token single
word_token twist
word_token that
word_token everyone
word_token except
word_token the
word_token characters
word_token in
word_token it
word_token can
word_token see
word_token coming
word_token a
word_token mile
word_token away
word_token .
word_token expanded
word_token to
word_token 65
word_token minutes
word_token for
word_token theatrical
word_token release
word_token ,
wor

word_token lawrence
word_token plumbs
word_token personal
word_token tragedy
word_token and
word_token also
word_token the
word_token human
word_token comedy
word_token .
word_token romething
word_token 's
word_token really
word_token wrong
word_token with
word_token this
word_token ricture
word_token !
word_token performances
word_token all
word_token around
word_token are
word_token tops
word_token ,
word_token with
word_token the
word_token two
word_token leads
word_token delivering
word_token oscar-caliber
word_token performances
word_token .
word_token while
word_token van
word_token wilder
word_token may
word_token not
word_token be
word_token the
word_token worst
word_token national
word_token lampoon
word_token film
word_token ,
word_token it
word_token 's
word_token far
word_token from
word_token being
word_token this
word_token generation
word_token 's
word_token animal
word_token house
word_token .
word_token it
word_token 's
word_token not
word_token particularly
word_token

word_token you
word_token 'll
word_token get
word_token the
word_token enjoyable
word_token basic
word_token minimum
word_token .
word_token a
word_token shimmeringly
word_token lovely
word_token coming-of-age
word_token portrait
word_token ,
word_token shot
word_token in
word_token artful
word_token ,
word_token watery
word_token tones
word_token of
word_token blue
word_token ,
word_token green
word_token and
word_token brown
word_token .
word_token the
word_token salton
word_token sea
word_token has
word_token moments
word_token of
word_token inspired
word_token humour
word_token ,
word_token though
word_token every
word_token scrap
word_token is
word_token of
word_token the
word_token darkest
word_token variety
word_token .
word_token bears
word_token about
word_token as
word_token much
word_token resemblance
word_token to
word_token the
word_token experiences
word_token of
word_token most
word_token battered
word_token women
word_token as
word_token spider-man
word_token does
word_

word_token great
word_token fun
word_token both
word_token for
word_token sports
word_token aficionados
word_token and
word_token for
word_token ordinary
word_token louts
word_token whose
word_token idea
word_token of
word_token exercise
word_token is
word_token climbing
word_token the
word_token steps
word_token of
word_token a
word_token stadium-seat
word_token megaplex
word_token .
word_token an
word_token excruciating
word_token demonstration
word_token of
word_token the
word_token unsalvageability
word_token of
word_token a
word_token movie
word_token saddled
word_token with
word_token an
word_token amateurish
word_token screenplay
word_token .
word_token trying
word_token to
word_token make
word_token head
word_token or
word_token tail
word_token of
word_token the
word_token story
word_token in
word_token the
word_token hip-hop
word_token indie
word_token snipes
word_token is
word_token enough
word_token to
word_token give
word_token you
word_token brain
word_token strain
word_to

word_token ...
word_token a
word_token bland
word_token murder-on-campus
word_token yawner
word_token .
word_token the
word_token only
word_token problem
word_token is
word_token that
word_token ,
word_token by
word_token the
word_token end
word_token ,
word_token no
word_token one
word_token in
word_token the
word_token audience
word_token or
word_token the
word_token film
word_token seems
word_token to
word_token really
word_token care
word_token .
word_token the
word_token most
word_token consistently
word_token funny
word_token of
word_token the
word_token austin
word_token powers
word_token films
word_token .
word_token this
word_token franchise
word_token has
word_token not
word_token spawned
word_token a
word_token single
word_token good
word_token film
word_token .
word_token one
word_token of
word_token the
word_token most
word_token incoherent
word_token features
word_token in
word_token recent
word_token memory
word_token .
word_token a
word_token compelling
word_token yarn


word_token leave
word_token it
word_token to
word_token the
word_token french
word_token to
word_token truly
word_token capture
word_token the
word_token terrifying
word_token angst
word_token of
word_token the
word_token modern
word_token working
word_token man
word_token without
word_token turning
word_token the
word_token film
word_token into
word_token a
word_token cheap
word_token thriller
word_token ,
word_token a
word_token dumb
word_token comedy
word_token or
word_token a
word_token sappy
word_token melodrama
word_token .
word_token although
word_token the
word_token film
word_token boils
word_token down
word_token to
word_token a
word_token lightweight
word_token story
word_token about
word_token matchmaking
word_token ,
word_token the
word_token characters
word_token make
word_token italian
word_token for
word_token beginners
word_token worth
word_token the
word_token journey
word_token will
word_token probably
word_token be
word_token one
word_token of
word_token those
word_

word_token it
word_token hates
word_token its
word_token characters
word_token .
word_token so
word_token fiendishly
word_token cunning
word_token that
word_token even
word_token the
word_token most
word_token jaded
word_token cinema
word_token audiences
word_token will
word_token leave
word_token the
word_token auditorium
word_token feeling
word_token dizzy
word_token ,
word_token confused
word_token ,
word_token and
word_token totally
word_token disorientated
word_token .
word_token like
word_token the
word_token series
word_token ,
word_token the
word_token movie
word_token is
word_token funny
word_token ,
word_token smart
word_token ,
word_token visually
word_token inventive
word_token ,
word_token and
word_token most
word_token of
word_token all
word_token ,
word_token alive
word_token .
word_token jacquot
word_token has
word_token filmed
word_token the
word_token opera
word_token exactly
word_token as
word_token the
word_token libretto
word_token directs
word_token ,
word_token i

word_token few
word_token films
word_token have
word_token captured
word_token the
word_token chaos
word_token of
word_token an
word_token urban
word_token conflagration
word_token with
word_token such
word_token fury
word_token ,
word_token and
word_token audience
word_token members
word_token will
word_token leave
word_token feeling
word_token as
word_token shaken
word_token as
word_token nesbitt
word_token 's
word_token cooper
word_token looks
word_token when
word_token the
word_token bullets
word_token stop
word_token flying
word_token .
word_token those
word_token with
word_token a
word_token modicum
word_token of
word_token patience
word_token will
word_token find
word_token in
word_token these
word_token characters
word_token '
word_token foibles
word_token a
word_token timeless
word_token and
word_token unique
word_token perspective
word_token .
word_token an
word_token unflinching
word_token ,
word_token complex
word_token portrait
word_token of
word_token a
word_token modern


word_token painfully
word_token padded
word_token .
word_token it
word_token 's
word_token worth
word_token taking
word_token the
word_token kids
word_token to
word_token .
word_token i
word_token 'm
word_token not
word_token sure
word_token which
word_token half
word_token of
word_token dragonfly
word_token is
word_token worse
word_token :
word_token the
word_token part
word_token where
word_token nothing
word_token 's
word_token happening
word_token ,
word_token or
word_token the
word_token part
word_token where
word_token something
word_token 's
word_token happening
word_token ,
word_token but
word_token it
word_token 's
word_token stupid
word_token .
word_token this
word_token movie
word_token has
word_token a
word_token strong
word_token message
word_token about
word_token never
word_token giving
word_token up
word_token on
word_token a
word_token loved
word_token one
word_token ,
word_token but
word_token it
word_token 's
word_token not
word_token an
word_token easy
word_token mo

word_token an
word_token indispensable
word_token peek
word_token at
word_token the
word_token art
word_token and
word_token the
word_token agony
word_token of
word_token making
word_token people
word_token laugh
word_token .
word_token judging
word_token by
word_token those
word_token standards
word_token ,
word_token `
word_token scratch
word_token '
word_token is
word_token a
word_token pretty
word_token decent
word_token little
word_token documentary
word_token .
word_token does
word_token point
word_token the
word_token way
word_token for
word_token adventurous
word_token indian
word_token filmmakers
word_token toward
word_token a
word_token crossover
word_token into
word_token nonethnic
word_token markets
word_token .
word_token a
word_token direct-to-void
word_token release
word_token ,
word_token heading
word_token nowhere
word_token .
word_token whether
word_token jason
word_token x
word_token is
word_token this
word_token bad
word_token on
word_token purpose
word_token is
wor

word_token it
word_token turns
word_token out
word_token to
word_token be
word_token smarter
word_token and
word_token more
word_token diabolical
word_token than
word_token you
word_token could
word_token have
word_token guessed
word_token at
word_token the
word_token beginning
word_token .
word_token in
word_token terms
word_token of
word_token execution
word_token this
word_token movie
word_token is
word_token careless
word_token and
word_token unfocused
word_token .
word_token your
word_token nightmares
word_token ,
word_token on
word_token the
word_token other
word_token hand
word_token ,
word_token will
word_token be
word_token anything
word_token but
word_token .
word_token a
word_token slick
word_token ,
word_token engrossing
word_token melodrama
word_token .
word_token this
word_token picture
word_token is
word_token murder
word_token by
word_token numbers
word_token ,
word_token and
word_token as
word_token easy
word_token to
word_token be
word_token bored
word_token by
word_t

word_token the
word_token movie
word_token ends
word_token with
word_token outtakes
word_token in
word_token which
word_token most
word_token of
word_token the
word_token characters
word_token forget
word_token their
word_token lines
word_token and
word_token just
word_token utter
word_token `
word_token uhhh
word_token ,
word_token '
word_token which
word_token is
word_token better
word_token than
word_token most
word_token of
word_token the
word_token writing
word_token in
word_token the
word_token movie
word_token .
word_token frei
word_token assembles
word_token a
word_token fascinating
word_token profile
word_token of
word_token a
word_token deeply
word_token humanistic
word_token artist
word_token who
word_token ,
word_token in
word_token spite
word_token of
word_token all
word_token that
word_token he
word_token 's
word_token witnessed
word_token ,
word_token remains
word_token surprisingly
word_token idealistic
word_token ,
word_token and
word_token retains
word_token an
word_t

word_token ...
word_token the
word_token maudlin
word_token way
word_token its
word_token story
word_token unfolds
word_token suggests
word_token a
word_token director
word_token fighting
word_token against
word_token the
word_token urge
word_token to
word_token sensationalize
word_token his
word_token material
word_token .
word_token too
word_token many
word_token scenarios
word_token in
word_token which
word_token the
word_token hero
word_token might
word_token have
word_token an
word_token opportunity
word_token to
word_token triumphantly
word_token sermonize
word_token ,
word_token and
word_token too
word_token few
word_token that
word_token allow
word_token us
word_token to
word_token wonder
word_token for
word_token ourselves
word_token if
word_token things
word_token will
word_token turn
word_token out
word_token okay
word_token .
word_token a
word_token flawed
word_token film
word_token but
word_token an
word_token admirable
word_token one
word_token that
word_token tries
word_

word_token it
word_token gets
word_token bogged
word_token down
word_token by
word_token hit-and-miss
word_token topical
word_token humour
word_token before
word_token getting
word_token to
word_token the
word_token truly
word_token good
word_token stuff
word_token .
word_token between
word_token bursts
word_token of
word_token automatic
word_token gunfire
word_token ,
word_token the
word_token story
word_token offers
word_token a
word_token trenchant
word_token critique
word_token of
word_token capitalism
word_token .
word_token in
word_token addition
word_token to
word_token gluing
word_token you
word_token to
word_token the
word_token edge
word_token of
word_token your
word_token seat
word_token ,
word_token changing
word_token lanes
word_token is
word_token also
word_token a
word_token film
word_token of
word_token freshness
word_token ,
word_token imagination
word_token and
word_token insight
word_token .
word_token lasker
word_token 's
word_token canny
word_token ,
word_token med

In [101]:
# check data to make sure tokens match sentences at the beginning and end
train_additional_cols.head()
train_additional_cols.tail()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,word_tokens,phrase_length,POS_tags
0,1,1,A series of escapades demonstrating the adage ...,1,"[series, escapades, demonstrating, adage, good...",188.0,
1,64,2,"This quiet , introspective and entertaining in...",4,"[quiet, introspective, entertaining, independe...",74.0,
2,82,3,"Even fans of Ismail Merchant 's work , I suspe...",1,"[even, fans, ismail, merchant, 's, work, suspe...",100.0,
3,117,4,A positively thrilling combination of ethnogra...,3,"[positively, thrilling, combination, ethnograp...",152.0,
4,157,5,Aggressive self-glorification and a manipulati...,1,"[aggressive, self-glorification, manipulative,...",60.0,


Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,word_tokens,phrase_length,POS_tags
8522,155985,8540,... either you 're willing to go with this cla...,2,"[either, 're, willing, go, claustrophobic, con...",82.0,
8523,155998,8541,"Despite these annoyances , the capable Claybur...",2,"[despite, annoyances, capable, clayburgh, tamb...",152.0,
8524,156022,8542,-LRB- Tries -RRB- to parody a genre that 's al...,1,"[-lrb-, tries, -rrb-, parody, genre, 's, alrea...",81.0,
8525,156032,8543,The movie 's downfall is to substitute plot fo...,1,"[movie, 's, downfall, substitute, plot, person...",61.0,
8526,156040,8544,"The film is darkly atmospheric , with Herrmann...",2,"[film, darkly, atmospheric, herrmann, quietly,...",137.0,


#### POS tagging, grammar rules phrase extraction

This loop will tag each list of tokens

In [102]:
for index, token_list in enumerate(train_additional_cols['word_tokens']):
    pos_tokens = pos_tag(token_list)
    train_additional_cols["POS_tags"][index] = pos_tokens

In [103]:
# check tags at the beginning and end
train_additional_cols.head()
train_additional_cols.tail()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,word_tokens,phrase_length,POS_tags
0,1,1,A series of escapades demonstrating the adage ...,1,"[series, escapades, demonstrating, adage, good...",188.0,"[(series, NN), (escapades, VBZ), (demonstratin..."
1,64,2,"This quiet , introspective and entertaining in...",4,"[quiet, introspective, entertaining, independe...",74.0,"[(quiet, JJ), (introspective, JJ), (entertaini..."
2,82,3,"Even fans of Ismail Merchant 's work , I suspe...",1,"[even, fans, ismail, merchant, 's, work, suspe...",100.0,"[(even, RB), (fans, NNS), (ismail, VBP), (merc..."
3,117,4,A positively thrilling combination of ethnogra...,3,"[positively, thrilling, combination, ethnograp...",152.0,"[(positively, RB), (thrilling, VBG), (combinat..."
4,157,5,Aggressive self-glorification and a manipulati...,1,"[aggressive, self-glorification, manipulative,...",60.0,"[(aggressive, JJ), (self-glorification, NN), (..."


Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,word_tokens,phrase_length,POS_tags
8522,155985,8540,... either you 're willing to go with this cla...,2,"[either, 're, willing, go, claustrophobic, con...",82.0,"[(either, DT), ('re, VBP), (willing, JJ), (go,..."
8523,155998,8541,"Despite these annoyances , the capable Claybur...",2,"[despite, annoyances, capable, clayburgh, tamb...",152.0,"[(despite, IN), (annoyances, NNS), (capable, J..."
8524,156022,8542,-LRB- Tries -RRB- to parody a genre that 's al...,1,"[-lrb-, tries, -rrb-, parody, genre, 's, alrea...",81.0,"[(-lrb-, JJ), (tries, NNS), (-rrb-, VBP), (par..."
8525,156032,8543,The movie 's downfall is to substitute plot fo...,1,"[movie, 's, downfall, substitute, plot, person...",61.0,"[(movie, NN), ('s, POS), (downfall, NN), (subs..."
8526,156040,8544,"The film is darkly atmospheric , with Herrmann...",2,"[film, darkly, atmospheric, herrmann, quietly,...",137.0,"[(film, NN), (darkly, RB), (atmospheric, JJ), ..."


In [104]:
# an ADJPH chunk should be formed whenever the chunker finds adverb (RB) followed by an adjective (JJ).
grammar_adjph = "ADJPH: {<RB.?>+<JJ.?>}"
# an ADVPH chunk should be formed whenever the chunker finds 2 consecutive adverbs ('RB')
grammar_advph = "ADVPH: {<RB>+<RB>}"
# an VBPH chunk should be formed whenever the chunker finds verb (VB) followed by a noun (NN).
grammar_vbph = "VBPH: {<VB.?>+<NN.?>}"
# an NPH chunk should be formed whenever the chunker finds a determiner (DT) followed by a noun (NN). We simply choose to define noun phrase as determiner followed by a noun of any kind.
grammar_nph = "NPH: {<DT>+<NN.?>}"

In [105]:
# function toimport the nltk parser to process each sentence
def create_chunk_parser(grammar_rules):
    return RegexpParser(grammar_rules)

This function will do the following:
- parse text based on parser provided
- get the actual phrase
- calculate frequency for most_common phrases (log only, we can add a return value)
- show the length of phrases sentences (log only)

In [106]:
def parse_phrases(sent, chunk_parser, label):
    tags = []
    phrases = []

    if len(sent) > 0:
#         print('sent', sent)
        tree = chunk_parser.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == label:
                tags.append(subtree)
    # Visualizing the actual phrase
    for sent in tags:
        temp = ''
        for w, t in sent:
            temp += w+ ' '    
        phrases.append(temp)
    print('phrases: ', phrases)
    # top 10 phrases
    freq = FreqDist(phrases)
    print('Top phrases by frequency: ')
    for word, freq in freq.most_common(10):
        print(word, freq)
    print("Length of {label} phrase sentences: ".format(label=label), len(tags))
    return phrases

##### Create parsers and parse the texts using the rules/parsers defined

In [107]:
adjph_parser = create_chunk_parser(grammar_adjph)
advph_parser = create_chunk_parser(grammar_advph)
vbph_parser = create_chunk_parser(grammar_vbph)
nph_parser = create_chunk_parser(grammar_nph)

Add more cols for grammar parser results

In [108]:
train_additional_cols["adjph"] = np.nan
train_additional_cols["advph"] = np.nan
train_additional_cols["vbph"] = np.nan
train_additional_cols["nph"] = np.nan

In [109]:
for index in range(train_additional_cols_num_rows):
    pos_tree = train_additional_cols['POS_tags'][index]
    if parse_phrases(pos_tree, adjph_parser, "ADJPH"):
        train_additional_cols["adjph"][index] = parse_phrases(pos_tree, adjph_parser, "ADJPH")
    else:
        train_additional_cols["adjph"][index] = 'no ADJPH phrases detected'

phrases:  ['also good ']
Top phrases by frequency: 
also good  1
Length of ADJPH phrase sentences:  1
phrases:  ['also good ']
Top phrases by frequency: 
also good  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['nearly epic ']
Top phrases by frequency: 
nearly epic  1
Length of ADJPH phrase sentences:  1
phrases:  ['nearly epic ']
Top phrases by frequency: 
nearly epic  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Lengt

phrases:  ['perhaps snuck ']
Top phrases by frequency: 
perhaps snuck  1
Length of ADJPH phrase sentences:  1
phrases:  ['perhaps snuck ']
Top phrases by frequency: 
perhaps snuck  1
Length of ADJPH phrase sentences:  1
phrases:  ['far disappointing ']
Top phrases by frequency: 
far disappointing  1
Length of ADJPH phrase sentences:  1
phrases:  ['far disappointing ']
Top phrases by frequency: 
far disappointing  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['decidedly unoriginal ']
Top phrases by frequency: 
decidedly unoriginal  1
Length of ADJPH phrase sentences:  1
phrases:  ['decidedly unoriginal ']
Top phrases by frequency: 
decidedly unoriginal  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase se

phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ["n't much "]
Top phrases by frequency: 
n't much  1
Length of ADJPH phrase sentences:  1
phrases:  ["n't much "]
Top phrases by frequency: 
n't much  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['silly stupid ']
Top phrases by frequency: 
silly stupid  1
Length of ADJPH phrase sentences:  1
phrases:  ['silly stupid ']
Top phrases by frequency: 
silly stupid  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['merely -lrb- ', 'repeatedly undercut ']
Top phrases by frequency: 
merely -lrb-  1
repeatedly undercut  1
Length of ADJPH phrase sentences:  2
phrases:  ['merely -lrb- ', 'repeatedly unde

Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['nearly impossible ']
Top phrases by frequency: 
nearly impossible  1
Length of ADJPH phrase sentences:  1
phrases:  ['nearly impossible ']
Top phrases by frequency: 
nearly impossible  1
Length of ADJPH phrase sentences:  1
phrases:  ['amy matthew ']
Top phrases by frequency: 
amy matthew  1
Length of ADJPH phrase sentences:  1
phrases:  ['amy matthew ']
Top phrases by frequency: 
amy matthew  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['thoroughly entertaining ']
Top phrases by frequency: 
thoroughly entertaining  1
Length of ADJPH phrase sentences:  1
phrases:  ['thoroughly entertaining ']
Top phrases by frequency: 
thoroughly entertaining  1
Length of ADJPH phrase sentences: 

phrases:  ['tart little ']
Top phrases by frequency: 
tart little  1
Length of ADJPH phrase sentences:  1
phrases:  ['tart little ']
Top phrases by frequency: 
tart little  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  ['ultimately worthwhile ']
Top phrases by frequency: 
ultimately worthwhile  1
Length of ADJPH phrase sentences:  1
phrases:  ['ultimately worthwhile ']
Top phrases by frequency: 
ultimately worthwhile  1
Length of ADJPH phrase sentences: 

phrases:  ['absolutely amazing ']
Top phrases by frequency: 
absolutely amazing  1
Length of ADJPH phrase sentences:  1
phrases:  ['absolutely amazing ']
Top phrases by frequency: 
absolutely amazing  1
Length of ADJPH phrase sentences:  1
phrases:  ['really quite funny ']
Top phrases by frequency: 
really quite funny  1
Length of ADJPH phrase sentences:  1
phrases:  ['really quite funny ']
Top phrases by frequency: 
really quite funny  1
Length of ADJPH phrase sentences:  1
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH phrase sentences:  0
phrases:  []
Top phrases by frequency: 
Length of ADJPH ph

In [110]:
train_additional_cols.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,word_tokens,phrase_length,POS_tags,adjph,advph,vbph,nph
0,1,1,A series of escapades demonstrating the adage ...,1,"[series, escapades, demonstrating, adage, good...",188.0,"[(series, NN), (escapades, VBZ), (demonstratin...",[also good ],,,
1,64,2,"This quiet , introspective and entertaining in...",4,"[quiet, introspective, entertaining, independe...",74.0,"[(quiet, JJ), (introspective, JJ), (entertaini...",no ADJPH phrases detected,,,
2,82,3,"Even fans of Ismail Merchant 's work , I suspe...",1,"[even, fans, ismail, merchant, 's, work, suspe...",100.0,"[(even, RB), (fans, NNS), (ismail, VBP), (merc...",no ADJPH phrases detected,,,
3,117,4,A positively thrilling combination of ethnogra...,3,"[positively, thrilling, combination, ethnograp...",152.0,"[(positively, RB), (thrilling, VBG), (combinat...",no ADJPH phrases detected,,,
4,157,5,Aggressive self-glorification and a manipulati...,1,"[aggressive, self-glorification, manipulative,...",60.0,"[(aggressive, JJ), (self-glorification, NN), (...",no ADJPH phrases detected,,,


In [111]:
# see how adjective phrase compares to the whole review
for index in range(train_additional_cols_num_rows):
    if train_additional_cols['adjph'][index]!="no ADJPH phrases detected":
        print(index)
        print(train_additional_cols['Phrase'][index], train_additional_cols['Sentiment'][index], train_additional_cols['adjph'][index])
        print('------next item-----------')

0
A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story . 1 ['also good ']
------next item-----------
5
A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis . 4 ['nearly epic ']
------next item-----------
13
Fresnadillo has something serious to say about the ways in which extravagant chance can distort our perspective and throw us off the path of good sense . 3 ['path good ']
------next item-----------
17
Like a less dizzily gorgeous companion to Mr. Wong 's In the Mood for Love -- very much a Hong Kong movie despite its mainland setting . 2 ['dizzily gorgeous ', 'love much hong ']
------next item-----------
25
It arrives with an impeccable pedigree , mongrel pep , and almost indecipherable plot complications . 2 ['almost indecipherable ']
------next item-----------
29
More vaudeville s

Among the many pleasures are the lively intelligence of the artists and their perceptiveness about their own situations . 3 ['lively intelligence ']
------next item-----------
4874
Diaz , Applegate , Blair and Posey are suitably kooky which should appeal to women and they strip down often enough to keep men alert , if not amused . 3 ['often enough ']
------next item-----------
4875
Plotless collection of moronic stunts is by far the worst movie of the year . 0 ['far worst ']
------next item-----------
4883
Noyce creates a film of near-hypnotic physical beauty even as he tells a story as horrifying as any in the heart-breakingly extensive annals of white-on-black racism . 2 ['heart-breakingly extensive ']
------next item-----------
4885
Melds derivative elements into something that is often quite rich and exciting , and always a beauty to behold . 4 ['often quite rich ']
------next item-----------
4888
It is n't that Stealing Harvard is a horrible movie -- if only it were that grand a f

This function will print top 50 tags and their frequency based on POS tag list provided provided ie we can supply adjective POS token tree and tagged text.

In [112]:
def get_top50_pos_tokens(pos_list, taggedtext):
    pos_tokens = []
    freq_table = PrettyTable(['word', 'frequency'])
    for sentence in taggedtext:
        for word, pos in sentence:
            if pos in pos_list:
                if len(word)>1:
                    pos_tokens.append(word)
    freq_pos = FreqDist(pos_tokens)
    for word, freq in freq_pos.most_common(50):
        freq_table.add_row([word, freq])
    print(freq_table)

### Next steps

####  Frequency  work

In [113]:
#Create a function that pulls the most frequent word in each review
def mostFreq(corpus):
    freqDist = FreqDist(corpus)
    freqList = list(freqDist.keys())
    try: 
        first =  freqList[0]
    except IndexError: 
        first = "None"
    return first

#### Add adverb, verb, noun phrase analysis (rules already set up, I did adjective phrases as an example)

#### Run Naive Bayes to predict negative/positive review

- with stop words
- without stop words
- maybe one where we use only the POS extracted phrase associated with it? (could be optional)