# Visualizing Phrase Trees
## 1. Overview

The Rotten Tomatoes movie review dataset includes a random set of individual sentences. These sentences have been scored for sentiment on a 5-part scale (0-4). Grammatical analysis has also been applied to each root sentence, with each constituent phrase at each level of the resulting phrase tree structure also scored for sentiment and added to the dataset under the root sentence. The hypothesis is that the inclusion of these constituents in the model can help improve weaknesses in the traditional "bag-of-words" approach related to the loss of grammatical information (e.g., scope of negation, reverse polarity). 

In this notebook I will provide a method to reconstruct and visualize the phrase tree structure using the NLTK package. 

## 2. Data Exploration

The data has already been split into separate train and test datasets, which are loaded into pandas below.

In [None]:
import pandas as pd 

train_file = '../input/sentiment-analysis-on-movie-reviews/train.tsv.zip'
test_file = '../input/sentiment-analysis-on-movie-reviews/test.tsv.zip'

train = pd.read_csv(train_file, delimiter = '\t', compression = 'zip')
test = pd.read_csv(test_file, delimiter = '\t', compression = 'zip')

In [None]:
train.shape

In [None]:
test.shape

I am also going to add a new column to the dataframe that counts the number of words (N) within the phrase. 

In [None]:
import nltk
from nltk import RegexpTokenizer
def n_grams(phrase):
    tokenizer = nltk.RegexpTokenizer(r"\w+")
    words = tokenizer.tokenize(phrase)
    return len(words)

train['N'] = train['Phrase'].apply(n_grams)
test['N'] = test['Phrase'].apply(n_grams)

Now we can examine the first rows of the dataframe.

In [None]:
train.head()

The following histogram shows the distribution of phrase lengths in the dataset. The maximum phrase length in the train data is 48. The distribution is obviously weighted towards the smaller lengths due to the recursive nature of the data (each smaller unit is part of a larger unit).

In [None]:
train['N'].hist(bins = 20)
train['N'].max()

The maximum phrase length in the test data is 52. Note that the distribution seems to be weighted more towards short phrases than in the train set.

In [None]:
test['N'].hist(bins = 20)
test['N'].max()

Phrases are identified by both a **SentenceId** and a **PhraseId**. Since the root sentence is the first phrase in each set, we can isolate the root sentences using **group_by** on the SentenceId and then taking the first row using **first()**. The distribution of sentence lengths is closer to a normal distribution for the train and test sets.

In [None]:
train_sentences = train.groupby(['SentenceId']).first().reset_index()
train_sentences['N'].hist(bins = 20)
train_sentences.shape

In [None]:
test_sentences = test.groupby(['SentenceId']).first().reset_index()
test_sentences['N'].hist(bins = 20)

## 3. Visualizing Trees

A phrase tree structure is a method of representing the hierarchical grammatical relationships between the constituents of a sentence. The **NLTK.Tree** package includes methods to both construct and visualize trees. For instance, the following code builds a tree to represent the simple sentence *Poor John ran away* with the nodes labelled with PoS tags. (Note that NLTK incudes a **draw** method, but the resulting trees cannot be viewed inline within the notebook). 

In [None]:
from nltk import Tree
sent =  "(S (NP (A Poor ) (N John)) (VP (V ran ) (Adv away)))"
tree = Tree.fromstring(sent)
tree.pretty_print()

Our data includes a list of every constituent phrase related to a root sentence, but the hierarchical structure has not been preserved. To reconstruct the tree, I will simply index the location of each sub phrase within the root sentence and add parentheses at the beginning and end of the constituent to define the node. Since we do not have node labels, I will use the sentiment score of the phrase as the label.  

In [None]:
train.loc[(train['SentenceId'] == 2)]

The following code builds separate lists for each constituent phrase and its sentiment and then adds parentheses and a label to the root sentence to mark the nodes.

In [None]:
phrases = train.loc[(train['SentenceId'] == 2)]['Phrase'].to_list()
sentiments = train.loc[(train['SentenceId'] == 2)]['Sentiment'].to_list()
root = phrases[0]
for p, s in zip(phrases,sentiments):
    start = root.index(p)
    end = start + len(p) + len(str(s)) + 2
    root = root[:start] + '(' + str(s) + ' ' + root[start:]
    root = root[:end] + ')' + root[end:]
    print(root)

Finally, the tree structure can be visualized using th epretty_print method.

In [None]:
tree = Tree.fromstring(root)
tree.pretty_print()

## 4. Examining Tree Structures in the Full Datasets

Unfortunately, when we try to reconstruct the tree structure for the full train set, it is obvious that the data needs a significant mount of cleaning. The following code groups the data by sentence id and applys the mthod demonstrated above to add parentheses and node labels around the constituents. If a sub phrase is not found in the root sentence, it returns the string "error" rather than the tree structure.  

In [None]:
def phrase_tree(phrase_group):
    phrases = phrase_group['Phrase'].to_list()
    sentiments = phrase_group['Sentiment'].to_list()
    root = phrases[0]
    for p, s in zip(phrases,sentiments):
        try:
            start = root.index(p)
        except:
            root = 'error'
        else:
            end = start + len(p) + len(str(s)) + 2
            root = root[:start] + '(' + str(s) + ' ' + root[start:]
            root = root[:end] + ')' + root[end:]
    return root

train_trees = []
train_groups = train.groupby(['SentenceId'])
for key, group in train_groups:
    root = phrase_tree(group)
    train_trees.append((key,root))

When we list the sentence ids that caused an error, we find that 71 sentences out of the 8529 total are missing the root sentence (and possibly other subphrases), which is almost 1%. 

In [None]:
errors = 0
for tree in train_trees:
    if tree[1] == 'error':
        errors += 1
print(errors)

The following example shows one of the sentences that threw an error. 

In [None]:
train.loc[(train['SentenceId'] == 8382)]

## 5. Conclusions

Both the train and test datasets have already been parsed, so it is not necessary for you to implement a grammatical analysis. Nevertheless, understanding the tree structure has a few implications for the sentiment analysis problem. First, if you are going to seperate a validation set from the train data, you should keep subphrases together with the root sentence or you will introduce a leakage problem. Second, be aware that the automated parsing seems to have left the data a bit messy, so be sure to clean it up. 