# Representing Text In Pandas
<img src="images/churchill2.jpg" height="200" width="200" align="left">
<center>
<h3>In This Worksheet</h3> We will use what we learned about parsing text in nltk to parse a speech into a pandas DataFrame so that we can maintain knowledge about words as we use different nltk tools.
<h3>The Data</h3> <strong>Speech, Blood Toil Tears and Sweat</strong><br><i>Sir Winston Churchill, May 13th 1940</i><br>
This is Churchill's first speech as the prime minister of Great Britain.<br>
https://www.youtube.com/watch?v=8TlkN-dcDCk
</center>

## Brief Intro to pandas
In case you are not familiar with pandas (http://pandas.pydata.org/), pandas is a Python package that provides very powerful data structures that can make it easy to work with structured data.  

According to their site, <i>pandas is well suited for many different kinds of data:

* Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
* Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure</i>

## What We Will Do With Pandas

This module will cover how to use our new nltk tool to put a speech into a structured data structure, the pandas DataFrame.  The DataFrame is like a table, but with a lot of powerful built in functionality.  It has two indexes, one for the 'rows' and one for the 'columns'. 

With the DataFrame, we will be able to add attributes to words, sentences, and calculate quick metrics.  We will show how to do a few of the vocabulary operations we learned last time easily in pandas.  There will still be some things though, especially those dealing with contexts and n-grams, that still do much better in nltk.

In [25]:
fp = 'speeches/Churchill-Blood.txt'
speech = open(fp).read().lower()

Before, we tokenized the speech by sentences and then words.  But what if we want to do both and keep it all in one data structure?  This is no big deal when we iterate through the sentences and then store our information in pandas.

We will track the sentence a word belongs to by creating a sent_id.  We will then iterate through each sentence, then its words, to extract our words.

In [26]:
from nltk import sent_tokenize, word_tokenize

rows = []
sent_id = 0
for sentence in sent_tokenize(speech):
    for word in word_tokenize(sentence):
        info = {'token':word, 'sent_id':sent_id}
        rows.append(info)
    sent_id += 1
    
rows[0:5]

[{'sent_id': 0, 'token': 'on'},
 {'sent_id': 0, 'token': 'friday'},
 {'sent_id': 0, 'token': 'evening'},
 {'sent_id': 0, 'token': 'last'},
 {'sent_id': 0, 'token': 'i'}]

This loop leaves us with a list of dictionaries, each with a 'token' and 'sent_id' key.  Pandas understands how to turn this list into a DataFrame with columns 'token' and 'sent_id'.  Let's look!

In [27]:
import pandas as pd

parsed_speech = pd.DataFrame(rows)
parsed_speech.head()

Unnamed: 0,sent_id,token
0,0,on
1,0,friday
2,0,evening
3,0,last
4,0,i


Now let's quickly demo some pandas commands while answering some questions about our speech.

### How many tokens are in this speech?

In [28]:
len(parsed_speech)

698

### How many sentences are in this speech?

In [29]:
len(parsed_speech.sent_id.unique())

36

### What are the most common tokens in the speech?

In [30]:
parsed_speech.token.value_counts().head(10)

the     45
.       34
,       28
of      25
to      21
and     21
i       18
in      15
be      14
that    13
Name: token, dtype: int64

We have run into the stop word problem again.  Let's write a function really quickly that identifies whether or not a token is a stop word, and another that says whether a token is punctuation.

In [31]:
from nltk.corpus import stopwords
import string

def is_stopword(token):
    stops = stopwords.words('english').copy()
    return token in stops

def is_punctuation(token):
    return token in string.punctuation

print(is_stopword('the'))
print(is_punctuation('!'))

True
True


Each column in a pandas DataFrame is called a series.  Pandas has an amazing function called apply that allows us to apply a function to every value in a series and get back a series with the new values. This makes transformations really easy!  We can create a new column in a DataFrame or change the value of a column by using the assignment operator in conjuction with some simple syntax.

Let's see it in action!

In [32]:
parsed_speech['is_stop'] = parsed_speech.token.apply(is_stopword)
parsed_speech['is_stop'].head()

0     True
1    False
2    False
3    False
4     True
Name: is_stop, dtype: bool

In [36]:
parsed_speech['is_punct'] = parsed_speech['token'].apply(is_punctuation)
parsed_speech.head(20)

Unnamed: 0,sent_id,token,is_stop,is_punct
0,0,on,True,False
1,0,friday,False,False
2,0,evening,False,False
3,0,last,False,False
4,0,i,True,False
5,0,received,False,False
6,0,from,True,False
7,0,his,True,False
8,0,majesty,False,False
9,0,the,True,False


### What are the most common words in the speech?
Now that we have columns stating whether or not a token is a stop word or punctuation, we can filter our data set for non-stop words.  We do this through pandas selection ability, using the syntax:

```df[condition]```

First we need to determine our condition, which is identifying rows where ```is_stop``` is False, as is ```is_punct```.

In [41]:
condition = (parsed_speech['is_stop'] == False) & (parsed_speech.is_punct == False)
condition.head()

0    False
1     True
2     True
3     True
4    False
dtype: bool

What returns is a Series of True/False statements that will then tell the selection syntax whether or not to keep a row in the DataFrame based on its index (think of this as the row id).

In [42]:
no_stops = parsed_speech[ condition ]
no_stops.head()

Unnamed: 0,sent_id,token,is_stop,is_punct
1,0,friday,False,False
2,0,evening,False,False
3,0,last,False,False
5,0,received,False,False
8,0,majesty,False,False


And now we can get our top words!

In [44]:
no_stops.token.value_counts().head(10)

house             6
victory           5
war               5
us                4
many              4
survival          4
ministers         3
may               3
administration    3
hope              3
Name: token, dtype: int64

We can easily do this all in one line of code as well:

In [45]:
parsed_speech[ (parsed_speech['is_stop'] == False) & (parsed_speech.is_punct == False) ].token.value_counts().head(10)

house             6
victory           5
war               5
us                4
many              4
survival          4
ministers         3
may               3
administration    3
hope              3
Name: token, dtype: int64

### What are the most common types of punctuation in the speech?

In [48]:
parsed_speech[ parsed_speech.is_punct == True ].token.value_counts()

.    34
,    28
?     2
-     2
:     1
'     1
Name: token, dtype: int64

### What questions does Churchill ask?


In [54]:
q_sent_ids = parsed_speech[ parsed_speech.token == '?' ].sent_id.tolist()
q_sent_ids

[23, 27]

In [62]:
question_tokens = parsed_speech[ parsed_speech.sent_id.isin(q_sent_ids) ]
question_tokens

Unnamed: 0,sent_id,token,is_stop,is_punct
479,23,you,True,False
480,23,ask,False,False
481,23,",",False,True
482,23,what,True,False
483,23,is,True,False
484,23,our,True,False
485,23,policy,False,False
486,23,?,False,True
542,27,you,True,False
543,27,ask,False,False


In [68]:
def concat_tokens(df):
    tokens = df.token.tolist()
    return ' '.join(tokens)

In [70]:
question_tokens.groupby('sent_id').apply(concat_tokens)

sent_id
23    you ask , what is our policy ?
27       you ask , what is our aim ?
dtype: object

It is also really easy to save DataFrames to csv.  Let's preserve our work.

In [1]:
parsed_speech.to_csv('data/parsed_churchill_blood.csv')

NameError: name 'parsed_speech' is not defined