# Text Cleaning

**In this Python script, we will clean the data.**

This means we will *filter* and *transform* the data to include only the elements of interest, and we will return it in a format that makes it easy to analyze later.

**Goals**
<ol>
    <li> Tokenize the text into words </li>
    <li> Remove punctuation and uninformative words </li>
    <li> Convert all words to lowercase </li>
    <li> Read a frequency distribution </li>
</ol>

In [None]:
###########
## Setup ##
###########

# This new notebook is a new virtual environment, so we have to set it up again.

# Import packages
import matplotlib.pyplot as plt
import numpy as np
import pickle
from scipy import special
from nltk import *
from nltk.corpus import stopwords

# This helps matlabplot and jupyter cooperate in showing graphics
%matplotlib inline

# The relative path to our text file.
# In other words, where the file is relative to this Jupyter notebook.
pathToFile = 'texts/walden.txt'

# Open Walden and read it (hence the 'r') into the variable 'file'
file = open( pathToFile, 'r')


## Tokenization

Tokenization is the process of breaking down a big text into units of interest, usually words.

But, what is a word?  This might seem like a weird question, but consider the example "black sheep", defined by the Cambridge English Dictionary below:

>*Someone who embarrasses a group or family because the person is different or has gotten into trouble.*

Based on this, you might say that the 'sheep' in 'black sheep' is different than the 'sheep' in the phrase 'my pet sheep', and you have the beginnings of a good argument that 'black sheep' is one word, not two.

> NLTK (and most text analytics tools) will tokenize "my black sheep uncle owns a black cat" as `['my', 'black', 'sheep', 'uncle', 'owns', 'a', 'black', 'cat']`. **We have a hunch that this is wrong, but whether it matters for your analysis is up to you.**

For the purposes of today, we will ignore these kinds of issues, but you should keep in mind that NLTK is built to work for most purposes, not necessarilly your purposes. Before you begin any text analysis project or experiment, you must first define exactly what kind of elements you are interested in, so you can identify those in continuous text.

Generally, without more information, a tokenizer will usually split a text into tokens based on spaces and punctuation, a strategy that works for most words.

>**Run the code chunk below to see how NLTK tokenizes Walden using its default settings.**

In [None]:
#################################
## Read-in and tokenize Walden ##
#################################

# Read the Walden file into the variable 'text'
walden_raw   = file.read()

# Tokenize 'text' with the NLTK function word_tokenize() and save the result to the variable 'tokens'
tokens = word_tokenize( walden_raw )

# Print the result of the tokenization
print( tokens )

# Close the Walden text file to release it from memory
# We don't need the original text anymore.
file.close()


### Assessing the tokenization

Take a minute to look through the tokens that NLTK found. Is there anything there that might get in the way of an analysis of the text?


## Filtering non-alphabetical characters

Filtering in NLTK usually entails comparing elements between lists. For example, if we want to see whether a particular token (usually equivalent to the idea of word) includes characters outside of the alphabet, we would compare each character to a list of all the letters in the alphabet (uppercase and lowercase).

This is the function of NLTK's `isalpha()`.

>**In the next code chunk we find the 100 most frequent tokens that include characters outside of the alphabet.** This will help us determine whether we should filter non-alphabetical characters. Follow along with the comments in the code to learn more about each step of the process.


In [None]:
##################################################
## Find tokens with characters that are not A-Z ##
##################################################

# Compare each token in Walden to the alphabet, returning when the token includes non-alphabetical characters
notalpha = [ token for token in tokens if not token.isalpha() ]

# Count the frequency of each token.
# Answers the question, How many times does each token appear across the corpus?
# Save the result to the 'freqs' variable
freqs = FreqDist(notalpha)

# Print the 100 most frequent tokens with non-alphabetical characters
print( freqs.most_common(100) )

### So, should we apply the filter to remove tokens with non-alphabetical characters?

**It looks like there are a few kinds of tokens here**, all of which would be removed from our dataset if we filtered using `isalpha()`:
<ul>
    <li> Punctuation (i.e. commas, which appear 8484 times) </li>
    <li> Numbers (i.e. '1.73', which appears twice) </li>
    <li> Hypenated words (i.e. 'so-called', which appears 7 times) </li>
</ul>

For the purposes of this workshop, let's assume we want to remove all of these tokens from our data. But you might want to give a little bit of thought about how we can possibly filter the text before tokenizing it to save more tokens from unnecessary filtering.

>**The next code chunk removes tokens that include non-alphabetical characters.**

In [None]:
####################################################
## Remove tokens with characters that are not A-Z ##
####################################################

# Read the list of tokens and only return each token if it's all alphabetical
# We write the result to a new variable, tokens_clean, which we will overwrite until it's cleaned.
tokens_clean = [ token for token in tokens if token.isalpha() ]


## Converting text to lowercase

**NLTK considers "Amber" and "amber" to be different tokens.** While it might be useful to make a distinction between these tokens (for example, for identifying some proper names), usually we choose to make all characters lowercase.

We can convert all of the text into lowercase by passing each token using the `lower()` function. There are more nuanced ways to do this, such as by separating capitalization at the beginning of sentences versus within sentences, that might be more appropriate for your data. You will see an example of this selective transformat

>**For our puposes, we will just convert all tokens to lowercase in the next code chunk.**

In [None]:
#################################
## Convert tokens to lowercase ##
#################################

# Use another list comprehension to save the lowercase version to tokens_clean (overwriting the original)
tokens_clean = [ token.lower() for token in tokens_clean ]

# Print the result
print( tokens_clean )


## Frequency distributions

The huge list of words above is impossible to interpret on its own. One simple way to analyze these data is to get the frequence (= count) of each token.

### Zipf's Law

The basic properties of the distribution of words in English was first described by Zipf (1932) in what would become known as Zipf's Law.

While I will spare you the mathematical underpinnings, the gist is that only a handfull of words in English are relatively frequent while the vast majority are infrequent.

If we did our job well, the plot below should support that conclusion.

> The next code block calculates the frequency for every word in ``tokens_clean`` and plots the 50 most frequent.

*Reference: Zipf, G. K., “Selected Studies of the Principle of Relative Frequency in Language,” Cambridge, MA: Harvard Univ. Press, 1932.*

In [None]:
#############################################
## Generate a token frequency distribution ##
#############################################

# This line specifies the size of the figure as out Jupyter Notebook will print it
plt.rcParams["figure.figsize"] = [16,7]

# Calculate the frequency for each token from the book and save it to the variable 'frequencies'
frequencies = FreqDist( tokens_clean )

# Call 'frequencies' with the method 'plot' to generate a frequency plot of the 50 most frequent words
walden_plot = frequencies.plot( 50 )

# The line above saves the plot to the variable 'walden_plot'. This line outputs it, so it will appear below.
walden_plot

# Also print the raw list
frequencies.most_common(50)

### Compare this plot to the empirical Zipf's Law distribution

**The plot seems to confirm the hypothesis that the frequency distribution for Walden tokens follows Zipf's law.**

Next we will generate an Zipf distribution using the `numpy` statistics package, so we can compare the shapes of the distributions more directly.

>**The next code block generates and displays a plot of the distribution Zipf proposed for English word frequency. How similar is it to the frequency distribution for tokens in Walden?**

In [None]:
##################################
## Generate a Zipf distribution ##
##################################

# We will randomly draw 1000 samples from the Zipf distribution with the parameter 'a'
a = 1.1
s = np.random.zipf(a, 1000)

# Truncate at x=20 and plot the density of the distribution
# This is because most of the 'action' in this distribution is closer to zero
count, bins, ignored = plt.hist(s[s<20], 20, histtype = 'step', fill = None, density=True)

# This is how we 'bin' our random terms between 1 and 20, so we can count the frequency of spans over the number line.
x = np.arange(1., 20.)

# We want to compare the Zipf samples to the the zeta function, which is like the Zipf distribution.
y = x**(-a) / special.zetac(a)

# Plot our frequency distribution
plt.plot(x, y/max(y), linewidth=2, color='r')

# Show the plot
plt.show()


### Hopefully, they look similar to you too.

## Removing stopwords

The plot above shows we replicated Zipf's 1932 finding for Walden!

One implication for this is that **most of the tokens in our dataset are not informative** because most of the high-frequency tokens in Walden are high frequency in every other English text. If we want to learn more about Walden, then we need to focus on lower frequency tokens.

In other words, if we followed every step in this tutorial so far with a different text, the result would likely be exactly the same. If we are asking questions about English in general this might be useful, but instead of settling for this general dataset, we will remove the most common tokens to make our data more representative of Walden.

Luckily, NLTK has many premade lists of very common tokens, and we already installed the lists! If you want to see it, just run the line `set(stopwords.words('english'))`.

>**The next code block removes stopwords. This step looks a lot like filtering we did on non-alphabetical tokens.**

In [None]:
######################
## Remove stopwords ##
######################

# Check every token in 'tokens_clean' against the NLTK stopword list
# Only keep tokens NOT in the list
tokens_clean = [t for t in tokens_clean if not t in set(stopwords.words('english'))]

# Generate a new frequency distribution plot with the 50 most frequent words remaining after filtering
# This follows the same steps as before: We count the frequency of each token, we generate a plot of the top 50, and then we display them.
frequencies = FreqDist( tokens_clean )

# Make plot with top 50 tokens
walden_plot = frequencies.plot( 50 )

# Show the plot
walden_plot

# Also print the raw list
frequencies.most_common(50)

### Cleaning is time consuming and imperfect

**As you can tell from the words in the plot above, we now have a fairly clean and informative data set!**

For example, we can tell just by looking at this that Walden may have a lot to do with the natural world (water, pond, ice, winter, nature, world) and was perhaps about a man living in nature (man, men, time, house).

Of course, there are other frequent tokens that might provide less information for your analysis (would, may, though, two, us). You can either filter these words out using *lists*, as we did above for non-alphabetical tokens and for stopwords, *or* you can filter our entire classes of words (i.e. pronouns) from your analysis. This will be detailed in the next section.

First we will save our progress in the **pickle format**, which is an efficient way to store raw data in Python.

In [None]:
#######################
## Save our progress ##
#######################

# Write the tokens list to the /working/ folder
with open('working/walden_clean_tokens.pkl', 'wb') as f:
    pickle.dump( tokens_clean, f )

# Also write a Text object (unique to NLTK) to the /working/ folder 
with open('working/walden_text.pkl', 'wb') as f:
    pickle.dump( Text( tokens ), f )
    
# Also write the text to the /working/ folder 
with open('working/walden_raw.pkl', 'wb') as f:
    pickle.dump( walden_raw, f )
    

# Next: Text Analysis

The next section will show the key features of NLTK's text analysis functions. We will compare the meanings of tokens, find similar tokens, analyze the sounds of Walden, and learn two ways to find the part of speech of each token (noun, verb, and so on).

# Code it: Tokenize Shakespeare

**Adapt the code block below to tokenize the complete works of William Shakespeare.**

* Below I give the filepath to a plaintext version of the complete works of William Shakespeare.
* For the first part of this activity you will tokenize the text.
* For the second part, you will generate a frequency distribution plot for the text.

Keep in mind that it will take a little while to run these scripts because this text file is larger than Walden.
*Tokenization takes about 10 seconds and cleaning + plotting takes about two minutes.*

Answers are in /answer_keys/.

In [None]:
##################################################################
## ## ## ## > Code it < ## ## ## ##                              #
################################### "Cleaning_Shakespeare_Timed" #
## Sample answer in /answer_keys ##                              #
##################################################################

######################
## Read-in the file ##
######################

# What is the path to the complete works of Shakespeare?
shakespeare_path = 'texts/shakespeare.txt'

# Open the file at the specified path
shakespeare_file = open( shakespeare_path, 'r')

# Read the file as raw text by calling the file with the read() function
shakespeare_raw = 

# Close the original file with the close() function
shakespeare_file.close()

#######################
## Tokenize the file ##
#######################

# Tokenize the raw text using word_tokenize()
shakespeare_tokens = 

# Print a list of the tokens


# Code it, continued: Plot the token frequency distribution for Shakespeare

**Adapt the code block below to plot the 50 most frequent tokens in the complete works of William Shakespeare.**

In [None]:
##############
## Cleaning ##
##############

# Remove non-alphabetical characters using list comprehension
shakespeare_tokens = [ token for token in shakespeare_tokens if token.isalpha() ]

# Convert the text to lowercase using list comprehension
shakespeare_tokens = [ token.lower() for token in shakespeare_tokens ]

# Filter out stopwords using list comprehension
shakespeare_tokens = [token for token in shakespeare_tokens if not token in set(stopwords.words('english'))]

####################
## Frequency plot ##
####################
# Generate a frequency distribution for the tokens


# Display plot with top 50 tokens

