## Part II: Preparing Text for Analysis

*Note: This is part of a series on the computational analysis of open-ended survey questions. For part one, on "Writing Open-Ended Survey Questions for Computational Analysis" click here.   

If you read our first post, you may remember that Evolytics was asked to analyze approximately 68,000 open-ended responses to nine survey questions.  These questions included asking survey participants to list competitor brands they had tried, rate the competitors, and describe their rationale for the rating. 

In this post, we'll talk about how to prepare text for analysis. The techniques discussed here are general and can apply to any form of text not just survey responses.   

### Getting Started with Natural Language Processing

Prior to preparing our text for analysis, it's important that we define three common terms used in natural language processing. First, a *corpus* is a collection of documents on which we are conducting analysis.  A *document* is any text that is subject to analysis.  This could be a set of reports, social media posts, or, in our cases, open-ended survey responses. Finally, *tokens* are groupings of characters that are meaningful.  Tokens are often words or parts of words.  

When preparing a document for analysis we tokenize it, or break it apart into discrete tokens.  In many models, it is the presence and frequency of tokens that characterize a document.  However, not all tokens are informative.  For example, some tokens (e.g., "a", "of", "the") are extremely common and have little purpose other than tying the sentence together grammatically. In linguistics, these are referred to as [function words](https://en.wikipedia.org/wiki/Function_word). Typically, function words are included in a *stopword* list that contains words we wish to exclude from tokenization because they have little value to statistical models.  
 
It is common to [*stem* or *lemmatize*](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) tokens to further standardize them.  Stemming involves the removal of the end of a word to get its root.  For example, "smarter" and "smartest" become "smart".  A drawback to stemming is that it may not return proper words.  For instance, "accelerating" becomes "acceler".  In contrast, lemmatization attempts to return the base form of a word such as might be found in the dictionary.  Using lemmatization "women", "Woman's", and "Womanly" becomes "woman" while "is" and "are" become "be".  However, you must tag a tokens part of speech (i.e., noun, adj, verb, etc..) to know how a given word should be lemmatized. Since stemming and lemmatization accomplish the same thing you should only choose one.  Furthermore, you shouldn't assume that you *have* to stem or lemmatize - try your models without and see how they perform.       

Finally, you may wish to remove things such as numbers, punctuation, or URLs.  Doing this can be especially helpful for data drawn from the web that may not conform to standard grammar.  However, let me issue two warnings about applying these functions indiscriminately.  First, altering your text can affect how a parts-of-speech classifier tags a given token which in turn can effect lemmatization.  As a result, I caution against aggressively filtering text before lemmatization although you can certainly do so afterwards.  Second, depending upon your corpus numbers, punctuation, and URLs may be informative.  For example, while often numbers by themselves are not informative, stripping them from your tokens may prevent your from identifying things such as the [unicode for emojis](https://unicode.org/emoji/charts/full-emoji-list.html).    

### Tokenizing Documents 

Below we've defined several useful functions for text cleaning. We're using Python's [Natural Language Toolkit (NLTK)](https://www.nltk.org/) which contains a number of tools and models for text processing.  If you've never used NLTK before, run the first cell below. 

In [1]:
# only run if you haven't previously installed these NLTK tools.
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ssanders/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ssanders/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/ssanders/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ssanders/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/ssanders/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /Users/ssanders/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

#### Stemming and Tokenization 

Now let's talk about what we're doing to tokenize and filter our text.  First, we use the string methods from the standard Python library to put text in lowercase. We can also filter out punctuation and numbers using Python's [str.maketrans( )](https://stackoverflow.com/a/41536036) and translate( ) functions. translate( ) substitutes one character for another and maketrans( ) creates the map between characters used by translate( ).  Here we substitute an empty string for all punctuation and numbers to remove them.  

Second, we're using [regular expressions](https://docs.python.org/3/library/re.html) (regex) to find and remove URLs. If you're reading this post, you're likely familiar with regex but if not all you need to know is that it is a way of defining patterns.  Here we're using it remove URLs by searching for a whitespace separated string beginning with "http" and substituting an empty string in its place. 

Third, we're using the WordPunctTokenizer from NLTK that will tokenize text creating a Python list of alphabetic and non-alphabetic tokens. We then can use Python [list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) to filter the token list to remove single character tokens and tokens that are found in our stopword list.  

Finally, we are using the SnowballStemmer from NLTK to stem each token as discussed above.  

In [2]:
import re
import string
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize.regexp import WordPunctTokenizer
from nltk.corpus import stopwords


#Remove urls. 
def remove_urls(text): 
    return re.sub(r"http\S+", "", text)

#Remove punctuation. Note- This leaves a space so it plays nice w/ nltk's stopword list.
def remove_punctuation(s):
    table = str.maketrans({ch: ' ' for ch in string.punctuation})  # this line determines what the punct. is replaced with.
    return s.translate(table)


#Remove numbers. 
def remove_numbers(s): 
    remove_digits = str.maketrans('', '', string.digits)
    return s.translate(remove_digits)


# Stems tokens. 
def stem_tokens(tokens, stemmer=SnowballStemmer("english", ignore_stopwords=True)): 
    return [stemmer.stem(tkn) for tkn in tokens]


#Tokenize texts.  Note- It is possible to comment out steps to change how tokenization occurs. 
def tokenize(text, stem=False):
    text = remove_urls(text) # removes urls 
    text = remove_numbers(text) # removes numbers 
    text = text.lower() # sets to lowercase
    text = text.replace('-', '') # removes hyphens  
    tkns = tokenizer.tokenize(text) # tokenizes text
    tkns = [remove_punctuation(tkn).strip() for tkn in tkns] #strips punctuation
    
    # stems tkns
    if stem: 
        tkns = [tkn for tkn in tkns if tkn not in sw] # filters using stopwords
        tkns = [tkn for tkn in tkns if len(tkn) > 1] # no single character tkns
        tkns = stem_tokens(tkns)
        tkns = [tkn for tkn in tkns if tkn not in sw]
    return tkns

tokenizer = WordPunctTokenizer()

# Creates stopword list from NLTK.
sw = stopwords.words("english") + ['']

Now we're going to do the actual tokenization.  After tokenization our corpus will be a list of documents with each document being a list of tokens.  Note that we also create a map of our indices.  This enables us to link tokenized documents back to the original, untokenized version if we need to remove a particular document during model building.  We keep track of these positionally and would remove the corresponding index.  Keeping a copy of the original documents enables us to inspect them after creating a topic model, which assist in interpretation, and the indices allow us to join the results of any text analysis back to our parent dataset in a survey for other analyses such as slicing by demographics.  

In [3]:
# Toy corpus for our example
doc_examples = ['We use data, statistical algorithms, and machine learning to help you make business decisions and targeted digital marketing efforts based on potential outcomes.', 
                'With propensity modeling, you can predict the likelihood of a visitor, lead, or current customer to perform a certain action on your website (i.e. browse your site, click a CTA, pick up their phone to call).',
                'Once you can anticipate future customer and user behavior, you can plan for possible challenges and obstacles you’ll need to help that customer or user overcome.', 
                'We believe in the power of data to affect change and help make a difference in the world. We focus on web analytics and marketing optimization for business evolution ' \
                'and brand growth. As a full-service data analytics consulting company, we partner with clients to activate their data with best-in-class analytics tool implementation, ' \
                'meaningful insights, and expert training. Founded in 2005, we serve clients across different industries. We get to know your business right away so the consulting help ' \
                'we offer you is catered to your specific needs and goals. We don’t just sell you a service; we engage with your business as a partner. We care about your success, and we ' \
                'want to help your business thrive. ', 
                "Evolytics can help you with your data science needs."
               ]

print("Number of Documents: ", len(doc_examples))

Number of Documents:  5


In [5]:
docs = []
orig_docs = []
doc_index = []
 
for i, d in enumerate(doc_examples):
    orig_docs.append(d)
    docs.append(tokenize(d, stem=True))
    doc_index.append(i)

i = 0 
print("Orginal Document: ", orig_docs[i])
print('\n')
print("Tokenized & Stemmed Document: ", docs[i])
print('\n')
print("Document Index: ", doc_index[i])

Orginal Document:  We use data, statistical algorithms, and machine learning to help you make business decisions and targeted digital marketing efforts based on potential outcomes.


Tokenized & Stemmed Document:  ['use', 'data', 'statist', 'algorithm', 'machin', 'learn', 'help', 'make', 'busi', 'decis', 'target', 'digit', 'market', 'effort', 'base', 'potenti', 'outcom']


Document Index:  0


#### Parts-of-Speech Tagging, Lemmatization, and Tokenization 

The above approach works well if you wish to stem tokens but slightly different handling is required for lemmatization.  Specifically, we need to tag the part-of-speech using NLTK's pos_tag method.  Unfortunately, the parts-of-speech tags ([see here for a full list](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)) returned by nlkt.pos_tag are not those accepted by the wordnet lemmatizer.  For example, NLTK returns a tag of "PRP" (i.e., personal pronoun) for the work "We" but the wordnet lemmitizer needs a more simple "n" (i.e., noun) tag.  Therefore, we define a helper function, wordnet_pos, that converts the tag to the correct format.   

In [10]:
from nltk.corpus import wordnet
from nltk import WordNetLemmatizer

i = 0

def wordnet_pos(pos):
    """Converts Penn Treebank PoS to Wordnet PoS
    
    Arg: 
        pos(str): Penn treebank PoS tag
        
    Returns: 
        str: wordnet PoS tag
    """
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    # Returns noun if not found to avoid lemmatization error. 
    return tag_dict.get(pos[0], wordnet.NOUN)
    

tokenized_sentence = nltk.word_tokenize(doc_examples[i]) #tokenizing sentence 
print("Tokenized Sentence: ", tokenized_sentence)
tag_sent = nltk.pos_tag(tokenized_sentence) # tagging pos 

print('\n')
print("Penn Treebank PoS: ", tag_sent)

words = [(word[0], wordnet_pos(word[1])) for word in tag_sent] # converting to wordnet pos

print('\n')
print("Wordnet PoS: ", words)


Tokenized Sentence:  ['We', 'use', 'data', ',', 'statistical', 'algorithms', ',', 'and', 'machine', 'learning', 'to', 'help', 'you', 'make', 'business', 'decisions', 'and', 'targeted', 'digital', 'marketing', 'efforts', 'based', 'on', 'potential', 'outcomes', '.']


Penn Treebank PoS:  [('We', 'PRP'), ('use', 'VBP'), ('data', 'NNS'), (',', ','), ('statistical', 'JJ'), ('algorithms', 'NN'), (',', ','), ('and', 'CC'), ('machine', 'NN'), ('learning', 'NN'), ('to', 'TO'), ('help', 'VB'), ('you', 'PRP'), ('make', 'VB'), ('business', 'NN'), ('decisions', 'NNS'), ('and', 'CC'), ('targeted', 'JJ'), ('digital', 'JJ'), ('marketing', 'NN'), ('efforts', 'NNS'), ('based', 'VBN'), ('on', 'IN'), ('potential', 'JJ'), ('outcomes', 'NNS'), ('.', '.')]


Wordnet PoS:  [('We', 'n'), ('use', 'v'), ('data', 'n'), (',', 'n'), ('statistical', 'a'), ('algorithms', 'n'), (',', 'n'), ('and', 'n'), ('machine', 'n'), ('learning', 'n'), ('to', 'n'), ('help', 'v'), ('you', 'n'), ('make', 'v'), ('business', 'n'), ('d

You may note that the above output contains punctuation.  After we lemmatize our tokens, we can use some of the general filtering techniques or functions we discussed above to clean our documents and remove punctuation and stopwords. The code below demonstrates this and presents us with a clean set of tokens.  However, depending upon your analysis you may wish to retain your PoS tags.  For example, [named entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) (i.e., the identification of a person, location, organization, or product in unstructured text) requires PoS tags. 

In [11]:
from nltk import sent_tokenize

def wordnet_pos(pos):
    """Converts Penn Treebank PoS to Wordnet PoS
    
    Arg: 
        pos(str): Penn treebank PoS tag
        
    Returns: 
        str: wordnet PoS tag
    """
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    # Returns noun if not found to avoid lemmatization errror. 
    return tag_dict.get(pos[0], wordnet.NOUN)

    
lemmatizer = WordNetLemmatizer() # initializing lemmatizer
tokenized_docs = []

for doc in doc_examples: 
    d = []
    sentences = nltk.sent_tokenize(doc) # creates a list of sentences
    for sentence in sentences:
        tokenized_sentence = nltk.word_tokenize(sentence) #tokenizes sentence
        tagged = nltk.pos_tag(tokenized_sentence) # pos tagging
        for tkn in tagged: 
            if (tkn[0] not in sw and tkn[0] not in string.punctuation): #filtering punct & stopwords
                lemma_tkn = lemmatizer.lemmatize(word=tkn[0], pos=wordnet_pos(tkn[1])) #lemmatization
                d.append(lemma_tkn)
    tokenized_docs.append(d)

print("Lemmatized Tokens: ", tokenized_docs[0])

Lemmatized Tokens:  ['We', 'use', 'data', 'statistical', 'algorithm', 'machine', 'learning', 'help', 'make', 'business', 'decision', 'targeted', 'digital', 'marketing', 'effort', 'base', 'potential', 'outcome']


## Conclusion 

In this post, we've defined some basic terms for natural language processing and discussed how to prepare text for analysis using Python's Natural Language ToolKit.  Specifically, we discussed how to tokenize and filter text, how to stem and lemmatize tokens, and how to tag the parts of speech of tokens for subsequent analysis.  Although our context is the analysis of open-ended survey questions, the above techniques will work with any body of text.  

In our next post, we'll talk about how to detect duplicate texts and how to do named entity recognition to identify people, locations, and products in text.   