# Computational Text Analysis Fundamentals


## Tokenization
Tokenization refers to the process of dividing a text into "tokens": words, parts of words, phrases, punctuations, or even sentences. The most common tokenization is at the level of words. A common strategy is to "split" the text wherever one encounters spaces.

In [1]:
sample_text = """
Well the design has developed a wee bit since you saw it last time, 
the design obviously is still in exactly the same place but 
the design is extended to actually include the actual cremator facility,
so if I can start with this particular drawing, you’ve seen a version
of this drawing before. Basically we’re arriving in the new car park
in this area and from the car park we’ll enter the building through a
waiting area. This leads us to the first query I have because there was
some discussion about whether you wanted the size of the waiting room
increased. At the moment it’s exactly on brief, but it does look kind of
small to my eye in relation to the size of the project.
"""

### Simple approach: split at the spaces
A common strategy is to "split" the text wherever one encounters spaces.

In [2]:
tokens = sample_text.split()
print(tokens)

['Well', 'the', 'design', 'has', 'developed', 'a', 'wee', 'bit', 'since', 'you', 'saw', 'it', 'last', 'time,', 'the', 'design', 'obviously', 'is', 'still', 'in', 'exactly', 'the', 'same', 'place', 'but', 'the', 'design', 'is', 'extended', 'to', 'actually', 'include', 'the', 'actual', 'cremator', 'facility,', 'so', 'if', 'I', 'can', 'start', 'with', 'this', 'particular', 'drawing,', 'you’ve', 'seen', 'a', 'version', 'of', 'this', 'drawing', 'before.', 'Basically', 'we’re', 'arriving', 'in', 'the', 'new', 'car', 'park', 'in', 'this', 'area', 'and', 'from', 'the', 'car', 'park', 'we’ll', 'enter', 'the', 'building', 'through', 'a', 'waiting', 'area.', 'This', 'leads', 'us', 'to', 'the', 'first', 'query', 'I', 'have', 'because', 'there', 'was', 'some', 'discussion', 'about', 'whether', 'you', 'wanted', 'the', 'size', 'of', 'the', 'waiting', 'room', 'increased.', 'At', 'the', 'moment', 'it’s', 'exactly', 'on', 'brief,', 'but', 'it', 'does', 'look', 'kind', 'of', 'small', 'to', 'my', 'eye', '

### Preferred Approach: Use an existing library.
Note the punctuations in the above output. They are still part of the preceding word. 
Also note contractions like `you've`. They are retained as they are. 
There are different ways to separate punctuations, contractions etc., but thankfully we can use a pre-existing library called [Natural Language Toolkit or NLTK](https://www.nltk.org/index.html#). To generate tokens, we will use a function called [word_tokenize](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize).

In [3]:
from nltk import word_tokenize
tokens = word_tokenize(sample_text)
print(tokens)

['Well', 'the', 'design', 'has', 'developed', 'a', 'wee', 'bit', 'since', 'you', 'saw', 'it', 'last', 'time', ',', 'the', 'design', 'obviously', 'is', 'still', 'in', 'exactly', 'the', 'same', 'place', 'but', 'the', 'design', 'is', 'extended', 'to', 'actually', 'include', 'the', 'actual', 'cremator', 'facility', ',', 'so', 'if', 'I', 'can', 'start', 'with', 'this', 'particular', 'drawing', ',', 'you', '’', 've', 'seen', 'a', 'version', 'of', 'this', 'drawing', 'before', '.', 'Basically', 'we', '’', 're', 'arriving', 'in', 'the', 'new', 'car', 'park', 'in', 'this', 'area', 'and', 'from', 'the', 'car', 'park', 'we', '’', 'll', 'enter', 'the', 'building', 'through', 'a', 'waiting', 'area', '.', 'This', 'leads', 'us', 'to', 'the', 'first', 'query', 'I', 'have', 'because', 'there', 'was', 'some', 'discussion', 'about', 'whether', 'you', 'wanted', 'the', 'size', 'of', 'the', 'waiting', 'room', 'increased', '.', 'At', 'the', 'moment', 'it', '’', 's', 'exactly', 'on', 'brief', ',', 'but', 'it',

### Removing Punctuations
Different approaches can be used to remove punctuations. A helpful way is to use another library called [string](https://docs.python.org/3/library/string.html), which contains a list of standard punctuations.

In [4]:
import string
punctuations = string.punctuation + '’'
tokens_without_puncts = [word for word in tokens if word not in punctuations]
print(tokens_without_puncts)

['Well', 'the', 'design', 'has', 'developed', 'a', 'wee', 'bit', 'since', 'you', 'saw', 'it', 'last', 'time', 'the', 'design', 'obviously', 'is', 'still', 'in', 'exactly', 'the', 'same', 'place', 'but', 'the', 'design', 'is', 'extended', 'to', 'actually', 'include', 'the', 'actual', 'cremator', 'facility', 'so', 'if', 'I', 'can', 'start', 'with', 'this', 'particular', 'drawing', 'you', 've', 'seen', 'a', 'version', 'of', 'this', 'drawing', 'before', 'Basically', 'we', 're', 'arriving', 'in', 'the', 'new', 'car', 'park', 'in', 'this', 'area', 'and', 'from', 'the', 'car', 'park', 'we', 'll', 'enter', 'the', 'building', 'through', 'a', 'waiting', 'area', 'This', 'leads', 'us', 'to', 'the', 'first', 'query', 'I', 'have', 'because', 'there', 'was', 'some', 'discussion', 'about', 'whether', 'you', 'wanted', 'the', 'size', 'of', 'the', 'waiting', 'room', 'increased', 'At', 'the', 'moment', 'it', 's', 'exactly', 'on', 'brief', 'but', 'it', 'does', 'look', 'kind', 'of', 'small', 'to', 'my', 'ey

## Counting Words
Almost subsequent processing is about counting words at some level. A simple way to get a count of words for us is to use another library called [collections](https://docs.python.org/3/library/collections.html), and a function called [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) in the library.

In [5]:
from collections import Counter
word_counts = Counter(tokens_without_puncts)

In [6]:
print(word_counts)

Counter({'the': 14, 'in': 4, 'to': 4, 'of': 4, 'design': 3, 'a': 3, 'you': 3, 'it': 3, 'this': 3, 'is': 2, 'exactly': 2, 'but': 2, 'I': 2, 'drawing': 2, 'we': 2, 'car': 2, 'park': 2, 'area': 2, 'waiting': 2, 'size': 2, 'Well': 1, 'has': 1, 'developed': 1, 'wee': 1, 'bit': 1, 'since': 1, 'saw': 1, 'last': 1, 'time': 1, 'obviously': 1, 'still': 1, 'same': 1, 'place': 1, 'extended': 1, 'actually': 1, 'include': 1, 'actual': 1, 'cremator': 1, 'facility': 1, 'so': 1, 'if': 1, 'can': 1, 'start': 1, 'with': 1, 'particular': 1, 've': 1, 'seen': 1, 'version': 1, 'before': 1, 'Basically': 1, 're': 1, 'arriving': 1, 'new': 1, 'and': 1, 'from': 1, 'll': 1, 'enter': 1, 'building': 1, 'through': 1, 'This': 1, 'leads': 1, 'us': 1, 'first': 1, 'query': 1, 'have': 1, 'because': 1, 'there': 1, 'was': 1, 'some': 1, 'discussion': 1, 'about': 1, 'whether': 1, 'wanted': 1, 'room': 1, 'increased': 1, 'At': 1, 'moment': 1, 's': 1, 'on': 1, 'brief': 1, 'does': 1, 'look': 1, 'kind': 1, 'small': 1, 'my': 1, 'eye

## Letter case
Looking at the list of non-repeating words in the sample text, we can see that capitalised letters are treated differently.
We may or may not want this.

In [7]:
non_repeating_words = word_counts.keys()
print(sorted(non_repeating_words))

['At', 'Basically', 'I', 'This', 'Well', 'a', 'about', 'actual', 'actually', 'and', 'area', 'arriving', 'because', 'before', 'bit', 'brief', 'building', 'but', 'can', 'car', 'cremator', 'design', 'developed', 'discussion', 'does', 'drawing', 'enter', 'exactly', 'extended', 'eye', 'facility', 'first', 'from', 'has', 'have', 'if', 'in', 'include', 'increased', 'is', 'it', 'kind', 'last', 'leads', 'll', 'look', 'moment', 'my', 'new', 'obviously', 'of', 'on', 'park', 'particular', 'place', 'project', 'query', 're', 'relation', 'room', 's', 'same', 'saw', 'seen', 'since', 'size', 'small', 'so', 'some', 'start', 'still', 'the', 'there', 'this', 'through', 'time', 'to', 'us', 've', 'version', 'waiting', 'wanted', 'was', 'we', 'wee', 'whether', 'with', 'you']


### Converting all text to lowercase

We simply use the `.lower()` method to convert all text to lowercase.

In [8]:
lowercase_tokens = [word.lower() for word in tokens_without_puncts]
word_counts_lowercase = Counter(lowercase_tokens)
print(word_counts_lowercase)

Counter({'the': 14, 'in': 4, 'to': 4, 'this': 4, 'of': 4, 'design': 3, 'a': 3, 'you': 3, 'it': 3, 'is': 2, 'exactly': 2, 'but': 2, 'i': 2, 'drawing': 2, 'we': 2, 'car': 2, 'park': 2, 'area': 2, 'waiting': 2, 'size': 2, 'well': 1, 'has': 1, 'developed': 1, 'wee': 1, 'bit': 1, 'since': 1, 'saw': 1, 'last': 1, 'time': 1, 'obviously': 1, 'still': 1, 'same': 1, 'place': 1, 'extended': 1, 'actually': 1, 'include': 1, 'actual': 1, 'cremator': 1, 'facility': 1, 'so': 1, 'if': 1, 'can': 1, 'start': 1, 'with': 1, 'particular': 1, 've': 1, 'seen': 1, 'version': 1, 'before': 1, 'basically': 1, 're': 1, 'arriving': 1, 'new': 1, 'and': 1, 'from': 1, 'll': 1, 'enter': 1, 'building': 1, 'through': 1, 'leads': 1, 'us': 1, 'first': 1, 'query': 1, 'have': 1, 'because': 1, 'there': 1, 'was': 1, 'some': 1, 'discussion': 1, 'about': 1, 'whether': 1, 'wanted': 1, 'room': 1, 'increased': 1, 'at': 1, 'moment': 1, 's': 1, 'on': 1, 'brief': 1, 'does': 1, 'look': 1, 'kind': 1, 'small': 1, 'my': 1, 'eye': 1, 'rela

## Stemming and Lemmatization
Note how `actual` and `actually` are treated separately. This may be necessary, or not, depending on the requirements of the analysis. If the base form of the word is to be obtained, we either have to "stem" the word (remove suffixes) or "lemmatize" the word (convert to base form).

### Stemming

This is the simple form where a set of rules can be used to remove inflection from the words. This may or may not work, as you can see from the below two examples.

In [9]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
print("discussion :", stemmer.stem("discussion"))
print("went :", stemmer.stem("went"))

discussion : discuss
went : went


Applying this approach to our word list...

In [10]:
stems = [stemmer.stem(word) for word in word_counts_lowercase]
print(stems)

['well', 'the', 'design', 'ha', 'develop', 'a', 'wee', 'bit', 'sinc', 'you', 'saw', 'it', 'last', 'time', 'obvious', 'is', 'still', 'in', 'exactli', 'same', 'place', 'but', 'extend', 'to', 'actual', 'includ', 'actual', 'cremat', 'facil', 'so', 'if', 'i', 'can', 'start', 'with', 'thi', 'particular', 'draw', 've', 'seen', 'version', 'of', 'befor', 'basic', 'we', 're', 'arriv', 'new', 'car', 'park', 'area', 'and', 'from', 'll', 'enter', 'build', 'through', 'wait', 'lead', 'us', 'first', 'queri', 'have', 'becaus', 'there', 'wa', 'some', 'discuss', 'about', 'whether', 'want', 'size', 'room', 'increas', 'at', 'moment', 's', 'on', 'brief', 'doe', 'look', 'kind', 'small', 'my', 'eye', 'relat', 'project']


### Lemmatization

This is a slightly more sophisticated version, where grammatical considerations are used to determine the base form of the word. For this, we also need to label the word with its appropriate part of speech.

In [11]:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

lemmatize = WordNetLemmatizer().lemmatize
print("discussion :", lemmatize("discussion", pos="n"))
print("went :", lemmatize('went', pos='v'))

discussion : discussion
went : go


Applying it to our list...

In [12]:
tagged_tokens = pos_tag(lowercase_tokens)
lemmas_list = []

for tagged_word in tagged_tokens:
    word, pos_tag = tagged_word
    if pos_tag.startswith("V") :
        lemma = lemmatize(word, pos="v")
    elif pos_tag.startswith("R") :
        lemma = lemmatize(word, pos="r")
    elif pos_tag.startswith("N") :
        lemma = lemmatize(word, pos="n")
    elif pos_tag.startswith("J") :
        lemma = lemmatize(word, pos="a")
    else :
        lemma = lemmatize(word)
    lemmas_list.append(lemma)

print(lemmas_list)

['well', 'the', 'design', 'have', 'develop', 'a', 'wee', 'bit', 'since', 'you', 'saw', 'it', 'last', 'time', 'the', 'design', 'obviously', 'be', 'still', 'in', 'exactly', 'the', 'same', 'place', 'but', 'the', 'design', 'be', 'extend', 'to', 'actually', 'include', 'the', 'actual', 'cremator', 'facility', 'so', 'if', 'i', 'can', 'start', 'with', 'this', 'particular', 'drawing', 'you', 've', 'see', 'a', 'version', 'of', 'this', 'draw', 'before', 'basically', 'we', 're', 'arrive', 'in', 'the', 'new', 'car', 'park', 'in', 'this', 'area', 'and', 'from', 'the', 'car', 'park', 'we', 'll', 'enter', 'the', 'building', 'through', 'a', 'wait', 'area', 'this', 'lead', 'u', 'to', 'the', 'first', 'query', 'i', 'have', 'because', 'there', 'be', 'some', 'discussion', 'about', 'whether', 'you', 'want', 'the', 'size', 'of', 'the', 'wait', 'room', 'increase', 'at', 'the', 'moment', 'it', 's', 'exactly', 'on', 'brief', 'but', 'it', 'do', 'look', 'kind', 'of', 'small', 'to', 'my', 'eye', 'in', 'relation', '

What difference do you see between the two lists?

## Extracting n-grams from text

For identifying commonly-used phrases in a given text, you need to capture all possible occurrences of word sequences of the length that interests you.

### Bigrams
If the sequence is of two words, it is called a bigram. There is a function in the NLTK library, which is called `bigrams`.

In [13]:
from nltk.util import bigrams
bigrams_from_text = list(bigrams(lowercase_tokens))
print(bigrams_from_text)

[('well', 'the'), ('the', 'design'), ('design', 'has'), ('has', 'developed'), ('developed', 'a'), ('a', 'wee'), ('wee', 'bit'), ('bit', 'since'), ('since', 'you'), ('you', 'saw'), ('saw', 'it'), ('it', 'last'), ('last', 'time'), ('time', 'the'), ('the', 'design'), ('design', 'obviously'), ('obviously', 'is'), ('is', 'still'), ('still', 'in'), ('in', 'exactly'), ('exactly', 'the'), ('the', 'same'), ('same', 'place'), ('place', 'but'), ('but', 'the'), ('the', 'design'), ('design', 'is'), ('is', 'extended'), ('extended', 'to'), ('to', 'actually'), ('actually', 'include'), ('include', 'the'), ('the', 'actual'), ('actual', 'cremator'), ('cremator', 'facility'), ('facility', 'so'), ('so', 'if'), ('if', 'i'), ('i', 'can'), ('can', 'start'), ('start', 'with'), ('with', 'this'), ('this', 'particular'), ('particular', 'drawing'), ('drawing', 'you'), ('you', 've'), ('ve', 'seen'), ('seen', 'a'), ('a', 'version'), ('version', 'of'), ('of', 'this'), ('this', 'drawing'), ('drawing', 'before'), ('bef

### Counting bigrams
It is then a matter of simply counting the number of occurrences, similar to what we had done with words.

In [14]:
bigram_counts = Counter(bigrams_from_text)
print(bigram_counts)

Counter({('the', 'design'): 3, ('car', 'park'): 2, ('to', 'the'): 2, ('the', 'size'): 2, ('size', 'of'): 2, ('of', 'the'): 2, ('well', 'the'): 1, ('design', 'has'): 1, ('has', 'developed'): 1, ('developed', 'a'): 1, ('a', 'wee'): 1, ('wee', 'bit'): 1, ('bit', 'since'): 1, ('since', 'you'): 1, ('you', 'saw'): 1, ('saw', 'it'): 1, ('it', 'last'): 1, ('last', 'time'): 1, ('time', 'the'): 1, ('design', 'obviously'): 1, ('obviously', 'is'): 1, ('is', 'still'): 1, ('still', 'in'): 1, ('in', 'exactly'): 1, ('exactly', 'the'): 1, ('the', 'same'): 1, ('same', 'place'): 1, ('place', 'but'): 1, ('but', 'the'): 1, ('design', 'is'): 1, ('is', 'extended'): 1, ('extended', 'to'): 1, ('to', 'actually'): 1, ('actually', 'include'): 1, ('include', 'the'): 1, ('the', 'actual'): 1, ('actual', 'cremator'): 1, ('cremator', 'facility'): 1, ('facility', 'so'): 1, ('so', 'if'): 1, ('if', 'i'): 1, ('i', 'can'): 1, ('can', 'start'): 1, ('start', 'with'): 1, ('with', 'this'): 1, ('this', 'particular'): 1, ('parti

### Generalizing to n-grams
We use a similar utility called `n-grams` to generalize this idea to words of any length.

In [15]:
from nltk.util import ngrams
trigrams_from_text = list(ngrams(lowercase_tokens, 3))
trigram_counts = Counter(trigrams_from_text)
print(trigram_counts)

Counter({('the', 'size', 'of'): 2, ('size', 'of', 'the'): 2, ('well', 'the', 'design'): 1, ('the', 'design', 'has'): 1, ('design', 'has', 'developed'): 1, ('has', 'developed', 'a'): 1, ('developed', 'a', 'wee'): 1, ('a', 'wee', 'bit'): 1, ('wee', 'bit', 'since'): 1, ('bit', 'since', 'you'): 1, ('since', 'you', 'saw'): 1, ('you', 'saw', 'it'): 1, ('saw', 'it', 'last'): 1, ('it', 'last', 'time'): 1, ('last', 'time', 'the'): 1, ('time', 'the', 'design'): 1, ('the', 'design', 'obviously'): 1, ('design', 'obviously', 'is'): 1, ('obviously', 'is', 'still'): 1, ('is', 'still', 'in'): 1, ('still', 'in', 'exactly'): 1, ('in', 'exactly', 'the'): 1, ('exactly', 'the', 'same'): 1, ('the', 'same', 'place'): 1, ('same', 'place', 'but'): 1, ('place', 'but', 'the'): 1, ('but', 'the', 'design'): 1, ('the', 'design', 'is'): 1, ('design', 'is', 'extended'): 1, ('is', 'extended', 'to'): 1, ('extended', 'to', 'actually'): 1, ('to', 'actually', 'include'): 1, ('actually', 'include', 'the'): 1, ('include', '

In [16]:
test_words = ['draw', 'drawing', 'drew', 'drawer', 'drawn']
lemmas_test = [lemmatize(w, 'v') for w in test_words]
print(lemmas_test)

['draw', 'draw', 'draw', 'drawer', 'draw']


In [17]:
stems_test = [stemmer.stem(w) for w in test_words]
print(stems_test)

['draw', 'draw', 'drew', 'drawer', 'drawn']


In [18]:
import nltk
from nltk.corpus import stopwords
# nltk.download('stopwords') # uncomment and run this line the first time you run this code.
stop_words = set(stopwords.words('english'))
print(stop_words)

{"that'll", 'through', 'doesn', "needn't", 'before', 'again', 're', 'nor', 'can', "should've", 'needn', 'wasn', 'yourself', 'had', 'i', 'to', 'down', 'a', 'hasn', 'haven', 'than', 'not', 'it', 'doing', 'off', 'this', 'didn', 'yours', 'should', 'o', 'be', 'itself', 'whom', "isn't", 'such', 'but', 'my', 'what', 'now', 'mightn', 'up', 'some', 's', 'in', "haven't", 'hers', 'below', 'there', 'other', 'just', 'these', 'shan', 'am', 'weren', "weren't", 'will', "wasn't", 'aren', 'where', 'shouldn', 'so', 'do', 'them', 'most', 't', 'or', "you're", "you'll", 'has', 'by', "she's", 'and', "mustn't", "doesn't", 'they', 'having', 'ma', 'here', 'until', "aren't", 'as', "you've", 'own', 'were', 'is', 'we', 'all', 'if', 'under', "don't", 'are', 'his', 'm', 'd', "hasn't", "mightn't", 'you', 'against', 'that', 'at', 'me', "didn't", 'same', 'its', 'of', 'does', 'their', 'into', 'from', 'an', 'more', 'which', 'for', 'himself', "wouldn't", 'no', 'themselves', "you'd", 'ours', 'after', "it's", 'her', 'our', 

In [19]:
lowercase_tokens_nostop = [word for word in lowercase_tokens if not word in stop_words]
word_counts_lowercase_nostop = Counter(lowercase_tokens_nostop)
print(word_counts_lowercase_nostop)

Counter({'design': 3, 'exactly': 2, 'drawing': 2, 'car': 2, 'park': 2, 'area': 2, 'waiting': 2, 'size': 2, 'well': 1, 'developed': 1, 'wee': 1, 'bit': 1, 'since': 1, 'saw': 1, 'last': 1, 'time': 1, 'obviously': 1, 'still': 1, 'place': 1, 'extended': 1, 'actually': 1, 'include': 1, 'actual': 1, 'cremator': 1, 'facility': 1, 'start': 1, 'particular': 1, 'seen': 1, 'version': 1, 'basically': 1, 'arriving': 1, 'new': 1, 'enter': 1, 'building': 1, 'leads': 1, 'us': 1, 'first': 1, 'query': 1, 'discussion': 1, 'whether': 1, 'wanted': 1, 'room': 1, 'increased': 1, 'moment': 1, 'brief': 1, 'look': 1, 'kind': 1, 'small': 1, 'eye': 1, 'relation': 1, 'project': 1})
