# Computational Text Analysis Fundamentals

## Tokenization
Tokenization refers to the process of dividing a text into "tokens": words, parts of words, phrases, punctuations, or even sentences. The most common tokenization is at the level of words. A common strategy is to "split" the text wherever one encounters spaces.

You can type or paste in your own text in into the cell below, but make sure to keep the three double-quotes (`"""`) at the start and the end.

In [None]:
sample_text = """
Well the design has developed a wee bit since you saw it last time, 
the design obviously is still in exactly the same place but 
the design is extended to actually include the actual cremator facility,
so if I can start with this particular drawing, you’ve seen a version
of this drawing before. Basically we’re arriving in the new car park
in this area and from the car park we’ll enter the building through a
waiting area. This leads us to the first query I have because there was
some discussion about whether you wanted the size of the waiting room
increased. At the moment it’s exactly on brief, but it does look kind of
small to my eye in relation to the size of the project.
"""

### Simple approach: split at the spaces
A common strategy is to "split" the text wherever one encounters spaces.

In [None]:
tokens = sample_text.split()
print(tokens)

### Preferred Approach: Use an existing library.
Note the punctuations in the above output. They are still part of the preceding word. 
Also note contractions like `you've`. They are retained as they are. 
There are different ways to separate punctuations, contractions etc., but thankfully we can use a pre-existing library called [Natural Language Toolkit or NLTK](https://www.nltk.org/index.html#). To generate tokens, we will use a function called [word_tokenize](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize).

Pay attention to the commented code below (lines starting with a `#`) and when applicable, uncomment them by deleting the `#`. A commented line of code is ignored by the system, and not executed.

You can also toggle the commenting of any line. To do so, place your cursor on that line and hold down `Ctrl` and press `/` (for Windows/Linux) or hold down `⌘` and press `/` (for Macs).

In [None]:
import nltk
# Uncomment this line if you don't have punkt downloaded.
# nltk.download('punkt_tab')  
from nltk import word_tokenize
tokens = word_tokenize(sample_text)
print(tokens)

#### Note:
`Import` commands (such as the ones you see above) are conventionally entered in the first cell of a Jupyter Notebook (or in the first few lines of a python program). However, we break from convention here to show when a particular library is used in the code.

### Removing Punctuations
Different approaches can be used to remove punctuations. A helpful way is to use another library called [string](https://docs.python.org/3/library/string.html), which contains a list of standard punctuations.

In [None]:
import string
punctuations = string.punctuation + '’'
tokens_without_puncts = [word for word in tokens if word not in punctuations]
print(tokens_without_puncts)

## Counting Words
Almost subsequent processing is about counting words at some level. A simple way to get a count of words for us is to use another library called [collections](https://docs.python.org/3/library/collections.html), and a function called [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) in the library.

In [None]:
from collections import Counter
word_counts = Counter(tokens_without_puncts)

In [None]:
print(word_counts)

## Letter case
Looking at the list of non-repeating words in the sample text, we can see that capitalised letters are treated differently.
We may or may not want this.

In [None]:
non_repeating_words = word_counts.keys()
print(sorted(non_repeating_words))

### Converting all text to lowercase

We simply use the `.lower()` method to convert all text to lowercase.

In [None]:
lowercase_tokens = [word.lower() for word in tokens_without_puncts]
word_counts_lowercase = Counter(lowercase_tokens)
print(word_counts_lowercase)

## Stemming and Lemmatization
Note how `actual` and `actually` are treated separately. This may be necessary, or not, depending on the requirements of the analysis. If the base form of the word is to be obtained, we either have to "stem" the word (remove suffixes) or "lemmatize" the word (convert to base form).

### Stemming

This is the simple form where a set of rules can be used to remove inflection from the words. This may or may not work, as you can see from the below two examples.

In [None]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
print("discussion :", stemmer.stem("discussion"))
print("went :", stemmer.stem("went"))

Applying this approach to our word list...

In [None]:
stems = [stemmer.stem(word) for word in word_counts_lowercase]
print(stems)

### Lemmatization

This is a slightly more sophisticated version, where grammatical considerations are used to determine the base form of the word. For this, we also need to label the word with its appropriate part of speech.

In [None]:
# comment the line below after the first time you run this code.
nltk.download('wordnet')  

# comment this line after the first time you run this code.
nltk.download('averaged_perceptron_tagger_eng')  

from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

lemmatize = WordNetLemmatizer().lemmatize
print("discussion :", lemmatize("discussion", pos="n"))
print("went :", lemmatize('went', pos='v'))

Applying it to our list...

In [None]:
tagged_tokens = pos_tag(lowercase_tokens)
lemmas_list = []

for tagged_word in tagged_tokens:
    word, pos_tag = tagged_word
    if pos_tag.startswith("V") :
        lemma = lemmatize(word, pos="v")
    elif pos_tag.startswith("R") :
        lemma = lemmatize(word, pos="r")
    elif pos_tag.startswith("N") :
        lemma = lemmatize(word, pos="n")
    elif pos_tag.startswith("J") :
        lemma = lemmatize(word, pos="a")
    else :
        lemma = lemmatize(word)
    lemmas_list.append(lemma)

print(lemmas_list)

What difference do you see between the two lists?

## Extracting n-grams from text

For identifying commonly-used phrases in a given text, you need to capture all possible occurrences of word sequences of the length that interests you.

### Bigrams
If the sequence is of two words, it is called a bigram. There is a function in the NLTK library, which is called `bigrams`.

In [None]:
from nltk.util import bigrams
bigrams_from_text = list(bigrams(lowercase_tokens))
print(bigrams_from_text)

### Counting bigrams
It is then a matter of simply counting the number of occurrences, similar to what we had done with words.

In [None]:
bigram_counts = Counter(bigrams_from_text)
print(bigram_counts)

### Generalizing to n-grams
We use a similar utility called `n-grams` to generalize this idea to words of any length.

In [None]:
from nltk.util import ngrams
trigrams_from_text = list(ngrams(lowercase_tokens, 3))
trigram_counts = Counter(trigrams_from_text)
print(trigram_counts)

In [None]:
test_words = ['draw', 'drawing', 'drew', 'drawer', 'drawn']
lemmas_test = [lemmatize(w, 'v') for w in test_words]
print(lemmas_test)

In [None]:
stems_test = [stemmer.stem(w) for w in test_words]
print(stems_test)

## Stop Words
Some words can be seen to contain less "information" than others. The commonly-occurring words are called "stop words". There is a library that does this for us.

In [None]:
import nltk
from nltk.corpus import stopwords

# comment this line after the first time you run this code.
nltk.download('stopwords') 

stop_words = set(stopwords.words('english'))
print(stop_words)

In [None]:
lcase_tokens_ns = [word for word in lowercase_tokens if not word in stop_words]
word_counts_lcase_ns = Counter(lcase_tokens_ns)
print(word_counts_lcase_ns)

What difference do you see in the word counts?

Try playing with other inputs in the notebook above, and when you are comfortable with most of the commands, move on to the next notebook: making sequential word clouds from a book.