# INFO 3350/6350

## Lecture 01: Tokenization and word counts

## To do

* Readings for next week (see [syllabus schedule](https://github.com/wilkens-teaching/info3350-f23/blob/main/schedule.md))
  * Mon: Reagan et al.+
  * Weds: Ramsay (Canvas), Healy, Rambsy
  * Come to lecture prepared: have a question, thought, or connection between the readings
  * Responses by Tuesday at 4pm for students with NetIDs `a*-g*`; see [instructions on Canvas](https://canvas.cornell.edu/courses/57246/discussion_topics)
      * Respond to **Wednesday** reading, not Monday
* Go to section this week!
    * If you hope to switch sections, go to desired section and ask if anyone will swap. You might also try posting to Ed with your request.
        * If yes, email [courses@cis.cornell.edu](mailto:courses@cis.cornell.edu) with both NetIDs and they will make the swap
    * If not yet enrolled from waitlist, attend any section
    
## The question: How do we turn books into data?

What specific things might we do to make books into computable objects? 30 seconds with the person next to you ...

## Definition

What is a token?

* The **smallest individually meaningful unit of a document.** Roughly, a word.
* But ... as soon as you see "meaningful," you know it's going to be a matter of interpretation.
  * *Every single thing you do in text analysis is an interpretive intervention!*
* Not all tokens are (single) words. For example:
  * **Contractions**. `"I'm"` or `"can't"`. One token or two?
  * **Phrases.** `"San Francisco"` or `"Cornell University"`. Two tokens or one?
    * These are exampled of "named entities." We'll revisit them later in the semester.
  * **Punctuation.** Count it at all? Is `"this"` the same token as `"this!"`? Is `"."` or `";"` a token on its own?
  * **Domain-specific terms.** `"@user"`, `"COVID-19"`, etc.

## Why tokenize?

Words suggest meaning. This is the wager and the starting point of many text analysis methods.
  * If we can identify words, we can count them.
  * Words are small enough to recur, so not all counts are `1` (which isn't very informative)
    * Hence we can compare word counts across contexts
    * Compare to sentences (or paragraphs, or full documents), which are often globally unique
  * If we we can count words, we can quantify (aspects of) a text that contains those words.
  * **If we can quantify a text, we can compute with it.**
  * **This is the most common way that text becomes data!**

Note that quantifying a text isn't the same thing as being *correct* about what that text means, nor is meaning solely a function of word counts(!).

Tokenization is part of the more-or-less standard text-processing workflow. Other parts of that workflow might include:
  * Case regularization/folding
  * Punctuation removal
  * Lemmatization or stemming
  * Sentence segmentation
  * and more ...
  
## State of the art

A decade ago, using raw tokens for NLP tasks was the best we could do. Today, we generally use static or contextual word *embeddings* in place of tokens. We'll talk about this at length in the second half of the course, but the underlying idea is the same. Words and embeddings are proxies for meaning (which is what we ultimately care about, but is never directly accessible to us). Embeddings are just a way to capture more of the specific meaning of a word as it is used in a given language (static) or linguistic context (contextual).

## Tokenization can be domain-specific

Note that today's reading assumed some special interests:

* Twitter(like) texts
* Sentiment as target phenomenon

So it worked hard to capture Twitter handles, hashtags, smilies, URLs, etc.

The "right" way to tokenize depends on your project, on what is meaningful *in context*.
If you have different data or different phenomena to investigate, you might tokenize differently.

## Approach 1: Split on whitespace

A simple, naïve approach, workable for quick-and-dirty work with many Western languages.

Consider the sentence:

> Cornell is a private, Ivy League university and the land-grant university for New York state.

How many tokens does this sentence contain? (count them for yourself)

In [1]:
cornell = 'Cornell is a private, Ivy League university and the land-grant university for New York state.'
tokens = cornell.split()
print(tokens)
print("Number of tokens:", len(tokens))

['Cornell', 'is', 'a', 'private,', 'Ivy', 'League', 'university', 'and', 'the', 'land-grant', 'university', 'for', 'New', 'York', 'state.']
Number of tokens: 15


Notice: `private,` `land-grant` `state.` These aren't wrong *per se*, but ...

Maybe we could do better if we just took non-space, non-puctuation strings.

In [2]:
import re
word_pattern = re.compile("[\w]+")
tokens_re = word_pattern.findall(cornell)
print(tokens_re)
print("Number of tokens:", len(tokens_re))

['Cornell', 'is', 'a', 'private', 'Ivy', 'League', 'university', 'and', 'the', 'land', 'grant', 'university', 'for', 'New', 'York', 'state']
Number of tokens: 16


### Regular expressions

A totally inadequate mini-introduction to an important but annoyingly complex technology.

* What is a regular expression (regex)?
  * A sequence of characters that define a search pattern.
  * That is, it's a text search or matching language.
  * Notoriously unreadable and difficult to parse by eye.
  
Consider the line above:

```
word_pattern = re.compile("[\w]+")
```

The search pattern here is any sequence of one or more (`+`) uniterrupted "word" characters (`\w` = upper- and lowercase letters, plus digits) that occur anywhere in a string. Regexes are usually "greedy," so will continue matching character by character until their condition is not met.

In [3]:
for word in ['t', 'the', 'these', "these'uns", "these ones"]:
    print(word_pattern.findall(word))

['t']
['the']
['these']
['these', 'uns']
['these', 'ones']


`re` is Python's regular expression library. `compile` prepares the regular expression for use with text inputs.

A few other useful bits of regex syntax:

* `.` (period) = any character
* `\s` = whitespace character (space, tab, newline, etc.)
* `\d` = digit
* `[abc]` = any character in the set {a, b, c}.
* `[^abc]` = negation, any character *except* a, b, or c.
* `A*` = zero or more occurrences of the character A; `+` = one or more, `?` = zero or one.
* `\A`, `\Z`, `^`, and `$` = match only at start or end of a string or line, respectively.
* `\` (backslash) = escape the next character; `\.` = period, not wildcard.

There's a lot more to this. Take a look at the code linked from today's reading, and/or consult a [regex cheat sheet](https://learnbyexample.github.io/cheatsheet/python/python-regex-cheatsheet/).

Why use regular expressions?
  * A powerful way to find/match/extract substrings from strings and texts.
  * Can use regexes to build robust custom tokenizers (as in the reading for today)

### NLTK

The Natural Language Tool Kit (NLTK) is a full-featured Python NLP library. It includes a bunch of tokenizers, nearly all of them extensible, that will probably perform better than whatever you can hack together for your project.

Let's try it:

In [4]:
from nltk import word_tokenize
tokens_nltk = word_tokenize(cornell)
print(tokens_nltk)
print('Number of tokens:', len(tokens_nltk))

['Cornell', 'is', 'a', 'private', ',', 'Ivy', 'League', 'university', 'and', 'the', 'land-grant', 'university', 'for', 'New', 'York', 'state', '.']
Number of tokens: 17


In [5]:
word_tokenize("can't, I'm")

['ca', "n't", ',', 'I', "'m"]

Note that NLTK treats word-terminal punctuation as a token and is smart about contractions.

## Non-English/Non-Western text

Whitespace can be a very bad approach if Western typographic conventions don't apply!

If you don't know the language:

* Ask if you should be doing the work
* Lean on libraries

### Example from the *New York Times*

In a [recent *Times* article](https://www.nytimes.com/2020/09/03/sports/soccer/premier-league-china-contract-television.html) on football broadcasting rights, we find this sentence:

**Chinese**

> 因受新型冠状病毒危机对足球和其他体育赛事的持续影响，早已面临越来越多亏损的英格兰超级足球联赛周四宣布，因为无法解决与中国合作伙伴的纠纷，已终止了其最赚钱的海外转播合同。

**English translation**

> The English Premier League, already facing mounting losses because of the continued impact of the coronavirus crisis on soccer and other sporting events, announced on Thursday that it had canceled its most lucrative overseas broadcast contract after it was unable to resolve a dispute with its Chinese partner.

Our previous tokenization strategy doesn't work well in this case:

In [6]:
# Strings
zh = '因受新型冠状病毒危机对足球和其他体育赛事的持续影响，早已面临越来越多亏损的英格兰超级足球联赛周四宣布，因为无法解决与中国合作伙伴的纠纷，已终止了其最赚钱的海外转播合同。'
en = 'The English Premier League, already facing mounting losses because of the continued impact of the coronavirus crisis on soccer and other sporting events, announced on Thursday that it had canceled its most lucrative overseas broadcast contract after it was unable to resolve a dispute with its Chinese partner.'

# Naive approach to tokenization
zh_tokens_bad = zh.split()
print(zh_tokens_bad)
print('Number of Chinese tokens:', len(zh_tokens_bad))

# English version
en_tokens = en.split()
print('Number of English tokens:', len(en_tokens))

['因受新型冠状病毒危机对足球和其他体育赛事的持续影响，早已面临越来越多亏损的英格兰超级足球联赛周四宣布，因为无法解决与中国合作伙伴的纠纷，已终止了其最赚钱的海外转播合同。']
Number of Chinese tokens: 1
Number of English tokens: 48


### The `jieba` tokenizer

See the [`jieba` project GitHub page](https://github.com/fxsjy/jieba) for documentation (in Chinese and in English). `jieba` is one of the packages we installed in our virtual environment.

In [7]:
# A better approach to tokenizing Chinese-language text
import jieba
zh_tokens_better = [token for token in jieba.cut(zh)]
print(zh_tokens_better)
print("Number of Chinese tokens:", len(zh_tokens_better))

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/xd/m092nj891q71xlv9zcn1sd8r0000gn/T/jieba.cache
Loading model cost 0.292 seconds.
Prefix dict has been built successfully.


['因受', '新型', '冠状病毒', '危机', '对', '足球', '和', '其他', '体育赛事', '的', '持续', '影响', '，', '早已', '面临', '越来越', '多', '亏损', '的', '英格兰', '超级', '足球联赛', '周四', '宣布', '，', '因为', '无法', '解决', '与', '中国', '合作伙伴', '的', '纠纷', '，', '已', '终止', '了', '其', '最', '赚钱', '的', '海外', '转播', '合同', '。']
Number of Chinese tokens: 45


## Counting words

We often want to count the number of occurrences of each *unique* type of token in a text.

Note that '**type**' is a quasi-technical word that means "unique token form." The sentence:

> The cat is a cat.

... contains five tokens, but only four types. We find the same term (and concept) in the measure of **type-token ratio** (TTR), which we can use to measure the lexical diversity of a text. Note that "lexical diversity" does not equal "sophistication" or "value." Gertrude Stein's poetry has low lexical diversity. Hemingway is surprisingly high. Pulp fiction often has (much) higher TTR than does "literary" fiction.

Anyway ... if we count tokens, throwing away word order, we have transformed our text(s) into a so-called "**bag of words**."

### Bags of words

A bag of words is a **representation** of a text in the same way that a photograph or a story might be a representation of a person. It's a way of looking at the text, useful for some purposes, terribly inadequate for others. 

A bag of words is neither a good nor a bad representation *in the abstract*, because there is no such thing as an abstractly (or universally) good or bad representation. Goodness and badness only apply to the suitability of a representation for a particular purpose in a specific context.

### Let's count some types and tokens

Approach: Iterate over a list of tokens, counting the number of times we see each unique type.

In [8]:
from collections import Counter
cornell_counter = Counter() # easier than using a dict (why?)
for token in tokens_nltk:
    cornell_counter[token] += 1
cornell_counter

Counter({'Cornell': 1,
         'is': 1,
         'a': 1,
         'private': 1,
         ',': 1,
         'Ivy': 1,
         'League': 1,
         'university': 2,
         'and': 1,
         'the': 1,
         'land-grant': 1,
         'for': 1,
         'New': 1,
         'York': 1,
         'state': 1,
         '.': 1})

In [9]:
cornell_counter['university']

2

In [10]:
cornell_counter.most_common(2)

[('university', 2), ('Cornell', 1)]

In [11]:
cornell_counter['coffee']

0

## Way more words!

Let's count the words in *Moby-Dick* (Herman Melville, 1851), sometimes called "the great American novel." It's long: 500 pages or more, depending on the edition. 

In [12]:
# path to text file (available on course GitHub)
import os
fn = os.path.join('..', 'data', 'texts', 'A-Melville-Moby_Dick-1851-M.txt')

In [13]:
# examine the file path constructed above
fn

'../data/texts/A-Melville-Moby_Dick-1851-M.txt'

In [14]:
%%time
# naive but fast
moby_fast = Counter()
with open(fn, 'r') as f:
    for line in f: # read one line at a time for memory efficiency
        mtokens = line.strip().split() # strip newlines, split on space
        for token in mtokens:
            moby_fast[token] += 1
moby_fast.most_common(10)

CPU times: user 68.9 ms, sys: 6.14 ms, total: 75.1 ms
Wall time: 84.7 ms


[('the', 13603),
 ('of', 6475),
 ('and', 5881),
 ('a', 4473),
 ('to', 4439),
 ('in', 3825),
 ('that', 2680),
 ('his', 2415),
 ('I', 1724),
 ('with', 1645)]

In [15]:
%%time
# better but slower
moby_nltk = Counter()
with open (fn, 'r') as f:
    for line in f:
        mtokens = word_tokenize(line)
        for token in mtokens:
            moby_nltk[token] += 1
moby_nltk.most_common(10)

CPU times: user 608 ms, sys: 12.1 ms, total: 620 ms
Wall time: 624 ms


[(',', 19204),
 ('the', 13715),
 ('.', 7432),
 ('of', 6513),
 ('and', 6010),
 ('a', 4546),
 ('to', 4515),
 (';', 4173),
 ('in', 3909),
 ('that', 2981)]

In [16]:
# Total wordcount
print("Number of words in Moby-Dick (per split):", sum(moby_fast.values()))
print("Number of words in Moby-Dick (per NLTK): ", sum(moby_nltk.values()))

Number of words in Moby-Dick (per split): 212014
Number of words in Moby-Dick (per NLTK):  255370


## Stopwords and other processing steps

Notice that the most frequently occurring words in *Moby-Dick* don't carry much meaning on their own.

These high-frequency tokens are sometimes called *stopwords*. Stopwords are words that one wants to remove from one's token counts.

In [17]:
# Work with sample stopwords
import string
from nltk.corpus import stopwords

stops = stopwords.words('english') # NLTK's short list of English stopwords
print("Base stoplist:", stops)

for punct in string.punctuation:
    stops.append(punct) # Add punctuation marks to stoplist
print("\nOur stoplist:", stops)

Base stoplist: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 

In [18]:
for stop in stops:
    del moby_nltk[stop]
moby_nltk.most_common(10)

[('I', 2121),
 ("'s", 1731),
 ('--', 1714),
 ("''", 1565),
 ('``', 1529),
 ('one', 881),
 ('whale', 789),
 ('But', 703),
 ('The', 609),
 ('like', 558)]

In [19]:
print("Number of words in Moby-Dick (per NLTK, minus stopwords): ", sum(moby_nltk.values()))

Number of words in Moby-Dick (per NLTK, minus stopwords):  123712


**NB.** Consider the difference in wordcount after removing stopwords ...

### More processing

Consider how and why you might do each of the following:

* Case regularization (all lower case?)
  * `lower()` string method
* Punctuation removal
  * At what point(s) in the process?
  * (Dis)advantages of each?
* Lemmatization
  * `from nltk.stem import WordNetLemmatizer`
  * Stemming is faster but less accurate
  * Note that lemmatization benefits from knowledge of token part of speech. Part-of-speech taggers benefit from case and punctuation information.

In [20]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print('am ->', lemmatizer.lemmatize('am'))
print('am (verb) ->', lemmatizer.lemmatize('am', pos='v'))
print('wolves ->', lemmatizer.lemmatize('wolves'))

am -> am
am (verb) -> be
wolves -> wolf


## A word on seeking help in this class ...

### Ed policies

* Ed is the first-best place to ask questions!
  * If you have a question, other people probably do, too
* You may post anonymously (to other students), but staff can always see your name
* Modest extra credit for high-quality participation on Ed, especially answering (correctly) other student's questions
* Staff wait 24 hours to respond (by policy) unless the matter is urgent
  * Homework due tomorrow is not urgent

### How to ask good questions

* More info, carefully curated, is better than less info
  * What homework? What code? **What are you trying to do?** What does your data look like? What output did you get? What did you expect to happen? What error message? Where does the data live? What did you try? **What does the documentation say?** ...
  * Craft a minimal example demonstrating the problem (this often helps you solve it on your own)
* Remember that TAs (and even professors) are human
  * They want to help. Respect, gratitude, and patience go a long way.
  * Evidence that you've made an earnest attempt to solve the problem on your own also goes a long way.
  
### Whom and where to ask

**When in doubt -> Ed!** Always open, async, monitored by staff, extra credit.

* It's 3am and my code won't work -> Maybe sleep on it?
* It's 9am and my code won't work -> Ed, office hours
* It's the next day and no one can figure it out -> Office hours (grad)
* I'm looking for project/HW partners -> Ed, lecture, [Learning Strategies Center](https://lsc.cornell.edu/)
* I'm thinking about grad school -> Office hours (grad, Prof. Wilkens)
* I have questions about the major, future classes, or careers -> Prof. Wilkens OH
* I'm really struggling (in the class, in general) -> Any staff, any time