This notebook outlines several methods for tokenizing text into words (and sentences), including:

* whitespace
* nltk (Penn Treebank tokenizer)
* nltk (Twitter-aware)
* spaCy
* custom regular expressions

highlighting differences between them.

In [1]:
import nltk, re, json
import spacy
from collections import Counter

In [2]:
# If you haven't downloaded the sentence segmentation model before, do so here
nltk.download("punkt")

[nltk_data] Downloading package punkt to /Users/dbamman/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
# spaCy lemmatization needs tagger but disable the rest
nlp = spacy.load('en_core_web_sm', disable=['tagger,ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser');

In [22]:
def read_tweets_from_json(filename):
    tweets=[]
    with open(filename, encoding="utf-8") as file:
        for line in file:
#         data=json.load(file)
#         for tweet in data:
            tweet=json.loads(line)
            tweets.append(tweet["full_text"])
    return tweets        

potus_tweets.json contains 1384 tweets from the Twitter @POTUS account, downloaded on 8/30/21.

In [23]:
filename="../data/potus_tweets.json"

In [24]:
tweets=read_tweets_from_json(filename)

In [25]:
whitespace_tokens=[]
for tweet in tweets:
    whitespace_tokens.append(tweet.split())

In [26]:
nltk_tokens=[]
for tweet in tweets:
    nltk_tokens.append(nltk.word_tokenize(tweet, language="english"))

In [27]:
nltk_casual_tokens=[]
for tweet in tweets:
    nltk_casual_tokens.append(nltk.casual_tokenize(tweet))

In [28]:
spacy_tokens=[]
for tweet in tweets:
    spacy_tokens.append([token.text for token in nlp(tweet)])

In [29]:
# Shorter version of http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py

# The order here is important (match from first to last)

# Keep usernames together (any token starting with @, followed by A-Z, a-z, 0-9)
regexes=(r"(?:@[\w_]+)",

# Keep hashtags together (any token starting with #, followed by A-Z, a-z, 0-9, _, or -)
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",

# Keep words with apostrophes, hyphens and underscores together
r"(?:[a-z][a-z’'\-_]+[a-z])",

# Keep all other sequences of A-Z, a-z, 0-9, _ together
r"(?:[\w_]+)",

# Everything else that's not whitespace
r"(?:\S)"
)

big_regex="|".join(regexes)

my_extensible_tokenizer = re.compile(big_regex, re.VERBOSE | re.I | re.UNICODE)

def my_extensible_tokenize(text):
    return my_extensible_tokenizer.findall(text)

In [30]:
extensible_tokens=[]
for tweet in tweets:
    extensible_tokens.append(my_extensible_tokenize(tweet))

Q1: Write a function to print out the first 5 tokenized tweets in each of the five tokenizers above. Examine those tweets; how would you characterize the differences?

In [31]:
for idx, (one, two, three, four, five) in enumerate(zip(nltk_tokens, nltk_casual_tokens, spacy_tokens, whitespace_tokens, extensible_tokens)):
    if idx >= 5:
        break
    print("NLTK      :\t%s" % ' '.join(one))
    print("CASUAL    :\t%s" % ' '.join(two))
    print("SPACY     :\t%s" % ' '.join(three))
    print("WHITESPACE:\t%s" % ' '.join(four))
    print("EXTENSIBLE:\t%s" % ' '.join(five))


    print()


NLTK      :	The past 17 days have seen our troops execute the largest airlift in US history . They have done it with unmatched courage , professionalism , and resolve . Now , our 20-year military presence in Afghanistan has ended . My full statement : https : //t.co/kfLkzQtEzp
CASUAL    :	The past 17 days have seen our troops execute the largest airlift in US history . They have done it with unmatched courage , professionalism , and resolve . Now , our 20 - year military presence in Afghanistan has ended . My full statement : https://t.co/kfLkzQtEzp
SPACY     :	The past 17 days have seen our troops execute the largest airlift in US history . They have done it with unmatched courage , professionalism , and resolve . Now , our 20 - year military presence in Afghanistan has ended . 

 My full statement : https://t.co/kfLkzQtEzp
WHITESPACE:	The past 17 days have seen our troops execute the largest airlift in US history. They have done it with unmatched courage, professionalism, and resolve

Q2: Write a function `compare(tokenization_one, tokenization_two)` that compares two tokenizations of the same text and finds the 20 most frequent tokens that don't appear in the other.

In [32]:
def compare(one_tokens, two_tokens):
    
    one_counts=Counter()
    two_counts=Counter()

    for sentence in one_tokens:
        for token in sentence:
            one_counts[token]+=1
        
    for sentence in two_tokens:
        for token in sentence:
            two_counts[token]+=1
        
    missing_from_one=Counter()
    missing_from_two=Counter()
    
    for word_type in one_counts:
        if word_type not in two_counts:
            missing_from_two[word_type]=one_counts[word_type]
        
    for word_type in two_counts:
        if word_type not in one_counts:
            missing_from_one[word_type]=two_counts[word_type]

    print ("Token counts -- one: %s, two: %s" % (len(one_tokens), len(two_tokens)))
    print ("\nNot in one:")
    print ('\n'.join("%s\t%d" % (k,v) for (k,v) in missing_from_one.most_common(20)))
    print ("\nNot in two:")
    print ('\n'.join("%s\t%d" % (k,v) for (k,v) in missing_from_two.most_common(20)))


In [33]:
compare(nltk_casual_tokens, nltk_tokens)

Token counts -- one: 1384, two: 1384

Not in one:
https	767
@	142
COVID-19	121
WhiteHouse	40
's	38
U.S.	30
#	17
//t.co/4MYpWqXVVo	16
're	13
amp	12
;	12
LGBTQ+	11
//t.co/gRX1fGFEzj	9
Dr.	9
VP	8
n't	8
JustinTrudeau	6
FLOTUS	6
St.	5
TeamUSA	4

Not in two:
…	47
U	41
S	41
@WhiteHouse	40
https://t.co/4MYpWqXVVo	16
8/	12
+	12
That's	11
LGBTQ	11
cannot	10
https://t.co/gRX1fGFEzj	9
Dr	9
@VP	8
H	8
It's	8
K	7
R	6
@JustinTrudeau	6
@FLOTUS	6
We're	5


Q3: Use one of the NLTK tokenizers; write code to determine how many sentences are in this dataset, and what the average number of words per sentence is.

In [34]:
count=0.
num_sents=0
for tweet in tweets:
    for sent in nltk.sent_tokenize(tweet):
        count+=len(nltk.word_tokenize(sent))
        num_sents+=1
print("Sents: %s, Tokens/sent: %.1f" % (num_sents, (count/num_sents)))


Sents: 3567, Tokens/sent: 14.8


Q4 (check-plus): modify the extensible tokenizer above to keep urls together (e.g., www.google.com or http://www.google.com)

In [35]:
# Keep usernames together (any token starting with @, followed by A-Z, a-z, 0-9)
regexes=(r"(?:@[\w_]+)",

# Keep hashtags together (any token starting with #, followed by A-Z, a-z, 0-9, _, or -)
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",

# Keep urls together
r"(?:https?:\S+)",
r"(?:www\.\S+)",
  
# Keep words with apostrophes, hyphens and underscores together
r"(?:[a-z][a-z’'\-_]+[a-z])",

# Keep all other sequences of A-Z, a-z, 0-9, _ together
r"(?:[\w_]+)",

# Everything else that's not whitespace
r"(?:\S)"
)

big_regex="|".join(regexes)

my_url_extensible_tokenizer = re.compile(big_regex, re.VERBOSE | re.I | re.UNICODE)

def my_extensible_tokenize_with_urls(text):
    return my_url_extensible_tokenizer.findall(text)

In [36]:
print ('\n'.join(my_extensible_tokenize_with_urls("The course website is http://people.ischool.berkeley.edu/~dbamman/info256.html")))

The
course
website
is
http://people.ischool.berkeley.edu/~dbamman/info256.html
