# Seminar Notebook 1.3: Preprocessing

**LSE MY459: Computational Text Analysis and Large Language Models** (WT 2026)

**Ryan H√ºbert**

This notebook covers preprocessing texts and creating document feature matrices.

## Directory management

We begin with some directory management to specify the file path to the folder on your computer where you wish to store data for this notebook.

In [56]:
import os
sdir = os.path.join(os.path.expanduser("~"), "LSE-MY459-WT26", "SeminarWeek02") # or whatever path you want
if not os.path.exists(sdir):
    os.mkdir(sdir)

In this notebook, we will examine a corpus of U.S. President Donald Trump's tweets from January 2017 through June 2018. These are contained in a JSON file called `trump-tweets.json` available on the course GitHub page. The following code chunk will download this file and save to `sdir`:

In [57]:
import requests

# Where is the remote file?
rfile = "https://raw.githubusercontent.com/lse-my459/data/master/trump-tweets.json"

# Where will we store it locally?
lfile = os.path.join(sdir, os.path.basename(rfile))

# Check if you have the file yet and if not, download it to correct location
if not os.path.exists(lfile):
    r = requests.get(rfile) # make GET request for the remote file
    r.raise_for_status()    # raise exception if there's an HTTP error
    
    # Write the raw bytes received from the server to the local file path
    with open(lfile, "wb") as f:
        f.write(r.content)

## Working with collections of texts

We begin by loading the tweets from the downloaded file. The file is a JSON file, but it is uses a layout that is somewhat unusual. In particular, each tweet is contained within a JSON object stored on a new line in the file. You should familiarise yourself with JSON formats by reading <https://en.wikipedia.org/wiki/JSON>. You should also familiarise yourself with Python's `json` module by reviewing parts of the documentation at <https://docs.python.org/3/library/json.html>.

In [58]:
import json

# Create a list to hold data for each tweet
tweets = []

# Read the data line by line, storing each line as a new element in the `tweets` list
with open(lfile, 'r', encoding='utf-8') as f:
    for line in f:
        line = line.strip() # remove whitespace at beginning/end
        if line: # if line object is not empty
            tweets.append(json.loads(line))

Let's look at the first tweet in this list of tweets, just to get a sense for how the data looks. You will see that each tweet is represented by a `dict` object, where each key-value pair represents a specific piece of information relating to the tweet.

In [59]:
tweets[0]

{'created_at': ['Fri Jun 22 19:40:20 +0000 2018'],
 'id': [1.01024612682035e+18],
 'id_str': ['1010246126820347906'],
 'full_text': ['We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY'],
 'truncated': [False],
 'display_text_range': [[0], [280]],
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [{'url': ['https://t.co/ZjXESYAcjY'],
    'expanded_url': ['https://www.pscp.tv/w/bf1GFzFvTlFsTFJub1dwUXd8MWpNSmdFVll5ZUFLTAWuHc0BMMKeCOoDRCPmtIftVLaFLQVwfSLoC_C0SbzX?t=9m9s'],
    'display_url': ['pscp.tv/w/bf1GFzFvTlFs‚Ä¶'],
    'indices': [[257], [280]]}]},
 'source': ['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'],
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_

There are two pieces of information that are quite useful, the text of the tweet and the date and time of the tweet, let's just look at those pieces of information for the first tweet.

In [60]:
print(tweets[0]["created_at"])
print(tweets[0]["full_text"])

['Fri Jun 22 19:40:20 +0000 2018']
['We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY']


For now, let's just focus on the text of the tweets and ignore the dates. We'll extract the text from each tweet and save it as an element of a list.

In [61]:
tweet_texts = [t['full_text'][0] for t in tweets]

# Show first few tweets
for tweet in tweet_texts[:5]:
    print(tweet)

We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY
Amy Kremer, Women for Trump, was so great on @foxandfriends. Brave and very smart, thank you Amy! @AmyKremer
Thank you South Carolina. Now let‚Äôs get out tomorrow and VOTE for @HenryMcMaster! https://t.co/5xlz0wfMfu
Just watched @SharkGregNorman on @foxandfriends. Said ‚ÄúPresident is doing a great job. All over the world, people want to come back to the U.S.‚Äù Thank you Greg, and you‚Äôre looking and doing great!
Russia continues to say they had nothing to do with Meddling in our Election! Where is the DNC Server, and why didn‚Äôt Shady James Comey and the now disgraced FBI agents take and closely examine it? Why isn‚Äôt Hillary/Russia being looked at? So many questions, so much corruption!


Let's dig into the data a little bit more. Given the source of the dataset, we can expect that there will be many tweets mentioning topics such as immigration or health care. We can search for patterns in strings using the `in` operator or by using `re` functions. Note that since we have a collection of texts, we will need to iterate over each one to find tweets on these topics 

In [62]:
# Print the first five tweets that mention immigration
[x for x in tweet_texts if 'immigration' in x.lower()][0:5]

['We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY',
 'RT @realDonaldTrump: We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citize‚Ä¶',
 '....If this is done, illegal immigration will be stopped in it‚Äôs tracks - and at very little, by comparison, cost. This is the only real answer - and we must continue to BUILD THE WALL!',
 'HOUSE REPUBLICANS SHOULD PASS THE STRONG BUT FAIR IMMIGRATION BILL, KNOWN AS GOODLATTE II, IN THEIR AFTERNOON VOTE TODAY, EVEN THOUGH THE DEMS WON‚ÄôT LET IT PASS IN THE SENATE. PASSAGE WILL SHOW THAT WE WANT STRONG BORDERS &amp; SECURITY WHILE THE DEMS WANT OPEN BORDERS = CRIME.  WIN!',
 '....Our Immigration policy, laughed at all over the world, is very unfair to all of

The `in` operator is a fast way to search, but less powerful than regex.

In [63]:
# Using re.search() instead, although faster to use `in`
import re
[x for x in tweet_texts if re.search(r'(immigration|immigrant)', x.lower())][0:5]

['We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citizens permanently separated from their loved ones b/c they were killed by criminal illegal aliens. These are the families the media ignores...https://t.co/ZjXESYAcjY',
 'RT @realDonaldTrump: We are gathered today to hear directly from the AMERICAN VICTIMS of ILLEGAL IMMIGRATION. These are the American Citize‚Ä¶',
 '....If this is done, illegal immigration will be stopped in it‚Äôs tracks - and at very little, by comparison, cost. This is the only real answer - and we must continue to BUILD THE WALL!',
 'HOUSE REPUBLICANS SHOULD PASS THE STRONG BUT FAIR IMMIGRATION BILL, KNOWN AS GOODLATTE II, IN THEIR AFTERNOON VOTE TODAY, EVEN THOUGH THE DEMS WON‚ÄôT LET IT PASS IN THE SENATE. PASSAGE WILL SHOW THAT WE WANT STRONG BORDERS &amp; SECURITY WHILE THE DEMS WANT OPEN BORDERS = CRIME.  WIN!',
 '....Our Immigration policy, laughed at all over the world, is very unfair to all of

And if we want to know the total proportion of tweets mentioning immigrants or immigration, we can do the following.

In [64]:
num = len([x for x in tweet_texts if re.search(r'(immigration|immigrant)', x.lower())])
den = len(tweet_texts)
print(num/den)

0.02353854112778065


## Storing a collection of texts in a table

Above we have a collection of texts stored in a list. This is a very simple way to store texts. However, when managing large data projects with lots of texts, we might want to use a more "structured" data storage system. We will now store the tweet data in a Pandas data frame so that we can access it using standard Pandas data wrangling techniques for tabular data. Since Pandas is the most common module for working with tabular data in Python, we will expect you to be broadly familiar with how to use Pandas tabular objects to do data analysis.

The dataframe will contain one column called `text` with the text of each tweet. Each row will be indexed by the date and time the tweet was posted. 

In [65]:
import pandas as pd

tf = pd.DataFrame({"datetime" : [x["created_at"][0] for x in tweets], 
                   "text" : [x["full_text"][0] for x in tweets]})
tf.head()

Unnamed: 0,datetime,text
0,Fri Jun 22 19:40:20 +0000 2018,We are gathered today to hear directly from th...
1,Thu Jun 28 11:32:09 +0000 2018,"Amy Kremer, Women for Trump, was so great on @..."
2,Tue Jun 26 01:35:03 +0000 2018,Thank you South Carolina. Now let‚Äôs get out to...
3,Thu Jun 28 11:38:40 +0000 2018,Just watched @SharkGregNorman on @foxandfriend...
4,Thu Jun 28 11:25:15 +0000 2018,Russia continues to say they had nothing to do...


Our corpus is now in tabular form. Let's do a little cleanup. First, let's convert the `datetime` column into a proper date format. Then, we'll sort the dataset by `datetime`.

In [66]:
tf["datetime"] = pd.to_datetime(tf["datetime"])   # Convert to datetime format
tf = tf.sort_values("datetime")
tf = tf.reset_index(drop=True)
tf.head()

  tf["datetime"] = pd.to_datetime(tf["datetime"])   # Convert to datetime format


Unnamed: 0,datetime,text
0,2017-01-01 05:00:10+00:00,TO ALL AMERICANS-\n#HappyNewYear &amp; many bl...
1,2017-01-01 05:39:13+00:00,RT @DanScavino: On behalf of our next #POTUS &...
2,2017-01-01 05:43:23+00:00,RT @Reince: Happy New Year + God's blessings t...
3,2017-01-01 05:44:17+00:00,RT @EricTrump: 2016 was such an incredible yea...
4,2017-01-01 06:49:33+00:00,RT @DonaldJTrumpJr: Happy new year everyone. #...


As the previous code demonstrates, a key advantage of keeping our texts in a DataFrame is that we can (efficiently) perform manipulations across all documents in the corpus. For example, we can quickly extract specific strings:

In [67]:
tf["text"].str.extractall(r"(AMERICA)")

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
0,0,AMERICA
1,0,AMERICA
6,0,AMERICA
19,0,AMERICA
45,0,AMERICA
...,...,...
3769,0,AMERICA
3778,0,AMERICA
3796,0,AMERICA
3850,0,AMERICA


We can also find all the documents containing specific text, and use it to subset our dataframe:

In [68]:
tf.loc[tf["text"].str.contains(r"(?:immigrant|immigration)"), :]

Unnamed: 0,datetime,text
198,2017-01-29 21:45:58+00:00,The joint statement of former presidential can...
199,2017-01-29 21:49:32+00:00,...Senators should focus their energies on ISI...
217,2017-02-02 03:55:49+00:00,Do you believe it? The Obama Administration ag...
329,2017-02-19 21:57:01+00:00,My statement as to what's happening in Sweden ...
330,2017-02-20 14:15:42+00:00,Give the public a break - The FAKE NEWS media ...
...,...,...
3791,2018-06-23 16:57:48+00:00,Heading to Nevada to talk trade and immigratio...
3793,2018-06-23 17:05:33+00:00,It‚Äôs very sad that Nancy Pelosi and her sideki...
3799,2018-06-24 15:02:02+00:00,We cannot allow all of these people to invade ...
3808,2018-06-25 12:36:29+00:00,Such a difference in the media coverage of the...


Now we will save this corpus in tabular `.csv` format.

In [69]:
corpus_file = os.path.join(sdir, "tweet-corpus.csv")
if not os.path.exists(corpus_file):
    tf.to_csv(corpus_file, index = False, header=True)

**Important warning:** try not to open CSV files in other applications (like Microsoft Excel or Apple Numbers). These applications often make changes to your data without your knowledge, which can cause you problems down the road! If you heed this advice, your future self will thank you!

## Preprocessing a text

We will now see how to perform standard pre-processing steps covered in lecture. We begin by preprocessing a specific text. Our running example will be the first tweet in the Trump tweet dataset.

In [70]:
from pprint import pprint
first_tweet = tf["text"].iloc[0]
pprint(first_tweet, width = 80) # Limit width for printing

('TO ALL AMERICANS-\n'
 '#HappyNewYear &amp; many blessings to you all! Looking forward to a '
 'wonderful &amp; prosperous 2017 as we work together to #MAGAüá∫üá∏ '
 'https://t.co/UaBFaoDYHe')


### Tokenising a text

When we use the bag of words model to analyse text for whitespace-delimited languages (like English), we tokenise text on the whitespace. 

In [71]:
tokens = re.split(r"\s+", first_tweet)
tokens

['TO',
 'ALL',
 'AMERICANS-',
 '#HappyNewYear',
 '&amp;',
 'many',
 'blessings',
 'to',
 'you',
 'all!',
 'Looking',
 'forward',
 'to',
 'a',
 'wonderful',
 '&amp;',
 'prosperous',
 '2017',
 'as',
 'we',
 'work',
 'together',
 'to',
 '#MAGAüá∫üá∏',
 'https://t.co/UaBFaoDYHe']

### Cleaning up and formatting tokens

When you clean and re-format tokens, you have to make some judgements about how to do it. Below are a sequence of cleaning steps based on the kind of data we're working with (Twitter data). In general, the cleaning steps you take should be appropriate to the dataset you are using.

In [72]:
tokens = [re.sub(r"https?\://[^ ]+", "", x) for x in tokens] # drop urls
tokens = [re.sub(r"\&\#?[A-z]+;", "", x) for x in tokens]    # drop html chars
tokens = [re.sub(r"[‚Äú‚Äù]", '"', x) for x in tokens]    # drop curly quotes
tokens = [re.sub(r"[‚Äò‚Äô]", "'", x) for x in tokens]    # drop curly quotes
tokens = [re.sub(r"[‚Äì‚Äî]", "-", x) for x in tokens]    # drop formatted dashes
tokens = [re.sub(r"[^A-z0-9\-#@':/_\.\$% ]", "", x) for x in tokens]    # drop unneeded characters
tokens = [re.sub(r"(^[^A-z0-9#@\$]|[^A-z0-9%]$)", "", x) for x in tokens]    # drop unneeded characters
tokens = [x for x in tokens if all(re.search(r"[A-z#@\-]", y) for y in x)]    # drop non-words
tokens = [x for x in tokens if x != ""]    # drop empty strings
tokens

['TO',
 'ALL',
 'AMERICANS',
 '#HappyNewYear',
 'many',
 'blessings',
 'to',
 'you',
 'all',
 'Looking',
 'forward',
 'to',
 'a',
 'wonderful',
 'prosperous',
 'as',
 'we',
 'work',
 'together',
 'to',
 '#MAGA']

### Standardising cases

In this application, we want to treat words similarly regardless of their capitalisation. For example, the phrase "MAKE AMERICA GREAT AGAIN" should be treated as equivalent to "Make America Great Again" or even "make america great again". To accomplish this you can make all tokens lowercase.

In [73]:
tokens = [x.lower() for x in tokens]
tokens

['to',
 'all',
 'americans',
 '#happynewyear',
 'many',
 'blessings',
 'to',
 'you',
 'all',
 'looking',
 'forward',
 'to',
 'a',
 'wonderful',
 'prosperous',
 'as',
 'we',
 'work',
 'together',
 'to',
 '#maga']

Of course, in some situations, you might _not_ want to standardise all the capitalisation. For example, consider the following Tweet, which uses the phrase "US" as an abbreviation for United States and the word "us". Note that if you make this tweet all lowercase, then US and us are considered the same.

In [74]:
pprint(tf["text"].iloc[1568])

('RT @TravelGov: Continue to notify us of US citizens overseas impacted by '
 '#HurricaneIrma &amp; #HurricaneJose. https://t.co/EuIpTB144z https://t‚Ä¶')


### Remove stop words

The `nltk` module has a bunch of tools for NLP tasks. For example, if you import `stopwords` from the `nltk.corpus` submodule, you can access a list of common stop words in several languages. You will need to install `nltk` before running the following code. (Also note that the first time you run it, you will need to download `stopwords`.)

In [75]:
from nltk.corpus import stopwords

try: 
    ## Try to extract list
    sw = stopwords.words('english')
except LookupError:
    ## If not available, download it
    import nltk
    nltk.download('stopwords')
    sw = stopwords.words('english')

## Make all the stop words lowercase (since all our tokens are lowercase)
sw = [x.lower() for x in sw]

## Show the first 10
sw[0:10]

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']

Now, we will remove all stop words from the tweet.

In [76]:
tokens = [x for x in tokens if not x in sw]
tokens

['americans',
 '#happynewyear',
 'many',
 'blessings',
 'looking',
 'forward',
 'wonderful',
 'prosperous',
 'work',
 'together',
 '#maga']

### Stemming

The `nltk.stem` submodule has tools available for both stemming and lemmatising texts. Here, we will use the standard English-language version of the Snowball stemmer.

In [77]:
from nltk.stem import snowball
sstemmer = snowball.SnowballStemmer("english")
tokens = [sstemmer.stem(x) for x in tokens]
tokens

['american',
 '#happynewyear',
 'mani',
 'bless',
 'look',
 'forward',
 'wonder',
 'prosper',
 'work',
 'togeth',
 '#maga']

### Edit distance

Sometimes you need to calculate the edit distance of two strings. For example, you might use it to find and correct important typos in your corpus. While we will not use it on the Trump tweets, the following code chunk shows you how you can calculate it.

In [78]:
from nltk.metrics.distance import edit_distance

edit_distance("kitten", "sitting")

3

### Calculate token frequency

Once we have a list of preprocessed tokens, we can then count the number of times each token appears in the list. In the `collections` module, the `Counter` class allows you to count the number of times an element appears in a list. Here is a simple example:

In [79]:
from collections import Counter
a_list = ["A", "a", "b", "b", 1, 17, "A", "z", "A"]
Counter(a_list)

Counter({'A': 3, 'b': 2, 'a': 1, 1: 1, 17: 1, 'z': 1})

Of course, we can use `Counter` to calculate token frequency in our preprocessed list of tokens:

In [80]:
tok_counts = Counter(tokens)
tok_counts

Counter({'american': 1,
         '#happynewyear': 1,
         'mani': 1,
         'bless': 1,
         'look': 1,
         'forward': 1,
         'wonder': 1,
         'prosper': 1,
         'work': 1,
         'togeth': 1,
         '#maga': 1})

In the case of the tweet we're analysing, each token only appears once. But that is not always guaranteed!

## Preprocessing a whole corpus

The steps above were useful for seeing how an individual text is preprocessed. However, in most applications, you will want to preprocess a whole corpus of documents all at once. Recall that we stored our documents (tweets) as a column of a Pandas `DataFrame`. We'll now add a new column where we'll store preprocess tokens for the documents. We'll do this by applying our preprocessing steps in order.

First we tokenise on whitespace using the `.str.split()` method available for Pandas DataFrame columns.

In [81]:
tf["preprocessed"] = tf["text"].str.split(r"\s+")
tf["preprocessed"]

0       [TO, ALL, AMERICANS-, #HappyNewYear, &amp;, ma...
1       [RT, @DanScavino:, On, behalf, of, our, next, ...
2       [RT, @Reince:, Happy, New, Year, +, God's, ble...
3       [RT, @EricTrump:, 2016, was, such, an, incredi...
4       [RT, @DonaldJTrumpJr:, Happy, new, year, every...
                              ...                        
3861    [Today,, we, broke, ground, on, a, plant, that...
3862    [AMERICA, IS, OPEN, FOR, BUSINESS!, https://t....
3863    [Prior, to, departing, Wisconsin,, I, was, bri...
3864    [Before, going, any, further, today,, I, want,...
3865    [Six, months, after, our, TAX, CUTS,, more, th...
Name: preprocessed, Length: 3866, dtype: object

Next, we'll apply a set of formatting rules using the `.apply()` method, which "applies" a function to each cell in the "preprocessed" column.

In [82]:
## Drop urls
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [x for x in doc_tokens if not re.search(r"https?\://.+", x)])
# Drop tokens with punctuation
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [x for x in doc_tokens if re.search(r"^[\#@]?[A-z]", x)])
# Drop any excess punctuation
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [re.sub(r"[^A-z0-9]+$", '', x) for x in doc_tokens])
# Drop double quotes
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [re.sub(r'[‚Äú‚Äù"]', '', x) for x in doc_tokens])
# Reformat "curly" apostrophes and formatted dashes
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [re.sub(r"[‚Äò‚Äô]", "'", x) for x in doc_tokens])
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [re.sub(r"[‚Äì‚Äî]", "-", x) for x in doc_tokens])             
# Drop empty strings
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [x for x in doc_tokens if x != ""])
tf["preprocessed"].head()

0    [TO, ALL, AMERICANS, #HappyNewYear, many, bles...
1    [RT, @DanScavino, On, behalf, of, our, next, #...
2    [RT, @Reince, Happy, New, Year, God's, blessin...
3    [RT, @EricTrump, was, such, an, incredible, ye...
4    [RT, @DonaldJTrumpJr, Happy, new, year, everyo...
Name: preprocessed, dtype: object

Now, we will make all the tokens lowercase (even though we know this is not always ideal, see above).

In [83]:
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [x.lower() for x in doc_tokens])
tf["preprocessed"].head()

0    [to, all, americans, #happynewyear, many, bles...
1    [rt, @danscavino, on, behalf, of, our, next, #...
2    [rt, @reince, happy, new, year, god's, blessin...
3    [rt, @erictrump, was, such, an, incredible, ye...
4    [rt, @donaldjtrumpjr, happy, new, year, everyo...
Name: preprocessed, dtype: object

Now, we will remove stop words.

In [84]:
from nltk.corpus import stopwords
sw = [x.lower() for x in stopwords.words('english')]
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [x for x in doc_tokens if not x in sw])
tf["preprocessed"].head()

0    [americans, #happynewyear, many, blessings, lo...
1    [rt, @danscavino, behalf, next, #potus, @teamt...
2    [rt, @reince, happy, new, year, god's, blessin...
3    [rt, @erictrump, incredible, year, entire, fam...
4    [rt, @donaldjtrumpjr, happy, new, year, everyo...
Name: preprocessed, dtype: object

Next, stem each token using the Snowball stemmer.

In [85]:
from nltk.stem import snowball
sstemmer = snowball.SnowballStemmer("english")
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [sstemmer.stem(x) if not x[0] in ["#", "@"] else x for x in doc_tokens])
tf["preprocessed"].head()

0    [american, #happynewyear, mani, bless, look, f...
1    [rt, @danscavino, behalf, next, #potus, @teamt...
2    [rt, @reince, happi, new, year, god, bless, lo...
3    [rt, @erictrump, incred, year, entir, famili, ...
4    [rt, @donaldjtrumpjr, happi, new, year, everyo...
Name: preprocessed, dtype: object

Finally, we will calculate token frequency for each list of preprocessed tokens using the `Counter` function, and applying it to the column using `.map()`.

In [86]:
tf["term_freqs"] = tf["preprocessed"].map(Counter)
tf["term_freqs"].head()

0    {'american': 1, '#happynewyear': 1, 'mani': 1,...
1    {'rt': 1, '@danscavino': 1, 'behalf': 1, 'next...
2    {'rt': 1, '@reince': 1, 'happi': 1, 'new': 1, ...
3    {'rt': 1, '@erictrump': 1, 'incred': 1, 'year'...
4    {'rt': 1, '@donaldjtrumpjr': 1, 'happi': 1, 'n...
Name: term_freqs, dtype: object

## Collocations

Above we tokenised on whitespace, which gave us unigrams as our core feature of interest. Of course, we might want to define our features differently and use bigrams, trigrams or other n-grams. Why? We might care about word ordering in our analysis and n-grams give us a crude way to include word order information in our document features. First, let's see how we can extract all the ngrams of a certain size from a single list of tokens representing one document. In this case, we'll extract bigrams from the list of already-processed tokens for Trump's first tweet.

In [87]:
from nltk.util import ngrams
list(ngrams(tokens, 2))

[('american', '#happynewyear'),
 ('#happynewyear', 'mani'),
 ('mani', 'bless'),
 ('bless', 'look'),
 ('look', 'forward'),
 ('forward', 'wonder'),
 ('wonder', 'prosper'),
 ('prosper', 'work'),
 ('work', 'togeth'),
 ('togeth', '#maga')]

A more common use case for ngrams is identifying common collocations that the analyst may want to treat as a single token. Let's extract the top bigrams from the entire corpus and list them below.

In [88]:
top_bigrams = tf["preprocessed"].apply(lambda x: list(ngrams(x,2))).to_list()
top_bigrams = Counter(b for lst in top_bigrams for b in lst)
top_bigrams = pd.DataFrame(top_bigrams.items(), columns=["bigram", "count"])
top_bigrams.sort_values("count", ascending=False)[0:20].reset_index(drop=True)

Unnamed: 0,bigram,count
0,"(fake, news)",213
1,"(tax, cut)",130
2,"(north, korea)",114
3,"(unit, state)",101
4,"(make, america)",99
5,"(america, great)",92
6,"(white, hous)",78
7,"(witch, hunt)",75
8,"(look, forward)",71
9,"(great, honor)",69


## Making a document feature matrix

Now that we have counted the (preprocessed) tokens from each document, we can create a document feature matrix. One way to do this (which is quite easy) is to create a Pandas `DataFrame`:

In [89]:
# Create the DataFrame from the Counters
dfm = pd.DataFrame(tf["term_freqs"].to_list())
# Fill all missing values with zeroes and convert all data to integers
dfm = dfm.fillna(0).astype(int)
dfm.head()

Unnamed: 0,american,#happynewyear,mani,bless,look,forward,wonder,prosper,work,togeth,...,kremer,@amykremer,@sharkgregnorman,gig,mueller/comey,collusion)...and,groundbreak,discov,newsroom,conscienc
0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Note that we do not have any way to know which row corresponds to which tweet. So, we will redefine the `DataFrame` index to be a tweet's date and time:

In [90]:
dfm.index = tf["datetime"]

We can now examine bot the shape of the dfm and then a subset of the rows/columns.

In [91]:
print(dfm.shape)
print(dfm.iloc[:6,:6])

(3866, 5949)
                           american  #happynewyear  mani  bless  look  forward
datetime                                                                      
2017-01-01 05:00:10+00:00         1              1     1      1     1        1
2017-01-01 05:39:13+00:00         0              1     0      0     0        0
2017-01-01 05:43:23+00:00         0              0     0      1     1        1
2017-01-01 05:44:17+00:00         0              0     0      0     0        0
2017-01-01 06:49:33+00:00         0              0     0      0     0        0
2017-01-01 06:49:49+00:00         0              0     0      0     0        0


Now that we've done all this work, we should save it! Again, try to avoid opening this CSV in Excel (or Numbers)!

In [92]:
dfm_file = os.path.join(sdir, "tweet-dfm.csv")
dfm.to_csv(dfm_file)

The approach above works fine for a relatively small corpus. As the corpus grows, however, representing document feature matrices as Pandas `DataFrame` objects becomes computationally inefficient. Document feature matrices are typically extremely wide (tens of thousands of columns) and mostly sparse (mostly zeroes). `DataFrame` objects are designed for heterogeneous, human-readable tables, not for large, sparse numeric matrices. As a result, they incur substantial memory and performance overhead at scale.

That said, using the process above on a relatively small corpus (like the Trump tweets) will yield a `DataFrame` object that is pretty big:

In [93]:
mem_used = dfm.memory_usage(deep=True).sum()
"DFM as Pandas DataFrame: " + str(round(mem_used / 1024**2, 2)) + " MB"

'DFM as Pandas DataFrame: 175.5 MB'

And the file size of the DFM we saved as a csv is also big:

In [94]:
"DFM as Pandas DataFrame: " + str(round(os.path.getsize(dfm_file) / 1024**2, 2)) + " MB"

'DFM as Pandas DataFrame: 44.01 MB'

## Vectorisers 

In many research applications with larger corpuses, it is often a good idea to represent document feature matrices in more efficient ways. The core problem is that DFMs are usually sparse matrices (or arrays) "where only a few locations in the array have any data, most of the locations are considered as ‚Äúempty‚Äù" ([Source](https://docs.scipy.org/doc/scipy/tutorial/sparse.html)). As you just saw, even this relatively small corpus created sparse DFMs that were very large (in memory and in storage).

A very common approach is to store DFMs as a objects that are optimised for storing sparse arrays. [SciPy](https://scipy.org/) offers a set of tools in its `scipy.sparse` submodule, which are very popular for text. The `sklearn.feature_extraction` submodule offers two options for creating efficient sparse SciPy arrays from the kind of text data we use in this class: `DictVectorizer` and `CountVectorizer`. `

In [95]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer

For our purposes, `DictVectorizer` is quite convenient, as it allows us to take our column of `Counter` objects from above and create a sparse array easily as follows.

In [96]:
# Step 1: initialise a DictVectorizer object, which we'll call `dv`
dv = DictVectorizer()
# Step 2: create the array from our data
dfm_sparse = dv.fit_transform(tf["term_freqs"].to_list())

You can see below that we have now created a `csr_matrix` object, containing data from a DFM with 3866 rows and 5949 columns:

In [97]:
print(type(dfm_sparse))
print(dfm_sparse.shape)

<class 'scipy.sparse._csr.csr_matrix'>
(3866, 5949)


You can also see that this object is _much_ smaller than the Pandas `DataFrame` we made above, even though it contains the same data!

In [98]:
mem_used = (dfm_sparse.data.nbytes + dfm_sparse.indices.nbytes + dfm_sparse.indptr.nbytes)
"DFM as csr_matrix: " + str(round(mem_used / 1024**2, 2)) + " MB"

'DFM as csr_matrix: 0.59 MB'

That said, it is more awkward to "see" the data as a `csr_matrix` object:

In [99]:
print(dfm_sparse)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 50274 stored elements and shape (3866, 5949)>
  Coords	Values
  (0, 96)	1.0
  (0, 151)	1.0
  (0, 883)	1.0
  (0, 1200)	1.0
  (0, 2533)	1.0
  (0, 3470)	1.0
  (0, 3544)	1.0
  (0, 4404)	1.0
  (0, 5424)	1.0
  (0, 5876)	1.0
  (0, 5881)	1.0
  (1, 96)	1.0
  (1, 209)	1.0
  (1, 381)	1.0
  (1, 660)	1.0
  (1, 881)	1.0
  (1, 1130)	1.0
  (1, 2943)	1.0
  (1, 3881)	1.0
  (1, 4732)	1.0
  (2, 590)	1.0
  (2, 595)	1.0
  (2, 881)	1.0
  (2, 1200)	1.0
  (2, 2533)	1.0
  :	:
  (3864, 2922)	1.0
  (3864, 3570)	1.0
  (3864, 3833)	1.0
  (3864, 3880)	1.0
  (3864, 4236)	1.0
  (3864, 4940)	1.0
  (3864, 4941)	1.0
  (3864, 5422)	1.0
  (3864, 5436)	1.0
  (3864, 5764)	1.0
  (3864, 5925)	1.0
  (3865, 261)	1.0
  (3865, 746)	1.0
  (3865, 1226)	1.0
  (3865, 1681)	1.0
  (3865, 1802)	1.0
  (3865, 3691)	1.0
  (3865, 3752)	1.0
  (3865, 4153)	1.0
  (3865, 4470)	1.0
  (3865, 4514)	1.0
  (3865, 4641)	1.0
  (3865, 4985)	1.0
  (3865, 5315)	1.0
  (3865, 5882)	1.0


This sparse array format stores each non-zero value from the DFM in a list of values, where each one is associated with a cell, whose index is shown under "Coords". For example, the first line above says that in a "regular" DFM, there would be a value of 1 in the cell of the matrix in row 0 and column 96. Row 0 corresponds to the first document in the DFM, but what is column 96? 

Unfortunately, the sparse array object does not store the feature names corresponding to the columns. But, the `DictVectorizer` object we created does:

In [100]:
vocabulary = dv.get_feature_names_out()
vocabulary

array(['#afghanistan', '#afghanstrategy', '#ahca', ..., 'zoo', 'zte',
       'zuker'], shape=(5949,), dtype=object)

Then, to find out which feature corresponds to column 96, we can look at:

In [101]:
vocabulary[96]

'#happynewyear'

So, the first document in the DFM (row 0) has the token '#happynewyear' in it one time. You can access the coordinates in the DFM that have non-zero token counts as follows:

In [102]:
dfm_sparse.nonzero()

(array([   0,    0,    0, ..., 3865, 3865, 3865],
       shape=(50274,), dtype=int32),
 array([  96,  151,  883, ..., 4985, 5315, 5882],
       shape=(50274,), dtype=int32))

And you can access the non-zero values as follows:

In [103]:
dfm_sparse.data

array([1., 1., 1., ..., 1., 1., 1.], shape=(50274,))

If you would like to see the non-zero values for the first document in the DFM (row 0), you can index:  

In [104]:
print(dfm_sparse[0])

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 11 stored elements and shape (1, 5949)>
  Coords	Values
  (0, 96)	1.0
  (0, 151)	1.0
  (0, 883)	1.0
  (0, 1200)	1.0
  (0, 2533)	1.0
  (0, 3470)	1.0
  (0, 3544)	1.0
  (0, 4404)	1.0
  (0, 5424)	1.0
  (0, 5876)	1.0
  (0, 5881)	1.0


And we can show the token counts for that row, which look correct!

In [105]:
[(x[0], int(x[1])) for x in zip(vocabulary[dfm_sparse[0].indices], dfm_sparse[0].data)]

[('#happynewyear', 1),
 ('#maga', 1),
 ('american', 1),
 ('bless', 1),
 ('forward', 1),
 ('look', 1),
 ('mani', 1),
 ('prosper', 1),
 ('togeth', 1),
 ('wonder', 1),
 ('work', 1)]

What if you want to go back to a more standard tabular format? For example, for saving, or more "easy" previewing?

In [106]:
dfm_sparse.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], shape=(3866, 5949))

In [107]:
pd.DataFrame(dfm_sparse.toarray(), columns=vocabulary)

Unnamed: 0,#afghanistan,#afghanstrategy,#ahca,#alconv2017,#amazonwashingtonpost,#america,#americafirst,#americafirstüá∫üá∏#unga,#americanheroes,#anthem,...,yrs,yulin,yuma,zero,zink,zito,zone,zoo,zte,zuker
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3861,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3863,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3864,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


You can also save sparse matrices using the `save_npz()` function from the `scipy.sparse` submodule. Recall that the sparse matrix object does not contain the feature names. So we need to save two objects: the sparse matrix with the data, and a list of the feature names.

In [108]:
from scipy import sparse

sparse_dfm_file = os.path.join(sdir, 'tweet-dfm-sparse.npz')
sparse.save_npz(sparse_dfm_file, dfm_sparse)

features_file = os.path.join(sdir, 'tweet-dfm-features.txt')
with open(features_file, 'w') as f:
    f.write("\n".join(vocabulary))

In this course, you will need to work with DFMs stored both as Pandas `DataFrame` objects _and_ as sparse array objects. For the most part, you will be able to work with both data types using similar(ish) syntax. Moreover, if your computer can handle it and you do not like working with sparse array objects, you _can_ convert them to Pandas `DataFrame` objects using the process described above.  

### Text to sparse array

Above, we used our own preprocessed `Counter` objects to create a sparse array. This allowed us to have full control over the preprocessing steps. In general, you _should_ fully control the preprocessing steps. However, it is important to know that there is the possibility of applying a different function (`CountVectorizer`) to a collection of texts.

In [109]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform(tf["text"].tolist())
print(vec.get_feature_names_out())
print(X)

['00' '000' '00am' ... 'Ê≠¥Âè≤ÁöÑ„Å™Êó•Êú¨Ë®™Âïè„ÅØ' 'ÈñìÈÅï„ÅÑ„Å™„Åè' 'ÈùûÂ∏∏„Å´ÈáçË¶Å„Å™ÁÇπ„ÅßË™çË≠ò„Çí‰∏ÄËá¥„Åï„Åõ„Çã„Åì„Å®„Åå„Åß„Åç„Åæ„Åó„Åü']
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 87103 stored elements and shape (3866, 9226)>
  Coords	Values
  (0, 8168)	4
  (0, 697)	2
  (0, 744)	1
  (0, 3743)	1
  (0, 753)	2
  (0, 5068)	1
  (0, 1237)	1
  (0, 9119)	1
  (0, 4928)	1
  (0, 3338)	1
  (0, 8949)	1
  (0, 6421)	1
  (0, 121)	1
  (0, 915)	1
  (0, 8810)	1
  (0, 8956)	1
  (0, 8175)	1
  (0, 5027)	1
  (0, 3995)	1
  (0, 1758)	1
  (0, 8374)	1
  (1, 3743)	1
  (1, 753)	1
  (1, 3995)	3
  (1, 1758)	2
  :	:
  (3864, 1944)	1
  (3864, 4694)	1
  (3865, 3995)	1
  (3865, 1758)	1
  (3865, 5870)	1
  (3865, 766)	1
  (3865, 7982)	1
  (3865, 6015)	1
  (3865, 3780)	1
  (3865, 5367)	1
  (3865, 8060)	1
  (3865, 625)	1
  (3865, 5268)	1
  (3865, 2153)	1
  (3865, 5359)	1
  (3865, 8959)	1
  (3865, 6674)	1
  (3865, 1985)	1
  (3865, 7477)	1
  (3865, 530)	1
  (3865, 7984)	1
  (3865, 6595)	1
  (3865, 1