# Seminar Notebook 2.1: Making a DFM

**LSE MY459: Computational Text Analysis and Large Language Models** (WT 2026)

**Ryan Hübert**

This notebook covers creating a DFM using a corpus of news articles.

## Directory management

We begin with some directory management to specify the file path to the folder on your computer where you wish to store data for this notebook.

In [23]:
import os
sdir = os.path.join(os.path.expanduser("~"), "LSE-MY459-WT26", "SeminarWeek04")
if not os.path.exists(sdir):
    os.mkdir(sdir)

We will begin by loading a corpus of news articles published in the Guardian during 2016. This corpus was sourced from the `{quanteda}` R package, see: <https://tutorials.quanteda.io/machine-learning/topicmodel/>. You can get a `.csv` version of the file from the `data` repo in the course GitHub. The following code chunk will download this file.

In [24]:
rfile = 'https://raw.githubusercontent.com/lse-my459/data/refs/heads/main/corpus_guardian_2016.csv'
lfile = os.path.join(sdir, os.path.basename(rfile))
if not os.path.exists(lfile):
    import requests
    r = requests.get(rfile)
    r.raise_for_status()
    with open(rfile, "wb") as f: #要记住WB是什么，raw bites
        f.write(r.content)


### Loading and preprocessing the corpus

In the next cell, we load and clean the corpus.

In [25]:
import pandas as pd

tf = pd.read_csv(lfile, dtype= "object") #要有dtype, full contorl, raw string
tf["datetime"] = tf["date"] + " " + tf["edition"]
tf["datetime"] = tf["datetime"].str.replace("GMT", "")
tf["datetime"] = pd.to_datetime(tf["datetime"])
tf = tf.loc[:,['datetime','texts']]
tf = tf.sort_values("datetime") #如果时间是string的话不能正确的sort
tf = tf.reset_index(drop=True)
tf.head()

  tf["datetime"] = pd.to_datetime(tf["datetime"])


Unnamed: 0,datetime,texts
0,2016-01-01 17:24:00,A second man has died as a result of heavy flo...
1,2016-01-01 18:37:00,As tensions continue in Chicago over the handl...
2,2016-01-01 22:01:00,Academic journals have begun withholding the g...
3,2016-01-04 00:50:00,The brutal propaganda video released by Islami...
4,2016-01-04 10:59:00,A cache of 13 weapons has been discovered in t...


Next, we will work through standard pre-processing steps. Keep in mind that we are looking at a corpus of news articles, and that may affect how we pre-process. Let's start by tokenising on white space.

In [26]:
tf["preprocessed"] = tf["texts"].str.split(r"\s+") #根据空格
tf.head()

Unnamed: 0,datetime,texts,preprocessed
0,2016-01-01 17:24:00,A second man has died as a result of heavy flo...,"[A, second, man, has, died, as, a, result, of,..."
1,2016-01-01 18:37:00,As tensions continue in Chicago over the handl...,"[As, tensions, continue, in, Chicago, over, th..."
2,2016-01-01 22:01:00,Academic journals have begun withholding the g...,"[Academic, journals, have, begun, withholding,..."
3,2016-01-04 00:50:00,The brutal propaganda video released by Islami...,"[The, brutal, propaganda, video, released, by,..."
4,2016-01-04 10:59:00,A cache of 13 weapons has been discovered in t...,"[A, cache, of, 13, weapons, has, been, discove..."


Next, we make all words lowercase.

In [27]:
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [x.lower() for x in doc_tokens])
tf.head()

Unnamed: 0,datetime,texts,preprocessed
0,2016-01-01 17:24:00,A second man has died as a result of heavy flo...,"[a, second, man, has, died, as, a, result, of,..."
1,2016-01-01 18:37:00,As tensions continue in Chicago over the handl...,"[as, tensions, continue, in, chicago, over, th..."
2,2016-01-01 22:01:00,Academic journals have begun withholding the g...,"[academic, journals, have, begun, withholding,..."
3,2016-01-04 00:50:00,The brutal propaganda video released by Islami...,"[the, brutal, propaganda, video, released, by,..."
4,2016-01-04 10:59:00,A cache of 13 weapons has been discovered in t...,"[a, cache, of, 13, weapons, has, been, discove..."


Next, we will clean up the text by removing extraneous punctuation, and dropping tokens we don't want. There is a lot of possible analyst discretion about how to do this.

In [28]:
import re

# Replace brackets
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [re.sub(r"(^[\[\(\{]|[\]\)\}]$)","", x) for x in doc_tokens])
# Keep only tokens that begin with a letter
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [x for x in doc_tokens if re.search(r"^[A-Za-z]", x)])
# Keep only tokens that have no numbers in them
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [x for x in doc_tokens if not re.search(r"[0-9]", x)])
# Remove all other punctuation, except dashes and apostrophes
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [re.sub(r'[^A-Za-z\-\']', '', x) for x in doc_tokens])
# Drop strings with article publishing time info
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [x for x in doc_tokens if not re.search(r"^(.*\-time|updated\-.*|gmt|bst)$", x)])
# Drop empty strings
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [x for x in doc_tokens if x != ""])
tf.loc[:,["texts", "preprocessed"]].head()

Unnamed: 0,texts,preprocessed
0,A second man has died as a result of heavy flo...,"[a, second, man, has, died, as, a, result, of,..."
1,As tensions continue in Chicago over the handl...,"[as, tensions, continue, in, chicago, over, th..."
2,Academic journals have begun withholding the g...,"[academic, journals, have, begun, withholding,..."
3,The brutal propaganda video released by Islami...,"[the, brutal, propaganda, video, released, by,..."
4,A cache of 13 weapons has been discovered in t...,"[a, cache, of, weapons, has, been, discovered,..."


We will remove English stop words from this corpus.

In [29]:
from nltk.corpus import stopwords

sw = stopwords.words("english")
sw = [x.lower() for x in sw]

tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [x for x in doc_tokens if x not in sw])
tf.head()

Unnamed: 0,datetime,texts,preprocessed
0,2016-01-01 17:24:00,A second man has died as a result of heavy flo...,"[second, man, died, result, heavy, flooding, s..."
1,2016-01-01 18:37:00,As tensions continue in Chicago over the handl...,"[tensions, continue, chicago, handling, police..."
2,2016-01-01 22:01:00,Academic journals have begun withholding the g...,"[academic, journals, begun, withholding, geogr..."
3,2016-01-04 00:50:00,The brutal propaganda video released by Islami...,"[brutal, propaganda, video, released, islamic,..."
4,2016-01-04 10:59:00,A cache of 13 weapons has been discovered in t...,"[cache, weapons, discovered, possession, gunma..."


Then, we will use the Snowball stemmer to create equivalence classes of tokens.

In [31]:
from nltk.stem import snowball
sstemmer = snowball.SnowballStemmer("english")
tf["preprocessed"] = tf["preprocessed"].apply(lambda doc_tokens: [sstemmer.stem(x) for x in doc_tokens])
tf.head()

Unnamed: 0,datetime,texts,preprocessed
0,2016-01-01 17:24:00,A second man has died as a result of heavy flo...,"[second, man, die, result, heavi, flood, scotl..."
1,2016-01-01 18:37:00,As tensions continue in Chicago over the handl...,"[tension, continu, chicago, handl, polic, shoo..."
2,2016-01-01 22:01:00,Academic journals have begun withholding the g...,"[academ, journal, begun, withhold, geograph, l..."
3,2016-01-04 00:50:00,The brutal propaganda video released by Islami...,"[brutal, propaganda, video, relea, islam, stat..."
4,2016-01-04 10:59:00,A cache of 13 weapons has been discovered in t...,"[cach, weapon, discov, possess, gunman, open, ..."


Finally, can apply a `Counter` to the preprocessed tokens in `tf` to get token counts. 

In [32]:
from collections import Counter
tf["term_freqs"] = tf["preprocessed"].map(Counter)
tf.head()

Unnamed: 0,datetime,texts,preprocessed,term_freqs
0,2016-01-01 17:24:00,A second man has died as a result of heavy flo...,"[second, man, die, result, heavi, flood, scotl...","{'second': 1, 'man': 5, 'die': 2, 'result': 1,..."
1,2016-01-01 18:37:00,As tensions continue in Chicago over the handl...,"[tension, continu, chicago, handl, polic, shoo...","{'tension': 1, 'continu': 1, 'chicago': 7, 'ha..."
2,2016-01-01 22:01:00,Academic journals have begun withholding the g...,"[academ, journal, begun, withhold, geograph, l...","{'academ': 2, 'journal': 4, 'begun': 1, 'withh..."
3,2016-01-04 00:50:00,The brutal propaganda video released by Islami...,"[brutal, propaganda, video, relea, islam, stat...","{'brutal': 1, 'propaganda': 7, 'video': 13, 'r..."
4,2016-01-04 10:59:00,A cache of 13 weapons has been discovered in t...,"[cach, weapon, discov, possess, gunman, open, ...","{'cach': 1, 'weapon': 4, 'discov': 1, 'possess..."


### Creating a document feature matrix (DFM)

We will now create a document feature matrix (DFM). We will use tools in `sklearn` to create a DFM in a sparse matrix format.

In [37]:
from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer()
dfm = dv.fit_transform(tf["term_freqs"].to_list())
print(dfm)
dfm.shape 

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 552004 stored elements and shape (1959, 33975)>
  Coords	Values
  (0, 91)	2.0
  (0, 118)	2.0
  (0, 204)	1.0
  (0, 261)	2.0
  (0, 289)	1.0
  (0, 470)	4.0
  (0, 513)	1.0
  (0, 597)	2.0
  (0, 815)	3.0
  (0, 939)	1.0
  (0, 940)	1.0
  (0, 970)	1.0
  (0, 1020)	1.0
  (0, 1625)	5.0
  (0, 1956)	1.0
  (0, 2333)	1.0
  (0, 2343)	1.0
  (0, 2479)	3.0
  (0, 2565)	1.0
  (0, 2702)	1.0
  (0, 2759)	1.0
  (0, 3407)	2.0
  (0, 3535)	1.0
  (0, 3699)	1.0
  (0, 3844)	1.0
  :	:
  (1958, 30119)	1.0
  (1958, 30133)	3.0
  (1958, 30181)	1.0
  (1958, 30265)	1.0
  (1958, 30328)	1.0
  (1958, 30385)	4.0
  (1958, 30393)	1.0
  (1958, 30435)	1.0
  (1958, 30751)	1.0
  (1958, 31017)	1.0
  (1958, 31152)	1.0
  (1958, 31949)	2.0
  (1958, 31972)	2.0
  (1958, 32542)	1.0
  (1958, 32557)	1.0
  (1958, 32608)	1.0
  (1958, 32703)	1.0
  (1958, 32752)	3.0
  (1958, 32812)	3.0
  (1958, 32850)	2.0
  (1958, 33008)	1.0
  (1958, 33393)	9.0
  (1958, 33451)	1.0
  (1958, 33459)	3.0
 

(1959, 33975)

We will also be sure to extract and preserve the vocabulary from the `DictVectorizer` object.

In [41]:
vocabulary = dv.get_feature_names_out()
vocabulary[33459]

'would'

### Trimming the DFM

Notice that the DFM has over 34,000 features. This is a _very_ sparse DFM. We will want to reduce its size by "trimming" it to remove features (i.e., weight them by zero). We will use the recommendation in [this tutorial](https://tutorials.quanteda.io/machine-learning/topicmodel/) and only keep features that are both (1) in the top 20% of total term frequency, and (2) used in no more than 10% of documents. Note that we are trimming the DFM using a similar intuition as tf-idf weighting.

In [None]:
ttf = dfm.sum(axis = 0).A1 #研究一下A1什么意思
docf = (dfm > 0).sum(axis=0).A1

import numpy as np
ttf_cutoff = np.quantile(ttf, 0.80)
docf_cutoff = dfm.shape[0] * 0.1

dfm = dfm[:, (ttf >= ttf_cutoff) & (docf <= docf_cutoff)] #不太确定咋回事这个地方 review needed
dfm.shape

(1959, 6202)

Warning: while we have trimmed the DFM, we have _not_ yet trimmed the corresponding `vocabulary` object, which we do next.

Let's quickly look at the top 20 features in this DFM to get a sense of how words are used in this corpus.

Now, let's create a wordcloud object.

Finally, we'll plot the wordcloud object using `matplotlib`.

Now that we've created this DFM, let's save it for further use. We'll also save the corpus file as well, so that we can access the texts if/when needed!