Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
84 lines (56 sloc) 1.76 KB
title weight draft
Construct a FCM
10
false

A feature co-occurrence matrix (FCM) records number of co-occurances of tokens. This is a special object in quanteda, but behaves similarly to a DFM.

require(quanteda)
require(quanteda.corpora)
corp_news <- download('data_corpus_guardian')

When a corpus is large, you have to select features of a DFM before constructing a FCM.

dfmat_news <- dfm(corp_news, remove = stopwords('en'), remove_punct = TRUE)
dfmat_news <- dfm_remove(dfmat_news, pattern = c('*-time', 'updated-*', 'gmt', 'bst'))
dfmat_news <- dfm_trim(dfmat_news, min_termfreq = 100)

topfeatures(dfmat_news)
##       said     people        one        new       also         us 
##      28413      11169       9884       8024       7901       7091 
##        can government       year       last 
##       6972       6821       6570       6335
nfeat(dfmat_news)
## [1] 4209

You can construct a FCM from a DFM or a tokens object using fcm(). topfeatures() returns the most frequntly co-occuring words.

fcmat_news <- fcm(dfmat_news)
dim(fcmat_news)
## [1] 4209 4209

You can select features of a FCM using fcm_select().

feat <- names(topfeatures(fcmat_news, 50))
fcmat_news_select <- fcm_select(fcmat_news, pattern = feat)
dim(fcmat_news_select)
## [1] 50 50

A FCM can be used to train word embedding models with the text2vec package, or to visualize a semantic network analysis with textplot_network().

size <- log(colSums(dfm_select(dfmat_news, feat)))
set.seed(144)
textplot_network(fcmat_news_select, min_freq = 0.8, vertex_size = size / max(size) * 3)