Permalink
Cannot retrieve contributors at this time
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
236 lines (168 sloc)
7.98 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Replication of Chapter 5 of _Quantitative Social Science: An Introduction_" | |
author: Stefan Müller and Kenneth Benoit | |
output: | |
html_document: | |
toc: true | |
--- | |
```{r, echo = FALSE} | |
knitr::opts_chunk$set(collapse = FALSE, | |
comment = "##") | |
``` | |
```{r, message = FALSE} | |
library("quanteda") | |
``` | |
In this vignette we show how the **quanteda** package can be used to replicate the text analysis part (Chapter 5.1) from Kosuke Imai's book [*Quantitative Social Science: An Introduction*](http://qss.princeton.press) (Princeton: Princeton University Press, 2017). | |
## Download the Corpus | |
To get the textual data, you need to install and load the **qss** package first that comes with the book. | |
```{r eval = FALSE} | |
devtools::install_github("kosukeimai/qss-package", build_vignettes = TRUE) | |
``` | |
## Section 5.1.1: The Disputed Authorship of ‘The Federalist Papers’ | |
First, we use the **readtext** package to import the Federalist Papers as a data frame and create a **quanteda** corpus. | |
```{r} | |
library("readtext") | |
# use readtext package to import all documents as a dataframe | |
corpus_texts <- readtext(system.file("extdata/federalist/", package = "qss")) | |
# create docvar with number of paper | |
corpus_texts$paper_number <- paste("No.", seq_len(nrow(corpus_texts)), sep = " ") | |
# transform to a quanteda corpus object | |
corpus_raw <- corpus(corpus_texts, text_field = "text", docid_field = "paper_number") | |
# create docvar with authorship (used in Section 5.1.4) | |
docvars(corpus_raw, "paper_numeric") <- seq_len(ndoc(corpus_raw)) | |
# create docvar with authorship (used in Section 5.1.4) | |
docvars(corpus_raw, "author") <- factor(NA, levels = c("madison", "hamilton")) | |
docvars(corpus_raw, "author")[c(1, 6:9, 11:13, 15:17, 21:36, 59:61, 65:85)] <- "hamilton" | |
docvars(corpus_raw, "author")[c(10, 14, 37:48, 58)] <- "madison" | |
``` | |
```{r} | |
# inspect Paper No. 10 (output suppressed) | |
texts(corpus_raw)[10] %>% | |
stringi::stri_sub(1, 240) %>% | |
cat() | |
``` | |
## Section 5.1.2: Document-Term Matrix | |
Next, we transform the corpus to a document-feature matrix. `dfm_prep` (used in sections 5.1.4 and 5.1.5) is a dfm in which numbers and punctuation have been removed, and in which terms have been converted to lowercase. In `dfm_papers`, the words have also been stemmed and a standard set of stopwords removed. | |
```{r} | |
# transform corpus to a document-feature matrix | |
dfm_prep <- tokens(corpus_raw, remove_numbers = TRUE, remove_punct = TRUE) %>% | |
dfm(tolower = TRUE) | |
# remove stop words and stem words | |
dfm_papers <- dfm_prep %>% | |
dfm_remove(stopwords("en")) %>% | |
dfm_wordstem("en") | |
# inspect | |
dfm_papers | |
# sort into alphabetical order of features, to match book example | |
dfm_papers <- dfm_papers[, order(featnames(dfm_papers))] | |
# inspect some documents in the dfm | |
head(dfm_papers, nf = 8) | |
``` | |
The **tm** package considers features such as "1st" to be numbers, whereas **quanteda** does not. We can remove these easily using a wildcard removal: | |
```{r} | |
dfm_papers <- dfm_remove(dfm_papers, "[0-9]", valuetype = "regex", verbose = TRUE) | |
head(dfm_papers, nf = 8) | |
``` | |
## Section 5.1.3: Topic Discovery | |
We can use the `textplot_wordcloud()` function to plot word clouds of the most frequent words in Papers 12 and 24. | |
```{r warning=FALSE, fig.width = 8, fig.height = 8} | |
set.seed(100) | |
library("quanteda.textplots") | |
textplot_wordcloud(dfm_papers[c("No. 12", "No. 24"), ], | |
max.words = 50, comparison = TRUE) | |
``` | |
Since **quanteda** cannot do stem completion, we will skip that part. | |
Next, we identify clusters of similar essay based on term frequency-inverse document frequency (*tf-idf*) and apply the $k$-means algorithm to the weighted dfm. | |
```{r} | |
# tf-idf calculation | |
dfm_papers_tfidf <- dfm_tfidf(dfm_papers, base = 2) | |
# 10 most important words for Paper No. 12 | |
topfeatures(dfm_papers_tfidf[12, ], n = 10) | |
# 10 most important words for Paper No. 24 | |
topfeatures(dfm_papers_tfidf[24, ], n = 10) | |
``` | |
We can match the clustering as follows: | |
```{r} | |
k <- 4 # number of clusters | |
# subset The Federalist papers written by Hamilton | |
dfm_papers_tfidf_hamilton <- dfm_subset(dfm_papers_tfidf, author == "hamilton") | |
# run k-means | |
km_out <- stats::kmeans(dfm_papers_tfidf_hamilton, centers = k) | |
km_out$iter # check the convergence; number of iterations may vary | |
colnames(km_out$centers) <- featnames(dfm_papers_tfidf_hamilton) | |
for (i in 1:k) { # loop for each cluster | |
cat("CLUSTER", i, "\n") | |
cat("Top 10 words:\n") # 10 most important terms at the centroid | |
print(head(sort(km_out$centers[i, ], decreasing = TRUE), n = 10)) | |
cat("\n") | |
cat("Federalist Papers classified: \n") # extract essays classified | |
print(docnames(dfm_papers_tfidf_hamilton)[km_out$cluster == i]) | |
cat("\n") | |
} | |
``` | |
## Section 5.1.4: Authorship Prediction | |
In a next step, we want to predict authorship for the Federalist Papers whose authorship is unknown. As the topics of the Papers differs remarkably, Imai focuses on 10 articles, prepositions and conjunctions to predict authorship. | |
```{r} | |
# term frequency per 1000 words | |
tfm <- dfm_weight(dfm_prep, "prop") * 1000 | |
# select words of interest | |
words <- c("although", "always", "commonly", "consequently", | |
"considerable", "enough", "there", "upon", "while", "whilst") | |
tfm <- dfm_select(tfm, words, valuetype = "fixed") | |
# average among Hamilton/Madison essays | |
tfm_ave <- dfm_group(dfm_subset(tfm, !is.na(author)), groups = author) / | |
as.numeric(table(docvars(tfm, "author"))) | |
# bind docvars from corpus and tfm to a data frame | |
author_data <- data.frame(docvars(corpus_raw), convert(tfm, to = "data.frame")) | |
# create numeric variable that takes value 1 for Hamilton's essays, | |
# -1 for Madison's essays and NA for the essays with unknown authorship | |
author_data$author_numeric <- ifelse(author_data$author == "hamilton", 1, | |
ifelse(author_data$author == "madison", -1, NA)) | |
# use only known authors for training set | |
author_data_known <- na.omit(author_data) | |
hm_fit <- lm(author_numeric ~ upon + there + consequently + whilst, | |
data = author_data_known) | |
hm_fit | |
hm_fitted <- fitted(hm_fit) # fitted values | |
sd(hm_fitted) | |
``` | |
## Section 5.1.5: Cross-Validation | |
Finally, we assess how well the model fits the data by classifying each essay based on its fitted value. | |
```{r} | |
# proportion of correctly classified essays by Hamilton | |
mean(hm_fitted[author_data_known$author == "hamilton"] > 0) | |
# proportion of correctly classified essays by Madison | |
mean(hm_fitted[author_data_known$author == "madison"] < 0) | |
n <- nrow(author_data_known) | |
hm_classify <- rep(NA, n) # a container vector with missing values | |
for (i in 1:n) { | |
# fit the model to the data after removing the ith observation | |
sub_fit <- lm(author_numeric ~ upon + there + consequently + whilst, | |
data = author_data_known[-i, ]) # exclude ith row | |
# predict the authorship for the ith observation | |
hm_classify[i] <- predict(sub_fit, newdata = author_data_known[i, ]) | |
} | |
# proportion of correctly classified essays by Hamilton | |
mean(hm_classify[author_data_known$author == "hamilton"] > 0) | |
# proportion of correctly classified essays by Madison | |
mean(hm_classify[author_data_known$author == "madison"] < 0) | |
disputed <- c(49, 50:57, 62, 63) # 11 essays with disputed authorship | |
tf_disputed <- dfm_subset(tfm, is.na(author)) %>% | |
convert(to = "data.frame") | |
author_data$prediction <- predict(hm_fit, newdata = author_data) | |
author_data$prediction # predicted values | |
``` | |
Finally, we plot the fitted values for each Federalist paper with the **ggplot2** package. | |
```{r, fig.width = 8, fig.height = 6} | |
author_data$author_plot <- ifelse(is.na(author_data$author), "unknown", as.character(author_data$author)) | |
library(ggplot2) | |
ggplot(data = author_data, aes(x = paper_numeric, | |
y = prediction, | |
shape = author_plot, | |
colour = author_plot)) + | |
geom_point(size = 2) + | |
geom_hline(yintercept = 0, linetype = "dotted") + | |
labs(x = "Federalist Papers", y = "Predicted values") + | |
theme_minimal() + | |
theme(legend.title=element_blank()) | |
``` | |