Skip to content

Commit

Permalink
Updated handouts
Browse files Browse the repository at this point in the history
  • Loading branch information
vanatteveldt committed Jun 1, 2016
1 parent 478646e commit 69afdc6
Show file tree
Hide file tree
Showing 13 changed files with 259 additions and 128 deletions.
3 changes: 3 additions & 0 deletions 2_playing.Rmd
@@ -1,3 +1,6 @@
---
output: pdf_document
---
```{r, echo=FALSE}
cat(paste("(C) (cc by-sa) Wouter van Atteveldt, file generated", format(Sys.Date(), format="%B %d %Y")))
```
Expand Down
70 changes: 61 additions & 9 deletions 2_playing.html

Large diffs are not rendered by default.

Binary file added 2_playing.pdf
Binary file not shown.
22 changes: 0 additions & 22 deletions 3_organizing.Rmd
Expand Up @@ -189,25 +189,3 @@ head(subset, n=10)
```


Good practice: self-contained scripts
====

Using R is programming, and one of the most important parts of programming is managing your source code.
An important thing to realize is that your code will be written only once, but read many times over.
Spending twice as much time to make the code well organized and more readable might feel like wasting time,
but you (or your colleagues/students) will be very happy when you are reading it again.
Especially since in research code is often left alone for a number of months until it is time to review an article,
it is very important to make sure that you (and ideally: the readers/reviewers of the article) can understand the code.

Although there are no simple rules for writing readable code, and sometimes what is readable to one is quite cryptic to the other.
However, here are three tips that I can offer and that I expect you to incorporate in your assignments:

1. Use descriptive variable names. Use `income` (or better: `income.top.percent`) rather than `i`.
2. Use comments where needed, especially to explain decisions, assumptions, and possible problems.
In R, every line starting with `#` is a comment, i.e. the line is completely skipped by R.
3. Often, when doing an analysis you're not quite sure where you are going to end up, so you write a lot of code that turns out not to be needed. When your analysis is done, take a moment to reorganize the code, remove redundancies, et cetera. It is often best to just start a new file and copy paste the relevant bits (add comments where needed). Assume that your code will also be reviewed, even if it is not, because you are sure to read it again later and wonder why/how you did certain things.
4. Finally, try to write what I term 'self contained scripts'. The script should start with some kind of data gathering commands such as `download.file` or `read.csv`, and end with your analyses. You should be able to clear your environment and run the code from top to bottom and arrive at the same results. In fact, when cleaning up my code I often do just that: clean up part of the code, clear all, re-run, and check the results. This is also important for reproducibility, as being able to run the whole code and get the same results is the only guarantee that that code in fact produced these results.

We will come across some tools to make these things easier such as defining your own functions and working with knitr, but the most important thing is to accept the your code is part of your product and you should take the time to polish it a bit.


Binary file modified 3_organizing.pdf
Binary file not shown.
2 changes: 1 addition & 1 deletion amcat.Rmd
Expand Up @@ -119,7 +119,7 @@ head(tokens)
So you can see this lemmatizes (stems) words and gives their part of speech (noun, verb, etc.)
Let's plot only the names:

```{r, message=F}
```{r, warning=F, message=F}
subset = tokens[tokens$pos == "name", ]
dtm = dtm.create(subset$aid, subset$lemma)
dtm.wordcloud(dtm, nterms = 200)
Expand Down
Binary file modified amcat.pdf
Binary file not shown.
88 changes: 88 additions & 0 deletions comparing.Rmd
@@ -0,0 +1,88 @@
---
title: "Comparing corpora"
author: "Wouter van Atteveldt"
date: "June 1, 2016"
output: pdf_document
---

```{r, echo=F}
head = function(...) knitr::kable(utils::head(...))
```

Comparing corpora
----

Another useful thing we can do is comparing two corpora:
Which words or names are mentioned more in e.g. Bush' speeches than Obama's.

This uses functions from the corpustoools package, which you can install directly from github:
(you only need to do this once per computer)

```{r, eval=F}
install.packages("devtools")
devtools::install_github("kasperwelbers/corpus-tools")
```

For this handout, we will use the State of the Union speeches contained in the `corpustools` package,
and create a document term matrix (DTM) from all names and nouns in the speeches by Bush and Obama:

```{r, message=F}
library(corpustools)
data(sotu)
dtm = with(subset(sotu.tokens, pos1 %in% c("M", "N")),
dtm.create(documents=aid, terms=lemma))
```

Now, we can create separate DTMs for Bush and Obama,
relying on the headline column in the metadata:

To do this, we split the dtm in separate dtm's for Bush and Obama.
For this, we select docment ids using the `headline` column in the metadata from `sotu.meta`, and then use the `dtm.filter` function:


```{r}
head(sotu.meta)
obama.docs = sotu.meta$id[sotu.meta$headline == "Barack Obama"]
dtm.obama = dtm.filter(dtm, documents=obama.docs)
bush.docs = sotu.meta$id[sotu.meta$headline == "George W. Bush"]
dtm.bush = dtm.filter(dtm, documents=bush.docs)
```

So how can we check which words are more frequent in Bush' speeches than in Obama's speeches?
The function `corpora.compare` provides this functionality, given two document-term matrices:

```{r}
cmp = corpora.compare(dtm.obama, dtm.bush)
cmp = cmp[order(cmp$over), ]
head(cmp)
```

For each term, this data frame contains the frequency in the 'x' and 'y' corpora (here, Obama and Bush).
Also, it gives the relative frequency in these corpora (normalizing for total corpus size)
and the overrepresentation in the 'x' corpus and the chi-squared value for that overrepresentation.
So, Bush used the word terrorist 105 times, while Obama used it only 13 times, and in relative terms Bush used it about four times as often, which is highly significant.

Which words did Obama use most compared to Bush?

```{r}
cmp = cmp[order(cmp$over, decreasing=T), ]
head(cmp)
```

So, while Bush talks about freedom, war, and terror, Obama talks more about industry, banks and education.

Let's make a word cloud of Obama' words, with size indicating chi-square overrepresentation:

```{r, warning=F}
obama = cmp[cmp$over > 1,]
dtm.wordcloud(terms = obama$term, freqs = obama$chi)
```

And Bush:

```{r, warning=F}
bush = cmp[cmp$over < 1,]
dtm.wordcloud(terms = bush$term, freqs = bush$chi)
```

Note that the warnings given by these commands are relatively harmless: it means that some words are skipped because it couldn't find a good place for them in the word cloud.
Binary file added comparing.pdf
Binary file not shown.
64 changes: 4 additions & 60 deletions corpus.Rmd
@@ -1,6 +1,6 @@
---
title: "Corpus analysis: the document-term matrix"
output: html_document
title: 'Corpus analysis: the document-term matrix'
output: pdf_document
---

=========================================
Expand All @@ -15,7 +15,8 @@ In R, these matrices are provided by the `tm` (text mining) package.
Although this package provides many functions for loading and manipulating these matrices,
using them directly is relatively complicated.

Fortunately, the `RTextTools` package provides an easy function to create a document-term matrix from a data frame. To create a term document matrix from a simple data frame with a 'text' column, use the `create_matrix` function (with removeStopwords=F to make sure all words are kept):
Fortunately, the `RTextTools` package provides an easy function to create a document-term matrix from a data frame.
To create a term document matrix from a simple data frame with a 'text' column, use the `create_matrix` function (with removeStopwords=F to make sure all words are kept):

```{r,message=F}
library(RTextTools)
Expand Down Expand Up @@ -205,60 +206,3 @@ to visualize the top words as a word cloud:
```{r, warning=F}
dtm.wordcloud(dtm_filtered)
```

Comparing corpora
----

Another useful thing we can do is comparing two corpora:
Which words or names are mentioned more in e.g. Bush' speeches than Obama's.

To do this, we split the dtm in separate dtm's for Bush and Obama.
For this, we select docment ids using the `headline` column in the metadata from `sotu.meta`, and then use the `dtm.filter` function:


```{r}
head(sotu.meta)
obama.docs = sotu.meta$id[sotu.meta$headline == "Barack Obama"]
dtm.obama = dtm.filter(dtm, documents=obama.docs)
bush.docs = sotu.meta$id[sotu.meta$headline == "George W. Bush"]
dtm.bush = dtm.filter(dtm, documents=bush.docs)
```

So how can we check which words are more frequent in Bush' speeches than in Obama's speeches?
The function `corpora.compare` provides this functionality, given two document-term matrices:

```{r}
cmp = corpora.compare(dtm.obama, dtm.bush)
cmp = cmp[order(cmp$over), ]
head(cmp)
```

For each term, this data frame contains the frequency in the 'x' and 'y' corpora (here, Obama and Bush).
Also, it gives the relative frequency in these corpora (normalizing for total corpus size)
and the overrepresentation in the 'x' corpus and the chi-squared value for that overrepresentation.
So, Bush used the word terrorist 105 times, while Obama used it only 13 times, and in relative terms Bush used it about four times as often, which is highly significant.

Which words did Obama use most compared to Bush?

```{r}
cmp = cmp[order(cmp$over, decreasing=T), ]
head(cmp)
```

So, while Bush talks about freedom, war, and terror, Obama talks more about industry, banks and education.

Let's make a word cloud of Obama' words, with size indicating chi-square overrepresentation:

```{r, warning=F}
obama = cmp[cmp$over > 1,]
dtm.wordcloud(terms = obama$term, freqs = obama$chi)
```

And Bush:

```{r, warning=F}
bush = cmp[cmp$over < 1,]
dtm.wordcloud(terms = bush$term, freqs = bush$chi)
```

Note that the warnings given by these commands are relatively harmless: it means that some words are skipped because it couldn't find a good place for them in the word cloud.
120 changes: 90 additions & 30 deletions lda.html

Large diffs are not rendered by default.

18 changes: 12 additions & 6 deletions twitter_facebook.Rmd
Expand Up @@ -2,7 +2,11 @@
output: pdf_document
---
Using API's from R: Twitter, Facebook, and NY Times
=========================
========================

```{r, echo=F}
head = function(...) knitr::kable(utils::head(...))
```

```{r include=FALSE, cache=FALSE}
library(twitteR)
Expand Down Expand Up @@ -66,6 +70,8 @@ As the following simple example shows, you can search for keywords and get a lis

```{r}
tweets = searchTwitteR("#Trump2016", resultType="recent", n = 10)
tweets[[1]]
tweets[[1]]$text
```
Expand All @@ -85,7 +91,6 @@ For querying facebook, we can use Pable Barbera's `Rfacebook` package, which we

```{r, eval=F}
devtools::install_github("pablobarbera/Rfacebook", subdir="Rfacebook")
install.packages("Rfacebook")
library(Rfacebook)
```
To get a permanent facebook oath token, there are a number of steps you need to take
Expand Down Expand Up @@ -142,7 +147,7 @@ This will have returned the first 'page' of 10 results, which we can convert to

```{r}
arts = plyr::ldply(res$data, function(x) c(headline=x$headline$main, date=x$pub_date))
arts
head(arts)
```

## APIs and rate limits
Expand Down Expand Up @@ -204,13 +209,14 @@ This tells us that we need to do a GET request to the articlesearch end point, s
```{r, results='hold'}
library(httr)
url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json'
r = httr::GET(url, query=list(`api-key`=nyt_api_key, q="clinton"))
r = httr::GET(url, query=list("api-key"=nyt_api_key, q="clinton"))
status_code(r)
```

The status code 200 indicates "OK", other status codes generally indicate a problem,
such as an invalid API key.
The results are retrieved as a json-dictionary, which is accessible in R as a list through the `content` function in `httr`.
such as an invalid API key (search for 'HTTP Status codes' for an overview)
The results are retrieved as a json-dictionary, which is accessible in R as a list through the `content` function in `httr`,
which identifies the data type based on the headers and converts it.
The API documentation linked above contains a list of these fields, but you can also inspect the list itself from R:

```{r}
Expand Down
Binary file modified twitter_facebook.pdf
Binary file not shown.

0 comments on commit 69afdc6

Please sign in to comment.