Updated handouts

vanatteveldt · Jun 1, 2016 · 69afdc6 · 69afdc6
1 parent 478646e
commit 69afdc6
Show file tree

Hide file tree

Showing 13 changed files with 259 additions and 128 deletions.
diff --git a/2_playing.Rmd b/2_playing.Rmd
@@ -1,3 +1,6 @@
+---
+output: pdf_document
+---
 ```{r, echo=FALSE}
 cat(paste("(C) (cc by-sa) Wouter van Atteveldt, file generated", format(Sys.Date(), format="%B %d %Y")))
 ```

diff --git a/2_playing.html b/2_playing.html
diff --git a/2_playing.pdf b/2_playing.pdf
diff --git a/3_organizing.Rmd b/3_organizing.Rmd
@@ -189,25 +189,3 @@ head(subset, n=10)
 ```
 
 
-Good practice: self-contained scripts
-====
-
-Using R is programming, and one of the most important parts of programming is managing your source code.
-An important thing to realize is that your code will be written only once, but read many times over.
-Spending twice as much time to make the code well organized and more readable might feel like wasting time,
-but you (or your colleagues/students) will be very happy when you are reading it again.
-Especially since in research code is often left alone for a number of months until it is time to review an article,
-it is very important to make sure that you (and ideally: the readers/reviewers of the article) can understand the code.
-
-Although there are no simple rules for writing readable code, and sometimes what is readable to one is quite cryptic to the other.
-However, here are three tips that I can offer and that I expect you to incorporate in your assignments:
-
-1. Use descriptive variable names. Use `income` (or better: `income.top.percent`) rather than `i`. 
-2. Use comments where needed, especially to explain decisions, assumptions, and possible problems. 
-   In R, every line starting with `#` is a comment, i.e. the line is completely skipped by R.
-3. Often, when doing an analysis you're not quite sure where you are going to end up, so you write a lot of code that turns out not to be needed. When your analysis is done, take a moment to reorganize the code, remove redundancies, et cetera. It is often best to just start a new file and copy paste the relevant bits (add comments where needed). Assume that your code will also be reviewed, even if it is not, because you are sure to read it again later and wonder why/how you did certain things. 
-4. Finally, try to write what I term 'self contained scripts'. The script should start with some kind of data gathering commands such as `download.file` or `read.csv`, and end with your analyses. You should be able to clear your environment and run the code from top to bottom and arrive at the same results. In fact, when cleaning up my code I often do just that: clean up part of the code, clear all, re-run, and check the results. This is also important for reproducibility, as being able to run the whole code and get the same results is the only guarantee that that code in fact produced these results. 
-
-We will come across some tools to make these things easier such as defining your own functions and working with knitr, but the most important thing is to accept the your code is part of your product and you should take the time to polish it a bit.
-
-
diff --git a/3_organizing.pdf b/3_organizing.pdf
diff --git a/amcat.Rmd b/amcat.Rmd
@@ -119,7 +119,7 @@ head(tokens)
 So you can see this lemmatizes (stems) words and gives their part of speech (noun, verb, etc.)
 Let's plot only the names:
 
-```{r, message=F}
+```{r, warning=F, message=F}
 subset = tokens[tokens$pos == "name", ]
 dtm = dtm.create(subset$aid, subset$lemma)
 dtm.wordcloud(dtm, nterms = 200)

diff --git a/amcat.pdf b/amcat.pdf
diff --git a/comparing.Rmd b/comparing.Rmd
@@ -0,0 +1,88 @@
+---
+title: "Comparing corpora"
+author: "Wouter van Atteveldt"
+date: "June 1, 2016"
+output: pdf_document
+---
+
+```{r, echo=F}
+head = function(...) knitr::kable(utils::head(...))
+```
+
+Comparing corpora
+----
+
+Another useful thing we can do is comparing two corpora: 
+Which words or names are mentioned more in e.g. Bush' speeches than Obama's.
+
+This uses functions from the corpustoools package, which you can install directly from github:
+(you only need to do this once per computer)
+
+```{r, eval=F}
+install.packages("devtools")
+devtools::install_github("kasperwelbers/corpus-tools")
+```
+
+For this handout, we will use the State of the Union speeches contained in the `corpustools` package,
+and create a document term matrix (DTM) from all names and nouns in the speeches by Bush and Obama:
+
+```{r, message=F}
+library(corpustools)
+data(sotu)
+dtm = with(subset(sotu.tokens, pos1 %in% c("M", "N")),
+           dtm.create(documents=aid, terms=lemma))
+```
+
+Now, we can create separate DTMs for Bush and Obama,
+relying on the headline column in the metadata:
+
+To do this, we split the dtm in separate dtm's for Bush and Obama.
+For this, we select docment ids using the `headline` column in the metadata from `sotu.meta`, and then use the `dtm.filter` function:
+
+
+```{r}
+head(sotu.meta)
+obama.docs = sotu.meta$id[sotu.meta$headline == "Barack Obama"]
+dtm.obama = dtm.filter(dtm, documents=obama.docs)
+bush.docs = sotu.meta$id[sotu.meta$headline == "George W. Bush"]
+dtm.bush = dtm.filter(dtm, documents=bush.docs)
+```
+
+So how can we check which words are more frequent in Bush' speeches than in Obama's speeches?
+The function `corpora.compare` provides this functionality, given two document-term matrices:
+
+```{r}
+cmp = corpora.compare(dtm.obama, dtm.bush)
+cmp = cmp[order(cmp$over), ]
+head(cmp)
+```
+
+For each term, this data frame contains the frequency in the 'x' and 'y' corpora (here, Obama and Bush).
+Also, it gives the relative frequency in these corpora (normalizing for total corpus size)
+and the overrepresentation in the 'x' corpus and the chi-squared value for that overrepresentation.
+So, Bush used the word terrorist 105 times, while Obama used it only 13 times, and in relative terms Bush used it about four times as often, which is highly significant. 
+
+Which words did Obama use most compared to Bush?
+
+```{r}
+cmp = cmp[order(cmp$over, decreasing=T), ]
+head(cmp)
+```
+
+So, while Bush talks about freedom, war, and terror, Obama talks more about industry, banks and education. 
+
+Let's make a word cloud of Obama' words, with size indicating chi-square overrepresentation:
+
+```{r, warning=F}
+obama = cmp[cmp$over > 1,]
+dtm.wordcloud(terms = obama$term, freqs = obama$chi)
+```
+
+And Bush:
+
+```{r, warning=F}
+bush = cmp[cmp$over < 1,]
+dtm.wordcloud(terms = bush$term, freqs = bush$chi)
+```
+
+Note that the warnings given by these commands are relatively harmless: it means that some words are skipped because it couldn't find a good place for them in the word cloud. 
diff --git a/comparing.pdf b/comparing.pdf
diff --git a/corpus.Rmd b/corpus.Rmd
@@ -1,6 +1,6 @@
 ---
-title: "Corpus analysis: the document-term matrix"
-output: html_document
+title: 'Corpus analysis: the document-term matrix'
+output: pdf_document
 ---
 
 =========================================
@@ -15,7 +15,8 @@ In R, these matrices are provided by the `tm` (text mining) package.
 Although this package provides many functions for loading and manipulating these matrices,
 using them directly is relatively complicated. 
 
-Fortunately, the `RTextTools` package provides an easy function to create a document-term matrix from a data frame. To create a term document matrix from a simple data frame with a 'text' column, use the `create_matrix` function (with removeStopwords=F to make sure all words are kept):
+Fortunately, the `RTextTools` package provides an easy function to create a document-term matrix from a data frame. 
+To create a term document matrix from a simple data frame with a 'text' column, use the `create_matrix` function (with removeStopwords=F to make sure all words are kept):
 
 ```{r,message=F}
 library(RTextTools)
@@ -205,60 +206,3 @@ to visualize the top words as a word cloud:
 ```{r, warning=F}
 dtm.wordcloud(dtm_filtered)
 ```
-
-Comparing corpora
-----
-
-Another useful thing we can do is comparing two corpora: 
-Which words or names are mentioned more in e.g. Bush' speeches than Obama's.
-
-To do this, we split the dtm in separate dtm's for Bush and Obama.
-For this, we select docment ids using the `headline` column in the metadata from `sotu.meta`, and then use the `dtm.filter` function:
-
-
-```{r}
-head(sotu.meta)
-obama.docs = sotu.meta$id[sotu.meta$headline == "Barack Obama"]
-dtm.obama = dtm.filter(dtm, documents=obama.docs)
-bush.docs = sotu.meta$id[sotu.meta$headline == "George W. Bush"]
-dtm.bush = dtm.filter(dtm, documents=bush.docs)
-```
-
-So how can we check which words are more frequent in Bush' speeches than in Obama's speeches?
-The function `corpora.compare` provides this functionality, given two document-term matrices:
-
-```{r}
-cmp = corpora.compare(dtm.obama, dtm.bush)
-cmp = cmp[order(cmp$over), ]
-head(cmp)
-```
-
-For each term, this data frame contains the frequency in the 'x' and 'y' corpora (here, Obama and Bush).
-Also, it gives the relative frequency in these corpora (normalizing for total corpus size)
-and the overrepresentation in the 'x' corpus and the chi-squared value for that overrepresentation.
-So, Bush used the word terrorist 105 times, while Obama used it only 13 times, and in relative terms Bush used it about four times as often, which is highly significant. 
-
-Which words did Obama use most compared to Bush?
-
-```{r}
-cmp = cmp[order(cmp$over, decreasing=T), ]
-head(cmp)
-```
-
-So, while Bush talks about freedom, war, and terror, Obama talks more about industry, banks and education. 
-
-Let's make a word cloud of Obama' words, with size indicating chi-square overrepresentation:
-
-```{r, warning=F}
-obama = cmp[cmp$over > 1,]
-dtm.wordcloud(terms = obama$term, freqs = obama$chi)
-```
-
-And Bush:
-
-```{r, warning=F}
-bush = cmp[cmp$over < 1,]
-dtm.wordcloud(terms = bush$term, freqs = bush$chi)
-```
-
-Note that the warnings given by these commands are relatively harmless: it means that some words are skipped because it couldn't find a good place for them in the word cloud. 
diff --git a/lda.html b/lda.html
diff --git a/twitter_facebook.Rmd b/twitter_facebook.Rmd
@@ -2,7 +2,11 @@
 output: pdf_document
 ---
 Using API's from R: Twitter, Facebook, and NY Times
-=========================
+========================
+
+```{r, echo=F}
+head = function(...) knitr::kable(utils::head(...))
+```
 
 ```{r include=FALSE, cache=FALSE}
 library(twitteR)
@@ -66,6 +70,8 @@ As the following simple example shows, you can search for keywords and get a lis
 
 ```{r}
 tweets = searchTwitteR("#Trump2016", resultType="recent", n = 10)
+
+
 tweets[[1]]
 tweets[[1]]$text
 ```
@@ -85,7 +91,6 @@ For querying facebook, we can use Pable Barbera's `Rfacebook` package, which we
 
 ```{r, eval=F}
 devtools::install_github("pablobarbera/Rfacebook", subdir="Rfacebook")
-install.packages("Rfacebook")
 library(Rfacebook)
 ```
 To get a permanent facebook oath token, there are a number of steps you need to take
@@ -142,7 +147,7 @@ This will have returned the first 'page' of 10 results, which we can convert to
 
 ```{r}
 arts = plyr::ldply(res$data, function(x) c(headline=x$headline$main, date=x$pub_date))
-arts
+head(arts)
 ```
 
 ## APIs and rate limits
@@ -204,13 +209,14 @@ This tells us that we need to do a GET request to the articlesearch end point, s
 ```{r, results='hold'}
 library(httr)
 url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json'
-r = httr::GET(url, query=list(`api-key`=nyt_api_key, q="clinton"))
+r = httr::GET(url, query=list("api-key"=nyt_api_key, q="clinton"))
 status_code(r)
 ```
 
 The status code 200 indicates "OK", other status codes generally indicate a problem,
-such as an invalid API key. 
-The results are retrieved as a json-dictionary, which is accessible in R as a list through the `content` function in `httr`.
+such as an invalid API key (search for 'HTTP Status codes' for an overview)
+The results are retrieved as a json-dictionary, which is accessible in R as a list through the `content` function in `httr`,
+which identifies the data type based on the headers and converts it.
 The API documentation linked above contains a list of these fields, but you can also inspect the list itself from R:
 
 ```{r}

diff --git a/twitter_facebook.pdf b/twitter_facebook.pdf