## Parsing JSON files from the Twitter API

This notebook was posted by Simon Lindgren // [@simonlindgren](http://www.twitter.com/simonlindgren) // [simonlindgren.com](http://simonlindgren.com).

The Twitter APIs, like many other services on the internet, output data in the interchange format [JSON](https://www.copterlabs.com/json-what-it-is-how-it-works-how-to-use-it/). 

This notebook is about how to [parse](https://en.wikipedia.org/wiki/Parsing) such data into the more accessible format of R dataframes for further analysis or csv export.

In [None]:
# Required libraries
library(tidyverse)
library(jsonlite)

### Ingest the JSON

In [None]:
# Read the json file (takes some time, doing this in a Terminal window gives better progress information)
tweets <- stream_in(file("file.json"))

### Inspect fields
Below, we look at top level names of variables in the parsed json. Some of these are dataframes in themselves with more variables nested within them.

In [None]:
names(tweets)

In [None]:
class(tweets$created_at) # a character variable, not a dataframe

In [None]:
class(tweets$user) # a dataframe

### The `flatten` function in `jsonlite`
In a nested data frame, one or more of the columns consist of another data frame. These structures frequently appear when parsing json data from the web. We can flatten such data frames into a regular 2 dimensional tabular structure.

In [None]:
tweets_flat <- flatten (tweets, recursive = TRUE)

##### Make it a [tibble](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html)

In [None]:
tweets_tbl <- as_data_frame(tweets_flat) # to tibble ()
twts <- tweets_tbl

###### Keep flattening

In [None]:
# We now have many more variables
names(twts)

In [None]:
# We can flatten once again to make even more variables jump out
twts <- flatten(twts)
names(twts)

In [None]:
class(twts$coordinates.coordinates) # a variable resulting from the second round of flattening

In [None]:
twts$retweeted_status.user.description # we can inspect any column (for example this one)

## Export custom csvs

In [None]:
# DATE AND TEXT
date_text <- data_frame (date = twts$created_at, text = gsub("[\r\n]", "", twts$text))
date_text
write.csv(date_text, file = "date_text.csv")

In [None]:
# SOURCE AND TARGET
source_target <- data_frame (source = twts$user.screen_name, target = twts$in_reply_to_screen_name)
source_target <- source_target %>%
    filter(!is.na(target)) # filter away lines where target is NA
source_target
write.csv(source_target, file = "source_target.csv", row.names=FALSE)

In [None]:
# SELF-REPORTED vs GEO-TAGGED LOCATION
place_place <- data_frame (self_reported = twts$user.location, geo_tagged = twts$place.full_name)
place_place <- place_place %>%
    filter(!is.na(geo_tagged)) %>% # filter away lines where geotag is NA
    filter(!is.na(self_reported)) # filter away lines where self-reported location is NA
place_place
write.csv(place_place, file = "place_place.csv", row.names=FALSE)

And so on ...