<font size="6"><b>WORKING WITH JSON AND NESTED LIST OBJECTS</b></font>

In [None]:
library(data.table)
library(tidyverse)
library(jqr)
library(listviewer)
library(jsonlite)
library(rlist)
library(pipeR)
library(data.tree)

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

# Working with JSON Objects

- JSON is a hierarchical data format that allows data that is not appropriate to be formatted as columns and rows to be stored and queried.

- Let's say we are tracking our contact data in a csv file:

```
Lastname, Firstname, Phone Number

Membrey, Peter, +852 1234 5678

Thielen, Wouter, +81 1234 5678
```

- If one of the contacts have more than one phone numbers, we have to create a new column

```
Lastname, Firstname, Phone Number1, Phone Number2

Membrey, Peter, +852 1234 5678, +44 1234 565 555

Thielen, Wouter, +81 1234 5678
```

- But suppose, we have million of records with tens of fields, and in some exceptions, some records may have many multiple values of some fields: e.g. 10 telephone numbers, etc

- JSON format is a remedy for these kinds of flexibility issues and hierarchical data formats.

- Integrity rules are softer for handling JSON data

- JSON stands for "JavaScript Object Notation"
- In JSON, each record is called a "document"
- Let's write the first record as a JSON document:

In [None]:
record1 <- '{
"firstname": "Peter",
"lastname": "Membrey",
"phone_numbers": [
"+852 1234 5678",
"+44 1234 565 555"
]
}'

This is treated as a regular character:

In [None]:
record1

In the structure of a JSON object:

- Each document (equivalent to a row in RDBMS) in JSON is delimited by curly braces "{"
- And all values are given as "key" and "value" pairs:

```json
{
  "firstname": "Peter",
  "lastname": "Membrey",
  "phone_numbers": [
    "+852 1234 5678",
    "+44 1234 565 555"
  ]
}
```

- firstname is the key, "Peter" is the value, an so on

- We also have arrays of values for a single key, delimited by square brackets []

## Converting JSON to list

`fromJSON` function from `jsonlite` package converts a JSON object into an R list:

In [None]:
record1l <- fromJSON(record1)

In [None]:
record1l

And `jsonedit` function from `listviewer` package creates a interactive, foldable, pretty printed representation of a list object:

In [None]:
jsonedit(record1l, mode = "form")

This is the best option to interactively visualize a JSON object in R

## Querying JSON with JQ

jq is a parser and querying tool for json, that creates a nice output

You can have more info on jq following the links:

[The Home Page](https://stedolan.github.io/jq/)

[Tutorial](https://stedolan.github.io/jq/tutorial/)

[Manual](https://stedolan.github.io/jq/manual/)

`jqr` package is an R implementation and interface to jq 

`jq` function from `jqr` package provides a low level interface to query json objects:

"." returns the whole record:

In [None]:
jq(record1, ".")

## A real json database example: UN COMTRADE

- We will be using a part of the UN COMTRADE database:

[UN COMTRADE](https://comtrade.un.org/)

UN COMTRADE is the widest and most comprehensive database on international trade:

- 250+ reporter countries
- 290+ partner countries
- 6500+ commodity codes
- 50+ of history
- Both imports and exports
- Both values and quantities!

Let's define a path variable:

In [None]:
comtrade_path <- "~/databb/json/comtrade_s1"

And list files:

In [None]:
list.files(comtrade_path)

- classificationS1.json lists the item classification according to SITC1 method
- reporterAreas.json and partnerAreas.json lists the countries and their respective codes
- data files are under 2010

### Reporters

Import reporterAreas file as list:

In [None]:
reporter <- jsonlite::fromJSON(paste(comtrade_path, "reporterAreas.json", sep = "/"))

In [None]:
str(reporter)

See that, in not-so-nested structures, the data is automatically flattened into a data frame

And a collapsable and interactive gadget for viewing json and similar hierarchical data types:

In [None]:
listviewer::jsonedit(reporter, mode = "form")

And we can change to text JSON representation again:

In [None]:
reporterj <- toJSON(reporter)

In [None]:
reporterj

Now let's traverse through this document to list country texts:

In [None]:
jq(reporterj, '.results[].text')

Or traversing the list object:

In [None]:
reporter$results$text

And let's list the country codes:

In [None]:
jq(reporterj, '.results[].id')

In [None]:
reporter$results$id

- Separate lists of country names and id's do not mean much.
- Suppose we want to find the country code of turkey

In [None]:
jq(reporterj, '.results[] | select(.text == "Turkey") | .id')

In [None]:
reporter$results %>% filter(text == "Turkey") %>% pull(id)

### Classification

Now let's go through the classification file:

In [None]:
classification <- jsonlite::fromJSON(paste(comtrade_path, "classificationS1.json", sep = "/"))

In [None]:
listviewer::jsonedit(classification, mode = "form")

In [None]:
classificationj <- toJSON(classification)

Now we will filter for those entries, in which text includes "textile" and code is only 3 digits:

In [None]:
jq(classificationj, '.results[] | select((.id|test("^[0-9]+$")) and (.text|test("(?i)textile"))) |
select((.id|tonumber < 1000) and (.id|tonumber > 99)) | .text')

See how it works:

- We filter for id values that are numeric (so exclude ALL, TOTAL, AG1..AG5) and where text includes case insensitive textile
- We filter for id values larger than 99 and smaller than 1000
- We return the text
- The text has already id info at the beginning split with " - ". We substitute these character with a tab character

And we can do the same through the list object and the data.frame inside:

In [None]:
classification$results %>% filter(str_detect(id, "^\\d+$") & str_detect(text, "(?i)textile")) %>%
mutate_at("id", as.integer) %>%
filter(id %between% c(100, 999)) %>%
pull(text)

### Data files

Now let's go through the actual data files that includes trade volumes:

In [None]:
comfiles <- list.files(paste(comtrade_path, "2010", sep = "/"), full.names = T)

In [None]:
comfiles

See that the files are gzipped

Let's extract the parts that represent the reporter and partner country codes:

In [None]:
comnames <- str_extract(comfiles, "(?<=2010_)\\d+_\\d+")
comnames

Either readlines or fromJSON can read gzipped files directly:

In [None]:
comj <- lapply(comfiles, readLines)

In [None]:
coml <- lapply(comfiles, fromJSON)

In [None]:
names(comj) <- comnames

In [None]:
names(coml) <- comnames

In [None]:
comj %>% str

- The code after the first underscore is the reporter's and the code after the second underscore is the partner country's code

- So there are the files for which Turkey is either a reporter or partner

Let's take only a part:

In [None]:
tradedataj <- comj$`792_12`

In [None]:
tradedata <- coml$`792_12`

In [None]:
listviewer::jsonedit(tradedata, mode = "form")

Descriptions for several variables are:

- CmdCode: 	Commodity code
- CmdDesc: 	Commodity description
- IsLeaf: 	Basic code without children
- Parentcode: 	High level of that commodity code
- pfDesc: 	Commodity classification
- PfCode: 	Commodity classification code
- yr: 	Year
- rtCode 	Reporter Code
- ptCode: 	Partner Code
- qtCode: 	Quantity code 

Now from all files for which Turkey is a reporter, the TradeValue of exports (rgCode is 2) in 651 code (Textile yarn and thread)

We will report:
- ptTitle (name of partner country)
- TradeValue
- TradeQuantity 

In [None]:
jq(tradedataj, '.dataset[] | select(.cmdCode == "651" and .rgCode == 2) |
"\\(.ptTitle) \\(.TradeValue) \\(.TradeQuantity)"')

In [None]:
tradedata$dataset %>%
filter(cmdCode == 651 & rgCode == 2) %>%
dplyr::select(ptTitle, TradeValue, TradeQuantity)

We can do the same for all parts of the list:

In [None]:
datasetl <- coml %>% list.select(dataset) %>% purrr::flatten()

In [None]:
lapply(datasetl, function(x) x  %>%
filter(cmdCode == 651 & rgCode == 2) %>%
dplyr::select(ptTitle, TradeValue, TradeQuantity)) %>% rbindlist

Or for lists than can be bound into a single data.frame/data.table:

In [None]:
datasetl %>% rbindlist %>%
filter(cmdCode == 651 & rgCode == 2) %>%
dplyr::select(ptTitle, TradeValue, TradeQuantity)

# Advanced list operations

Let's do some advanced operations on highly nested list objects using functions from `purrr` and `rlist` packages:

In [None]:
coml %>% jsonedit(mode = "form")

We have a single larger list of six smaller list objects.

Each object is comprised of a list object called validation

Using a hack from this stackoverflow answer:

https://stackoverflow.com/a/51611498

In [None]:
depth <- function(x) ifelse(is.list(x), 1 + max(sapply(x, depth)), 0)

toTree <- function(x) {
  d <- depth(x)
  if(d > 1) {
    lapply(x, toTree)
  } else {
    children = lapply(names(x), function(nm) list(name=nm))
  }
}

We can draw the structure of the first validation object as such using `data.tree` package:

In [None]:
suppressWarnings(dt <- data.tree::FromListSimple(toTree(coml[[1]]$validation), nodeName = "x"))

In [None]:
plot(dt)

`value` node under `count`, a part of `validation` node, has information on the number of records in `dataset`.

Lets extract only those parts from the larger object first as a list:

In [None]:
valuel <- coml %>% rlist::list.select(validation$count$value)

In [None]:
valuel

And as a simple vector:

In [None]:
unlist(valuel)

Now let's filter only those parts of the list, value of which is above 1000.

Note that the pipe (%>>%) from pipeR package can be used interchangably with tidyverse pipe (%>%) for rlist functions:

In [None]:
coml2 <- coml %>>% rlist::list.filter(validation$count$value > 1000)

In [None]:
coml2 %>% names

In [None]:
coml2 %>% jsonedit(mode = "form")

Now let's flatten the list one level so that the nodes under validation (status, message, count, datasetTimer) replace validation using `purrr::list_flatten`:

In [None]:
coml %>% lapply(purrr::list_flatten) %>% jsonedit(mode = "form")

Or we can repeat it so that two levels are flattend:

In [None]:
coml %>% lapply(purrr::list_flatten) %>% lapply(purrr::list_flatten) %>% jsonedit(mode = "form")

What if we want to have a single data.table under each of the major nodes of the greater list using `rlist::list.flatten`:

In [None]:
com_dtl <- coml %>% lapply(rlist::list.flatten) %>% lapply(as.data.table)

In [None]:
com_dtl %>% jsonedit(mode = "form")

While we had a table of 35 columns along with the nested validation object in the original version:

In [None]:
coml[[1]]$dataset

In the flattened version, we only have a data.table object of 47 columns, 12 columns of which come from the flatenning of the validation object:

In [None]:
com_dtl[[1]]

We can combine all data.table in the list into one single large object using `rbindlist` as usual:

In [None]:
com_dt <- com_dtl %>% rbindlist

In [None]:
com_dt

Note that, I am using package namespaces (packagename::functioname()) sometimes just to make you know where those functions come from. Provided that you load those packages with the library() function, you do not have to use namespaces