# JSON FORMAT, NOSQL AND MONGODB

Note: For convenience, it is better that you click on New > Terminal from the Home page of Jupyter

JSON is a hierarchical data format that allows data that is not appropriate to be formatted as columns and rows to be stored and queried.

Let's say we are tracking our contact data in a csv file:

Lastname, Firstname, Phone Number

Membrey, Peter, +852 1234 5678

Thielen, Wouter, +81 1234 5678

If one of the contacts have more than one phone numbers, we have to create a new column

Lastname, Firstname, Phone Number1, Phone Number2

Membrey, Peter, +852 1234 5678, +44 1234 565 555

Thielen, Wouter, +81 1234 5678

But suppose, we have million of records with tens of fields, and in some exceptions, some records may have many multiple values of some fields: e.g. 10 telephone numbers, etc

JSON format is a remedy for these kinds of flexibility issues and hierarchical data formats.

Integrity rules are softer for handling JSON data

JSON stands for "JavaScript Object Notation"

In JSON, each record is called a "document"

Let's write the first record as a JSON document:

In [None]:
record1='{
"firstname": "Peter",
"lastname": "Membrey",
"phone_numbers": [
"+852 1234 5678",
"+44 1234 565 555"
]
}'

echo $record1

Echoing the JSON as comma separated values as such, is not easy to parse and understand the format.

We may use online json parsers for this purpose.

You can copy and paste the above string into the input pane:

http://jsonparseronline.com/

Or better, we will use a handy tool called "jq" for this purpose:

In [None]:
echo $record1 | jq .

jq is a parser and querying tool for json, that creates a nice output

You can have more info on jq following the links:

[The Home Page](https://stedolan.github.io/jq/)

[Tutorial](https://stedolan.github.io/jq/tutorial/)

[Manual](https://stedolan.github.io/jq/manual/)

Each document (equivalent to a row in RDBMS) in JSON is delimited by curly braces "{"
And all values are given as "key" and "value" pairs:

```json
{
  "firstname": "Peter",
  "lastname": "Membrey",
  "phone_numbers": [
    "+852 1234 5678",
    "+44 1234 565 555"
  ]
}
```

firstname is the key, "Peter" is the value, an so on

We also have arrays of values for a single key, delimited by square brackets []

What is more powerful about JSON format is that you can embedded documents inside other ones: (we print the data here in three visual formats)

In [None]:
record1b='{
"firstname": "Peter",
"lastname": "Membrey",
"numbers": [
{
"phone": "+852 1234 5678"
},
{
"fax": "+44 1234 565 555"
}
]
}'

echo -e $record1b "\n"

echo -e "$record1b\n"

echo $record1b | jq .

See, the phone and fax numbers are inside embedded documents

Multiple documents can be "collected" inside "collections":

A "collection" in NoSQL terminology is analogous to a "table" in the RDBMS jargon. A collection is a collection of similar items (or documents with similar key-value pairs)

MongoDB assigns a unique "_id" for each document - this _id key is similar to primary key in rdbms

Note that, JSON is a text format, while MongoDB keeps the data as BSON - Binary JSON.

BSON is faster to traverse through.

Although binary format occupies less space than text (ASCII or UTF-8) format, extra data in terms of indices, etc. may render BSON format heavier in terms of space

Now before we go on to MongoDB, let's give some info on our JSON database

We will be using a part of the UN COMTRADE database:

[UN COMTRADE](https://comtrade.un.org/)

UN COMTRADE is the widest and most comprehensive database on international trade:

- 250+ reporter countries
- 290+ partner countries
- 6500+ commodity codes
- 50+ of history
- Both imports and exports
- Both values and quantities!

First make the paths easier to write:

In [None]:
weekpath=/home/bda505/mef-bigdata/week_06

In [None]:
cat $weekpath/comtrade_tables/classificationH4.json | leafpad &

The head of the file is:

```json
{
	"more": false,
	"minimumInputLength": 2,
	"classCode": "H4",
	"className": "HS2012",
	"results": [{
		"id": "ALL",
		"text": "All HS2012 categories",
		"parent": "#"
	},
	{
		"id": "TOTAL",
		"text": "Total of all HS2012 commodities",
		"parent": "#"
	},
```

A better way to page json data nice is to use jq and less (but it won't work inside Jupyter:

jq . $weekpath/comtrade_tables/classificationH4.json -C | less -R

or

cat $weekpath/comtrade_tables/classificationH4.json | jq . -C | less -R


Extract the item codes:

In [None]:
cat $weekpath/comtrade_tables/classificationH4.json | \
jq '(.results | .[] | .id)' | \
head

And their definitions:

In [None]:
cat $weekpath/comtrade_tables/classificationH4.json | \
jq '(.results | .[] | .text)' | \
head

To see how many id's are available in the database:

In [None]:
cat $weekpath/comtrade_tables/classificationH4.json | \
jq '(.results | .[] | .id)' | \
wc -l

Now let's see the reporter countries:

In [None]:
cat $weekpath/comtrade_tables/reporterAreas.json | leafpad &

In [None]:
cat $weekpath/comtrade_tables/reporterAreas.json | \
jq '(.results | .[] | .id)' | \
head

If we don't want the quotes:

In [None]:
cat $weekpath/comtrade_tables/reporterAreas.json | \
jq '(.results | .[] | .id)' | \
sed 's/"//g' | \
head

And country names:

In [None]:
cat $weekpath/comtrade_tables/reporterAreas.json | \
jq '(.results | .[] | .text)' | \
sed 's/"//g' | \
head

And their counts:

In [None]:
cat $weekpath/comtrade_tables/reporterAreas.json | \
jq '(.results | .[] | .id)' | \
wc -l

And for the partner countries:

In [None]:
cat $weekpath/comtrade_tables/partnerAreas.json | \
jq '(.results | .[] | .id)' | \
wc -l

What can be done with this database?

[Atlas of Economic Complexity](http://atlas.cid.harvard.edu/)

It is a very comprehensive study that aims to calculate the "complexity" (sophistication) level of economies and products

UN COMTRADE database is freely and publicly available with some technical limitations as documented in the API information:

[The UN Comtrade data extraction API](https://comtrade.un.org/data/doc/api/)

- Each query can at most return 50K values
- An IP can only send 100 requests per hour

The best approach is the send a query per reporter / partner pair

255 * 293 = 74715 queries

And use some proxies to stretch the limit!

Actually we see that total number of pairs that have trade are much less than that:

In [None]:
dataset04=/home/bda505/mef/04/comtrade_2015

In [None]:
ls $dataset04/gz | head

We have files as such

And the count is:

In [None]:
ls $dataset04/gz | wc -l

So our database is ALL INTERNATIONAL TRADE IN THE MOST DETAILED FORM IN 2015, in 27924 files

Now let's view one of them. But first get some country codes

In [None]:
cat $weekpath/comtrade_tables/reporterAreas.json | \
jq '(.results | .[] | select(.text == "Turkey") | {id, text})'

In [None]:
cat $weekpath/comtrade_tables/partnerAreas.json | \
jq '(.results | .[] | select(.text == "Germany") | {id, text})'

Copy this into a terminal please:

zcat $dataset04/gz/2015_792_276.json.gz | jq . -C | less -R

Now, how many items has Turkey exported to Germanyin 2015?

First, the descriptions:

In [None]:
zcat $dataset04/gz/2015_792_276.json.gz | jq '(.dataset | .[] | select(.rgDesc == "Export") | .cmdDescE)' | head

And the codes:

In [None]:
zcat $dataset04/gz/2015_792_276.json.gz | jq '(.dataset | .[] | select(.rgDesc == "Export") | .cmdCode)' | head

In [None]:
zcat $dataset04/gz/2015_792_276.json.gz | jq '(.dataset | .[] | select(.rgDesc == "Export") | .cmdCode)' | wc -l

So Turkey exported commodity to Germany under 4223 headings in 2015

And what about imports?

In [None]:
zcat $dataset04/gz/2015_792_276.json.gz | jq '(.dataset | .[] | select(.rgDesc == "Import") | .cmdCode)' | wc -l

Turkey imported commodities from Germany under 6053 headings in 2015

Is it possible to double check this information from the counterpart's reported figures?

Yes, provided that both countries are reporters, we can just swap the country codes!

Germany's imports are Turkey's exports:

In [None]:
zcat $dataset04/gz/2015_276_792.json.gz | jq '(.dataset | .[] | select(.rgDesc == "Import") | .cmdCode)' | wc -l

And Germany's exports are Turkey's imports:

In [None]:
zcat $dataset04/gz/2015_276_792.json.gz | jq '(.dataset | .[] | select(.rgDesc == "Export") | .cmdCode)' | wc -l

See that, Germany reports more items both in exports and imports

Now an exercise for you:

From the file $weekpath/comtrade_tables/classificationH4.json

Get the text of all items parent of which is:

{
      "id": "1104",
      "text": "1104 - Cereal grains otherwise worked (e.g. hulled, rolled, flaked, pearled, sliced or kibbled) except rice of heading no. 1006; germ of cereals whole, rolled, flaked or ground",
      "parent": "11"
    },


Now it is time to unzip the gz files and import into mongodb

** Note that first check whether the files are already gunzipped since it takes too much time! **

In [None]:
ls $dataset04/json | wc -l

First create a folder for gunzipped json files

In [None]:
mkdir $dataset04/json

gunzip gz files keeping the zipped files! Note the "-k" flag (f flag forces overwrite):

In [None]:
gunzip -fk $dataset04/gz/*

Move unzziped jsons to json directory:

In [None]:
mv $dataset04/gz/*.json $dataset04/json

And import the data into a new db and collection

In [None]:
cd $dataset04/json
for file in *.json; do mongoimport --db comtrade --collection 2015 $dataset04/json/$file; done

And fire up the mongodb client robomongo

In [None]:
robomongo &

See that the collection is created and is there

Some simple queries that you can apply:

db.getCollection('2015').find({'dataset.ptCode':288})

db.getCollection('2015').find({'dataset.rtTitle':"Bulgaria"})

db.getCollection('2015').find({ 'dataset.rtTitle' : "Bulgaria", $and : [{ 'dataset.ptTitle' : "Ghana"} ] } )