# JSON

## json format

**JSON** -- is a file format that uses human-readable text to transmit data objects. 

JSON stands for *JavaScript Object Notation.*

JSON is computer- and human-friendly.

JSON is a language-independent data format. It was derived from JavaScript, but many modern programming languages include code to generate and parse JSON-format data.

JSON is more compact than XML. 

![](https://pics.me.me/json-statham-new-json-object-44070723.png)

## Data types and syntax

A JSON string can contain an __object__, then it starts with a `{` and ends with a `}`. Such an object resembles a python *dictionary*: it consists of attribute–value pairs devided by a comma. For example:

In [None]:
{"first_name": "Guido", "last_name":"Rossum"}

A JSON string can contain an __array__, then it starts with a `[` and ends with a `]`. This object resembles a python array, all the values are separated by a comma. For example:

In [None]:
["Guido van Rossum", "Diana Clarke", "Naomi Ceder", "Van Lindberg", "Ewa Jodlowska"]

JSON's basic data types are:
* number (an integer or a float)
* string (strings are delimited with double-quotation marks and support a backslash escaping syntax)
* boolean (`true` or `false`)
* array (an ordered list of zero or more values, each of which may be of any type. Arrays use square bracket notation with comma-separated elements)
* object (in {})
* null (an empty value, using the word null)

To include special symbols, use `\`, e.g., `\"` or `\r\n`. If you want to, you can find all the rules here: http://www.json.org/

One could mistake a json string for a python data type, e.g., a dictionary. But it is not the case. 
First, json is not a code, it is a text. Second, not all valid python objects form a valid json. For example, the following object is not a valid json, but it is a valid python dictionary: `{(1, 'a'): u'12345'}`. (Can you think of more examples?)

Another example of a longer json string:

In [None]:
{"organisation": "Python Software Foundation",
 "officers": [
            {"first_name": "Guido", "last_name":"Rossum", "position":"president"},
            {"first_name": "Diana", "last_name":"Clarke", "position":"chair"},
            {"first_name": "Naomi", "last_name":"Ceder", "position":"vice chair"},
            {"first_name": "Van", "last_name":"Lindberg", "position":"vice chair"},
            {"first_name": "Ewa", "last_name":"Jodlowska", "position":"director of operations"}
            ],
"type": "non-profit",
"country": "USA",
"founded": 2001,
"members": 244,
"budget": 750000,
"url": "www.python.org/psf/"}

## json module

Python has a standard `json` module. You will most likely need the following functions:

* `loads`  - to convert a json string into a python object -- a dictionary or an array. This function takes one obligatory argument -- a json string.
* `dumps`  - to convert a python dictionary or an array into a json string. This function takes one obligatory argument -- a dictionary or an array.
* `load` - to read a json from a file and convert it into a python object. This function takes two obligatory arguments -- a json string and a file.
* `dump` - to convert a python object into a json and save it into a file. This function takes two obligatory arguments - a file and a python object.

The word "file" refers to any file-like object, to anything that we can apply the `.read()` method to.

## An example

Let's convert our json string into a python object:

In [1]:
json_string = """{"organisation": "Python Software Foundation",
                 "officers": [
                            {"first_name": "Guido", "last_name":"Rossum", "position":"president"},
                            {"first_name": "Diana", "last_name":"Clarke", "position":"chair"},
                            {"first_name": "Naomi", "last_name":"Ceder", "position":"vice chair"},
                            {"first_name": "Van", "last_name":"Lindberg", "position":"vice chair"},
                            {"first_name": "Ewa", "last_name":"Jodlowska", "position":"director of operations"}
                            ],
                "type": "non-profit",
                "country": "USA",
                "founded": 2001,
                "members": 244,
                "budget": 750000,
                "url": "www.python.org/psf/"}"""

In [2]:
import json

data = json.loads(json_string)
print(type(data))  # let's print out the type of the object to make sure that that's a dictionary and not a string

<class 'dict'>


In [3]:
from pprint import pprint

pprint(data) # let's take a look at the dictionary

{'budget': 750000,
 'country': 'USA',
 'founded': 2001,
 'members': 244,
 'officers': [{'first_name': 'Guido',
               'last_name': 'Rossum',
               'position': 'president'},
              {'first_name': 'Diana',
               'last_name': 'Clarke',
               'position': 'chair'},
              {'first_name': 'Naomi',
               'last_name': 'Ceder',
               'position': 'vice chair'},
              {'first_name': 'Van',
               'last_name': 'Lindberg',
               'position': 'vice chair'},
              {'first_name': 'Ewa',
               'last_name': 'Jodlowska',
               'position': 'director of operations'}],
 'organisation': 'Python Software Foundation',
 'type': 'non-profit',
 'url': 'www.python.org/psf/'}


In [4]:
# let's print the keys of the dictionary
for key in data: 
    print(key, end=' ')

organisation officers type country founded members budget url 

In [5]:
# now let's convert a python dictionary into a json string

d = {"John": 51, "Kate": 12, "Bill": 27}
json_string = json.dumps(d)
print(type(json_string)) # убедимся, что теперь наши данные превратились в строку

<class 'str'>


In [6]:
# let's print out the string
print(json_string)

{"John": 51, "Kate": 12, "Bill": 27}


In [7]:
# let's do the same with an array
arr = ['hello', 'world']
json_string = json.dumps(arr)
print(type(json_string)) 
print(json_string)

<class 'str'>
["hello", "world"]


In [8]:
# not all valid python objects can be converted into a valid json 
d = {("A", 21): "John"}
json_string = json.dumps(d)
print(json_string)

TypeError: keys must be a string

In [9]:
# writing the code into a file

d = {'абв': 1, 'где': 2, 'ёжз': 3}

with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(d, f)

# the result (the contents of the file):

{"\u0433\u0434\u0435": 2, "\u0430\u0431\u0432": 1, "\u0451\u0436\u0437": 3}

{'где': 2, 'абв': 1, 'ёжз': 3}

In [10]:
# let's include the ensure_ascii parameter:

with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(d, f, ensure_ascii = False)

# the result:

{"где": 2, "абв": 1, "ёжз": 3}

{'где': 2, 'абв': 1, 'ёжз': 3}

In [11]:
# let's include the indent parameter (the number stands for the number of spaces in the indentation):

with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(d, f, ensure_ascii = False, indent = 4)

# the result

{
    "абв": 1,
    "где": 2,
    "ёжз": 3
}

{'абв': 1, 'где': 2, 'ёжз': 3}

## How to check the validity of a json?

When you work with a long json, it is hard to spot a mistake. To make sure your json is valid, use one of the following tools:
* https://jsonlint.com/
* https://jsoncompare.com/ 
* http://www.jsonschemavalidator.net/
* https://jsonformatter.curiousconcept.com/#

## JSON in the wild

### 1. Sending the data from the server to the browser



As an example, let's use the github. For instance, let's determine the number of repositories of a given github user.

In [1]:
import json
import urllib.request

user = "dashapopova"  # the user 
url = 'https://api.github.com/users/%s/repos' % user  
# the link to the json

response = urllib.request.urlopen(url)  # sending a request to the server and getting an answer
text = response.read().decode('utf-8')  # reading the answer into a string
data = json.loads(text) # converting the json string into a python object

print(len(data))  # printing the number of the user's repositories
for i in data:
    print(i["name"]) # printing the names of all the repositories

16
CM
CompLex
CompSemantics
Corpus_methods_LangPolicy_2020
Data-Analysis-Python-II
FunctionalModelsCompLing
FunctionalModelsDH2021
Interactive-Dictionary
Intro-to-R
pp
Preprocessing
Programming-Basics
python
Python101
Slovo-dnja
test_app


## Twitter

The documentation for the twitter json: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object

In [2]:
# reading the data from the twitter.json file
twitter = []
for line in open('twitter.json'):
    twitter.append(json.loads(line))

In [3]:
# how many tweets do we have?
len(twitter)

2556

In [4]:
twitter[0]

{'created_at': 'Wed Oct 03 05:00:00 +0000 2018',
 'id': 1047350533454012417,
 'id_str': '1047350533454012417',
 'text': 'RT @ELISSEsifieds: Nothing can stop us from supporting you. When we say all the way, it will be indeed. Hello Elissesifieds Cebu. \nThank yo…',
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'truncated': False,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 937522240488443905,
  'id_str': '937522240488443905',
  'name': "DonnaArabe'Efieds🇸🇦",
  'screen_name': 'ArabiDonna',
  'location': 'Jubail Industrial City, Kingdo',
  'url': None,
  'description': "Don't think too hard,just have fun with it.",
  'translator_type': 'none',
  'protected': False,
  'verified': False,
  'followers_count': 104,
  'friends_count': 132,
  'listed_count': 0,
  'favourites_count': 8372,
  'statuses_count':

In [5]:
twitter[-1]

{'delete': {'status': {'id': 729296586933678081,
   'id_str': '729296586933678081',
   'user_id': 2740164192,
   'user_id_str': '2740164192'},
  'timestamp_ms': '1538542860162'}}

In [6]:
ok = twitter[0]

In [7]:
# the language of the tweet
ok['lang']

'en'

In [8]:
# the text of the tweet
ok['retweeted_status']['extended_tweet']['full_text']

'Nothing can stop us from supporting you. When we say all the way, it will be indeed. Hello Elissesifieds Cebu. \nThank you guys, for rising so early just to visit Elisse on her last taping day sa humble place nyo💓 https://t.co/cLpJ2ifyGw'

In [9]:
# the percentage of the deleted tweets
sum('delete' in t for t in twitter)/len(twitter)

0.14162754303599373

In [10]:
langs = [t['lang'] for t in twitter if 'lang' in t]

In [11]:
# the most popular languages of the tweets
from collections import Counter
Counter(langs).most_common(10)

[('en', 719),
 ('ja', 438),
 ('es', 173),
 ('ko', 149),
 ('th', 123),
 ('ar', 119),
 ('und', 117),
 ('in', 71),
 ('pt', 69),
 ('fr', 35)]

In [12]:
# Do we have tweets from the same user?
cnt = Counter([t['user']['id'] for t in twitter if 'user' in t])
for key in sorted(cnt, key=cnt.get, reverse=True):
    if cnt[key] > 1:
        print(cnt[key], key)

2 992084216350294016
2 581282101
2 2317193324
2 978499715657445377
2 2245928100
2 993031040
2 290401936
2 3067130479
2 849417895109156869
2 958056194366754816
2 1017442172495331328
2 702487896935104513
2 995683537197158401
2 121016179
2 947288315375394817
2 2464271844
2 1009443285176340482
2 860202971266772992
2 2734975298
2 4311188534
2 1290792062
2 897067178754686976
2 4179415159
2 772081812109570048
2 1006114081739288577


In [13]:
# Top 20 hashtags
hashtags = []
for t in twitter:
    if 'entities' in t:
        if 'hashtags' in t['entities'] and t['entities']['hashtags']:
            hashtags.extend([i['text'] for i in t['entities']['hashtags']])
Counter(hashtags).most_common(20)

[('BTS', 17),
 ('방탄소년단', 13),
 ('AMAs', 11),
 ('人気投票ガチャ', 8),
 ('태형', 7),
 ('뷔', 6),
 ('BTSinChicago', 5),
 ('BTSLoveYourselfTour', 5),
 ('오늘의방탄', 5),
 ('PledgeForSwachhBharat', 5),
 ('MPN', 5),
 ('PCAs', 4),
 ('V', 4),
 ('시카고1회차공연', 4),
 ('เป๊กผลิตโชค', 4),
 ('JIMIN', 4),
 ('running', 3),
 ('NCT', 3),
 ('지민', 3),
 ('WajahmuPlastik', 3)]

In [14]:
# preprocessing the text of the original tweets (not retweeted tweets)
from string import punctuation
texts = []
for t in twitter:
    if 'retweeted_status' in t:
        pass
    elif 'text' in t and t['lang'] == 'en':
        texts.append(' '.join([w.strip(punctuation) for w in t['text'].lower().split()]))

all_tweets = ' '.join(texts)
d = Counter(all_tweets.split())
d.most_common(25)

[('the', 125),
 ('to', 86),
 ('a', 75),
 ('i', 73),
 ('and', 64),
 ('is', 50),
 ('you', 48),
 ('of', 45),
 ('for', 42),
 ('it', 41),
 ('in', 38),
 ('that', 33),
 ('this', 31),
 ('my', 30),
 ('me', 27),
 ('be', 26),
 ('on', 26),
 ('are', 21),
 ('what', 20),
 ('so', 20),
 ('with', 20),
 ('have', 19),
 ('not', 17),
 ('more', 17),
 ('but', 17)]

In [15]:
# print out the top 10 users with the most followers 
d = {}
for t in twitter:
    if 'user' in t:
        d[t['user']['name']] = t['user']['followers_count']
for key in sorted(d, key=d.get, reverse=True)[:10]:
    print(d[key], key)

2521403 Filosofía♕
1491309 FITNESS Magazine
1206759 malaysiakini.com
1137374 NYT Science
625463 Gramática
392472 TGRT Haber
383698 The Sun Football ⚽
374222 Melbourne, Australia
318189 Roznama Express
311319 💞 ცųཞɠɛཞცơơɠıɛ 💞


In [16]:
# top-10 sources of the tweets
import re

reg = re.compile('^.*?>(.*?)<.*?$')

sources = []
for t in twitter:
    if 'source' in t:
        res = reg.findall(t['source'])
        if res:
            sources.extend(res)
Counter(sources).most_common(20)

[('Twitter for iPhone', 800),
 ('Twitter for Android', 695),
 ('Twitter Web Client', 140),
 ('twittbot.net', 122),
 ('Twitter Lite', 51),
 ('Twitter for iPad', 28),
 ('TweetDeck', 23),
 ('Facebook', 17),
 ('IFTTT', 14),
 ('تطبيق قرآني', 10),
 ('dlvr.it', 10),
 ('Buffer', 8),
 ('Google', 8),
 ('autotweety.net', 7),
 ('Hootsuite Inc.', 7),
 ('WordPress.com', 6),
 ('Twittascope', 6),
 ('Botbird tweets', 6),
 ('تطبيق دعـاء', 5),
 ('Zapier.com', 5)]