# Parsing JSON and XML

JSON and XML are structured data formats. XML -- eXtensible Markup Language -- was once very commonly used for text encoding. It's powerful, flexible, and still widely encountered in archival text projects. It's also comparatively complicated to use.

JSON -- JavaScript Object Notation -- is a newer, looser, less complicated format for storing arbitrary info. It's typically used for data exchange by Web applications and for non-fulltext and text-adjacent data like word lists, bibliographic records, etc. JSON data looks, feels, and operates much like Python dictionaries.

Here's how to do things with each format ...

## JSON

In [1]:
import json

s = """
    {
        "fruit": "Apple",
        "size": "Large",
        "color": "Red"
    }
"""
print(type(s), s)

j = json.loads(s)
print(type(j), j)

<class 'str'> 
    {
        "fruit": "Apple",
        "size": "Large",
        "color": "Red"
    }

<class 'dict'> {'fruit': 'Apple', 'size': 'Large', 'color': 'Red'}


In [2]:
j['fruit']

'Apple'

In [3]:
j['price'] = 1
s = json.dumps(j)
print(j)
print(s)

{'fruit': 'Apple', 'size': 'Large', 'color': 'Red', 'price': 1}
{"fruit": "Apple", "size": "Large", "color": "Red", "price": 1}


In [4]:
j['details'] = {}
j['details']['variety'] = 'Braeburn'
j['details']['tart'] = True
j

{'fruit': 'Apple',
 'size': 'Large',
 'color': 'Red',
 'price': 1,
 'details': {'variety': 'Braeburn', 'tart': True}}

In [5]:
j['details']

{'variety': 'Braeburn', 'tart': True}

In [6]:
j['details']['tart']

True

## XML

XML looks kind of like HTML, i.e., it's made up of tagged text. This is geerally good, though it can get tricky in real life due to issues about nesting tags. We'll ignore that complication, since we won't be doing any encoding, just decoding and processing.

In [7]:
x = """
<note>
    <to>Bob</to>
    <from>Alice</from>
    <heading>Test!</heading>
    <body>
        <p>This is the first paragraph.</p>
        <p type='special'>This is another paragraph.</p>
        <p>And a third p element.</p>
    </body>
</note>
"""

In [8]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(x, "html.parser")
print(soup.prettify())

<note>
 <to>
  Bob
 </to>
 <from>
  Alice
 </from>
 <heading>
  Test!
 </heading>
 <body>
  <p>
   This is the first paragraph.
  </p>
  <p type="special">
   This is another paragraph.
  </p>
  <p>
   And a third p element.
  </p>
 </body>
</note>



In [9]:
# Pull some data and do things with it.

author = soup.find('from').get_text() # 'from' is a reserved word
recipient = soup.to.get_text() # 'to' is not reserved, so this shortcut works
wc = len(soup.body.get_text().split())
pars = len(soup.body.find_all('p'))
spec = soup.find(type="special").get_text()

print("Author:", author)
print("Recipient:", recipient)
print("Wordcount:", wc)
print("Body paragraphs:", pars)
print("Special content:", spec)

Author: Alice
Recipient: Bob
Wordcount: 14
Body paragraphs: 3
Special content: This is another paragraph.


In [10]:
# Iterate over all instances of a tag type
for par in soup.find_all('p'):
    print(par)
    print(par.get_text())
    print()

<p>This is the first paragraph.</p>
This is the first paragraph.

<p type="special">This is another paragraph.</p>
This is another paragraph.

<p>And a third p element.</p>
And a third p element.

