# Storing / Sending / Receiving data

Data serialization provides a means of translating data into a format to be stored or transmitted. Upon reading or receipt of the data it can be reconstructed on the other side. We'll look at some Python native serialization formats using pickling and shelves followed by standards such as JSON and XML.

Lesson Goals
- Review Pickling
- Review Shelves
- Review JSON
- Review XML

https://en.wikipedia.org/wiki/Serialization

https://docs.python-guide.org/scenarios/serialization/

## Pickling

According to Python.org 'The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” [1] or “flattening”; however, to avoid confusion, the terms used here are “pickling” and “unpickling”.

Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.'

Source: https://docs.python.org/3/library/pickle.html

Additional reading: https://www.datacamp.com/community/tutorials/pickle-python-tutorial

In [1]:
import pickle

In [2]:
my_food = {
    'drink': 'Rip It',
    'sides': 'fries',
    'sandwich': ['burger','bun','lettuce','tomato','pickle','ketchup']
}

In [3]:
filename = 'food.pickle'
with open(filename,'wb') as filehandle:
    pickle.dump(my_food, filehandle)
    #Write a pickled representation of obj to the open file object file.

In [4]:
# You will get the contents now
my_food

{'drink': 'Rip It',
 'sides': 'fries',
 'sandwich': ['burger', 'bun', 'lettuce', 'tomato', 'pickle', 'ketchup']}

Go Kernel -> Restart. You will be warned that all variables will be lost.

In [5]:
# This will not be defined and will get a NameError
print(my_food)

{'drink': 'Rip It', 'sides': 'fries', 'sandwich': ['burger', 'bun', 'lettuce', 'tomato', 'pickle', 'ketchup']}


##### Exercise 1

Refresh yourself on the try/except functionality and print a clean error message when the NameError exception occurs

In [6]:
# NOT IMPLEMENTED YET, try/except printing my_food
try:
    print(my_food)
except NameError:
    print('There is a NameError in the code')

{'drink': 'Rip It', 'sides': 'fries', 'sandwich': ['burger', 'bun', 'lettuce', 'tomato', 'pickle', 'ketchup']}


In [7]:
import pickle
# NOT IMPLEMENTED YET
with open('food.pickle', 'rb') as f:
    my_food = pickle.load(f)
# open file for reading in binary mode and load the contents

In [8]:
my_food

{'drink': 'Rip It',
 'sides': 'fries',
 'sandwich': ['burger', 'bun', 'lettuce', 'tomato', 'pickle', 'ketchup']}

Voila. You got your data back.

Let's take a moment to consider the warning in Python.org... 'Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.'

Let's show why you should check any data you ingest with pickle as well as why you should use care if you are using pickled data for communication over the network.

Further reading: https://blog.nelhage.com/2011/03/exploiting-pickle/

In [9]:
# run me I won't hurt I promise
pickle.loads(b"cos\nsystem\n(S'echo hello world'\ntR.")

# look in your jupyter notebook command prompt

0

## Shelve

According to Python.org 'A “shelf” is a persistent, dictionary-like object. The difference with “dbm” databases is that the values (not the keys!) in a shelf can be essentially arbitrary Python objects — anything that the pickle module can handle. This includes most class instances, recursive data types, and objects containing lots of shared sub-objects. The keys are ordinary strings.
...
Because the shelve module is backed by pickle, it is insecure to load a shelf from an untrusted source. Like with pickle, loading a shelf can execute arbitrary code.'

Source: https://docs.python.org/3/library/shelve.html

In [11]:
# Import shelve and open one up
import shelve
filename = 'food.shelve'
d = shelve.open(filename)

In [12]:
# store my_food in shelf_of_food
d['shelf_of_food'] = my_food

In [13]:
# we can also access it directly just like an in memory dictionary
d['shelf_of_food']

{'drink': 'Rip It',
 'sides': 'fries',
 'sandwich': ['burger', 'bun', 'lettuce', 'tomato', 'pickle', 'ketchup']}

In [15]:
# we can delete keys... or can we?
del d['shelf_of_food']['drink']

In [16]:
# show the contents again, did it work?
d['shelf_of_food']

{'drink': 'Rip It',
 'sides': 'fries',
 'sandwich': ['burger', 'bun', 'lettuce', 'tomato', 'pickle', 'ketchup']}

In [19]:
# close it
d.close()

In [None]:
# Open it up again so we can actually add to entries/delete entries
# <- remember kwargs?

##### Exercise 2

Demonstrate you can manipulate a shelve.  Append 'mustard' to the sandwich key in this shelve.  Remember the sandwich key in your dictionary contains a list so you use various list methods (https://docs.python.org/3/tutorial/datastructures.html) to add an entry.

In [None]:
# NOT IMPLEMENTED YET - Append 'mustard' to the sandwich key

In [None]:
# now things work as expected

In [None]:
# close it

## JSON

According to Python.org 'JSON (JavaScript Object Notation), specified by RFC 7159 (which obsoletes RFC 4627) and by ECMA-404, is a lightweight data interchange format inspired by JavaScript object literal syntax (although it is not a strict subset of JavaScript [1] ).

json exposes an API familiar to users of the standard library marshal and pickle modules.'

Source: https://docs.python.org/3/library/json.html

##### Exercise 3

JSON will likely be the most common serialized data you work with.  Use the following exercise and write filtered TODOs from the JSON todos to a file. Your opening of a filehandle and writing to it will work similarly to earlier.

https://realpython.com/python-json/

In [None]:
import json
import requests

In [None]:
response = requests.get("https://jsonplaceholder.typicode.com/todos")
todos = json.loads(response.text)

In [None]:
# NOT IMPLEMENTED YET - write filtered TODOs to file

## XML

XML: XML stands for eXtensible Markup Language. It was designed to store and transport data. It was designed to be both human- and machine-readable. That’s why, the design goals of XML emphasize simplicity, generality, and usability across the Internet. The XML file to be parsed in this tutorial is actually a RSS feed.

Source: https://www.geeksforgeeks.org/xml-parsing-python/

Additional reading: https://docs.python.org/3/library/xml.html

##### Exercise 4

Parse and save the contents of the RSS feed into a CSV file based on the GeeksForGeeks exercice.

In [None]:
# NOT IMPLEMENTED YET - make it happen

"""Here, we first create a HTTP response object by sending an HTTP request to the URL of the RSS feed.
The content of response now contains the XML file data which we save as topnewsfeed.xml in 
our local directory."""


#Python code to illustrate parsing of XML files 
# importing the required modules 
import csv 
import requests 
import xml.etree.ElementTree as ET 
  
def loadRSS(): 
# url of rss feed 
    url = 'http://www.hindustantimes.com/rss/topnews/rssfeed.xml'
# creating HTTP response object from given url 
    resp = requests.get(url) 
# saving the xml file 
    with open('topnewsfeed.xml', 'wb') as f: 
        f.write(resp.content) 

        
def parseXML(xmlfile): 
# create element tree object 
    tree = ET.parse(xmlfile) 
# get root element 
    root = tree.getroot() 
# create empty list for news items 
    newsitems = [] 
# iterate news items 
    for item in root.findall('./channel/item'): 
# empty news dictionary 
        news = {} 
# iterate child elements of item 
        for child in item: 
# special checking for namespace object content:media 
            if child.tag == '{http://search.yahoo.com/mrss/}content': 
                news['media'] = child.attrib['url'] 
            else: 
                news[child.tag] = child.text.encode('utf8') 
# append news dictionary to news items list 
        newsitems.append(news) 
      
# return news items list 
    return newsitems 
  

def savetoCSV(newsitems, filename): 
# specifying the fields for csv file 
    fields = ['guid', 'title', 'pubDate', 'description', 'link', 'media'] 
# writing to csv file 
    with open(filename, 'w') as csvfile: 
# creating a csv dict writer object 
        writer = csv.DictWriter(csvfile, fieldnames = fields) 
# writing headers (field names) 
        writer.writeheader() 
# writing data rows 
        writer.writerows(newsitems) 
    

def main(): 
    # load rss from web to update existing xml file 
    loadRSS() 
# parse xml file 
    newsitems = parseXML('topnewsfeed.xml') 
# store news items in a csv file 
    savetoCSV(newsitems, 'topnews.csv') 
         
if __name__ == "__main__": 
# calling main function 
    main() 