# 6 - Data Encoding and Processing

## Reading and Writing CSV Data

In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 4), columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
0,0.578421,1.21362,-0.732249,-2.849113
1,0.07596,0.735471,-0.245389,0.982651
2,1.146799,1.763106,1.404748,-1.197465
3,-1.060406,-1.082608,1.003626,-0.990171
4,-0.428314,-1.117124,0.78425,-0.434769


In [3]:
file_name = "test_data.csv"

df.to_csv(file_name)

In [4]:
import os

os.path.isfile(file_name)

True

In [5]:
dfr = pd.read_csv(file_name)
dfr

Unnamed: 0.1,Unnamed: 0,A,B,C,D
0,0,0.578421,1.21362,-0.732249,-2.849113
1,1,0.07596,0.735471,-0.245389,0.982651
2,2,1.146799,1.763106,1.404748,-1.197465
3,3,-1.060406,-1.082608,1.003626,-0.990171
4,4,-0.428314,-1.117124,0.78425,-0.434769


pandas provides a clean way to read and write csv files. Alternatively...

In [6]:
import csv

with open(file_name) as f:
    f_csv = csv.reader(f)
    headers = next(f_csv)
    for row in f_csv:
        print(row)


['0', '0.5784206468336923', '1.2136204592697122', '-0.7322489731430893', '-2.84911273983835']
['1', '0.07596047207284655', '0.7354709000827151', '-0.2453891691302662', '0.9826508560764818']
['2', '1.146799123292891', '1.763105869585345', '1.404748156315179', '-1.1974647197868011']
['3', '-1.0604061326077359', '-1.0826079764500174', '1.0036262065245922', '-0.9901708378243262']
['4', '-0.42831386072217076', '-1.1171242932657632', '0.7842497569749508', '-0.43476896864505865']


In [7]:
headers

['', 'A', 'B', 'C', 'D']

A nice and clean way to read the data...

In [14]:
from collections import namedtuple

rows = []
with open(file_name) as f:
    f_csv = csv.reader(f)
    headings = next(f_csv)
    headings[0] = "index"
    Row = namedtuple('Row', headings)
    for r in f_csv:
        row = Row(*r)
        rows.append(row)


In [16]:
row = rows[0]
row

Row(index='0', A='0.5784206468336923', B='1.2136204592697122', C='-0.7322489731430893', D='-2.84911273983835')

In [18]:
row.A, row.B

('0.5784206468336923', '1.2136204592697122')

In [20]:
import csv

rows = []
with open(file_name) as f:
    f_csv = csv.DictReader(f)
    for row in f_csv:
        rows.append(row)


In [21]:
rows

[OrderedDict([('', '0'),
              ('A', '0.5784206468336923'),
              ('B', '1.2136204592697122'),
              ('C', '-0.7322489731430893'),
              ('D', '-2.84911273983835')]),
 OrderedDict([('', '1'),
              ('A', '0.07596047207284655'),
              ('B', '0.7354709000827151'),
              ('C', '-0.2453891691302662'),
              ('D', '0.9826508560764818')]),
 OrderedDict([('', '2'),
              ('A', '1.146799123292891'),
              ('B', '1.763105869585345'),
              ('C', '1.404748156315179'),
              ('D', '-1.1974647197868011')]),
 OrderedDict([('', '3'),
              ('A', '-1.0604061326077359'),
              ('B', '-1.0826079764500174'),
              ('C', '1.0036262065245922'),
              ('D', '-0.9901708378243262')]),
 OrderedDict([('', '4'),
              ('A', '-0.42831386072217076'),
              ('B', '-1.1171242932657632'),
              ('C', '0.7842497569749508'),
              ('D', '-0.43476896864505865')]

In [23]:
rows[0]["A"]

'0.5784206468336923'

In [24]:
rows[0]["B"]

'1.2136204592697122'

## Reading and Writing JSON Data
The two main functions are json.dumps() and json.loads() from the json module.

In [25]:
import json
data = {
    'name' : 'ACME',
    'shares' : 100,
    'price' : 542.23
}

json_str = json.dumps(data)

In [26]:
json_str

'{"name": "ACME", "shares": 100, "price": 542.23}'

In [27]:
data = json.loads(json_str)
data

{'name': 'ACME', 'shares': 100, 'price': 542.23}

In [28]:
file_name = "test_json.json"

with open(file_name, "w") as f:
    json.dump(data, f)


In [29]:
import os

os.path.isfile(file_name)

True

In [31]:
with open(file_name, "r") as f:
    print(json.load(f))


{'name': 'ACME', 'shares': 100, 'price': 542.23}


The format of JSON encoding is almost identical to Python syntax except for a few minor changes. For instance, True is mapped to true, False is mapped to false, and None is mapped to null. 

In [34]:
json.dumps(True)

'true'

In [33]:
json.dumps(False)

'false'

In [35]:
json.dumps(None)

'null'

## Parsing Simple XML Data
The xml.etree.ElementTree module can be used to extract data from simple XML documents. 

In [41]:
from urllib.request import urlopen
from xml.etree.ElementTree import parse

# Download the RSS feed and parse it
u = urlopen('http://planet.python.org/rss20.xml')
doc = parse(u)

In [42]:
doc

<xml.etree.ElementTree.ElementTree at 0x24e52886a58>

In [46]:
doc.getroot()

<Element 'rss' at 0x0000024E5287E638>

In [44]:
stop_after = 5
counter = 0

for item in doc.iterfind('channel/item'):
    counter += 1
    title = item.findtext('title')
    date = item.findtext('pubDate')
    link = item.findtext('link')
    print(title)
    print(date)
    print(link)
    print()
    if counter >= stop_after:
        break


PyCoder’s Weekly: Issue #374 (June 25, 2019)
Tue, 25 Jun 2019 19:30:00 +0000
https://pycoders.com/issues/374

Continuum Analytics Blog: How We Made Conda Faster in 4.7
Tue, 25 Jun 2019 16:56:09 +0000
https://www.anaconda.com/how-we-made-conda-faster-4-7/

Real Python: Generating Random Data in Python
Tue, 25 Jun 2019 14:00:00 +0000
https://realpython.com/courses/generating-random-data-python/

Reuven Lerner: Announcing: Python standard library, video explainer
Tue, 25 Jun 2019 09:30:19 +0000
https://lerner.co.il/2019/06/25/announcing-python-standard-library-video-explainer/

Talk Python to Me: #218 Serverless Python functions in Azure
Tue, 25 Jun 2019 08:00:00 +0000
https://talkpython.fm/episodes/show/218/serverless-python-functions-in-azure



## Parsing Huge XML Files Incrementally
Any time you are faced with the problem of incremental data processing, you should think of iterators and generators.

In [50]:
from xml.etree.ElementTree import iterparse

## Turning a Dictionary into XML

In [1]:
from xml.etree.ElementTree import Element

def dict_to_xml(tag, d):
    elem = Element(tag)
    for key, val in d.items():
        child = Element(key)
        child.text = str(val)
        elem.append(child)
    return elem


In [2]:
s = { 'name': 'GOOG', 'shares': 100, 'price':490.1 }
s

{'name': 'GOOG', 'shares': 100, 'price': 490.1}

In [3]:
e = dict_to_xml('stock', s)
e

<Element 'stock' at 0x000001F29E15E278>

In [5]:
from xml.etree.ElementTree import tostring

tostring(e)

b'<stock><name>GOOG</name><shares>100</shares><price>490.1</price></stock>'

In [6]:
e.set('_id','1234')
tostring(e)

b'<stock _id="1234"><name>GOOG</name><shares>100</shares><price>490.1</price></stock>'

## Parsing, Modifying, and Rewriting XML