Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

In [13]:
import io
import os.path

import pandas as pd
import requests
from lxml import etree

datadir = "public_data"
filename = "ind0.xml"  # Text file encoded as UTF-8
path = os.path.join(datadir, filename)

protocol = "http"
location = "personal.denison.edu"
resourcepath = "/~bressoud/datasystems/data/{}"

buildURL = lambda s: "{}://{}{}".format(protocol, location, resourcepath.format(s))


def print_tree(node, pretty_print=True, encoding="utf-8"):
    result = etree.tostring(node, pretty_print=pretty_print)
    if isinstance(result, bytes):
        result = result.decode(encoding)
    print(result)

## Building Tree From Existing XML

### Simple Local File

In [53]:
tree = etree.parse(path)
root = tree.getroot()

### Custom Parser Local File

In [47]:
myparser = etree.XMLParser(remove_blank_text=True)

tree = etree.parse(path, myparser)
root = tree.getroot()

### Network Request

In [16]:
response = requests.get(buildURL(filename))
assert response.status_code == 200

fileObj = io.BytesIO(response.content)
tree = etree.parse(path, myparser)
root = tree.getroot()

In [11]:
root

<Element indicators at 0x7fa3ad74ee40>

## Basic Operations

As an aid for working with Element nodes, we summarize some of the fundamental operations

Operation     |  Syntax Hint  |Brief Description
:-------------|:--------------|:-----------------------------------------
Get a Child   | `[index]`     |Access the node's child at index
Get tag       | `.tag`        |Obtain tag of node
Get text      | `.text`       |Obtain text of node up to child node or end tag
Access all attributes | `.attrib` | Obtain dictionary of all of node's xml attributes
Access one attribute | `.get()` | Fetch value for specified attribute, or `None` if not present
Find child node | `.find()` | Search for first child matching search specification (by tag)
Iterator child search | `.iterfind()` | Iterator for all children matching search specification (by tag)
Unconditional Child Iteration | *node* | A node itself can be used as an iterator to obtain all children in document order
Count children | `len(`*node*`)` | Find the number of children of a node
Interator on descendents | `iter()` | Iterator over all descendents


**Q** Print the full tree.  You can use the provided `print_tree()` function.  Try it with different arguments for the named parameters.

In [39]:
print_tree(root)
# raise NotImplementedError()

<indicators>
  <country code="FRA" name="France">
    <timedata year="2007">
      <pop>64.02</pop>
      <gdp>2657.21</gdp>
    </timedata>
    <timedata year="2017">
      <pop>66.87</pop>
      <gdp>2586.29</gdp>
    </timedata>
  </country>
  <country code="GBR" name="United Kingdom">
    <timedata year="2007">
      <pop>61.32</pop>
      <gdp>3084.12</gdp>
    </timedata>
    <timedata year="2017">
      <pop>66.06</pop>
      <gdp>2637.87</gdp>
    </timedata>
  </country>
  <country code="USA" name="United States">
    <timedata year="2007">
      <pop>301.23</pop>
      <gdp>14451.9</gdp>
    </timedata>
    <timedata year="2017">
      <pop>325.15</pop>
      <gdp>19485.4</gdp>
    </timedata>
  </country>
</indicators>



**Q** Get the index 2 child of the root, assign it to `node` and print the tree rooted at `node`.

In [40]:
# YOUR CODE HERE
node = root[2]
print_tree(node)
# raise NotImplementedError()

<country code="USA" name="United States">
    <timedata year="2007">
      <pop>301.23</pop>
      <gdp>14451.9</gdp>
    </timedata>
    <timedata year="2017">
      <pop>325.15</pop>
      <gdp>19485.4</gdp>
    </timedata>
  </country>




**Q** Get the index 1 node of the index 2 node of the root, assign it to `node`, and then find the `gdp`-tagged child of `node`, and assign it to `gdp-node`.

In [41]:
# YOUR CODE HERE
node = root[2][1]
gdp_node = node.find("gdp")
print_tree(node)
print_tree(gdp_node)

# raise NotImplementedError()

<timedata year="2017">
      <pop>325.15</pop>
      <gdp>19485.4</gdp>
    </timedata>
  

<gdp>19485.4</gdp>
    



**Q** Repeat the above, but then obtain the value (based on the text of the gdp_node), and assign to `gdp_value`, and then assign to `value` 10% more than `gdp_value`.  Print this final value.

In [51]:
# YOUR CODE HERE

node = root[2][1]
gdp_value = node.find("gdp").text
print_tree(node)
print(gdp_value, type(gdp_value))

# raise NotImplementedError()

<timedata year="2017">
      <pop>325.15</pop>
      <gdp>19485.4</gdp>
    </timedata>
  

19485.4 <class 'str'>


In [52]:
node.find("gdp").text = str(10 * float(gdp_value))
print(gdp_value, type(gdp_value))

19485.4 <class 'str'>


**Q** Iterate over the `country` nodes and check each for a case where the letter `'A'` appears in the node's `code` XML attribute.  If found, print the value of the `name` attribute.

In [58]:
# YOUR CODE HERE
for country_node in root:
    print(country_node.get("name"))
    if "A" in country_node.get("code"):
        print_tree(country_node)
# raise NotImplementedError()

France
<country code="FRA" name="France">
    <timedata year="2007">
      <pop>64.02</pop>
      <gdp>2657.21</gdp>
    </timedata>
    <timedata year="2017">
      <pop>66.87</pop>
      <gdp>2586.29</gdp>
    </timedata>
  </country>
  

United Kingdom
United States
<country code="USA" name="United States">
    <timedata year="2007">
      <pop>301.23</pop>
      <gdp>14451.9</gdp>
    </timedata>
    <timedata year="2017">
      <pop>325.15</pop>
      <gdp>19485.4</gdp>
    </timedata>
  </country>




**Q** Use nested loops to accumulate a list of **just** the `timedata` nodes for the year "2017".  

Hints:

- Initialize an empty list
- Outer loop will iterate over root's `country` nodes
- Inner loop will iterate of each country node's `timedata` nodes
- For each of these, check for the attribute of the `timedata` node to be equal to the string `"2017"`.  If found, accumulate into the list.

In [59]:
# YOUR CODE HERE

timelist = []
for country_node in root:
    for timedata in country_node.iter("timedata"):
        if timedata.get("year") == "2017":
            timelist.append(timedata)

# raise NotImplementedError()
for node in timelist:
    print_tree(node)

<timedata year="2017">
      <pop>66.87</pop>
      <gdp>2586.29</gdp>
    </timedata>
  

<timedata year="2017">
      <pop>66.06</pop>
      <gdp>2637.87</gdp>
    </timedata>
  

<timedata year="2017">
      <pop>325.15</pop>
      <gdp>19485.4</gdp>
    </timedata>
  



**Bonus Q** Write a function

    recursive_printtags(node)
    
that prints the tag of the given node, and then recurses to print the tags of the subtree rooted at each of its children.

In [64]:
# YOUR CODE HERE
def recursive_printtags(root):
    print((root.tag.title(), root.attrib.get("name", root.text)))
    for elem in root.getchildren():
        printRecur(elem)


# raise NotImplementedError()
recursive_printtags(root)

('Indicators', '\n  ')
('Country', 'France')
('Timedata', '\n      ')
('Pop', '64.02')
('Gdp', '2657.21')
('Timedata', '\n      ')
('Pop', '66.87')
('Gdp', '2586.29')
('Country', 'United Kingdom')
('Timedata', '\n      ')
('Pop', '61.32')
('Gdp', '3084.12')
('Timedata', '\n      ')
('Pop', '66.06')
('Gdp', '2637.87')
('Country', 'United States')
('Timedata', '\n      ')
('Pop', '301.23')
('Gdp', '14451.9')
('Timedata', '\n      ')
('Pop', '325.15')
('Gdp', '19485.4')


## From XML to Build a Data Frame

Often, we obtain XML-formatted data, but for manipulation, transformation, and analysis, we need to construct one or more tabular data frames.  This gives us a functional use and practice for our various procedural operations learned today.

In particular, we want to take an XML based data set of our topnames information, and to construct a dataframe with columns `year`, `sex`, `name`, and `count`.

**Q** Using one of the techniques at the beginning of this notebook, retrieve (from local file or from the network), the xml tree in the resource file `"topnames.xml"`, and assign to `root` the Element at the root of that tree.  Finish by printing the number of children of the root node.

In [65]:
# YOUR CODE HERE
from lxml import etree

path = "public_data/topnames.xml"
tree = etree.parse(path)
root = tree.getroot()
# raise NotImplementedError()

**Q** Using nested loops, with the outer loop iterating over the children of the topnames root, and the inner iterating over those children's children, print out, inside the inner loop, the value of the year (from the outer node's xml-attribute) and the value of the sex (from the inner node's xml-attribute).  A prefix of the resultant output:
```
1880 Female
1880 Male
1881 Female
1881 Male
1882 Female
```

In [71]:
# YOUR CODE HERE
for year in root[:5]:
    for sex in year.iter("sex"):
        print(year.get("value"), sex.get("value"))
# raise NotImplementedError()

1880 Female
1880 Male
1881 Female
1881 Male
1882 Female
1882 Male
1883 Female
1883 Male
1884 Female
1884 Male


**Q** We saw, from the book, that it is convenient in these cases to collect our row data in a list, and for each element in the list to be a **dictionary**, in which the **keys** are the names of the columns/fields, and the **values** contain the value, for that row, of the given field.  Without yet worrying about the `name` and `count` columns, let us build such a List of Dictionaries (LoD) for the year and sex combinations.  So a prexix of the LoD would look like the following:

```
[
  {'year': 1880, 'sex': 'Female'}, 
  {'year': 1880, 'sex': 'Male'}, 
  {'year': 1881, 'sex': 'Female'}, 
  {'year': 1881, 'sex': 'Male'}, 
  {'year': 1882, 'sex': 'Female'}, 
  {'year': 1882, 'sex': 'Male'},
  ...
]
```
Augment your code from the last question to build the List of Dictionaries, using the typical accumulation pattern, starting with an empty list named `LoD` and replacing your `print()` with the creation and appending of the dictionary needed for each row.

In [75]:
# YOUR CODE HERE
# raise NotImplementedError()
LoD = []
for year in root:
    for sex in year.iter("sex"):
        d = {"year": int(year.get("value")), "sex": sex.get("value")}
        LoD.append(d)


assert len(LoD) == 278
assert isinstance(LoD, list)
assert isinstance(LoD[0], dict)
assert LoD[0]["year"] == 1880
assert LoD[0]["sex"] == "Female"

print(LoD[:5])

[{'year': 1880, 'sex': 'Female'}, {'year': 1880, 'sex': 'Male'}, {'year': 1881, 'sex': 'Female'}, {'year': 1881, 'sex': 'Male'}, {'year': 1882, 'sex': 'Female'}]


**Q** Finish your build of the LoD by including in each dictionary the value (based on the `.text` attribute) of the `name` child and the `count` child of the `sex` node from the inner loop.

In [79]:
# YOUR CODE HERE
def child_value(node, tag):
    first_find = node.find(tag)
    if first_find != None:
        return first_find.text


LoD = []
for year in root:
    for sex in year.iter("sex"):
        d = {
            "year": int(year.get("value")),
            "sex": sex.get("value"),
            "name": child_value(sex, "name"),
            "count": int(child_value(sex, "count")),
        }
        LoD.append(d)


# raise NotImplementedError()
assert len(LoD) == 278
assert isinstance(LoD, list)
assert isinstance(LoD[0], dict)
assert LoD[0]["year"] == 1880
assert LoD[0]["sex"] == "Female"
assert LoD[0]["name"] == "Mary"
assert LoD[0]["count"] == 7065

LoD[:4]

[{'year': 1880, 'sex': 'Female', 'name': 'Mary', 'count': 7065},
 {'year': 1880, 'sex': 'Male', 'name': 'John', 'count': 9655},
 {'year': 1881, 'sex': 'Female', 'name': 'Mary', 'count': 6919},
 {'year': 1881, 'sex': 'Male', 'name': 'John', 'count': 8769}]

**Q** For the final step, use pandas to construct a data frame from the list of dictionaries; set the index of this data frame to the independent variables of `year` and `sex`, and display the head() of the resultant data frame.

In [80]:
# YOUR CODE HERE
def child_value(node, tag):
    first_find = node.find(tag)
    if first_find != None:
        return first_find.text


LoD = []
for year in root:
    for sex in year.iter("sex"):
        d = {
            "year": int(year.get("value")),
            "sex": sex.get("value"),
            "name": child_value(sex, "name"),
            "count": int(child_value(sex, "count")),
        }
        LoD.append(d)

df = pd.DataFrame(LoD)
df.set_index(["year", "sex"], inplace=True)

df.head(10)

# raise NotImplementedError()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,count
year,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
1880,Female,Mary,7065
1880,Male,John,9655
1881,Female,Mary,6919
1881,Male,John,8769
1882,Female,Mary,8148
1882,Male,John,9557
1883,Female,Mary,8012
1883,Male,John,8894
1884,Female,Mary,9217
1884,Male,John,9388
