Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# XPath

The cell below is the same set-up cell as we had in the previous in-class worksheet, i.e., gets our data either from a local file or over the network. In this worksheet, we use local files, but we still provide the general set-up. 

Please execute the cell below. The `print_results` function can be a useful way to print out nodes in a node set matched by an XPath expression.

In [3]:
import io
import os.path

import pandas as pd
import requests
from lxml import etree


def print_tree(node, pretty_print=True, encoding="utf-8"):
    """
    This function prints the subtree indicated by a given
    Element, node, decoding if necessary.

    Parameters:
    node is an Element
    pretty_print is a flag variable, True by default
    encoding is a string, 'utf-8' by default
    """
    result = etree.tostring(node, pretty_print=pretty_print)
    if isinstance(result, bytes):
        result = result.decode(encoding)
    print(result)


def print_results(nodeset):
    """
    This function iterates over all Elements in a given list
    of Elements, printing the tag, text, and attributes of each.

    Parameters:
    nodeset - a list of Elements
    """
    print("Length of nodeset result:", len(nodeset))
    for node in nodeset:
        print("Type:", type(node))
        if type(node) == etree._Element:
            print("Tag:", node.tag)
            print("  Text:", node.text)
            print("  Attrib:", node.attrib)
        else:
            print(node)
        print()


protocol = "http"
location = "personal.denison.edu"
resourcepath = "/~bressoud/datasystems/data/{}"

buildURL = lambda s: "{}://{}{}".format(protocol, location, resourcepath.format(s))

datadir = "public_data"
filename = "ind0.xml"  # Text file encoded as UTF-8
path = os.path.join(datadir, filename)


indtree = etree.parse(path)
indroot = indtree.getroot()

filename = "topnames.xml"
path = os.path.join(datadir, filename)

toptree = etree.parse(path)
toproot = toptree.getroot()

filename = "school.xml"
path = os.path.join(datadir, filename)

schtree = etree.parse(path)
schroot = schtree.getroot()

## Core XPath operations

1. Get root node by xpath 
2. Get **single node** (in a list) by specifying a complete element path. The path can lead to:

    a. an Element node (by leading to its tag)  
    b. the text of a node (via `.../text()`)  
    c. an attribute (via `@`)  
    
3. Match multiple nodes **at a single level of the hierarchy** using an element path to that level. Such a path can lead to:

    a. an Element node (by leading to its tag)  
    b. the text of a node (via `.../text()`)  
    c. an attribute (via `@`)  
    
4. Match all nodes with a particular element + attribute match

    a. Specify the element with the path, and use a **predicate** to specify the attribute match.  
    b. Can also specify a set of attribute values, e.g., using an inequality in a predicate.  
    c. Can do the same with a set of text values.  
    d. Can match at multiple levels of the hierarchy (depending on tag names in your tree), if you use `//`.  
    e. Can match all nodes in the tree (`//node()`), all elements (`//*`), all text nodes (`//text()`), or all attributes (`//@*`)  
    
5. Match all nodes with a particular element + attribute combination 

    a. Can use `and`/`or` keywords inside a predicate.  
    b. Can use `not` inside a predicate, to see if an attribute or child is present or not.  
    c. Can use a **predicate that includes a path**, e.g., to match all nodes whose child (with given element specification) has a text or attribute matching some given value.  
    d. Can use `|` between two XPath expressions to match one or the other (or both).  
    
6. Climb back up the tree from a nodeset by adding a parent specification (`..`) to a set of found nodes

Syntax:

Expression | Meaning
:---------:|:------------
`/`        | When the first character, means the traversal starts at the root of the tree.  If not the first character, is used to separate location-steps in the set of possible traversals.
`.`        | Refers to the current node of the traversal.
`..`       | Means the parent of the current node of the traversal.  Every node has a parent, and the parent of the root is the root itself.
`@`        | Used to reference/match an **attribute** (instead of an element tag).
`[]`       | Used, relative to the node of the current location-step, to specify a predicate (i.e. something that results in a boolean true/false, often involving an attribute of the current node).
`or, and` | Used inside a `[]` expression to specify logical operators.
`vert bar`        | Used between XPaths in the same string to combine the nodeset results from the first XPath with the nodeset results from the second XPath.
`//` | Matches **all** descendent traversals/paths from the node of the current location-step.
`*` | Matches all the element siblings relative to the current location-step level.
`@*` | Matches all the attributes relative to the current location-step level (or predicate, if used with `[]`).
`text()` | Extracts the text of the current node.
`position()` | Refers to the index (1-relative) of a child from its parent.
`contains` | Arguments of a string (attribute or text()) and a substring and gives a boolean for a predicate that is true if the string contains the substring.

### XPath Exercises

For each of the following exercises, create a correct XPath expression to identify the node set described, then use the `xpath()` method to extract this node set from the relevant tree, storing your answer in a list of Elements `nodeset`.

1. Get root node by xpath 

**Example 1** Create a nodeset consisting of the root node of the `topnames` tree, then use the `print_results` function we created above to print the results. Then use `print_tree` to print the information from the Element in the nodeset. We see that `print_results` gives us a more compact output.

In [2]:
nodeset = toproot.xpath(".")
print_results(nodeset)

Length of nodeset result: 1
Type: <class 'lxml.etree._Element'>
Tag: topnames
  Text: 
  
  Attrib: {}



**Q1** Create a nodeset consisting of the root node of the `indicators0` tree, then use the `print_results` function we created above to print the results. Please mimic the above, but use `indroot.xpath`.

In [5]:
nodeset = indroot.xpath("/indicators")


print_results(nodeset)  # should just have one node, indicators

Length of nodeset result: 1
Type: <class 'lxml.etree._Element'>
Tag: indicators
  Text: 
  
  Attrib: {}



2. Get **single node** (in a list) by specifying a complete element path. The path can lead to:
    a. an Element node (by leading to its tag)
    b. the text of a node (via `.../text()`)
    c. an attribute (via `@`)

When there are multiple children with the same tag, use a predicate to specify the relevant attribute value, e.g. `node/child[@attributeName=...]/grandchild`

**Example 2a** Find the count of the number of top female births in 1882. First find the node then extract the count itself as an integer `c`.

In [6]:
# Finding the relevant node

nodeset = toproot.xpath("/topnames/year[@value='1882']/sex[@value='Female']/count")
print_results(nodeset)

# Extracting the actual count
c = int(nodeset[0].text)
print(c)
print(type(c))
print()

Length of nodeset result: 1
Type: <class 'lxml.etree._Element'>
Tag: count
  Text: 8148
  Attrib: {}

8148
<class 'int'>



**Example 2b** Repeat the above, but using a path that goes all the way to the text of the leaf node in question.

In [7]:
# Alternative way that uses a path to the text of the node
results = toproot.xpath(
    "/topnames/year[@value='1882']/sex[@value='Female']/count/text()"
)
c2 = int(results[0])
print(c2)
print(type(c2))

8148
<class 'int'>


**Q2a** Find the gdp of the USA in 2017. First find the node then extract the gdp itself as a float `f`.

In [13]:
nodeset = indroot.xpath("/indicators/country[@code = 'USA']/timedata[@year='2017']/gdp")
f = float(nodeset[0].text)


print_results(nodeset)
print(f)  # should be 19485.4
print(type(f))  # should be float

Length of nodeset result: 1
Type: <class 'lxml.etree._Element'>
Tag: gdp
  Text: 19485.4
  Attrib: {}

19485.4
<class 'float'>


**Q2b** Solve this problem again, but with an XPath expression that leads to the `text()` of the node in question.

In [15]:
f = float(
    indroot.xpath(
        "/indicators/country[@code = 'USA']/timedata[@year='2017']/gdp/text()"
    )[0]
)


print_results(nodeset)

print(f)  # should be 19485.4
print(type(f))  # should be float

Length of nodeset result: 1
Type: <class 'lxml.etree._Element'>
Tag: gdp
  Text: 19485.4
  Attrib: {}

19485.4
<class 'float'>


**Example 2c** Find the first year in the `topnames` dataset. Note that this information is stored in an attribute, so your path should end at an attribute.

In [17]:
nodeset = toproot.xpath("/topnames/year/@value")
firstyear = nodeset[0]
print(firstyear)

1880


**Q2c** Find the `name` of the first country in `indicators0`.

In [21]:
nodeset = indroot.xpath("/indicators/country/@name")


print(nodeset[:5])  # Should be France

['France', 'United Kingdom', 'United States']


3. Match multiple nodes **at a single level of the hierarchy** using an element path to that level. Such a path can lead to:
    a. an Element node (by leading to its tag)
    b. the text of a node (via `.../text()`)
    c. an attribute (via `@`)

**Example 3a** Find a list of all nodes in `topnames` tagged `count`. This includes both male and female counts, over all years. Print the length of the resulting list, and the first item in it.

In [22]:
nodeset = toproot.xpath("/topnames/year/sex/count")
print(len(nodeset))
print(nodeset[0])
print()
print_results(nodeset[0:1])

278
<Element count at 0x7f9cc3377280>

Length of nodeset result: 1
Type: <class 'lxml.etree._Element'>
Tag: count
  Text: 7065
  Attrib: {}



**Q3a** Find a list of all department names that appear in `school`. Please print the length of your nodeset and the first element in it.

In [24]:
nodeset = schroot.xpath("/school/departments/department/name")


print(len(nodeset))  # should be 36
print(nodeset[0])  # should be an element
print()
print_results(nodeset[0:1])

nametext = [node.text for node in nodeset]
print(nametext[:5])

36
<Element name at 0x7f9cc32c0540>

Length of nodeset result: 1
Type: <class 'lxml.etree._Element'>
Tag: name
  Text: Anthropology and Sociology
  Attrib: {}

['Anthropology and Sociology', 'Art History and Visual Culture', 'Biology', 'Black Studies', 'Chemistry and Biochemistry']


**Example 3b** Find a list of all names that appear in `topnames` (over all years, and both genders). You should achieve a list of strings instead of a list of Elements.

In [25]:
nodeset = toproot.xpath("/topnames/year/sex/name/text()")
print(len(nodeset))  # should be 278
print(nodeset[0])  # should be Mary
print(nodeset[-1])

278
Mary
Liam


**Q3b** Find a list of all instructor last names that appear in `school` (one per `instructor`, ignoring the `department` and `courses` branches). You should achieve a list of strings instead of a list of Elements.

In [26]:
nodeset = schroot.xpath("/school/instructors/instructor/last/text()")


print(len(nodeset))  # should be 292
print(nodeset[20])
print(nodeset[21])

292
Boyd
Schultz


**Example 3c** Extract the list of years from the `topnames` dataset. Note that this information is stored in attributes. You should end up with a list of strings.

In [27]:
nodeset = toproot.xpath("/topnames/year/@value")
print(len(nodeset))
print(nodeset[0])
print(nodeset[-1])

139
1880
2018


**Q3c** Extract the list of subject names from the `school` dataset. Note that this information is stored in attributes. You should end up with a list of strings.

In [33]:
nodeset = schroot.xpath("/school/departments/department/subject/@name")


print(len(nodeset))  # should be 32
print(nodeset[:3])  # Art History
print(nodeset[-1])

32
['Art History', 'Art Studio', 'Biochemistry']
Physics


4. Match all nodes with a particular element + attribute match
    a. Specify the element with the path, and use a **predicate** to specify the attribute match.
    b. Can also specify a set of attribute values, e.g., using an inequality in a predicate.
    c. Can do the same with a set of text values.
    d. Can match at multiple levels of the hierarchy (depending on tag names in your tree), if you use `//`.
    e. Can match all nodes in the tree (`//node()`), all elements (`//*`), all text nodes (`//text()`), or all attributes (`//@*`)

**Example 4a** Find a list of all Female names that appear in `topnames` (duplicates allowed). Note that we want data from every year, so we do NOT use a predicate to restrict the years. Please print the length of your nodeset and the first element in it.

In [32]:
nodeset = toproot.xpath("/topnames/year/sex[@value='Female']/name")
print(len(nodeset))
print(nodeset[0])
print()
print_results(nodeset[0:1])

139
<Element name at 0x7f9cc32d0980>

Length of nodeset result: 1
Type: <class 'lxml.etree._Element'>
Tag: name
  Text: Mary
  Attrib: {}



**Q4a** Find a list of all subjects that appear in `school` as part of the Modern Language department. Hint: this department has an `id` of `'LANG'`.

In [37]:
nodeset = schroot.xpath("/school/departments/department[@id = 'LANG']/subject/@name")


print_results(nodeset)  # should have 8 things

Length of nodeset result: 8
Type: <class 'lxml.etree._ElementUnicodeResult'>
Arabic

Type: <class 'lxml.etree._ElementUnicodeResult'>
Chinese

Type: <class 'lxml.etree._ElementUnicodeResult'>
French

Type: <class 'lxml.etree._ElementUnicodeResult'>
German

Type: <class 'lxml.etree._ElementUnicodeResult'>
Japanese

Type: <class 'lxml.etree._ElementUnicodeResult'>
Modern Language

Type: <class 'lxml.etree._ElementUnicodeResult'>
Portuguese

Type: <class 'lxml.etree._ElementUnicodeResult'>
Spanish



**Example 4b** Find all `country` nodes in `indicators` where the code contains an 'A'. Hint: in your predicate, use the function `contains()`

In [38]:
nodeset = indroot.xpath("/indicators/country[contains(@code,'A')]")
print_results(nodeset)

Length of nodeset result: 2
Type: <class 'lxml.etree._Element'>
Tag: country
  Text: 
    
  Attrib: {'code': 'FRA', 'name': 'France'}

Type: <class 'lxml.etree._Element'>
Tag: country
  Text: 
    
  Attrib: {'code': 'USA', 'name': 'United States'}



**Q4b** Find all courses in `school` where the course number is 400 or above. This information is stored in an attribute.

In [41]:
nodeset = schroot.xpath("/school/courses/course[@num >= 400]/@subject")


print(len(nodeset))  # should be 145
print_results(nodeset[0:5])

145
Length of nodeset result: 5
Type: <class 'lxml.etree._ElementUnicodeResult'>
ARTH

Type: <class 'lxml.etree._ElementUnicodeResult'>
ARTH

Type: <class 'lxml.etree._ElementUnicodeResult'>
ARTH

Type: <class 'lxml.etree._ElementUnicodeResult'>
ARTS

Type: <class 'lxml.etree._ElementUnicodeResult'>
ARTS



**Example 4c** Find all gdp nodes in `indicators0` where the gdp was less than 3000.

In [42]:
nodeset = indroot.xpath("/indicators/country/timedata/gdp[text()<'3000']")
print_results(nodeset)

Length of nodeset result: 3
Type: <class 'lxml.etree._Element'>
Tag: gdp
  Text: 2657.21
  Attrib: {}

Type: <class 'lxml.etree._Element'>
Tag: gdp
  Text: 2586.29
  Attrib: {}

Type: <class 'lxml.etree._Element'>
Tag: gdp
  Text: 2637.87
  Attrib: {}



**Q4c** Find all city nodes under `instructors` where the city is "Columbus".

In [44]:
nodeset = schroot.xpath("/school/instructors/instructor/city[text() = 'Columbus']")


print(len(nodeset))  # should be 28
print_tree(nodeset[0])

28
<city>Columbus</city>
      



**Example 4d** Find all nodes in `school` tagged `title`. Return a list of elements and print information from the first and last, to see that some "title" nodes refer to course titles and others refer to instructors' titles.

We can distinguish the two types of `title` nodes by naming each of the three subtrees of `school` and then using paths relative to either `instructors` or `courses`.

In [46]:
nodeset = schroot.xpath("//title")
print(len(nodeset))
print_results([nodeset[0]] + [nodeset[-1]])

# Making the three subtrees
dept_subtree = schroot[0]
courses_subtree = schroot[1]
instr_subtree = schroot[2]

# Getting just the "title" nodes under instructors
# Method 1
nodeset = schroot.xpath("/school/instructors//title")
print(len(nodeset))

# Method 2
inst_nodeset = instr_subtree.xpath(".//title")
print(len(inst_nodeset))  # same list as above!

1196
Length of nodeset result: 2
Type: <class 'lxml.etree._Element'>
Tag: title
  Text: Beginning Arabic I
  Attrib: {}

Type: <class 'lxml.etree._Element'>
Tag: title
  Text: Vis. Ass't Prof. PT, Chemistry
  Attrib: {}

292
292


**Q4d** Use an XPath expression with `//` to find all nodes in `school` tagged instructorid. Why does this only match nodes under the `courses` subtree and not under the `instructors` subtree?

In [47]:
nodeset = schroot.xpath("//instructorid")


print(len(nodeset))  # should be 1634
print_tree(nodeset[0])
print_tree(nodeset[-1])

# The tag instructorid only appears under courses
# The tag id appears under instructors

1634
<instructorid>D01349259</instructorid>
      

<instructorid>D01349036</instructorid>
      



**Example 4e** Find the number of nodes, element nodes, text nodes, and attributes, in `indicators`.

In [48]:
print("nodes:          ", len(indroot.xpath("//node()")))
print("element nodes:  ", len(indroot.xpath("//*")))
print("text nodes:     ", len(indroot.xpath("//text()")))
print("attributes:     ", len(indroot.xpath("//@*")))

nodes:           65
element nodes:   22
text nodes:      43
attributes:      12


**Q4e** Find the number of nodes, element nodes, text nodes, and attributes, in `school`.

In [49]:
print("nodes:          ", len(schroot.xpath("//node()")))
print("element nodes:  ", len(schroot.xpath("//*")))
print("text nodes:     ", len(schroot.xpath("//text()")))
print("attributes:     ", len(schroot.xpath("//@*")))

# 38991 nodes, 13008 element nodes, 25983 text, 4006 attrib

nodes:           38991
element nodes:   13008
text nodes:      25983
attributes:      4006


5. Match all nodes with a particular element + attribute combination

    a. Can use `and`/`or` keywords inside a predicate.  
    b. Can use `not` inside a predicate, to see if an attribute or child is present or not.  
    c. Can use a **predicate that includes a path**, e.g., to match all nodes whose child (with given element specification) has a text or attribute matching some given value.  
    d. Can use `|` between two XPath expressions to match one or the other (or both).
    
For this entire set, please refer to `school`.
    
**Example 5a** Find all course titles containing either 'Directed' or 'Independent'.

In [50]:
nodeset = schroot.xpath(
    "/school/courses/course/title[contains(text(),'Directed') or contains(text(),'Independent')]"
)

print(len(nodeset))
print_tree(nodeset[0])

150
<title>Directed Study</title>
      



**Q5a** Find all `departmentid` nodes under `instructors`, where the department is either MATH or BIOL.


In [64]:
nodeset = schroot.xpath(
    "/school/instructors/instructor/departmentid[contains(text(), 'MATH') or contains(text(), 'BIOL')]"
)


print(len(nodeset))  # 29
print_tree(nodeset[0])
a = list(map(print_tree, nodeset[:2]))

29
<departmentid>MATH</departmentid>
      

<departmentid>MATH</departmentid>
      

<departmentid>MATH</departmentid>
      



**Example 5b** Find all departments that do not have a chair listed.

In [66]:
nodeset = schroot.xpath("/school/departments/department[not(chair)]")
print(len(nodeset))
print_tree(nodeset[0])

XPathEvalError: Invalid expression

**Q5b** Find all departments that do have at least one subject node.

In [72]:
nodeset = schroot.xpath("/school/departments/department[subject]")


print(len(nodeset))  # 11
print_tree(nodeset[0])

11
<department id="ART">
      <name>Art History and Visual Culture</name>
      <division>Fine Arts</division>
      <subject id="ARTH" name="Art History"/>
      <subject id="ARTS" name="Art Studio"/>
    </department>
    



**Example 5c** Find all `course` nodes worth zero credits (i.e. where `hours` is 0.0). Here we are trying to match nodes with tag `course` based on the `text()` of a child with tag `hours`, so the predicate on `course` must include a path that goes one level deeper.

In [73]:
nodeset = schroot.xpath("/school/courses/course[hours[text() = '0.0']]")
print(len(nodeset))
print_tree(nodeset[0])

32
<course subject="BIOL" num="300">
      <title>Biology Assessment I</title>
      <hours>0.0</hours>
      <class id="21089">
        <term>SPRING</term>
        <section>01</section>
        <instructorid>D00122772</instructorid>
      </class>
      <class id="40570">
        <term>FALL</term>
        <section>01</section>
        <instructorid>D00122772</instructorid>
      </class>
    </course>
    



**Q5c** Find all `department` nodes where the `division` is 'Interdisciplinary'.

In [75]:
nodeset = schroot.xpath(
    "/school/departments/department[division[text() = 'Interdisciplinary']]"
)


print(len(nodeset))  # 13
print_tree(nodeset[0])

13
<department id="BLST">
      <name>Black Studies</name>
      <division>Interdisciplinary</division>
    </department>
    



**Example 5d** Find all courses where the subject is "CS" or there is a class section meeting `11:30-12:20 MWF` in the FALL term. This is the kind of query you might do during registration. In the example below, we break it into two separate queries, and use the `|` to get all entries matching either of them.

In [81]:
nodeset = schroot.xpath(
    """
/school/courses/course[@subject='CS'] | 
/school/courses/course[class/meeting[text()='11:30-12:20 MWF'] and term[text()='FALL']]
"""
)
print(len(nodeset))
print_tree(nodeset[0])

17
<course subject="CS" num="110">
      <title>Computing/Digital Media</title>
      <hours>4.0</hours>
      <class id="21709">
        <term>SPRING</term>
        <section>01</section>
        <meeting>10:30-11:20 MWRF</meeting>
        <instructorid>D01014580</instructorid>
      </class>
      <class id="40336">
        <term>FALL</term>
        <section>01</section>
        <meeting>10:30-11:20 MWRF</meeting>
        <instructorid>D01014580</instructorid>
      </class>
      <class id="40744">
        <term>FALL</term>
        <section>02</section>
        <meeting>08:30-09:20 MWRF</meeting>
        <instructorid>D01014580</instructorid>
      </class>
    </course>
    



**Q5d** Find all `course` nodes where either there is a class section with instructor id is 'D01014580' or where the title contains 'Computer'.

In [87]:
nodeset = schroot.xpath(
    """
    /school/courses/course[class/instructorid[text() = 'D01014580']] |
    /school/courses/course[title[contains(text(), 'Computer')]]
"""
)


print(len(nodeset))  # 6
print_tree(nodeset[0])

6
<course subject="CS" num="110">
      <title>Computing/Digital Media</title>
      <hours>4.0</hours>
      <class id="21709">
        <term>SPRING</term>
        <section>01</section>
        <meeting>10:30-11:20 MWRF</meeting>
        <instructorid>D01014580</instructorid>
      </class>
      <class id="40336">
        <term>FALL</term>
        <section>01</section>
        <meeting>10:30-11:20 MWRF</meeting>
        <instructorid>D01014580</instructorid>
      </class>
      <class id="40744">
        <term>FALL</term>
        <section>02</section>
        <meeting>08:30-09:20 MWRF</meeting>
        <instructorid>D01014580</instructorid>
      </class>
    </course>
    



6. Climb back up the tree from a nodeset by adding a parent specification (`..`) to the set of found nodes, or by using the axis `ancestor` to climb up more levels.

**Example 6a** Find the first and last names of all instructors who live in Columbus. Do this via a path to `city` and then backtracking.

In [88]:
nodeset = schroot.xpath(
    """
/school/instructors/instructor/city[text() = 'Columbus']/../first | 
/school/instructors/instructor/city[text() = 'Columbus']/../last
"""
)
print(len(nodeset))
print_tree(nodeset[0])
print_tree(nodeset[1])

56
<first>Mitchell</first>
      

<last>Snay</last>
      



**Q6a** Find all `course` nodes where `title` contains 'Computer'. Do this with an XPath that leads to `title` then backtracks.

In [96]:
nodeset = schroot.xpath("/school/courses/course/title[contains(text(), 'Computer')]/..")


print(len(nodeset))  # 4
print_tree(nodeset[0])

4
<course subject="CS" num="173">
      <title>Intermediate Computer Science</title>
      <hours>4.0</hours>
      <class id="21711">
        <term>SPRING</term>
        <section>01</section>
        <meeting>11:30-12:20 MTWF</meeting>
        <instructorid>D00118952</instructorid>
      </class>
    </course>
    



**Example 6a** Use `ancestor` to find all nodes along the path to the course title of 'Empowering Girls/Literature'.

In [97]:
nodeset = schroot.xpath(
    """
//title[text()='Empowering Girls/Literature']/ancestor::*
"""
)
print_results(nodeset)

Length of nodeset result: 3
Type: <class 'lxml.etree._Element'>
Tag: school
  Text: 
  
  Attrib: {}

Type: <class 'lxml.etree._Element'>
Tag: courses
  Text: 
    
  Attrib: {}

Type: <class 'lxml.etree._Element'>
Tag: course
  Text: 
      
  Attrib: {'subject': 'WMST', 'num': '390'}



**Q6b** Use `ancestor` to find all nodes along the path to the datum where the top male name was 'David'.


In [100]:
nodeset = toproot.xpath(
    "/topnames/year/sex[@value = 'Male']/name[text() = 'David']/ancestor::*"
)


print_results(nodeset)  # 3 on the path

Length of nodeset result: 3
Type: <class 'lxml.etree._Element'>
Tag: topnames
  Text: 
  
  Attrib: {}

Type: <class 'lxml.etree._Element'>
Tag: year
  Text: 
    
  Attrib: {'value': '1960'}

Type: <class 'lxml.etree._Element'>
Tag: sex
  Text: 
      
  Attrib: {'value': 'Male'}



**Q6c** Find all departments that offer courses that meet `'08:30-09:20 MWF'`. Hint: navigate to `meeting` nodes satisfying the condition, then back up (more than once!) to the `course` level, then extract the `subject` attribute.

In [103]:
nodeset = schroot.xpath(
    "/school/courses/course/class/meeting[text() = '08:30-09:20 MWF']/../../@subject"
)


print(len(nodeset))  # 29
print(nodeset[0])

29
BIOL


## Using XPath to Build a Data Frame

We have previously seen how to build a `pandas` dataframe using XML programming, by iterating over the children of an Element. We now show how to accomplish the same goal using XPath. We take as our example `indicators0`. The plan is:

1. Use XPath to give a list of values of country code  
2. For each, use XPath to give a list of years.  
3. For every (code,year) pair, use XPath to find the associated name, pop, and gdp. 

We break this into a series of functions. The reader is encouraged to peek ahead to see how these functions will be used in the final solution.

**Q7** Write a function `getCodeList(root)` that uses an XPath expression to return a list of codes that appear in the `indicators` tree given by the Element `root` (e.g., this could be `ind0` or the full dataset). Your function should return a list of strings.

In [106]:
def getCodeList(root):
    return root.xpath("/indicators/country/@code")


assert getCodeList(indroot) == ["FRA", "GBR", "USA"]

**Q8** Write a function `getName(root,code)` that gets the `name` associated with a given `code` in a tree given by the Element `root`. Your function should return a string. Your function MUST use XPath. For example, `getName(indroot,'FRA')` navigates the path `/indicators/country[@code='FRA']/@name` to extract the name associated with the code 'FRA'. Hint: you can create the correct XPath expression using a format string.

In [115]:
def getName(root, code):
    return root.xpath("/indicators/country[@code = '{}']/@name".format(code))[0]


assert getName(indroot, "FRA") == "France"
getName(indroot, "FRA")

'France'

**Q9** Write a function `getYearList(root,code)` that gets the list of years that appear as attributes of the `timedata` children of the country node given by `code` in  the tree given by the Element `root`. Your function should return a list of strings, e.g., `['2007','2017']`. Your function MUST use XPath. Hint: you can create the correct XPath expression using a format string.

In [117]:
def getYearList(root, code):
    return root.xpath("/indicators/country[@code = '{}']/timedata/@year".format(code))


assert getYearList(indroot, "FRA") == ["2007", "2017"]
getYearList(indroot, "USA")

['2007', '2017']

**Q10** Write a function `getValue(root,code,year,var)` that gets the value (as a float) of the given variable along the given path. Your function MUST use XPath. For example, `getValue(indroot,'FRA','2007','pop')` traverses the path `/indicators/country[@code='FRA']/timedata[@year='2007']/pop/text()` then casts the resulting string as a float and returns it. Hint: you can create the correct XPath expression using a format string.

In [127]:
def getValue(root, code, year, var):
    return float(
        root.xpath(
            "/indicators/country[@code = '{}']/timedata[@year = '{}']/{}/text()".format(
                code, year, var
            )
        )[0]
    )


print(getValue(indroot, "FRA", "2007", "pop"))  # 64.02
print(getValue(indroot, "USA", "2017", "gdp"))  # 19485.4

64.02
19485.4


**Q11** For the final step, write a function `ind2df(root)` that takes a given Element `root` representing an indicators tree, and produces the corresponding `pandas` dataframe, with index `['code','year']`. Please follow the plan laid out at the start:

1. Invoke `getCodeList()` to get a list of country codes  
2. For each, use `getYearList()` to give a list of years.  
3. Iterate over `(code,year)` pairs, and use `getValue()` to fill a `LoD` with the data.  
4. Pass that `LoD` to pandas to create and return the dataframe.  

Hint: when the root is `indroot`, the first dictionary in my LoD is `{'code': 'FRA', 'name': 'France', 'year': '2007', 'pop': 64.02, 'gdp': 2657.21}`


In [144]:
def ind2df(root):
    LoD = []
    code_list = getCodeList(root)
    for code in code_list:
        row = {"code": code}
        row["name"] = getName(root, code)

        year_list = getYearList(root, code)
        for year in year_list:
            row["year"] = year
            row["pop"] = getValue(root, code, year, "pop")
            row["gdp"] = getValue(root, code, year, "gdp")
            LoD.append(row)

    df = pd.DataFrame(LoD)
    df.set_index(["code", "year"], inplace=True)
    return df


df = ind2df(indroot)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,name,pop,gdp
code,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
FRA,2017,France,66.87,2586.29
FRA,2017,France,66.87,2586.29
GBR,2017,United Kingdom,66.06,2637.87
GBR,2017,United Kingdom,66.06,2637.87
USA,2017,United States,325.15,19485.4
USA,2017,United States,325.15,19485.4


In [143]:
assert df.shape == (6, 3)