### XML Parser For Stats Functions  
### Sean Warlick  
### July 23, 2016

This note book is used to develop the code needed to parse the XML returned from the `TruliaStats` library.  We interactively explore the XML to find to aquire the desired data.  A notebook was used so we did not need to continually call the API.  The bulk of the code here will be discarded. 



In [1]:
# This is needed to accomidate the new file structure designed on 8/2
import sys
sys.path.append('/Users/SeanWarlick/Documents/GitHub/seattle_housing/python')

In [3]:
# Import Packages
import api_function as api
import xml.etree.ElementTree as et 

In [6]:
state_data = api.state_stats("", "WA", '2016-01-01', '2016-01-09', #API KEY HERE)

In [92]:
print type(state_data)

state_data2 = et.fromstring(state_data) # Create Element from string
state_tree = et.ElementTree(state_data2) # Convert Element to Element Tree

<type 'str'>


In [93]:
print state_data

<?xml version="1.0"?>
<TruliaWebServices><request><Parameter><index>1</index><name>function</name><value>getStateStats</value></Parameter><Parameter><index>2</index><name>city</name><value/></Parameter><Parameter><index>3</index><name>apikey</name><value>53yq9hwphajxg9swpy7cvbkm</value></Parameter><Parameter><index>4</index><name>endDate</name><value>2016-01-09</value></Parameter><Parameter><index>5</index><name>startDate</name><value>2016-01-01</value></Parameter><Parameter><index>6</index><name>library</name><value>TruliaStats</value></Parameter><Parameter><index>7</index><name>state</name><value>WA</value></Parameter><Parameter><index>8</index><name>source</name><value>Webservice</value></Parameter></request><response><TruliaStats><location><stateName>Washington</stateName><stateCode>WA</stateCode><stateURL>http://www.trulia.com/sitemap/Washington-real-estate/</stateURL><heatMapURL>http://www.trulia.com/home_prices/Washington/</heatMapURL></location><trafficStats><trafficStat><date>

### Explore Structure in Response Tags
TruliaStats is child of interest in the response tag. 

In [94]:
# Find Children in Location Stats
state_response = state_tree.find(".//TruliaStats")

for child in state_response:
    print child.tag, child.attrib, child.text

location {} None
trafficStats {} None
listingStats {} None


There are three nodes under *TruliaStats*.  We will explore these sequentially, starting with trafficStats. 

In [95]:
traffic = state_tree.find('.//trafficStats')

In [96]:
for child in traffic:
    print child.tag, child.attrib, child.text

trafficStat {} None
trafficStat {} None
trafficStat {} None
trafficStat {} None
trafficStat {} None
trafficStat {} None
trafficStat {} None
trafficStat {} None
trafficStat {} None


Beneith the *trafficStats* node we have a group of nodes named *trafficStat*.  Now we look at one of these nodes to explore how the data is stored. 

In [97]:
traffic_data = traffic.find("trafficStat")

In [98]:
for child in traffic_data:
    print child.tag, child.attrib, child.text

date {} 2016-01-01
percentNationalTraffic {} 2.0668424598


This is as far down as we need to go on the state traffic data.  We can now develop code to parse all the returned records!.  

In [99]:
# KEEP THIS CODE
traffic_data_collect = state_tree.findall(".//trafficStat")

storage = {"date":[], "value":[]} # Create storage variable

for i in traffic_data_collect:
    date = i.find("date").text
    storage["date"].append(date)
    
    val = i.find("percentNationalTraffic").text
    storage["value"].append(val)


In [100]:
print storage

{'date': ['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08', '2016-01-09'], 'value': ['2.0668424598', '2.0626666809', '1.9174601053', '1.8742901588', '1.7754850671', '1.8107127378', '1.8036938434', '1.8573790791', '1.9520533534']}


Now that we have parsed the trafic data we need to explore the listing data. 
We'll start by finding the children under listing data. 

In [101]:
listingstats = state_tree.find(".//listingStats")
for child in listingstats:
    print child.tag, child.attrib, child.text

listingStat {} None
listingStat {} None


In [102]:
listing = state_tree.find(".//listingStat")
for child in listing:
    print child.tag, child.attrib, child.text

weekEndingDate {} 2016-01-02
listingPrice {} None


In [103]:
listingPrice = state_tree.find(".//listingPrice")
for child in listingPrice:
    print child.tag, child.attrib, child.text

subcategory {} None
subcategory {} None
subcategory {} None
subcategory {} None
subcategory {} None
subcategory {} None
subcategory {} None
subcategory {} None
subcategory {} None
subcategory {} None


In [104]:
subcat = state_tree.find(".//subcategory")
for child in subcat:
    print child.tag, child.attrib, child.text

type {} All Properties
numberOfProperties {} 10116
medianListingPrice {} 276757
averageListingPrice {} 383939


The data of interest is with in the subcategory tags.  We also need to keep the text from the weekEndingDate tag.  
To correctly parse this we will need to go back up to the listing stat tag.  From there we will obtain the end date which we will use with each listing category

In [121]:
listing_data = state_tree.findall(".//listingStat")

storage = {"date":[], "type":[], "properties":[], "medianPrice":[], "avgPrice":[]}

for records in listing_data:
    record_date = records.find("weekEndingDate").text # Extract record date
    #print record_date
    categories = records.findall(".//subcategory") # Find all subcategory for the given weekend date
    
    for category in categories:
        storage["date"].append(record_date) 

        tp = category.find("type").text
        storage["type"].append(tp)
        
        prop = category.find("numberOfProperties").text
        storage["properties"].append(prop)
        
        median = category.find("medianListingPrice").text
        storage["medianPrice"].append(median)
        
        avg = category.find("averageListingPrice").text
        storage["avgPrice"].append(avg)
        
    
    

In [122]:
print storage

{'date': ['2016-01-02', '2016-01-02', '2016-01-02', '2016-01-02', '2016-01-02', '2016-01-02', '2016-01-02', '2016-01-02', '2016-01-02', '2016-01-02', '2016-01-09', '2016-01-09', '2016-01-09', '2016-01-09', '2016-01-09', '2016-01-09', '2016-01-09', '2016-01-09', '2016-01-09', '2016-01-09'], 'type': ['All Properties', '1 Bedroom Properties', '2 Bedroom Properties', '3 Bedroom Properties', '4 Bedroom Properties', '5 Bedroom Properties', '6 Bedroom Properties', '7 Bedroom Properties', '8 Bedroom Properties', '9 Bedroom Properties', 'All Properties', '1 Bedroom Properties', '2 Bedroom Properties', '3 Bedroom Properties', '4 Bedroom Properties', '5 Bedroom Properties', '6 Bedroom Properties', '7 Bedroom Properties', '8 Bedroom Properties', '9 Bedroom Properties'], 'properties': ['10116', '385', '1847', '4605', '2165', '525', '99', '12', '8', '4', '10354', '396', '1921', '4705', '2216', '531', '90', '7', '9', '5'], 'medianPrice': ['276757', '222936', '209021', '265547', '362697', '425050', '4

Now we just need to turn this dictonary into an array. 

### Parsing `getCityStats`

In [7]:
x = api.city_stats("Seattle", "WA", '2016-01-01', '2016-01-09', #API KEY HERE)

In [10]:
city_data = et.fromstring(x) # Create Element from string
city_tree = et.ElementTree(city_data) # Convert Element to Element Tree

In [11]:
listing_data = city_tree.findall(".//listingStat")

storage = {"date":[], "type":[], "properties":[], "medianPrice":[], "avgPrice":[]}
for records in listing_data:
    record_date = records.find("weekEndingDate").text # Extract record date
    #print record_date
    categories = records.findall(".//subcategory") # Find all subcategory for the given weekend date
    
    for category in categories:
        storage["date"].append(record_date) 

        tp = category.find("type").text
        storage["type"].append(tp)
        
        prop = category.find("numberOfProperties").text
        storage["properties"].append(prop)
        
        median = category.find("medianListingPrice").text
        storage["medianPrice"].append(median)
        
        avg = category.find("averageListingPrice").text
        storage["avgPrice"].append(avg)

In [12]:
print storage

{'date': ['2016-01-02', '2016-01-02', '2016-01-02', '2016-01-02', '2016-01-02', '2016-01-02', '2016-01-02', '2016-01-02', '2016-01-09', '2016-01-09', '2016-01-09', '2016-01-09', '2016-01-09', '2016-01-09', '2016-01-09'], 'type': ['All Properties', '1 Bedroom Properties', '2 Bedroom Properties', '3 Bedroom Properties', '4 Bedroom Properties', '5 Bedroom Properties', '6 Bedroom Properties', '9 Bedroom Properties', 'All Properties', '1 Bedroom Properties', '2 Bedroom Properties', '3 Bedroom Properties', '4 Bedroom Properties', '5 Bedroom Properties', '6 Bedroom Properties'], 'properties': ['505', '112', '150', '146', '58', '23', '4', '1', '512', '121', '156', '135', '66', '20', '5'], 'medianPrice': ['572496', '494264', '599521', '581407', '735357', '735568', '1849000', '1299922', '578400', '501379', '586200', '601046', '830857', '725107', '1420643'], 'avgPrice': ['758123', '495308', '715344', '752994', '1011508', '1002964', '1868750', '1299922', '751255', '498348', '698709', '794528', '10