
# Mining the social web 
## Workout 2. Understanding XML 

- CityU COM5507 201819A - Unit 2: Web data collection
- 24 Oct 2018, Week 8: Mining the social web - data formats 


- Course Instructor: [Dr. Xinzhi Zhang](www.drxinzhizhang.com)  (JOUR, Hong Kong Baptist University) 
  - xzzhang2@gmail.com


- The codes in this notebook are modified from various sources. All codes are for educational purposes only and released under the CC1.0. 

## XML basic: structures & finding the elements

- tutorial: https://stackabuse.com/reading-and-writing-xml-files-in-python/
- official tutorial: https://docs.python.org/3.7/library/xml.etree.elementtree.html 

- ```Element.find()``` finds the first child with a particular tag, and Element.text accesses the element’s text content. 
- ```Element.get()``` accesses the element’s attributes. 


- ```Element.findall()``` finds only elements with a tag which are direct children of the current element. 
- ```Element.iter()``` creates a tree iterator with the current element as the root. The iterator iterates over this element and all elements below it, in document (depth first) order.

In [1]:
import xml.etree.ElementTree as ET # always import this when parsing XML 

In [2]:
data = '''
<person>
  <name>Chuck</name>
  <phone type="intl">
     +1 734 303 4456
   </phone>
   <email hide="yes"/>
</person>
'''

In [3]:
tree = ET.fromstring(data)
print('Name:', tree.find('name').text)
# what happened if we print: print('Name:', tree.find('name'))? 
print('Phone:', tree.find('phone').text) 
# compare: print('Phone:', tree.find('phone').text.strip()) 
print('Attrib', tree.find('email').get('hide'))

Name: Chuck
Phone: 
     +1 734 303 4456
   
Attrib yes


In [4]:
input = '''
<stuff>
    <users>
        <user x="2">
            <id>001</id>
            <name>Chuck</name>
        </user>
        <user x="7">
            <id>009</id>
            <name>Brent</name>
        </user>
    </users>
</stuff>'''

In [6]:
stuff = ET.fromstring(input)
lst = stuff.findall('users/user')
print(type(lst))
print('User count:', len(lst))

<class 'list'>
User count: 2


In [7]:
for item in lst:
    print('Name', item.find('name').text)
    print('Id', item.find('id').text)
    print('Attribute', item.get("x"))

Name Chuck
Id 001
Attribute 2
Name Brent
Id 009
Attribute 7


## Another example: an official tutorial 

In [8]:
country_data_as_string = '''
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>'''

In [9]:
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()

#root = ET.fromstring(country_data_as_string)

In [10]:
for child in root:
    print(child.tag, child.attrib)

country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}


In [12]:
# XML can be as paths 
root[0][1].text

'2008'

In [15]:
for neighbor in root.iter('neighbor'):
    print(neighbor.attrib)

{'name': 'Austria', 'direction': 'E'}
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Malaysia', 'direction': 'N'}
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}


In [16]:
names = []
directions = []
for neighbor in root.iter('neighbor'):
    name = neighbor.get('name')
    direction = neighbor.get('direction') 
    #print(name, vote)
    names.append(name)
    directions.append(direction)

print(names)
print(directions)

['Austria', 'Switzerland', 'Malaysia', 'Costa Rica', 'Colombia']
['E', 'W', 'N', 'W', 'E']


## A real example: the Hong Kong LegCo voting
- Open data: https://www.legco.gov.hk/general/chinese/open-legco/cm-201819.html#n1a
- Data source: https://www.legco.gov.hk/yr18-19/chinese/counmtg/voting/cm_vote_20181010.xml
- The Hong Kong LegCo voting record documented on the LegCo website is a very good example of open data, as well as a good practicum field for data analytical skills. 

In [18]:
import xml.etree.ElementTree as ET  
tree = ET.parse('cm_vote_20181010.xml')  
root = tree.getroot()

In [19]:
root.tag

'legcohk-vote'

In [20]:
root.attrib

{'{http://www.w3.org/2001/XMLSchema-instance}noNamespaceSchemaLocation': '/schema/legcohk-vote-schema.xsd'}

In [21]:
for child in root:
    print(child.tag, child.attrib) 

meeting {'start-date': '10/10/2018', 'type': 'Council Meeting'}


In [22]:
for member in root.iter('member'):
    print(member.attrib) 

{'name-ch': '梁君彥', 'name-en': 'Andrew LEUNG', 'constituency': 'Functional'}
{'name-ch': '涂謹申', 'name-en': 'James TO', 'constituency': 'Functional'}
{'name-ch': '梁耀忠', 'name-en': 'LEUNG Yiu-chung', 'constituency': 'Functional'}
{'name-ch': '石禮謙', 'name-en': 'Abraham SHEK', 'constituency': 'Functional'}
{'name-ch': '張宇人', 'name-en': 'Tommy CHEUNG', 'constituency': 'Functional'}
{'name-ch': '李國麟', 'name-en': 'Prof Joseph LEE', 'constituency': 'Functional'}
{'name-ch': '林健鋒', 'name-en': 'Jeffrey LAM', 'constituency': 'Functional'}
{'name-ch': '黃定光', 'name-en': 'WONG Ting-kwong', 'constituency': 'Functional'}
{'name-ch': '李慧琼', 'name-en': 'Starry LEE', 'constituency': 'Functional'}
{'name-ch': '陳克勤', 'name-en': 'CHAN Hak-kan', 'constituency': 'Geographical'}
{'name-ch': '陳健波', 'name-en': 'CHAN Kin-por', 'constituency': 'Functional'}
{'name-ch': '梁美芬', 'name-en': 'Dr Priscilla LEUNG', 'constituency': 'Geographical'}
{'name-ch': '黃國健', 'name-en': 'WONG Kwok-kin', 'constituency': 'Geographical

In [23]:
names = []
votes = []
for member in root.iter('member'):
    vote = member.find('vote').text
    name = member.get('name-ch') 
    #print(name, vote)
    votes.append(vote)
    names.append(name)

In [24]:
import pandas as pd

In [25]:
df_vote = {
    'names': names,
    'votes': votes
}

In [26]:
pd_vote = pd.DataFrame.from_dict(df_vote) 
print(pd_vote.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68 entries, 0 to 67
Data columns (total 2 columns):
names    68 non-null object
votes    68 non-null object
dtypes: object(2)
memory usage: 1.1+ KB
None


In [28]:
pd_vote

Unnamed: 0,names,votes
0,梁君彥,Present
1,涂謹申,Yes
2,梁耀忠,Yes
3,石禮謙,No
4,張宇人,No
5,李國麟,Absent
6,林健鋒,No
7,黃定光,Absent
8,李慧琼,Absent
9,陳克勤,Yes


In [35]:
pd_vote

Unnamed: 0,names,votes
0,梁君彥,Present
1,涂謹申,Yes
2,梁耀忠,Yes
3,石禮謙,No
4,張宇人,No
5,李國麟,Absent
6,林健鋒,No
7,黃定光,Absent
8,李慧琼,Absent
9,陳克勤,Yes


## Challenges: 
1. Try to use Pandas package to examine  this voting record and explore further about the patterns of voting, if any. 