# Parsing XML files with BeautifulSoup

Now, we are going to demonstrate how to use BeautifulSoup to extract information from the XML file, called "Melbourne_bike_share.xml" that we used in the reading materials. For the documentation of BeautifulSoup, please refer to it <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all">official website</a>. 

In [None]:
from bs4 import BeautifulSoup
btree = BeautifulSoup(open("./Melbourne_bike_share.xml"),"lxml-xml") 

You can alo print out the Beautifulsoup object by calling the <font color="blue">prettify()</font> function.

In [None]:
print(btree.prettify())

It is easy to figure out information we would like to extract is stored in the following tags
<ul>
<li>id </li>
<li>featurename </li>
<li>terminalname </li>
<li>nbbikes </li>
<li>nbemptydoc </li>
<li>uploaddate </li>
<li>coordinates </li>
</ul>

Each record is stored in "<row> </row>". To extract information from those tags, except for "coordinates", we use the <font color="blue">find_all()</font> function. Its documentation can be found <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all">here</a>.

In [None]:
featuretags = btree.find_all("featurename")
featuretags

The output shows that the <font color="blue"> find_all() </font> returns all the 50 station names. Now, we need to exclude the tags and just keep the text stored between the tags.

In [None]:
for feature in featuretags:
    print (feature.string)

Now, we can put all the above code together using list comprehensions. 

In [None]:
featurenames = [feature.string for feature in btree.find_all("featurename")]

In [None]:
featurenames

Similarly, we can use the <font color = "blue">find_all()</font> function to extract the other information.

In [None]:
nbbikes = [feature.string for feature in btree.find_all("nbbikes")]
nbbikes

In [None]:
NBEmptydoc = [feature.string for feature in btree.find_all("nbemptydoc")]
NBEmptydoc

In [None]:
TerminalNames = [feature.string for feature in btree.find_all("terminalname")]
TerminalNames

In [None]:
UploadDate = [feature.string for feature in btree.find_all("uploaddate")]
UploadDate

In [None]:
ids = [feature.string for feature in btree.find_all("id")]
ids

Now, how can we extract the attribute values from the tage called "coordinates"?

In [None]:
lattitudes = [coord["latitude"] for coord in btree.find_all("coordinates")]
lattitudes

In [None]:
longitudes = [coord["longitude"] for coord in btree.find_all("coordinates")]
longitudes

After the extraction, we can put all the information in a Pandas DataFrame.

In [None]:
import pandas as pd 
dataDict = {}
dataDict['Featurename'] = featurenames
dataDict['TerminalName'] = TerminalNames
dataDict['NBBikes'] = nbbikes
dataDict['NBEmptydoc'] = NBEmptydoc
dataDict['UploadDate'] = UploadDate
dataDict['lat'] = lattitudes
dataDict['lon'] = longitudes
df = pd.DataFrame(dataDict, index = ids)
df.index.name = 'ID'
df.head()