# Other data file formats

In this Notebook, you will learn how to work with a variety of other file formats. Details for some file formats are left deliberately sparse. If you find yourself spending a lot of time working with such file formats, feel free to add additional notes to this Notebook, or create a new Notebook to record the recipes you find useful.

## Spreadsheet files (Excel XLS and XLSX files)

Although spreadsheet files are one of the most widely used file formats for sharing data, we have relegated them to this Notebook because we want you to get into the habit of using other file formats to publish and request data yourself.  

Part 7 of the module looks at some of the weaknesses for analysis and management of data in spreadsheet form.

As one of the most widely used spreadsheet applications, the file formats used by Microsoft Excel by default are the ones most commonly encountered. Excel spreadsheet files can be recognised from the file extensions *.xls* and *.xlsx*.

You can open a file from a spreadsheet into a *pandas* DataFrame using the `read_excel()` function.

In [1]:
# We can try to import a sheet directly into pandas using the read_excel() method.
# We'll only read the first three lines to see what it brings in.
import pandas as pd

In [2]:
# The following spreadsheet is taken from the Greater London Authority, London DataStore.
#                     https://londondatastore-upload.s3.amazonaws.com/tfl-buses-type.xls
#                     [retrieved 20/07/15]

#Set the sheetname parameter to None to load in all the sheets as a dict of dataframes
xl = pd.read_excel('data/tfl-buses-type.xls',  sheetname=None)
xl.keys()

odict_keys(['Metadata', 'Data'])

In [3]:
#Preview the first few rows of the Data sheet
xl['Data'][:3]

Unnamed: 0,Unnamed: 1.1,Number of buses,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
Bus Type,Drive train type,2010.0,2011.0,2012.0,2013.0,2014.0
New Routemaster,Hybrid,0.0,0.0,5.0,8.0,168.0
Routemaster,Diesel,18.0,18.0,19.0,20.0,19.0


In [4]:
#Alternatively, we can read in a single sheet by name
pd.read_excel('data/tfl-buses-type.xls', sheetname='Data')[:3]

Unnamed: 0,Unnamed: 1.1,Number of buses,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
Bus Type,Drive train type,2010.0,2011.0,2012.0,2013.0,2014.0
New Routemaster,Hybrid,0.0,0.0,5.0,8.0,168.0
Routemaster,Diesel,18.0,18.0,19.0,20.0,19.0


In [5]:
# It looks OK, so let's read the whole spreadsheet:
data = pd.read_excel('data/tfl-buses-type.xls', sheetname='Data')
data

Unnamed: 0,Unnamed: 1.1,Number of buses,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
Bus Type,Drive train type,2010.0,2011.0,2012.0,2013.0,2014.0
New Routemaster,Hybrid,0.0,0.0,5.0,8.0,168.0
Routemaster,Diesel,18.0,18.0,19.0,20.0,19.0
Artic,Diesel,320.0,260.0,0.0,0.0,0.0
Single deck,Diesel,2676.0,2670.0,2661.0,2608.0,2606.0
,Fuel Cell/Hybrid,0.0,5.0,5.0,5.0,8.0
,Hybrid,27.0,27.0,33.0,28.0,23.0
,Electric,0.0,0.0,0.0,0.0,2.0
Double deck,Diesel,5554.0,5487.0,5787.0,5696.0,5296.0
,Hybrid,29.0,79.0,233.0,352.0,643.0


By inspecting this data, or by opening the spreadsheet using a spreadsheet application or the OpenRefine tool (which is introduced in Part 2 of the module), we can check to see how many of the first few rows are metadata or blank rows. We can discount a certain number of lines at the top of the sheet using the `skiprows` parameter, or we can specify the spreadsheet row number of the header row explictly and ignore the rows preceding that one. We can also define which columns we wish to import.  

The NaNs sometimes indicate that cells are empty, or contain formula or other 'non' value data. In the cells under those containing 'Single deck' and 'Double deck' and alongside the description in the final row, the NaNs are there because the cells have been merged into a single spreadsheet spanning cell.

(For more information, see the documentation for the [*pandas* read_excel method]( http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html)    

### *xlrd*

The `xlrd` package is a powerful package for reading and writing files using Excel's .xls and .xlsx formats, and lower level access to the contents of Excel spreadsheets than `pandas` provides. 

For more details see: http://xlrd.readthedocs.io/en/latest/

In [6]:
import xlrd

workbook = xlrd.open_workbook('data/tfl-buses-type.xls')
# The library also allows us to preview the sheet names.
print(workbook.sheet_names())

['Metadata', 'Data']


In [7]:
# By manual inspection of the originally previewed sheet, we can use 
# xlrd to read the metadata from the metadata cell.
# Note that row/columns indices are integer values, indexed on 0, 
# and also note that some cells span multiple rows.
sheet = workbook.sheet_by_name('Data')
sheet.cell_value(rowx=14, colx=0)

'Commentary: The figures supplied reflect the fleet as of March 31, 2014. The number of hybrid buses in the fleet which is tracked through the year, unlike other statistics, now stands at around 860 and is projected to rise to 1,700 (of which 600 will be NBfLs) by 2016.'

## XML Files

Importing XML data into a *pandas* DataFrame is currently a little trickier than importing JSON, as there are no default *pandas* methods for supporting the import.

Instead, you need to load in a file, parse it using a third party parser such as `lxml`, and then handle the mapping to the DataFrame yourself.

Alternatively, use OpenRefine to parse the elements of the XML document that you are interested in and then save the data out again as a tabular CSV document which is a little easier to import.

We will try to limit our use of XML-based datasets in this module, preferring instead CSV formats for tabular data and JSON for more elaborately structured datasets. You will, however, work with a particular style of XML later in the module when you look at Linked Data and the semantic web.

One thing worth bearing in mind is that popular versions of XML formats may have Python libraries defined to make it easier to parse them, and read and write files defined using the format. For example, the KML format that is used to transport geographical data (points, lines, boundaries) can be parsed using the `fastkml` library.

##  Working with KML Files

In [8]:
# We can load in data from a KML file (a file format for geographic data sets) and 
# then render it onto a map quite easily.

# For example, in the data directory is a file, 'CarParks.kml' that contains a list of car park 
# locations on the Isle of Wight.
!ls data

CarParks.kml	    IOWcarparlocations.html  tmpfile.csv
document.yaml	    iwCouncilSpending	     tmp.json
IOWcarparlocations  tfl-buses-type.xls	     tmp_snowkuma.json


In [9]:
from fastkml import kml
k = kml.KML()

# We need to open the file as a bytestream - and let the lxml parser 
#          used by the fastxml package identify the encoding itself:
doc = open("data/CarParks.kml",'rb').read()
k.from_string(doc)

# The alternative is to open the file with a UTF-8 encoding to get a Unicode string, 
#   then throw away the first line that now incorrectly declares the decoding to be UTF-8.
#!head -n 3 data/CarParks.kml
#doc = open("data/CarParks.kml", encoding='utf-8')
#lines = '\n'.join(doc.readlines()[1:])
#k.from_string(lines)

# We can parse the locations of the carpark placemarks from the file
locations = dict()
for feature in k.features():
    for placemark in feature.features():
        locations.update({placemark.name: (placemark.geometry.y, placemark.geometry.x)})
list(locations)

['Sea Street, Newport',
 'Brunswick Road, Cowes',
 'Garfield Road, Ryde',
 'Medina Avenue, Newport',
 'Station Avenue, Sandown',
 'The Grove, Ventnor',
 'River Road, Yarmouth',
 'Dudley Road, Ventnor',
 'Atherley Road, Shanklin',
 'Avenue Road, Freshwater',
 'The Duver, Seaview',
 'Freshwater Bay, Freshwater',
 'Moa place, Freshwater',
 'Pound Lane, Ventnor',
 'Vernon Meadow, Shanklin',
 'Little London, Newport',
 'Lugley Street, Newport',
 'Orchardleigh Road, Shanklin',
 'Seaclose Recreation Ground, Newport',
 'Quay Road, Ryde',
 'Maresfield Road, East Cowes',
 'Lind Place, Ryde',
 'Cross Street, Cowes',
 'Church Litten, Newport ',
 'New Street, Newport',
 'Shore Road, Bonchurch',
 'Market Street, Ventnor',
 'La Falaise, Ventnor',
 'Lane End, Bembridge',
 'Esplanade Gardens, Shanklin',
 'St Thomas Street Upper/Lower, Ryde',
 'The Duver, St Helens',
 'Yaverland, Sandown',
 'Chapel Street, Newport',
 'St Johns Road, Sandown',
 'Hope Road, Shanklin',
 'Central (High Street), Ventnor',
 '

In [10]:
# Let's quickly map the markers to show how the parser 
#       has pulled out the placemark information:
import folium
# We will look at folium in more detail in Notebooks for Part 5 of the module.

# NOTE: folium uses an external tileset to render the map background appearance.
#       This requires that you have an internet connection when the map is being
#       displayed, it may use cached tile data, but some tiles will be missing if you 
#       change scale by zooming.

# If we know the latitude and longitude at the centre of the map we want to display, 
#    we can set it directly:
carparks = folium.Map(location=[50.68, -1.2667], width = 960, height = 500, zoom_start=11)

# Alternatively, we could calculate it as the mean latitude and longitude 
#     of the points we wish to plot (a handy recipe):
#latSum = lonSum = 0
#for name, location in locations.items():
#    latSum += location[0]
#    lonSum += location[1]
#carparks = folium.Map(location=[latSum/len(locations.items()), 
#                                lonSum/len(locations.items())], 
#                               width = 960, height = 500, zoom_start=11)

# The following loops through the location items, splitting out the car part name
#                    and the location as a pair of latitude and longitude values.
# For each location, it then plots a circle marker on the map with the name as a popup string.
# We will look at folium in more detail in Notebooks for Part 5 of the module.

for name, location in locations.items():
    folium.CircleMarker(location=location,
                        popup=name,
                        radius=20,
                        fill_color='blue',
                        fill_opacity=0.2
                   ).add_to(carparks)

# Display the map (this will not display a map if you are offline)
carparks

In [11]:
# Finally we create the HTML file for the map, and display it below.
#   (The HTML file can be opened directly from your browser)

carparks.save('data/IOWcarparlocations.html')

## YAML

*pandas* does not support YAML imports directly, but it is possible to use libraries such as the `PyYaml` library to load in a YAML file and convert it to a Python dict that can then be transformed to a *pandas* DataFrame.

WARNING:  The `yaml.load()` and `yaml.load_all()` should not be used to parse arbitrary content from unsafe sources.  These functions are capable of creating arbitrary Python objects, including code.  The `yaml.safe_load()` and `yaml.safe_load_all()` limit that ability to objects that cannot generate executable code.

As with XML, we will tend *not* to focus on the use of YAML, preferring instead JSON and CSV representations.

In [None]:
import yaml

# yaml.load() will accept a single document string, and parse it to generate
# a Python dict, so will yaml.safe_load():
document = """
image:
    width: 800
    height: 600
    title:  View from 15th Floor
    thumbnail:
        url: http://www.example.com/image/481989943
        height: 125
        width:  100
        animated : false
    IDs:
        - 116
        - 943
        - 234
        - 38793
"""
parsedYAML = yaml.safe_load(document)
parsedYAML

In [None]:
# yaml.load() and yaml.safe_load() will also accept a file name and read that, 
#     converting it to a Python dict.

# Note that yaml.load_all(stream) and yaml.safe_load_all() will parse a file 
#  containing a sequence of yaml documents to produce a sequence of dicts

# Here 'document.yaml' contains a single YAML document.
stream = open('data/document.yaml', 'r') 
yaml.safe_load(stream)


In [None]:
# We can also cast a dict to YAML using the yaml.dump() function applied to a dict:
print(yaml.dump(parsedYAML))

For those interested in exploring Python's handling of YAML further
the `PyYAML` library documentation can be found at  http://pyyaml.org/wiki/PyYAMLDocumentation.

## Summary
In this Notebook you have seen how to:
1. read .xls and .xlsx spreadsheet files
2. handle XML files
3. read KML files and seen map data plotted in folium
4. parse YAML data and load it into a Python dict.


## What next?

That completes the coverage of data file formats for this module; we will make extensive use of CSV and JSON formats in the module and may introduce others as we work through different tools and techniques.

Return to the module materials now.