# Introduction

[Databaker](https://github.com/sensiblecodeio/databaker) is an Open Source Python library for converting semi-structured spreadsheets into computer-friendly datatables.  The resulting data can be stored into [Pandas data tables](http://pandas.pydata.org/) or the ONS-specific WDA format.

The system is embedded into the interactive programming environment called [Jupyter](http://jupyter.org/) for fast prototyping and development, and depends for its spreadsheet processing on [messytables](http://messytables.readthedocs.io/en/latest/) and [xypath](https://github.com/sensiblecodeio/xypath).

Install it with the command:

> `pip3 install databaker`

Your main interaction with databaker is through the Jupyter notebook interface.  There are many tutorials to show you how to master this system elsewhere on-line.  

Once you've have a working program to converts a particular spreadsheet style into the output which you want, there are ways to rerun the notebook on other spreadsheets externally or from the command line.  

# Example

Although Databaker can handle spreadsheets of any size, here is a tiny example from the tutorials to illustrate what it does.

In [1]:
from databaker.framework import *

tab = loadxlstabs("example1.xls", "beatles", verbose=False)[0]
savepreviewhtml(tab, verbose=False)


0,1,2,3
Date,2014.0,,
,,,
,Cars,Planes,Trains
John,2.0,2.0,1.0
Paul,4.0,3.0,2.0
Ringo,4.0,1.0,3.0
George,2.0,5.0,5.0


## Conversion segments
Databaker gives you tools to help you write the code to navigate around the spreadsheet and select the cells and their correspondences.  

When you are done your code will look like the following.  

You can click on the OBS (observation) cells to see how they connect to the headings.

In [2]:
r1 = tab.excel_ref('B3').expand(RIGHT)
r2 = tab.excel_ref('A3').fill(DOWN)
dimensions = [ 
    HDim(tab.excel_ref('B1'), TIME, CLOSEST, ABOVE), 
    HDim(r1, "Vehicles", DIRECTLY, ABOVE), 
    HDim(r2, "Name", DIRECTLY, LEFT), 
    HDimConst("Category", "Beatles")
]
observations = tab.excel_ref('B4').expand(DOWN).expand(RIGHT).is_not_blank().is_not_whitespace()
c1 = ConversionSegment(observations, dimensions)
savepreviewhtml(c1)


0,1,2,3
OBS,TIME,Vehicles,Name

0,1,2,3
Date,2014.0,,
,,,
,Cars,Planes,Trains
John,2.0,2.0,1.0
Paul,4.0,3.0,2.0
Ringo,4.0,1.0,3.0
George,2.0,5.0,5.0


## Output in pandas
[Pandas data tables](http://pandas.pydata.org/) provides an enormous scope for further processing and cleaning of the data.  

To make full use of its power you should become familiar with its [Time series functionality](http://pandas.pydata.org/pandas-docs/stable/timeseries.html), which will allows you to plot, resample and align multple data sources at once.


In [3]:
c1.topandas()

TIMEUNIT='Year'


Unnamed: 0,OBS,TIME,TIMEUNIT,Vehicles,Name,Category,__x,__y,__tablename
0,2.0,2014,Year,Cars,John,Beatles,1,3,beatles
1,2.0,2014,Year,Planes,John,Beatles,2,3,beatles
2,1.0,2014,Year,Trains,John,Beatles,3,3,beatles
3,4.0,2014,Year,Cars,Paul,Beatles,1,4,beatles
4,3.0,2014,Year,Planes,Paul,Beatles,2,4,beatles
5,2.0,2014,Year,Trains,Paul,Beatles,3,4,beatles
6,4.0,2014,Year,Cars,Ringo,Beatles,1,5,beatles
7,1.0,2014,Year,Planes,Ringo,Beatles,2,5,beatles
8,3.0,2014,Year,Trains,Ringo,Beatles,3,5,beatles
9,2.0,2014,Year,Cars,George,Beatles,1,6,beatles


## Output in WDA Observation File
The WDA system in the ONS has been the primary use for this library.  If you need output into WDA the result would look like the following:

In [4]:
print(writetechnicalCSV(None, c1))

observation,data_marking,statistical_unit_eng,statistical_unit_cym,measure_type_eng,measure_type_cym,observation_type,empty,obs_type_value,unit_multiplier,unit_of_measure_eng,unit_of_measure_cym,confidentuality,empty1,geographic_area,empty2,empty3,time_dim_item_id,time_dim_item_label_eng,time_dim_item_label_cym,time_type,empty4,statistical_population_id,statistical_population_label_eng,statistical_population_label_cym,cdid,cdiddescrip,empty5,empty6,empty7,empty8,empty9,empty10,empty11,empty12,dim_id_1,dimension_label_eng_1,dimension_label_cym_1,dim_item_id_1,dimension_item_label_eng_1,dimension_item_label_cym_1,is_total_1,is_sub_total_1,dim_id_2,dimension_label_eng_2,dimension_label_cym_2,dim_item_id_2,dimension_item_label_eng_2,dimension_item_label_cym_2,is_total_2,is_sub_total_2,dim_id_3,dimension_label_eng_3,dimension_label_cym_3,dim_item_id_3,dimension_item_label_eng_3,dimension_item_label_cym_3,is_total_3,is_sub_total_3
2.0,,,,,,,,,,,,,,,,,2014,2014,,Year,,,,,,,,,,,,,0,,Vehicles,

## Further notes
Databaker has been developed by the [Sensible Code Company](http://sensiblecode.io/) on contract from the [Office of National Statistics](https://www.ons.gov.uk/).

The first version was written in 2014 and ran only as a command line script where previews were made by via a coloured Excel spreadsheet.  This version still exists under the [version 1.2.0](https://github.com/sensiblecodeio/databaker/tree/1.2.0) tag and the documentation is hosted [here](https://sensiblecodeio.github.io/quickcode-ons-docs/).

This new version was developed at the end of 2015 to take advantage of the interactive programming capabilities of Jupyter and the freedom not to maintain backward compatibility.

See the remaining tutorial notebooks for more details.