# Function reference

This is a run-down of most of the functions that are in the system.  

If a function looks useful but you don't quite understand the description, it's should be easy to experiment with its outputs within this interactive programming environment.

In [7]:
from databaker.framework import *

# put your input-output files here
inputfile = "example1.xls"
outputfile = "example1.csv"
previewfile = "preview.html"


# Loading and saving

### tabs = loadxlstabs(inputfile, sheetids="*", verbose=True)
Load xls file into a list of tables, which act as bags of cells
  
  
### savepreviewhtml(tab, htmlfilename=None, verbose=True)

Previews a table -- or list of cellbags or conversion segments with the same table -- either inline, or into a separate file.
  
  
### writetechnicalCSV(outputfile, conversionsegments) 

Outputs a WDA format CSV file from a list of conversion segments or pandas dataframes


### readtechnicalCSV(wdafile, bverbose=False, baspandas=True)

Reads in an old WDA file into a list of pandas tables, one for each segment



# Cell bag selection
These functions generally apply to a table as well as a cell bag, but they always output a cell bag.

A cell bag `bag` always has a pointer to its original table `bag.table`.  Howwever, you can access the underlying unordered set of cells of a bag as `bag.unordered_cells`.

In [33]:
tab = loadxlstabs(inputfile, sheetids="stones", verbose=True)[0]
print(tab)


Loading example1.xls which has size 7168 bytes
Table names: ['stones']
{<E4 10.0>, <C1 ''>, <D3 'Rocks'>, <D7 'shale'>, <A2 'Date'>, <C6 'yes'>, <A3 'Year'>, <D2 ''>, <D5 'granite'>, <E6 2.0>, <A5 ''>, <A9 ''>, <C4 'yes'>, <B9 'Dec'>, <D6 'limestone'>, <A6 1989.0>, <C2 ''>, <C3 'present'>, <E9 8.0>, <C8 'yes'>, <A1 ''>, <B8 'Jun'>, <A4 1972.0>, <C5 'no'>, <D8 'basalt'>, <C9 'yes'>, <E2 ''>, <E1 ''>, <B6 'Feb'>, <E5 30.0>, <C7 'no'>, <D1 ''>, <E8 96.0>, <B7 'Mar'>, <B1 ''>, <E3 'cost'>, <A8 ''>, <E7 88.0>, <B2 ''>, <B3 'Month'>, <B5 'Aug'>, <A7 ''>, <D9 'ice'>, <D4 'chalk'>, <B4 'Jan'>}


### cellbag.is_XXX()
### cellbag.is_not_XXX()

Returns cells which are or are not a XXX thing.
  
Allowable functions: 

> bold, italic, underline, number, date, whitespace, strikeout, any_border, all_border, richtext

These functions can be chained, eg cellbag.is_not_number().is_not_whitespace().

In [45]:
cellbag = tab
print("Numbered cells only:", cellbag.is_number())
print()
print("Not numbers:", cellbag.is_not_number())
print()
print("Not numbers and not whitespace:", cellbag.is_not_number().is_not_whitespace())
print()
print("Cells that seem to be a date:", cellbag.is_date())


Numbered cells only: {<A6 1989.0>, <E5 30.0>, <E8 96.0>, <E4 10.0>, <E9 8.0>, <E6 2.0>, <E7 88.0>, <A4 1972.0>}

Not numbers: {<C1 ''>, <D3 'Rocks'>, <D7 'shale'>, <B4 'Jan'>, <A2 'Date'>, <C6 'yes'>, <A3 'Year'>, <D4 'chalk'>, <D2 ''>, <D5 'granite'>, <A9 ''>, <C4 'yes'>, <B9 'Dec'>, <D6 'limestone'>, <C2 ''>, <C3 'present'>, <C8 'yes'>, <A1 ''>, <B8 'Jun'>, <D8 'basalt'>, <C5 'no'>, <C9 'yes'>, <E1 ''>, <B6 'Feb'>, <C7 'no'>, <D1 ''>, <B7 'Mar'>, <B1 ''>, <E3 'cost'>, <A8 ''>, <B2 ''>, <B3 'Month'>, <B5 'Aug'>, <A7 ''>, <D9 'ice'>, <E2 ''>, <A5 ''>}

Not numbers and not whitespace: {<D3 'Rocks'>, <D7 'shale'>, <A2 'Date'>, <C6 'yes'>, <A3 'Year'>, <D5 'granite'>, <C4 'yes'>, <B9 'Dec'>, <D6 'limestone'>, <C3 'present'>, <C8 'yes'>, <B8 'Jun'>, <D8 'basalt'>, <C5 'no'>, <C9 'yes'>, <B6 'Feb'>, <C7 'no'>, <B7 'Mar'>, <E3 'cost'>, <B3 'Month'>, <B5 'Aug'>, <D9 'ice'>, <D4 'chalk'>, <B4 'Jan'>}

Cells that seem to be a date: {<A4 1972.0>, <A6 1989.0>}


### cellbag.filter(word)

Only cells matching this word exactly

### cellbag.filter(function(cell))

Only cells where function(cell) == True


### cellbag.one_of([word1, word2])

Only cells matching one of the words


### cellbag.regex(regexp)

Only cell matching one of the words


### cellbag.excel_ref(ref)

Selects a cell by its excel Column-Row/Letter-Number format where 'A1' is the top left hand corner.

This also works for single columns or rows (eg 'C', or '3') and ranges (eg 'A2:B3'). 

This way of accessing is not recommended unless you know that the spreadsheet you are working with won't have extra rows or columns inserted or deleted from it.  

### cellbag.by_index(n)

Selects a single cell from the cell bag of index n, where n=1 is the first element.  (n can also be a list of integers.)


### cellbag.assert_one()

Throws an exception if there is not exactly one cell in this bag (useful for validation if your filter above was supposed to return only one cell)


In [132]:
print("Get some matching cells", cellbag.one_of(["Rocks", "ice", "mud"]))
print("A3 is", cellbag.excel_ref("A3"))
print("A3:B4 is", cellbag.excel_ref("A2:B4"))
print()
print("Numbers greater than 20", cellbag.is_number().filter(lambda c: c.value>20))
print("Numbers less than 20", cellbag.is_number().filter(lambda c: c.value<20))
print()
print("The second cell in the whole table is", tab.by_index(2))

Get some matching cells {<D3 'Rocks'>, <D9 'ice'>}
A3 is {<A3 'Year'>}
A3:B4 is {<B3 'Month'>, <A2 'Date'>, <B2 ''>, <A4 1972.0>, <A3 'Year'>, <B4 'Jan'>}

Numbers greater than 20 {<A4 1972.0>, <A6 1989.0>, <E8 96.0>, <E5 30.0>, <E7 88.0>}
Numbers less than 20 {<E6 2.0>, <E4 10.0>, <E9 8.0>}

The second cell in the whole table is {<B1 ''>}


### cellbag1.union(cellbag2)

Union of two bags.  Can also be expressed as `cellbag1 | cellbag2`

### cellbag1.difference(cellbag2)

Difference of two bags.  Can also be expressed as `cellbag1 - cellbag2`

### cellbag1.difference(cellbag2)

Intersection of two bags.  Can also be expressed as `cellbag1 & cellbag2`

In [121]:
colC = tab.excel_ref("D3:D5")
rowC = tab.excel_ref("A4:D4")
print("colC", colC)
print("rowC", rowC)
print()
print("Union is", colC.union(rowC))
print("Difference is", colC.difference(rowC))
print("Intersection is", colC.intersection(rowC))
print()
print("Union is", (colC | rowC))
print("Difference is", (colC - rowC))
print("Intersection is", (colC & rowC))


colC {<D5 'granite'>, <D3 'Rocks'>, <D4 'chalk'>}
rowC {<A4 1972.0>, <D4 'chalk'>, <C4 'yes'>, <B4 'Jan'>}

Union is {<D5 'granite'>, <A4 1972.0>, <D3 'Rocks'>, <D4 'chalk'>, <C4 'yes'>, <B4 'Jan'>}
Difference is {<D5 'granite'>, <D3 'Rocks'>}
Intersection is {<D4 'chalk'>}

Union is {<D5 'granite'>, <A4 1972.0>, <D3 'Rocks'>, <D4 'chalk'>, <C4 'yes'>, <B4 'Jan'>}
Difference is {<D5 'granite'>, <D3 'Rocks'>}
Intersection is {<D4 'chalk'>}


### cellbag1.waffle(cellbag2)

Get all cells which have a cell from one bag above them, and the other bag to the side. Note that the two bags are interchangable without changing the output. You can change the direction from its default (DOWN) by specifying direction=LEFT or similar.

### cellbag1.junction(cellbag2)

Enumerates the output of waffle in triplets


### cellbag1.same_row(cellbag2)

Get cells in this bag which are in the same row as a cell in the second.

### cellbag1.same_column(cellbag2)

Get cells in this bag which are in the same column as a cell in the second.

In [122]:
c = tab.excel_ref("D3") | tab.excel_ref("E4")
d = tab.excel_ref("A6:A7")
print("Waffle:")
savepreviewhtml([c,d, c.waffle(d)])

Waffle:


0,1,2
item 0,item 1,item 2

0,1,2,3,4
,,,,
Date,,,,
Year,Month,present,Rocks,cost
1972.0,Jan,yes,chalk,10.0
,Aug,no,granite,30.0
1989.0,Feb,yes,limestone,2.0
,Mar,no,shale,88.0
,Jun,yes,basalt,96.0
,Dec,yes,ice,8.0


In [123]:
print("Junction output:")
for s in c.junction(d):
    print("  ", s)

Junction output:
   ({<D3 'Rocks'>}, {<A6 1989.0>}, {<D6 'limestone'>})
   ({<D3 'Rocks'>}, {<A7 ''>}, {<D7 'shale'>})
   ({<E4 10.0>}, {<A6 1989.0>}, {<E6 2.0>})
   ({<E4 10.0>}, {<A7 ''>}, {<E7 88.0>})


In [128]:
print("Cells column A that are in same row as", c, "are", tab.excel_ref("A").same_row(c))
print("Cells column 7 that are in same column as", c, "are", tab.excel_ref("7").same_col(c))

Cells column A that are in same row as {<D3 'Rocks'>, <E4 10.0>} are {<A4 1972.0>, <A3 'Year'>}
Cells column 7 that are in same column as {<D3 'Rocks'>, <E4 10.0>} are {<D7 'shale'>, <E7 88.0>}


### cellbag.shift(direction)

Move the selected cells UP, DOWN, LEFT or Right by one cell

### cellbag.shift((dx, dy))

Move the selected cells dx cells to RIGHT and dy cells DOWN (can have negative values)


### cellbag.fill(direction)

Take all the cells in one direction from the given cellbag

### cellbag.expand(direction)

All the cells in one direction, including itself.

### cellbag.extrude(dx, dy)

Step and include this many cells between 0 and dx and dy.


In [120]:
c = tab.excel_ref("B4")
print("Shift RIGHT from", c, "is", c.shift(RIGHT))
print("Shift (-1,-2) from", c, "is", c.shift((-1, -2)))
print("Fill UP from", c, "is", c.fill(UP))
print("Expand UP from", c, "is", c.expand(UP))
print()
print("How it works: UP=", UP, "  DOWN=", DOWN, "  LEFT=", LEFT, "  RIGHT=", RIGHT)
print()
print("Extrude two cells rightwards", c.extrude(2,0))

Shift RIGHT from {<B4 'Jan'>} is {<C4 'yes'>}
Shift (-1,-2) from {<B4 'Jan'>} is {<A2 'Date'>}
Fill UP from {<B4 'Jan'>} is {<B3 'Month'>, <B1 ''>, <B2 ''>}
Expand UP from {<B4 'Jan'>} is {<B3 'Month'>, <B4 'Jan'>, <B1 ''>, <B2 ''>}

How it works: UP= (0, -1)   DOWN= (0, 1)   LEFT= (-1, 0)   RIGHT= (1, 0)

Extrude two cells rightwards {<D4 'chalk'>, <C4 'yes'>, <B4 'Jan'>}


# Dimensions
A dimension is simply a cellbag with a label and a lookup direction applied to it.  

### hdim = HDim(cellbag, label, strict=[DIRECTLY|CLOSEST], direction=[ABOVE|BELOW|LEFT|RIGHT])

The main constructor:

* CLOSEST (gets the *first* cell in the same column or row as the observation in a specified direction);
* DIRECTLY (gets the *closest* cell in the same column or row as the observation in a specified direction).


### hdim.cellvalobs(cell)

To look up the value of an individual cell.


### hdim.AddCellValueOverride(overridecell, overridevalue)

To add an override value to a cell or a value

### hdim.discardcellsnotlookedup(observationcells)

To remove header cells that are not seen by this list of observations

### hdim.valueslist()

To extract the list of values that the observations will look up to

### hdim.checkvalues()

To compare check against the cell values if they are the same

### hdimc = HDimConst(label, value)

To create a constant dimension that will give the same value no matter what the observation

In [133]:
HDimConst

<function databaker.jupybakeutils.HDimConst>



### Special dimensions

* OBS is an observation, a particular piece of data.
* DATAMARKER refer to footnote-like things, for example `(a)` to refer
  to approximate data.
* GEOG is the GSS code, e.g. `E04001323`.
* TIME specifies the period, e.g. a year, month or quarter for which the
  observation applies, e.g. `Feb-Apr 1971`.
* TIMEUNIT is the length of the time period, e.g. `Quarter`.
* Also STATPOP, UNITOFMEASURE, UNITMULTIPLIER, MEASURETYPE, STATUNIT.