## Expression Tutorial

This tutorial covers data representation with Hail's expression classes. We will go over Hail's data types and the expressions that represent them, as well as a few features of expressions, such as lazy evaluation and missingness. We will also cover how expressions can refer to fields in a table or matrix table.

As you are working through the tutorial, you can also check out the [expression API](https://hail.is/docs/devel/expressions.html#expressions) for documentation on specific expressions and their methods, or the [expression](https://hail.is/docs/devel/hailpedia/expressions.html) page in the Hailpedia for more information on expressions.

Start by importing the Hail module, which we typically abbreviate as `hl`, and initializing Hail and Spark with the [init](https://hail.is/docs/devel/api.html#hail.init) method:

In [None]:
import hail as hl
hl.init()

### Hail's Data Types

Each object in Python has a data type, which can be accessed with Python's `type` method. Here is a Python string, which has type `str`. 

In [None]:
type("Python")

Hail has its own data types for representing data. Here is a Hail string, which we construct with the [str](https://hail.is/docs/devel/functions/core.html?highlight=str#hail.expr.functions.str) method. We can access the string's Hail type with the `dtype` field.

In [None]:
hl.str("Hail").dtype

Hail has primitive and container types, as well as a few types specific to the field of genetics.

* primitive types: [int32](https://hail.is/docs/devel/types.html#hail.expr.types.tint32), [int64](https://hail.is/docs/devel/types.html#hail.expr.types.tint64), [float32](https://hail.is/docs/devel/types.html#hail.expr.types.tfloat32), [float64](https://hail.is/docs/devel/types.html#hail.expr.types.tfloat64), [bool](https://hail.is/docs/devel/types.html#hail.expr.types.tbool), [str](https://hail.is/docs/devel/types.html#hail.expr.types.tstr)
* container types: [arrays](https://hail.is/docs/devel/types.html#hail.expr.types.tarray), [sets](https://hail.is/docs/devel/types.html#hail.expr.types.tset), [dicts](https://hail.is/docs/devel/types.html#hail.expr.types.tdict), [tuples](https://hail.is/docs/devel/types.html#hail.expr.types.ttuple), [structs](https://hail.is/docs/devel/types.html#hail.expr.types.tstruct), [intervals](https://hail.is/docs/devel/types.html#hail.expr.types.tinterval)
* genetics types: [locus](https://hail.is/docs/devel/types.html#hail.expr.types.tlocus), [call](https://hail.is/docs/devel/types.html#hail.expr.types.tcall)

Each of these types has its own constructor method, which returns an expression:

In [None]:
hl.str("Hail")

### What is an Expression?

Data types in Hail are represented by [expression](https://hail.is/docs/devel/expressions.html#expressions) classes. Each data type has its own expression class. For example, an integer of type `tint32` is represented by an `Int32Expression`. 

We can construct an integer expression in Hail with the [int32](https://hail.is/docs/devel/functions/constructors.html?highlight=int32#hail.expr.functions.int32) function.

In [None]:
hl.int32(3)

To automatically impute the type when converting a Python object to a Hail expression, use the [literal](https://hail.is/docs/devel/functions/core.html?highlight=literal#hail.expr.functions.literal) method. Let's try it out on a Python list.

In [None]:
hl.literal(['a', 'b', 'c'])

The Python list is converted to an ArrayExpression of type `array<str>`. In other words, an array of strings.

### Expressions are Lazy

In languages like Python and R, expressions are evaluated and stored immediately. This is called **eager** evalutation.

In [None]:
1 + 2

Eager evaluation won't work on datasets that won't fit in memory. Consider the UK Biobank BGEN file, which is ~2TB but decompresses to >100TB in memory.

In order to process datasets of this size, Hail uses **lazy** evaluation. When you enter an expression, Hail doesn't execute the expression immediately; it only records what you asked to do.

In [None]:
one = hl.int32(1)
three = one + 2
three

Hail evaluates an expression only when it must. For example:

 - when performing an aggregation
 - when calling the methods [take](https://hail.is/docs/devel/expressions.html?highlight=take#hail.expr.expressions.Expression.take), [collect](https://hail.is/docs/devel/expressions.html?highlight=take#hail.expr.expressions.Expression.collect), and [show](https://hail.is/docs/devel/expressions.html?highlight=take#hail.expr.expressions.Expression.show)
 - when exporting or writing to disk

Hail evaluates expressions by streaming to accomodate very large datasets.

If you want to force the evaluation of an expression, you can do so by accessing its [value](https://hail.is/docs/devel/expressions.html?highlight=take#hail.expr.expressions.Expression.value). Note that this can only be done on an expression with no index, such as `hl.int32(1) + 2`. If the expression has an index, e.g. `table.idx + 1`, 
then the `value` method will fail. The section on indices below explains this concept further. 

In [None]:
three.value

The [show](https://hail.is/docs/devel/hail.Table.html?highlight=show#hail.Table.show) method can also be used to evaluate and display the expression.

In [None]:
three.show()

### Missing data

All expressions in Hail can represent missing data. Hail has a [collection of primitive operations](https://hail.is/docs/devel/functions/core.html) for dealing with missingness. 

The [null](https://hail.is/docs/devel/functions/core.html?highlight=null#hail.expr.functions.null) constructor can be used to create a missing expression of a specific type, such as a missing string:

In [None]:
missing_string = hl.null(hl.tstr)

Use [is_defined](https://hail.is/docs/devel/functions/core.html?highlight=is_defined#hail.expr.functions.is_defined) or [is_missing](https://hail.is/docs/devel/functions/core.html?highlight=is_defined#hail.expr.functions.is_missing) to test an expression for missingness.

In [None]:
hl.is_defined(missing_string).value

In [None]:
hl.is_missing(missing_string).value

Expressions handle missingness in the following ways:

* a missing value plus another value is always missing
* a conditional statement with a missing predicate is missing
* when aggregating a sum of values, the missing values are ignored

This is different from Python's treatment of missingness, where `None + 5` would produce an error. In Hail, `hl.null(hl.tint32) + 5` produces a missing result, not an error. 

In [None]:
hl.is_missing(hl.null(hl.tint32) + 5).value

Here are a few more examples to illustrate how missingness is treated in Hail:

Missingness is ignored in a summation:

In [None]:
hl.sum(hl.array([1, 2, hl.null(hl.tint32)])).value

[or_missing](https://hail.is/docs/devel/functions/core.html?highlight=is_defined#hail.expr.functions.or_missing) takes a predicate and a value. If the predicate is True, it returns the value; otherwise, it returns a missing value. 

In [None]:
x = hl.int32(5)
hl.or_missing(x>0, x).value

In [None]:
print(hl.or_missing(x>10, x).value)

### Indices

Expressions carry another piece of information: indices.  Indices record the `Table` or `MatrixTable` to which the expression refers, and the axes over which the expression can vary.

Let's see some examples from the 1000 genomes dataset:

In [None]:
hl.utils.get_1kg('data/')

In [None]:
mt = hl.read_matrix_table('data/1kg.mt')
mt

Let's add a global field.

In [None]:
mt = mt.annotate_globals(dataset = '1kg')

We can examine any field of the matrix table with the [describe](https://hail.is/docs/devel/expressions.html?highlight=describe#hail.expr.expressions.Expression.describe) method. If we examine the field we just added, notice that it has no indices, because it is a global field.

In [None]:
mt.dataset.describe()

The `locus` field is a row field, so it will be indexed by `row`. 

In [None]:
mt.locus.describe()

Likewise, a column field `s` will be indexed by `column`.

In [None]:
mt.s.describe()

And finally, an entry field `GT` will be indexed by both the `row` and `column`.

In [None]:
mt.GT.describe()

Expressions like `locus`, `s`, and `GT` above do not have a single value, but rather a value that varies across rows or columns of `mt`. Therefore, calling the `value` method on these expressions will lead to an error.

Global fields don't vary across rows or columns, so they have a `value`:

In [None]:
mt.dataset.value

### `show`, `take`, and `collect`

Although expressions with indices do not have a single realizable `value` (calling `value` will fail), you can use `show` to print the first few values, or `take` and `collect` to localize all values into a Python list. 

`show` and `take` grab the first 10 rows by default, but you can specify a number of rows to grab.

In [None]:
mt.s.show()

In [None]:
mt.s.take(5)

You can [collect](https://hail.is/docs/devel/expressions.html?highlight=collect#hail.expr.expressions.Expression.collect) an expression to localize all values, like getting a list of all sample IDs of a dataset.

But be careful -- don't `collect` more data than can fit in memory!

In [None]:
all_sample_ids = mt.s.collect()
all_sample_ids[:5]

### Learning more

Hail has a suite of of [functions](https://hail.is/docs/devel/functions/index.html) to transform and build expressions. 

For further documentation on expressions, see the [expression API](https://hail.is/docs/devel/expressions.html) and the [expression](https://hail.is/docs/devel/hailpedia/expressions.html) page. 