## Hail Issues

This is a scratch notebook for aggregating thoughts on the Hail API as well as potential bugs

In [1]:
import hail as hl
import pandas as pd
hl.init()

Running on Apache Spark version 2.4.4
SparkUI available at http://2e4e0c6972f9:4042
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.32-a5876a0a2853
LOGGING: writing to /home/eczech/repos/gwas-analysis/notebooks/organism/canine/hail-20200223-2235-0.2.32-a5876a0a2853.log


## Questions

- How do you visualize an EntryExpression as a matrix?

Use BM?

```python
hl.linalg.BlockMatrix.from_entry_expr(
    hl.case().when(hl.is_defined(mt.GT),mt.GT.n_alt_alleles()).default(-1)
).to_numpy()
```

Use collect?

```python
# Is the reshape here always going to work?
np.nan_to_num(np.array(
    mt.GT.n_alt_alleles().collect()).reshape(mt.count()).astype(float)
, nan=-1).astype(int)
```

- How do you get counts of elements in an aggregation?

```python
hl.agg.sum(mt.GT.n_alt_alleles()) / (hl.agg.count_where(hl.is_defined(mt.GT)) * 2)
```

- How do you convert an EntryExpression to a Hail Table?
- How do you do a projection or annotation with a key field?  What are the consequences of annotation with altered keys?
- What are good use cases for [hl.filtering_allele_frequency](https://hail.is/docs/0.2/experimental/index.html?highlight=allele_frequency#hail.experimental.filtering_allele_frequency)?  Is this a good way to do AF filtering w/ small sample sizes?
- How do you create a centered, normalized call expression (w/ missing values)?
    - https://github.com/hail-is/hail/blob/master/hail/python/test/hail/methods/test_statgen.py#L1349
```python
ds = hl.split_multi_hts(hl.import_vcf(resource('ldprune.vcf'), min_partitions=3))
filtered_ds = ds.filter_rows(hl.is_defined(ds.row_key))
filtered_ds = filtered_ds.annotate_rows(stats=agg.stats(filtered_ds.GT.n_alt_alleles()))
filtered_ds = filtered_ds.annotate_rows(mean=filtered_ds.stats.mean, sd_reciprocal=1 / filtered_ds.stats.stdev)
n_samples = filtered_ds.count_cols()
normalized_mean_imputed_genotype_expr = (
    hl.cond(hl.is_defined(filtered_ds['GT']),
            (filtered_ds['GT'].n_alt_alleles() - filtered_ds['mean'])
            * filtered_ds['sd_reciprocal'] * (1 / hl.sqrt(n_samples)), 0))

- How do you create column aggregates for each row?
```python
mt = hl.balding_nichols_model(5, 100, 1000)
mt = mt.annotate_rows(x=hl.agg.group_by(mt.pop, hl.agg.count()))
```
- How do you create aggregates by rows and cols simultaneously?
    - These give an error:
```python
mt = hl.balding_nichols_model(5, 100, 1000)
mt.group_rows_by(c=mt.locus.contig, p=mt.pop).aggregate(x=hl.agg.mean(mt.GT.n_alt_alleles()))
mt.group_rows_by(contig=mt.locus.contig).group_cols_by(p=mt.pop)
mt.x.collect()
# [{0: 18, 1: 16, 2: 23, 3: 22, 4: 21}, ...] # for each row, a dict of pop -> count 
```
    - This gives an incorrect result:
```python
mt = mt.key_rows_by().key_cols_by()
mta = mt.group_rows_by(contig=mt.locus.contig)
mta = mta.aggregate(x=hl.agg.group_by(mta['pop'], hl.agg.count()))
```

- How do you create a matrix table manually?
    - See: https://github.com/hail-is/hail/blob/1958fb9b76a2a37ad5069c430e9bff2824b0d290/hail/python/test/hail/methods/test_statgen.py#L1117
    - Also: https://hail.is/docs/0.2/methods/genetics.html?highlight=ld_matrix#hail.methods.ld_matrix
    
```python
>>> data = [{'v': '1:1:A:C',       'cm': 0.1, 's': 'a', 'GT': hl.Call([0, 0])},
...         {'v': '1:1:A:C',       'cm': 0.1, 's': 'b', 'GT': hl.Call([0, 0])},
...         {'v': '1:1:A:C',       'cm': 0.1, 's': 'c', 'GT': hl.Call([0, 1])},
...         {'v': '1:1:A:C',       'cm': 0.1, 's': 'd', 'GT': hl.Call([1, 1])},
...         {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'a', 'GT': hl.Call([0, 1])},
...         {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'b', 'GT': hl.Call([1, 1])},
...         {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'c', 'GT': hl.Call([0, 1])},
...         {'v': '1:2000000:G:T', 'cm': 0.9, 's': 'd', 'GT': hl.Call([0, 0])},
...         {'v': '2:1:C:G',       'cm': 0.2, 's': 'a', 'GT': hl.Call([0, 1])},
...         {'v': '2:1:C:G',       'cm': 0.2, 's': 'b', 'GT': hl.Call([0, 0])},
...         {'v': '2:1:C:G',       'cm': 0.2, 's': 'c', 'GT': hl.Call([1, 1])},
...         {'v': '2:1:C:G',       'cm': 0.2, 's': 'd', 'GT': hl.null(hl.tcall)}]
>>> ht = hl.Table.parallelize(data, hl.dtype('struct{v: str, s: str, cm: float64, GT: call}'))
>>> ht = ht.transmute(**hl.parse_variant(ht.v))
>>> mt = ht.to_matrix_table(row_key=['locus', 'alleles'], col_key=['s'], row_fields=['cm'])
```

- How well does pc_relate scale with larger datasets?

## API Nuisances

Source mismatch:

In [None]:
# This is a somewhat reasonable limitation, that everything you ever want to use in aggregations
# needs to be attached to a table first, but it is very annoying
mt.aggregate_cols(hl.expr.aggregators.hist(hl.sample_qc(mt).sample_qc.call_rate, 0, 1, 30))
# 'MatrixTable.aggregate_cols': source mismatch

Plotting:

In [None]:
# Having one bin with a zero is very common and not having the ability to visualize anything 
# on a log scale makes the histogram plotting pretty useless.  Grabbing the intermediate
# Struct object (hist) and trying to mutate is a pain too, not to mention that a transformation
# there would not show up as an inverse transform on the scales.
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
output_notebook()

def get_hist_plot(mt):
    mt = hl.sample_qc(mt)
    hist = mt.aggregate_cols(hl.expr.aggregators.hist(mt.sample_qc.call_rate, 0, 1, 30))
    return hl.plot.histogram(hist, legend='CR', title='Call Rate', log=True)
show(get_hist_plot(mt))

#     404     if log:
# --> 405         data.bin_freq = [math.log10(x) for x in data.bin_freq]
#     406         data.n_larger = math.log10(data.n_larger)
#     407         data.n_smaller = math.log10(data.n_smaller)

# ValueError: math domain error

QC by group:

In [None]:
# Hail examples for producing non-scalar aggregations by group look like this:
dataset_result = (dataset.group_cols_by(dataset.cohort)
    .aggregate_cols(mean_height = hl.agg.mean(dataset.pheno.height))
    .result())

# What is unclear though, is how sample_qc and variant_qc can be applied **within** a group 
# rather than over the entire MT (the signature just accepts MTs)

Aggregations in general:

See https://discuss.hail.is/t/issues-with-sample-and-variant-qc-by-group/1286/7 for a discussion on different aggregation types and why they can be ambiguous.

In [3]:
mt = hl.balding_nichols_model(3, 100, 1000)

2020-02-12 11:36:11 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 100 samples, and 1000 variants...


In [86]:
mtr = mt.rows()
mtr.aggregate(hl.agg.group_by(
   mtr.locus.contig, 
   hl.agg.sum(mt.entries().key_by('locus', 'alleles').index(mtr.key).GT.n_alt_alleles())
))

2020-02-12 12:53:21 Hail: INFO: Coerced sorted dataset
2020-02-12 12:53:21 Hail: INFO: Coerced sorted dataset


{'1': 986}

Selecting key fields:

Why is it not possible to create a project with key fields?

In [106]:
mt = hl.balding_nichols_model(1, 10, 10)
mt.select_cols(mt.sample_idx, mt.pop)

2020-02-12 22:25:04 Hail: INFO: balding_nichols_model: generating genotypes for 1 populations, 10 samples, and 10 variants...
2020-02-12 22:25:04 Hail: ERROR: Analysis exception: 'MatrixTable.select_cols': cannot overwrite key field 'sample_idx' with annotate, select or drop; use key_by to modify keys.


ExpressionException: 'MatrixTable.select_cols': cannot overwrite key field 'sample_idx' with annotate, select or drop; use key_by to modify keys.

Annotating key fields:

In [110]:
mt = hl.balding_nichols_model(1, 10, 10)
mt.annotate_rows(alleles=hl.reversed(mt.alleles))

2020-02-13 02:25:03 Hail: INFO: balding_nichols_model: generating genotypes for 1 populations, 10 samples, and 10 variants...
2020-02-13 02:25:03 Hail: ERROR: Analysis exception: 'MatrixTable.annotate_rows': cannot overwrite key field 'alleles' with annotate, select or drop; use key_by to modify keys.


ExpressionException: 'MatrixTable.annotate_rows': cannot overwrite key field 'alleles' with annotate, select or drop; use key_by to modify keys.

#### Kinship Estimator Output Format

- pc_relate returns i, j as col key structs (using whatever was provided)
- identity_by_descent returns only the sample id strings 

Why are they not the same?

## Bugs

### Invalid Aggregation Bugs

In [109]:
# KeyError: 'g' ?
mt = hl.balding_nichols_model(1, 10, 10)
mt.aggregate_rows(hl.agg.group_by(
   mt.pop, hl.agg.sum(mt.GT.n_alt_alleles())
))

2020-02-12 23:22:53 Hail: INFO: balding_nichols_model: generating genotypes for 1 populations, 10 samples, and 10 variants...


KeyError: 'g'

In [108]:
# KeyError: 'va' ?
mt = hl.balding_nichols_model(1, 10, 10)
mt.aggregate_cols(hl.agg.counter(mt.locus))

2020-02-12 23:22:34 Hail: INFO: balding_nichols_model: generating genotypes for 1 populations, 10 samples, and 10 variants...


KeyError: 'va'

### Grouped MT Bug

In [98]:
mt = hl.balding_nichols_model(3, 100, 1000)
mtg = mt.group_rows_by(mt.locus).describe()

2020-02-12 13:43:35 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 100 samples, and 1000 variants...


AttributeError: 'list' object has no attribute 'items'

In [90]:
hl.agg.group_by(mt.pop, hl.agg.count()).describe()

--------------------------------------------------------
Type:
        dict<int32, int64>
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7f4031e63a50>
Index:
    [] (aggregated)
--------------------------------------------------------
Includes aggregation with index []
    (Aggregation index may be promoted based on context)
--------------------------------------------------------


### Sorted Array Bug

See: https://github.com/hail-is/hail/issues/8076#issuecomment-584793361

Should be fixed in 0.2.31 and later. Original bug found in 0.2.30

In [4]:
# SEGFAULT
mt.aggregate_rows(hl.agg.counter(hl.sorted(mt.alleles)))

TypeError: unhashable type: 'list'

In [2]:
mt = hl.balding_nichols_model(1, 10, 10)
mt.aggregate_rows(hl.agg.counter(hl.delimit(mt.alleles, '|')))

2020-02-13 19:56:31 Hail: INFO: balding_nichols_model: generating genotypes for 1 populations, 10 samples, and 10 variants...


{'A|C': 10}

In [4]:
#mt.aggregate_rows(hl.agg.counter(mt.alleles))
# TypeError: unhashable type: 'list'

In [8]:
mt.aggregate_rows(hl.agg.counter(hl.tuple([mt.alleles[0], mt.alleles[1]])))

{('A', 'C'): 10}

In [2]:
mt = hl.balding_nichols_model(1, 10, 10)
# What are these counts of?
mt.aggregate_rows(hl.agg.counter(hl.delimit(hl.sorted(mt.alleles), '|')))

2020-02-11 16:11:17 Hail: INFO: balding_nichols_model: generating genotypes for 1 populations, 10 samples, and 10 variants...


{'A|A|A|C|\x0b\x00\x00': 2, 'A|A|A|C|C|C': 8}

In [3]:
# This is fine, they should all be AC alleles
pd.Series(hl.delimit(hl.sorted(mt.alleles), '|').collect()).value_counts()

2020-02-11 16:11:20 Hail: INFO: Coerced sorted dataset


A|C    10
dtype: int64

In [4]:
# SEGFAULT
mt.aggregate_rows(hl.agg.counter(hl.sorted(mt.alleles)))

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/conda/envs/hail/lib/python3.7/site-packages/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/hail/lib/python3.7/site-packages/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/opt/conda/envs/hail/lib/python3.7/site-packages/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37043)
Traceback (most recent call last):
  File "/opt/conda/envs/hail/lib/python3.7/site-packages/IPython/core/intera

Py4JError: An error occurred while calling o59.executeJSON

In [7]:
mt.aggregate_rows(hl.agg.counter(hl.sorted(mt.alleles)))

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:43681)
Traceback (most recent call last):
  File "/opt/conda/envs/hail/lib/python3.7/site-packages/py4j/java_gateway.py", line 929, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/hail/lib/python3.7/site-packages/py4j/java_gateway.py", line 1067, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused


Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:43681)

## Bad Messaging Bugs

In [9]:
mt = hl.balding_nichols_model(5, 100, 1000)
# Group rows by something in cols
# sa not found in?
mt.group_rows_by(c=mt.locus.contig, p=mt.pop).aggregate(x=hl.agg.mean(mt.GT.n_alt_alleles()))

2020-02-15 10:12:20 Hail: INFO: balding_nichols_model: generating genotypes for 5 populations, 100 samples, and 1000 variants...


AssertionError: sa not found in {'global': dtype('struct{bn: struct{n_populations: int32, n_samples: int32, n_variants: int32, n_partitions: int32, pop_dist: array<int32>, fst: array<float64>, mixture: bool}}'), 'va': dtype('struct{locus: locus<GRCh37>, alleles: array<str>, ancestral_af: float64, af: array<float64>}')}