## Hail Issues

This is a scratch notebook for aggregating thoughts on the Hail API as well as potential bugs

In [1]:
import hail as hl
import pandas as pd
hl.init()

Running on Apache Spark version 2.4.4
SparkUI available at http://a783b4e25167:4041
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.30-2ae07d872f43
LOGGING: writing to /home/eczech/repos/gwas-analysis/notebooks/organism/canine/hail-20200211-1803-0.2.30-2ae07d872f43.log


### API Nuisances

Source mismatch:

In [None]:
# This is a somewhat reasonable limitation, that everything you ever want to use in aggregations
# needs to be attached to a table first, but it is very annoying
mt.aggregate_cols(hl.expr.aggregators.hist(hl.sample_qc(mt).sample_qc.call_rate, 0, 1, 30))
# 'MatrixTable.aggregate_cols': source mismatch

Plotting:

In [None]:
# Having one bin with a zero is very common and not having the ability to visualize anything 
# on a log scale makes the histogram plotting pretty useless.  Grabbing the intermediate
# Struct object (hist) and trying to mutate is a pain too, not to mention that a transformation
# there would not show up as an inverse transform on the scales.
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
output_notebook()

def get_hist_plot(mt):
    mt = hl.sample_qc(mt)
    hist = mt.aggregate_cols(hl.expr.aggregators.hist(mt.sample_qc.call_rate, 0, 1, 30))
    return hl.plot.histogram(hist, legend='CR', title='Call Rate', log=True)
show(get_hist_plot(mt))

#     404     if log:
# --> 405         data.bin_freq = [math.log10(x) for x in data.bin_freq]
#     406         data.n_larger = math.log10(data.n_larger)
#     407         data.n_smaller = math.log10(data.n_smaller)

# ValueError: math domain error

QC by group:

In [None]:
# Hail examples for producing non-scalar aggregations by group look like this:
dataset_result = (dataset.group_cols_by(dataset.cohort)
    .aggregate_cols(mean_height = hl.agg.mean(dataset.pheno.height))
    .result())

# What is unclear though, is how sample_qc and variant_qc can be applied **within** a group 
# rather than over the entire MT (the signature just accepts MTs)

### Aggregation Bug

In [2]:
mt = hl.balding_nichols_model(1, 10, 10)
mt.aggregate_rows(hl.agg.counter(hl.delimit(mt.alleles, '|')))

2020-02-11 18:03:45 Hail: INFO: balding_nichols_model: generating genotypes for 1 populations, 10 samples, and 10 variants...


{'A|C': 10}

In [4]:
#mt.aggregate_rows(hl.agg.counter(mt.alleles))
# TypeError: unhashable type: 'list'

In [8]:
mt.aggregate_rows(hl.agg.counter(hl.tuple([mt.alleles[0], mt.alleles[1]])))

{('A', 'C'): 10}

In [2]:
mt = hl.balding_nichols_model(1, 10, 10)
# What are these counts of?
mt.aggregate_rows(hl.agg.counter(hl.delimit(hl.sorted(mt.alleles), '|')))

2020-02-11 16:11:17 Hail: INFO: balding_nichols_model: generating genotypes for 1 populations, 10 samples, and 10 variants...


{'A|A|A|C|\x0b\x00\x00': 2, 'A|A|A|C|C|C': 8}

In [3]:
# This is fine, they should all be AC alleles
pd.Series(hl.delimit(hl.sorted(mt.alleles), '|').collect()).value_counts()

2020-02-11 16:11:20 Hail: INFO: Coerced sorted dataset


A|C    10
dtype: int64

In [4]:
# SEGFAULT
mt.aggregate_rows(hl.agg.counter(hl.sorted(mt.alleles)))

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/conda/envs/hail/lib/python3.7/site-packages/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/hail/lib/python3.7/site-packages/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/opt/conda/envs/hail/lib/python3.7/site-packages/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:37043)
Traceback (most recent call last):
  File "/opt/conda/envs/hail/lib/python3.7/site-packages/IPython/core/intera

Py4JError: An error occurred while calling o59.executeJSON

In [7]:
mt.aggregate_rows(hl.agg.counter(hl.sorted(mt.alleles)))

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:43681)
Traceback (most recent call last):
  File "/opt/conda/envs/hail/lib/python3.7/site-packages/py4j/java_gateway.py", line 929, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/hail/lib/python3.7/site-packages/py4j/java_gateway.py", line 1067, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused


Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:43681)