# VEP Annotation Example

### List EMR Master Nodes

`~/SageMaker/bin/list-clusters` will output the IP of each master node in your account and check Livy connectivity.

In [None]:
%%bash
~/SageMaker/bin/list-clusters

Collect the Cluster Name from the output above.  Replace `<CLUSTER_NAME>` below and run the cell to collect the EMR Master node IP. That IP will be used for the Livy connection and Bokeh plot transfers.

In [None]:
%%bash --out LIVY_ENDPOINT
~/SageMaker/bin/list-clusters | grep <CLUSTER_NAME> | awk '{ print $3 }'

In [None]:
%%local
import re

LIVY_ENDPOINT = LIVY_ENDPOINT.strip()
EMR_MASTER_IP = re.sub('http://([0-9.]+):([0-9]{4})', '\\1', LIVY_ENDPOINT)

Use the Livy Endpoint above and start your session name `-s`, language `-l python`, the livy endpoint `-u`, and authentication type `-t`.

In [None]:
%reload_ext sparkmagic.magics
%spark add -s jsmith -l python -u $LIVY_ENDPOINT -t None

In [None]:
import hail as hl
import hail.expr.aggregators as agg
hl.init(sc)

In [None]:
hl.utils.get_1kg('data/')
mt = hl.read_matrix_table('data/1kg.mt')
table = (hl.import_table('data/1kg_annotations.txt', impute=True)
         .key_by('Sample'))
mt = mt.annotate_cols(**table[mt.s])
mt = hl.sample_qc(mt)

mt.describe()

Downsample some for the sake of speed in this example.

In [None]:
mt = mt.annotate_cols(**table[mt.s])
mt = hl.sample_qc(mt)
mt = mt.sample_rows(p=0.01, seed=421)
mt = mt.sample_cols(p=0.1, seed=421)

Annotate the MatrixTable with the [vep()](https://hail.is/docs/0.2/methods/genetics.html#hail.methods.vep) method.  Example VEP JSON configurations were loaded into your Hail S3 bucket during the quickstart deployment.   In this example, we'll use GRCh37 with [LOFTEE](https://github.com/konradjk/loftee).

Substitute `<HAIL_BUCKET>` with the Hail bucket name you selected during quickstart deployment.

In [None]:
mt = hl.vep(mt, f"s3://<HAIL_BUCKET>/vep-configuration/vep-configuration-GRCh37.json")

mm = mt.select_rows(**mt['vep'])
mm = mm.select_rows('most_severe_consequence')
consequence = mm.aggregate_rows(agg.counter(mm.most_severe_consequence))
print(f"missense_variant: {consequence['missense_variant']}")

Remove the Livy notebook session

In [None]:
%spark cleanup