# Filter Variants

This comes from the example notebook and the [tour of scikit-allel](http://alimanfoo.github.io/2016/06/10/scikit-allel-tour.html) blog post.

There are many possible approaches to filtering variants. The simplest approach is define thresholds on variant attributes like DP, MQ and QD, and exclude SNPs that fall outside of a defined range (a.k.a. “hard filtering”). This is crude but simple to implement and in many cases may suffice, at least for an initial exploration of the data.

In [2]:
import numpy as np
import zarr
import pandas as pd
import dask.array as da
import allel
import scipy
from pprint import pprint
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
sns.set_style('ticks')
sns.set_context('notebook')
%matplotlib inline

In [3]:
from dask_kubernetes import KubeCluster
cluster = KubeCluster(n_workers=30)
cluster

distributed.scheduler - INFO - Clear task state
Port 8787 is already in use. 
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
distributed.scheduler - INFO -   Scheduler at:   tcp://10.35.63.92:42375
distributed.scheduler - INFO -   dashboard at:                    :39099


VBox(children=(HTML(value='<h2>KubeCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n    .…

## Import the Variant Data

In [5]:
import gcsfs

gcs_bucket_fs = gcsfs.GCSFileSystem(project='malariagen-jupyterhub', token='anon', access='read_only')

storage_path = 'ag1000g-release/phase2.AR1/variation/main/zarr/pass/ag1000g.phase2.ar1.pass'
store = gcsfs.mapping.GCSMap(storage_path, gcs=gcs_bucket_fs, check=False, create=False)
callset = zarr.Group(store)

In [6]:
chrom = '3L'
variants = allel.VariantChunkedTable(callset[chrom]['variants'], 
                                     names=['POS', 'REF', 'ALT', 'DP', 'MQ', 'QD'],
                                     index='POS')
variants

Unnamed: 0,POS,REF,ALT,DP,MQ,QD,Unnamed: 7
0,9790,b'C',[b'T' b'' b''],35484,54.96,14.26,
1,9791,b'G',[b'T' b'' b''],35599,55.0,20.52,
2,9798,b'G',[b'A' b'' b''],35561,55.01,13.74,
...,...,...,...,...,...,...,...
10640385,41956541,b'C',[b'A' b'' b''],40185,57.63,30.28,
10640386,41956551,b'G',[b'A' b'' b''],39819,58.01,8.53,
10640387,41956556,b'T',[b'A' b'C' b''],39174,58.37,32.66,


## Filter using a Hard Expression

Define the hard filter using an expression. This is just a string of Python code, which we will evaluate in a moment.

In [7]:
filter_expression = '(QD > 5) & (MQ > 40) & (DP > 15000) & (DP < 30000)'

In [8]:
variant_selection = variants.eval(filter_expression)[:]
variant_selection

array([False, False, False, ..., False, False, False])

How many variants do we want to keep?

In [10]:
# Number of variants we're keeping based on filter_expression criteria
np.count_nonzero(variant_selection)

304050

How many variants do we filter out?

In [12]:
# Number of variants we're tossing based on filter_expression criteria
np.count_nonzero(~variant_selection)

10336338

distributed.scheduler - INFO - Register tcp://10.35.87.2:38715
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.87.2:38715
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://10.35.67.2:36717
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.67.2:36717
distributed.core - INFO - Starting established connection


Now that we have our variant filter, let’s make a new variants table with only rows for variants that pass our filter.

In [14]:
variants_pass = variants.compress(variant_selection)
variants_pass

distributed.scheduler - INFO - Register tcp://10.35.93.2:44567
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.93.2:44567
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://10.35.92.2:35361
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.92.2:35361
distributed.core - INFO - Starting established connection


Unnamed: 0,POS,REF,ALT,DP,MQ,QD,Unnamed: 7
0,65172,b'T',[b'A' b'' b''],29076,43.68,9.71,
1,80433,b'G',[b'A' b'' b''],29345,59.34,17.23,
2,80434,b'T',[b'C' b'' b''],29141,59.34,11.88,
...,...,...,...,...,...,...,...
304047,41949138,b'G',[b'T' b'' b''],28489,57.36,9.42,
304048,41949139,b'A',[b'C' b'' b''],28453,57.37,17.56,
304049,41949142,b'G',[b'A' b'' b''],29176,57.39,12.18,


distributed.scheduler - INFO - Register tcp://10.35.69.2:34855
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.69.2:34855
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://10.35.85.2:36335
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.85.2:36335
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://10.35.68.2:46027
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.68.2:46027
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://10.35.94.2:35005
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.94.2:35005
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://10.35.76.2:40437
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.35.76.2:40437
distributed.core - INFO