<a href="https://colab.research.google.com/gist/Gibbsdavidl/dc257e66867a5f3bb8a6c6f351a633c9/isb-cgc-query_of_the_month-november-2018.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Basic BigQuery (BBQ): Table Exploration**

 

## Configuration and Setup
Most of the setup that happens in this section can be kept "hidden" when you run this notebook, with the exception of the *authentication* step below which will require that you sign in with Google and then **Allow** the Google Cloud SDK to access your Google account. You will then have to copy/paste a "verification code" (which looks something like ``KYuEQglK4/tACQe47cc_GZQa-KGm6JgJY8BiFyX0I27G-0cjnsz7Z6Fka``) into a box below.

After completing these steps, if you do not get a "successfully authenticated" message, you will not be able to continue running this notebook.  (Because of the nature of Jupyter notebooks, it is also possible that you may see the "successfully authenticated" message but it may have been from a previous, expired session, and you may need to re-authenticate)


### Authenticate with Google
Our first step is to authenticate with Google -- you will need to be a member of a Google Cloud Platform (GCP) project, with authorization to run BigQuery jobs in order to run this notebook.  If you don't have access to a GCP project, please contact the ISB-CGC team for help (www.isb-cgc.org)

In [0]:
from google.colab import auth
try:
  auth.authenticate_user()
  print('You have been successfully authenticated!')
except:
  print('You have not been authenticated.')

You have been successfully authenticated!


### Initialize connection to BigQuery
Once you're authenticated, we'll begin getting set up to pull data out of BigQuery.  

The first step is to initialize the BigQuery client.  This requires specifying a Google Cloud Platform (GCP) **project id** in which you have the necessary privileges (also referred to as "roles") to execute queries and access the data used by this notebook.


---



In [0]:
from google.cloud import bigquery
try:
  project_id = 'isb-cgc-bq'
  bqclient = bigquery.Client(project=project_id)
  print('BigQuery client successfully initialized')
except:
  print('Failure to initialize BigQuery client')

BigQuery client successfully initialized


### Software Configuration

#### Libraries & Setup


In [0]:
import logging
import numpy as np
import pandas as pd
import seaborn as sns
import time

logging.basicConfig ( level=logging.INFO )

#### Python convenience functions

We import several 'bbq' convenience functions from github.


In [0]:
!wget https://raw.githubusercontent.com/smrgit/jbox/master/bbq/bbq.py

import bbq

--2019-05-30 22:05:03--  https://raw.githubusercontent.com/smrgit/jbox/master/bbq/bbq.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16775 (16K) [text/plain]
Saving to: ‘bbq.py’


2019-05-30 22:05:04 (1.30 MB/s) - ‘bbq.py’ saved [16775/16775]




**`bbqRunQuery`**: a relatively generic BigQuery query-execution wrapper function which can be used to run a query in "dry-run"  mode or not:  the call to the `query()` function itself is inside a `try/except` block and if it fails we return `None`;  otherwise a "dry-run" will return an empty dataframe, and a "live" run will return the query results as a dataframe


**`bbqCheckQueryResults`**: a generic function that makes sure that what was returned is a dataframe, and checks how many rows are in the returned dataframe

**`bbqBuildFieldContentsQuery`**: this python function constructs a query to examine the _contents_ a specific field in the specified table

**`bbqBuildRepeatedFieldsQuery`**: this python function constructs a query to examine how often a _repeated_ field is repeated in the specified table

**`bbqSummarizeQueryResults`**: this python function expects a very specify type of 'query results' and then 'summarizes' them 

**`bbqExploreFieldContents`**: this python function walks through the schema of the specified table in BigQuery and calls previously defined functions to explore the contents of each field

**`bbqExploreRepeatedFields`**: this python function walks through the schema of the specified table in BigQuery and calls previously defined functions to determine how often _repeated_ fields are repeated (even if a field is defined as a _repeated_ field, it may in fact never be repeated more than once -- and it might in fact never be repeated at all!)

## **Exploring a Simple Table**

Let's start by exploring a simple table in BigQuery using the functions defined above.

In [0]:
projectName = "bigquery-public-data"
datasetName = "github_repos"
tableName   = "sample_files"

projectName = "isb-cgc-bq"
datasetName = "scratch"
tableName = "gvar_annot_20190405"

In [0]:
## the "exploratory" queries sort and count up unique values, etc, and this type
## of operation does not make sense (and can be time-consuming and expensive) 
## for certain fields (or certain types of fields), so those fields can be 
## passed in as names or types to be *excluded* using the excludedNames or
## excludedTypes lists

excludedNames = [ "path", "id", "symlink_target" ]
excludedTypes = [ ]
xf = bbq.bbqExploreFieldContents ( bqclient, projectName, datasetName, tableName, 
                              excludedNames, excludedTypes )

reference_name                STRING      NULLABLE   (24, 0, 18076178, '10', 3117718, 0.17247661535530354, '1', 'Y', 6, 16)
variant_start                 INTEGER     NULLABLE   (329201, 0, 18076178, 97916198, 16470, 0.0009111439376177863, 3935, 249240538, 3879, 26511)
variant_end                   INTEGER     NULLABLE   (329201, 0, 18076178, 97916199, 16470, 0.0009111439376177863, 3936, 249240539, 3879, 26511)
reference_bases               STRING      NULLABLE   (4, 0, 18076178, 'C', 4676484, 0.2587097781400471, 'A', 'T', 2, 4)
variant_quality               FLOAT       NULLABLE   (26898, 5889935, 12186243, 286.8, 70795, 0.005809419687429505, 30.29, 289763.0, 928, 3998)
alt                           STRING      NULLABLE   (4, 0, 18076178, 'G', 4568004, 0.2527085095090345, 'A', 'T', 2, 4)
aliquot_barcode               STRING      NULLABLE   (10389, 0, 18076178, 'TCGA-W5-AA2J-10A-01D-A46P-09', 32103, 0.0017759838390615538, 'TCGA-02-0003-10A-01D-1490-08', 'TCGA-ZX-AA5X-10A-01D-A42R-09', 17

In [0]:
xr = bbq.bbqExploreRepeatedFields ( bqclient, projectName, datasetName, tableName )

 no REPEATED fields found in this table


## **Exploring a Complex Table**

Now for a table with a significantly more complicated schema which includes RECORDs as well as REPEATED fields.

In [0]:
projectName = "bigquery-public-data"
datasetName = "github_repos"
tableName = "sample_commits"

projectName = "isb-cgc-secure-testing"
datasetName = "VT_test"
tableName = "SixteenVCFs_24mayH"

In [0]:
excludedNames = [ ]
excludedTypes = [ ]
xf = bbq.bbqExploreFieldContents ( bqclient, 
                                   projectName, datasetName, tableName, 
                                   excludedNames, excludedTypes )

reference_name                STRING      NULLABLE   (25, 0, 316773, 'chr1', 31306, 0.09882786727404166, 'chr1', 'chrY', 9, 18)
start_position                INTEGER     NULLABLE   (178278, 0, 316773, 24749170, 8, 2.5254677639824102e-05, 267, 248918270, 43946, 146600)
end_position                  INTEGER     NULLABLE   (178054, 0, 316773, 24749171, 8, 2.5254677639824102e-05, 270, 248918271, 43943, 146376)
reference_bases               STRING      NULLABLE   (4508, 0, 316773, 'G', 103728, 0.3274521502779593, 'A', 'TTTTTTTTGAG', 2, 4)
alternate_bases               RECORD      REPEATED   [2] 
    > alt                     STRING      NULLABLE   (12243, 0, 317065, 'A', 103930, 0.3277876776055383, 'A', 'TTTTTTTTTTG', 2, 4)
    > CSQ                     RECORD      REPEATED   [66] 
        > allele              STRING      NULLABLE   (10639, 0, 2023917, 'A', 657008, 0.32462200772067235, '-', 'TTTTTTTTTTTT', 2, 5)
        > Consequence         STRING      NULLABLE   (119, 0, 2023917, 'intron

In [0]:
xr = bbq.bbqExploreRepeatedFields ( bqclient, projectName, datasetName, tableName )

alternate_bases               RECORD      REPEATED   (3, 0, 316773, 1, 316484, 0.9990876747702614, 1, 3, 1, 1)
    > CSQ                     RECORD      REPEATED   (72, 0, 317065, 3, 38056, 0.12002586220491067, 1, 125, 5, 13)
names                         STRING      REPEATED   (2, 0, 316773, 0, 270998, 0.8554958913796314, 0, 1, 1, 2)
filter                        STRING      REPEATED   (7, 0, 316773, 1, 214380, 0.6767622240531863, 1, 7, 1, 3)
call                          RECORD      REPEATED   always repeated 2 time(s)
    > genotype                INTEGER     REPEATED   (2, 0, 633546, 2, 633511, 0.9999447553926629, 2, 3, 1, 1)
    > AMQ                     INTEGER     REPEATED   (3, 0, 633546, 0, 553090, 0.8730068534881446, 0, 2, 1, 2)
    > BCOUNT                  INTEGER     REPEATED   (2, 0, 633546, 0, 553090, 0.8730068534881446, 0, 4, 1, 2)
    > BQ                      INTEGER     REPEATED   (4, 0, 633546, 0, 439160, 0.693177764519072, 0, 3, 1, 2)
    > DP4                     

## **Platinum Genomes**

The "*Platinum Genomes*" are a set of sequences derived from the three-generation, 17-member CEPH 1463 pedigree.  This reference data set is open-access and is described in detail in the Genome Research [publication](https://genome.cshlp.org/content/27/1/157) by Eberle *et al*.

In this section, we're going use the previously defined functions to have a look at the Platinum Genome variants *called by Google's DeepVariant method* which have been made available in a publicly accessible [table](https://bigquery.cloud.google.com/table/bigquery-public-data:human_genome_variants.platinum_genomes_deepvariant_variants_20180823?pli=1&tab=schema) in BigQuery.  

The size of this table is **19.6 GB** and it has nearly 106 *million* rows of data.  It was created in August 2018.  More information can be found by clicking on the `Details` button in the BigQuery web UI.

In [0]:
projectName = "bigquery-public-data"
datasetName = "human_genome_variants"
tableName = "platinum_genomes_deepvariant_variants_20180823"

The **exploreFieldContents** function will walk through the entire table schema and output one row per field, with the following information:
- field name
- field type (*eg* STRING, INTEGER, RECORD)
- field mode (*eg* NULLABLE, REPEATED)
- finally comes a summary of the *contents* of this field:
  - if this field is a RECORD, the next field will be a number in brackets, indicating the number of *child* fields within this RECORD
  - otherwise, the following 10 values describe the contents of this field:
    - total number of _unique_ non-null values observed
    - total number of null values observed
    - total number of non-null values observed
    - single most common (non-null) value
    - number of times the most common value is observed
    - fraction of total non-null values that are equal to this most common value
    - 'minimum' observed value (based on numeric or alphabetic sorting)
    - 'maximum' observed value (based on numeric or alphabetic sorting)
    - number of distinct values that represent at least 50% of non-null values
    - number of distinct values that represent at least 90% of non-null values
    
**NB**: fields which are always *null* will not be shown at all in this output.  For example, for this table, the **call** record contains 12 fields: name, genotype, phaseset, GQ, DP, MIN_DP, AD, VAF, GL, PL, quality, and filter.  But two of these fields (phaseset, and GL) are always null (or missing), and are therefore not contained in the output generated by **exploreFieldContents**.

In [0]:
## the "exploratory" queries sort and count up unique values, etc, and this
## type of operation does not make sense (and can be very time-consuming) for
## certain fields, so those fields can be passed in as names to be *excluded*
## using the `excludedNames` list:

excludedNames = [ "start_position", "end_position" ]
excludedTypes = [ ]
xf = bbq.bbqExploreFieldContents ( bqclient, 
                                   projectName, datasetName, tableName, 
                                   excludedNames, excludedTypes )

reference_name                STRING      NULLABLE   (24, 0, 105923159, 'chrX', 14598875, 0.13782514737877105, 'chr1', 'chrY', 8, 19)
start_position                INTEGER     NULLABLE   (this field was excluded)
end_position                  INTEGER     NULLABLE   (this field was excluded)
reference_bases               STRING      NULLABLE   (67253, 0, 105923159, 'T', 28081226, 0.26510940822676937, 'A', 'TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTGGAG', 2, 4)
alternate_bases               RECORD      REPEATED   [1] 
    > alt                     STRING      NULLABLE   (52211, 0, 118997285, '<*>', 105923159, 0.8901308882803503, '<*>', 'TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT', 1, 2)
quality                       FLOAT       NULLABLE   (830, 0, 105923159, 0.0, 97211191, 0.9177519998247031, 0.0, 150.0, 1, 1)
filter                        STRING      REPEATED   (2, 0, 13234436, 'PASS', 8117976, 0.6133979566639636, 'PASS', 'RefCall', 1, 2)
call                          RECORD      REPEATED   [12] 


In [0]:
xr = bbq.bbqExploreRepeatedFields ( bqclient, 
                                    projectName, datasetName, tableName )

alternate_bases               RECORD      REPEATED   (5, 0, 105923159, 1, 93106707, 0.8790023624578643, 1, 5, 1, 2)
names                         STRING      REPEATED   always repeated 0 time(s)
filter                        STRING      REPEATED   (3, 0, 105923159, 0, 93106707, 0.8790023624578643, 0, 2, 1, 2)
call                          RECORD      REPEATED   (6, 0, 105923159, 1, 72441909, 0.683910012540317, 1, 6, 1, 4)
    > genotype                INTEGER     REPEATED   always repeated 2 time(s)
    > AD                      INTEGER     REPEATED   (5, 0, 182104652, 0, 143555264, 0.7883118988086038, 0, 6, 1, 2)
    > VAF                     FLOAT       REPEATED   (5, 0, 182104652, 0, 143555264, 0.7883118988086038, 0, 5, 1, 2)
    > GL                      FLOAT       REPEATED   always repeated 0 time(s)
    > PL                      INTEGER     REPEATED   (5, 0, 182104652, 3, 143555264, 0.7883118988086038, 3, 21, 1, 2)
    > filter                  STRING      REPEATED   (2, 0, 1821

The variants in this table are not annotated, so the contents of the table are  relatively simple.

## *gnomAD Exome data: hg19 release 2.0.2*

In this section, we'll explore data released by the [gnomAD project](https://gnomad.broadinstitute.org/).  Public data files are available from a Google Cloud Storage bucket ([gs://gnomad-public](https://console.cloud.google.com/storage/browser/gnomad-public/))

The r2.0.2 exome VCF file is available in GCS at `gs://gnomad-public/release/2.0.2/vcf/exomes/gnomad.exomes.r2.0.2.sites.vcf.bgz`

The key difference between tables in Google's **human_variant_annotation** and **human_genome_variants** is whether or not the variants have been annotated using the [**Variant Effect Predictor**](https://uswest.ensembl.org/info/docs/tools/vep/vep_formats.html) option in Google's [**Variant Transforms**](https://cloud.google.com/genomics/docs/how-tos/variant-transforms) tool.

These annotations *dramatically* increase the size of the destination table in BigQuery.  The size of this gnomAD table is **68 GB** and contains ~15 million rows, as compared to the Platinum Genomes table which was less than 20 GB, with nearly 106 million rows.


(Note that the most recent version of gnomAD is release 2.1)

In [0]:
projectName = "bigquery-public-data"
datasetName = "human_variant_annotation"
tableName = "gnomad_gnomad_exomes_hg19_release_2_0_2"

We can use the `bq` command-line tool to show information about this table:

In [0]:
!gcloud config set project 'isb-cgc-bq'

Updated property [core/project].


To take a quick anonymous survey, run:
  $ gcloud alpha survey



In [0]:
!bq help

Python script for interacting with BigQuery.


USAGE: bq [--global_flags] <command> [--command_flags] [args]


Any of the following commands:
  cancel, cp, extract, head, help, init, insert, load, ls, mk, mkdef, partition,
  query, rm, shell, show, update, version, wait


cancel     Request a cancel and waits for the job to be cancelled.

           Requests a cancel and then either: a) waits until the job is done if
           the sync flag is set [default], or b) returns immediately if the sync
           flag is not set. Not all job types support a cancel, an error is
           returned if it cannot be cancelled. Even for jobs that support a
           cancel, success is not guaranteed, the job may have completed by the
           time the cancel request is noticed, or the job may be in a stage
           where it cannot be cancelled.

           Examples:
           bq cancel job_id # Requests a cancel and waits until the job is done.
           bq --nosync cancel job_id # Request

In [0]:
!gcloud auth application-default login


The environment variable [GOOGLE_APPLICATION_CREDENTIALS] is set to:
  [/content/adc.json]
Credentials will still be generated to the default location:
  [/content/.config/application_default_credentials.json]
To use these credentials, unset this environment variable before
running your application.

Do you want to continue (Y/n)?  Y

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&prompt=select_account&response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform&access_type=offline


Enter verification code: 4/WgHQ4Gey91fMt_RCzRObQFOASe58WAJOSPP0qJL-aFms-c9vIZ4PmZo

Credentials saved to file: [/content/.config/application_default_credentials.json]

These credentials will be used by any library that requests
Application Default Credential

In [0]:
!bq --project_id isb-cgc show "bigquery-public-data:human_variant_annotation.gnomad_gnomad_exomes_hg19_release_2_0_2"


Welcome to BigQuery! This script will walk you through the 
process of initializing your .bigqueryrc configuration file.

First, we need to set up your credentials if they do not 
already exist.

Credential creation complete. Now we will select a default project.

List of projects:
   #             projectId                     friendlyName           
 ----- ----------------------------- -------------------------------- 
  1     another-test-project-186620   another-test-project            
  2     ultra-envoy-227421            APT Sandbox                     
  3     isb-cgc-01-0010               B-cell Receptor Analysis        
  4     big-data-science              big-data-science                
  5     cgc-05-0042                   biocellion                      
  6     cancer-data-commons           cancer-data-commons             
  7     cancer-genome-atlas           cancer-genome-atlas             
  8     cgc-05-0001                   CGC-05-0001                     
  9   

As we can see above, this table has well  over 200 fields!
- 124 integer fields (33 of which are *repeated* fields)
-  84 string fields (2 of which are *repeated* fields)
-  43 floating point fields
-  5 boolean fields

Some of these fields are contained within 3 (potentially repeated) records: **alternate_bases**, **alternate_bases.CSQ**, and **call**.

Note that fully running the **exploreFieldContents** on this table would take several hours and would also run out of RAM prior to completing, so instead we will run it with most fields *excluded*.

In [0]:
excludedNames = [ "start_position", "end_position" ]
excludedTypes = [ "INTEGER", "FLOAT" ]
xf = bbq.bbqExploreFieldContents ( bqclient,
                                projectName, datasetName, tableName, 
                                excludedNames, excludedTypes )

reference_name                STRING      NULLABLE   (24, 0, 15014473, '1', 1494265, 0.09952164155212108, '1', 'Y', 8, 18)
start_position                INTEGER     NULLABLE   (this field was excluded)
end_position                  INTEGER     NULLABLE   (this field was excluded)
reference_bases               STRING      NULLABLE   (106583, 0, 15014473, 'G', 4617665, 0.30754759091444633, 'A', 'TTTTTTTTTTTTTTTTTTTTTTTTG', 2, 4)
alternate_bases               RECORD      REPEATED   [93] 
    > alt                     STRING      NULLABLE   (133493, 0, 17009588, 'A', 4857970, 0.28560186172645685, 'A', 'TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTG', 2, 4)
    > AC                      INTEGER     NULLABLE   (this field was excluded)
    > AF                      FLOAT       NULLABLE   (this field was excluded)
    > GQ_HIST_ALT             STRING      NULLABLE   (1534167, 0, 17009588, '0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1', 6471723, 0.3804750003

Note that although the schema identified 84 STRING fields, the output generated by exploreFieldContents above (where we skipped over all INTEGER and FLOAT fields), includes only 18 STRING fields, which means that the majority of the STRING fields are never used.

Now let's see how often the **repeated** fields are repeated, by using the `exploreRepeatedFields` function defined earlier.

In [0]:
xr = bbq.bbqExploreRepeatedFields ( bqclient, 
                                    projectName, datasetName, tableName )

alternate_bases               RECORD      REPEATED   (6, 0, 15014473, 1, 13237344, 0.8816389359786387, 1, 6, 1, 2)
    > CSQ                     RECORD      REPEATED   (84, 0, 17009588, 4, 1740924, 0.10234956896075319, 0, 84, 6, 16)
names                         STRING      REPEATED   (2, 0, 15014473, 1, 9448572, 0.629297611711047, 0, 1, 1, 2)
filter                        STRING      REPEATED   (3, 0, 15014473, 1, 13902880, 0.9259652336781984, 1, 3, 1, 1)
call                          RECORD      REPEATED   always repeated 0 time(s)
    > genotype                INTEGER     REPEATED   (0, 0)
    > AD                      INTEGER     REPEATED   (0, 0)
    > PL                      INTEGER     REPEATED   (0, 0)
GC_AFR                        INTEGER     REPEATED   (7, 0, 15014473, 3, 12900298, 0.8591908620435762, 0, 28, 1, 2)
GC_AMR                        INTEGER     REPEATED   (7, 0, 15014473, 3, 12900298, 0.8591908620435762, 0, 28, 1, 2)
GC_ASJ                        INTEGER     REPEAT

As mentioned earlier, there are 3 records in this Variant Transform BigQuery schema:
- **alternate_bases** is repeated anywhere between 1 and 6 times, but generally only once or twice
- **alternate_bases.CSQ** is repeated up to 84 times, but generally 16 or fewer times
- and the **call** record is never repeated at all, so that entire part of the schema can be ignored

It might seem odd that there are no **call** records, but this is common when looking at a "variant database".  When looking at specific "sample data", on the other hand, we would expect to see details in the **call** record, including the sample name, genotype, *etc*.

One of the interesting fields inside the **alternate_bases.CSQ** (aka *consequences*) record is called **IMPACT** -- let's write a query to see what the different values are for that field and how often does each occur.

Note that since it is a field inside a repeated record which is itself inside a repeated record, we have two implicit "unnesting" steps in the **FROM** clause below:
- we first alias the long table name to `t`
- then we implicitly unnest `t.alternate_bases` and alias it as `alt`
- then we implicitly unnest `alt.CSQ` and alias it as `csq`

That then allows us to **SELECT** `csq.IMPACT` and perform **COUNT** and **GROUP BY** operations on that field.

In [0]:
%%bigquery --project isb-cgc-bq

WITH
  t1 AS (
  SELECT
    csq.IMPACT, COUNT(*) AS n
  FROM
    `bigquery-public-data.human_variant_annotation.gnomad_gnomad_exomes_hg19_release_2_0_2` AS t,
    t.alternate_bases AS alt,
    alt.CSQ AS csq
  GROUP BY 1 )
SELECT *
FROM t1
ORDER BY n DESC, 1

Unnamed: 0,IMPACT,n
0,MODIFIER,102675373
1,MODERATE,18959053
2,LOW,12570156
3,HIGH,2113301


Ok, now let's see how many different HIGH IMPACT variants exist in this database!

In [0]:
%%bigquery --project isb-cgc-bq

WITH
  t1 AS (
  SELECT DISTINCT 
    reference_name, start_position, end_position, reference_bases, AN,
    alt.alt, alt.AC, alt.AF, alt.AF_Male, alt.AF_Female,
    csq.allele, csq.Consequence, csq.IMPACT, csq.SYMBOL
  FROM
    `bigquery-public-data.human_variant_annotation.gnomad_gnomad_exomes_hg19_release_2_0_2` AS t,
    t.alternate_bases AS alt,
    alt.CSQ AS csq
  WHERE
    csq.IMPACT="HIGH" ),
  t2 AS (
  SELECT DISTINCT 
    SYMBOL, reference_name, start_position,
    AN, AC, AF, AF_Male, AF_Female,
    Consequence,
    reference_bases, alt, allele
  FROM
    t1
  WHERE
    (( AF >= 0.00001 AND AF <= 0.05000 )
      OR ( AF >= 0.95000 AND AF <= 0.99999 ))
    AND AN >= 10000 AND AC >= 10 )
SELECT * FROM t2

Unnamed: 0,SYMBOL,reference_name,start_position,AN,AC,AF,AF_Male,AF_Female,Consequence,reference_bases,alt,allele
0,GOLGA6L6,15,20740192,137056,144,0.001051,0.000988,0.001124,frameshift_variant,T,TGC,GC
1,CDCP2,1,54605317,152972,789,0.005158,0.005591,0.004625,frameshift_variant,TG,TGCG,CG
2,FAM83C,20,33874804,242996,16,0.000066,0.000067,0.000064,frameshift_variant,G,GGCCAC,GCCAC
3,CD36,7,80302137,244370,14,0.000057,0.000067,0.000045,frameshift_variant,T,TCAAGC,CAAGC
4,AC064874.1,2,236682978,143494,70,0.000488,0.000615,0.000336,frameshift_variant,T,TGCTG,GCTG
5,KRTAP2-4,17,39221826,127396,26,0.000204,0.000115,0.000311,frameshift_variant,GCC,GGTCC,GT
6,RBM12B,8,94746697,245754,24,0.000098,0.000096,0.000099,frameshift_variant,G,GGGAGCTGCCTGAAGTCCTCCTCT,GGAGCTGCCTGAAGTCCTCCTCT
7,TCHH,1,152084455,123622,47,0.000380,0.000219,0.000581,frameshift_variant,G,GTCCTC,TCCTC
8,CMYA5,5,79025349,244398,11,0.000045,0.000082,0.000000,frameshift_variant,T,TAA,AA
9,HAGHL,16,778960,163440,81,0.000496,0.000424,0.000582,frameshift_variant,G,GGC,GC
