# A note about opening notebooks in shared workspaces <a class="tocSkip">

Master copies of notebooks should not be run or edited unless you intend to improve the code. As a general rule, it is good to be cautious when editing a notebook in a shared workspace, because you don't want to overwrite the work of your collaborators. Best practices is to test in a cloned workspace with an easily identifiable name.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#BigQuery-Cohort-Examples" data-toc-modified-id="BigQuery-Cohort-Examples-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>BigQuery Cohort Examples</a></span></li><li><span><a href="#Setup" data-toc-modified-id="Setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Get-the-cohort-query" data-toc-modified-id="Get-the-cohort-query-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Get the cohort query</a></span></li><li><span><a href="#Call-BigQuery" data-toc-modified-id="Call-BigQuery-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Call BigQuery</a></span></li><li><span><a href="#Join-with-another-Table" data-toc-modified-id="Join-with-another-Table-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Join with another Table</a></span></li><li><span><a href="#Plot" data-toc-modified-id="Plot-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Plot</a></span></li><li><span><a href="#Provenance" data-toc-modified-id="Provenance-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Provenance</a></span></li></ul></div>

# BigQuery Cohort Examples
This notebook provides examples of how to manipulate cohorts using BigQuery in a notebook. We will be working with the public-access 1,000 Genomes Project data.

# Setup

First, be sure to run the notebook **`R environment setup`** in this workspace.

In [None]:
# Load additional R packages needed to run the code
library(reticulate)
library(bigrquery)
library(ggplot2)

In [None]:
# Set the project id of the could project to bill for queries to BigQuery
BILLING_PROJECT_ID <- Sys.getenv('GOOGLE_PROJECT')

In [None]:
# Authorize bigrquery client
bigrquery::set_service_token(Ronaldo::getServiceAccountKey())

# Get the cohort query

In [None]:
cohort_query <- "SELECT DISTINCT t1.participant_id FROM (SELECT participant_id 
FROM `verily-public-data.human_genome_variants.1000_genomes_participant_info` 
WHERE  ((Super_Population_Description = \"American\"))) t1"

# Call BigQuery

In [None]:
# Execute the query and return all results into an in-memory table in R
t <- bigrquery::bq_project_query(
    BILLING_PROJECT_ID,
    cohort_query
)
tt <- bigrquery::bq_table_download(t)

Take a peek at the output

In [None]:
print(tt)

# Join with another Table

In [None]:
query <- '
SELECT
    DISTINCT participant_id,
    Gender
FROM
    `verily-public-data.human_genome_variants.1000_genomes_participant_info`
'
table_data <- bigrquery::bq_project_query(
    BILLING_PROJECT_ID,
    query
)
table <- bigrquery::bq_table_download(table_data)
dim(table)

In [None]:
merged_table <- merge(x = tt, y = table, by="participant_id", all.x = TRUE)
dim(merged_table)

# Plot

In [None]:
grouped <- table(merged_table$Gender)
print(grouped)

g <- ggplot(merged_table, aes(Gender))
g + geom_bar()

# Provenance

In [None]:
devtools::session_info()

Copyright 2019 The Broad Institute, Inc., Verily Life Sciences, LLC All rights reserved.

This software may be modified and distributed under the terms of the BSD license. See the LICENSE file for details.