# ISB-CGC Community Notebooks

```
Title:   How to Create a Random Sample in BigQuery
Author:  Lauren Hagen
Created: 2019-10-17
Purpose: Demonstrates how to split a data set into multiple groups randomly with BigQuery
```
***

# How to Create a Random Sample in BigQuery

In this notebook, we will be using BigQuery to create random samples for predicting an outcome with test and training data sets such as in machine learning. In this notebook, we assume that you have set up your GCP and accessed the ISB-CGC WebApp, if not, please visit the [How To Get Started on ISB-CGC](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/HowToGetStartedonISB-CGC.html) or the [Community Notebook Repository](https://github.com/isb-cgc/Community-Notebooks).

We will go over two methods to create the subset data:
- `RAND()` Function
- `MOD()` and `FARM_FINGERPRINT` Functions

Before we can begin working with BigQuery, we will need to load the BigQuery module and authenticate ourselves.

In [0]:
from google.cloud import bigquery

In [0]:
!gcloud auth application-default login

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?code_challenge=Z6c0Nwb5MzxmnCZrf0GKFkWfYXu4rxFHOC2vTApWSd0&prompt=select_account&code_challenge_method=S256&access_type=offline&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth


Enter verification code: 4/tAEUZfQbCQUFY5TRqQgluy7FELjvZwsJNnyFHTBQmgG8q2s9woEWXcQ

Credentials saved to file: [/content/.config/application_default_credentials.json]

These credentials will be used by any library that requests
Application Default Credentials.

To generate an access token for other uses, run:
  gcloud auth application-default print-access-token


## `RAND()` Function for Randomly Splitting data
A simple way to create a random sample with BigQuery is to use the `RAND()` function. The `RAND()` function will create a seemingly random number and then the query can select to create a random sample of rows.

The final query creates a cohort from a BigQuery table, then generates random sample from the cohort, and finally joins the random sample to the main cohort with the rows labeled for the random sample.

To explain this query, we are going to start out with a simple query to create the cohort. Cohorts can be created previously in the WebApp or though other means instead of within this query. For more information on creating cohorts, please see the [ISB-CGC Web Interface (Web App) documentation](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/Web-UI.html) and the [Community Notebook Repository](https://github.com/isb-cgc/Community-Notebooks).

In [0]:
# Create a query with the cohort information
# This can be replaced with direction to your own cohort stored in BigQuery
%%bigquery cohort --project isb-cgc-02-0001
SELECT case_barcode, project_short_name, case_gdc_id
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL"

In [0]:
# View the first 5 rows of the cohort
cohort.head(5)

Unnamed: 0,case_barcode,project_short_name,case_gdc_id
0,TARGET-30-PATSRD,TARGET-NBL,fef92ed0-242b-5564-ad92-6b35c21c3bd5
1,TARGET-30-PATTEF,TARGET-NBL,fef13b6c-d5e9-5ffa-9f55-2404f2f99eeb
2,TARGET-30-PARJAR,TARGET-NBL,feb97edc-ce83-5fd5-94e3-261ce244ac52
3,TARGET-30-PAUAZA,TARGET-NBL,fe831368-c7ce-5e2b-b0fd-c35216a7761d
4,TARGET-30-PAIJGC,TARGET-NBL,fe58a8bf-8306-5aaf-99d1-b65c20fedc58


In [0]:
print("This cohort has " + str(len(cohort)) + " cases (rows).")

This cohort has 1180 cases (rows).


The next part of the query is creating the random sample. This line can be adjusted to pull any percentage of the data table into a random sample. For this example, it is set to create a random sample of ~25% of the data.

In [0]:
%%bigquery sample --project isb-cgc-02-0001

--- Create Cohort
WITH table1 AS (
SELECT case_barcode, project_short_name, case_gdc_id, 0 as table_num
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL")

--- Select a random sample that is ~25% of the data
SELECT case_barcode, project_short_name, case_gdc_id, 1 as table_num
FROM table1
-- Count the number of rows in the cohort, then find how many of them will be 25%
-- of the cohort. Divid that number by the total number of rows in the data 
-- To change the %, change the 0.25 to what ever precentage you need
WHERE RAND() < ((SELECT COUNT(*) FROM table1)*0.25)/(SELECT COUNT(*) FROM table1)

In [0]:
print("The random sample of the cohort with " + str(len(sample)) + " cases which is around " + str(round((len(sample)/len(cohort)*100),1)) + "% of the cohort.")

The random sample of the cohort with 289 cases which is around 24.5% of the cohort.


This query is nice if we just wanted to grab a smaller sample of the cohort to do some initial analysis on before moving to larger data set but it is not useful if you want to create two separate subsets of data for training a model and then testing the model. The final query joins the new random sample table back with the main table and preserving the split of data with a column for which subset it belongs to.

In [0]:
%%bigquery dataset --project isb-cgc-02-0001
--- Create Cohort
WITH table1 AS (
SELECT case_barcode, project_short_name, case_gdc_id, 0 as table_num
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL"),

--- Select a random sample that is ~25% of the data
table2 AS (
SELECT case_barcode, project_short_name, case_gdc_id, 1 as table_num
FROM table1
-- Count the number of rows in the cohort, then find how many of them will be 25%
-- of the cohort. Divid that number by the total number of rows in the data
-- To change the %, change the 0.25 to what ever precentage you need
WHERE RAND() < ((SELECT COUNT(*) FROM table1)*0.25)/(SELECT COUNT(*) FROM table1)
)

--- Join the random sample table back to the main table
SELECT a.case_barcode, a.project_short_name, a.case_gdc_id, IFNULL(b.table_num,2) AS table_num
FROM table1 AS a
FULL OUTER JOIN table2 AS b
ON a.case_barcode = b.case_barcode

In [0]:
dataset.head(5)

Unnamed: 0,case_barcode,project_short_name,case_gdc_id,table_num
0,TARGET-30-PAUHHW,TARGET-NBL,ce12dce7-88e2-511c-a1c8-89f96c24b199,2
1,TARGET-30-PANRRN,TARGET-NBL,bb1bd33c-5ca7-5954-b1ca-2c7e4950d8ce,1
2,TARGET-30-PASJZC,TARGET-NBL,717c4844-5ef9-553b-bb90-0d62d976f909,2
3,TARGET-30-PATRJK,TARGET-NBL,67e5b0a5-5af4-5177-9d11-850a0441bb87,2
4,TARGET-30-PAUJLH,TARGET-NBL,666bc1d4-deb1-5251-894c-220885a1f62e,2


In [0]:
print("The final table has " + str(len(dataset)) + " cases.")

The final table has 1180 cases.


In [0]:
# Create a list with the sorted initial barcodes
case_barcode_initial = list(cohort.case_barcode.sort_values())
# Create a list with the sorted final barcodes
case_barcode_final = list(dataset.case_barcode.sort_values())
# Compare the two lists, if TRUE, no barcodes were lost
case_barcode_initial == case_barcode_final 

True

Each query will have it's own random sample because each time `RAND()` is run, it generates a new set of random numbers. This could be a problem if you want reproducible results each time you run the query. Another way to solve this problem is to use `FARM_FINGERPRINT()` wtih `MOD()` which we will cover next.

## `MOD()` and `FARM_FINGERPRINT` Functions

`FARM_FINGERPRINT()` will compute a string of BYTES or STRING and will never change. The `MOD()` function will return the remainder of the farm fingerprint number and a number. This method will always return the same values for each subset.

For example, we want three approximately equal subsets of the data sets of our cohort. We will create a `WHERE` statement that has the `case_barcode` in the `FARM_FINGERPRINT` function, take the absolute value, put that number into the `MOD()`, and set that equal to 0.

Note: You need to run `FARM_FINGERPRINT()` on a column that has unique values for each row or combine two columns with `CONCAT` to create a unique value row.

For more information visit the BigQuery documentation:
- [`FARM_FINGERPRINT()`](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#farm_fingerprint)
- [`MOD()`](https://cloud.google.com/dataprep/docs/html/MOD-Function_57344691)

In [0]:
%%bigquery  --project isb-cgc-02-0001
SELECT case_barcode, project_short_name, case_gdc_id
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL" AND MOD(ABS(FARM_FINGERPRINT(case_barcode)),3) = 0

Unnamed: 0,case_barcode,project_short_name,case_gdc_id
0,TARGET-30-PATTEF,TARGET-NBL,fef13b6c-d5e9-5ffa-9f55-2404f2f99eeb
1,TARGET-30-PARJAR,TARGET-NBL,feb97edc-ce83-5fd5-94e3-261ce244ac52
2,TARGET-30-PANUVK,TARGET-NBL,fd756a5f-0f9a-57ca-9879-2b18e5fa0b54
3,TARGET-30-PARZHA,TARGET-NBL,fc9d5f9e-af43-5134-95e7-2d434831580b
4,TARGET-30-PAPRXW,TARGET-NBL,fc3288a3-ad98-5be3-ab70-e02bf4b8fc7c
...,...,...,...
367,TARGET-30-PAHSRS,TARGET-NBL,
368,TARGET-30-PAHYBI,TARGET-NBL,
369,TARGET-30-PADGWJ,TARGET-NBL,
370,TARGET-30-PADTYM,TARGET-NBL,


The three in the `MOD()` statement is what splits the cohort into three subsets. If we wanted to do ~20% split of the data, we would change the three to a ten and then take any row that less than or equal to 1.

In [0]:
%%bigquery  --project isb-cgc-02-0001
SELECT case_barcode, project_short_name, case_gdc_id
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL" AND MOD(ABS(FARM_FINGERPRINT(case_barcode)),10) <= 1

Unnamed: 0,case_barcode,project_short_name,case_gdc_id
0,TARGET-30-PASWIJ,TARGET-NBL,fe0b727f-3843-5b70-b9c1-8a207b837fc4
1,TARGET-30-PARABN,TARGET-NBL,fc9d0307-a5f5-51c4-ad9e-6e9ed60f0eba
2,TARGET-30-PAPRXW,TARGET-NBL,fc3288a3-ad98-5be3-ab70-e02bf4b8fc7c
3,TARGET-30-PAREAG,TARGET-NBL,fb071e74-40dc-5f67-b4c3-e61dd2c2ef88
4,TARGET-30-PASKSX,TARGET-NBL,faaed289-3f28-5a16-b18b-5ee164f05f50
...,...,...,...
216,TARGET-30-PAMXWK,TARGET-NBL,
217,TARGET-30-PAHZRF,TARGET-NBL,
218,TARGET-30-PADKLJ,TARGET-NBL,
219,TARGET-30-PAIFAU,TARGET-NBL,


With this method, we can also create a query that will retrieve the remaining subset of data. For the first example, we would change the `=` to `!=` as shown below:

In [0]:
%%bigquery  --project isb-cgc-02-0001
SELECT case_barcode, project_short_name, case_gdc_id
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL" AND MOD(ABS(FARM_FINGERPRINT(case_barcode)),3) != 0

Unnamed: 0,case_barcode,project_short_name,case_gdc_id
0,TARGET-30-PATSRD,TARGET-NBL,fef92ed0-242b-5564-ad92-6b35c21c3bd5
1,TARGET-30-PAUAZA,TARGET-NBL,fe831368-c7ce-5e2b-b0fd-c35216a7761d
2,TARGET-30-PAIJGC,TARGET-NBL,fe58a8bf-8306-5aaf-99d1-b65c20fedc58
3,TARGET-30-PASWIJ,TARGET-NBL,fe0b727f-3843-5b70-b9c1-8a207b837fc4
4,TARGET-30-PANBMJ,TARGET-NBL,fdfb389d-eb9a-5014-b391-9cd5f908720d
...,...,...,...
803,TARGET-30-PAKKMP,TARGET-NBL,
804,TARGET-30-PADKLJ,TARGET-NBL,
805,TARGET-30-PAKHWS,TARGET-NBL,
806,TARGET-30-PAHPEL,TARGET-NBL,


For the second example, we would change the `<=` to `>` as shown below:

In [0]:
%%bigquery  --project isb-cgc-02-0001
SELECT case_barcode, project_short_name, case_gdc_id
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL" AND MOD(ABS(FARM_FINGERPRINT(case_barcode)),10) > 1

Unnamed: 0,case_barcode,project_short_name,case_gdc_id
0,TARGET-30-PATSRD,TARGET-NBL,fef92ed0-242b-5564-ad92-6b35c21c3bd5
1,TARGET-30-PATTEF,TARGET-NBL,fef13b6c-d5e9-5ffa-9f55-2404f2f99eeb
2,TARGET-30-PARJAR,TARGET-NBL,feb97edc-ce83-5fd5-94e3-261ce244ac52
3,TARGET-30-PAUAZA,TARGET-NBL,fe831368-c7ce-5e2b-b0fd-c35216a7761d
4,TARGET-30-PAIJGC,TARGET-NBL,fe58a8bf-8306-5aaf-99d1-b65c20fedc58
...,...,...,...
954,TARGET-30-PAHXGX,TARGET-NBL,
955,TARGET-30-PAKEVZ,TARGET-NBL,
956,TARGET-30-PAKKMP,TARGET-NBL,
957,TARGET-30-PAKHWS,TARGET-NBL,


If you want to label the entire cohort and only create one query instead of two, a column can be added to label each case with the subset number.

In [0]:
%%bigquery  --project isb-cgc-02-0001
SELECT case_barcode, project_short_name, case_gdc_id, MOD(ABS(FARM_FINGERPRINT(case_barcode)),3) as subset
FROM `isb-cgc.TARGET_bioclin_v0.Clinical`
WHERE project_short_name = "TARGET-NBL"

Unnamed: 0,case_barcode,project_short_name,case_gdc_id,subset
0,TARGET-30-PATSRD,TARGET-NBL,fef92ed0-242b-5564-ad92-6b35c21c3bd5,1
1,TARGET-30-PATTEF,TARGET-NBL,fef13b6c-d5e9-5ffa-9f55-2404f2f99eeb,0
2,TARGET-30-PARJAR,TARGET-NBL,feb97edc-ce83-5fd5-94e3-261ce244ac52,0
3,TARGET-30-PAUAZA,TARGET-NBL,fe831368-c7ce-5e2b-b0fd-c35216a7761d,2
4,TARGET-30-PAIJGC,TARGET-NBL,fe58a8bf-8306-5aaf-99d1-b65c20fedc58,2
...,...,...,...,...
1175,TARGET-30-PADKLJ,TARGET-NBL,,2
1176,TARGET-30-PAKHWS,TARGET-NBL,,2
1177,TARGET-30-PAHPEL,TARGET-NBL,,2
1178,TARGET-30-PAIFAU,TARGET-NBL,,2


The subsets can then be filtered out and manipulated in python. Please let us know if you have questions by emailing feedback@isb-cgc.org.