# Sampling Data
`bqt` supports creating sampled tables based on either a query or a table.

Supported types of sampling are:

### Pure Random Sampling
This is the most basic form where each row has the same chance of being included in the sampled dataset. Available methods are:
* **`bqt.sample_table_random()`:** create a random sample given a table and wait until it's done
* **`bqt.sample_table_random_async()`:** create a random sample given a table asynchronous
* **`bqt.sample_query_random()`:** create a random sample from results of the given query and wait until it's done

All these methods have a similar signature, the main inputs are:
* `table`, `dataset`, `project`: indicate where the source data is when sampling a table
* `partition_start` and `partition_end`: are useful for when you want to sample across partition. using these you can limit your scope
* `ratio` and `num_rows`: one and only one of these should be provided. you can either sample a ratio of the original table or specifically mention how many rows you want.
* `dest_table`, `dest_dataset`, `dest_project`: are used to store the sampled data

### Stable Random Sampling
These functions assign a chance of being included in the sample to a group of rows instead of every single row. They are very useful for example if you want to sample a table that has user level information but want to include either all events for a user or none of them. Note that the previous method would've randomly included some events for any user because it doesn't know about a grouping of events while this method does.

Methods are:
* **`bqt.sample_table_stable_random()`:** create a stable random sample given a table and wait until it's done
* **`bqt.sample_table_stable_random_async()`:** create a stable random sample given a table asynchronous
* **`bqt.sample_query_stable_random()`:** create a stable random sample from results of the given query and wait until it's done

All these methods have a similar signature, all inputs are similar to pure random sampling except now you can specify your row grouping;
* `columns` is one or more columns in the table or query that should be used to group rows together when sampling

### Stratified Random Sampling
These functions are useful when you have unbalanced groups of data but want to sample them uniformly. For example if you have data by country but each country has a different volume of events you can use these methods to sample the data and be sure that the final sample includes a similar proportion for all counties.

Methods are:
* **`bqt.sample_table_stratified()`:** create a stratified random sample given a table and wait until it's done
* **`bqt.sample_table_stratified_async()`:** create a stratified random sample given a table asynchronous
* **`bqt.sample_query_stratified()`:** create a stratified random sample from results of the given query and wait until it's done

All these methods have a similar signature, all inputs are similar to pure random sampling except now you can specify the column used to balance the sample:
* `strata` is one or more columns in the table or query that should be used to balance the sampling

In [1]:
from bqt import bqt

Example usage:

In [None]:
# sample the given query at 10% and store it in `analytics-mafia.behrooza.test_sample_query`
bqt.sample_query_random(
    "SELECT * FROM `ad-veritas.oculus_impressions_base_v2.oculus_impressions_base_v2_20181001`",
    ratio=0.1, dest_table='test_sample_query', dest_dataset='behrooza', dest_project='analytics-mafia'
)

# sample the table:
#    `ad-vertias.oculus_impressions_base_v2.oculus_impressions_base_v2_YYYYMMDD`
# over user_id and over partitions:
#     from 2018-10-01 to 2018-10-10
# at 10% and store it in
#     `analytics-mafia.behrooza.test_stable_random_YYYYMMDD`
#
# NOTE: `ignore_exists=True` will re sample and recreate the table
bqt.sample_table_stable_random(
    'oculus_impressions_base_v2_YYYYMMDD', 'oculus_impressions_base_v2', project='ad-veritas',
    partition_start='2018-10-01', partition_end='2018-10-10', ratio=0.1,
    dest_table='test_stable_random_YYYYMMDD', dest_dataset='behrooza', dest_project='analytics-mafia',
    columns='user_id', ignore_exists=True
)

# sample the table:
#    `ad-vertias.oculus_impressions_base_v2.oculus_impressions_base_v2_YYYYMMDD`
# making use each `ad_unit` has the same number of rows sampled and over partitions:
#     from 2018-10-01 to 2018-10-10
# keeping a total 10000
#     `analytics-mafia.behrooza.test_stratified_YYYYMMDD`
bqt.sample_table_stratified(
    'oculus_impressions_base_v2_YYYYMMDD', 'oculus_impressions_base_v2', project='ad-veritas',
    partition_start='2018-10-01', partition_end='2018-10-10', num_rows=10000,
    dest_table='test_stratified_YYYYMMDD', dest_dataset='behrooza', dest_project='analytics-mafia',
    strata='ad_unit'
)