<a href="https://colab.research.google.com/github/smenaaliaga/tesis_magister/blob/main/ej_google_bigquery_mimiciv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Before you begin


1.   Use the [Cloud Resource Manager](https://console.cloud.google.com/cloud-resource-manager) to Create a Cloud Platform project if you do not already have one.
2.   [Enable billing](https://support.google.com/cloud/answer/6293499#enable-billing) for the project.
3.   [Enable BigQuery](https://console.cloud.google.com/flows/enableapi?apiid=bigquery) APIs for the project.


### Provide your credentials to the runtime

In [None]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

## Optional: Enable data table display

Colab includes the ``google.colab.data_table`` package that can be used to display large pandas dataframes as an interactive data table.
It can be enabled with:

In [2]:
%load_ext google.colab.data_table

If you would prefer to return to the classic Pandas dataframe display, you can disable this by running
```python
%unload_ext google.colab.data_table
```

# Use BigQuery via magics

The `google.cloud.bigquery` library also includes a magic command which runs a query and either displays the result or saves it to a variable as a `DataFrame`.

In [None]:
# Display query output immediately

%%bigquery --project mimic-356201
SELECT *
FROM `physionet-data.mimiciv_hosp.patients`
WHERE subject_id < 10000100
ORDER BY subject_id

Unnamed: 0,subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
0,10000032,F,52,2180,2014 - 2016,2180-09-09
1,10000048,F,23,2126,2008 - 2010,
2,10000068,F,19,2160,2008 - 2010,
3,10000084,M,72,2160,2017 - 2019,2161-02-13


In [None]:
# Save output in a variable `df`

%%bigquery --project mimic-356201 df
SELECT *
FROM `physionet-data.mimiciv_hosp.patients`
WHERE subject_id < 10000100
ORDER BY subject_id

In [None]:
df

Unnamed: 0,subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
0,10000032,F,52,2180,2014 - 2016,2180-09-09
1,10000048,F,23,2126,2008 - 2010,
2,10000068,F,19,2160,2008 - 2010,
3,10000084,M,72,2160,2017 - 2019,2161-02-13


# Use BigQuery through google-cloud-bigquery

See [BigQuery documentation](https://cloud.google.com/bigquery/docs) and [library reference documentation](https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html).

The [GSOD sample table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=samples&t=gsod&page=table) contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.


### Declare the Cloud project ID which will be used throughout this notebook

In [3]:
project_id = 'mimic-356201'

In [5]:
from google.cloud import bigquery

client = bigquery.Client(project=project_id)

In [6]:
sample_count = 2000

first_age = client.query('''
SELECT *
FROM `physionet-data.mimiciv_hosp.patients`
WHERE subject_id < 10000100
ORDER BY subject_id
''').to_dataframe().anchor_age[0]

In [7]:
first_age

52

### Sample approximately 2000 random rows

In [10]:
sample_count = 2000

row_count = client.query('''
SELECT count(*) as total
FROM `physionet-data.mimiciv_hosp.patients`
LIMIT 100 
''').to_dataframe().total[0]

df = client.query('''
SELECT *
FROM `physionet-data.mimiciv_hosp.patients`
WHERE RAND() < %d/%d
''' % (sample_count, row_count)).to_dataframe()

print('Full dataset has %d rows' % row_count)

Full dataset has 315460 rows


### Describe the sampled data

In [11]:
df.describe()

Unnamed: 0,subject_id,anchor_age,anchor_year
count,1948.0,1948.0,1948.0
mean,15113830.0,48.720226,2151.50308
std,2936800.0,20.715359,23.706961
min,10001020.0,18.0,2110.0
25%,12585700.0,29.0,2131.0
50%,15138560.0,48.0,2152.0
75%,17749130.0,65.0,2172.0
max,19996870.0,91.0,2207.0


### View the first 10 rows

In [12]:
df.head(10)

Unnamed: 0,subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
0,18885495,M,20,2110,2008 - 2010,
1,14351781,M,22,2110,2014 - 2016,
2,15108583,M,22,2110,2008 - 2010,
3,16779283,M,22,2110,2008 - 2010,
4,18739815,M,25,2110,2008 - 2010,
5,15466940,M,27,2110,2014 - 2016,
6,18455345,M,27,2110,2008 - 2010,
7,13893519,F,39,2110,2008 - 2010,
8,15456338,M,40,2110,2008 - 2010,
9,11498798,M,43,2110,2014 - 2016,


In [13]:
# 10 highest total_precipitation samples
df.sort_values('anchor_age', ascending=False).head(10)[['subject_id', 'gender', 'anchor_age']]

Unnamed: 0,subject_id,gender,anchor_age
730,17238343,F,91
277,11669818,F,91
215,11977338,F,91
937,17502587,M,91
1373,17713926,F,91
1876,11924161,F,91
960,17317600,M,91
1136,14014101,F,91
1137,18851418,M,91
1752,15949804,M,91


# Use BigQuery through pandas-gbq

The `pandas-gbq` library is a community led project by the pandas community. It covers basic functionality, such as writing a DataFrame to BigQuery and running a query, but as a third-party library it may not handle all BigQuery features or use cases.

[Pandas GBQ Documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_gbq.html)

In [14]:
import pandas as pd

sample_count = 2000

df = pd.io.gbq.read_gbq('''
SELECT *
FROM `physionet-data.mimiciv_hosp.patients`
WHERE subject_id < 10000100
ORDER BY subject_id
''', project_id=project_id, dialect='standard')

df.head()

Unnamed: 0,subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
0,10000032,F,52,2180,2014 - 2016,2180-09-09
1,10000048,F,23,2126,2008 - 2010,NaT
2,10000068,F,19,2160,2008 - 2010,NaT
3,10000084,M,72,2160,2017 - 2019,2161-02-13


# Syntax highlighting
`google.colab.syntax` can be used to add syntax highlighting to any Python string literals which are used in a query later.

In [15]:
from google.colab import syntax

query = syntax.sql('''
SELECT *
FROM `physionet-data.mimiciv_hosp.patients`
WHERE subject_id < 10000100
ORDER BY subject_id
''')

pd.io.gbq.read_gbq(query, project_id=project_id, dialect='standard')

Unnamed: 0,subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
0,10000032,F,52,2180,2014 - 2016,2180-09-09
1,10000048,F,23,2126,2008 - 2010,NaT
2,10000068,F,19,2160,2008 - 2010,NaT
3,10000084,M,72,2160,2017 - 2019,2161-02-13
