### Feature Extraction

Now that we have defined a cohort, we can extract features from the database for machine learning.

This library provides the capability to efficiently extract a count-based feature representation from BigQuery. Here, we will explore some of those capabilities.

In [9]:
import pandas as pd
import os
import glob
import joblib
from prediction_utils.extraction_utils.featurizer import BigQueryOMOPFeaturizer

In [10]:
# Configuration for the extraction
config_dict = {
    "gcloud_project": "som-nero-phi-nigam-starr",
    "dataset_project": "som-rit-phi-starr-prod",
    "rs_dataset_project": "som-nero-phi-nigam-starr",
    "dataset": "starr_omop_cdm5_deid_1pcent_lite_latest",
    "rs_dataset": "temp_dataset",
    "data_path": "/share/pi/nigam/projects/prediction_utils/scratch/",
    "features_by_analysis_path": "features_by_analysis",
    "merged_name": "merged_features_binary",
    "cohort_name": "vignette_cohort_filtered",
    "row_id_field": "prediction_id",
    "index_date_field": "admit_date",
    "time_bins": [-365, -180, -90, -30, 0],
    "include_all_history": True,
    "overwrite": True,
    "binary": True,
}

With the above configuration, we can now create a featurizer object:

In [11]:
featurizer = BigQueryOMOPFeaturizer(**config_dict)

The featurizer objects primary methods are `featurize` and `merge_features`. 

`featurize` performs a series of set of extractions labeled by `analysis_ids` and time bins.

Here, we see the list of analysis_ids that are currently defined

In [12]:
featurizer.valid_queries

['condition_occurrence',
 'drug_exposure',
 'device_exposure',
 'measurement',
 'procedure_occurrence',
 'note_type',
 'observation',
 'note_nlp',
 'measurement_range',
 'gender',
 'race',
 'ethnicity',
 'age_group',
 'measurement_bin']

Some of these analyses can be binned over time. If the list of `time_bins` is defined, the extraction occurs separately over each time bin. If `include_all_history` is True, then the feature extraction will also occur separately over the full history.

We can now run the featurizer, selecting only the analyses that we would like.
Let's generate features for `gender`, `age_group`, `drug_exposure`, and `condition_occurrence`.
Features for five time bins will be generated on the basis of the configuration we specified above.

In [13]:
featurizer.featurize(
    analysis_ids=['gender', 'age_group', 'drug_exposure', 'condition_occurrence']
)

Now let's inspect the results. The data was stored to `os.path.join(config_dict['data_path'], config_dict['features_by_analysis_path'])`

In [14]:
os.listdir(os.path.join(config_dict['data_path'], config_dict['features_by_analysis_path']))

['gender', 'age_group', 'drug_exposure', 'condition_occurrence']

In [15]:
print('Contents of gender')
print(os.listdir(
    os.path.join(
        config_dict['data_path'], 
        config_dict['features_by_analysis_path'],
        'gender'
    )
))
print('Contents of drug_exposure')
print(os.listdir(
    os.path.join(
        config_dict['data_path'], 
        config_dict['features_by_analysis_path'],
        'drug_exposure'
    )
))
print('Contents of drug_exposure/bin_-36500_-1')
print(os.listdir(
    os.path.join(
        config_dict['data_path'], 
        config_dict['features_by_analysis_path'],
        'drug_exposure',
        'bin_-36500_-1'
    )
))

Contents of gender
['static']
Contents of drug_exposure
['bin_-36500_-1', 'bin_-90_-31', 'bin_-365_-181', 'bin_-30_-1', 'bin_-180_-91']
Contents of drug_exposure/bin_-36500_-1
['features_42.parquet']


The `merge_features` method can now be used to merge the individual features into a single feature representation. This also constructs a vocabulary of features that maps each unique feature to a unique numberic identifier from `0..vocab_size-1` where `vocab_size` is the number of unique features.

The merge procedures can write the results to either Scipy CSR Sparse or Parquet dataset. For datasets that will fit in memory, CSR is recommended, and Parquet if the data is larger than memory.

We will run the merge procedure, generating the CSR sparse result, and not the parquet, since the example data is small.

In [16]:
featurizer.merge_features(
    create_sparse=True,
    create_parquet=False,
    existing_vocab_path=None,
    **config_dict
)

Now let's inspect the results:

In [17]:
print(os.listdir(os.path.join(config_dict['data_path'], config_dict['merged_name'])))
print('Contents of vocab')
print(os.listdir(os.path.join(config_dict['data_path'], config_dict['merged_name'], 'vocab')))
print('Contents of features_sparse')
print(os.listdir(os.path.join(config_dict['data_path'], config_dict['merged_name'], 'features_sparse')))

['vocab', 'features_sparse']
Contents of vocab
['vocab.parquet']
Contents of features_sparse
['features.gz', 'features_row_id_map.parquet']


Now let's load the results to explore the contents

In [18]:
features = joblib.load(
    os.path.join(config_dict['data_path'], config_dict['merged_name'], 'features_sparse', 'features.gz')
)

row_id_map = pd.read_parquet(
    os.path.join(config_dict['data_path'], config_dict['merged_name'], 'features_sparse', 'features_row_id_map.parquet')
)

vocab = pd.read_parquet(
    os.path.join(config_dict['data_path'], config_dict['merged_name'], 'vocab', 'vocab.parquet')
)


In [19]:
features

<2754x23159 sparse matrix of type '<class 'numpy.int64'>'
	with 155909 stored elements in Compressed Sparse Row format>

In [20]:
row_id_map.head()

Unnamed: 0,features_row_id,prediction_id
0,0,-3656122440376071607
1,1,-1038042919890479280
2,2,-735874405745870554
3,3,-5741866840589643302
4,4,-1359111829118906538


In [21]:
vocab.head()

Unnamed: 0,col_id,feature_id
0,0,8532_gender
1,1,8507_gender
2,2,age_group_5_age_group
3,3,age_group_14_age_group
4,4,age_group_17_age_group
