# Count Featurization And Models

FEMR contains several utilities to implement common featurization strategies.

[CountFeaturizer](https://github.com/som-shahlab/femr/blob/main/src/femr/featurizers/featurizers.py#L180) is the main class and it documents the various supported options.

In order to use the featurizer, you must construct a featurize list, prepare the featurizers, and then featurize.

In [1]:
import pickle
import femr.featurizers
import femr.labelers


# Load some labels
labels = femr.labelers.load_labeled_patients('input/labels.csv')
    
# Define our featurizer

# Note that we are using both ages and counts here
age = femr.featurizers.AgeFeaturizer(is_normalize=False)
count = femr.featurizers.CountFeaturizer(string_value_combination=True)
featurizer_age_count = femr.featurizers.FeaturizerList([age, count])

# Preprocessing the featurizers, which includes processes such as normalizing age.
featurizer_age_count.preprocess_featurizers("input/extract", labels)

# Actually do the featurization
results = featurizer_age_count.featurize("input/extract", labels)

In [2]:
# Results consist of four components, the feature matrix, the patient ids, the label values, and the prediction times

features, patient_ids, label_values, prediction_times = results

print(features[0,:], patient_ids[0], label_values[0], prediction_times[0])

  (0, 0)	20.013699
  (0, 1)	1.0
  (0, 3376)	1.0 3 False 1990-01-07T00:00:00.000000


# Data Splitting

FEMR contains utilities for doing hash based patient splitted, where splits are determined based on a hash value of the patient id.

This is a deterministic approximate approach for splitting that has the advantage of stability and scalability.

database.compute_split(seed, pid) return as a psuedo-random number between 0 and 99 (inclusive) to help construct splits.


In [3]:
import femr.datasets
import numpy as np

database = femr.datasets.PatientDatabase("input/extract")

percent_train = .70
split_seed = 97

hashed_pids = np.array([database.compute_split(split_seed, pid) for pid in patient_ids])
train_pids_idx = np.where(hashed_pids < (percent_train * 100))[0]
test_pids_idx = np.where(hashed_pids >= (percent_train * 100))[0]

X_train, y_train = (
    features[train_pids_idx],
    label_values[train_pids_idx],
)
X_test, y_test = features[test_pids_idx], label_values[test_pids_idx]

# Building Models

The generated features can then be used to build your standard models. In this case we construct both logistic regression and XGBoost models and evaluate them.

Performance is perfect since our task (predicting gender) is 100% determined by the features

In [4]:
import xgboost as xgb
import sklearn.linear_model
import sklearn.metrics
import sklearn.preprocessing

def run_analysis(title: str, y_train, y_train_proba, y_test, y_test_proba):
    print(f"---- {title} ----")
    print("Train:")
    print_metrics(y_train, y_train_proba)
    print("Test:")
    print_metrics(y_test, y_test_proba)

def print_metrics(y_true, y_proba):
    y_pred = y_proba > 0.5
    auroc = sklearn.metrics.roc_auc_score(y_true, y_proba)
    aps = sklearn.metrics.average_precision_score(y_true, y_proba)
    accuracy = sklearn.metrics.accuracy_score(y_true, y_pred)
    f1 = sklearn.metrics.f1_score(y_true, y_pred)
    print("\tAUROC:", auroc)
    print("\tAPS:", aps)
    print("\tAccuracy:", accuracy)
    print("\tF1 Score:", f1)


scaler = sklearn.preprocessing.MaxAbsScaler().fit(
    X_train
)  # best for sparse data: see https://scikit-learn.org/stable/modules/preprocessing.html#scaling-sparse-data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = sklearn.linear_model.LogisticRegressionCV(penalty="l2", solver="liblinear").fit(X_train_scaled, y_train)
y_train_proba = model.predict_proba(X_train_scaled)[::, 1]
y_test_proba = model.predict_proba(X_test_scaled)[::, 1]
run_analysis("Logistic Regression", y_train, y_train_proba, y_test, y_test_proba)


# XGBoost
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
y_train_proba = model.predict_proba(X_train)[::, 1]
y_test_proba = model.predict_proba(X_test)[::, 1]
run_analysis("XGBoost", y_train, y_train_proba, y_test, y_test_proba)

---- Logistic Regression ----
Train:
	AUROC: 1.0
	APS: 1.0
	Accuracy: 1.0
	F1 Score: 1.0
Test:
	AUROC: 1.0
	APS: 1.0
	Accuracy: 1.0
	F1 Score: 1.0
---- XGBoost ----
Train:
	AUROC: 1.0
	APS: 1.0
	Accuracy: 1.0
	F1 Score: 1.0
Test:
	AUROC: 1.0
	APS: 1.0
	Accuracy: 1.0
	F1 Score: 1.0
