# Example usage

Here, we demonstrate how to use `fasterrisk` to generate sparse risk scoring systems:

## Download and Read Sample Data

### Imports

In [5]:
!pip3 install fasterrisk



In [7]:
from fasterrisk.fasterrisk import RiskScoreOptimizer, RiskScoreClassifier
from fasterrisk.utils import download_file_from_google_drive
import os.path

import numpy as np
import pandas as pd
import time

ModuleNotFoundError: No module named 'fasterrisk'

### Download Sample Data

In [2]:
train_data_file_path = "../tests/adult_train_data.csv"
test_data_file_path = "../tests/adult_test_data.csv"

if not os.path.isfile(train_data_file_path):
    download_file_from_google_drive('1nuWn0QVG8tk3AN4I4f3abWLcFEP3WPec', train_data_file_path)
if not os.path.isfile(test_data_file_path):
    download_file_from_google_drive('1TyBO02LiGfHbatPWU4nzc8AndtIF-7WH', test_data_file_path)


NameError: name 'os' is not defined

### Read Sample Data

In [3]:
train_df = pd.read_csv(train_data_file_path)
train_data = np.asarray(train_df)
X_train, y_train = train_data[:, 1:], train_data[:, 0]

test_df = pd.read_csv(test_data_file_path)
test_data = np.asarray(test_df)
X_test, y_test = test_data[:, 1:], test_data[:, 0]

## Train Risk Score Models

### Create RiskScoreOptimizer and Perform Optimization

In [4]:
sparsity = 5
parent_size = 10

RiskScoreOptimizer_m = RiskScoreOptimizer(X = X_train, y = y_train, k = sparsity, parent_size = parent_size)

In [5]:
start_time = time.time()
RiskScoreOptimizer_m.optimize()
print("Optimization takes {:.2f} seconds.".format(time.time() - start_time))

Optimization takes 14.04 seconds.


## Get Risk Score Models

In [6]:
multipliers, sparseDiversePool_beta0_integer, sparseDiversePool_betas_integer = RiskScoreOptimizer_m.get_models()
print("We generate {} risk score models from the sparse diverse pool".format(len(multipliers)))

(26049, 50)
We generate 50 risk score models from the sparse diverse pool


### Access the first risk score model

In [7]:
model_index = 0 # first model
multiplier = multipliers[model_index]
intercept = sparseDiversePool_beta0_integer[model_index]
coefficients = sparseDiversePool_betas_integer[model_index]

### Use the first risk score model to do prediction

In [8]:
RiskScoreClassifier_m = RiskScoreClassifier(multiplier, intercept, coefficients)

In [9]:
y_test_pred = RiskScoreClassifier_m.predict(X_test)
print("y_test are predicted to be {}".format(y_test_pred))

y_test are predicted to be [-1 -1 -1 ... -1 -1 -1]


In [10]:
y_test_pred_prob = RiskScoreClassifier_m.predict_prob(X_test)
print("The risk probabilities of having y_test to be +1 are {}".format(y_test_pred_prob))

The risk probabilities of having y_test to be +1 are [0.13308868 0.34872682 0.34872682 ... 0.04216029 0.34872682 0.04216029]


### Print the first model card

In [11]:
X_featureNames = list(train_df.columns[1:])

RiskScoreClassifier_m.reset_featureNames(X_featureNames)
RiskScoreClassifier_m.print_model_card()

The Risk Score is:
1.            Age_22_to_29     -2 point(s) |   ...
2.               HSDiploma     -2 point(s) | + ...
3.                    NoHS     -4 point(s) | + ...
4.                 Married      4 point(s) | + ...
5.         AnyCapitalGains      3 point(s) | + ...
                                     SCORE | =    
SCORE |  -8.0  |  -6.0  |  -5.0  |  -4.0  |  -3.0  |  -2.0  |  -1.0  |
RISK  |   0.1% |   0.4% |   0.7% |   1.2% |   2.3% |   4.2% |   7.6% |
SCORE |   0.0  |   1.0  |   2.0  |   3.0  |   4.0  |   5.0  |   7.0  |
RISK  |  13.3% |  22.3% |  34.9% |  50.0% |  65.1% |  77.7% |  92.4% |


### Print Top 10 Model Cards from the Pool and their performance metrics

In [12]:
num_models = min(10, len(multipliers))

for model_index in range(num_models):
    multiplier = multipliers[model_index]
    intercept = sparseDiversePool_beta0_integer[model_index]
    coefficients = sparseDiversePool_betas_integer[model_index]

    RiskScoreClassifier_m = RiskScoreClassifier(multiplier, intercept, coefficients)
    RiskScoreClassifier_m.reset_featureNames(X_featureNames)
    RiskScoreClassifier_m.print_model_card()

    train_loss = RiskScoreClassifier_m.compute_logisticLoss(X_train, y_train)
    train_acc, train_auc = RiskScoreClassifier_m.get_acc_and_auc(X_train, y_train)
    test_acc, test_auc = RiskScoreClassifier_m.get_acc_and_auc(X_test, y_test)

    print("The logistic loss on the training set is {}".format(train_loss))
    print("The training accuracy and AUC are {:.3f}% and {:.3f}".format(train_acc*100, train_auc))
    print("The test accuracy and AUC are are {:.3f}% and {:.3f}\n".format(test_acc*100, test_auc))

The Risk Score is:
1.            Age_22_to_29     -2 point(s) |   ...
2.               HSDiploma     -2 point(s) | + ...
3.                    NoHS     -4 point(s) | + ...
4.                 Married      4 point(s) | + ...
5.         AnyCapitalGains      3 point(s) | + ...
                                     SCORE | =    
SCORE |  -8.0  |  -6.0  |  -5.0  |  -4.0  |  -3.0  |  -2.0  |  -1.0  |
RISK  |   0.1% |   0.4% |   0.7% |   1.2% |   2.3% |   4.2% |   7.6% |
SCORE |   0.0  |   1.0  |   2.0  |   3.0  |   4.0  |   5.0  |   7.0  |
RISK  |  13.3% |  22.3% |  34.9% |  50.0% |  65.1% |  77.7% |  92.4% |
The logistic loss on the training set is 9798.652346518873
The training accuracy and AUC are 82.575% and 0.862
The test accuracy and AUC are are 81.787% and 0.856

The Risk Score is:
1.               HSDiploma     -2 point(s) |   ...
2.                    NoHS     -4 point(s) | + ...
3.                 Married      4 point(s) | + ...
4.    WorkHrsPerWeek_lt_40     -2 point(s) | + ...
5.  

## Additional Tutorial on Binarizing Continuous Features

If your data has continuous features, we recommend converting the continuous features to binary features as a preprocessing step to make the final model more interpretable. We use the public PIMA dataset to show how to do this as a preprocessing step.

### Download the PIMA dataset

In [13]:
pima_original_data_file_path = "../tests/pima_original_data.csv"
if not os.path.isfile(pima_original_data_file_path):
    download_file_from_google_drive('184JhmJiSEUiBCo8ySAD8adDn_S9rjmjM', pima_original_data_file_path)

pima_original_data_df = pd.read_csv(pima_original_data_file_path)
X_original_df = pima_original_data_df.drop(columns="Outcome") # drop the Outcome column, which stores the y label for this binary classification problem

X_original_df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


### Convert the dataframe with continuous features to a new dataframe with binary features

In [14]:
from fasterrisk.binarization_util import convert_continuous_df_to_binary_df

X_binarized_df = convert_continuous_df_to_binary_df(X_original_df)
X_binarized_df

Converting continuous features to binary features in the dataframe......
If a feature has more than 100 unqiue values, we pick the threasholds by selecting 100 quantile points. You can change the number of thresholds by passing another specified number: convert_continuous_df_to_binary_df(df, num_quantiles=50).
Finish converting continuous features to binary features......


Unnamed: 0,Pregnancies<=0,Pregnancies<=1,Pregnancies<=2,Pregnancies<=3,Pregnancies<=4,Pregnancies<=5,Pregnancies<=6,Pregnancies<=7,Pregnancies<=8,Pregnancies<=9,...,Age<=62,Age<=63,Age<=64,Age<=65,Age<=66,Age<=67,Age<=68,Age<=69,Age<=70,Age<=72
0,0,0,0,0,0,0,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,0,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
2,0,0,0,0,0,0,0,0,1,1,...,1,1,1,1,1,1,1,1,1,1
3,0,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
4,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,0,0,0,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,1,1
764,0,0,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
765,0,0,0,0,0,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
766,0,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


You can then use X_binarized_df as your new design matrix and input to the FasterRisk algorithm!