## Example usage ##


### Set the experiment parameters: ###

1. `ds` is the name of the dataset, which is placed in /data folder.
2. `$c\in[0,1]$` is label frequency P(S=1|Y=1)
3. `$g\geq 0$` is a parameter for labelling schemes S1-S3. It describes how 'far' the generated PU data are from the null hypothesis of SCAR. Value $g=0$ means that our PU data follows SCAR assumption, value $g>0$ means that our PU data corresponds to SAR.
4. `label_scheme` is labeling scheme. Scheme S0 is SCAR, whereas S1-S3 are SAR schemes considered in the paper.
5. `stat` is the type of test statistic. Possible test statistics are: kl, klcov, ks and nb.
6. `B` is the number of repetitions. The higher the value of parameter B, the better the approximation of the distribution under H0, but at the same time the
greater the computational cost.
7. `clf` is base classifier. Any scikit-learn classifier can be used.

In [16]:
import numpy as np
import pandas as pd
from labelling import make_pu_labels
from utils import  make_binary_class, remove_collinear
from scar import make_scar_test
from sklearn.ensemble import RandomForestClassifier


ds = 'Breast-w'
c = 0.3
g = 5
label_scheme = "S1"
stat = 'ks'
B = 200
clf = RandomForestClassifier()    

### Load and prepare data: ###

In [17]:

df_name = 'data/' + ds + '.csv'
df = pd.read_csv(df_name, sep=',')
if np.where(df.columns.values=="BinClass")[0].shape[0]>0:
    del df['BinClass']
df = df.to_numpy()
p = df.shape[1]-1
y = df[:,p]
y = make_binary_class(y)
X = df[:,0:p]
selected_columns = remove_collinear(X,0.95)
X = X[:,selected_columns]
hat_y = np.mean(y)

### Create PU dataset: ###

In [18]:
s, ex_true, a = make_pu_labels(X,y,label_scheme=label_scheme,c=c,g=g)

### Run SCAR test: ###

In [19]:
hat_s = np.mean(s)
hat_c = hat_s/hat_y
reject, pv, Tstat, Tstat0 = make_scar_test(X,s,hat_c,clf,B=B,alpha=0.05,stat=stat)

### Analyze the results ###

In [20]:
print("H0 rejected (yes=1, no = 0): ", reject,'\n')
print("p-value: ", pv,'\n')
print("Observed test statistic: ",Tstat0,'\n')
print("95% quantile of the generated null distribution of the test statistic:",np.quantile(Tstat,0.95),'\n')
print("Generated null distribution of the test statistic: " ,Tstat,'\n')

H0 rejected (yes=1, no = 0):  1 

p-value:  0.005 

Observed test statistic:  0.8208852005532504 

95% quantile of the generated null distribution of the test statistic: 0.6961032494476473 

Generated null distribution of the test statistic:  [0.62352557 0.6846473  0.37561298 0.48106518 0.62245851 0.44628306
 0.78843234 0.54309694 0.47656694 0.58290877 0.57963219 0.60596281
 0.54646803 0.43171374 0.42279391 0.47360907 0.48547718 0.45298149
 0.57247265 0.48557042 0.57730946 0.39579063 0.63007511 0.51175657
 0.7627683  0.44892536 0.68532854 0.46125639 0.46013098 0.56093033
 0.57969841 0.53840704 0.65343047 0.60611274 0.51789114 0.52061558
 0.39992937 0.52192661 0.58636633 0.41554633 0.40083987 0.53457087
 0.59421087 0.48158081 0.47950207 0.5119586  0.43672199 0.54070876
 0.54199456 0.36209941 0.5947284  0.54717447 0.46223123 0.43623456
 0.54086438 0.59939207 0.54414745 0.58563854 0.53052756 0.54209313
 0.47699802 0.71509403 0.54220598 0.55681818 0.5560166  0.62201146
 0.41200879 0.582257