In [1]:
import pandas as pd
import numpy as np

#### Here we show how to prepare a UCI dataset
To this end, we show an example dataset -- [`audiology`](https://archive.ics.uci.edu/ml/datasets/Audiology+%28Standardized%29)

Firstly, let us read the datasets (combine train:test datasets as we randomly split iteratively).

In [2]:
df = pd.read_csv("./audiology.data.csv", header = None)
df2 = pd.read_csv("./audiology.test.csv", header = None)
df = pd.concat([df, df2], axis= 0)
del df2
print("Dataset shape:", df.shape)

Dataset shape: (226, 71)


Have a look at the dataset.

In [3]:
df.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,61,62,63,64,65,66,67,68,69,70
0,f,mild,f,normal,normal,?,t,?,f,f,...,f,f,normal,t,a,f,f,f,p1,cochlear_unknown
1,f,moderate,f,normal,normal,?,t,?,f,f,...,f,f,normal,t,a,f,f,f,p2,cochlear_unknown
2,t,mild,t,?,absent,mild,t,?,f,f,...,f,f,normal,t,as,f,f,f,p3,mixed_cochlear_age_fixation


Manual work: column #69, as explained on the UCI webpage, is a unique row identifier. We must drop this.

In [4]:
df = df.drop(columns = [69])

Rename columns as 1,2, ...

In [5]:
df.columns = np.arange(1,df.shape[1] + 1,1) #columns from 1 to 18

Check if the feature is multi-class.

In [6]:
target_loc = df.shape[1]
df[target_loc].value_counts(normalize = True)

cochlear_age                        0.252212
cochlear_unknown                    0.212389
cochlear_age_and_noise              0.097345
normal_ear                          0.097345
cochlear_poss_noise                 0.088496
mixed_cochlear_unk_fixation         0.039823
possible_menieres                   0.035398
conductive_fixation                 0.026549
possible_brainstem_disorder         0.017699
mixed_cochlear_age_otitis_media     0.017699
otitis_media                        0.017699
mixed_cochlear_unk_ser_om           0.013274
mixed_cochlear_unk_discontinuity    0.008850
mixed_cochlear_age_s_om             0.008850
conductive_discontinuity            0.008850
mixed_poss_noise_om                 0.008850
cochlear_noise_and_heredity         0.008850
retrocochlear_unknown               0.008850
mixed_cochlear_age_fixation         0.008850
bells_palsy                         0.004425
poss_central                        0.004425
mixed_poss_central_om               0.004425
acoustic_n

Binary encoding of majority class vs other.

In [7]:
majority_class = df[target_loc].value_counts(normalize = True).head(1).index[0]
df[target_loc] = (df[target_loc] == majority_class).astype(int).astype(str)

Now, notice that some features have two possible values, hence we do not need to one-hot encode them. For the others, we need to bring binary dummy variables (we do **not** drop the last dummies here, these are dealt with in the Julia code). Let us keep a list of the feature groups after a one-hot encoding.

In [8]:
running_count = 0
groups_string = ""
for i in df.nunique(axis = 0):
    if i >= 3:
        groups_string = groups_string + str(running_count + 1) +"-"+str(running_count + i) +  ","
        running_count = (running_count + i)
    else:
        groups_string = groups_string + str(running_count + 1)+"-"+str(running_count + 1) + ","
        running_count = (running_count + 1)
print("The features (and the label at the end) are encoded by the following variables\n",  groups_string[:-1])

The features (and the label at the end) are encoded by the following variables
 1-1,2-6,7-7,8-11,12-15,16-20,21-21,22-24,25-25,26-26,27-27,28-28,29-29,30-30,31-31,32-32,33-33,34-34,35-35,36-36,37-37,38-38,39-39,40-40,41-41,42-42,43-43,44-44,45-45,46-46,47-47,48-48,49-49,50-50,51-51,52-52,53-53,54-54,55-55,56-56,57-57,58-58,59-59,60-60,61-61,62-62,63-63,64-64,65-65,66-66,67-67,68-68,69-69,70-70,71-71,72-72,73-73,74-74,75-78,79-82,83-83,84-84,85-85,86-92,93-93,94-98,99-99,100-100,101-101,102-102


Save this file as a "key" as this will be used when referring to the original features. The name of the dataset is taaken as `audiologyBIN-cooked-key`, whose reason is the following
- `audiology`: name of the UCI dataset
- `BIN`: only added if the dataset has been binarified, that is, it was originally a multi-class feature dataset
- `-cooked`: means we processed the data afterm downloading from the UCI repository
- `-key`: is used for the files that specify the features -> one-hot dummies maps.

In [9]:
text_file = open("audiologyBIN-cooked-key.csv", "w")
n = text_file.write(groups_string[:-1])
text_file.close()

Now we one-hot encode the dataset.

In [10]:
df_new = pd.DataFrame() #empty dataframe
for col in range(1,len(df.columns) + 1): #iterate over every column
    if df[col].nunique() >= 3: #if there are more than 2 unique values
        df_new = pd.concat([df_new, pd.get_dummies(data = df[col])], axis = 1) #standard one-hot encoding
    else: #means the original feature has 2 unique values, then we keep it as it is, but convert it to 0/1 structure
        df_new = pd.concat([df_new, pd.get_dummies(data = df[col], drop_first = True)], axis = 1)

Rename columns, convert the dataframe elements to integer, and replace 0s with -1s to keep the $\pm 1$ nature our paper has.

In [11]:
df_new.columns = np.arange(1, df_new.shape[1]+ 1, 1)
df_new.astype("int")
df_new= df_new.replace([0], -1)

Check if everything went as planned.

In [12]:
np.sum(np.sum(np.abs(df_new) == 1)) == df_new.shape[0] * df_new.shape[1] #all ones?

True

Now we can save it.

In [13]:
df_new.to_csv('./audiology-cooked.csv', header=False, index = False)