# Active Learning

In [1]:
import pandas as pd
import numpy as np
import uuid
import random

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, matthews_corrcoef, confusion_matrix, classification_report

## Generate Data

In case this needs to be said, if you're using this approach to solve a problem prominent to you, don't bother generating a dataset, use the one you're working with. I'm only generating a dataset for simplicity, reproducibility and to outline a use case on how such a problem could be solved using active learning.  

We are going to generate fake vital statistics associated to humans. The function we're building will randomly generate fake height, weight and age data. 

In [2]:
def generate_data(n = 1000):
    '''
    This function will simulate the generation of gender data. It will randomly create
    columns associated to the height (cm), weight, age and gender. The gender column
    will not be completely filled out, it will have a few sample rows labelled to a gender.
    
    params:
        n (Integer) : The number of rows you want to synthesize
        
    returns:
        A dataframe with the columns of uuid, height, weight, age and gender.
            - uuid (UUID4) : A unique identifier to the user
            - height (Integer) : The height of the user in cm
            - weight (Integer) : The weight of the user in pounds (lbs)
            - age (Integer) : The age of the user
            - gender (String) : The gender associated to the user if known
        
    example:
        gender_df = generate_data(n = 1000)
    '''
    # we have more np.nan than Male or Female so that we can skew majority of the
    # data to be missing
    genders = ['Male', 'Female', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]
    height_range = (50, 200)
    weight_range = (30, 250)
    age_range = (3, 70)
    
    d = pd.DataFrame(
        {
            'uuid' : [uuid.uuid4() for _ in range(n)],
            'height' : [random.randint(height_range[0], height_range[1]) for _ in range(n)],
            'weight' : [random.randint(weight_range[0], weight_range[1]) for _ in range(n)],
            'age' : [random.randint(age_range[0], age_range[1]) for _ in range(n)],
            'gender' : [random.choice(genders) for _ in range(n)]
        }
    ).drop_duplicates()
    d = d.set_index('uuid')
    return d

In [3]:
gender_df = generate_data(n = 1000)
gender_df.shape

(1000, 4)

In [4]:
gender_df.head()

Unnamed: 0_level_0,height,weight,age,gender
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b272e16c-4577-440f-ab47-68a417013f66,160,224,8,Male
48b02f2a-2eef-499f-bb25-40c4a5716db5,190,241,9,Female
528dd869-7579-40d0-a41f-50daf8aad6ed,168,165,34,
db3c577e-eb7d-4403-8d1d-58bffa80c243,50,49,59,
fa92a381-8a95-4fd6-b828-4be9709aab32,92,203,53,Female


Of course, since we randomly generated data, we will have weird labelled values, like a 39 year old man who weighs 214 pounds and is 78 centimeters tall. 

In [5]:
gender_df.describe()

Unnamed: 0,height,weight,age
count,1000.0,1000.0,1000.0
mean,125.264,140.397,36.456
std,43.863633,62.880262,19.188597
min,50.0,30.0,3.0
25%,87.0,85.0,19.75
50%,128.0,140.5,36.0
75%,163.0,194.0,52.0
max,200.0,250.0,70.0


## Annotate Data

Here people would usually annotate their datasets associated to the labels, 

In [6]:
gender_df['gender'].value_counts()

Female    138
Male      123
Name: gender, dtype: int64

In [7]:
train_df = gender_df[~gender_df['gender'].isna()].copy()
pred_df = gender_df[gender_df['gender'].isna()].copy()
print(train_df.shape, pred_df.shape)

(261, 4) (739, 4)


## Train Model

In [8]:
# feature, target breakdown
ft_cols = train_df.drop(columns = ['gender']).columns.tolist()
target_col = 'gender'

# train test split
x = train_df[ft_cols].values
y = train_df[target_col].values

x_train, x_test, y_train, y_test = train_test_split(
    x, 
    y,
    test_size = 0.3
)

In [9]:
# GBC classifier
clf = GradientBoostingClassifier()

# train the model
clf.fit(x_train, y_train)

GradientBoostingClassifier()

## Model Accuracy

In [10]:
def clf_eval(clf, x_test, y_test):
    '''
    This function will evaluate a sk-learn multi-class classification model based on its
    x_test and y_test values
    
    params:
        clf (Model) : The model you wish to evaluate the performance of
        x_test (Array) : Result of the train test split
        y_test (Array) : Result of the train test split
    
    returns:
        This function will return the following evaluation metrics:
            - Accuracy Score
            - Matthews Correlation Coefficient
            - Classification Report
            - Confusion Matrix
    
    example:
        clf_eval(
            clf,
            x_test,
            y_test
        )
    '''
    y_pred = clf.predict(x_test)
    y_true = y_test
    
    y_pred = clf.predict(x_test)
    test_acc = accuracy_score(y_test, y_pred)
    print("Testing Accuracy : ", test_acc)
    
    print("MCC Score : ", matthews_corrcoef(y_true, y_pred))
    
    print("Classification Report : ")
    print(classification_report(y_test, clf.predict(x_test)))
    
    print(confusion_matrix(y_pred,y_test))
    
clf_eval(
    clf,
    x_test,
    y_test
)

Testing Accuracy :  0.46835443037974683
MCC Score :  -0.05512488583891162
Classification Report : 
              precision    recall  f1-score   support

      Female       0.54      0.44      0.49        45
        Male       0.40      0.50      0.45        34

    accuracy                           0.47        79
   macro avg       0.47      0.47      0.47        79
weighted avg       0.48      0.47      0.47        79

[[20 17]
 [25 17]]


## Predict

In [16]:
pred_df['pred_proba'] = pred_df[ft_cols].apply(lambda x : dict(
    zip(clf.classes_, clf.predict_proba(x.values[None])[0])
), axis = 1)
pred_df['pred'] = pred_df[ft_cols].apply(lambda x : clf.predict(x.values[None])[0], axis = 1)

In [17]:
pred_df

Unnamed: 0_level_0,height,weight,age,gender,pred_proba,pred
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
528dd869-7579-40d0-a41f-50daf8aad6ed,168,165,34,,"{'Female': 0.4206269557792154, 'Male': 0.57937...",Male
db3c577e-eb7d-4403-8d1d-58bffa80c243,50,49,59,,"{'Female': 0.9552707514441507, 'Male': 0.04472...",Female
a93968fe-db43-46c8-9603-2ca581b02d65,137,187,15,,"{'Female': 0.6346480152115866, 'Male': 0.36535...",Female
bec3e5ec-784c-4726-9cda-e39db1901f76,133,141,37,,"{'Female': 0.266693073556865, 'Male': 0.733306...",Male
c657cf01-067a-4e63-901d-47b79b4b7000,121,178,40,,"{'Female': 0.9471261555316062, 'Male': 0.05287...",Female
...,...,...,...,...,...,...
9985c101-420e-4fe4-bbe1-5e2169b5d17e,83,136,33,,"{'Female': 0.4351914446825188, 'Male': 0.56480...",Male
74ebf7de-042b-4239-be9c-48d3add7afaa,109,63,44,,"{'Female': 0.4371760025439071, 'Male': 0.56282...",Male
7564b71c-feb2-4e29-9900-908465616fbd,86,158,9,,"{'Female': 0.7159971112961241, 'Male': 0.28400...",Female
658aa860-905c-4dd9-815b-1f1a798be681,54,59,58,,"{'Female': 0.9476437445404257, 'Male': 0.05235...",Female


## Get Annotated & To Annoate Predictions

Predictions with low predicted probabilities will be annotated manually by the user, while predictions which have high predicted probability will be assumed to be a good prediction generated by the model. This will allow us to increase our labelled data and retrain a new model based on the best performing results of the previous model.

In [18]:
# low predictions with threshold <= 0.6
low_th= 0.6

# high predictions with threshold >= 0.9
high_th = 0.9

def check_preds(pred_dct, low_th = low_th, high_th = high_th):
    '''
    This function will check the dictionary associated to the prediction probabilities
    generated by the model. It will return either `annotate`, `annotated` or `neither`
    as a result.
    
    params:
        pred_dct (Dictionary) : The keys are the classes, values are the predicted proba
        low_th (Integer) : The low prediction proba threshold
        high_th (Integer): The high prediction proba threshold
        
    returns:
        This function will return the following:
            `annotate` : if the maximum value in the pred_dct is less than or equal to
                         the low_th
            `annotated` : if the maximum value in the pred_dct is greater than or equal to
                          the high_th
            'neither' : If it does not fall in the other two ranges
            
    example:
        pred_df['annotate_decision'] = pred_df['pred_proba'].apply(check_preds)
    '''
    
    max_val = max(list(pred_dct.values()))
    if max_val <= low_th:
        return 'annotate'
    elif max_val >= high_th:
        return 'annotated'
    else:
        return 'neither'

In [19]:
pred_df['annotate_decision'] = pred_df['pred_proba'].apply(check_preds)

In [21]:
pred_df['annotate_decision'].value_counts()

neither      464
annotate     194
annotated     81
Name: annotate_decision, dtype: int64

## Update Labelled Data

## Retrain Model

## Model Accuracy

## Concluding Remarks

---