# Active Learning

In [1]:
import pandas as pd
import numpy as np
import uuid
import random

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, matthews_corrcoef, confusion_matrix, classification_report

## Problem Statement

The business problem we are going to solve is associated to demographic modelling. Given some vital statistics associated to the users on our platform, we want to identify the gender associated to the user. We have a large amount of data but little to no labelled data. The goal is to build a model to identify the gender of a user.

## Generate Data

In case this needs to be said, if you're using this approach to solve a problem prominent to you, don't bother generating a dataset, use the one you're working with. I'm only generating a dataset for simplicity, reproducibility and to outline a use case on how such a problem could be solved using active learning.  

We are going to generate fake vital statistics associated to humans. The function we're building will randomly generate fake height, weight and age data. The resulting DataFrame from the function outlined below will yield a CSV with :   
- uuid (UUID4) : A unique identifier to the user  
- height (Integer) : The height of the user in cm  
- weight (Integer) : The weight of the user in pounds (lbs)  
- age (Integer) : The age of the user  
- gender (String) : The gender associated to the user if known  

In [2]:
def generate_data(n = 1000):
    '''
    This function will simulate the generation of gender data. It will randomly create
    columns associated to the height (cm), weight, age and gender. The gender column
    will not be completely filled out, it will have a few sample rows labelled to a gender.
    
    params:
        n (Integer) : The number of rows you want to synthesize
        
    returns:
        A dataframe with the columns of uuid, height, weight, age and gender.
            - uuid (UUID4) : A unique identifier to the user
            - height (Integer) : The height of the user in cm
            - weight (Integer) : The weight of the user in pounds (lbs)
            - age (Integer) : The age of the user
            - gender (String) : The gender associated to the user if known
        
    example:
        gender_df = generate_data(n = 1000)
    '''
    # we have more np.nan than Male or Female so that we can skew majority of the
    # data to be missing
    genders = ['Male', 'Female', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]
    height_range = (50, 200)
    weight_range = (30, 250)
    age_range = (3, 70)
    
    d = pd.DataFrame(
        {
            'uuid' : [uuid.uuid4() for _ in range(n)],
            'height' : [random.randint(height_range[0], height_range[1]) for _ in range(n)],
            'weight' : [random.randint(weight_range[0], weight_range[1]) for _ in range(n)],
            'age' : [random.randint(age_range[0], age_range[1]) for _ in range(n)],
            'gender' : [random.choice(genders) for _ in range(n)]
        }
    ).drop_duplicates()
    d = d.set_index('uuid')
    return d

In [3]:
gender_df = generate_data(n = 1000)
gender_df.shape

(1000, 4)

In [4]:
gender_df.head()

Unnamed: 0_level_0,height,weight,age,gender
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e64f4c9c-1132-4a95-b0a9-6ae6d54bd43c,144,217,42,
667294ec-c6c6-47b8-86cd-b0987d23f19a,93,225,52,
77a163ae-05a3-4a87-96d9-b68103af541d,106,206,39,Female
7bdafd42-cfd5-47d1-a881-91c9f0ab6c18,142,161,58,
c348e0fc-64ff-43ba-b50d-632585073e28,68,245,68,


Of course, since we randomly generated data, we will have weird labelled values, like a 39 year old man who weighs 214 pounds and is 78 centimeters tall. 

In [5]:
gender_df.describe()

Unnamed: 0,height,weight,age
count,1000.0,1000.0,1000.0
mean,125.906,140.664,36.735
std,43.570007,64.555329,19.603037
min,50.0,30.0,3.0
25%,88.0,84.0,19.0
50%,126.0,139.0,38.0
75%,163.0,199.0,53.0
max,200.0,250.0,70.0


## Annotate Data

Here people would usually annotate their datasets associated to the labels, for the sake of speed and simplicity I'm going to select the randomly assigned gender data. There are various sampling stategies (like purposeful sampling and convenience sampling) you can investigate into to best select the data points to annotate. Purposeful sampling refers to selective sampling which would yield informative information to the model. Convenience sampling refers to easy to label information. Be aware, both of these methods does introduce a bias into the labelling process and could positively / negatively impact the model performance (hard to say and depends highly on the problem you're trying to solve) in comparison to random sampling.

In [6]:
gender_df['gender'].value_counts()

Female    129
Male      122
Name: gender, dtype: int64

In [7]:
train_df = gender_df[~gender_df['gender'].isna()].copy()
pred_df = gender_df[gender_df['gender'].isna()].copy()
print(train_df.shape, pred_df.shape)

(251, 4) (749, 4)


## Train Model

I'm going to use the Gradient Boosting model to train the binary classifier, in a real world situation you should try many different classifiers (like SVC, Logistic Regression, KNN, XG Boost, etc.) and evaluate the performance of each. You should also conduct hyperparameter tuning on these models to optimize the performance.

In [8]:
# feature, target breakdown
ft_cols = train_df.drop(columns = ['gender']).columns.tolist()
target_col = 'gender'

# train test split
x = train_df[ft_cols].values
y = train_df[target_col].values

x_train, x_test, y_train, y_test = train_test_split(
    x, 
    y,
    test_size = 0.3
)

In [9]:
# GBC classifier
clf = GradientBoostingClassifier()

# train the model
clf.fit(x_train, y_train)

GradientBoostingClassifier()

## Model Accuracy

In [10]:
def clf_eval(clf, x_test, y_test):
    '''
    This function will evaluate a sk-learn multi-class classification model based on its
    x_test and y_test values
    
    params:
        clf (Model) : The model you wish to evaluate the performance of
        x_test (Array) : Result of the train test split
        y_test (Array) : Result of the train test split
    
    returns:
        This function will return the following evaluation metrics:
            - Accuracy Score
            - Matthews Correlation Coefficient
            - Classification Report
            - Confusion Matrix
    
    example:
        clf_eval(
            clf,
            x_test,
            y_test
        )
    '''
    y_pred = clf.predict(x_test)
    y_true = y_test
    
    y_pred = clf.predict(x_test)
    test_acc = accuracy_score(y_test, y_pred)
    print("Testing Accuracy : ", test_acc)
    
    print("MCC Score : ", matthews_corrcoef(y_true, y_pred))
    
    print("Classification Report : ")
    print(classification_report(y_test, clf.predict(x_test)))
    
    
clf_eval(
    clf,
    x_test,
    y_test
)

Testing Accuracy :  0.5526315789473685
MCC Score :  0.10277777777777777
Classification Report : 
              precision    recall  f1-score   support

      Female       0.57      0.57      0.57        40
        Male       0.53      0.53      0.53        36

    accuracy                           0.55        76
   macro avg       0.55      0.55      0.55        76
weighted avg       0.55      0.55      0.55        76



## Predict

In [11]:
pred_df['pred_proba'] = pred_df[ft_cols].apply(lambda x : dict(
    zip(clf.classes_, clf.predict_proba(x.values[None])[0])
), axis = 1)
pred_df['pred'] = pred_df[ft_cols].apply(lambda x : clf.predict(x.values[None])[0], axis = 1)

## Get Annotated & To Annoate Predictions

Predictions with low predicted probabilities will be annotated manually by the user, while predictions which have high predicted probability will be assumed to be a good prediction generated by the model. This will allow us to increase our labelled data and retrain a new model based on the best performing results of the previous model.

In [12]:
# low predictions with threshold <= 0.6
low_th= 0.6

# high predictions with threshold >= 0.9
high_th = 0.9

def check_preds(pred_dct, low_th = low_th, high_th = high_th):
    '''
    This function will check the dictionary associated to the prediction probabilities
    generated by the model. It will return either `annotate`, `annotated` or `neither`
    as a result.
    
    params:
        pred_dct (Dictionary) : The keys are the classes, values are the predicted proba
        low_th (Integer) : The low prediction proba threshold
        high_th (Integer): The high prediction proba threshold
        
    returns:
        This function will return the following:
            `annotate` : if the maximum value in the pred_dct is less than or equal to
                         the low_th
            `annotated` : if the maximum value in the pred_dct is greater than or equal to
                          the high_th
            'neither' : If it does not fall in the other two ranges
            
    example:
        pred_df['annotate_decision'] = pred_df['pred_proba'].apply(check_preds)
    '''
    
    max_val = max(list(pred_dct.values()))
    if max_val <= low_th:
        return 'annotate'
    elif max_val >= high_th:
        return 'annotated'
    else:
        return 'neither'

In [13]:
pred_df['annotate_decision'] = pred_df['pred_proba'].apply(check_preds)

In [14]:
pred_df['annotate_decision'].value_counts()

neither      481
annotate     210
annotated     58
Name: annotate_decision, dtype: int64

In [15]:
annotated_df = pred_df[pred_df['annotate_decision'] == 'annotated'].copy()
annotate_df = pred_df[pred_df['annotate_decision'] == 'annotate'].copy()

I'm just going to randomly assign a gender out of simplicity (since this is randomly generated data), but in reality you should manually label this.

In [16]:
gender_choices = ['Male', 'Female', np.nan, np.nan, np.nan]
annotate_df['gender'] = [random.choice(gender_choices) for _ in range(annotate_df.shape[0])]
annotate_df = annotate_df[~annotate_df['gender'].isna()].copy()

## Update Labelled Data

In [17]:
annotated_df['gender'] = annotated_df['pred']

In [18]:
annotated_df = pd.concat([annotated_df, annotate_df, train_df])

In [19]:
annotated_df.shape

(396, 7)

## Retrain Model

In [20]:
# feature, target breakdown
ft_cols = ['height', 'weight', 'age']
target_col = 'gender'

# train test split
x = annotated_df[ft_cols].values
y = annotated_df[target_col].values

x_train, x_test, y_train, y_test = train_test_split(
    x, 
    y,
    test_size = 0.3
)

In [21]:
# GBC classifier
clf = GradientBoostingClassifier()

# train the model
clf.fit(x_train, y_train)

GradientBoostingClassifier()

## Model Accuracy

In [22]:
clf_eval(
    clf,
    x_test,
    y_test
)

Testing Accuracy :  0.5294117647058824
MCC Score :  0.06628740703572837
Classification Report : 
              precision    recall  f1-score   support

      Female       0.58      0.49      0.53        65
        Male       0.48      0.57      0.53        54

    accuracy                           0.53       119
   macro avg       0.53      0.53      0.53       119
weighted avg       0.54      0.53      0.53       119



## Caveats

Be aware that this example above was associated to randomly generated data, this example was to show the power of active learning and how to use it to solve problems when there are a limited amount of labelled data. Although the performance of the model improved, the model is essentially learning randomness so do take these results with a grain of salt. 

---