# **DSFM Workshop**: Model Interpretation

---

## **Section 3**: Active learning

Creator: [Data Science for Managers - EPFL Program](https://www.dsfm.ch)  
Source:  [https://github.com/dsfm-org/code-bank.git](https://github.com/dsfm-org/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

---

## **Overview**

Labeled observations tend to be scarce in the business world and time-consuming to annotate. Hence, understanding whether it makes sense to obtain more labeled observations and, if so, which observations to label next can be of great use.

Which new labeled observations might be most useful to the model? You often want to label observations in the uncertain region, i.e. in regions where the model predicts neither one of the classes. For example, in a binary classification, the uncertain region might be unlabeled observations with a predicted probability around 0.5. The approach to decide which observations to label next is generally known as your "querying strategy".

The general workflow in active learning looks as follows (from the modAL GitHub repository): 

<img src="https://camo.githubusercontent.com/23e6a639d055d5ce91e89c12b27c008066a0d314/68747470733a2f2f6d6f64616c2d707974686f6e2e72656164746865646f63732e696f2f656e2f6c61746573742f5f696d616765732f6163746976652d6c6561726e696e672e706e67" widt=500>

[Image source](https://camo.githubusercontent.com/23e6a639d055d5ce91e89c12b27c008066a0d314/68747470733a2f2f6d6f64616c2d707974686f6e2e72656164746865646f63732e696f2f656e2f6c61746573742f5f696d616765732f6163746976652d6c6561726e696e672e706e67)

## **Learning goals**

- Understand the active learning workflow
- Learn about different querying strategies (random vs. uncertainty sampling) to prioritize the unlabeled pool of observations
- Experiment with `modAL`, a flexible package for active learning in Python
- Explore a user-friendly annotation tool with active learning and transfer learning built-in: `prodi.gy`

## **Useful resources**

- Dataiku article on experimenting with different [Python packages for active learning](https://blog.dataiku.com/a-proactive-look-at-active-learning-packages)
- `modAL` GitHub repository containing many [examples](https://github.com/modAL-python/modAL)


---

<img src="https://greendayonline.com/wp-content/uploads/2017/03/Recovering-From-Student-Loan-Default.jpg" width="500" height="600" align="center"/>


[Image source](https://greendayonline.com/wp-content/uploads/2017/03/Recovering-From-Student-Loan-Default.jpg)


In [None]:
# Ensure that all packages are installed 
# import sys
# !{sys.executable} -m pip install modAL

## **Part 1:** Load data

We will try to predict the probability of defaulting on a credit card account at a Taiwanese bank. A credit card default happens when a customer fails to pay the minimum due on a credit card bill for more than 6 months. 

We will use a dataset from a Taiwanese bank with 30,000 observations (Source: *Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.*). Each observation represents an account at the bank at the end of October 2005.  We renamed the variable default_payment_next_month to customer_default. The target variable to predict is `customer_default` -- i.e., whether the customer will default in the following month (1 = Yes or 0 = No). The dataset also includes 23 other explanatory features. 

Variables are defined as follows:

| Feature name     | Variable Type | Description 
|------------------|---------------|--------------------------------------------------------
| customer_default | Binary        | 1 = default in following month; 0 = no default 
| LIMIT_BAL        | Continuous    | Credit limit   
| SEX              | Categorical   | 1 = male; 2 = female
| EDUCATION        | Categorical   | 1 = graduate school; 2 = university; 3 = high school; 4 = others
| MARRIAGE         | Categorical   | 0 = unknown; 1 = married; 2 = single; 3 = others
| AGE              | Continuous    | Age in years  
| PAY1             | Categorical   | Repayment status in September, 2005 
| PAY2             | Categorical   | Repayment status in August, 2005 
| PAY3             | Categorical   | Repayment status in July, 2005 
| PAY4             | Categorical   | Repayment status in June, 2005 
| PAY5             | Categorical   | Repayment status in May, 2005 
| PAY6             | Categorical   | Repayment status in April, 2005 
| BILL_AMT1        | Continuous    | Balance in September, 2005  
| BILL_AMT2        | Continuous    | Balance in August, 2005  
| BILL_AMT3        | Continuous    | Balance in July, 2005  
| BILL_AMT4        | Continuous    | Balance in June, 2005 
| BILL_AMT5        | Continuous    | Balance in May, 2005  
| BILL_AMT6        | Continuous    | Balance in April, 2005  
| PAY_AMT1         | Continuous    | Amount paid in September, 2005
| PAY_AMT2         | Continuous    | Amount paid in August, 2005
| PAY_AMT3         | Continuous    | Amount paid in July, 2005
| PAY_AMT4         | Continuous    | Amount paid in June, 2005
| PAY_AMT5         | Continuous    | Amount paid in May, 2005
| PAY_AMT6         | Continuous    | Amount paid in April, 2005

In [None]:
# Load the dataset
import pandas as pd

df = pd.read_csv('data/credit_data.csv')

print(df.shape)
df.head()

In [None]:
# Percentage of fraudulent transactions
count = df['customer_default'].value_counts()
percentage_default = count[1] / (count[1] + count[0]) * 100
print('{}% of customers default'.format(round(percentage_default, 4)))

In [None]:
# Divide data into labeled data and unlabeled data (we pretend)
# Shuffle data
SEED = 7
df = df.sample(len(df), replace=False, random_state=SEED)

# Split df into equal-sized parts
import numpy as np
df1, df2, df3, df4, df5 = np.array_split(df, 5)

# DF1: labeled data
# DF2: pool of unlabeled data
# DF3: test data for evaluation
print(df1.shape)
print(df2.shape)
print(df3.shape)

In [None]:
df1.head()

## **Part 2:** Active learning for classification

In this classification example, we use the `entropy_sampling` querying strategy. In this strategy, the "next best" unlabeled observation is the one for which the predicted probabilities have the highest entropy. 

The `modAL` package includes other querying strategies, including **margin sampling**, which selects the instances where the difference between the first most likely and second most likely classes are the smallest. Other querying strategies include **expected model change** (label observations that would most change the current model), **expected error reduction**, and **query by committee** (train different models on labeled data and select the unlabeled observations where the predictions diverge the most).

Let's investigate how the model performance changes with active learning. Note that due to the small number of training observations, the random seed play an unrealistically important role in the model performance we obtain.  

### Train initial model

In [None]:
from modAL.models import ActiveLearner
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from modAL.uncertainty import entropy_sampling

# Only train on few examples
df1_small = df1.sample(30, random_state = SEED)

# Split features from target variable
X_df1 = df1_small.drop(columns=['customer_default'], inplace=False).values
y_df1 = df1_small['customer_default'].values

# initializing the learner
learner = ActiveLearner(
    estimator = GradientBoostingClassifier(random_state = SEED),
    query_strategy = entropy_sampling,
    X_training = X_df1, y_training = y_df1
)

df1_small['customer_default'].value_counts()

In [None]:
# Split features from target variable
X_df3 = df3.drop(columns=['customer_default'], inplace=False).values
y_df3 = df3['customer_default'].values


In [None]:
# Performance of the initial classifier
pred = learner.predict_proba(X_df3)
print('AUC: {}'.format(roc_auc_score(y_df3, pred[:,1])))

### Query another X newly labeled samples

In [None]:
n_queries = 20

# Split features from target variable
X_df2 = df2.drop(columns=['customer_default'], inplace=False).values
y_df2 = df2['customer_default'].values

for idx in range(n_queries):
    query_idx, query_instance = learner.query(X_df2)
    print('Observation {}'.format(query_idx[0]).ljust(20) +  'prediction {}'.format(round(learner.predict_proba(X_df2[query_idx])[0][1], 4)))
    # Oracle labels the observation selected by our querying strategy
    learner.teach(X_df2[query_idx], y_df2[query_idx])

In [None]:
# Performance of the initial classifier
pred = learner.predict_proba(X_df3)
print('AUC: {}'.format(roc_auc_score(y_df3, pred[:,1])))

### Questions based on the Dataiku article

For the article, click [here](https://blog.dataiku.com/a-proactive-look-at-active-learning-packages)

1. What querying strategy do you think works best without any labeled data?
2. Do you think active learning is more or less effective for classifications with many or few classes?
3. What's a good way to select between different querying strategies?
    
- Answer 1: random strategy
- Answer 2: active learning is more useful in situations with many classes
- Answer 3: no unifying framework exists at the moment

## **Part 3:** Active learning for regression

We now look at a time-series regression problem. The visualizations show that only learning on few data points results in poor performance. Furthermore, we compare querying random observations and querying observations with the highest predicted uncertainty to learn that querying strategies can make a difference for effectively labeling new observations to improve model performance. 

Note that we use a Gaussian process regression (GPR) as our model. GPR is a nonparametric, Bayesian approach to regression that provides uncertainty measurements for predictions, which is the main reason we are using this model here. Nonparametric simply means that the model is not bound to a particular function form (e.g. linear, quadratic). 

Code adapted from: [modAL on GitHub](https://github.com/modAL-python/modAL/blob/master/examples/active_regression.py)

In [None]:
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, RBF

In [None]:
# Generating the data
SEED = 7
STD = 0.3
np.random.seed(SEED)
X = np.random.choice(np.linspace(0, 20, 10000), size=200, replace=False).reshape(-1, 1)
y = np.sin(X) + np.random.normal(scale=STD, size=X.shape)

FIGSIZE = (10, 6)
MARKERSIZE = 10
LINEWIDTH = 3
DPI = 120

# Plotting the initial estimation
with plt.style.context('seaborn-white'):
    plt.figure(figsize=FIGSIZE, dpi=DPI)
    x = np.linspace(0, 20, 1000)
    y_true = np.sin(x)
    plt.plot(x, y_true, c='k', linewidth=LINEWIDTH, label='True DGP')
    plt.fill_between(x, y_true - STD, y_true + STD, alpha=0.2, color='gray', label='+/- 1 SD')
    plt.scatter(X, y, c='k', s=MARKERSIZE, label='Sampled data')
    plt.title('Randomly sampled data')
    plt.grid()
    plt.legend()
    plt.show()

In [None]:
import numpy as np
np.random.seed(SEED)
from sklearn.metrics import mean_squared_error
from math import sqrt

# Query strategy: select observations at random
def random_sampling(classifier, X_pool):
    n_samples = len(X_pool)
    query_idx = np.random.choice(range(n_samples))
    return query_idx, X_pool[query_idx]

# Assembling initial training set with few data points
n_initial = 5
COLOR = 'cornflowerblue'
initial_idx = np.random.choice(range(len(X)), size=n_initial, replace=False)
X_initial, y_initial = X[initial_idx], y[initial_idx]

# Defining the kernel for the Gaussian process
kernel = RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e3)) \
         + WhiteKernel(noise_level=1, noise_level_bounds=(1e-10, 1e+1))

# Initializing the active learner
regressor = ActiveLearner(
    estimator = GaussianProcessRegressor(kernel=kernel, random_state=SEED),
    query_strategy = random_sampling,
    X_training = X_initial.reshape(-1, 1), y_training=y_initial.reshape(-1, 1)
)

pred, std = regressor.predict(X.reshape(-1,1), return_std=True)
rmse = sqrt(mean_squared_error(y.reshape(1,-1)[0], pred[:,0]))
print('RMSE with initial data: {}'.format(round(rmse, 4)))

# plotting the initial estimation
with plt.style.context('seaborn-white'):
    x = np.linspace(0, 20, 1000)
    
    # DGP
    plt.figure(figsize=FIGSIZE, dpi=DPI)
    plt.plot(x, y_true, c='k', linewidth=LINEWIDTH, label='True DGP', alpha=0.2)
    plt.fill_between(x, y_true - STD, y_true + STD, alpha=0.2, color='gray', label='+/- 1 SD')
    plt.scatter(X, y, c='k', s=MARKERSIZE, label='Sampled data', alpha=0.2)
    
    # Prediction
    pred, std = regressor.predict(x.reshape(-1,1), return_std=True)
    plt.plot(x, pred, color=COLOR)
    plt.fill_between(x, pred.reshape(-1, )-std, pred.reshape(-1, )+std, alpha=0.2, color=COLOR)
    # Plot initial data
    plt.scatter(X_initial, y_initial, s=MARKERSIZE+30, color=COLOR, label='Initial data')
    plt.title('Initial estimation based on %d points' % n_initial)
    plt.grid()
    plt.legend()
    plt.show()
    

### Random sampling

In [None]:
# Initializing the active learner
regressor = ActiveLearner(
    estimator = GaussianProcessRegressor(kernel=kernel, random_state=SEED),
    query_strategy = random_sampling,
    X_training = X_initial.reshape(-1, 1), y_training=y_initial.reshape(-1, 1)
)
    
# Active learning
n_queries = 10
x = np.linspace(0, 20, 1000)
X_new, y_new = [], []
predictions, stds = [], []
for idx in range(n_queries):
    query_idx, query_instance = regressor.query(X)
    regressor.teach(X[query_idx].reshape(1, -1), y[query_idx].reshape(1, -1))
    pred, std = regressor.predict(x.reshape(-1,1), return_std=True)
    
    X_new.append(X[query_idx].reshape(1, -1))
    y_new.append(y[query_idx].reshape(1, -1))
    predictions.append(pred)
    stds.append(std)
    
# Plotting after active learning
for i in range(len(X_new)):
    
    with plt.style.context('seaborn-white'):

        # DGP
        plt.figure(figsize=FIGSIZE, dpi=DPI)
        plt.plot(x, y_true, c='k', linewidth=LINEWIDTH, label='True DGP', alpha=0.2)
        plt.fill_between(x, y_true - STD, y_true + STD, alpha=0.2, color='gray', label='+/- 1 SD')
        plt.scatter(X, y, c='k', s=MARKERSIZE, label='Sampled data', alpha=0.2)

        # Prediction
        plt.plot(x, predictions[i], color='red', label='Latest estimate')
        # Previous estimates
        for j in range(i):
            plt.plot(x, predictions[j], color='red', linestyle='dashed', linewidth=2, alpha=(j+1)/(i+1)*0.5)

        plt.fill_between(x, predictions[i].reshape(-1, ) - stds[i], predictions[i].reshape(-1, ) + stds[i], alpha=0.3, color='red')
        # Plot latest observation
        plt.scatter(X_new[i:i+1], y_new[i:i+1], marker='*', s=MARKERSIZE+200, color='red', label='Latest queried obs.')
        # Plot previous observations
        if len(X_new[:1]) > 0: plt.scatter(X_new[:i], y_new[:i], s=MARKERSIZE+30, color='red', label='Previously queried obs.')
        # Plot initial data 
        plt.scatter(X_initial, y_initial, s=MARKERSIZE+30, color=COLOR, label='Initial data', alpha=0.5)
        plt.title('RANDOM SAMPLING: Estimation after an additional %d observation(s)' % (i+1))
        plt.grid()
        plt.legend(loc='lower left')
        plt.show()

pred, std = regressor.predict(X.reshape(-1,1), return_std=True)
rmse = sqrt(mean_squared_error(y.reshape(1,-1)[0], pred[:,0]))
print('RMSE with {} new observations: {}'.format(n_queries, round(rmse, 4)))

### Uncertainty sampling

In [None]:
# Query strategy: select observation with the largest standard deviation
def GP_regression_std(regressor, X):
    _, std = regressor.predict(X, return_std=True)
    query_idx = np.argmax(std)
    return query_idx, X[query_idx]

# Initializing the active learner
regressor = ActiveLearner(
    estimator = GaussianProcessRegressor(kernel=kernel, random_state=SEED),
    query_strategy = GP_regression_std,
    X_training = X_initial.reshape(-1, 1), y_training=y_initial.reshape(-1, 1)
)
    
# Active learning
n_queries = 10
x = np.linspace(0, 20, 1000)
X_new, y_new = [], []
predictions, stds = [], []
for idx in range(n_queries):
    query_idx, query_instance = regressor.query(X)
    regressor.teach(X[query_idx].reshape(1, -1), y[query_idx].reshape(1, -1))
    pred, std = regressor.predict(x.reshape(-1,1), return_std=True)
    
    X_new.append(X[query_idx].reshape(1, -1))
    y_new.append(y[query_idx].reshape(1, -1))
    predictions.append(pred)
    stds.append(std)
    
# Plotting after active learning
for i in range(len(X_new)):

    with plt.style.context('seaborn-white'):

        # DGP
        plt.figure(figsize=FIGSIZE, dpi=DPI)
        plt.plot(x, y_true, c='k', linewidth=LINEWIDTH, label='True DGP', alpha=0.2)
        plt.fill_between(x, y_true - STD, y_true + STD, alpha=0.2, color='gray', label='+/- 1 SD')
        plt.scatter(X, y, c='k', s=MARKERSIZE, label='Sampled data', alpha=0.2)

        # Prediction
        plt.plot(x, predictions[i], color='red', label='Latest estimate')
        # Previous estimates
        for j in range(i):
            plt.plot(x, predictions[j], color='red', linestyle='dashed', linewidth=2, alpha=(j+1)/(i+1)*0.5)

        plt.fill_between(x, predictions[i].reshape(-1, )-stds[i], predictions[i].reshape(-1, )+stds[i], alpha=0.3, color='red')
        # Plot latest observation
        plt.scatter(X_new[i:i+1], y_new[i:i+1], marker='*', s=MARKERSIZE+200, color='red', label='Latest queried obs.')
        # Plot previous observations
        if len(X_new[:1]) > 0: plt.scatter(X_new[:i], y_new[:i], s=MARKERSIZE+30, color='red', label='Previously queried obs.')
        # Plot initial data 
        plt.scatter(X_initial, y_initial, s=MARKERSIZE+30, color=COLOR, label='Initial data', alpha=0.5)
        plt.title('UNCERTAINTY SAMPLING: Estimation after an additional %d observation(s)' % (i+1))
        plt.grid()
        plt.legend(loc='lower left')
        plt.show()

pred, std = regressor.predict(X.reshape(-1,1), return_std=True)
rmse = sqrt(mean_squared_error(y.reshape(1,-1)[0], pred[:,0]))
print('RMSE with {} new observations: {}'.format(n_queries, round(rmse, 4)))

## **Part 4:** prodigy annotation tool demo

prodigy is a software tool for active learning. It was developed by Explosion AI, the makers of the `spaCy` NLP package. Let's demo what prodigy looks like to see what effective annotation can look like. 

Of course, there are numerous alternatives to doing your own annotation, including Amazon Mechanical Turk.

[Click here for the prodigy demo](https://prodi.gy/demo)

## **Bonus questions**:

1. In what scenario might a "margin sampling" querying strategy work well?
2. What's another simple approach to answering "Is it worth collecting more data?"

- Answer 1: multi-class problems
- Answer 2: learning curve