# **Expected Value**

Source:  [https://github.com/d-insight/code-bank.git](https://github.com/d-insight/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

This illustration demonstrates how the expected value of weighted errors changes as one shifts the probability threshold for a positive prediction on a linear probability model. As the cost of a false positive (FP) and/or the cost of a false negative (FN) changes in relative terms, the optimal probability threshold for a positive prediction will also shift. 

  * Users can visualize this effect manually by changing the cost of a false positive error and/or the cost of a false negative error, by changing the costs of the `C_FP` variable and the `C_FN` variable in the code below, and then re-running the code.

  * Users can also visualize the effects of changing costs in a live, interactive manner by 

It uses data from a Taiwanese bank with 30,000 observations (Source: *Yeh, I. C., & Lien, C. H. (2009)*). The target variable to predict is `customer_default` -- i.e., whether the customer will default in the following month (1 = Yes or 0 = No). The dataset also includes 23 other explanatory features. 

| Feature name     | Variable Type | Description 
|------------------|---------------|--------------------------------------------------------
| customer_default | Binary        | 1 = default in following month; 0 = no default 
| LIMIT_BAL        | Continuous    | Credit limit   
| SEX              | Categorical   | 1 = male; 2 = female
| EDUCATION        | Categorical   | 1 = graduate school; 2 = university; 3 = high school; 4 = others
| MARRIAGE         | Categorical   | 0 = unknown; 1 = married; 2 = single; 3 = others
| AGE              | Continuous    | Age in years  
| PAY1             | Categorical   | Repayment status in September, 2005 
| PAY2             | Categorical   | Repayment status in August, 2005 
| PAY3             | Categorical   | Repayment status in July, 2005 
| PAY4             | Categorical   | Repayment status in June, 2005 
| PAY5             | Categorical   | Repayment status in May, 2005 
| PAY6             | Categorical   | Repayment status in April, 2005 
| BILL_AMT1        | Continuous    | Balance in September, 2005  
| BILL_AMT2        | Continuous    | Balance in August, 2005  
| BILL_AMT3        | Continuous    | Balance in July, 2005  
| BILL_AMT4        | Continuous    | Balance in June, 2005 
| BILL_AMT5        | Continuous    | Balance in May, 2005  
| BILL_AMT6        | Continuous    | Balance in April, 2005  
| PAY_AMT1         | Continuous    | Amount paid in September, 2005
| PAY_AMT2         | Continuous    | Amount paid in August, 2005
| PAY_AMT3         | Continuous    | Amount paid in July, 2005
| PAY_AMT4         | Continuous    | Amount paid in June, 2005
| PAY_AMT5         | Continuous    | Amount paid in May, 2005
| PAY_AMT6         | Continuous    | Amount paid in April, 2005

-------------

## **Part 0**: Setup

In [None]:
# import all packages 
import pandas as pd
import numpy as np

# scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.linear_model    import LinearRegression

# plotting 
import matplotlib 
import matplotlib.pyplot as plt
%matplotlib inline


In [None]:
# define all constants
FONTSIZE  = 20
FIGSIZE   = (12, 12)
LINEWIDTH = 3

## **Part 1**: Load data

In [None]:
# load data
data = pd.read_csv('data/credit_data.csv')

# Select target
y = np.array(data['customer_default'])

# Select features 
features = list(set(list(data.columns)) - set(['customer_default']))
X = data.loc[:, features]

# Divide data into a training set and a testing set using the train_test_split() function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=1, stratify=y)

print('X_train: \t{}'.format(X_train.shape))
print('y_train: \t{}\n'.format(y_train.shape))
print('X_test: \t{}'.format(X_test.shape))
print('y_test: \t{}'.format(y_test.shape))

## **Part 2**: Fit a linear probability model (LPM)

In [None]:
# Fit an OLS linear regression
ols_model = LinearRegression()
ols_model.fit(X_train, y_train)
y_hat_ols_prob = ols_model.predict(X_test)  # Precict the probability

## **Part 3**: Optimize the LPM based on accuracy

Here we show how different thresholds result in different accuracies. 

In [None]:
# Find the optimal threshold with constant costs 
results = []
for i in range(1, 100):
    threshold = 0.01 * i
    y_hats    = [int(v >= threshold) for v in y_hat_ols_prob]
    correct   = [int(r[0]==r[1]) for r in zip(y_test, y_hats)]
    accuracy  = sum(correct)/len(correct)
    results.append( (accuracy, threshold) )
optimal_p   = sorted(results, reverse=True)[0][1]
optimal_acc = sorted(results, reverse=True)[0][0]
print('Optimal probability threshold = {} with accuracy = {}\n'.format(round(optimal_p, 4), round(optimal_acc, 4)))

y,x = zip(*results)
font = {'size'   : FONTSIZE}
matplotlib.rc('font', **font)
plt.figure(figsize=FIGSIZE)
plt.plot(x, y, linewidth=LINEWIDTH)
plt.vlines(optimal_p, ymin=0, ymax=1, colors=['red'], linewidth=3)
plt.ylabel('Accuracy')
plt.xlabel('Probability Threshold')
plt.grid()


## **Part 4**: Optimize the LPM based on costs of incorrect predictions

Here we show how different costs for False Positives and False Negatives can result in optimal thresholds different to optimizing for accuracy.

In [None]:
# Set costs
C_FP = 1
C_FN = 4

# Find the optimal threshold with specific costs 
results = []
for i in range(1, 100):
    threshold = 0.01 * i
    y_hats = [int(v >= threshold) for v in y_hat_ols_prob]
    errors = []
    for r in zip(y_test, y_hats):
        actual_value = r[0]
        predicted_value = r[1]
        # Incorrect prediction
        if predicted_value != actual_value:
            
            # False positve
            if predicted_value:
                errors.append(C_FP)
                
            # False negative
            else:
                errors.append(C_FN)
        # Correct prediction
        else:
            errors.append(0)
    total_error = sum(errors)
    results.append( (total_error, threshold) )
optimal_p   = sorted(results)[0][1]
optimal_acc = sorted(results)[0][0]
print('Optimal probability threshold = {} with costs = {}\n'.format(round(optimal_p, 4), round(optimal_acc, 4)))

y,x = zip(*results)
plt.figure(figsize=FIGSIZE)
plt.plot(x, y, linewidth=LINEWIDTH)
plt.vlines(optimal_p, ymin=0, ymax=max(results), colors=['red'], linewidth=LINEWIDTH)
plt.ylabel('Weighted Count of Error')
plt.xlabel('Probability Threshold')
plt.grid()


Switch to `scripts/i2b-lpm.py` to see how the LPM errors change with different costs for false predictions.

## **Part 5**: Visualize the effects of changing the cost structure LIVE

Use a python console to run `expected-value.py` - you can then adjust the cost structure and see how the expected value (weighted count of errors) changes, and thus the optimal probability threshold for the model.