## Profit-Gain Framework for ML Model Evaluation

This is the first attempt to demonstrate the Profit-Gain framework.  Machine-Learning/Deep Learning Model selection and comparison has long relied on measures of accuracy.  However, the accuracy metrics does not allow for a practical cost-benefit analysis to determine the use of one model over another.  The Profit-Gain Framework aims to address this problem through a simple method of evaluation. 

In the first section, we introduce the framework.  Then we use it to evaluate a few different classifiers in the next section.
*****

## The Profit-Gain Framework


##### Annual Cost of Meeting Data Processing Quota

So say that our quota requirmenet is to process $D$ data-points per second.  As the model takes $T$ seconds to process a single data point, we need $DT$ parallel instances of the model to fulfill this requirement. 

Each instance of the model takes up $M$ megabytes of memeory and each megabyte costs $d$ dollars on the cloud.  Therefore, the memory required to fulfill our requirement is $MDT$ megabytes in total and the cost of hosting this is $d \,(MDT)$ per second. 

From this, we can extrapolate that our annual cost per data-point is $kMT$, where $k$ is some constant that is based on $D$ and $d$.

##### Model Revenue

So again, we are required to process $D$ data points per second. We make $R$ dollars from each accurate prediction. Our model accuracy is $A$. Therefore, the revenue generated by our model is $ADR$ per second. Note that a perfect model yields a revenue of $DR$ as $A =1$ .  

Annual model revenue is thus $jAR$, where $j$ is a constant.

##### Evaluating the Baseline

Let's assume the baseline model yields no profit. 
Therefore, annual costs equal revenues.

In other words,  $$k MT \hspace{3mm} \text{(annual cost)} = j A R \hspace{3mm} \text{(annual revenue)}.$$
We can reformulate this as as $RA = hMT$ where $h= k\,/\,j$.

Hence we can solve for $R$
$$R = \frac{hMT}{A}$$

which is the revenue generated from a single correctly processed data-point for a basline model. 

##### Comparing ML Models

Assume that we have some fancy new ML Model.  It performs with an accuracy of $A_2$ where $A_2 > A$ and $A$ is the baseline accuracy.

The annual revenue of this new model equals 

$$j A_2 R = j A_2 \big(\frac{h MT}{A} \big) = j h MT (A_2/A) = j (k /j) MT (A_2/A)  =  k MT (A_2/A)$$

Furthermore, the model has its own $M_2$ and $T_2$. The annual cost of the model is $kM_2T_2$. 

Therefore, the model profit equals $kMT(A_2 / A) - kM_2T_2 = k(MT(A_2 / A) - M_2T_2)$. 

Lets assume $C = MT$ and $C_2 = M_2T_2$. 
Thus, the profit of this new model equals 

$$k(C(A_2 / A) - C_2).$$

##### Maximum Profit 
So recall that $C$ is our baseline cost model. The baseline never drop below $C_2$ ($C_2 \ge C$). Thus there is a ceiling to our profit! Its one where $A_2 = 1.0$ (we achieve perfect accuracy) and $C_2 = C$ (costs do not go up with the new model). 

In other words, our maximum possible profit is $kC / A - kC$. 
Therefore the maximum profit is $kC(1/A - 1)$

Percent of maximum profit achieved is

$$\frac{k(C(A_2 / A) - C_2)}{kC(1/A - 1)}= \frac{C(A_2 / A) - C_2} { C(1/A - 1)} .$$

We multiply by $A / A$. This yields

$$\frac{CA_2 - C_2A}{C(1 - A)}.$$

At baseline, this metric equals zero. 

When $CA_2 -C_2 A < 0$, then this metric would be less than zero.

In [1]:
def profit_gain_metric(C_2, A_2):
    global A, C
    result = C*A_2 - C_2*A
    result /= C*(1-A)

    return result

*****

## Demonstrating the Framework

Here we demonstrate the framework using 4 different Classifiers  

The 4 classifiers are:
* Logistic Regression (Baseline)
* Random Forest with 1000 trees 
* Random Forest with 1000 trees (with Truncated SVD)
* BERT 

##### Loading Dataset 
Again, we use the 20 News Grops data.   Background information on this dataset can be  [found here](http://qwone.com/~jason/20Newsgroups/). 


In [2]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups

mydata_train = fetch_20newsgroups(subset='train', shuffle=True, 
                                  remove = ('headers', 'footers', 'quotes'), 
                                  random_state=42)

mydata_test = fetch_20newsgroups(subset='test', shuffle=True,
                                 remove = ('headers', 'footers', 'quotes'), 
                                 random_state=42)

print('size of training set: %s' % (len(mydata_train ['data'])))
print('size of validation set: %s' % (len(mydata_test['data'])))

size of training set: 11314
size of validation set: 7532


This is a quick DataFrame view of this data:

In [3]:
import pandas as pd
mydata_train_df = pd.DataFrame({'data': mydata_train.data, 'target': mydata_train.target})
mydata_train_df.head()

Unnamed: 0,data,target
0,I was wondering if anyone out there could enli...,7
1,A fair number of brave souls who upgraded thei...,4
2,"well folks, my mac plus finally gave up the gh...",4
3,\nDo you have Weitek's address/phone number? ...,1
4,"From article <C5owCB.n3p@world.std.com>, by to...",14


##### Establishing the Baseline: Logistic Regression

Here we establish our baseline model, a Logistic Regression with a pipeline that incorporates TFIDF and a Truncated SVM in its pipeline.  

This model will be used to establish the baseline cost $C=MT$ and basline accuracy $A$.  

First we create the pipeline:

In [4]:
# used for all classifier pipelines 
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer

# creating the logistic regresion pipeline 
from sklearn.linear_model import LogisticRegression

lr_pipeline = Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('svd', TruncatedSVD(n_components=300)),
        ('clf', LogisticRegression(multi_class='auto', solver='lbfgs'))])

# fittingthe data
lr_pipeline.fit(mydata_train.data, mydata_train.target)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='...
                ('svd',
                 TruncatedSVD(algorithm='randomized', n_components=300,
                              n_iter=5, random_state=None, tol=0.0)),
                ('clf',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fi

##### Setting Baseline Accuracy
Now we establish the baseline accuracy $A$ by scoring this classifier on the basline data.

In [5]:
A = lr_pipeline.score(mydata_test.data, mydata_test.target)
A

0.6165693043016464

##### Setting Baseline Cost

Recall again that the cost variable $C$ is a product of $M$ and $T$ where $M$ is the size of the model (in megabytes) and $T$ is the time it takes the model to compute a single data point.  


We first measure the runtime of computing preduction on the testing dataset.  Here we use the `timeit` module. 

Recall that the `n` parameter controls how many executions are done for each timing and it's used to get representative timings and `r` repeat argument controls how many timings are done and its use is to get accurate statistics.  
The [documentation](https://docs.python.org/3/library/timeit.html#timeit.Timer.repeat) for `timeit` states the following:

> Note: It’s tempting to calculate mean and standard deviation from the result vector and report these. *However, this is not very useful*. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. 

Therefore we find the best runtime out of the measurements we use.  

In [6]:
import timeit
results = %timeit -r 3 -n 100 -o lr_pipeline.predict(mydata_test.data)

1.99 s ± 181 ms per loop (mean ± std. dev. of 3 runs, 100 loops each)


In [7]:
results.best 

1.85574203334

Now we use the measured minimum runtime to calculate $T$. 

In [8]:
#if you want to skip the timeit part 
#min_runtime=3.27452979408
min_runtime = results.best


n_data_pts = len(mydata_test['data'])
T = min_runtime / n_data_pts

Now we use `asizeof` from the `pympler` package, which is the best tool we found to measure the static memory of a given python object.  

In [9]:
from pympler import asizeof
M = asizeof.asizeof(lr_pipeline) # in bytes
M /= 10*6 # convert to MB
M

248049.33333333334

Finally, we have our baseline cost, $C$.   

In [10]:
C = M*T
C

61.11465403725857

****
##### Profit-Gain Evaluation for a Random Forest Classifier

The first classifier we try is a random forest classifier with 1000 trees. 

In [11]:
from sklearn.ensemble import RandomForestClassifier
rf_pipeline = Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('clf', RandomForestClassifier(n_estimators=1000))])

rf_pipeline.fit(mydata_train.data, mydata_train.target)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='...
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                               

Now we determine the accuracy ($A$) and cost ($C$) for a Random Forest Classifier with 1,000 trees. 

In [12]:
A_rf = rf_pipeline.score(mydata_test.data, mydata_test.target)
A_rf

0.6299787573021773

The accuracy is slightly higher than our baseline model. 

In [13]:
results = %timeit -r 3 -n 10 -o rf_pipeline.predict(mydata_test.data)

21.5 s ± 817 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


In [14]:
# determining T
min_runtime = results.best
T_rf = min_runtime / n_data_pts

# determining M
M_rf = asizeof.asizeof(rf_pipeline) # in bytes
M_rf /= 10*6 # convert to MB

#calculate Cost for this classifier
C_rf = M_rf*T_rf

Just judging by the runtime, the cost of this model is significantly higher than the baseline model. 

In [15]:
C_rf

708.6065525032902

In [16]:
profit_gain_metric(C_rf, A_rf)

-17.001669933560045

Now we compute the profit-gain metric for this classifier.

In [17]:
result = {}
result['Random Forest'] = profit_gain_metric(C_rf, A_rf)

****
##### Profit-Gain Evaluation for a Random Forest Classifier with Truncated SVD

In [18]:
rf_pipeline2 = Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('svd', TruncatedSVD(n_components=300)),
        ('clf', RandomForestClassifier(n_estimators=1000))])

rf_pipeline2.fit(mydata_train.data, mydata_train.target)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='...
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                               

In [19]:
A_rf2 = rf_pipeline2.score(mydata_test.data, mydata_test.target)
A_rf2

0.5330589484864577

In [20]:
results = %timeit -r 3 -n 10 -o rf_pipeline2.predict(mydata_test.data)

13.1 s ± 2.69 s per loop (mean ± std. dev. of 3 runs, 10 loops each)


In [21]:
# determining T
min_runtime = results.best
T_rf2 = min_runtime / n_data_pts

# determining M
M_rf2 = asizeof.asizeof(rf_pipeline2) # in bytes
M_rf2 /= 10*6 # convert to MB

#calculate Cost for this classifier
C_rf2 = M_rf2*T_rf2
C_rf2

330.59489917174403

In [22]:
profit_gain_metric(C_rf2, A_rf2)

-7.308293489008655

In [23]:
result['Random Forest 2'] = profit_gain_metric(C_rf2, A_rf2)

****
##### Profit-Gain Evaluation for a Random Forest Classifier, 100 trees only


Question: the following example is a case where the cost < baseline cost because of runtime. 
This is probably because it doesn't do Truncated SVD.   This is not allowed under this framework correct?

In [24]:
rf_pipeline3 = Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('clf', RandomForestClassifier(n_estimators=100))])

rf_pipeline3.fit(mydata_train.data, mydata_train.target)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='...
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                               

In [25]:
A_rf3 = rf_pipeline3.score(mydata_test.data, mydata_test.target)
A_rf3

0.5896176314391928

In [26]:
results = %timeit -r 3 -n 10 -o rf_pipeline3.predict(mydata_test.data)

3.42 s ± 317 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


In [27]:
# determining T
min_runtime = results.best
T_rf3 = min_runtime / n_data_pts

# determining M
M_rf3 = asizeof.asizeof(rf_pipeline3) # in bytes
M_rf3 /= 10*6 # convert to MB

#calculate Cost for this classifier
C_rf3 = M_rf3*T_rf3

C_rf3

101.96664650886125

In [28]:
profit_gain_metric(C_rf3, A_rf3)

-1.1451780996034888

In [29]:
result['Random Forest 3'] = profit_gain_metric(C_rf3, A_rf3)

In [30]:
C_rf3

101.96664650886125

****
##### Profit-Gain Evaluation for SVM

Here we employ a linear support vector machine (SVM), as done in the sckit-learn tutorial.    

In [44]:
from sklearn.linear_model import SGDClassifier

svm_pipeline = Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('clf', SGDClassifier(loss='hinge', penalty='l2',
                         alpha=1e-3, random_state=42,
                         max_iter=5, tol=None))])

svm_pipeline.fit(mydata_train.data, mydata_train.target)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='...
                ('clf',
                 SGDClassifier(alpha=0.001, average=False, class_weight=None,
                               early_stopping=False, epsilon=0.1, eta0=0.0,
                               fit_intercept=True, l1_ratio=0.15,
                               learning_rate='optimal', lo

In [45]:
A_svm = svm_pipeline.score(mydata_test.data, mydata_test.target)
A_svm

0.6829527349973447

In [46]:
results = %timeit -r 3 -n 10 -o svm_pipeline.predict(mydata_test.data)

1.82 s ± 184 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


In [47]:
# determining T
min_runtime = results.best
T_svm = min_runtime / n_data_pts

# determining M
M_svm = asizeof.asizeof(svm_pipeline) # in bytes
M_svm /= 10*6 # convert to MB

#calculate Cost for this classifier
C_svm = M_svm*T_svm

C_svm

116.14750826236896

In [48]:
profit_gain_metric(C_svm, A_svm)

-1.274880276495956

## Saving Logistic Regression Pipeline for future Use


In [54]:
import pickle
file = open('lr_pipeline.pkl', 'wb')
pipeline = pickle.dump(lr_pipeline, file)

In [58]:
file = open('lr_pipeline.pkl', 'rb')
test = pickle.load(file)

In [59]:
test.score(mydata_test.data, mydata_test.target)

0.6165693043016464