<a href="https://colab.research.google.com/github/sho-kawano/profit-gain/blob/master/Main_Notebook_20_NewsGroups.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
!pip install ktrain
!pip install pympler
!pip install memory_profiler



## Profit-Gain Framework for ML Model Evaluation

This is the first attempt to demonstrate the Profit-Gain framework.  Machine-Learning/Deep Learning Model selection and comparison has long relied on measures of accuracy.  However, the accuracy metrics does not allow for a practical cost-benefit analysis to determine the use of one model over another.  The Profit-Gain Framework aims to address this problem through a simple method of evaluation.

In the first section, we introduce the framework.  Then we use it to evaluate a few different classifiers in the next section.
*****

## The Profit-Gain Framework


##### Annual Cost of Meeting Data Processing Quota

So say that our quota requirmenet is to process $D$ data-points per second.  As the model takes $T$ seconds to process a single data point, we need $DT$ parallel instances of the model to fulfill this requirement.

Each instance of the model takes up $M$ megabytes of memeory and each megabyte costs $d$ dollars on the cloud.  Therefore, the memory required to fulfill our requirement is $MDT$ megabytes in total and the cost of hosting this is $d \,(MDT)$ per second.

From this, we can extrapolate that our annual cost per data-point is $kMT$, where $k$ is some constant that is based on $D$ and $d$.

##### Model Revenue

So again, we are required to process $D$ data points per second. We make $R$ dollars from each accurate prediction. Our model accuracy is $A$. Therefore, the revenue generated by our model is $ADR$ per second. Note that a perfect model yields a revenue of $DR$ as $A =1$ .  

Annual model revenue is thus $jAR$, where $j$ is a constant.


##### Evaluating the Baseline

Let's assume the baseline model yields no profit.
Therefore, annual costs equal revenues.

In other words,  $$k MT \hspace{3mm} \text{(annual cost)} = j A R \hspace{3mm} \text{(annual revenue)}.$$
We can reformulate this as as $RA = hMT$ where $h= k\,/\,j$.

Hence we can solve for $R$
$$R = \frac{hMT}{A}$$

which is the revenue generated from a single correctly processed data-point for a basline model.

##### Comparing ML Models

Assume that we have some fancy new ML Model.  It performs with an accuracy of $A_2$ where $A_2 > A$ and $A$ is the baseline accuracy.

The annual revenue of this new model equals

$$j A_2 R = j A_2 \big(\frac{h MT}{A} \big) = j h MT (A_2/A) = j (k /j) MT (A_2/A)  =  k MT (A_2/A)$$

Furthermore, the model has its own $M_2$ and $T_2$. The annual cost of the model is $kM_2T_2$.

Therefore, the model profit equals $kMT(A_2 / A) - kM_2T_2 = k(MT(A_2 / A) - M_2T_2)$.

Lets assume $C = MT$ and $C_2 = M_2T_2$.
Thus, the profit of this new model equals

$$k(C(A_2 / A) - C_2).$$

##### Maximum Profit
So recall that $C$ is our baseline cost model. The baseline never drop below $C_2$ ($C_2 \ge C$). Thus there is a ceiling to our profit! Its one where $A_2 = 1.0$ (we achieve perfect accuracy) and $C_2 = C$ (costs do not go up with the new model).

In other words, our maximum possible profit is $kC / A - kC$.
Therefore the maximum profit is $kC(1/A - 1)$

Percent of maximum profit achieved is

$$\frac{k(C(A_2 / A) - C_2)}{kC(1/A - 1)}= \frac{C(A_2 / A) - C_2} { C(1/A - 1)} .$$

We multiply by $A / A$. This yields

$$\frac{CA_2 - C_2A}{C(1 - A)}.$$

At baseline, this metric equals zero.

When $CA_2 -C_2 A < 0$, then this metric would be less than zero.

In [None]:
def profit_gain_metric(C_2, A_2):
    global A, C
    result = C*A_2 - C_2*A
    result /= C*(1-A)

    return result


*****

## Demonstrating the Framework




Here we demonstrate the framework using 4 different Classifiers  

The 6 classifiers are:
1. Logistic Regression (Baseline)
2. Random Forest (1000 trees)
3. Random Forest (10 trees)
4. SVM
5. BERT
6. DistilBERT (Hugging Face Transformer)

##### Loading Dataset


Again, we use the 20 News Grops data.   Background information on this dataset can be  [found here](http://qwone.com/~jason/20Newsgroups/).

In [None]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups

mydata_train = fetch_20newsgroups(subset='train', shuffle=True,
                                  remove = ('headers', 'footers', 'quotes'),
                                  random_state=42)

mydata_test = fetch_20newsgroups(subset='test', shuffle=True,
                                 remove = ('headers', 'footers', 'quotes'),
                                 random_state=42)

print('size of training set: %s' % (len(mydata_train['data'])))
print('size of validation set: %s' % (len(mydata_test['data'])))

size of training set: 11314
size of validation set: 7532


This is a quick DataFrame view of this data:

In [None]:
import pandas as pd
mydata_train_df = pd.DataFrame({'data': mydata_train.data, 'target': mydata_train.target})
mydata_train_df.head()

Unnamed: 0,data,target
0,I was wondering if anyone out there could enli...,7
1,A fair number of brave souls who upgraded thei...,4
2,"well folks, my mac plus finally gave up the gh...",4
3,\nDo you have Weitek's address/phone number? ...,1
4,"From article <C5owCB.n3p@world.std.com>, by to...",14


##### Establishing the Baseline: Logistic Regression



Here we establish our baseline model, a Logistic Regression with a pipeline that incorporates TFIDF and a Truncated SVM in its pipeline.  

This model will be used to establish the baseline cost $C=MT$ and basline accuracy $A$.  We first calculate the weights of each class which improves predictive accuracy.



In [None]:
weights = mydata_train_df.groupby(mydata_train_df.target).count()
weights.data /= len(mydata_train_df)
weights = weights.to_dict()
weights = weights['data']
weights

{0: 0.04242531377054976,
 1: 0.05161746508750221,
 2: 0.0522361675799894,
 3: 0.05214778150963408,
 4: 0.05108714866537034,
 5: 0.05241293972070002,
 6: 0.05170585115785752,
 7: 0.05250132579105533,
 8: 0.05285487007247658,
 9: 0.052766484002121264,
 10: 0.0530316422131872,
 11: 0.05258971186141064,
 12: 0.0522361675799894,
 13: 0.05250132579105533,
 14: 0.05241293972070002,
 15: 0.052943256142831886,
 16: 0.048258794414000356,
 17: 0.04984974368039597,
 18: 0.041099522715220084,
 19: 0.033321548523952624}

I used the following settings with the TFIDF process:

*   `min_df` is the minimum numbers of documents a word must be present in to be kept.  I set this to 5.
*   `norm` is set to l2, to ensure all our feature vectors have a euclidian norm of 1.
*   `ngram_range` is set to `(1, 2)` to indicate that we want to consider both unigrams and bigrams.
*    `stop_words` is set to "english" to remove all common pronouns ("a", "the", ...) to reduce the number of noisy features
* `class_weights` which is self-explanatory

We also use  a StandardScaler which helps classifiers like logistic regression.

Now we create the pipeline:

In [None]:
# used for all classifier pipelines
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.preprocessing import StandardScaler

# creating the logistic regresion pipeline
from sklearn.linear_model import LogisticRegression

lr_pipeline = Pipeline([
        ('tfidf', TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2',
                                  encoding='latin-1', ngram_range=(1, 2),
                                  stop_words='english')),
        ('svd', TruncatedSVD(n_components=100)),
        ('std', StandardScaler(with_mean=False)),
        ('clf', LogisticRegression(multi_class='auto',
                                   max_iter=1000,
                                   class_weight=weights,
                                   solver='lbfgs'))])

# fittingthe data
lr_pipeline.fit(mydata_train.data, mydata_train.target)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='latin-1', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=5, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=True,
                                 token_pat...
                                                  14: 0.05241293972070002,
                                                  15: 0.052943256142831886,
                                                  16: 0.048258794414000356,
                                                  17: 0.04984974368039597,


*Setting Baseline Accuracy*
Now we establish the baseline accuracy $A$ by scoring this classifier on the basline data.

In [None]:
A = lr_pipeline.score(mydata_test.data, mydata_test.target)
A

0.6328996282527881

*Setting Baseline Cost*

Recall again that the cost variable $C$ is a product of $M$ and $T$ where $M$ is the size of the model (in megabytes) and $T$ is the time it takes the model to compute a single data point.  


We first measure the runtime of computing preduction on the testing dataset.  Here we use the `timeit` module.

Recall that the `n` parameter controls how many executions are done for each timing and it's used to get representative timings and `r` repeat argument controls how many timings are done and its use is to get accurate statistics.  
The [documentation](https://docs.python.org/3/library/timeit.html#timeit.Timer.repeat) for `timeit` states the following:

> Note: It’s tempting to calculate mean and standard deviation from the result vector and report these. *However, this is not very useful*. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in.

Therefore we find the best runtime out of the measurements we use.  

In [None]:
import random
# times the execution of a single data point 100 x 5 times
results = %timeit -r 5 -n 1000 -o lr_pipeline.predict([random.choice(mydata_test.data)])

1000 loops, best of 5: 4.99 ms per loop


In [None]:
results.best

0.0049936129030002125

Now we use the measured minimum runtime to calculate $T$.

In [None]:
#if you want to skip the timeit part
min_runtime = results.best
T = min_runtime

Now we use pickling to measure the size of the pipeline mentioned  [stackoverflow](https://stackoverflow.com/questions/45601897/how-to-calculate-the-actual-size-of-a-fit-trained-model-in-sklearn).

In [None]:
import pickle
import sys

p = pickle.dumps(lr_pipeline)
M = sys.getsizeof(p)
M /= 10**6 # convert to MB
M

45.422564

Finally, we have our baseline cost, $C$.   

In [None]:
C = M*T
C

0.22682270167775295

In [None]:
result = []
#               model                 p-g metric,       accuracy, cost, memory, T
result.append(['Logistic Regression',  profit_gain_metric(C, A), A, C,  M, T])


##### Random Forest Classifier (1000 trees)

In [None]:
from sklearn.ensemble import RandomForestClassifier
# rf_pipeline = Pipeline([
#         ('tfidf', TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2',
#                                   encoding='latin-1', ngram_range=(1, 2),
#                                   stop_words='english')),
#         ('svd', TruncatedSVD(n_components=100)),
#         ('clf', RandomForestClassifier(n_estimators=1000,
#                                        class_weight=weights))])

# rf_pipeline.fit(mydata_train.data, mydata_train.target)

Here we pickle this model to save it for future use.

In [None]:
# import pickle
# filename = '/content/gdrive/My Drive/Colab Notebooks/ML_Profit-Gain/20NewsGroups/RF100_predictor/rf_pipeline.sav'
# pickle.dump(rf_pipeline, open(filename, 'wb'))

Load the pickled model.

In [None]:
import pickle
filename = '/content/gdrive/My Drive/Colab Notebooks/ML_Profit-Gain/20NewsGroups/RF100_predictor/rf_pipeline.sav'
rf_pipeline =  pickle.load(open(filename, 'rb'))

In [None]:
A_rf = rf_pipeline.score(mydata_test.data, mydata_test.target)
A_rf

0.5882899628252788

In [None]:
results = %timeit -r 5 -n 1000 -o rf_pipeline.predict([random.choice(mydata_test.data)])

1000 loops, best of 5: 69.7 ms per loop


In [None]:
# determining T
min_runtime = results.best
T_rf = min_runtime

# determining M
p = pickle.dumps(rf_pipeline)
M_rf = sys.getsizeof(p)
M_rf /= 10**6 # convert to MB
M_rf

#calculate Cost for this classifier
C_rf = M_rf * T_rf
C_rf

86.98710487480464

In [None]:
profit_gain_metric(C_rf, A_rf)

-659.5754373374436

In [None]:
#               model              profit-gain metric,         accuracy, cost, mem, time
result.append(['RF 1000 trees',  profit_gain_metric(C_rf, A_rf), A_rf, C_rf, M_rf, T_rf ] )

##### Random Forest Classifier (10 trees)


In [None]:
rf_pipeline2 = Pipeline([
        ('tfidf', TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2',
                                  encoding='latin-1', ngram_range=(1, 2),
                                  stop_words='english')),
        ('svd', TruncatedSVD(n_components=100)),
        ('clf', RandomForestClassifier(n_estimators=10,
                                       class_weight=weights))])

rf_pipeline2.fit(mydata_train.data, mydata_train.target)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='latin-1', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=5, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=True,
                                 token_pat...
                                                      18: 0.041099522715220084,
                                                      19: 0.033321548523952624},
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
      

In [None]:
A_rf2 = rf_pipeline2.score(mydata_test.data, mydata_test.target)
A_rf2

0.47703133297928835

In [None]:
results = %timeit  -r 5 -n 1000 -o rf_pipeline2.predict([random.choice(mydata_test.data)])

1000 loops, best of 5: 6.66 ms per loop


In [None]:
# determining T
min_runtime = results.best
T_rf2 = min_runtime

# determining M
p = pickle.dumps(rf_pipeline2)
M_rf2 = sys.getsizeof(p)
M_rf2 /= 10**6 # convert to MB
M_rf2

#calculate Cost for this classifier
C_rf2 = M_rf2 * T_rf2

C_rf2

0.3815060638589327

In [None]:
profit_gain_metric(C_rf2, A_rf2)

-1.6003217767875293

In [None]:
#               model
result.append(['RF  10 trees',  profit_gain_metric(C_rf2, A_rf2), A_rf2, C_rf2, M_rf2, T_rf2] )


##### SVM

Here we employ a linear support vector machine (SVM), as done in the sckit-learn tutorial.    

In [None]:
 from sklearn.svm import LinearSVC

svm_pipeline = Pipeline([
        ('tfidf', TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2',
                                  encoding='latin-1', ngram_range=(1, 2),
                                  stop_words='english')),
        ('svd', TruncatedSVD(n_components=100)),
        ('std', StandardScaler(with_mean=False)),
        ('clf', LinearSVC(random_state=0, tol=1e-5))])

svm_pipeline.fit(mydata_train.data, mydata_train.target)



Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='latin-1', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=5, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=True,
                                 token_pat...
                 TruncatedSVD(algorithm='randomized', n_components=100,
                              n_iter=5, random_state=None, tol=0.0)),
                ('std',
                 StandardScaler(copy=True, with_mean=False, with_std=True)),
                ('clf',
                 LinearSVC(C=1.0, c

In [None]:
A_svm = svm_pipeline.score(mydata_test.data, mydata_test.target)
A_svm

0.6323685608072225

In [None]:
results = %timeit  -r 5 -n 1000 -o svm_pipeline.predict([random.choice(mydata_test.data)])

1000 loops, best of 5: 5.35 ms per loop


In [None]:
# determining T
min_runtime = results.best
T_svm = min_runtime

# used for the old approach - this takes too long for some models
# T_svm = min_runtime / n_data_pts

# determining M
p = pickle.dumps(svm_pipeline)
M_svm = sys.getsizeof(p)
M_svm /= 10**6 # convert to MB
M_svm

#calculate Cost for this classifier
C_svm = M_svm * T_svm

C_svm

0.2431809704121839

In [None]:
profit_gain_metric(C_svm, A_svm)

-0.125783783815314

In [None]:
#               model,  profit-gain metric,         accuracy, cost
result.append(['SVM',  profit_gain_metric(C_svm, A_svm), A_svm, C_svm, M_svm, T_svm] )

##### BERT (Full Model)



Here we load the BERT model trained in another notebook. We established there that the accuracy was 87%: quite a bit of improvement over the standard model.  

In [None]:
# import random
# A = 0.6202867764206054
# C = 0.36625090748935357

In [None]:
import ktrain
bert_predictor = ktrain.load_predictor('/content/gdrive/My Drive/Colab Notebooks/ML_Profit-Gain/20NewsGroups/BERT_predictor')
A_bert = 0.87

In [None]:
results = %timeit -r 5 -n 1000 -o bert_predictor.predict([random.choice(mydata_test.data)])

1000 loops, best of 5: 74.7 ms per loop


I use `os` to find the size of the saved model and the preprocessor.

As you can verify in the models that are saved in the Google Driver folder, the size of the BERT model is around 1.3 GB.

In [None]:
import os
model = os.path.getsize('/content/gdrive/My Drive/Colab Notebooks/ML_Profit-Gain/20NewsGroups/BERT_predictor/tf_model.h5')
prepoc = os.path.getsize('/content/gdrive/My Drive/Colab Notebooks/ML_Profit-Gain/20NewsGroups/BERT_predictor/tf_model.preproc')

M_bert = model + prepoc
M_bert /= 10**6 # convert to MB
M_bert

1314.291219

In [None]:
# determining T
min_runtime = results.best
T_bert = min_runtime

Finally we have the cost of the BERT model $C=MT$.

In [None]:
C_bert = T_bert*M_bert
C_bert

98.12523454285277

In [None]:
profit_gain_metric(C_bert, A_bert)

-743.4675581187139

In [None]:
#               model,  profit-gain metric,                      accuracy, cost
result.append(['BERT',  profit_gain_metric(C_bert, A_bert), A_bert, C_bert, M_bert, T_bert] )

##### DistilBERT (Hugging Face)

Here we load a  hugging face transformer that was trained in [another notebook](https://colab.research.google.com/drive/1gS_h_HlZNaualX9w2Rlz5w2WvsN1fpA_?usp=sharing).   We established in the notebook that the accuracy was 71%.  

In [None]:
dbert_predictor = ktrain.load_predictor('/content/gdrive/My Drive/Colab Notebooks/ML_Profit-Gain/20NewsGroups/DistilBERT_predictor')
A_dbert = 0.71

In [None]:
results = %timeit -r 5 -n 1000 -o dbert_predictor.predict([random.choice(mydata_test.data)])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




1000 loops, best of 5: 94.1 ms per loop


In [None]:
# determining T
min_runtime = results.best
T_dbert = min_runtime

In [None]:
model = os.path.getsize('/content/gdrive/My Drive/Colab Notebooks/ML_Profit-Gain/20NewsGroups/DistilBERT_predictor/tf_model.h5')
prepoc = os.path.getsize('/content/gdrive/My Drive/Colab Notebooks/ML_Profit-Gain/20NewsGroups/DistilBERT_predictor/tf_model.preproc')

M_dbert = model + prepoc
M_dbert /= 10**6 # convert to MB
M_dbert

268.006463

In [None]:
C_dbert = T_dbert*M_dbert
C_dbert

25.221973338559305

In [None]:
profit_gain_metric(C_dbert, A_dbert)

-189.7749495404849

In [None]:
#               model,  profit-gain metric,                      accuracy, cost
result.append(['DistilBERT',  profit_gain_metric(C_dbert, A_dbert), A_dbert, C_dbert, M_dbert, T_dbert] )

## Preliminary Results




In [None]:
# Create the pandas DataFrame
df = pd.DataFrame(result, columns = ['model', 'profit-gain', 'accuracy', 'cost', 'memory', 'runtime'])
df

Unnamed: 0,model,profit-gain,accuracy,cost,memory,runtime
0,Logistic Regression,0.0,0.6329,0.226823,45.422564,0.004994
1,RF 1000 trees,-659.575437,0.58829,86.987105,1247.129869,0.06975
2,RF 10 trees,-1.600322,0.477031,0.381506,57.320606,0.006656
3,SVM,-0.125784,0.632369,0.243181,45.422239,0.005354
4,BERT,-743.467558,0.87,98.125235,1314.291219,0.07466
5,DistilBERT,-189.77495,0.71,25.221973,268.006463,0.09411


So here are the preliminary results.  It's notable that all the classifers performed worse than the baseline (the metric is all negative).  The results suggest that this framework will really penalize high cost even if the predictive accuracy improves significantly: look at the results on BERT which has 400 times the cost of the baseline.  The accuracy of the random forest classifier probably should be scrutinized given that they're generally pretty bad.   It's interesting that SVM is the top perfomer among the non-baseline classifiers.  



*   Accuracy is based on validation on the test dataset.
*   Memory is in Megabytes.
*   Runtime was established by timing the execution on a single data point (5 trials of 1000 repeated measurements on a GPU).  I chose the "best" execution time as the statistic here as suggested by the `timeit` documentation.  









In [None]:
df.sort_values('profit-gain', ascending=False)

Unnamed: 0,model,profit-gain,accuracy,cost,memory,runtime
0,Logistic Regression,0.0,0.6329,0.226823,45.422564,0.004994
3,SVM,-0.125784,0.632369,0.243181,45.422239,0.005354
2,RF 10 trees,-1.600322,0.477031,0.381506,57.320606,0.006656
5,DistilBERT,-189.77495,0.71,25.221973,268.006463,0.09411
1,RF 1000 trees,-659.575437,0.58829,86.987105,1247.129869,0.06975
4,BERT,-743.467558,0.87,98.125235,1314.291219,0.07466
