# Understanding and Correcting Bias in Modern Machine Learning



A few years ago, I was made aware of a new software product that corrected bias in machine learning models.  My initial response?  "What the heck is bias in machine learning models?"  I seriously had no idea.  Of course, there are different types of bias when doing Data Science, but I didn't understand what problem the product was solving.  Since then, I have seen the term "machine learning model bias" used repeatedly.  I decided I couldn't stay in the dark forever, so I set out to do a little research on the topic.

After doing some research, I think I understand what most people mean when they say a machine learning model has "bias".  I say most because many people use the term and have no idea what they are talking about.  "Bias" is not a new problem, just a new name (or at least new to me).  I don't like that we call the issue at hand "bias".  The term "bias" is too general and applies to multiple problems in multiple contexts.  For example, you can have confirmation bias when you misinterpret data.  Also, I learned many years ago in statistics that bias occurs when a parameter estimate differs from the population parameter.  Terms like heteroskedasticity are hard to say, but there is no doubt what I mean when I use that specific word.  Overusing simple phrases like "bias" just confuses everyone.
 

Generally, when someone talks about "bias" in a machine learning model, it is usually in the context of gender, racial or ethnic discrimination.  If a machine learning model reflects historical discrimination against a cohort of people, it is  "biased".  It isn't that machine learning is prejudiced against anyone, but the people/organization/social constructs who generated the historical data were.  This means that a machine learning model based on a history of excluding specific cohorts will replicate that behavior.

Here in North America, the statement above edges into a bit of controversy.  The idea of historical discrimination and how it impacts life today is a  sensitive topic.  I'll do my best to steer clear of saying anything controversial and attempt to focus on the use cases from an objective point of view.  

To stay neutral and non-political, I will examine this issue from a broader perspective.  As I mentioned, the term "bias" is somewhat new to me, but the general idea that the historical record may not meet your current objective isn't.  That's the core issue.  Machine learning models can be problematic when organizations seek to achieve new goals that differ from the past.  For example, in the case of gender discrimination in tech hiring, if the historical record of hires lacks females, a machine learning model may conclude that females are not worthy candidates and predict them as less hireable.   The model does not know that women were denied employment in tech roles based on their gender, not their ability to do the job.  Machine Learning is a matter-of-fact kind of thing.  It can not automatically adjust itself to correct the fact that previous managers failed to hire qualified females.  Note, this is only a problem if the organization wants more females in its workforce.  If it is happy hiring men as it did in the past, a machine learning model using the historical data will be acceptable.  

So, this article is about handling situations when your historical repository does not reflect how you want to do business today. 

Generally, there are three ways you can adjust your models when the current objectives differ from the historical data.  One, you can monkey With the scoring data.  Two, you can monkey with the modeling sample.  Or, three, you can monkey with the output data.

Note that neither of these three fixes is perfect.  Each is just a way to get actionable results in a less-than-ideal situation.  Which method makes the most sense depends on the context of the problem.

I will be looking at two different use cases to highlight the methods of correcting "bias".  The first is new market expansion.  This is a classic example of when the historical record does not reflect the current objectives of the business.  The second is a discrimination example, where a particular group has had minimal access historically.  




Note that all of the data used in this article are 100% fake.  I created the data, and 100% own it. 

## Table of contents

1. [Getting Setup](#10)<br>
    1.1 [Install Relevant Libraries](#11)<br>

2. [New Market Expansion](#20)<br>
    2.1 [Import Data From GitHub](#21)<br>
    2.2 [Examining the Data](#22)<br>
    2.3 [ Machine Learning Model on the Raw Data](#23)<br>
    2.4 [ Monkey with the scoring data](#24)<br>
    2.5 [ Monkey with the Sample](#25)<br>
    
3. [Historical Discrimination](#30)<br>
    3.1 [Pull in the data from GitHub](#31)<br>
    3.2 [Examine the results](#32)<br>
    3.3 [Build a Machine Learning Model to predict the best prospects](#33)<br>
    3.4 [Monkey with the Output Data](#34)<br>
    
4. [Conclusion](#40)<br>
 

## 1.0 Getting Setup <a id="10"></a>

 #### 1.1 Import relevant libraries <a id="11"></a>

In [1]:
import numpy.dual as dual
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import metrics

import xgboost as xgb
from sklearn.metrics import mean_squared_error

from IPython.display import Image
from IPython.core.display import HTML 


#Un-Comment these options if you want to exapand the number of rows and columns of you see visually in the notebook.
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

##  2.0 New Market Expansion <a id="20"></a>

Barney BA and Associates (BBAA) operates a successful carpet cleaning business.  Their business model is pretty simple.  Their marketing budget is limited, and they have found great success by mailing coupons to prospects in the mail.  Consumers open their mailbox, find a coupon, realize their carpets are dirty, and call BBAA for an appointment.   Again, BBAA has a limited marketing budget, so they can not mail coupons to every household in their target market.  They use a machine learning model to maximize their marketing dollars to predict which consumers are most likely to need a carpet cleaning.  After constructing their machine learning model, they only mail to the top 30% of the prospects.  Focusing on the 30% most likely to buy enables them to maximize their marketing expenditures.

Karma, Yuba, and Yarnaby are three suburban communities in a larger metro area.  BBAA focuses on Yuba and Karma, but they are looking to expand into Yarnaby.  Note that BBAA has not spent any marketing dollars in Yarnaby or focused on doing business in Yarnaby.  They do have a few customers in Yarnaby, however.  These customers come from referrals or customers who lived in either Karma or Yuba and moved to Yarnaby as a BBAA customers.  

Consistent with their business model, BBAA would like machine learning to guide their market expansion.

##### 2.1 Import Data From GitHub <a id="21"></a>

In [2]:
!rm markets.csv
!wget https://raw.githubusercontent.com/shadgriffin/ml_bias/main/markets.csv
    
df = pd.read_csv("markets.csv", sep=",", header=0)



--2022-01-07 22:31:32--  https://raw.githubusercontent.com/shadgriffin/ml_bias/main/markets.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5809035 (5.5M) [text/plain]
Saving to: ‘markets.csv’


2022-01-07 22:31:32 (176 MB/s) - ‘markets.csv’ saved [5809035/5809035]



##### 2.2 Examining the Data <a id="22"></a>

Let's take a look at that the data 

In [3]:
df.head()

Unnamed: 0,CUST_ID,THEPREDICTOR,MARKET,BUY
0,10000275,-98.151992,Yarnaby,1
1,10000617,-98.991996,Yarnaby,1
2,10000716,-99.051996,Yarnaby,1
3,10000823,-97.779991,Yarnaby,1
4,10001492,-97.854991,Yarnaby,1


The dataset is pretty simple.  Again, it is 100% fake. 

CUST_ID -- Unique Identifier of the prospect

THEPREDICTOR -- This is the primary independent variable of the model.

MARKET -- The market of the prospect.  Can be Yuba, Yarnaby, or Karma

BUY -- The dependent variable.  A '1' means the prospect purchased carpet cleaning in the past, a '0' means they did not.

The three markets have about the same number of prospects.

In [4]:
df_count = pd.DataFrame(df.groupby(['MARKET'])['BUY'].count())
df_count

Unnamed: 0_level_0,BUY
MARKET,Unnamed: 1_level_1
Karma,54975
Yarnaby,53006
Yuba,54908


As expected, the conversion rate and the number of conversions in Yarnaby are much lower than in other markets.

Fewer conversions in Yarnaby is because BBAA has never focused on doing business there.   If historically they have never tried to sell products in Yarnaby, they will not have many sales there.  Building a model on historical data will not facilitate selling into a new market.

In [5]:
df_count = pd.DataFrame(df.groupby(['MARKET'])['BUY'].mean())
df_count

Unnamed: 0_level_0,BUY
MARKET,Unnamed: 1_level_1
Karma,0.041237
Yarnaby,0.000434
Yuba,0.040959


In [6]:
df_count = pd.DataFrame(df.groupby(['MARKET'])['BUY'].sum())
df_count

Unnamed: 0_level_0,BUY
MARKET,Unnamed: 1_level_1
Karma,2267
Yarnaby,23
Yuba,2249


So, we have three markets.  Two markets are mature, and one is new, with few customers.  Let's see our prospect list if we build a machine learning model with the existing data.  In the next section, we will create a machine learning model from the historical record and identify the top 30% of prospects that are likely to buy carpet cleaning.  Once we have done this, we can mail them coupons in the mail.

##### 2.3 Machine Learning Model on the Raw Data <a id="23"></a>

Create a dummy variable that reflects identifies prospects in Yarnaby.  We will use this as a dependent variable in the model.

In [7]:
df['YARNABY'] = np.where(((df.MARKET == 'Yarnaby')), 1, 0)

Define the Independent and dependent variables.

In [8]:
X=pd.DataFrame(df[['THEPREDICTOR','YARNABY']])
y=pd.DataFrame(df['BUY'])

Define the model

In [9]:
xgb21 = XGBClassifier(objective = 'binary:logistic',use_label_encoder=False);

Fit the model.

In [10]:
xgb21.fit(X,y,eval_metric='auc')

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=56, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

Score the data and evaluate the accuracy.

In [11]:
phat= xgb21.predict_proba(X);
yhat= xgb21.predict(X);
df_phat=pd.DataFrame(phat)
df_phat=df_phat.rename(columns={0: "P_NOBUY", 1:"P_BUY"})
df_yhat=pd.DataFrame(yhat)
print("\nModel Report")
print("Accuracy : %.4g" % metrics.accuracy_score(y, yhat))
print("AUC Score: %f" % metrics.roc_auc_score(y, yhat))    


Model Report
Accuracy : 0.9906
AUC Score: 0.898325


Append the scores to the original data

In [12]:
dfx = pd.concat([df, df_phat], axis=1)



Assign records to deciles.  The firm will send coupons to the top 30%.

In [13]:
dfx['bingbong'] = (np.random.randint(0, 100, df.shape[0]))/100000000000000000

dfx['P_BUY']=(dfx['P_BUY']+dfx['bingbong'])#*dfx['ADJUSTER']

#Create deciles based on P_BUY
dfx['DECILE'] = pd.qcut(dfx['P_BUY'], 10, labels=np.arange(100, 0, -10))


Look at the accuracy of the model based on the deciles.  (10 is the highest, 100 is the lowest.)
The bulk or Buyers are in the top deciles, which is good.  Also, the distribution of prospects in the expansion market is weighted towards the lower deciles. 

In [14]:
df_sum = pd.DataFrame(dfx.groupby(['DECILE'])['BUY'].sum())

df_count = pd.DataFrame(dfx.groupby(['DECILE'])['BUY'].count())
df_mean = pd.DataFrame(dfx.groupby(['DECILE'])['BUY'].mean())
df_phat = pd.DataFrame(dfx.groupby(['DECILE'])['P_BUY'].mean())
dfx['Yarnaby'] = np.where(((dfx.MARKET == 'Yarnaby')), 1, 0)
df_yarnaby = pd.DataFrame(dfx.groupby(['DECILE'])['Yarnaby'].sum())

dfx.head()


df_out = pd.concat([df_mean, df_count,df_sum,df_yarnaby,df_phat], axis=1)

df_out.columns = [['Historical Conversion Rate', 'Total Prospects', 'Historical Conversions','Prospects in Yarnaby','Predicted Conversion Rate']]
df_out

Unnamed: 0_level_0,Historical Conversion Rate,Total Prospects,Historical Conversions,Prospects in Yarnaby,Predicted Conversion Rate
DECILE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
100,0.0,16290,0,16290,1.9e-05
90,0.0,16289,0,16289,2.9e-05
80,0.0,16288,0,16288,0.000512
70,0.000614,16292,10,2941,0.002875
60,0.001781,16286,29,394,0.004655
50,0.003745,16288,61,294,0.00595
40,0.0062,16290,101,200,0.007212
30,0.009944,16291,162,202,0.009072
20,0.014243,16289,232,27,0.011452
10,0.242171,16286,3944,81,0.236942


Identify the top 30% and examine the distribution across markets.


In [15]:
dfx['DECILE']=dfx['DECILE'].astype(int)

In [16]:
df_x=dfx[dfx['DECILE']<=30]

In [17]:
df_count = pd.DataFrame(df_x.groupby(['MARKET'])['BUY'].count())

In [18]:
df_count

Unnamed: 0_level_0,BUY
MARKET,Unnamed: 1_level_1
Karma,24261
Yarnaby,310
Yuba,24295


See the problem?  The historical data has almost no sales from Yarnaby, and without doing something, they never will.  It becomes a "self-fulfilling prophecy".  If you use historical sales to predict future sales, you will never have many sales in a new market.  

Unless we do something here, we will never be successful in our new market.



##### 2.4 Monkey  with the scoring data <a id="24"></a>

The most straightforward thing to do here is to apply the model to Yarnaby but pretend that the prospects are either in Yuba or Karma.  That is, recode the "YARNABY" variable so that all records (even those in Yarnaby) have a value of 0.  Changing the market of the Yarnaby records means that when you apply the model, we will score every record as if it was in Yuba or Karma, even if it isn't.

In [19]:
df['YARNABY']=0

In [20]:
#Define the independent and dependent variable.
X=pd.DataFrame(df[['THEPREDICTOR','YARNABY']])
y=pd.DataFrame(df['BUY'])

#Score the data 
phat= xgb21.predict_proba(X);

df_phat=pd.DataFrame(phat)
df_phat=df_phat.rename(columns={0: "P_NOBUY", 1:"P_BUY"})

#Append the scores to the original data

dfx = pd.concat([df, df_phat], axis=1)
#Assign records to deciles. The firm will send coupons to the top 30%.
dfx['bingbong'] = (np.random.randint(0, 100, df.shape[0]))/100000000000000000

dfx['P_BUY']=(dfx['P_BUY']+dfx['bingbong'])

#Identify the top 30% and examine the distribution across markets.
dfx['DECILE'] = pd.qcut(dfx['P_BUY'], 10, labels=np.arange(100, 0, -10))
dfx['DECILE']=dfx['DECILE'].astype(int)
df_x=dfx[dfx['DECILE']<=30]
df_count = pd.DataFrame(df_x.groupby(['MARKET'])['BUY'].count())
df_count

Unnamed: 0_level_0,BUY
MARKET,Unnamed: 1_level_1
Karma,16893
Yarnaby,15072
Yuba,16902


The results of Monkeying with the scoring Data appear above.  Note there are now a significant number of prospects in Yarnaby.  

A word of caution, however.  This method assumes that all market factors exist in one variable and that by changing the variable, you can remove all of the market effects.  In practice, this is rarely the case.  Data used as inputs in a machine learning model are almost always correlated.  In classical regression methods, we call this collinearity or multi-collinearity.  So, even if you change the market for Yarnaby prospects, there will likely be some lingering market effect in other variables.  A good example is household income.  Income and market are typically collinear, given that households with similar incomes tend to live by one another. 

Also, it is essential to point out that the viability of the Yarnaby market is unknown.  We are making a pretty big assumption that it is as viable as the other markets, but BBAA won't know that until they start selling in the market.

This use case is what I would refer to as a batch use case.  That is, we are scoring the complete set of prospects and then taking action by focusing on the top 30%.  Sometimes your deployment would be what I would call real-time.  That is, you will need to take action on a single prospect without the benefit of aggregating a larger cohort.  For example, you could be deploying this model in an application that scores prospects one at a time instead of thousands at a time.  

Monkeying with the scoring data lends itself to a real-time deployment, given that the scores for all three markets are comparable.  In other words, the average score or predicted value will be about the same across all three markets.  Or, a good prospect will have roughly the same value, regardless of their market.



##### 2.5 Monkey  with the Sample<a id="25"></a>

Another way to address this problem is to resample the original data before building any machine learning model. Again, this is not perfect but does make sense in some situations. Currently, the historical purchase rate for Yarnaby is substantially lower than the other two markets. If we resample Yarnaby to have a similar historical purchase rate, a model should reflect this when applied to prospect data.

Read in the csv.

In [21]:
df = pd.read_csv("markets.csv", sep=",", header=0)


Create a dummy variable for records in Yarnaby

In [22]:
df['YARNABY'] = np.where(((df.MARKET == 'Yarnaby')), 1, 0)

Put the records in random order

In [23]:
df['loopy'] = (np.random.randint(0, 1000, df.shape[0]))/1000
df=df.sort_values(by=['loopy'])
df=df.reset_index(drop=True)

Create a separate data frame for each market based on the "Buy" variable.

In [24]:
dfyarnb=df[(df['MARKET']=='Yarnaby') & (df['BUY']==1)]
dfyarnn=df[(df['MARKET']=='Yarnaby') & (df['BUY']==0)]
dfyubab=df[(df['MARKET']=='Yuba') & (df['BUY']==1)]
dfyuban=df[(df['MARKET']=='Yuba') & (df['BUY']==0)]
dfkarmab=df[(df['MARKET']=='Karma') & (df['BUY']==1)]
dfkarman=df[(df['MARKET']=='Karma') & (df['BUY']==0)]

For each Buyer in each market, randomly select 20 non-buyers.  The conversion rate will be the same for each market in our new data frame.

In [25]:
a=(dfyarnb['CUST_ID'].count())*20
b=(dfyubab['CUST_ID'].count())*20
c=(dfkarmab['CUST_ID'].count())*20


dfyarnn=dfyarnn.sort_values(by=['loopy'])


dfyarnn=dfyarnn.head(a)
dfyuban=dfyuban.head(b)
dfkarman=dfkarman.head(c)

Concatenate the adjusted data frames into a complete data frame.

In [26]:
dfz = pd.concat([dfyarnn, dfyarnb, dfyuban, dfyubab, dfkarman, dfkarmab])

Reset the Indexes

In [27]:
dfz=dfz.sort_values(by=['CUST_ID'])
dfz=dfz.reset_index(drop=True)

df=df.sort_values(by=['CUST_ID'])
df=df.reset_index(drop=True)

The original dataframe has 162,889 records.

In [28]:
df.shape

(162889, 6)

The new data frame with a consistent conversion rate in each market has 95,319 records.

In [29]:
dfz.shape

(95319, 6)

Note, now all markets have the same conversion rate.

In [30]:
zss = pd.DataFrame(dfz.groupby(['MARKET'])['BUY'].mean())
zss

Unnamed: 0_level_0,BUY
MARKET,Unnamed: 1_level_1
Karma,0.047619
Yarnaby,0.047619
Yuba,0.047619


Define the dependent and independent variables for the modeling and scoring data set.  Note that we will build the model on the data frame with the adjusted sample, and we will apply that model to the complete record data set.

In [31]:
#Define the scoring data set
X=pd.DataFrame(df[['THEPREDICTOR','YARNABY']])
y=pd.DataFrame(df['BUY'])
#Define the modeling data set
X1=pd.DataFrame(dfz[['THEPREDICTOR','YARNABY']])
y1=pd.DataFrame(dfz['BUY'])

In [32]:
#Define the Model
xgb21 = XGBClassifier(objective = 'binary:logistic',use_label_encoder=False);
#Fit the Model to the adjusted sample
xgb21.fit(X1,y1,eval_metric='auc')
#apply or score the model against the original dataframe
phat= xgb21.predict_proba(X);
#write to a dataframe
df_phat=pd.DataFrame(phat)
#Rename the columns in the scored data set
df_phat=df_phat.rename(columns={0: "P_NOBUY", 1:"P_BUY"})
#concatenate the scored data to the original non-augmented data frame.
dfx = pd.concat([df, df_phat], axis=1)
#Examine the results
dfx.head()

Unnamed: 0,CUST_ID,THEPREDICTOR,MARKET,BUY,YARNABY,loopy,P_NOBUY,P_BUY
0,10000000,-98.361993,Karma,0,0,0.539,0.994262,0.005738
1,10000001,-91.149964,Yuba,0,0,0.763,0.992324,0.007676
2,10000002,-99.321997,Yuba,0,0,0.34,0.990993,0.009007
3,10000003,-99.597998,Yarnaby,0,1,0.862,0.854623,0.145377
4,10000004,-99.408998,Karma,0,0,0.136,0.990805,0.009195


In [33]:
#Evaluate the accuracy of the model built on the augmented data on the original data
yhat= xgb21.predict(X);


df_yhat=pd.DataFrame(yhat)
print("\nModel Report")
print("Accuracy : %.4g" % metrics.accuracy_score(y, yhat))
print("AUC Score: %f" % metrics.roc_auc_score(y, yhat))    


Model Report
Accuracy : 0.9849
AUC Score: 0.895927


Place the newly scored records into deciles.

In [34]:
#assign a tiny random number to the probability to brake ties
dfx['bingbong'] = (np.random.randint(0, 100, dfx.shape[0]))/100000000000000000

dfx['P_BUY']=(dfx['P_BUY']+dfx['bingbong'])

#Create deciles based on P_BUY
dfx['DECILE'] = pd.qcut(dfx['P_BUY'], 10, labels=np.arange(100, 0, -10))


Select the top 3 deciles.

In [35]:
dfx['DECILE']=dfx['DECILE'].astype(int)
df_x=dfx[dfx['DECILE']<=30]

In [36]:
df_count = pd.DataFrame(df_x.groupby(['MARKET'])['BUY'].count())

In [37]:
df_count

Unnamed: 0_level_0,BUY
MARKET,Unnamed: 1_level_1
Karma,15702
Yarnaby,17631
Yuba,15526


Note the distribution of the top 3 deciles across the various markets.  The number of records in each market is about the same, roughly anyway.  

Yarnaby has a few thousand more than either Karma or Yuba.  There are a few reasons this could happen.  One, it is possible that Yarnaby is a better market than the other two, and as BBAA starts selling in Yarnaby, they will realize better sales than in their existing markets.  While that is possible, I kind of doubt that to be the case here.  An issue with this method is that we rely on sampling to balance the conversion rates between the markets.  Sample, as you know, comes with sampling error.  My guess is that if we select a different random selection of records, the results might look different.

Again, we don't know for sure because we don't know the actual conversion rate in Yarnaby.  We assume that it will be the same as it is in the other two markets.

Like the previous example above, this method lends itself to real-time scoring, given that the model output is comparable across the three markets.   
This method also lends itself to situations when you need to resample your data anyway.  For example, if you have a tiny (less than 1%) purchase rate, you may decide to balance or resample your data anyway.

##  3.0 Historical Discrimination <a id="30"></a>

Ed and Son's Dog Bathing Incorporated is dedicated to cleaning canines in the tri-state area.  Ed started the business thirty years ago and added his son as a business partner twelve years ago.  Ed, getting up in age, is now looking to retire.  Once he retires, his son Elmo will continue the family business for another generation.

Ed's business includes all breeds of dogs except for one, the Basset Hound.  Ed hates basset hounds.  He thinks they are stubborn, their ears are too long, and they insist on stealing humans' food.  He just doesn't like basset hounds.  In fact, you could say Ed has actively discriminated against basset hounds in his thirty years of dog washing, refusing to do business with them.

Now that Ed is heading into retirement, Elmo is looking to expand the business.  One easy way to grow the market is to start bathing basset hounds.  Another way Elmo plans to grow the company is to use a machine learning model to identify prospects for dog washing in the tri-state area.  Elmo will use historical data to understand who purchased dog washes in the past and then use this learning to find new prospects.  Elmo plans to apply a model to the universe of candidates, then call the top 30% to see if they want a dog wash.

##### 3.1 Pull in the data from GitHub.<a id="31"></a>

In [38]:
!rm breeds.csv
!wget https://raw.githubusercontent.com/shadgriffin/ml_bias/main/breeds.csv
    
df = pd.read_csv("breeds.csv", sep=",", header=0)

rm: cannot remove 'breeds.csv': No such file or directory
--2022-01-07 22:39:26--  https://raw.githubusercontent.com/shadgriffin/ml_bias/main/breeds.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4455824 (4.2M) [text/plain]
Saving to: ‘breeds.csv’


2022-01-07 22:39:26 (121 MB/s) - ‘breeds.csv’ saved [4455824/4455824]



##### 3.2 Examine the results <a id="32"></a>

The dataset is pretty simple.  Again, it is 100% fake. 

CUST_ID -- Unique Identifier of the prospect

THEPREDICTOR -- This is the primary independent variable of the model.

BREED -- The breed of the prospect.  It can be Basset Hound or Other.

BUY -- The dependent variable.  A '1' means the prospect purchased carpet cleaning in the past; a '0' means they did not.

In [39]:
df.head()

Unnamed: 0,CUST_ID,THEPREDICTOR,BREED,BUY
0,10000007,-97.248989,Basset Hound,0
1,10000021,-98.154992,Basset Hound,0
2,10000028,-98.442994,Basset Hound,0
3,10000036,-98.415993,Basset Hound,0
4,10000037,-99.681999,Basset Hound,0


Looking at the historical data, it is evident that basset hounds have not been a priority.  There is only fourteen basset hound customers in the historical data.  More and likely a mixed-breeds.

In [40]:
df['wookie']=1

df_count = pd.DataFrame(df.groupby(['BREED','wookie'])['BUY'].count())
df_sum = pd.DataFrame(df.groupby(['BREED','wookie'])['BUY'].sum())
df_predictor = pd.DataFrame(df.groupby(['BREED','wookie'])['THEPREDICTOR'].mean())
df_total = pd.DataFrame(df.groupby(['wookie'])['BUY'].count())

df_count=df_count.rename(columns={"BUY": "Total Prospects"})
df_sum=df_sum.rename(columns={"BUY": "Buyers"})
df_total=df_total.rename(columns={"BUY": "All Breeds"})

df_out = pd.concat([ df_count,df_sum,df_predictor], axis=1)

df_total=df_total.reset_index()
df_out=df_out.reset_index()

df_out = pd.merge(df_out, df_total, on="wookie")
df_out['Conversion Rate']=df_out['Buyers']/df_out['Total Prospects']
df_out['Percent of Prospects']=df_out['Total Prospects']/df_out['All Breeds']
df_out['Conversion Rate']=df_out['Buyers']/df_out['Total Prospects']

df_out=df_out.drop(columns='wookie')
df_out

Unnamed: 0,BREED,Total Prospects,Buyers,THEPREDICTOR,All Breeds,Conversion Rate,Percent of Prospects
0,Basset Hound,13663,14,-98.29528,123903,0.001025,0.110272
1,Other,110240,6438,597.421517,123903,0.0584,0.889728


##### 3.3 Build a Machine Learning Model to predict the best prospects.<a id="33"></a>

Create a dummy variable to identify Basset Hounds.

In [41]:
df['BASSET'] = np.where(((df.BREED == 'Basset Hound')), 1, 0)

Define the independent and dependent variables.

In [42]:
X=pd.DataFrame(df[['THEPREDICTOR']])
y=pd.DataFrame(df['BUY'])

Define and fit the model.

In [43]:
xgb21 = XGBClassifier(objective = 'binary:logistic',use_label_encoder=False);
xgb21.fit(X,y,eval_metric='auc')

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=56, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

In [44]:
#Score the original data with the model
phat= xgb21.predict_proba(X);
#Put the results in a dataframe
df_phat=pd.DataFrame(phat)
#add Variable lablels
df_phat=df_phat.rename(columns={0: "P_NOBUY", 1:"P_BUY"})
#append values to the original data
dfx = pd.concat([df, df_phat], axis=1)

Evaluate the accuracy of the model

In [45]:

yhat= xgb21.predict(X);


df_yhat=pd.DataFrame(yhat)
print("\nModel Report")
print("Accuracy : %.4g" % metrics.accuracy_score(y, yhat))
print("AUC Score: %f" % metrics.roc_auc_score(y, yhat))    


Model Report
Accuracy : 0.9846
AUC Score: 0.926639


Place the records in Deciles

In [46]:
#Add a tiny random number to the probability to break ties
dfx['bingbong'] = (np.random.randint(0, 100, df.shape[0]))/100000000000000000

dfx['P_BUY']=(dfx['P_BUY']+dfx['bingbong'])

#Create deciles based on P_BUY
dfx['DECILE'] = pd.qcut(dfx['P_BUY'], 10, labels=np.arange(100, 0, -10))


Evaluate the deciles.  The model does an excellent job of identifying prospects, but note that all the basset hounds are in the bottom decile.

In [47]:
df_sum = pd.DataFrame(dfx.groupby(['DECILE'])['BUY'].sum())
df_count = pd.DataFrame(dfx.groupby(['DECILE'])['BUY'].count())
df_mean = pd.DataFrame(dfx.groupby(['DECILE'])['BUY'].mean())
df_phat = pd.DataFrame(dfx.groupby(['DECILE'])['P_BUY'].mean())
dfx['BASSET'] = np.where(((dfx.BREED == 'Basset Hound')), 1, 0)
df_yarnaby = pd.DataFrame(dfx.groupby(['DECILE'])['BASSET'].sum())

dfx.head()


df_out = pd.concat([df_mean, df_count,df_sum,df_yarnaby,df_phat], axis=1)
df_out

Unnamed: 0_level_0,BUY,BUY,BUY,BASSET,P_BUY
DECILE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
100,0.0,12395,0,12395,0.000399
90,0.001211,12387,15,1025,0.00375
80,0.003309,12391,41,21,0.005289
70,0.005166,12389,64,15,0.006207
60,0.007099,12396,88,47,0.007244
50,0.007911,12388,98,11,0.008111
40,0.009037,12394,112,5,0.009018
30,0.011387,12382,141,0,0.010288
20,0.015577,12390,193,1,0.013065
10,0.460011,12391,5700,143,0.457366


Select on the top 3 deciles

In [48]:
dfx['DECILE']=dfx['DECILE'].astype(int)
df_x=dfx[dfx['DECILE']<=30]

Evaluate the distribution of top records by breed.

There are not many basset hounds in the top three deciles.  Again, hopefully, the problem is apparent.  The machine learning model doesn't know Ed actively discriminated against basset hounds.  It has no way to know this.  If you use the historical data to identify the best prospects, basset hounds will never make the cut.  This isn't because they are perpetually clean.  It is because they were discriminated against in the historical data.

In [49]:
df_count = pd.DataFrame(df_x.groupby(['BREED'])['THEPREDICTOR'].count())
df_count

Unnamed: 0_level_0,THEPREDICTOR
BREED,Unnamed: 1_level_1
Basset Hound,144
Other,37019


##### 3.4 Monkey with the Output Data. <a id="34"></a>

Previously, we examined ways of dealing with "Machine Learning Bias" or situations where your historical data does not reflect your current objectives.  We saw how you could alter the data or the sample to bring the machine learning model closer to what you hope to achieve.  Another approach is to monkey with the output data.  In this case, you build the machine learning model on the original data but then change how you use it.  Below is a concrete example of how to do this.

Take the data from the model and separate each breed into a different data frame.

In [50]:
df_boogie=dfx[['CUST_ID','P_BUY','BREED','BUY']]

df_b=df_boogie[df_boogie['BREED']=='Basset Hound'].copy()
df_e=df_boogie[df_boogie['BREED']!='Basset Hound'].copy()


Remember, the goal is to contact 30% of the prospects. Note that Basset Hounds represent about 11% of the total population. Here is how to ensure that Bassets reflect about 11% of the final targeted prospect file.

Assign deciles within each breed.

In [51]:
df_b['DECILE'] = pd.qcut(df_b['P_BUY'], 10, labels=np.arange(100, 0, -10))
df_e['DECILE'] = pd.qcut(df_e['P_BUY'], 10, labels=np.arange(100, 0, -10))

df_b['DECILE']=df_b['DECILE'].astype(int)
df_e['DECILE']=df_e['DECILE'].astype(int)



Append the records back together into one dataframe.

In [52]:
df_boogie = pd.concat([df_b, df_e])

Select the top 3 deciles.

In [53]:
df_boogie=df_boogie[df_boogie['DECILE']<=30]

Examine the results.

In [54]:
df_count = pd.DataFrame(df_boogie.groupby(['BREED'])['P_BUY'].count())
df_count

Unnamed: 0_level_0,P_BUY
BREED,Unnamed: 1_level_1
Basset Hound,4090
Other,33070


Using the method above, we can ensure that Basset Hounds are represented proportionally to their presence in the original prospect file, even though they have experienced historical discrimination. 

What we are doing in this latest example is establishing a quota.  The methodology above will ensure that the at-risk group is represented proportionally to their presence in the population.  You could easily change this so that Basset Hounds represent 1%, 30%, or 50% of the targeted group.  Again, it just depends on your objective.

Monkeying with the output is another way to correct "Machine learning Bias".  This method could make sense if you are looking for exact numbers.  For example, you want 20% of the final targeted list to be basset hounds.  This method also lends itself to batch deployments, given that you have to aggregate the data to know how to adjust it.

## 4.0 Conclusion <a id="40"></a>

In this article, I have laid out scenarios where building a machine learning model on your historical data will not meet your current business objects. Many call this "Machine Learning Bias".   If you encounter "Machine Learning Bias," you have a problem with your data, and there is no sure-fire way to fix data that is missing something you need. This article presented three ways to deal with "Machine Learning Bias". Hopefully, I made clear how you can use these techniques to minimize the impact of this kind of problem.

As far as which of the three techniques is best, it is tough to say. It does depend on the specific problem you are looking to solve. I would say that generally, if you are looking to deploy your model in real-time via an application, then the first two mentioned in the article are probably the best. Those being Monkeying with the scoring data and Monkeying with the sample. The third (Monkeying with the output data) is probably the best if you are doing a batch deployment. Again, it depends on your objectives and the size of the gap n the historical data.   I'd suggest a bit of trial and error with all three methods.  

I hope this was helpful.

### Author

**Shad Griffin**, is a Data Scientist headquartered in a Treehouse sitting on top of Idiot's Hill in Denton, Texas