In [1]:
import pandas as pd 
from sklearn.model_selection import train_test_split
import requests

## Loading Data
This dataset was provided by [pplonski](https://github.com/pplonski) in a [GitHub repository](https://github.com/pplonski/datasets-for-start/tree/master/adult).

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv', skipinitialspace=True)

X = df.drop('income', axis=1)
y = df['income']

df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Split the data into a set for training and a set for testing.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=4321)

## Data Pre-Processing

The training algorithm we will use is **Random Forest** from `sklearn` which cannot handle missing values or categorical data. First we will fill missing values with the mode (most common value) in each feature.

### Fill Missing Values

In [7]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22792 entries, 15797 to 9021
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             22792 non-null  int64 
 1   workclass       22792 non-null  object
 2   fnlwgt          22792 non-null  int64 
 3   education       22792 non-null  object
 4   education-num   22792 non-null  int64 
 5   marital-status  22792 non-null  object
 6   occupation      22792 non-null  object
 7   relationship    22792 non-null  object
 8   race            22792 non-null  object
 9   sex             22792 non-null  object
 10  capital-gain    22792 non-null  int64 
 11  capital-loss    22792 non-null  int64 
 12  hours-per-week  22792 non-null  int64 
 13  native-country  22792 non-null  object
dtypes: int64(6), object(8)
memory usage: 2.6+ MB


In [5]:
train_mode = dict(X_train.mode().iloc[0])
X_train = X_train.fillna(train_mode)
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22792 entries, 15797 to 9021
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             22792 non-null  int64 
 1   workclass       22792 non-null  object
 2   fnlwgt          22792 non-null  int64 
 3   education       22792 non-null  object
 4   education-num   22792 non-null  int64 
 5   marital-status  22792 non-null  object
 6   occupation      22792 non-null  object
 7   relationship    22792 non-null  object
 8   race            22792 non-null  object
 9   sex             22792 non-null  object
 10  capital-gain    22792 non-null  int64 
 11  capital-loss    22792 non-null  int64 
 12  hours-per-week  22792 non-null  int64 
 13  native-country  22792 non-null  object
dtypes: int64(6), object(8)
memory usage: 2.6+ MB


### Perform A/B Test

Predict the first 100 rows of the randomly selected training data and put the actual result as feedback.

In [58]:
for i in range(100):
    input_data = X_train.iloc[i].to_json()
    target = y_train.iloc[i]
    r = requests.post("http://127.0.0.1:8000/api/v1/income_classifier/predict?status=ab_testing", json=input_data)
    response = r.json()
    # provide feedback
    requests.put(f"http://127.0.0.1:8000/api/v1/mlrequests/{response['request_id']}", {"feedback": target})

Try testing with the first 100 rows of the test data as well without filling empty values.

In [8]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9769 entries, 10971 to 26822
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             9769 non-null   int64 
 1   workclass       9227 non-null   object
 2   fnlwgt          9769 non-null   int64 
 3   education       9769 non-null   object
 4   education-num   9769 non-null   int64 
 5   marital-status  9769 non-null   object
 6   occupation      9223 non-null   object
 7   relationship    9769 non-null   object
 8   race            9769 non-null   object
 9   sex             9769 non-null   object
 10  capital-gain    9769 non-null   int64 
 11  capital-loss    9769 non-null   int64 
 12  hours-per-week  9769 non-null   int64 
 13  native-country  9587 non-null   object
dtypes: int64(6), object(8)
memory usage: 1.1+ MB


In [59]:
for i in range(100):
    input_data = X_test.iloc[i].to_json()
    target = y_test.iloc[i]
    r = requests.post("http://127.0.0.1:8000/api/v1/income_classifier/predict?status=ab_testing", json=input_data)
    response = r.json()
    # provide feedback
    requests.put(f"http://127.0.0.1:8000/api/v1/mlrequests/{response['request_id']}", {"feedback": target})