# Multi-class Classification on Anonymized 'Adult' Dataset

This notebook contains and analyses the multi-class classification of k-anonymous 'Adult' dataset from UCI Machine Learning repository. 

k-anonymity is a property of a dataset that the information for each entry can not be distinguished from at least *k-1* other entries in the dataset. Algorithm used to obtain k-anonymity of the 'Adult' dataset is SaNGreeA - a version of a greedy clustering algorithm. For our latter experiments we will use 10 different k-values, however in this notebook we focus more on preprocessing methods in order to obtain similar results as in paper "DO NOT DISTURB? Classifier Behavior on Perturbed Datasets".

We use 4 classifiers:
<ol>
    <li>Gradient Boosting</li>
    <li>Random Forst</li>
    <li>Logistic Regression</li>
    <li>Linear SVC</li>
</ol>

In [1]:
# This is a multiclass classification of anonymized Adult datasets on target 'marital-status' for 4 classifiers:
# Gradient Boosting
# Linear SVC
# Logistic Regression
# Random Forest
# k = {3, 7, 11, 15, 19, 23, 27, 31, 35, 100}

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import math
import numpy as np
import pickle

from sklearn import metrics, preprocessing, model_selection
from sklearn.ensemble import GradientBoostingClassifier as GradientBoosting, RandomForestClassifier as RandomForest
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression

  from numpy.core.umath_tests import inner1d


In [3]:
def read_anon_data(filename):
    filepath = "../output/marital-status/"
    filepath += filename
    dataset = pd.read_csv(filepath, sep=r'\s*,\s*', na_values="*", engine='python', index_col=False)
    return dataset

Let's load and have a look at our anonymous data.

In [4]:
k = 11

In [5]:
dataset = read_anon_data("anonymized_equal_weights_k_" + str(k) + ".csv")
dataset.head()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,workclass,native-country,sex,race,relationship,occupation,income,marital-status
0,[22 - 51],[12 - 13],[0 - 2174],0,[37 - 47],State-gov,United-States,Male,White,Not-in-family,Adm-clerical,<=50K,Never-married
1,[22 - 51],[12 - 13],[0 - 2174],0,[37 - 47],State-gov,United-States,Male,White,Not-in-family,Adm-clerical,<=50K,Never-married
2,[22 - 51],[12 - 13],[0 - 2174],0,[37 - 47],State-gov,United-States,Male,White,Not-in-family,Adm-clerical,<=50K,Divorced
3,[22 - 51],[12 - 13],[0 - 2174],0,[37 - 47],State-gov,United-States,Male,White,Not-in-family,Adm-clerical,<=50K,Never-married
4,[22 - 51],[12 - 13],[0 - 2174],0,[37 - 47],State-gov,United-States,Male,White,Not-in-family,Adm-clerical,<=50K,Never-married


In [6]:
# Preprocessing
def number_encode_features(ds):
    result = ds.copy()
    encoders = {}
    for feature in result.columns:
        if result.dtypes[feature] == np.object:
            encoders[feature] = preprocessing.LabelEncoder()
            result[feature] = encoders[feature].fit_transform(result[feature].astype(str))
    return result, encoders

dataset_encoded, encoders = number_encode_features(dataset)

In [7]:
# Target will be 'marital-status'
y = dataset_encoded['marital-status']
X = dataset_encoded.drop('marital-status', axis=1)

In [8]:
# Scoring
def f1_micro(clf, X, y):
    # cross validation scores on number encoded data
    scores = model_selection.cross_val_score(clf, X, y, cv=10, scoring='f1_micro')
    print("F1 score: %0.2f (+/- %0.2f)" 
          % (scores.mean(), scores.std() * 2))
    return scores.mean()

In [9]:
scores = {}

## 1. Gradient Boosting

In [10]:
# Gradient Boosting
clf = GradientBoosting(random_state=0)

In [11]:
scores['Gradient Boosting'] = f1_micro(clf, X, y)

F1 score: 0.80 (+/- 0.04)


## 2. Random Forest

In [12]:
# Random Forest
clf = RandomForest(random_state=0)

In [13]:
scores['Random Forest'] = f1_micro(clf, X, y)

F1 score: 0.72 (+/- 0.07)


## 3. Logistic Regression

In [14]:
# Logistic Regression
clf = LogisticRegression(random_state=0)

In [15]:
scores['Logistic Regression'] = f1_micro(clf, X, y)

F1 score: 0.68 (+/- 0.04)


## 4. Linear SVC

In [16]:
# Linear SVC - binary attributes needed
clf = LinearSVC(random_state=0)

In [None]:
#f1_micro(clf, X, y)

In [None]:
del y, X, dataset_encoded, encoders

In [None]:
# we can try with binary encoded features
# Target will be 'marital-status'
y = dataset['marital-status']
X = dataset.drop('marital-status', axis=1)
X.head()

In [None]:
X = pd.get_dummies(X)
X.shape

In [None]:
scores['Linear SVC'] = f1_micro(clf, X, y)

### Saving scores

In [None]:
filename = '../output/marital-status/classification-res/adult_multiclass_k' + str(k)
outfile = open(filename,'wb')

In [None]:
pickle.dump(scores, outfile)
outfile.close()