# ACRA vs Naive-Bayes Algorithm 

In this notebook, we build the **ACRA** algorithm for good word insertion (GWI) attacks. First, we define all neccesary functions, and then test the algorithm in a real dataset, and compare performance against utility sensitive Naive-Bayes.

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import beta
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from joblib import Parallel, delayed
import multiprocessing

## ACRA functions

In this section we define all neccesary functions to build **ACRA** algorithm.

### Train Raw Naive-Bayes

This function trains a raw Naive Bayes in a given training set. In particular, it calculates all relevant parameters such as likelihoods a apriori distributions. Its inputs are:

   * `X_train`: An array where each row is a given email, in the bag-of-words representation (1 if word present, 0 else).
    
   * `y_train`: An array containing the labels of each email of `X_train` (1 if the email is spam, 0 if ham).

This function returns an `sklearn.naive_bayes.BernoulliNB` object with all relevant information.

In [2]:
def trainRawNB(X_train, y_train):
    from sklearn.naive_bayes import BernoulliNB
    clf = BernoulliNB(alpha=1.0e-10)
    clf.fit(X_train, y_train)
    return(clf)

### Get priors

This function returns the priors $p_C(y = 1)$ and $p_C(y = 0)$. The input is `obj`, an `sklearn.naive_bayes.BernoulliNB` object. The resutl is an array whose first element is $p_C(y = 0)$ and the second $p_C(y = 1)$.

In [3]:
def priors(obj):
    return(np.exp(obj.class_log_prior_ ))

### Calculate posterior for given instance

For a given instance (email) $X$ and given classifier `obj` (`sklearn.naive_bayes.BernoulliNB` object) , this function returns $p_C(X|y)p_C(y)$ for $y \in \lbrace 0, 1 \rbrace$. 
In particular, it returns an array whose first element is $p_C(X|0)p_C(0)$ and the second $p_C(X|1)p_C(1)$.

In [4]:
def xposterior(X, obj):
    from scipy.special import comb, logsumexp
    return(np.exp(obj._joint_log_likelihood(X)))

### Compute $\mathcal{X}'$ for a given x'

This function computes the set of possible originating instances of a given one, under 1-GWI attack strategy. It returns an array containing all these instances.

In [5]:
def getXp(xp):
    aux = np.ones( ( xp.shape[1] , xp.shape[1] ) )
    np.fill_diagonal(aux, 0)

    return( np.unique(np.logical_and(xp, aux).astype(int), axis=0) ) 

### Get Random Utilities $U_A(y_C,y,a)$ for a given set of attacks

This function generates random attacker utilities for a set of attacks, sampling from the gamma distribution. That is, it generate samples from $U_A(y_C,y,a)$.
The inputs are

* `yc`: $y_C$ the classification result.
* `y`: $y$ the true label.
* `a`: array containing number of words added in each attack.

In [6]:
def randut(yc,y,a):
    
    d = len(a)
    # if y label is malicious and yc label is malicious
    if( (y == 1) and (yc == 1) ):
        Y = - np.random.gamma(shape = np.repeat(2500.0, d), scale = np.repeat(1.0/500.0, d))
        
    else:    
        # if y label is malicious and yc label innocent
        if( (y == 1) and (yc == 0) ):
            Y = np.random.gamma(shape = np.repeat(2500.0, d), scale = np.repeat(1.0/500.0, d))
        # if y label is innocent and yc label is malicious OR y label is innocent and yc label is innocent  
        else:
            Y = np.repeat(yc*y, d)
    
    
    # Generate random cost of implementing attack
    B = a*np.random.uniform(high = 0.6, low = 0.4, size = 1)
    
    # Risk proneness
    rho = np.random.uniform(high = 0.6, low = 0.4, size = 1)
  
    return (np.exp( rho * (Y - B) ))

### Get random probabilities $P_{a(x)}^A$ for given set of instances

First we have to define some auxiliar functions, useful later.

#### Get $r_a$ for given set of instance

For a given set of instances this function computes the mean of the beta disttribution to be used later. 
The inputs are:
* `X`: a 2D-array containing the set of instances.
* `obj`: the classifier (`sklearn.naive_bayes.BernoulliNB` object).

It returns an array containing $r_a$ for each email.

In [7]:
def getRa(X, obj):
    ra = obj.predict_proba(X)[:,1]
    ra[ra==1.0] -= 0.0001
    return(ra)

For a given set of instances this function computes the mean of the beta disttribution to be used later (Alternative form). 
The inputs are:
* `X`: a 2D-array containing the set of instances.
* `obj`: the classifier (`sklearn.naive_bayes.BernoulliNB` object).
* `n`: an integer indicating the number of word changes of the attacks.

It returns an array containing $r_a$ for each email.

In [8]:
def getRa2(X, obj, n):

    def aux(Z, obj, n):
        q = np.sum( np.apply_along_axis(lambda x: xposterior(x.reshape(1, -1), obj)[0,1],\
                                        1, getXp(Z,1) ) )
        return ( q / (q + xposterior(Z, obj)[0, 0]) )

    ra = np.apply_along_axis( lambda x : aux(x.reshape(1 , -1), obj, n), 1, X)
    ra[ra==1.0] -= 0.0001
    return(ra)


#### Get $\delta_1$ and $\delta_2$ for given set of $r_a$ and $var$

This function return the shape parameters of the beta distribution, for given set of means `ra` and given variance $var\cdot ra \min \big(\frac{ra(1-ra)}{1+ra},\frac{(1-ra)^2}{2-ra}\big)$.

In [9]:
def deltas(ra, var):
   
    deltas = np.zeros((len(ra),2))
    
    for i in range(len(ra)):
        s2 = var *ra[i]* min(ra[i] * (1.0 - ra[i]) / (1.0 + ra[i]) , \
                             (1.0 - ra[i])**2 / (2.0 - ra[i]))     ## proportion of maximum
                                                                #variance of convex beta
        deltas[i][0] = ( ( 1.0 - ra[i] ) / s2 - 1.0 / ra[i]) * ra[i]**2
        deltas[i][1] = deltas[i][0] * ( 1.0/ra[i] - 1.0 )
 
    return(deltas)

#### Compute $a(x)$ for all $a \in \mathcal{A}(X)$

For a given email $X$, this function computes $\mathcal{A}(X)$ under some attack strategy (one word insertion, in this particular case). For each $a \in \mathcal{A}(X)$, it computes $a(X)$, and returns an array containing all $a(X)$.

In [10]:
def getxax(X):
    aux = np.logical_or(X, np.identity(X.shape[1])).astype(int)
    return( np.unique(np.insert(aux, 0, X, 0), axis = 0) )

#### Generate from $P_{a(x)}^A \sim \beta e(\delta_1^a, \delta_2^a)$, for a given set of $\delta_1^a$ and $\delta_2^a$

For a given array `delta`, where `delta[0]` correspond to $\delta_1$ and `delta[1]` to $\delta_2$, this function generates one sample from the beta distribution for each pair of deltas. 

In [11]:
def randprob(deltas):
    return( np.random.beta(deltas[:,0], deltas[:,1]) )

### Compute $p_C(a_{x \rightarrow x'}|x,+)$ for a pair $x$ and $x'$ and give `var`

This is the main function in ACRA algorithm. For a pair of instances $x$ and $x'$, this function returns the probability that the attacker, given that he has instance $x$, will execute the attack that transfors $x$ into $x'$.

In [12]:
def pxaxp(x, xp, obj, var, K = 1000):

    # First we compute the set of all a(x) for the given x, and store them in the array aX.
    aX = getxax(x)
    
    # We store in ix, the index of the element of aX coinciding with xp, this is the index of the attack
    # conecting x with xp.
    ix = np.where(np.all(aX == xp, axis=1))[0]
    
    # We compute de deltas of the instances in aX and store them in d
    d = deltas(getRa(aX, obj), var)

    # We compute the distances between tha attacked instances (those in aX) and the original instance x 
    distances = np.sum(aX - x, axis=1)
    
    # We start the simulation
    distribution = np.zeros( len(distances) ) #here we will store the number of times each attack is maximum
                            
    for i in range(K):                    
        PA = randprob(d)
        psi = PA * randut(1,1,distances) + (1.0 - PA)* randut(0,1,distances) 
        distribution[np.argmax(psi)] += 1
                            
        

    return( sum(distribution[ix])/K )

## Utility sensitive Naive-Bayes label

For given emails, given utilities, and given `sklearn.naive_bayes.BernoulliNB` classifier, this function returns the label of the emails calculated using utility sensitive NB algorithm.

The inputs are
* `Xp`: array containing the instance to predict on.
* `obj`: the classifier.
* `ut`: an array containing the utilities, `ut[i,j]` $= u_C(y_C = i, y = j)$

The output is an array containing the predicted label of each email, 1 for spam and 0 for ham.


In [13]:
def nbusXlabel(Xp, obj, ut):
    aux = np.dot(ut, xposterior(Xp,obj).transpose())
    return(np.argmax(aux, axis=0))
    

## ACRA label

For a given email, given `sklearn.naive_bayes.BernoulliNB` classifier and given `var` this function returns the ACRA posteriors $p_C(X'|y)p_C(y)$ for $y \in \lbrace 0, 1 \rbrace$. 
In particular, it returns an array whose first element is $p_C(X'|0)p_C(0)$ and the second $p_C(X'|1)p_C(1)$.

The inputs are
* `Xp`: array containing the instance to predict on.
* `obj`: the classifier.
* `var`: variance = $var\cdot ra \min \big(\frac{ra(1-ra)}{1+ra},\frac{(1-ra)^2}{2-ra}\big)$



In [14]:
def ACRAposterior(Xp, obj, var):
    aux = getXp(Xp)
    sum = 0
    for i in range(aux.shape[0]):
        sum += pxaxp(aux[[i],:],Xp,obj,var)*xposterior(aux[[i],:], obj)[0,1]
    return(np.array( [ xposterior(Xp, obj)[0,0] , sum ] ))
    
        

The same for multiple instances, parallelizing the code

In [15]:
def posteriorInput(i,Xp,obj, ut, var):
        return ACRAposterior(Xp[[i],:],obj,var)

In [16]:
def ACRAparPosterior(Xp, obj, var):
    inputs = range(Xp.shape[0])
    num_cores = multiprocessing.cpu_count()
    result = Parallel(n_jobs=num_cores)(delayed(posteriorInput)(i,Xp,obj, ut, var) for i in inputs)
    return(np.array(result))

For a given email, this function returns the ACRA label.
The inputs:
* `Xp`: array containing the instance to predict on.
* `obj`: the classifier.
* `var`: variance = $var\cdot ra \min \big(\frac{ra(1-ra)}{1+ra},\frac{(1-ra)^2}{2-ra}\big)$
* `ut`: an array containing the utilities, `ut[i,j]` $= u_C(y_C = i, y = j)$

The output is 0 for ham, 1 for spam

In [17]:
def ACRA(Xp,obj, ut, var):
    aux = np.dot(ut, ACRAposterior(Xp,obj, var).transpose())
    return(np.argmax(aux, axis=0))

The following code transforms posterior into labels, for a give utility.

In [18]:
def ACRAlabel(posterior, ut):
    
    aux = np.dot(ut, posterior.transpose())
        
    return(np.argmax(aux, axis = 0))
        

## Attacker simulation

In order to test ACRA algorithm, we need to get an attacked test set. As long as there are no benchmarks for that purpose, we will generate it artificially by simulating the attacker's behaviour. 

At a first stage, we can simulate the attacker using the same assumptions we use to solve the classifier problem, but removing the uncertainty that is not present from the attackers point of view. Therefore, the attacker will not change ham emails. For spam email he will solve

$$
argmax_a \big[u_A(+,+,a)-u_A(-,+,a)\big]p_{a(x)}^A + u_A(-,+,a)
$$

Now the utilities are not random anymore, specifically we shall use the same as in the classifier problem, but collapsing every probability distribution to its mean value.

$p_{a(x)}^A$ will be the probability given by the naive Bayes classifier.


First we define the utility function

In [19]:
def adversarialUt(yc,y,a):
    
    d = len(a)
    # if y label is malicious and yc label is malicious
    if( (y == 1) and (yc == 1) ):
        Y = np.repeat(-5.0, d)
        
    else:    
        # if y label is malicious and yc label innocent
        if( (y == 1) and (yc == 0) ):
            Y = np.repeat(5.0, d)
        # if y label is innocent and yc label is malicious OR y label is innocent and yc label is innocent  
        else:
            Y = np.repeat(0.0, d)
    
    
    # Generate random cost of implementing attack
    B = a*np.repeat(0.5, d)
    
    # Risk proneness
    rho = np.repeat(0.5, d)
  
    return (np.exp( rho * (Y - B) ))

The following function performs an attack over a given email x, using the previous deterministic model.

In [20]:
def attackit(X, y, obj):
    if y == 0:
        return(X)
    else:
        possibleAttacks = getxax(X)
        pr = getRa(possibleAttacks, obj)
        distances = np.sum(possibleAttacks - X, axis=1)
        psi = pr * adversarialUt(1,1,distances) + (1.0 - pr)* adversarialUt(0,1,distances) 
        return(possibleAttacks[np.argmax(psi),:])

This function attacks a whole set of emails.

In [21]:
def attack(X, y, obj):
    att = np.zeros(X.shape, dtype=int)
    for i in range(len(y)):
        att[i,:] = attackit(X[[i],:], y[i], obj)
    return(att)

## Experiment with artificial data

In this section, we generate artificial data to test ACRA algorithm. For that purpose we use the `mailGenerator` function. This function generates k-words emails in bag-of-words representation, i.e. each email is represented using a vector of 1's or 0's, indicating wether the corresponding word is present or not in the email. In addition, the last column indicates if the email is spam (1) or ham (0).

This email is generated this way:
A random email is produced, flipping a coin for each word (head = the correspoding word appears, tails = the word does not appear). `spamWord` is an integer variable indicating which position correspond to the spam word. `fixingWord` refers to the position of the fixing word. Spam email are those that contain the spam word with high probability (`pAux`) and the fixing word with low probability (1 - `pAux`). Thus, the fixing word will be an extremely good word to add when the email is spam in order to convert it into ham. The only useful attack will be then to add the fixing word (because the rest of the words are just noise) and attacked emails will have the form $(1,1, \cdots)$, having 1's in both the spam and the fixing word. As long as we just consider exploratory attacks, the training set is clean and none attacked emails can appear in it.

In [22]:
def mailGenerator(k = 5, spamWord = 0, fixingWord = 1, pAux = 0.9):
    
    email = np.random.binomial(1, 0.5, size = k)
    while (email[spamWord] == 1 and email[fixingWord] == 1) or \
        (email[spamWord] == 0 and email[fixingWord] == 0):
        email = np.random.binomial(1, 0.5, size = k)
    
    if email[spamWord] == 1 and email[fixingWord] == 0:
        
        if np.random.binomial(1, pAux) == 1:
            email = np.insert(email, len(email), 1)
            
        else:
            email = np.insert(email, len(email), 0)
        
    else:
        if np.random.binomial(1, pAux) == 1:
            email = np.insert(email, len(email), 0)
            
        else:
            email = np.insert(email, len(email), 1)
            
           # if email[spamWord] == 1 and email[fixingWord] == 1:
           #     
           #     if np.random.binomial(1, probFix) == 1:
           #         email = np.insert(email, len(email), 0)
           #         
           #     else:
           #             email = np.insert(email, len(email), 1)
           #             
           # else:
                
          #    email = np.insert(email, len(email), 0)
                
    return(email)

In [23]:
def dataSetGen(n = 1000, k = 5, spamWord = 0, fixingWord = 1):
    
    cols = [None] * (k+1)
    for i in range(k):
        cols[i] = "W" + str(i)
    cols[ k ] = "spam"
    
    data = pd.DataFrame(columns=cols)
    
    for i in range(n):
        data.loc[i] = list(mailGenerator(k, spamWord, fixingWord))
        
    return(data)


### Generate and preprocess the data

In this section we generate the data, and divide into training and test set.

In [36]:
spamData = dataSetGen()
spamData = shuffle(spamData)
spamData = spamData.astype("float64")
spamData

Unnamed: 0,W0,W1,W2,W3,W4,spam
452,1.0,0.0,1.0,0.0,1.0,1.0
904,0.0,1.0,0.0,0.0,0.0,0.0
101,0.0,1.0,0.0,1.0,1.0,0.0
143,1.0,0.0,0.0,1.0,1.0,1.0
362,0.0,1.0,0.0,0.0,1.0,0.0
54,1.0,0.0,0.0,0.0,0.0,1.0
915,0.0,1.0,0.0,1.0,0.0,0.0
252,0.0,1.0,0.0,1.0,1.0,0.0
735,1.0,0.0,1.0,0.0,0.0,1.0
112,1.0,0.0,1.0,0.0,1.0,1.0


Let's calculate the prevalence of spam in the dataset.

In [37]:
p = sum(spamData.spam)/len(spamData)
print("Prevalence: ", p)

Prevalence:  0.476


Now we create the training and test set sampling at random from the  whole dataset. The parameter `q` indicates the proportion of emails in the test set. For that purpose, we save in `X` an array containing the feature values and in `y` the labels.

In [38]:
X = spamData.drop("spam", axis=1).values
y = spamData.spam.values
q = 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=q, random_state=42)

## Train Naive-Bayes and test in clean data

We train the algorithm and predict over the test set. We also check performance over the clean test set.

In [39]:
clf = trainRawNB(X_train, y_train)
y_pred = clf.predict(X_test)
print("Confussion Matrix: ")
print(confusion_matrix(y_test, y_pred))
print("Accuracy Score: ", accuracy_score(y_test, y_pred))

Confussion Matrix: 
[[97 11]
 [12 80]]
Accuracy Score:  0.885


## ACRA vs Naive-Bayes in attacked test data

In this section we compare both algorithms using different utilities. First we generate the attacked test set.

In [40]:
X_testAtt = attack(X_test, y_test, clf)
X_testAtt

array([[0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [1, 1, 1, 1, 0],
       [0, 1, 0, 0, 1],
       [1, 1, 1, 0, 0],
       [0, 1, 0, 1, 1],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 1],
       [0, 1, 0, 0, 0],
       [0, 1, 0, 0, 1],
       [0, 1, 0, 0, 1],
       [1, 1, 0, 0, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 0, 1],
       [0, 1, 0, 1, 0],
       [0, 1, 0, 1, 1],
       [0, 1, 1, 1, 0],
       [1, 1, 0, 0, 0],
       [1, 0, 1, 0, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 0, 1],
       [0, 1, 1, 1, 0],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 0],
       [1, 1, 0, 0, 0],
       [1, 1, 1, 1, 0],
       [1, 1, 0, 1, 1],
       [0, 1, 0, 0, 0],
       [0, 1, 1, 1, 0],
       [1, 0, 0, 1, 0],
       [1, 1, 1, 0, 0],
       [0, 1, 0, 1, 1],
       [0, 1, 1, 0, 0],
       [1, 1, 1, 1, 1],
       [1, 0, 0, 0, 0],
       [1, 1, 0, 1, 1],
       [0, 1, 0, 0, 1],
       [0, 1, 1, 0, 0],
       [0, 1, 0, 0, 1],
       [0, 1, 1, 0, 1],
       [1, 1, 0,

Now we calculate ACRA and utility sensitive naive bayes labels for a given utility.

In [41]:
ut = np.array([[1,0],[0,1]])

y_nbus = nbusXlabel(X_testAtt, clf, ut)
xxyy = ACRAparPosterior(X_testAtt, clf, var = 0.5)
y_ACRA = ACRAlabel(xxyy, ut)
print("Confussion Matrix NB: ")
print(confusion_matrix(y_test, y_nbus))
print("Accuracy Score NB: ", accuracy_score(y_test, y_nbus))

print("Confussion Matrix ACRA: ")
print(confusion_matrix(y_test, y_ACRA))
print("Accuracy Score ACRA: ", accuracy_score(y_test, y_ACRA))

Confussion Matrix NB: 
[[97 11]
 [92  0]]
Accuracy Score NB:  0.485
Confussion Matrix ACRA: 
[[108   0]
 [ 12  80]]
Accuracy Score ACRA:  0.94


## Experiment with real data

### Read and preprocess the data

The data consists on 4601 emails in bag-of-words representation, i.e. each email is represented using a vector of 1's or 0's, indicating wether the corresponding word is present or not in the email. In addition, the last column indicates if the email is spam (1) or ham (0). We read the data from the file located at `/data`.

In [42]:
dataPath = "data/"
spamData = pd.read_csv(dataPath + "uciData.csv")
spamData = shuffle(spamData)
spamData

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_edu,word_freq_table,word_freq_conference,char_freq_,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,spam
2703,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1905,0,0,0,0,1,0,0,1,0,0,...,1,0,0,1,1,1,1,0,0,0
3740,1,0,0,0,0,1,0,1,0,1,...,0,0,0,1,1,0,1,1,0,0
4165,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
201,0,0,1,0,1,0,0,0,0,1,...,0,0,0,0,1,0,1,1,0,1
754,0,1,1,0,1,0,1,1,0,0,...,0,0,0,0,0,0,1,1,1,1
1953,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1326,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3987,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4226,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


Let's calculate the prevalence of spam in the dataset.

In [43]:
p = sum(spamData.spam)/len(spamData)
print("Prevalence: ", p)

Prevalence:  0.394044772875


Now we create the training and test set sampling at random from the  whole dataset. The parameter `q` indicates the proportion of emails in the test set. For that purpose, we save in `X` an array containing the feature values and in `y` the labels.

In [44]:
X =  spamData.drop("spam", axis=1).values
y = spamData.spam.values
q = 0.25
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=q)

## Train Naive-Bayes

We train the algorithm and predict over the test set. We also check performance.

In [45]:
clf = trainRawNB(X_train, y_train)
ut = np.array([[1,-1],[-10,1]])
y_pred = nbusXlabel(X_test, clf, ut)
print("Confussion Matrix: ")
print(confusion_matrix(y_test, y_pred))
print("Accuracy Score: ", accuracy_score(y_test, y_pred))

Confussion Matrix: 
[[668  35]
 [103 345]]
Accuracy Score:  0.880104257168


## ACRA vs Naive-Bayes in real attacked test data

In this section we compare both algorithms using different utilities. First we generate the attacked test set.

In [46]:
X_testAtt = attack(X_test, y_test, clf)
X_testAtt

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 1, 0],
       ..., 
       [1, 0, 1, ..., 1, 1, 0],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 0]])

Now we calculate ACRA and utility sensitive naive bayes labels for a given utility.

In [47]:
ut = np.array([[1,-1],[-10,1]])

y_nbus = nbusXlabel(X_testAtt, clf, ut)
xxyy = ACRAparPosterior(X_testAtt, clf, var = 0.5)

y_ACRA = ACRAlabel(xxyy, ut)
print("Confussion Matrix NB: ")
print(confusion_matrix(y_test, y_nbus))
print("Accuracy Score NB: ", accuracy_score(y_test, y_nbus))

print("Confussion Matrix ACRA: ")
print(confusion_matrix(y_test, y_ACRA))
print("Accuracy Score ACRA: ", accuracy_score(y_test, y_ACRA))


Confussion Matrix NB: 
[[668  35]
 [178 270]]
Accuracy Score NB:  0.814943527368
Confussion Matrix ACRA: 
[[684  19]
 [ 89 359]]
Accuracy Score ACRA:  0.906168549088


Next function is used to write results into a dataframe

In [48]:
def write_to_csv(name, X, NBC_post, ACRA_post, NB_post, y_NBC, y_ACRA, y_NB, y_test):
    cols = [None] * (X.shape[1] + 10)
    for i in range(X.shape[1]):
        cols[i] = "W" + str(i)
    cols[ X.shape[1] ] = "NBCpost0"
    cols[ X.shape[1] + 1 ] = "NBCpost1"
    cols[ X.shape[1] + 2 ] = "ACRApost0"
    cols[ X.shape[1] + 3 ] = "ACRApost1"
    cols[ X.shape[1] + 4 ] = "NBpost0"
    cols[ X.shape[1] + 5 ] = "NBpost1"
    cols[ X.shape[1] + 6 ] = "NBClab"
    cols[ X.shape[1] + 7 ] = "ACRAlab"
    cols[ X.shape[1] + 8 ] = "NBlab"
    cols[ X.shape[1] + 9 ] = "spam"


    bigResult = pd.DataFrame(columns=cols)

    for i in range(y_test.shape[0]):
        bigResult.loc[i] = list(np.concatenate( (X[i,:], NBC_post[i,:], ACRA_post[i,:], \
                                                 NB_post[i,:], y_NBC[[i]],y_ACRA[[i]], \
                                                 y_NB[[i]], y_test[[i]]) , axis = 0))
       

    bigResult.to_csv("results/" + name)

## Big Experiment

In this section we perform the experiment over a grid of 10 values of `var` parameter, performing for each value of 10 experiments using different training-test divisions. We first read the data

In [51]:
dataPath = "data/"
bigSpam = pd.read_csv(dataPath + "uciData.csv")
bigSpam = shuffle(bigSpam)

We define the grid of values of `var`and the number of experiments per `N`

In [52]:
var_grid = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] # Different values of k
N = 1 # Times to repeat experiment

We perform the experiment

In [53]:
for i in range(N):
    
    #Split Training-Test
    X = bigSpam.drop("spam", axis=1).values
    y = bigSpam.spam.values
    q = 0.25
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=q)
    
    #Train NB
    clf = trainRawNB(X_train, y_train)
    
    NBC_post = xposterior(X_test, clf)
    ut = np.array([[1,-4],[-1,1]])
    y_NBC = nbusXlabel(X_test, clf, ut)
    
    ## Attack test set
    X_testAtt = attack(X_test, y_test, clf)
    NB_post = xposterior(X_testAtt, clf)
    y_NB = nbusXlabel(X_testAtt, clf, ut)
    
    ## ACRA loop
    
    for j in var_grid:
        
        ACRA_post = ACRAparPosterior(X_testAtt, clf, var = j)
        y_ACRA = ACRAlabel(ACRA_post, ut)
        name = "BE" + "N" + str(i) + "var" + str(j*100)  + ".csv"
        write_to_csv(name, X, NBC_post, ACRA_post, NB_post, y_NBC, y_ACRA, y_NB, y_test)
        