## Boolean Learning Example

Let us define the unknown target function, $f:\mathcal{X} \mapsto \mathcal{Y}$. The training set
is $\left\{(x,y)\right\}$ which means that we only see the function's
inputs/outputs. The hypothesis set $\mathcal{H}$ is the set of all
possible guesses at $f$. This is the set from which we will ultimately
draw our final estimate, $g$. **The machine learning problem is
how to  derive the best element from the hypothesis set by using the
training set.**

Suppose $\mathcal{X}$ consists of all four-bit vectors (i.e.,
$\mathcal{X}=\left\{0000,0001,\ldots,1111\right\}$) as in the code below,

In [None]:
import pandas as pd
import numpy as np
from pandas import DataFrame
df=DataFrame(index=pd.Index(['{0:04b}'.format(i) for i in range(2**4)],
                            dtype='str',
                            name='x'),columns=['f'])

**Programming Tip.**

The string specification above uses Python's advanced string
formatting mini-language. In this case, the specification says to
convert the integer into a fixed-width, four-character (`04b`) binary
representation.



 Next, we define the target function $f$ below which just
checks if the number of zeros in the binary representation exceeds the
number of ones.  If so, then the function outputs `1` and `0`
otherwise (i.e., $\mathcal{Y}=\left\{0,1\right\}$).

In [None]:
df.f=np.array(df.index.map(lambda i:i.count('0')) 
               > df.index.map(lambda i:i.count('1')),dtype=int)
df # show all the input vectors and target values

Unnamed: 0_level_0,f
x,Unnamed: 1_level_1
0,1
1,1
10,1
11,0
100,1
101,0
110,0
111,0
1000,1
1001,0


Let's suppose that the first eight elements from
$\mathcal{X}$ are twice as likely as the last eight. The following code is a
function that generates elements from $\mathcal{X}$ according to this
distribution.

In [None]:
def get_sample(n=1):
   if n==1:
      return '{0:04b}'.format(np.random.choice(list(range(8))*2+list(range(8,16))))
   else:
      return [get_sample(1) for _ in range(n)]

**Programming Tip.**

The function that returns random samples uses the
`np.random.choice` function from Numpy which takes samples (with replacement)
from the given iterable.  Because we want the first eight numbers to be twice
as frequent as the rest, we simply repeat them in the iterable using
`range(8)*2`. Recall that multiplying a Python list by an integer duplicates
the entire list by that integer. It does not do element-wise multiplication as
with Numpy arrays. If we wanted the first eight to be 10 times more frequent,
then we would use `range(8)*10`, for example. This is a simple but powerful
technique that requires very little code. Note that the `p` keyword argument in
`np.random.choice` also provides an explicit way to specify more  complicated
distributions.



 The next block applies the function definition $f$ to the
sampled data to generate the training set consisting of 5 elements.

In [None]:
np.random.seed(12) # for reproduction
train=df.loc[get_sample(5),'f'] # 5-element training set
train.index.unique().shape    # how many unique elements?

(4,)

Notice that even though there are 5 elements, there is redundancy
because these are drawn according to an underlying probability.  Under the
assumption that the prediction will be used in an environment that is
determined by the same probability, getting something outside of the training
set is just as likely as getting something inside the training set.  

In [None]:
df['g']=df.loc[train.index.unique(),'f']
df.g

x
0000    NaN
0001    NaN
0010    1.0
0011    0.0
0100    NaN
0101    NaN
0110    0.0
0111    NaN
1000    NaN
1001    0.0
1010    NaN
1011    NaN
1100    NaN
1101    NaN
1110    NaN
1111    NaN
Name: g, dtype: float64

Note that there are `NaN` symbols where the training set had
no values. For definiteness, we fill these in with zeros, although we
can fill them with anything we want so long as whatever we do is not
determined by the training set.

In [None]:
df.g.fillna(0,inplace=True) #final specification of g
df.g

x
0000    0.0
0001    0.0
0010    1.0
0011    0.0
0100    0.0
0101    0.0
0110    0.0
0111    0.0
1000    0.0
1001    0.0
1010    0.0
1011    0.0
1100    0.0
1101    0.0
1110    0.0
1111    0.0
Name: g, dtype: float64

Now, let's pretend we have deployed this and generate some
test data.

In [None]:
np.random.seed(12) # for reproduction
test= df.loc[get_sample(150),'f']
(df.loc[test.index,'g'] != test).mean()

0.32

The result shows the error rate, given the probability
mechanism that is generating the data.  The following Pandas-fu
compares the overlap between the training set and the test set in the
context of all possible data.  The `NaN` values show the rows where
the test data had items absent in the training data. Recall that the
method returns zero for these items.  As shown, sometimes this works
in its favor, and sometimes not.

In [None]:
pd.concat([test.groupby(level=0).mean(), 
           train.groupby(level=0).mean()],
          axis=1,
          keys=['test','train'])

Unnamed: 0,test,train
0,1,
1,1,
10,1,1.0
11,0,0.0
100,1,
101,0,
110,0,0.0
111,0,
1000,1,
1001,0,0.0


**Programming Tip.**

The `pd.concat` function concatenates the two `Series` objects in the
list. The `axis=1` means join the two objects along the columns where
each newly created column is named according to the given `keys`. The
`level=0` in the `groupby` for each of the `Series` objects  means
group along the index. Because the index corresponds to the 4-bit
elements, this accounts for repetition in the elements. The `mean`
aggregation function computes the values of the function for each
4-bit element. Because all functions in each  respective group have
the same value, the `mean` just picks out that value
because the average of a list of constants is that constant.

The size of the training set is key here --- the bigger the training set, the less
likely that there will be real-world data that fall outside of it and the
better $g$ will perform.  

In [None]:
np.random.seed(12) # for reproduction
train=df.loc[get_sample(12),'f'] 
del df['g']   
df['g']=df.loc[train.index.unique(),'f']
df.g.fillna(0,inplace=True) #final specification of g
np.random.seed(30) # for reproduction
test= df.loc[get_sample(150),'f'] 
(df.loc[test.index,'g'] != df.loc[test.index,'f']).mean() # error rate

0.12666666666666668

In [None]:
pd.concat([test.groupby(level=0).mean(), 
           train.groupby(level=0).mean()],
          axis=1,
          keys=['test','train'])

Unnamed: 0,test,train
0,1,
1,1,
10,1,1.0
11,0,0.0
100,1,1.0
101,0,0.0
110,0,0.0
111,0,
1000,1,1.0
1001,0,0.0
