# You know I am strictly Bayesian Right?

This is for finding if a Bayesian Framework is going to be better at predicting results compared to a normal model. The results uptill now show significant gains (LB score 11.98 -> 0.74) by employing this framework. I shall be working on implementing multiple models and also on a version that uses `predict_proba` rather than `predict`. I expect to see significant gains wheh I switch to `predict_proba`

## 1. Hypothesis Testing

We are tasked to generate the probability that a house has `low`, `medium` or `high` interest, given the feature set provided. We need to train a model, and figure out the probabilities:

$$P[low]=?, P[medium]=?, P[high]=?$$

What we want to do is to find the best method of coming up with these probabilities. So, how do we do that? What we want to do is to see of there is a method of coming up with a solution that is an improvement over a basic machine learning (ML) estimation. We want to come up with a framework, that will allow us to put ML results as part of a Bayesian inference problem. 

### 1.1. Some Definitions

Let us start with the definitions of some terms which will be used for the rest of this report. First, lets write down the preliminaries:

| value | definition
|-------|---
| $H_i$ | The hypothesis that we are testing. $i$ is the subscript that is one of $low$, $medium$, $high$. Hence, it will be useful to look at values such as $H_{high}$, $H_{medium}$, and $H_{low}$. $H_{low}$ for example is the *hypothesis* that the house under scrutiny has low interest. 
| $P(H_i)$ | This is the prior probability that the current house has interest level $i$. Hence, $P(H_{low})$ is the probability that the current house has low interest. 

### 1.2. Now for the Math

Now, we are also given some features, for each house. Let us call them $\mathbf{f}$. Let us say that there are $M$ features, which can be written as:

$$\mathbf{f} = [f_0, f_1, \ldots, f_{M-1}]$$

Typically, the normal procedure that Bayesian people follow is something like the following: 

$$P(H_i|\mathbf{f}) = P(H_i) \frac {P(\mathbf{f}|H_i) } {P(\mathbf{f})} $$

As the saying goes, we are *updating* the prior knowledge with new information. Note that $P(H_i|\mathbf{f})$ is what we are seeking to find in this competition. However, I want to propose a modified version of this method to see if we can gain from ML techniques. 

### 1.3. Throwing ML in the Mix

Let us say that we train a model $m$ (for example, a Random Forest model) which makes a prediction of $P(H_i)$, using the features $\mathbf{f}$. This can be written as:

$$ m(\mathbf{f}) \rightarrow \hat{P}(H_i)$$

We shall use the symbol $m$ in place of $ m(\mathbf{f})$ and $\hat{P}(H_i)$, because they *all* mean the same thing. $\hat{P}(H_i)$ is one *estimation* of $P(H_i)$. Let us see this is the Bayesian perspective... 

$$P(H_i| m) = P(H_i) \frac {P(m|H_i) } {P(m)} = P(H_i) \frac {P(m|H_i) } { P(m|H_i) P(H_i) + P(m|\bar{H_i}) P( \bar{H_i}) } $$

Note that $\bar{H_i}$ is the *inverse* of $H_i$. So, $\bar{H_{low}}$ is the condition that the current house has either a high or medium interest. 

### 1.4. Why Stop at One Model?

It is entirely possible that we train not one model, but two, or three, or even a hundred. People frequently do, and either use some form of voting classifier, or some form if averaging for obtaining the final result. However, why not improve the Bayesian paradigm itself? Let us say that we have a bunch of models $\mathbf{m} = [m_0, m_1, \ldots, m_{N-1}]$. Then, each of these models can give us a bunch of predictions on the testing data $\mathbf{\hat{P}}(H_i)$. The above equation changes to:

$$P(H_i| \mathbf{m}) = P(H_i) \frac {P(\mathbf{m}|H_i)} {P(\mathbf{m})}$$

This is not a trivial solution. However, it is not impossible to solve. We can, for the sake of simplicity, use the naive assumption of model independence. In that case, we can try the following:

$$P(H_i| \mathbf{m}) = P(H_i) \frac { \prod_{j=0}^{N-1} P(m_j|H_i)} { \prod_{j=0}^{N-1} P(m_j) }$$.

Ok, so we have some form of a mathematical construct. 

### 1.5. Summary

Bayesian inference lends itself naturally in this current setting. The rest of the document will look at some possible ways of approximating the different values in the equations above. If anyone has some good methods to improve the methods already described, do share, so that all of us can benefit :).


## 2. The Preliminaries

First, let us consider some of the preliminaries. We shall consider what we are doing using what we already know. 

The data from the competition has already been split into the training and the test sets, and the output value and put into its own folder `localFiles\`. There are three files within this folder:

 - `Xtest.csv`  
 - `Xtrain.csv`  
 - `y.csv`

### 2.1. One-Hot-Encoding prediction

This is relatively simple, and this we shall not delve into this too much. It has already been done and saved in the file `localFiles\y.csv`.

```python
>>> import pandas as pd
>>> ys = pd.read_csv('localFiles\y.csv')
>>> ys.head()
   high  medium  low
0   0.0     1.0  0.0
1   0.0     0.0  1.0
2   1.0     0.0  0.0
3   0.0     0.0  1.0
4   0.0     0.0  1.0
```

## 3. Equations to Programs ...

Let us find out how we can find the different values in the equations above. What do we need to find?

 1. $P(H_i)$, $P(\bar{H_i})$,
 2. $P(m|H_i)$, $P(m|\bar{H_i})$, and $P(m)$

Thats it. Only five values. Although, from section 1.3 it would appear that we woule be able to calcualate $P(m)$ from the other values, as we shall see later, the computation of $P(m)$ is relatively straight-forward. So we shall do that exclusively. Under such a circumstance, we don't need to calculate $P(m|\bar{H_i})$ explicitely. 

### 3.1. Finding $P(H_i)$, and $P(\bar{H_i})$

Without any information about the houses, we can simply say that there is an equal probability of having a certain amount of interest in a particular house. That will make $P(H_i) = 1$. However, we do know from the training data that there is a significant amount of skew in the data.  $P(H_i)$ 

```python
>>> for c in ys.columns:
...    print c, sum(ys[c] == 1)*1.0 / len(ys[c]), sum(ys[c] == 0)*1.0 / len(ys[c])
high 0.0777881342195 0.922211865781
medium 0.227528772897 0.772471227103
low 0.694683092884 0.305316907116
```

So, we get out first set of priors:

| $H_i$        | $P(H_i)$        | $P(\bar{H_i})$ 
|--------------|-----------------|------------------
| $H_{high}$   | 0.0777881342195 | 0.922211865781
| $H_{medium}$ | 0.227528772897  | 0.772471227103
| $H_{low}$    | 0.694683092884  | 0.305316907116

That was easy! 

### 3.2. Finding $P(m|H_i)$, $P(m|\bar{H_i})$, and $P(m)$

Let us think about what this is. We shall look at $P(m|H_i)$ first. It will soon be clear that $P(m|\bar{H_i})$ follows trivially from this same example. I'll be honest here. This is not trivial, and at some points, you do need to make leaps of faith.  When you do, I will clearly enunciate them, so you can make your own decisions of how to deal with them. Let's first break this down real quick so it will be easier to follow:

#### 3.2.1. First what is $m$ and how do we obtain it?

Well, as we have seen previously, the definition for $m$ is:

$$ m(\mathbf{f}) \rightarrow \hat{P}(H_i)$$

Now note that we have used $m$ in two different ways. It represents the 

 1. *model* $m$ as well as the 
 2. *prediction* that the model gives, given the featureset $\mathbf{f}$ (i.e. $m(\mathbf{f})$).

For convenience, these are used interchangeably, since the use of $m$ is so much easier than its actual counterpart. Which definition of $m$ we are using will generally be clear form the context. In this and the subsequent sections, $m$ will refer to the *prediction*. 

So how do we get a model? Lets take a Random Forest model for example, and train it ...

```python
from sklearn.ensemble import RandomForestClassifier

Xtest  = pd.read_csv('localFeatures/Xtest.csv')
Xtrain = pd.read_csv('localFeatures/Xtrain.csv')
y      = pd.read_csv('localFeatures/y.csv')

model = RandomForestClassifier()
model.fit(Xtest, y['low'])
```

> Question: Why U No Use PIP8? (cringe!!!)
>  
>  Ans: I find it easier to read my code when it is not linted. 

**Note:** *In this case, we are only predicting the values for the case where the interest is low. We need to redo this calculation for each interest level.*

Here `model` represents $m$, our model (not a prediction). The *prediction* of the `i`<sup>th</sup> house in the training dataset $m(\mathbf{f})$ would then be represented by either `model.predict(Xtrain.ix[ i , :])` or `model.predict_proba(Xtrain.ix[ i , :])[:, 1]`. This is either a `1` or `0` for the discrete case, or a floating point number of the continuous case. We shall look at the discrete one first as this one is easier to wrap out heads around. 

#### 3.2.2. The discrete case for $P(m|H_i)$

Now let us suppose that, given a set of features $\mathbf{f}$, out model *predicts*  a value of `1`. We know (from Section 3.1.) that before the prediction, the probability that the house has low interest is approximately 0.69. Given that our model predicts that this also have low interest, how do we *update* this probability?

For this, we need two quantities. $P(m|H_i)$. In this specific case, we want to find

$$ P( \hat{P}(H_i) = 1 | H_{low} = 1 ) $$

From the training set, we can easily calculate this quantity ...

```python
yHat = model.predict(Xtrain.ix[y['low'] == 1, :]) 
P_m_given_H = sum(yHat == 1)*1.0/len(yHat)
```

If the current model predicts a `0` on the other hand, the equation that we are interested in evaluating is the following:

$$    P( H_i = 1| \hat{P}(H_i) = 0 )  =  P(H_i = 1) \frac {P( \hat{P}(H_i) = 0 |H_i = 1) } {P( \hat{P}(H_i) = 0 )} $$

Under these circumstances, the quantity that we would be interested in is: ${P( \hat{P}(H_i) = 0 |H_i = 1) }$, the code for which can be described written as:

```python
yHat = model.predict(Xtrain.ix[y['low'] == 1, :])
P_m_given_H = sum(yHat == 0)*1.0/len(yHat)
```

These two sets of code is essentially the same. The one that we are going to choose is the one that depends upon the prediction made by the current model for the current house. 

```python
pred = 1 # This is the current prediction

yHat = model.predict(Xtrain.ix[y['low'] == 1, :])
P_m_given_H = sum(yHat == pred)*1.0/len(yHat)
```


#### 3.2.3. The discrete case for $P(m)$

This is much simpler of course. We are looking for the quantities:

 - $ P( \hat{P}(H_i) = 1 ) $, and 
 - $ P( \hat{P}(H_i) = 0 ) $

This is simply given by the following lines of code:

```python
pred = 1 # This is the current prediction

yHatAll = model.predict(Xtrain) # Predict for all values.
P_m = sum(yHatAll == pred)*1.0/len(yHatAll)
```

#### 3.3.4. Summary

For *every* model that we train, we want to find the values of the following quantities:

 - $ P( \hat{P}(H_i) = 1 | H_{low} = 1 ) $,
 - $ P( \hat{P}(H_i) = 0 | H_{low} = 1 ) $
 - $ P( \hat{P}(H_i) = 1 ) $, and
 - $ P( \hat{P}(H_i) = 0 ) $

Note that these quantities are *independent* of the testing set, and thus can be easily calculated immediately following the training of the data.

## 4. A Simple Test Script

Let us first test to see if a simple test script is going to show improvement in scores using a Bayesing inference methodology.

```python
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random, json

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import cross_val_score

if __name__ == '__main__':

    print('Reading data ...')
    Xtest  = pd.read_csv('localFeatures/Xtest.csv')
    Xtrain = pd.read_csv('localFeatures/Xtrain.csv')
    y      = pd.read_csv('localFeatures/y.csv')
    test   = pd.read_json("../../data/test.json" )

    predictClasses = ['high', 'medium', 'low']

    Xtest.fillna( Xtest.mean(), inplace=True)

    print('Fitting a different model for each class ...')


    params = {
        'min_samples_split' : random.randint(4, 10),
        'max_depth'         : random.randint(15, 50),
        'n_estimators'      : random.randint(100, 500),
        'n_jobs'            :  -1,
    }

    resultNormal   = {}
    resultBayesian = {}

    for p in predictClasses:

        print 'Training model: ', p

        # Train a random model ...
        model = RandomForestClassifier(**params)
        model.fit(Xtrain, y[p])

        # P(m|H)
        P_m_given_H = {}
        yHat = model.predict(Xtrain.ix[y[p] == 1, :]) 
        P_m_given_H[0] = sum(yHat == 0)*1.0/len(yHat)
        P_m_given_H[1] = sum(yHat == 1)*1.0/len(yHat)

        # P(m)
        P_m = {}
        yHatAll = model.predict(Xtrain) # Predict for all values.
        P_m[0] = sum(yHatAll == 0)*1.0/len(yHatAll)
        P_m[1] = sum(yHatAll == 1)*1.0/len(yHatAll)

        # This is the factor by which a result is going
        # to be changed, given that a 1 or a 0 is selected
        Factor = {}
        Factor[0] = P_m_given_H[0] / P_m[0]
        Factor[1] = P_m_given_H[1] / P_m[1]

        print 'Factor  :', Factor
        print 'P(m)    :', P_m
        print 'P(m|H)  :', P_m_given_H

        Hi = sum(y[p] == 1)*1.0 / len(y[p])

        yHat = model.predict(Xtest)

        resultNormal[p] = yHat
        resultBayesian[p] = Hi * np.array([ Factor[m] for m in yHat])

    resultBayesian['listing_id'] = test['listing_id']
    resultBayesian = pd.DataFrame(resultBayesian)[['listing_id', 'high', 'medium', 'low']]
    
    resultNormal['listing_id'] = test['listing_id']
    resultNormal = pd.DataFrame(resultNormal)[['listing_id', 'high', 'medium', 'low']]

    print resultBayesian.head()
    print resultNormal.head()

    resultBayesian.to_csv('results/bayesian1.csv', index=False)
    resultNormal.to_csv('results/normal.csv', index=False)



```

| Method    |  LB Score
|-----------|---------
| Normal    | 11.98269
| Bayesian  | 0.74782

## 5. Multiple Models

[To do] I shall complete this section in a couple of days ...

## 6. Conclusion

We have seen in this article how we can incorporate machine learning models into a Bayesian Framework. The good news about this framework is that this is going to work well no matter what type of model you use, how different they are. In fact, because of the assumption of independence, the more different your models are, the better off you will be. The other interesting thing about this approach is that this method os going to change your prior *in proportion* to how good they are. So if a particular model is very accurate, then this method should theoretically enhance that. 

#### More to come 

