# Data Mining 2016 - Know your Toolbelt

In the previous class we have looked at running experiments in batch, both to prove singificant improvement (Experimenter), as well as automatically finding optimal parameter settings (GridSearch). What followed was that both have pro's and cons in comparison to the standard Explorer. In this class, we are going to examine our options a bit more in terms of classifiers and feature selection.

## 0.1 - Prepare your Data

For this practical, you can work on any set we have used before. The goal will again be to maximize performance on these tasks and see which classifiers will perform best.

## 0.2 - Refresher on WEKA

What we've seen until now is that WEKA comes with different windows, filters, classifiers, and settings to work with data. Although up until now we've covered basic things, these should be generalizable to *any other* software that you might use in the future. 

We showed you that in the Experimenter it is possible to perform a multitude of tests under different random conditions (order of the data for example) and test if improvements are significant between classifier A and B, C, and D (using a Paired $t$-test). While this is not common (as it adds another factor of time to your experiments), if you are ever asked to prove that your results perform significantly better, this is the way to do it. Note that you lose the extensive report that WEKA normally provides you in a window, but you can still output this information to `.csv` for example (under `Setup`, `Results destination`). This view can be used to test different classfiers, as well as different parameter settings.

We also looked at automatic parameter tuning and how to set it up in WEKA. Note that by default, WEKA doesn't report on the performance of the individual parameters (also not in Experimenter mode), so you lose this information. This is sometimes an issue, as for example when $k = 3$ and $k = 10$ perform slightly different (e.g. accuracy of 91,3% and 91% respectively), that WEKA will opt for the best setting without considering the complexity of the model.

## 1 - Naive Bayes

In the lectures you were introduced to three new classifiers: Naive Bayes, Random Forests, and SVMs. While the latter two are very complex, Naive Bayes is actually something you can do by hand fairly easily to get a better intuition of what it's doing. That's exactly what we will do in this part. You are given the following instances:

|    | sound | action| label |
| -- | ----- | ----- | ----- |
| 1  | bark  | run   | dog   |
| 2  | meow  | run   | cat   |
| 3  | meow  | jump  | cat   |
| 4  | bark  | run   | dog   |
| 5  | bark  | jump  | dog   |
| 6  | meow  | run   | cat   |
| 7  | bark  | run   | dog   |
| 8  | meow  | jump  | cat   |
| 9  | bark  | run   | dog   |
| 10 | bark  | run   | dog   |
| 11 | bark  | jump  | cat   |
| 12 | meow  | run   | cat   |


Remember that Bayes theorem states the following:

### $P(c \mid x) = \frac{P( x \mid c ) P(c)}{P(x)}$

Where $c$ is a class (e.g. $\text{dog}$), and $x$ is a feature value (e.g $\text{jump}$), so for example $P(\text{dog} \mid \text{jump})$ = $(P(\text{jump} \mid \text{dog}) \cdot P(\text{dog})) / P(\text{jump})$. This would be all occurences of a jumping dog, divided by the total amount of jumps. So $P(\text{dog} \mid \text{jump}) = 1 / 4 = 0.25$. Or, if you run the whole formula: $(1 / 6 \cdot 6 / 12) / (4 / 12) = 0.25$.

For Naive Bayes, however, we assume that all inputs are independent. As such, we only calculate $P(x \mid c)$, (e.g. $P(\text{jump} \mid \text{dog}$, which is $1 / 6$)). In the training phase, it calculates all possible probabilites given the training data, so it can quickly look them up while applying the formula in the prediction phase.

### Tasks

1. Do the same for all possible combinations in our training data: 
    - $P(\text{jump} \mid \text{dog})$
    - $P(\text{jump} \mid \text{cat})$
    - $P(\text{run} \mid \text{dog})$
    - $P(\text{run} \mid \text{cat})$
    - $P(\text{bark} \mid \text{dog})$
    - $P(\text{bark} \mid \text{cat})$
    - $P(\text{meow} \mid \text{dog})$
    - $P(\text{meow} \mid \text{cat})$
    - $P(\text{cat})$
    - $P(\text{dog})$
2. You've now trained Naive Bayes, do the probabilities already provide you some information?

We are given a new instance, of which we don't know the label. We want to predict if it's a cat or a dog using Naive Bayes:

|    | sound | action| label |
| -- | ----- | ----- | ----- |
| 13 | bark  | jump  | ?     |

The probability of a certain class (or label) ($c$) given an instance ($X$) is the following (given the simplified naive assumption):

### $P(c \mid X) = P(x_1 \mid c) \cdot P(x_2 \mid c) \cdot \ldots \cdot P(x_n \mid c) \cdot P(c)$

### Tasks

3. Calculate the probability per class using the feature values:
    - $P(\text{cat} \mid X) = P(\text{bark} \mid \text{cat}) \cdot P(\text{jump} \mid \text{cat}) \cdot P(\text{cat})$
    - $P(\text{dog} \mid X) = P(\text{bark} \mid \text{dog}) \cdot P(\text{jump} \mid \text{dog}) \cdot P(\text{dog})$
4. What class is our instance?

## 2 - Naive Bayes, Random Forests and SVMs

Apply the three different classifiers to the problems you have done before, under the same settings as you've ran $k$-nn and decision trees.

- Naive Bayes: `bayes -> NaiveBayes`
- Random Forest: `trees -> Random Forest`
- SVM: `functions -> SMO`

## 3 - PCA and Feature Selection

PCA and any other ways of feature selection can be applied to your data as a filter. However, the easiest, non-permanent way of doing feature selection is to use a meta-classifier.

- Under `Classify` click `Choose`.
- Navigate to `meta -> AttributeSelectedClassifier`.
- Click the name next to `Choose` to select a classifier to run.
- You can select `PrincipalComponents` as an evaluator to run PCA.
- Within `PrincipalComponents` you can also change the variance it covers (less `==` less features).
- Note that if you click `ok` and `Start`, it will warn you that you need to change `search` to `Ranker`.
- Go back into the menu, and change `BestFirst` to `Ranker`.
- Run your classifier.

### Tasks

1. Can you interpret from the output if PCA reduced the amount of features?
2. Does PCA improve the performance of your classifier?
3. What do you think will happen if you start tweaking the covered variance to maximize performance?
4. Try to apply PCA to a dataset where you have a training set (IMDB for example). How does optimizing affect your performance on both CV an test?