# Social Data Mining 2016 - Practical 5: Know your Practices

Up until now we've touched upon many factors within a data mining pipeline that need to be taken in consideration to accurately interpret the predictions made by our models. We discussed `metrics`, `baselines`, different `data splits` (train, test, dev), `parameter tuning`, `overfitting and underfitting`, and `dimensionality reduction`. These are all important variables in your entire set-up, and therefore require a thourough understanding of best practices with regards to different scenarios. This practical will focus on considering these different scenarios.

## The Data

For this practical, we'll work with some imaginary data. In contrast with the 'softer' first part of the practicals, where we were mainly interested in what implications the data might pose to any predictions that we might perform, we will now look at the models themselves. We assume that we have already thought through the first part, so the characteristics of the data matter less at this point.

What we do know is: the data has 500.000 instances, with a mix of numeric and nominal features. There are two potential targets for prediction; one is numeric, and one is nominal.

## Cases

Try to solve the **mistakes** in the methods for the cases below given what you know of earlier mentioned concepts, and the description of the case.

---

### Case 1

We sampled about 20% of the total data. We apply standard linear regression and get an error of 0.05. We're happy with the result and report it as is.

---

### Case 2

We use all of the data, but split off 10% for testing our model. We start applying $k$-NN and tuning $k$ manually until we get the highest accuracy score on the train set ($k$=50, accuracy=0.80). We then apply it to the test set and get a score of 0.85.

---

### Case 3

We add a baseline to Case 2 to compare. With the exact same procedure (as case 2) we get a baseline accuracy score of 0.50. We're happy and report it as is.

---

### Case 4

We set up the neccesary splits, tune $k$ fairly and see that both our test and train scores are now lower (0.80 and 0.70). However, we are still above baseline performance (0.50). We see that the final $k$ value settled on was $k=3$. We try again manually with $k=5$. We now have 0.90 for train and 0.89 for test.

---

### Case 5

We decide that some of the labels we use for classification actually have a low occurence, and when we inspect the confusion matrix we see that they are almost always guessed incorrectly. We remove them and run the experiment again.

---

### Case 6

We apply PCA to our entire dataset to reduce the size of the feature space. After, we run our set-up of case 4 again.

---

### Case 7

Suppose the data we have been using up until now are behavioural attributes of persons recorded by a security camera on an airport. They have binary labels with potential threat yes / no. We have 98% harmless travellers, and 2% potential threats. We use accuracy to evaluate our classifier against a majority baseline. 