# Social Data Mining 2016 - Practical 5: Know your Practices

Up until now we've touched upon many factors within a data mining pipeline that need to be taken in consideration to accurately interpret the predictions made by our models. We discussed `metrics`, `baselines`, different `data splits` (train, test, dev), `parameter tuning`, `overfitting and underfitting`, and `dimensionality reduction`. These are all important variables in your entire set-up, and therefore require a thourough understanding of best practices with regards to different scenarios. This practical will focus on considering these different scenarios.



## The Data

For this practical, we'll work with some imaginary data. In contrast with the 'softer' first part of the practicals, where we were mainly interested in what implications the data might pose to any predictions that we might perform, we will now look at the models themselves. We assume that we have already thought through the first part, so the characteristics of the data matter less at this point.

What we do know is: the data has 500.000 instances, with a mix of numeric and nominal features. There are two potential targets for prediction; one is numeric, and one is nominal.

## Cases

Try to solve the **mistakes** in the methods for the cases below given what you know of earlier mentioned concepts, and the description of the case.

--- 

### Case 1

We sampled about 20% of the total data. We apply standard linear regression and get an error of 0.05. We're happy with the result and report it as is.

<details>

<summary><h4>Solution</h4></summary>
<blockquote>
<p>There are many issues with this particular example, most of which will become more apparent once we progress through all of these cases. The most important one, however, is that of the prediction set-up.</p>

<p>In this case we train a regression model without any note of a validation or test set. So, this very small error that we got is a direct result of evaluating predictions conducted on previously seen data. As such, we reproduce the target values rather than predicting them; we don't learn a model that generalizes, but is specifically tailored to perfectly reproduce the values associated with the input.</p>

<p>Sampling can't be regarded as a mistake per se; despite the fact that our model covers less data, there might be some restrictions that limit us to only using this much (e.g. computation time, size of the data). We do however need to realize that data reduction through sampling makes our model less robust (i.e. have the ability to generalize well), and will not give us a lot of confidence in how well it might perform on a bigger variety data. If you can, use all the data you can process.</p>

</blockquote>

</details>

---

### Case 2

We use all of the data, but split off 10% for testing our model. We start applying $k$-NN and tuning $k$ manually until we get the highest accuracy score on the train set ($k$=50, accuracy=0.80). We then apply it to the test set and get a score of 0.85.

<details>

<summary><h4>Solution</h4></summary>
<blockquote>
<p>In this case, parameter tuning of $k$ is done by evaluating its performance on the training set. This makes as little sense here as it did in the previous case. What we can observe as well, is that tuning resulted in a very high $k$ value of 50, which means the model could be underfitted (which we can confirm when we see the score on the test set is higher). Ideally, we would want to validate our tuned performance on a seperate set.</p>

<p>Moreover, and this also holds for the previous case, there is no baseline! Even if we did everything correctly, we don't know if a baseline score might have resulted in an accuracy score of .90 or something - which would've left us with the conlusion that the model learned nothing.</p>
</blockquote>

</details>

---

### Case 3

We add a baseline to Case 2 to compare. With the exact same procedure (as case 2) we get a baseline accuracy score of 0.50 on the test set. We're happy and report it as is.

<details>
<summary><h4>Solution</h4></summary>
<blockquote>
<p>We score much higher than the baseline, so our model is definitely doing well. However, we still need to create that development/validation set to make sure that our tuning is correct.</p>
</blockquote>

</details>

---

### Case 4

We set up the neccesary splits, tune $k$ fairly and see that both our test and train scores are now lower (0.80 and 0.70). However, we are still above baseline performance (0.50). We see that the final $k$ value settled on was $k=3$. However, when we manually try $k=5$, we get 0.79 for train and 0.78 for test. We settle on $k=5$.

<details>
<summary><h4>Solution</h4></summary>
<blockquote>
<p>For $k$ to be tuned fairly, tuning was conducted on a validation split. The scores are lower, but that's to be expected, as we're now actually doing predictions on unseen data (both validation and test). The fact that we're scroing better than a baseline classifier (such as majority) indicates that the model is actually learning something from the data.</p>

<p>The mistake here is that we go back again after having looked at the performance of the *test set*. We know use our newly acquired knowledge of this 'new data' to go back and tune. Through this, we now don't know how it would perform on new data.</p>
</blockquote>

</details>

---



### Case 5

We decide that some of the labels we use for classification actually have a low occurence, and when we inspect the confusion matrix we see that they are almost always guessed incorrectly. We remove them and run the experiment again.

<details>
<summary><h4>Solution</h4></summary>
<blockquote>
<p>Again, we're using information that we did not have any access to prior to conducting these experiments - just to increase our score. Just because these labels are hard to predict, doesn't mean that they are to be excluded!</p>
</blockquote>

</details>

---



### Case 6

We apply PCA to our entire dataset to reduce the size of the feature space. After, we run our set-up of case 4 again.

<details>
<summary><h4>Solution</h4></summary>
<blockquote>
<p>PCA, like any component in our pipeline (even some preprocessing ones), can only access the information of the training set to keep things 100% fairly evaluated. The number of components is typically also selected using the development set.</p>
</blockquote>

</details>

---



### Case 7

Suppose the data we have been using up until now are behavioural attributes of persons recorded by a security camera on an airport. They have binary labels with potential threat yes / no. We have 98% harmless travellers, and 2% potential threats. We use accuracy to evaluate our classifier against a majority baseline.

<details>
<summary><h4>Solution</h4></summary>
<blockquote>
<p>Majority baseline will be a very hard baseline to beat (given that 98% is negative), which is good! However, accuracy will not reflect our score very well; it will be anything up from 98% and therefore might give us (or someone else) the illusion that we're doing way better than we actually are. Generally, we would want to focus on the correctly classified POSITIVE instances in this regard.</p>

<p>Additonally, this is a typical case where precision and recall become important metrics. Are we going for as few mistakes as possible (high precision) or are we interested in capturing as many threats as possible even though we might have to harass innoncent civillians for it (high recall)?</p>
</blockquote>

</details>
