# Social Data Mining 2017 (Spring) - Practical 2: Know your Predictions

This session will focus on making precitions, ascertaining the correctness of these predictions, and trying to party improve them. You are going to use the **IMDB 5000 Movie Dataset**.

- Information: [task & data](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset)
- Data: [link](https://raw.githubusercontent.com/tcsai/social-data-mining/master/data/imdb.csv)

Again: please make sure you understand the dataset and the task before beginning. In this practical we will focus on predictive regression. Apart from getting to know our data, we want to try and solve the task we are assigned.

---

![img1](https://kaggle2.blob.core.windows.net/datasets-images/138/287/229bfb5d3dd1a49cc5ac899c45ca2213/dataset-cover.png)

## 1 - Interpreting the Data

The IMDB dataset is scraped off the internet and as-is. Please note that the provided version has an extra feature: `quality`. This was built up from the IMDB score, and structured as follows:

    1-3:  very bad
    3-5:  bad
    5-7:  okay
    7-8:  good
    8-10: very good

### Tasks

- Which prediction tasks can you come up with given this dataset?
- Inspect the raw data; can you see problems with it?
- Visualize the year * budget and label the instances by movie title. What movie has the highest budget?
- Look up the movie; is the information correct?

---

![imgproc](https://image.slidesharecdn.com/datapreprocessing-150127194908-conversion-gate01/95/data-preprocessing-3-638.jpg?cb=1422388220)

## 2 - Preprocessing

Data can be noisy, does not correctly represent what we want it to, and therefore hampers the tools that we are using. In Orange's case; some information (such as that in the visualizations) can show up incorrectly if no proper preprocessing is conducted beforehand. This is done with the `Preprocessing` widget; generally placed between your data and anything else. In it, several options are presented which can be dragged to the right side of the screen to activate. For now, we will only focus on `Impute`.

### Tasks

- Remove any items in the list on the right.
- Select Impute Missing Values. Consider the options. Which one would you prefer given this dataset?
- Does the information now plot correctly?

---

![imgfilter](https://upload.wikimedia.org/wikipedia/commons/2/25/Crystal_Project_Filter.png)



## 3 - Filtering Features

Given that we've also established issues with some of the features, it would be preferable if we filter them based on these specifications. You can select features based on some rule in the `Select Rows` widget. In turn, you can set up the prediction task as determined in the first task by using the `Select Columns` widget (both under Data). These should be changed **after** eachother.

### Tasks

- Filter movies to be USA only.
- Remove any features that can't be determined pre-release.
- How do you think certain features affect predictions? What would including `quality` do, specifically?
- How did your prediction set-up change?

---

![imgreg](http://www.vias.org/science_cartoons/img/gm_regression.jpg)


## 3 - Prediction

To apply Linear Regression to our data, we use the `Linear Regression` widget (found under Regression). Chain the widget after your filtering components. You can use the `Data Table` widget (again changed after) to interepret the coefficients and the bias. Under Evaluate, you can find the `Test & Score` widget. Input both your data (from `Select Columns`), as well as your model (from `Linear Regression`) to this. 

![imgtest](http://scott.fortmann-roe.com/docs/docs/MeasuringError/holdout.png)

The scores will give you an estimate of the erros on our data. We go into detail about this topic in the next lecture. However, for now it's good to know that to get these scores, the model first tries to computationally fit the best possible regression line on some part of the data (the training data). Once this is determined, it is provided with a set of new, unseen data (test data), and asked to make predictions on these. It then computes the error. Further insight into this process is given in section 5.

### Tasks

- Do the coefficients give you any information?
- How can you interpret the error scores?

---

![imgana](https://inquiryintoinquiry2012.files.wordpress.com/2012/11/analysis1.jpg)


## 4 - Interpreting Scores

One way to determine feature importance for regression is to gradually increase the amount of features used in the model, and see how they individually decrease the error rates. For this, we also need some form of **baseline**: a 'stupid' method of prediction to compare against. For regression, we generally use the `Mean Learner` widget (computes the average over all scores, and always predicts this for each instance). Connect the `Mean Learner` to `Test & Score`; it should show up as well.

### Tasks

- Use the `Select Columns` widget you chained before to remove all but 1 feature, interpret test & score, add 1 feature, interpret, repeat.
- What is the most informative feature in this regression?

---

![imghouse](https://archive.ics.uci.edu/ml/assets/MLimages/Large48.jpg)

## 5 - Take Home Assignment: Housing & Predictions

This assignmed used the **UCI Housing Dataset** as data. More info can be found [here](https://archive.ics.uci.edu/ml/datasets/Housing).

The coefficients and bias per feature fitted by a regression model on **training data** are as follows:

```
     -0.1084 * crime-rate +
      0.0458 * zoned +
      2.7187 * charles +
    -17.376  * nitric-oxide +
      3.8016 * rooms +
     -1.4927 * employment-center +
      0.2996 * radial-highways +
     -0.0118 * property-tax +
     -0.9465 * pupil-teach-ratio +
      0.0093 * proportion-black-families +
     -0.5226 * poor-people +
     36.3411
```

You are provided with the following test data:

| crime-rate | zoned | industry | charles | nitric-oxide | rooms | age | employment-center | radial-highways | property-tax | pupil-teach-ratio | proportion-black-families | poor-people |
| ---      | ---  | ---    | - | ---    | ---    | ---    | ---    | ---  | ---   | ---   | ---    | ---   |
| 25.04610 | 0.00 | 18.100 | 0 | 0.6930 | 5.9870 | 100.00 | 1.5888 | 24   | 666.0 | 20.20 | 396.90 | 26.77 | 
| 14.23620 | 0.00 | 18.100 | 0 | 0.6930 | 6.3430 | 100.00 | 1.5741 | 24   | 666.0 | 20.20 | 396.90 | 20.32 | 
| 9.59571  | 0.00 | 18.100 | 0 | 0.6930 | 6.4040 | 100.00 | 1.6390 | 24   | 666.0 | 20.20 | 376.11 | 20.31 | 

- Use the given formula to predict the `median-value` (by hand) for the feature vectors above.

Given these **actual** median values (so the actual price of the houses):

| median-value |
|------------- |
| 5.60         |
| 7.20         |
| 12.10        |


- Use the predicted `median-value` to calculate the Root Mean Squared Error for the **actual** median-value in the table above. You do this by (for each of the feature vectors) substracting the `actual` ($y_t$) from the `predicted` ($\hat{y_t}$) value, and squaring them. After, you take the sum over all these values (should be 3 values, because 3 instances), divide it by the amount of predictions (3), and take the root of this number. <br><br> Or: <br><br> $\text{RMSE} = \sqrt{ \frac{ \sum^n_{t=1}(\hat{y}_t - y_t)^2}{n}}$