# Social Data Mining 2016 - Practical 3: Know your Results

This session we are going to focus on classification, comparing it to regression, and assessing our results in light of our data.

- Information: [task & data](http://labrosa.ee.columbia.edu/millionsong/pages/example-track-description)
- Data: [all](https://raw.githubusercontent.com/tcsai/social-data-mining/master/data/songs.csv), [50+ genre frequency](https://raw.githubusercontent.com/tcsai/social-data-mining/master/data/songs50themes.csv)

Again: please make sure you understand the dataset and the task before beginning. In this practical we will focus on predictive classification. Apart from getting to know our data, we want to use data mining techniques to judge its value.

---

![imgaws](http://usr.audioasylum.com/images/4/44493/MISC-121a.jpg)

## 1 - Interpreting the Data

The Million Song Dataset consists of 1.000.000 (that was implied) tracks with meta-data from a wide range of platforms: [Labrosa](http://labrosa.ee.columbia.edu/), [The Echo Nest](http://the.echonest.com/), [Last.fm](http://www.last.fm/), etc. As most of it are special ways of representing audio data ([like this](https://upload.wikimedia.org/wikipedia/commons/thumb/2/25/ChromaFeatureCmajorScaleScoreAudioColor.png/400px-ChromaFeatureCmajorScaleScoreAudioColor.png)), we will focus on simple numeric and nominal data only:

    artist_name,duration,artist_terms,tempo,key,energy,title,year,release,time_signature,artist_familiarity,loudness,song_hotttnesss,artist_hotttnesss,danceability
    
### Tasks    
 
- Which prediction tasks (both classification and regression) do you think are feasible, looking at the data?
- Can you find interesting visualizations?
- Do you see any issues that the data poses for interpretation?

---

![imgmus](http://il7.picdn.net/shutterstock/videos/6317468/thumb/1.jpg)

## 2 - Preprocessing

You've been provided with 2 files. One contains a subsample (10.000) of all songs. One is a subsample of that subsample, and only contains the top 50 genres (with about 50+ occurences of that label). Use them both during this assignment.

> Filtering by label was done in Python, this is not something Orange implements out of the box (unless you use the Python Script widget for it).

### Tasks

- Inspect the data, identify, and tackle any issues.
- Compare visualizations between the two sets, what are the advantages and disadvantages of filtering based on genre?

---

![imgsmash](http://il8.picdn.net/shutterstock/videos/5799335/thumb/3.jpg?i10c=img.resize(height:160)

## 3 - Constructing Features

Binning (reducing multiple values to a limited set) is a common method of altering the data to be interpreted. In Orange you can apply this type of feature construction with the [`Feature Constructor` widget](http://docs.orange.biolab.si/3/visual-programming/widgets/data/featureconstructor.html).

The widget works with some indexing, and logical rules. You code the values you want the features to take with an index number (starting from 0). After, at the `Values` part, you enter their actual values. So say that I want to convert all music before 2000 as `classic`, and after as `modern`. It will look something like this:

    Discrete         period            0 if year > 2000.0 else 1

                     Values            modern, classic

- What problems do you foresee with this technique, and how will it limit the predictions?
- Try to add another label (divide between classic and retro for example).

---

![imglab](http://thebrainchilddesign.com/wp-content/uploads/2013/08/photodune-1410498-chemistry-or-biology-laborotary-equipment-m1.jpg)


## 4 - Prediction and Interpretation

Set up both the 50 genre file (Data 1) and the original file (doing the same preprocessing) (Data 2). Also include one 50 genre file were the `year` is **not** binned (but rather at the input changed to nominal) (Data 3). Apply regression on Data 3, and $k$-NN (under classify -> Nearest Neighbours).  

- What feature should we, given our new feature, delete now? Why?
- Compare performance across classifiers, and regression. Are you able to compare them?
- What does this tell you about the data?

---

![imgknn](http://webdataanalysis.net/wp-content/uploads/2011/10/KNN.png)

## 5 - Take Home Assignment: DIY $k$-NN

- Think of 6 instances (with features and feature values) of data yourself. Be creative!
- Use 1 of these instances as test data.
- Normalize the vectors with the $L_2$ norm, and calculate the dot product between your test data and the other 5 vectors.
- Which of the vectors are the 2 nearest neighbours?

Remember that for two given vectors (`[1, 5, 10]`, `[2, 1, 0]`), the $L_2$ norm is the root of the sum of squares, so:

`norm = sqrt(1^2 + 5^2 + 10^2)`
`normalized vector = [1 / norm, 5 / norm, 10 / norm]`

The dot product would be the element-wise multiplication of these two vectors, so in a **non-normalized example** (note that you do have to normalize):

`dist = (1 * 2 + 5 * 1 + 10 * 0)` 

Submit your answers to the Discussion Board & **check one other** assignment on blackboard before the scheduled deadline.