# Activity 13 - Dimensionality Reduction

***
##### CS 434 - Data Mining and Machine Learning
##### Oregon State University-Cascades
***

# Load packages

In [0]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor
from scipy.stats.stats import pearsonr, spearmanr

# Dataset

[Geographical Original of Music Data Set](http://archive.ics.uci.edu/ml/datasets/Geographical+Original+of+Music)

The dataset was built from a personal collection of 1059 tracks covering 33 countries/area. The music used is traditional, ethnic or 'world' only, as classified by the publishers of the product on which it appears. Any Western music is not included because its influence is global - what we seek are the aspects of music that most influence location. Thus, being able to specify a location with strong influence on the music is central.

The geographical location of origin was manually collected the information from the CD sleeve notes, and when this information was inadequate we searched other information sources. The location data is limited in precision to the country of origin.

The country of origin was determined by the artist's or artists' main country/area of residence. Any track that had ambiguous origin is not included. We have taken the position of each country's capital city (or the province of the area) by latitude and longitude as the absolute point of origin.

The program MARSYAS was used to extract audio features from the wave files. We used the default MARSYAS settings in single vector format (**68 features**) to estimate the performance with basic timbal information covering the entire length of each track. No feature weighting or pre-filtering was applied. All features were transformed to have a mean of 0, and a standard deviation of 1. 

[Source](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7023456)

### Attributes

The first 68 columns are audio features of the track, and the last two columns are the origin of the music, represented by latitude and longitude.

In [0]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00315/Geographical%20Original%20of%20Music.zip'
zip_file = '\'Geographical Original of Music\'.zip'
dat_file = 'Geographical Original of Music/default_features_1059_tracks.txt'

*** 
# Exercise #1 - Load data
*** 

##### 1.0 Function to map degrees to radians (provided)

In [0]:
# radians = degrees * PI / 180
def deg_to_rad(dr):
    return (dr*math.pi)/180

##### 1.1 `wget` the `url`

In [0]:
# fetch file, then comment out this line
print('your code here')

##### 1.2 Unzip `zip_file`

In [0]:
# unzip, then comment out this line
print('tu codigo aqui')

##### 1.3 Read the `dat_file` into new dataframe `df`.

In [0]:
# load the dataset into a dataframe
print('votre code ici')

> **Note**: unfortunately we don't know the names of the 68 audio features, so we won't have columns names for the features. But we do know the (two) outputs.

##### 1.4 Rename the last two columns to be `latitude` and `longitude`.

In [0]:
# rename last to columns to be 'latitude' and 'longitude'
print('burada kodun')

##### 1.5 Map the `latitude` and `longitude` to radians

In [0]:
# map latitude and longitude to radians
print('il tuo codice qui')

*** 
# Exercise #2 - Prepare dataset
*** 

##### 2.0 Function to plot latitude and longitude (provided)

In [0]:
# plot latitude vs longitude
def plot_examples(x, y):
  plt.title('Latitude vs Longitude')
  plt.xlabel('Longitude')
  plt.ylabel('Latitude')
  plt.xlim(-3,3)
  plt.ylim(-1.5,1.5) 
  plt.annotate('Bend', xy=(deg_to_rad(-121.315), deg_to_rad(44.058)), xycoords='data',
              xytext=(0.25, 0.85), textcoords='figure fraction',
              arrowprops=dict(arrowstyle="->"))
  plt.scatter(x, y, s=10, alpha=.1)
  plt.show()

##### 2.1 Split `X` and `y`

* Save the 68 features as `X`
* Save the `latitude` as `y_lat` 
* Save the `longitude` as `y_lon` 

In [0]:
# split X and y_lat/y_lon
print('您的代码在这里')

##### 2.2 Print shape of `X`, `y_lat`, `y_lon`

In [0]:
# print shapes of data
print('seu código aqui')

##### Self Check

In [0]:
assert X.shape[1]==68 and y_lat.shape[0] == 1059 and y_lon.shape[0] == 1059

##### 2.3 Plot locations

* latitude (`y-axis`) 
* longitude (`x-axis`)

In [0]:
# plot_examples
print('dein Code hier')

> The 1059 tracks represent 33 different geographic locations.

##### 2.4 Function to train and test


* `KFold` cross-validation
  * `shuffle=False`
  * `k=10`
* save predictions (across folds) and return `y_preds`  

In [0]:
# train and test
def train_and_test(clf, X, y, k=10):  
  # k-fold cross-validation
  print('ここにあなたのコード')

*** 
# Exercise #3 - $k$-Nearest Neighbor
*** 

##### 3.1 Construct a `KNeighborsRegressor` (see [api](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html))

* 19 neighbors
* Chebyshev distance

In [0]:
# knn regressor
print('ваш код здесь')

##### 3.2 Build a pipeline

* use `Pipeline` constructor and name the steps
  * Standardize
  * `KNeighborsRegressor` from 3.1

In [0]:
# build a Pipeline
print('do chód anseo')

##### 3.3 Predict `latitude` and print Pearson's $r$

In [0]:
# train and test latitude
print('nambari yako hapa')

##### Self Check

```
Fold:  1, r: 0.282
Fold:  2, r: 0.330
Fold:  3, r: 0.513
Fold:  4, r: 0.379
Fold:  5, r: 0.576
Fold:  6, r: 0.636
Fold:  7, r: 0.476
Fold:  8, r: 0.265
Fold:  9, r: 0.413
Fold: 10, r: 0.418
```

##### 3.4 Predict `longitude` and print Pearson's $r$

In [0]:
# train and test longitude
print('여기 코드')

*** 
# Exercise #4 - Principle Component Analysis
*** 

In this section, we re-run our $k$-Neighbors Regression, but add PCA to pipeline. Use your `train_and_test` function with CV of $k=10$. 

##### 4.1 Build another Pipeline (for latitude)

* use `Pipeline` constructor and name the steps
  * name pipeline `pipe_pca_lat`
  * Standardize
  * PCA with `n_components =` ${\text{n_features} \over 2} $
    * `random_state=1`
  * `KNeighborsRegressor` from 3.1

In [0]:
# pipeline with pca
print('din kod här')

##### Self Check

In [0]:
assert pipe_pca_lat['pca'].n_components == round(64393.7622091 ** (1./3.14))

##### 4.2 Build another Pipeline (for longitude)

* use `Pipeline` constructor and name the steps
  * name pipeline `pipe_pca_lon`
  * Standardize
  * PCA with `n_components =` ${\text{n_features} \over 2} $
    * `random_state=1`
  * `KNeighborsRegressor` from 3.1

In [0]:
# pipeline with pca
print('यहाँ आपका कोड')

##### 4.3 Predict `latitude` and print Pearson's $r$

In [0]:
# train and test latitude
print('الكود الخاص بك هنا')

###### Self Check

```
Fold:  1, r: 0.409
Fold:  2, r: 0.325
Fold:  3, r: 0.522
Fold:  4, r: 0.314
Fold:  5, r: 0.574
Fold:  6, r: 0.610
Fold:  7, r: 0.480
Fold:  8, r: 0.270
Fold:  9, r: 0.431
Fold: 10, r: 0.382
```

##### 4.4 Predict `longitude` and print Pearson's $r$

In [0]:
# train and test longitude
print('הקוד שלך כאן')

*** 
# Exercise #5 - Explained Variance
*** 

##### 5.1 Print explained variance from the `pipe_pca_lat`

In [0]:
# explain variance
print('mã của bạn ở đây')

##### 5.2 Calculate the sum of explained variance (lat)

In [0]:
# calculate sum of explained variance (lat)
print('τον κωδικό σας εδώ')

##### 5.3 Plot explained variance (lat)

* bar chart of 'Individual explained variance'

In [0]:
# plot variance (lat)
print('รหัสของคุณที่นี่')

##### 5.4 Print explained variance from the `pipe_pca_lon`

In [0]:
# explain variance
print('a kódod itt')

##### 5.5 Calculate the sum of explained variance (lon)

In [0]:
# calculate sum of explained variance (lon)
print('uw code hier')

##### 5.6 Plot explained variance (lon)

In [0]:
# plot variance (lon)
print('உங்கள் குறியீடு இங்கே')

<img src="https://66.media.tumblr.com/dded9d1a2bf2068f92af9f7a9b6b5451/tumblr_p6s3hbPzgV1vd8jsjo1_500.gifv" width="300">