## Exercise 2: Correlations and preprocessing

Before we start applying machine learning algorithms, we want to have a look at further preprocessing and analyzing steps. To do so, this exercise will mainly deal with scaling, dimensionality reduction, correlation measures and the distribution of data points.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

In [None]:
df = pd.read_csv('data/fifa1.csv')

### correlation vs. causality

A common step in data analytics is to investigate correlations between variables. Sometimes these correlations might or might not be derivable from obvious causalities. Create scatter plots of the feature pair 'Ball control' and 'Dribbling' and the feature pair 'Positioning' and 'Penalties'. Do not forget to name the axes of the plots and add suitable titles.

In addition, calculate and print the Pearson correlation of each feature pair (you can also add the correlation to the title of the plot for a better overview). For calculation, use for example:

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.corr.html

In [None]:
#TODO: Make scatter plots, calculate the Pearson correlation, think about it.

### different correlation measures

Beside the Pearson correlation, there are many more correlation measures with different properties. We want to have a closer look at one of them, the so-called Spearman correlation:

https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

Create a scatter plot of the features 'Overall' and 'Value Euro'. In addition, calculate the Pearson as well as the Spearman correlation of the features. For calculation, use for example:

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.spearmanr.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.corr.html

##### Questions:

Can you give a short explanation of the results? 

Can you also imagine dependencies between some features that cannot be detected by both correlation measures? Give a simple example.

In [None]:
#TODO: Calculate the correlation measures, make a scatter plot and comment on the questions

### investigate distributions

Create a histogram of the feature *interceptions* using 100 bins and the parameter *normed = True*. In addition, plot a normal distribution with the same mean and variance as line plot into the same diagram (it is recommended to use different colors).
For plotting the normal distribution you can use:

https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.linspace.html

https://docs.scipy.org/doc/scipy-0.16.1/reference/generated/scipy.stats.norm.html

What do you observe?

In [None]:
#TODO: Compute mean and standard deviation, plot both charts in one diagram, what can you see?

### PCA

A common method for dimensionality reduction is the Principal Component Analysis, which is also known from the lecture:

https://en.wikipedia.org/wiki/Principal_component_analysis

Create scatter plots of the features 'Dribbling' and 'Ball control' before and after applying PCA: 

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Before doing so, apply z-normalization to both features:

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html#sklearn.preprocessing.scale

What do you observe? Can you name one advantage and one disadvantage of using PCA?

In [None]:
#TODO: Scale the features, apply pca on them and don't forget to comment on the questions.