# Visualization

In this notebook, the objective is to manipulate the main visualization tools and interpret the obtained results and graphs.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing

#EDA - Dataset loading and transformation
The file customers.csv contains a list of customers with the total spend by item. The objective of this lab is to do a clustering on the customers and experimenting with dimensional reduction and visualization.

Q1. Load the data and describe elementary statistics.

In [None]:
data = pd.read_csv("customers.csv")
# STUDENT

Q2. Draw the:
- scatter matrix of the pairwise features
- the violin plot of each feature
- the matrix of correlation

Use the seaborn package which provides the required functions.


In [None]:
#[STUDENT]


In [None]:
#[STUDENT]


In [None]:
#[STUDENT]


Q3. What can you deduce from those graphs ? What transformation is required ? Apply it and redraw the graphs.

In [None]:
#[STUDENT]


In [None]:
#[STUDENT]

## PCA
The goal of PCA is to find rotated axes (principal components) in which the variance of the data is maximal. By projecting the data on these axes, we can represent most of the data without losing much information. Indeed, minimising Mean Squared error (MSE) is equivalent to maximising the variance of the projected data (see lesson).
 
First, we are going to carry out a PCA "by hand" on some toy data: ```mystery_data.npy```. This consists in the following steps:
- Centre the columns of the data: $X_{i,j} = X_{i,j}-\frac{1}{n} \sum_{k=1}^n X_{k,j}$
- Calculate the covariance matrix: $\Sigma_X = \frac{1}{n-1} X^T X$
- Calculate the eigenvectors and eigenvectors of $\Sigma_X$. Use the ```np.linalg.eig``` function
- Arrange the eigenvectors $\left\{m_1,m_2,\dots, m_p\right\}$ in order of decreasing eigenvalues. **Note**: The ```np.argsort``` function only orders in increasing order, find a way to invert this order
- Project the data onto the first principal components (take two here). 

We load the data first:

In [None]:
X = np.load('mystery_data.npy')

n = X.shape[0]
p = X.shape[1]

print("Number of rows (samples):", n)
print("Number of variables:", p)

Pick two variables (columns) of this data (randomly), and display the data using a scatter plot. Can you see any meaningful structure ?

In [None]:
# STUDENT

Now, carry out the PCA "by hand" to calculate the two first prinicpal components $m_1$ and $m_2$, and project the data on these two. Display the projected data. Can you see some structure ? :)

In [1]:
#STUDENT

## PCA
The correlation matrix of the customers data shows that some features are correlated. Two features which are correlated can be summarised/compressed with one feature. Take, for example, distance from the sun and radiation amplitude. One is determined by the other, so it is a waste of features to store them both. In real-world situations (not like the previous toy data), the goal of the PCA is to determine these correlations for the purpose of reducing dimensionality.  

Q4. Use a PCA to reduce the dimensionality. From now on, we will use pre-coded functions of ```scikit-learn``` (see further on for explanations). Do the following:

1. Carry out the PCA
2. Display the "explained variance" (see lesson for more explanations on this) for each axis
3. Look at the contribution of each dimension to each new axis
4. Draw the first two axes and the features in a biplot.

What can you conclude from the biplot ?

**How to proceed**
We will use the PCA fonction from scikit-learn [see here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html). Look at the documentation and try and understand a the different functions/variables in the PCA object.

First, create an object which will contain the result of the PCA. In the PCA() function, we can also specify parameters such as the number of components to be retained (n_components - here, we keep all of them).
Next, we fit the data with the fit() function of the previously created object. In this function, we pass the data to be used (quantitative variables only with no missing data).


The pca object now includes a number of objects and functions. There is the list of ``explained variances'':
- ```explained_variance_```,
which is proportional to the covariance eigenvalues along each dimension, and the ``explained variance ratio'', defined by:
- $\frac{\lambda_i}{\sum_j \lambda_j}$, ```explained_variance_ratio_```

The prinicpal components can also be seen using:
- ```components_```.

Create this object now:

In [1]:
#[STUDENT]


Print the sum of the explained variance ratio. What does this give ? Explain this result

In [None]:
#[STUDENT]

Q5. Make a summary table, with explained variances, simple and cumulative explained variance proportions.

In [None]:
#[STUDENT]

Q6. Plot these proportions of variances explained (in percentage).

In [None]:
#[STUDENT]

Q7. Calculate the coordinates of the individuals (samples) in the new basis given by the , using the transform() function of the pca object. Keep the appropriate number of dimensions (see lesson) and build a table with the corresponding dimensions.

In [None]:
#[STUDENT]

Q8. Plot the data on the first factorial axis.

In [None]:
#[ETUD]
plt.scatter(pca_data['PC1'], pca_data['PC2'])

The circle of correlation is a visual way to represent the data and its correlations using the first two principal components of the PCA.

Here is a code to do this:

In [None]:
n = data.shape[0] # nb of individuals
p = data.shape[1] # nb of variables
print(n, '  ', p)
eigval = (n-1) / n * pca.explained_variance_ # eigen values
sqrt_eigval = np.sqrt(eigval)
corvar = np.zeros((p,p)) # empty matrix for coordinates
for k in range(p):
    corvar[:,k] = pca.components_[k,:] * sqrt_eigval[k]
# on modifie pour avoir un dataframe
coordvar = pd.DataFrame({'id': data.columns, 'COR_1': corvar[:,0], 'COR_2': corvar[:,1]})
coordvar

In [None]:
# Création d'une figure vide (avec des axes entre -1 et 1 + le titre)
fig, axes = plt.subplots(figsize = (6,6))
fig.suptitle("Cercle des corrélations")
axes.set_xlim(-1, 1)
axes.set_ylim(-1, 1)
# Ajout des axes
axes.axvline(x = 0, color = 'lightgray', linestyle = '--', linewidth = 1)
axes.axhline(y = 0, color = 'lightgray', linestyle = '--', linewidth = 1)
# Ajout des noms des variables
for j in range(p):
    axes.text(coordvar["COR_1"][j],coordvar["COR_2"][j], coordvar["id"][j])
# Ajout du cercle
plt.gca().add_artist(plt.Circle((0,0),1,color='blue',fill=False))

plt.show()

You can also combine variables and individuals as follows:

In [None]:
plt.bar(x=range(6),height=pca.components_[0])

In [None]:
plt.scatter(pca_data['PC1'], pca_data['PC2'], s=5)
for i in range(pca.components_.shape[1]):
    plt.arrow(0,0,pca.components_[0,i]*10,pca.components_[1,i]*10,alpha=0.5)
    plt.text(pca.components_[0,i]*10,pca.components_[1,i]*10,log_data.columns[i])

Q8. Comment on the results thus obtained

#[STUDENT]

# Other visualizations

Q9. PCA is a **linear** dimension reduction tool: the new dimensions are calculated as a linear combination of the initial (canonical) dimensions. However, this is only useful when the data show linear correlations, basically when they are spread over a linear axis. This is the case if the data can be modelled as a multi-dimensional Gaussian distribution (blobs stretched in different directions). If the data has some other, non-linear, shape, then the PCA may not be useful. In this case, we may use ``non-linear'' dimension reduction tools, which apply non-linear operators.

Apply the Isomap, the MDS and the T-SNE algorithm to visualize the transformed data. Feel free to change the parameters of each visualization methods. Draw conclusion from this study.

To obtain readable visualization, we need to first perform a clustering to categorize data. The code is provided below. You can thus specify the color using predefined palette in seaborn

In [None]:
import seaborn as sns

palette = sns.color_palette() # Default color palette

#for having a palette with 10 colors
palette10= sns.color_palette('husl', 10)


In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

km = KMeans(n_clusters=5)
km.n_clusters = 8
pred = km.fit_predict(pca_data)

# Generate a colormap with 8 colors
cmap = cm.get_cmap('tab20', 8)
# Get colors for each cluster
colors = [cmap(i) for i in pred]

plt.scatter(pca_data['PC1'], pca_data['PC2'], color=colors, s=5)
real_centers = np.exp(pca.inverse_transform(km.cluster_centers_))
fig, axs = plt.subplots(km.n_clusters // 2, 2, sharey=True, sharex=True)
for i, k in enumerate(real_centers):
    # Get the color for the current cluster
    color = cmap(i)
    axs.flatten()[i].bar(range(len(k)), k, color=color)
    axs.flatten()[i].set_xticks(range(len(k)))
    axs.flatten()[i].set_xticklabels(data.columns, rotation="vertical")

In [None]:
#ISOMAP
from sklearn.manifold import Isomap
isomap = Isomap(n_components=2)
isomap_data = isomap.fit_transform(pca_data)
plt.scatter(isomap_data[:, 0], isomap_data[:, 1], color=colors, s=5)


In [None]:
#MDS zertrrz
from sklearn.manifold import MDS
mds = MDS(n_components=2)
mds_data = mds.fit_transform(pca_data)
plt.scatter(mds_data[:, 0], mds_data[:, 1], color=colors, s=5)

In [None]:
#TSNE
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2)
tsne_data = tsne.fit_transform(pca_data)
plt.scatter(tsne_data[:, 0], tsne_data[:, 1], color=colors, s=5)