# Principal Component Analysis 

### Data import and preparation

In [2]:
import sklearn
from sklearn.decomposition import PCA
import pandas as pandas
import numpy as np
#load data and merge both tables to one, ignore_index to reindex
redwinedata = pandas.read_csv('data/winequality-red.csv', sep =';')
whitewinedata = pandas.read_csv('data/winequality-white.csv', sep =';')
concat_data = redwinedata.append(whitewinedata, ignore_index=True)
# drop the quality label and normalize the data
concat_data = concat_data.drop('quality', axis=1)
winearray = concat_data.values
winearray_norm = sklearn.preprocessing.scale(winearray)

### How many of the Principal Components should be used?

In [6]:
pca = PCA()
pca.fit(winearray_norm)
print("First three PC:")
print(pca.components_[:3]) 
print("Percent of Variance each PC accounts for:")
print(pca.explained_variance_ratio_) 


import plotly.plotly as py
import plotly.graph_objs as go

trace1 = go.Scatter(
    x=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    y=np.cumsum(pca.explained_variance_ratio_),
    fill='tozeroy'
)
layout = go.Layout(
    title='Plot Title',
    xaxis=dict(
        title='x Axis',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    yaxis=dict(
        title='y Axis',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    )
)

data = [trace1]
py.iplot({'data': data, 'layout': {'title': 'Cumultative explained variance ratio', 'font': dict(size=16)}}, filename='basic-area')


First three PC:
[[-0.2387989  -0.3807575   0.15238844  0.34591993 -0.29011259  0.43091401
   0.48741806 -0.04493664 -0.21868644 -0.29413517 -0.10643712]
 [ 0.33635454  0.11754972  0.1832994   0.32991418  0.31525799  0.0719326
   0.08726628  0.58403734 -0.155869    0.19171577 -0.46505769]
 [ 0.4343013  -0.30725942  0.59056967 -0.16468843 -0.0166791  -0.13422395
  -0.1074623  -0.17560555 -0.45532412  0.07004248  0.26110053]]
Percent of Variance each PC accounts for:
[0.2754426  0.22671146 0.14148609 0.08823201 0.06544317 0.05521016
 0.04755989 0.04559184 0.03063855 0.02069961 0.00298462]


Based on this Graph, my desicion would be to include 4 PC, because it explains 73% of the variance while drastically reducing the number of components.

### Feature-Composition of the most important PC
We now look at the correlation between the features and the principal components to see which of the features play into the principal components. This gives us a hint, which features we could drop for our analysis.

In [None]:
# see https://stackoverflow.com/questions/23294616/how-to-use-scikit-learn-pca-for-features-reduction-and-know-which-features-are-d
pca = PCA(n_components=4)
pca.fit(winearray_norm)
#print(pca.components_)

In [None]:
# first take absolute values and then normalize this data to get a better overview
comp1 = sklearn.preprocessing.scale(np.absolute(pca.components_[0]))
comp2 = sklearn.preprocessing.scale(np.absolute(pca.components_[1]))
comp3 = sklearn.preprocessing.scale(np.absolute(pca.components_[2]))
comp4 = sklearn.preprocessing.scale(np.absolute(pca.components_[3]))

df = pandas.DataFrame([comp1,comp2, comp3, comp4])
print(df.describe().drop(['count', '25%', '50%', '75%']).round(2))

Having a look at the dataset above, the features (columns) with the highest means means over all 4 components they have the biggest impact.
In this case the ranking would be: 9, 8, 0, 2, 5, 1, 3, 6, 10, 4, 7
Additionally one could put a weight according to their cumultative explained variance ratio because the PC which explain more of the variance should have a bigger influence in deciding which feature is important.
So let's do the same as before with the variance ratio.


In [None]:
#take the explained variance ratio array and use it to weight the component arrays
comp1_weighted = pca.explained_variance_ratio_[0] * comp1
comp2_weighted = pca.explained_variance_ratio_[1] * comp2
comp3_weighted = pca.explained_variance_ratio_[2] * comp3
comp4_weighted = pca.explained_variance_ratio_[3] * comp4
df_weighted = pandas.DataFrame([comp1_weighted,comp2_weighted, comp3_weighted, comp4_weighted])
print(df_weighted.describe().drop(['count', '25%', '50%', '75%']).round(4))

Altough this ranking is quite different, I will compare those two to make a decision:    
0, 3, 6, 1, 5, 9, 8, 2, 4, 10, 7 - weighted     
9, 8, 0, 2, 5, 1, 3, 6, 10, 4, 7 - unweighted    

It can therefore be argued, that the components 4, 7 and 10 can be dropped (index begins at 0).
A look at the head of the data shows us which feature those represent:

In [7]:
concat_data.head(1)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4


Therefore chlorides, density, alcohol can be dropped!
Comparing with excercise a), it does only partly reflect the importance of those features in difference of white and red wines, so it must also have found other "underlying" factors.