**In this notebook I'm doing a demonstration of Principal Component Analysis on the types of glass; although PCA is extensively used in Machine Learning as a way to simplify the data by reducing dimensionality, I believe it can be very useful from a visual point of view for simpler analysis, especially in the field of consumer behaviour. **

Firstly, let's import the data and have a quick look at it:

In [None]:

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

db = pd.read_csv('/kaggle/input/glass/glass.csv')
db.head()

In [None]:
db.info()

Checking correlations between the types of glass components: 

In [None]:
corr_matrix = db.corr(method = 'pearson')
corr_matrix 

In [None]:
correlation = db.corr(method = 'pearson')
ax = sns.heatmap(correlation)

Now I'll be splitting the variables that will be grouped in components and the type of glass that can be considered the target:

In [None]:
db1 = db.iloc[:,0:9]
db_target = db[["Type"]]
print(db1.head())
print(db_target.head())

Now I'll run PCA with two components at first:

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
db_pca = pca.fit_transform(db1)

Let's have a look at the eigen vectors:

In [None]:
pca.components_ 

Now let's see what each of the two components contain:

In [None]:
df = pd.DataFrame(pca.components_, columns=list(db1.columns))
df.head()


And the most important part, how much variance are these two components explaining:

In [None]:
pca.explained_variance_ratio_ 

47% explained by the first one, 26% by the second, so 71,3%; for a Machine Learning project I would go further and try to add more components, but right now I'm more interested to see how to represent this visually, in the next steps.

In [None]:
main_db = pd.DataFrame(data = db_pca
             , columns = ['principal component 1', 'principal component 2'])

In [None]:
final_db = pd.concat([main_db, db_target], axis = 1)
final_db.head()

Now let's have a look at the graphic representation:

In [None]:

plt.figure(figsize=(8,6))
plt.scatter(db_pca[:,0],db_pca[:,1], c = final_db['Type'])


plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')

plt.legend(numpoints = 6)

plt.show()

For someone interested more in the visual explanation of PCA, I believe it would be great to see what are the elements in each of the two components; I couldn't find yet something similar to R's biplot so I used a function found on github that helps make it (https://github.com/teddyroland/python-biplot). I will replicate this analysis in R and hopefully it will look even better!

In [None]:
#======== TEST WITH FUNCTION REPLICATING BIPLOT FUNCTION FROM R =========#

xvector = pca.components_[0] 
yvector = pca.components_[1]
xs = pca.transform(db1)[:,0] # see 'prcomp(my_data)$x' in R
ys = pca.transform(db1)[:,1]

In [None]:
for i in range(len(xvector)):

    plt.arrow(0, 0, xvector[i]*max(xs), yvector[i]*max(ys),
              color='r', width=0.0005, head_width=0.0025)
    plt.text(xvector[i]*max(xs)*1.2, yvector[i]*max(ys)*1.2,
             list(db1.columns.values)[i], color='r')

for i in range(len(xs)):

    plt.plot(xs[i], ys[i], 'bo')
    plt.text(xs[i]*1.2, ys[i]*1.2, list(db1.index)[i], color='b')


plt.show()

In [None]:
#plt.close(fig) 
#fig.show('YourPathHere')