**Let's have a look at a Principal Component Analysis using a 2D and 3D representation:**

In [None]:

import numpy as np 
import pandas as pd 


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



The data I'm testing PCA on is information about different features of song extracts from multiple genres:

In [None]:
db = pd.read_csv('/kaggle/input/musicfeatures/data.csv')
db.head()

Let's see what type are the variables and if we have NAs:

In [None]:
print(db.info())
print(db.shape)

In [None]:
db['label'].value_counts() #the genres are equally represented

Now I will keep only the numeric ones:

In [None]:
db1 = db.iloc[:,1:29]
db1.info()

Let's test PCA, firstly with 3 components:

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 3)
db_pca = pca.fit_transform(db1)

Now let's check how much of a variance we're explaining:

In [None]:
pca.explained_variance_ratio_

Not bad, it gets 98% from the first component.

In [None]:
principalDf = pd.DataFrame(data = db_pca
             , columns = ['principal component 1', 'principal component 2','principal component 3' ])

In [None]:
finalDf = pd.concat([principalDf, db['label']], axis = 1)
finalDf.head()

# 2D Representation

In [None]:
import plotly.express as px
fig = px.scatter(db_pca, x = 0, y = 1, color = finalDf['label'])
fig.show()

# 3D Representation

In [None]:
fig = px.scatter_3d(db_pca, x = 0, y = 1, z = 2, color = finalDf['label'])
fig.show()

# 2D Representation with loadings

Now having a look at the loadings to better understand the weights of the original variables to the components:

*(beautiful code I've found on the plotly page)*

In [None]:
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
features = db1.columns

fig = px.scatter(db_pca, x=0, y=1, color= finalDf['label'])
 
for i, feature in enumerate(features):
    fig.add_shape(
        type='line',
        x0=0, y0=0,
        x1=loadings[i, 0],
        y1=loadings[i, 1]
    )
    fig.add_annotation(
        x=loadings[i, 0],
        y=loadings[i, 1],
        ax=0, ay=0,
        xanchor="center",
        yanchor="bottom",
        text=feature,
    )
fig.show()

Rolloff, spectral_centroid, bandwidth seem to be the most important.