# Classificação de Atividade Humana com PCA

Vamos trabalhar com a base da demonstração feita em aula, mas vamos explorar um pouco melhor como é o desempenho da árvore variando o número de componentes principais.

In [24]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix


filename_features = "/content/features.txt"
filename_labels = "/content/activity_labels.txt"

filename_subtrain = "/content/subject_train.txt"
filename_xtrain = "/content/X_train.txt"
filename_ytrain = "/content/y_train.txt"

filename_subtest = "/content/subject_test.txt"
ffilename_xtest = "/content/X_test.txt"
filename_ytest = "/content/y_test.txt"

features = pd.read_csv(filename_features, header=None, names=['nome_var'], sep="#")
labels = pd.read_csv(filename_labels, delim_whitespace=True, header=None, names=['cod_label', 'label'])

subject_train = pd.read_csv(filename_subtrain, header=None, names=['subject_id'])
X_train = pd.read_csv(filename_xtrain, delim_whitespace=True, header=None, names=features['nome_var'])
y_train = pd.read_csv(filename_ytrain, header=None, names=['cod_label'])

subject_test = pd.read_csv(filename_subtest, header=None, names=['subject_id'])
X_test = pd.read_csv(ffilename_xtest, delim_whitespace=True, header=None, names=features['nome_var'])
y_test = pd.read_csv(filename_ytest, header=None, names=['cod_label'])

## PCA com variáveis padronizadas

Reflexão sobre a escala das variáveis:

**Variáveis em métricas muito diferentes** podem interferir na análise de componentes principais. Lembra que variância é informação pra nós? Pois bem, tipicamente se há uma variável monetária como salário, vai ter uma ordem de variabilidade bem maior que número de filhos, tempo de emprego ou qualquer variável dummy. Assim, as variáveis de maior variância tendem a "dominar" a análise. Nesses casos é comum usar a padronização das variáveis.

Faça duas análises de componentes principais para a base do HAR - com e sem padronização e compare:

- A variância explicada por componente
- A variância explicada acumulada por componente
- A variância percentual por componente
- A variância percentual acumulada por componente
- Quantas componentes você escolheria, em cada caso para explicar 90% da variância?

In [25]:
%%time

def padroniza(s):
    if s.std() > 0:
        s = (s - s.mean())/s.std()
    return s

X_train_pad = pd.DataFrame(X_train).apply(padroniza, axis=0)
X_train_pad.head()

CPU times: user 466 ms, sys: 0 ns, total: 466 ms
Wall time: 475 ms


Unnamed: 0,1 tBodyAcc-mean()-X,2 tBodyAcc-mean()-Y,3 tBodyAcc-mean()-Z,4 tBodyAcc-std()-X,5 tBodyAcc-std()-Y,6 tBodyAcc-std()-Z,7 tBodyAcc-mad()-X,8 tBodyAcc-mad()-Y,9 tBodyAcc-mad()-Z,10 tBodyAcc-max()-X,...,552 fBodyBodyGyroJerkMag-meanFreq(),553 fBodyBodyGyroJerkMag-skewness(),554 fBodyBodyGyroJerkMag-kurtosis(),"555 angle(tBodyAccMean,gravity)","556 angle(tBodyAccJerkMean),gravityMean)","557 angle(tBodyGyroMean,gravityMean)","558 angle(tBodyGyroJerkMean,gravityMean)","559 angle(X,gravityMean)","560 angle(Y,gravityMean)","561 angle(Z,gravityMean)"
0,0.200628,-0.063678,-0.4196,-0.868755,-0.939377,-0.737479,-0.859758,-0.938955,-0.766385,-0.855978,...,-0.795305,0.025958,-0.27638,-0.360579,0.062935,-0.778374,-0.026079,-0.687172,0.407918,-0.007567
1,0.055944,0.031484,-0.253891,-0.875366,-0.923839,-0.849247,-0.868472,-0.921936,-0.84887,-0.8713,...,0.130605,-0.897296,-0.767938,0.133002,-0.02146,-1.218722,1.484369,-0.694091,0.409089,0.007875
2,0.07351,-0.043414,-0.076289,-0.86898,-0.907698,-0.893724,-0.863078,-0.898793,-0.89664,-0.863264,...,1.152257,-0.26086,-0.438286,-0.377815,0.391949,0.151197,1.704085,-0.702191,0.41026,0.026501
3,0.066691,-0.208407,-0.249695,-0.870566,-0.939959,-0.921743,-0.864445,-0.93806,-0.925216,-0.863264,...,1.112694,0.591005,0.463123,-0.135016,-0.033635,1.037781,-1.002951,-0.701636,0.414622,0.031712
4,0.030467,0.027585,-0.10984,-0.875128,-0.934815,-0.921281,-0.867325,-0.931726,-0.927965,-0.870201,...,-0.149567,-0.138505,-0.240296,0.340383,0.268468,1.125841,-1.276196,-0.700104,0.425434,0.045222


In [45]:
# Padronização dos dados
scaler = StandardScaler()
X_train_pad = scaler.fit_transform(X_train)
X_test_pad = scaler.fit_transform(X_test)

In [27]:
# PCA sem padronização
pca = PCA()
X_train_pca = pca.fit_transform(X_train)

In [28]:
# PCA com padronização
pca_pad = PCA()
X_train_pca_pad = pca_pad.fit_transform(X_train_pad)

In [29]:
# Variância explicada por componente
explained_variance = pca.explained_variance_
explained_variance_pad = pca_pad.explained_variance_

In [30]:
# Variância explicada acumulada por componente
explained_variance_ratio = pca.explained_variance_ratio_
explained_variance_ratio_pad = pca_pad.explained_variance_ratio_

In [31]:
# Variância percentual por componente
explained_variance_percent = explained_variance_ratio * 100
explained_variance_percent_pad = explained_variance_ratio_pad * 100

In [32]:
# Variância percentual acumulada por componente
cumulative_variance_percent = np.cumsum(explained_variance_percent)
cumulative_variance_percent_pad = np.cumsum(explained_variance_percent_pad)

In [33]:
# Número de componentes para explicar 90% da variância
n_components_90 = np.argmax(cumulative_variance_percent >= 90) + 1
n_components_90_pad = np.argmax(cumulative_variance_percent_pad >= 90) + 1

### Com padronização:

In [35]:
# Número de componentes para explicar 90% da variância:
n_components_90

34

In [36]:
# Variância explicada acumulada por componente:
cumulative_variance_percent

array([ 62.55443998,  67.4674627 ,  71.58893016,  73.46388628,
        75.15874627,  76.43081555,  77.60750069,  78.67647386,
        79.64585363,  80.50387181,  81.26617372,  81.93861938,
        82.51803897,  83.07591961,  83.57484534,  84.04978297,
        84.51698308,  84.94860094,  85.37431612,  85.78471116,
        86.17871356,  86.55402287,  86.90645036,  87.24580979,
        87.57794878,  87.89737757,  88.19915672,  88.49093929,
        88.78050925,  89.06243704,  89.33914119,  89.60253624,
        89.85784293,  90.09370881,  90.32436112,  90.54800929,
        90.77095742,  90.9812334 ,  91.18962632,  91.39440007,
        91.58725653,  91.77613615,  91.95731641,  92.13678911,
        92.30911678,  92.46931872,  92.62635821,  92.78298558,
        92.93595543,  93.08630671,  93.23142442,  93.37206458,
        93.50888964,  93.63574755,  93.76075366,  93.88049584,
        93.99861565,  94.11361054,  94.22669303,  94.33636258,
        94.44406665,  94.54896693,  94.65286105,  94.75

### Sem padronização

In [37]:
# Número de componentes para explicar 90% da variância:
n_components_90_pad

63

In [38]:
# Variância explicada acumulada por componente:
cumulative_variance_percent_pad

array([ 50.78117229,  57.36185256,  60.16828933,  62.67224208,
        64.56052709,  66.28453351,  67.65554498,  68.85462266,
        69.85048217,  70.81556876,  71.67562041,  72.47590136,
        73.23989773,  73.88522665,  74.517551  ,  75.11727309,
        75.70402339,  76.27943078,  76.84735183,  77.37464761,
        77.87501053,  78.36341894,  78.84162472,  79.31018765,
        79.75947691,  80.18050415,  80.59848284,  81.00405321,
        81.39257737,  81.77959542,  82.1455543 ,  82.50010768,
        82.84805028,  83.18523739,  83.51491439,  83.84312944,
        84.16365892,  84.45927386,  84.74599627,  85.03107082,
        85.29983714,  85.565457  ,  85.82886299,  86.08771359,
        86.33676846,  86.58372249,  86.82440241,  87.06051748,
        87.29079634,  87.51836358,  87.73852828,  87.95199527,
        88.15969972,  88.36219634,  88.56197578,  88.75972641,
        88.95400004,  89.1442372 ,  89.33230209,  89.51851806,
        89.6999846 ,  89.87736559,  90.05345066,  90.22

## Árvore com PCA

Faça duas uma árvore de decisão com 10 componentes principais - uma com base em dados padronizados e outra sem padronizar. Utilize o ```ccp_alpha=0.001```.

Compare a acurácia na base de treino e teste.

In [46]:
pca = PCA(n_components=10)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

X_train_pca_pad = pca.fit_transform(X_train_pad)
X_test_pca_pad = pca.fit_transform(X_test_pad)

### Com padronização

In [40]:
%%time
tree_standard = DecisionTreeClassifier(ccp_alpha=0.001)
tree_standard.fit(X_train_pca, y_train)
y_train_pred_standard = tree_standard.predict(X_train_pca)
y_test_pred_standard = tree_standard.predict(X_test_pca)

CPU times: user 166 ms, sys: 138 µs, total: 166 ms
Wall time: 429 ms


In [41]:
accuracy_train_standard = accuracy_score(y_train, y_train_pred_standard)
accuracy_test_standard = accuracy_score(y_test, y_test_pred_standard)

### Sem padronização

In [47]:
tree_non_standard = DecisionTreeClassifier(ccp_alpha=0.001)
tree_non_standard.fit(X_train_pca_pad, y_train)
y_train_pred_non_standard = tree_non_standard.predict(X_train_pca_pad)
y_test_pred_non_standard = tree_non_standard.predict(X_test_pca_pad)

In [48]:
accuracy_train_non_standard = accuracy_score(y_train, y_train_pred_non_standard)
accuracy_test_non_standard = accuracy_score(y_test, y_test_pred_non_standard)

In [49]:
print("Acurácia na base de treino com dados padronizados:", accuracy_train_standard)
print("Acurácia na base de teste com dados padronizados:", accuracy_test_standard)
print("Acurácia na base de treino sem padronização:", accuracy_train_non_standard)
print("Acurácia na base de teste sem padronização:", accuracy_test_non_standard)

Acurácia na base de treino com dados padronizados: 0.8930903155603918
Acurácia na base de teste com dados padronizados: 0.8143875127248049
Acurácia na base de treino sem padronização: 0.8596300326441785
Acurácia na base de teste sem padronização: 0.3128605361384459
