# Testes de Desempenho

## K-Means

Vamos agora realizar testes de ganho de velocidade de execução, comparando o desempenho do K-Means rodando na CPU com o do K-means rodando na GPU.

Desta vez, iremos utilizar datasets bem maiores e, portanto, nada triviais — como era o caso do [*Iris* Data Set](https://archive.ics.uci.edu/ml/datasets/Iris) que foi usado anteriormente apenas como uma prova de conceito e teste de corretude.

A ideia é testar se os ganhos de desempenho ao utilizarmos uma versão paralelizada em GPU diminuem, estagnam ou aumentam junto com o aumento de instâncias ou dimensionalidade do dataset.

### Código Comum

In [21]:
import kMeans as km
import pandas as pd

import time
import os

import importlib
importlib.reload(km)

# Testing imports
print(km.kMeansCPU)
print(km.kMeansGPU)

<function kMeansCPU at 0x7f8a551f3ba0>
<function kMeansGPU at 0x7f8a52f33e20>


### Dataset 1 (N > 1.000, D = 7) — Rice (Cammeo and Osmancik)

Foi utilizado aqui o Dataset **[Rice (Cammeo and Osmancik)](https://archive.ics.uci.edu/dataset/545/rice+cammeo+and+osmancik)**, que reúne dados expressando características morfológicas de grãos de arroz de duas espécies, extraídas a partir de fotos destes. Temos **7 variáveis (D = 7)** e **3.810 instâncias**.

Esse dataset também contém informações de classe, definindo qual a espécie real do grão de arroz: **Cammeo** ou **Osmancik**. Portanto, haverão **2 grupos de dados (K = 2)**.

Esse conjunto de dados está presente no arquivo `Rice_Cammeo_Osmancik.arff` dentro do arquivo `rice+cammeo+and+osmancik.zip` do dataset (também disponível em download direto [neste link](https://archive.ics.uci.edu/static/public/545/rice+cammeo+and+osmancik.zip)).

#### Código

In [43]:
# Novas variáveis globais
K = 2
MAX_ITERATIONS = 5000
PLOT_RESULTS = False
DEBUG = False

COMMENT_CHAR = '%'
ALTERNATIVE_COMMENT_CHARS = ['@']

datasetFilePath = './Rice_Cammeo_Osmancik.csv'

if not os.path.exists(datasetFilePath):
    with \
        open('./rice+cammeo+and+osmancik/Rice_Cammeo_Osmancik.arff', 'r') as file,\
        open(datasetFilePath, 'w') as fileAsCsv:

        for line in file:
            if line[0] in ALTERNATIVE_COMMENT_CHARS:
                fileAsCsv.write(COMMENT_CHAR + ' ' + line[1:])
            else:
                fileAsCsv.write(line)

columnNames = ['Area', 'Perimeter', 'Major_Axis_Length', 'Minor_Axis_Length', 'Eccentricity', 'Convex_Area', 'Extent', 'Class']

# Lendo dataset do arquivo
with open(datasetFilePath, 'r') as datasetFile:
    dataset = pd.read_csv(datasetFilePath, names=columnNames, sep=',', skip_blank_lines=True, comment=COMMENT_CHAR)

dataset = dataset.drop(columns=['Class'])

print(dataset)

       Area   Perimeter  Major_Axis_Length  Minor_Axis_Length  Eccentricity   
0     15231  525.578979         229.749878          85.093788      0.928882  \
1     14656  494.311005         206.020065          91.730972      0.895405   
2     14634  501.122009         214.106781          87.768288      0.912118   
3     13176  458.342987         193.337387          87.448395      0.891861   
4     14688  507.166992         211.743378          89.312454      0.906691   
...     ...         ...                ...                ...           ...   
3805  11441  415.858002         170.486771          85.756592      0.864280   
3806  11625  421.390015         167.714798          89.462570      0.845850   
3807  12437  442.498993         183.572922          86.801979      0.881144   
3808   9882  392.296997         161.193985          78.210480      0.874406   
3809  11434  404.709991         161.079269          90.868195      0.825692   

      Convex_Area    Extent  
0           15617  0.

In [44]:
# Normalizando o dataset (normalização min-max), para que todos valores estejam no intervalo [1, 10]
datasetTreated = ((dataset - dataset.min()) / (dataset.max() - dataset.min())) * 9 + 1

print(f'##### Dataset (tratado e normalizado, intervalo [1, 10]) #####\n{datasetTreated}')

##### Dataset (tratado e normalizado, intervalo [1, 10]) #####
          Area  Perimeter  Major_Axis_Length  Minor_Axis_Length  Eccentricity   
0     7.083436   8.913085           9.110943           5.791756      8.992095  \
1     6.627970   7.426854           6.832784           7.035968      7.227819   
2     6.610544   7.750595           7.609142           6.293120      8.108617   
3     5.455642   5.717221           5.615196           6.233153      7.041041   
4     6.653318   8.037925           7.382246           6.582591      7.822599   
...        ...        ...                ...                ...           ...   
3805  4.081324   3.697823           3.421444           5.916006      5.587521   
3806  4.227073   3.960771           3.155323           6.610732      4.616211   
3807  4.870269   4.964124           4.677767           6.111975      6.476267   
3808  2.846418   2.577921           2.529299           4.501406      6.121153   
3809  4.075779   3.167936           2.518285  

In [36]:

# * ####################################
# * Rodando o K-Means CPU
# * ####################################

NUMBER_OF_RUNS = 50

totalExecTime = 0

for rep in range(1, NUMBER_OF_RUNS + 1):
    startTime = time.time()
    km.kMeansCPU(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
    # print(f'Results:\n \n{resultsCPU}\n ')
    elapsedTime = time.time() - startTime
    print(f'Execution time for K-Means CPU run #{rep}: {elapsedTime}')
    totalExecTime += elapsedTime
    print(f'Average execution time for K-Means CPU until now: {totalExecTime / rep}')

print(f'Final average execution time for K-Means CPU: {totalExecTime / NUMBER_OF_RUNS}')


Execution time for K-Means CPU run #1: 0.09755229949951172
Average execution time for K-Means CPU until now: 0.09755229949951172
Execution time for K-Means CPU run #2: 0.13881874084472656
Average execution time for K-Means CPU until now: 0.11818552017211914
Execution time for K-Means CPU run #3: 0.1493539810180664
Average execution time for K-Means CPU until now: 0.12857500712076822
Execution time for K-Means CPU run #4: 0.13250446319580078
Average execution time for K-Means CPU until now: 0.12955737113952637
Execution time for K-Means CPU run #5: 0.11022138595581055
Average execution time for K-Means CPU until now: 0.12569017410278321
Execution time for K-Means CPU run #6: 0.1188056468963623
Average execution time for K-Means CPU until now: 0.12454275290171306
Execution time for K-Means CPU run #7: 0.13318800926208496
Average execution time for K-Means CPU until now: 0.12577778952462332
Execution time for K-Means CPU run #8: 0.12111377716064453
Average execution time for K-Means CPU u

In [45]:

# * ####################################
# * Rodando o K-Means GPU
# * ####################################

NUMBER_OF_RUNS = 50

totalExecTime = 0

for rep in range(1, NUMBER_OF_RUNS + 1):
    startTime = time.time()
    km.kMeansGPU(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
    # print(f'Results:\n \n{resultsGPU}\n ')
    elapsedTime = time.time() - startTime
    print(f'Execution time for K-Means CPU run #{rep}: {elapsedTime}')
    totalExecTime += elapsedTime
    print(f'Average execution time for K-Means CPU until now: {totalExecTime / rep}')

print(f'Final average execution time for K-Means CPU: {totalExecTime / NUMBER_OF_RUNS}')

Execution time for K-Means CPU run #1: 0.03649449348449707
Average execution time for K-Means CPU until now: 0.03649449348449707
Execution time for K-Means CPU run #2: 0.04213213920593262
Average execution time for K-Means CPU until now: 0.039313316345214844
Execution time for K-Means CPU run #3: 0.04639744758605957
Average execution time for K-Means CPU until now: 0.04167469342549642
Execution time for K-Means CPU run #4: 0.025108814239501953
Average execution time for K-Means CPU until now: 0.0375332236289978
Execution time for K-Means CPU run #5: 0.03381943702697754
Average execution time for K-Means CPU until now: 0.03679046630859375
Execution time for K-Means CPU run #6: 0.030785799026489258
Average execution time for K-Means CPU until now: 0.035789688428243004
Execution time for K-Means CPU run #7: 0.0269010066986084
Average execution time for K-Means CPU until now: 0.03451987675258091
Execution time for K-Means CPU run #8: 0.03560638427734375
Average execution time for K-Means C

### Dataset 2 (N > 10.000)

[TBA]

#### Código

In [15]:
# TBA

### Dataset 3 (N > 100.000)

[TBA]

#### Código

In [16]:
# TBA

### Dataset 4 (N > 1.000.000, D = 8) — WESAD

Foi utilizado aqui um sub-conjunto dos dados do Dataset **[WESAD (Wearable Stress and Affect Detection)](https://archive.ics.uci.edu/dataset/465/wesad+wearable+stress+and+affect+detection)**, que reúne dados, fisiológicos e de movimento, de diversos sensores presentes em aparelhos *wearables* usados por 15 pacientes diferentes em testes laboratoriais. Um aparelho foi usado no peitoral e outro no pulso dos pacientes.

Esse dataset também contém informações de classe, definindo momentos dos testes como pertencendo à três classificações de emoção do paciente: **referência**, **estresse** ou **diversão**. Portanto, haverão **3 grupos de dados (K = 3)**.

O sub-conjunto de dados utilizado foi: dados obtidos apenas através do **aparelho usado no peito** do paciente, e apenas do **paciente #4**. Utilizando este sub-conjunto, temos **8 variáveis (D = 8)** e **4.588.552 instâncias**, cada uma sendo uma leitura ao longo do tempo do teste laboratorial (leituras realizadas na frequência de 700hz).

Esse sub-conjunto de dados está presente no arquivo `S4/S4_respiban.txt` dentro do arquivo `WESAD.zip` do dataset (também disponível em download direto [neste link](https://uni-siegen.sciebo.de/s/HGdUkoNlW1Ub0Gx/download)).

#### Código

In [5]:
# Novas variáveis globais
K = 3
MAX_ITERATIONS = 500
PLOT_RESULTS = False
DEBUG = False

datasetFilePath = './WESAD/S4/S4_respiban.txt'
columnNames = ['index', 'DI', 'ECG', 'EDA', 'EMG', 'TEMP', 'spatialX', 'spatialY', 'spatialZ', 'RESPIRATION', '_ignore_']

# Lendo dataset do arquivo
with open(datasetFilePath, 'r') as datasetFile:
    dataset = pd.read_csv(datasetFilePath, names=columnNames, sep='\t', index_col=0, skip_blank_lines=True, comment='#')

dataset = dataset.drop(columns=['DI', '_ignore_'])

print(dataset)

           ECG   EDA    EMG   TEMP  spatialX  spatialY  spatialZ  RESPIRATION
index                                                                        
0        34487  2844  32819  27563     37495     32437     31921        33292
1        34274  2869  32481  27560     37485     32433     31935        33295
2        33960  2774  32431  27557     37471     32445     31927        33293
3        33737  2767  32561  27555     37485     32433     31925        33308
4        33602  2768  32696  27562     37487     32429     31909        33300
...        ...   ...    ...    ...       ...       ...       ...          ...
4588548  33272  6470  32721  26727     37539     32597     32256        31863
4588549  33389  6467  32360  26726     37543     32583     32253        31865
4588550  33497  6456  32357  26719     37530     32598     32243        31857
4588551  33499  6450  32175  26733     37539     32585     32263        31855
4588552  33425  6445  32340  26753     37525     32595     32237

In [18]:
# Normalizando o dataset (normalização min-max), para que todos valores estejam no intervalo [1, 10]
datasetTreated = ((dataset - dataset.min()) / (dataset.max() - dataset.min())) * 9 + 1

print(f'##### Dataset (tratado e normalizado, intervalo [1, 10]) #####\n{datasetTreated}')

##### Dataset (tratado e normalizado, intervalo [1, 10]) #####
              ECG       EDA       EMG      TEMP  spatialX  spatialY  spatialZ   
index                                                                           
0        5.735295  1.508909  4.175522  8.925593  5.800000  5.203079  4.600799  \
1        5.706038  1.527215  3.927640  8.903516  5.789437  5.196481  4.607789   
2        5.662907  1.457652  3.890971  8.881439  5.774648  5.216276  4.603795   
3        5.632276  1.452526  3.986310  8.866721  5.789437  5.196481  4.602796   
4        5.613733  1.453258  4.085316  8.918234  5.791549  5.189883  4.594808   
...           ...       ...       ...       ...       ...       ...       ...   
4588548  5.568405  4.164022  4.103651  2.773508  5.846479  5.467009  4.768057   
4588549  5.584475  4.161826  3.838902  2.766149  5.850704  5.443915  4.766559   
4588550  5.599310  4.153771  3.836701  2.714636  5.836972  5.468658  4.761567   
4588551  5.599585  4.149378  3.703227  2.81766

In [6]:

# * ####################################
# * Rodando o K-Means CPU
# * ####################################

# NUMBER_OF_RUNS = 50

# totalExecTime = 0

# for rep in range(1, NUMBER_OF_RUNS + 1):
#     startTime = time.time()
#     km.kMeansCPU(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
#     # print(f'Results:\n \n{resultsCPU}\n ')
#     elapsedTime = time.time() - startTime
#     print(f'Execution time for K-Means CPU run #{rep}: {elapsedTime}')
#     totalExecTime += elapsedTime
#     print(f'Average execution time for K-Means CPU until now: {totalExecTime / rep}')

# print(f'Final average execution time for K-Means CPU: {totalExecTime / NUMBER_OF_RUNS}')


In [20]:

# * ####################################
# * Rodando o K-Means GPU
# * ####################################

# NUMBER_OF_RUNS = 50

# totalExecTime = 0

# for rep in range(1, NUMBER_OF_RUNS + 1):
#     startTime = time.time()
#     km.kMeansGPU(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
#     # print(f'Results:\n \n{resultsGPU}\n ')
#     elapsedTime = time.time() - startTime
#     print(f'Execution time for K-Means GPU run #{rep}: {elapsedTime}')
#     totalExecTime += elapsedTime
#     print(f'Average execution time for K-Means GPU until now: {totalExecTime / rep}')

# print(f'Final average execution time for K-Means GPU: {totalExecTime / NUMBER_OF_RUNS}')

#### Resultados

> Resultados completos disponíveis no arquivo `code/examples-and-tests/speedupTestsRawResults.txt`

| |Tempo médio (50 execuções)|Speedup Médio|
|-|-|-|
|K-Means CPU|~129,81s|-|
|K-Means GPU|~27,87s|~4,65x|

### Dataset 5 (N > 10.000.000)

In [21]:
# TBA