# Testes de Desempenho

## K-Means

Vamos agora realizar testes de ganho de velocidade de execução, comparando o desempenho do K-Means rodando na CPU com o do K-means rodando na GPU.

Desta vez, iremos utilizar datasets bem maiores e, portanto, nada triviais — como era o caso do [*Iris* Data Set](https://archive.ics.uci.edu/ml/datasets/Iris) que foi usado anteriormente apenas como uma prova de conceito e teste de corretude.

A ideia é testar se os ganhos de desempenho ao utilizarmos uma versão paralelizada em GPU diminuem, estagnam ou aumentam junto com o aumento de instâncias ou dimensionalidade do dataset.

Também será testado se houve diferença de precisção de cada classificação. Isso é realizado usando a função `getClassificationHits()`, explicada mais a fundo nos cadernos Jupyter `kMeansCPU.ipynb` e `kMeansGPU.ipynb`.

### Código Comum

In [16]:
import kMeans as km
import pandas as pd

import time
import os

import importlib
importlib.reload(km)

# Testing imports
print(km.kMeansCPU)
print(km.kMeansGPU)
print(km.getClassificationHits)

# Se verdadeiro, os testes incluirão a contagem de acertos dos resultados dos algoritmos. Isso pode demorar MUITO (>4h por execução no dataset 5)!
TEST_CORRECTEDNESS = False

# Valor efetivamente infinito para um float, para ser usado como valor inicial na variável "fastestExecTime"
FLOAT_MAX = float('inf')

# Configurando o Numba para não reportar erros de baixa ocupação dos streaming multiprocessors (SMs) da GPU
# Não suprimir estes erros gera um overhead bem considerável, ocasionalmente, em algumas execuções do K-Means GPU
%set_env NUMBA_CUDA_LOW_OCCUPANCY_WARNINGS 0

# Função para rodar os testes
def runTests(mode:str='GPU', runs:int=10, countHits:bool=True):
    '''Essa função depende de diversas variáveis declaradas anteriormente. Portanto, é inútil fora deste caderno Jupyter!'''

    mode = mode.upper()
    if mode == 'CPU': kMeans = km.kMeansCPU
    elif mode == 'GPU': kMeans = km.kMeansGPU
    else: raise ValueError('Unknown mode!')

    totalExecTime = 0.0
    slowestExecTime = -1.0
    fastestExecTime = FLOAT_MAX

    totalHits = 0
    totalHitsTime = 0.0

    for rep in range(1, runs + 1):
        startTime = time.time()
        results = kMeans(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
        elapsedTime = time.time() - startTime
        if elapsedTime < fastestExecTime: fastestExecTime = elapsedTime
        if elapsedTime > slowestExecTime: slowestExecTime = elapsedTime
        totalExecTime += elapsedTime
        print(f'Execution K-Means {mode} run #{rep}: {elapsedTime}; curr avg: {totalExecTime / rep}; ', end='')

        if countHits:
            # Verificando acertos
            # Converting from numpy arrays to panda's dataframes, if needed
            if results.__class__.__name__ != pd.DataFrame.__class__.__name__: results = pd.DataFrame(results)
            startTime = time.time()
            hits, _, _  = km.getClassificationHits(results, dataset, classColumnName, classes, debug=DEBUG)
            elapsedTime = time.time() - startTime
            totalHits += hits
            totalHitsTime += elapsedTime
            print(f'Hits: {hits} (done in {elapsedTime:.4f}); curr avg hits: {totalHits / rep}', end='')

        print('\n', end='')

    print(f' \nAvg exec K-Means {mode}: {totalExecTime / runs}')
    print(f'Max exec K-Means {mode}: {slowestExecTime}')
    print(f'Min exec time K-Means {mode}: {fastestExecTime}')

    if countHits:
        print(f' \nAverage hits: {totalHits / runs}')
        print(f'Avg exec K-Means {mode} classificationHits(): {totalHitsTime / runs}')

<function kMeansCPU at 0x771247f6e840>
<function kMeansGPU at 0x771247f6f880>
<function getClassificationHits at 0x77124c153f60>


### Dataset 1 (N > 1.000, D = 7, K = 2) — Rice (Cammeo and Osmancik)

Foi utilizado aqui o Dataset **[Rice (Cammeo and Osmancik)](https://archive.ics.uci.edu/dataset/545/rice+cammeo+and+osmancik)**, que reúne dados expressando características morfológicas de grãos de arroz de duas espécies, extraídas a partir de fotos destes. Temos **7 variáveis (D = 7)** e **3.810 instâncias**.

Esse dataset também contém informações de classe, definindo qual a espécie real do grão de arroz: **Cammeo** ou **Osmancik**. Portanto, haverão **2 grupos de dados (K = 2)**.

Esse conjunto de dados está presente no arquivo `Rice_Cammeo_Osmancik.arff` dentro do arquivo `rice+cammeo+and+osmancik.zip` do dataset (também disponível em download direto [neste link](https://archive.ics.uci.edu/static/public/545/rice+cammeo+and+osmancik.zip)).

#### Código

In [17]:
# Novas variáveis globais
K = 2
MAX_ITERATIONS = 60
PLOT_RESULTS = False
DEBUG = False

COMMENT_CHAR = '%'
ALTERNATIVE_COMMENT_CHARS = ['@']

datasetFilePath = './Rice_Cammeo_Osmancik.csv'

# Processando o aqruivo .arff file e convertendo para um arquivo .csv válido (com linhas comentadas)
if not os.path.exists(datasetFilePath):
    with \
        open('./rice+cammeo+and+osmancik/Rice_Cammeo_Osmancik.arff', 'r') as file,\
        open(datasetFilePath, 'w') as fileNew:

        for line in file:
            if line[0] in ALTERNATIVE_COMMENT_CHARS:
                fileNew.write(COMMENT_CHAR + ' ' + line[1:])
            else:
                fileNew.write(line)

columnNames = ['Area', 'Perimeter', 'Major_Axis_Length', 'Minor_Axis_Length', 'Eccentricity', 'Convex_Area', 'Extent', 'Class']

# Lendo dataset do arquivo
with open(datasetFilePath, 'r') as datasetFile:
    dataset = pd.read_csv(datasetFilePath, names=columnNames, sep=',', skip_blank_lines=True, comment=COMMENT_CHAR)

datasetTreated = dataset.drop(columns=['Class'])
print(datasetTreated)

classColumnName = 'Class'
classes = dataset[classColumnName].unique()

print(f'Classes (from column "{classColumnName}"): {classes}')

       Area   Perimeter  Major_Axis_Length  Minor_Axis_Length  Eccentricity   
0     15231  525.578979         229.749878          85.093788      0.928882  \
1     14656  494.311005         206.020065          91.730972      0.895405   
2     14634  501.122009         214.106781          87.768288      0.912118   
3     13176  458.342987         193.337387          87.448395      0.891861   
4     14688  507.166992         211.743378          89.312454      0.906691   
...     ...         ...                ...                ...           ...   
3805  11441  415.858002         170.486771          85.756592      0.864280   
3806  11625  421.390015         167.714798          89.462570      0.845850   
3807  12437  442.498993         183.572922          86.801979      0.881144   
3808   9882  392.296997         161.193985          78.210480      0.874406   
3809  11434  404.709991         161.079269          90.868195      0.825692   

      Convex_Area    Extent  
0           15617  0.

In [18]:
# Normalizando o dataset (normalização min-max), para que todos valores estejam no intervalo [1, 10]
datasetTreated = ((datasetTreated - datasetTreated.min()) / (datasetTreated.max() - datasetTreated.min())) * 9 + 1

print(f'##### Dataset (tratado e normalizado, intervalo [1, 10]) #####\n{datasetTreated}')

##### Dataset (tratado e normalizado, intervalo [1, 10]) #####
          Area  Perimeter  Major_Axis_Length  Minor_Axis_Length  Eccentricity   
0     7.083436   8.913085           9.110943           5.791756      8.992095  \
1     6.627970   7.426854           6.832784           7.035968      7.227819   
2     6.610544   7.750595           7.609142           6.293120      8.108617   
3     5.455642   5.717221           5.615196           6.233153      7.041041   
4     6.653318   8.037925           7.382246           6.582591      7.822599   
...        ...        ...                ...                ...           ...   
3805  4.081324   3.697823           3.421444           5.916006      5.587521   
3806  4.227073   3.960771           3.155323           6.610732      4.616211   
3807  4.870269   4.964124           4.677767           6.111975      6.476267   
3808  2.846418   2.577921           2.529299           4.501406      6.121153   
3809  4.075779   3.167936           2.518285  

In [19]:

# * ####################################
# * Rodando o K-Means CPU
# * ####################################

runTests('CPU', 100, TEST_CORRECTEDNESS)

Execution K-Means CPU run #1: 0.12076354026794434; curr avg: 0.12076354026794434; 
Execution K-Means CPU run #2: 0.05366396903991699; curr avg: 0.08721375465393066; 
Execution K-Means CPU run #3: 0.05398058891296387; curr avg: 0.07613603274027507; 
Execution K-Means CPU run #4: 0.10972189903259277; curr avg: 0.08453249931335449; 
Execution K-Means CPU run #5: 0.06972193717956543; curr avg: 0.08157038688659668; 
Execution K-Means CPU run #6: 0.0933084487915039; curr avg: 0.08352673053741455; 
Execution K-Means CPU run #7: 0.0696258544921875; curr avg: 0.08154089110238212; 
Execution K-Means CPU run #8: 0.08548259735107422; curr avg: 0.08203360438346863; 
Execution K-Means CPU run #9: 0.10138416290283203; curr avg: 0.08418366644117567; 
Execution K-Means CPU run #10: 0.10966610908508301; curr avg: 0.0867319107055664; 
Execution K-Means CPU run #11: 0.09969568252563477; curr avg: 0.0879104354164817; 
Execution K-Means CPU run #12: 0.06181025505065918; curr avg: 0.0857354203859965; 
Execut

In [20]:

# * ####################################
# * Rodando o K-Means GPU
# * ####################################

""" NUMBER_OF_RUNS = 10

totalExecTime = 0.0
slowestExecTime = -1.0
fastestExecTime = FLOAT_MAX

totalHits = 0
totalHitsTime = 0.0

for rep in range(1, NUMBER_OF_RUNS + 1):
    startTime = time.time()
    results = km.kMeansGPU(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
    elapsedTime = time.time() - startTime
    if elapsedTime < fastestExecTime: fastestExecTime = elapsedTime
    if elapsedTime > slowestExecTime: slowestExecTime = elapsedTime
    totalExecTime += elapsedTime
    print(f'Execution K-Means GPU run #{rep}: {elapsedTime}; curr avg: {totalExecTime / rep}; ', end='')

    # Verificando acertos
    # Converting from numpy arrays to panda's dataframes, if needed
    if results.__class__.__name__ != pd.DataFrame.__class__.__name__: results = pd.DataFrame(results)
    startTime = time.time()
    hits, _, _  = km.getClassificationHits(results, dataset, classColumnName, classes, debug=DEBUG)
    elapsedTime = time.time() - startTime
    totalHits += hits
    totalHitsTime += elapsedTime
    print(f'Hits: {hits} (done in {elapsedTime:.4f}); curr avg hits: {totalHits / rep}\n')

print(f' \nAvg exec K-Means GPU: {totalExecTime / NUMBER_OF_RUNS}')
print(f'Max exec K-Means GPU: {slowestExecTime}')
print(f'Min exec time K-Means GPU: {fastestExecTime}')

print(f' \nAverage hits: {totalHits / NUMBER_OF_RUNS}')
print(f'Avg exec K-Means GPU classificationHits(): {totalHitsTime / NUMBER_OF_RUNS}') """

runTests('GPU', 100, TEST_CORRECTEDNESS)

Execution K-Means GPU run #1: 0.01478266716003418; curr avg: 0.01478266716003418; 
Execution K-Means GPU run #2: 0.01815009117126465; curr avg: 0.016466379165649414; 
Execution K-Means GPU run #3: 0.013356447219848633; curr avg: 0.01542973518371582; 
Execution K-Means GPU run #4: 0.018792152404785156; curr avg: 0.016270339488983154; 
Execution K-Means GPU run #5: 0.036507606506347656; curr avg: 0.020317792892456055; 
Execution K-Means GPU run #6: 0.022976160049438477; curr avg: 0.020760854085286457; 
Execution K-Means GPU run #7: 0.01458883285522461; curr avg: 0.019879136766706194; 
Execution K-Means GPU run #8: 0.02637457847595215; curr avg: 0.02069106698036194; 
Execution K-Means GPU run #9: 0.02113795280456543; curr avg: 0.020740720960828993; 
Execution K-Means GPU run #10: 0.0188901424407959; curr avg: 0.020555663108825683; 
Execution K-Means GPU run #11: 0.021135568618774414; curr avg: 0.020608381791548294; 
Execution K-Means GPU run #12: 0.019621610641479492; curr avg: 0.02052615

### Dataset 2 (N > 10.000, D = 8, K = 2) — HTRU2

Foi utilizado aqui o Dataset **[HTRU2 (High Time Resolution Universe 2)](https://archive.ics.uci.edu/dataset/372/htru2)**, que reúne dados a respeito de emissões de sinais de rádio de banda larga obtidos através de leituras feitas com telescópios de rádio. É um dos resultados da busca por pulsares, estrelas de neutrôn que possuem uma rotação rápida e que emitem sinais de rádio banda larga detectáveis do nosso planeta. Temos **8 variáveis (D = 8)** e **17.898 instâncias**.

Esse dataset também contém informações de classe, definindo se a leitura é **positiva** ou **negativa**, a respeito do sinal candidato de fato originar ou não de um pulsar. Portanto, haverão **2 grupos de dados (K = 2)**.

Esse conjunto de dados está presente no arquivo `HTRU_2.csv` dentro do arquivo `HTRU_2.zip` do dataset (também disponível em download direto [neste link](https://archive.ics.uci.edu/static/public/372/htru2.zip)).

#### Código

In [21]:
# Novas variáveis globais
K = 2
MAX_ITERATIONS = 60
PLOT_RESULTS = False
DEBUG = False

datasetFilePath = './htru2/HTRU_2.csv'

columnNames = ['mean_IP', 'std_dev_IP', 'exc_kurt_IP', 'skew_IP', 'mean_DM_SNR', 'std_dev_DM_SNR', 'exc_kurt_DM_SNR', 'skew_DM_SNR', 'is_positive']

# Lendo dataset do arquivo
with open(datasetFilePath, 'r') as datasetFile:
    dataset = pd.read_csv(datasetFilePath, names=columnNames, sep=',')

datasetTreated = dataset.drop(columns=['is_positive'])
print(datasetTreated)

classColumnName = 'is_positive'
classes = dataset[classColumnName].unique()

print(f'Classes (from column "{classColumnName}"): {classes}')

          mean_IP  std_dev_IP  exc_kurt_IP   skew_IP  mean_DM_SNR   
0      140.562500   55.683782    -0.234571 -0.699648     3.199833  \
1      102.507812   58.882430     0.465318 -0.515088     1.677258   
2      103.015625   39.341649     0.323328  1.051164     3.121237   
3      136.750000   57.178449    -0.068415 -0.636238     3.642977   
4       88.726562   40.672225     0.600866  1.123492     1.178930   
...           ...         ...          ...       ...          ...   
17893  136.429688   59.847421    -0.187846 -0.738123     1.296823   
17894  122.554688   49.485605     0.127978  0.323061    16.409699   
17895  119.335938   59.935939     0.159363 -0.743025    21.430602   
17896  114.507812   53.902400     0.201161 -0.024789     1.946488   
17897   57.062500   85.797340     1.406391  0.089520   188.306020   

       std_dev_DM_SNR  exc_kurt_DM_SNR  skew_DM_SNR  
0           19.110426         7.975532    74.242225  
1           14.860146        10.576487   127.393580  
2        

In [22]:
# Normalizando o dataset (normalização min-max), para que todos valores estejam no intervalo [1, 10]
datasetTreated = ((datasetTreated - datasetTreated.min()) / (datasetTreated.max() - datasetTreated.min())) * 9 + 1

print(f'##### Dataset (tratado e normalizado, intervalo [1, 10]) #####\n{datasetTreated}')

##### Dataset (tratado e normalizado, intervalo [1, 10]) #####
        mean_IP  std_dev_IP  exc_kurt_IP   skew_IP  mean_DM_SNR   
0      7.492075    4.759187     2.485386  1.140645     1.120440  \
1      5.658651    5.148176     3.118736  1.164410     1.059040   
2      5.683117    2.771815     2.990246  1.366092     1.117270   
3      7.308394    4.940954     2.635746  1.148810     1.138310   
4      4.994689    2.933627     3.241398  1.375405     1.038944   
...         ...         ...          ...       ...          ...   
17893  7.292961    5.265529     2.527670  1.135690     1.043698   
17894  6.624482    4.005425     2.813468  1.272336     1.653146   
17895  6.469407    5.276293     2.841869  1.135059     1.855621   
17896  6.236795    4.542553     2.879693  1.227544     1.069897   
17897  3.469156    8.421307     3.970340  1.242264     8.585104   

       std_dev_DM_SNR  exc_kurt_DM_SNR  skew_DM_SNR  
0            2.023125         3.654872     1.575009  
1            1.652719   

In [23]:

# * ####################################
# * Rodando o K-Means CPU
# * ####################################

runTests('CPU', 100, TEST_CORRECTEDNESS)

Execution K-Means CPU run #1: 0.3878934383392334; curr avg: 0.3878934383392334; 
Execution K-Means CPU run #2: 0.24135899543762207; curr avg: 0.31462621688842773; 
Execution K-Means CPU run #3: 0.3276667594909668; curr avg: 0.3189730644226074; 
Execution K-Means CPU run #4: 0.32759809494018555; curr avg: 0.32112932205200195; 
Execution K-Means CPU run #5: 0.35552310943603516; curr avg: 0.3280080795288086; 
Execution K-Means CPU run #6: 0.2795841693878174; curr avg: 0.3199374278386434; 
Execution K-Means CPU run #7: 0.3373122215270996; curr avg: 0.32241954122270855; 
Execution K-Means CPU run #8: 0.311126708984375; curr avg: 0.32100793719291687; 
Execution K-Means CPU run #9: 0.35654354095458984; curr avg: 0.32495633761088055; 
Execution K-Means CPU run #10: 0.26954221725463867; curr avg: 0.31941492557525636; 
Execution K-Means CPU run #11: 0.3290719985961914; curr avg: 0.32029284130443225; 
Execution K-Means CPU run #12: 0.38916993141174316; curr avg: 0.3260325988133748; 
Execution K-M

In [24]:

# * ####################################
# * Rodando o K-Means GPU
# * ####################################

runTests('GPU', 100, TEST_CORRECTEDNESS)

Execution K-Means GPU run #1: 0.09691810607910156; curr avg: 0.09691810607910156; 
Execution K-Means GPU run #2: 0.06672310829162598; curr avg: 0.08182060718536377; 
Execution K-Means GPU run #3: 0.0672140121459961; curr avg: 0.07695174217224121; 
Execution K-Means GPU run #4: 0.057332754135131836; curr avg: 0.07204699516296387; 
Execution K-Means GPU run #5: 0.06341791152954102; curr avg: 0.0703211784362793; 
Execution K-Means GPU run #6: 0.07718777656555176; curr avg: 0.07146561145782471; 
Execution K-Means GPU run #7: 0.07773590087890625; curr avg: 0.07236136708940778; 
Execution K-Means GPU run #8: 0.0966799259185791; curr avg: 0.0754011869430542; 
Execution K-Means GPU run #9: 0.07704830169677734; curr avg: 0.07558419969346789; 
Execution K-Means GPU run #10: 0.06165194511413574; curr avg: 0.07419097423553467; 
Execution K-Means GPU run #11: 0.07194137573242188; curr avg: 0.07398646528070624; 
Execution K-Means GPU run #12: 0.07722163200378418; curr avg: 0.0742560625076294; 
Execu

### Dataset 3 (N > 100.000, D = 50, K = 2) — MiniBooNE

Foi utilizado aqui o Dataset **[MiniBooNE Particle Identification](https://archive.ics.uci.edu/dataset/199/miniboone+particle+identification)**, que reúne dados a respeito de partículas detectadas no experimento *MiniBooNE* (*Mini Booster Neutrino Experiment*), conduzido no laboratório americano *Fermilab*. Cada detecção de partícula é descrita por **50 variáveis reais (D = 50)** e há **129.596 instâncias no total**.

As primeiras 36.488 instâncias são detecções de neutrinos do elétron (sinal) e as 93.108 restantes são de neutrinos do múon (ruído de fundo). Assim, as informações de classe desse dataset estão implícitas, expressa pela ordem das instâncias no arquivo. Como temos duas classes, haverão **2 grupos de dados (K = 2)**.

Esse conjunto de dados está presente no arquivo `MiniBooNE_PID.txt` dentro do arquivo `miniboone+particle+identification.zip` do dataset (também disponível em download direto [neste link](https://archive.ics.uci.edu/static/public/199/miniboone+particle+identification.zip)).

Foi necessário, neste dataset, realizar um **pré-processamento** para **remoção de outliers**. Originalmente, há 130.064 instâncias no total (36.499 sinal e 93.565 ruído). Porém, existem 468 instâncias (11 sinal e 457 ruído) que são extremos outliers, possuindo o valor -999.0 em todas as 50 variáveis — provavelmente advindos de algum erro de detecção. A presença destes outliers causava a criação de um cluster contendo apenas estes outliers, diminuindo muito o tempo de execução do algoritmo de maneira artificial. Estes outliers tiveram que ser removidos. Note que poderíamos ter solucionado este problema com outra abordagem: aumentar K para 3, criando um cluster novo para conter apenas os outliers. Isso, no entanto, seria mais custoso computacionalmente do que a remoção das instâncias.

#### Código

In [25]:
# Novas variáveis globais
K = 2
MAX_ITERATIONS = 60
PLOT_RESULTS = False
DEBUG = False

COMMENT_CHAR = '#'

# As primeiras 36.499 instâncias são consideradas um sinal, e o resto como ruído
N_OF_SIGNAL_LINES = 36499

datasetFilePath = './MiniBooNE_PID.csv'

# Processando o aqruivo .txt file e convertendo para um arquivo .csv válido (com a primeira linha comentada, removendo o leading whitespace, e trocando o separador de "  " ou " " para ",")
if not os.path.exists(datasetFilePath):
    with \
        open('./MiniBooNE_PID.txt', 'r') as file,\
        open(datasetFilePath, 'w') as fileNew:

        print('Processing MiniBooNE_PID.txt...\n ')

        # Removendo outliers com -999.0 de valor nas 50 variáveis. Há 468 destas instâncias
        outlierString = '''-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03'''

        index = 1
        signalInstRemoved = 0
        noiseInstRemoved = 0
        for line in file:
            if index != 1:
                lineToWrite = line.strip(' ').replace('  ', ' ').replace(' ', ',')
                if outlierString not in lineToWrite:
                    fileNew.write(lineToWrite)
                else:
                    if index - 1 <= N_OF_SIGNAL_LINES:
                        # print(f'Instance (signal) #{index - 1} removed...')
                        signalInstRemoved += 1
                    else:
                        # print(f'Instance (noise) #{index - 1} removed...')
                        noiseInstRemoved += 1
            # else:
            #     fileNew.write(COMMENT_CHAR + ' ' + line.strip(' ').replace('  ', ' '))
            index += 1

        print(f'Signal outlier instances removed = {signalInstRemoved}')
        print(f'Noise outlier instances removed = {noiseInstRemoved}\n ')

        print(f'Processed dataset saved in {datasetFilePath} with success!\n ')
else:
    print(f'Processed dataset found in {datasetFilePath}. No need for processing!\n ')

columnNames = [f'id_var_{i}' for i in range(1, 50 + 1) ]

# Lendo dataset do arquivo
with open(datasetFilePath, 'r') as datasetFile:
    dataset = pd.read_csv(datasetFile, names=columnNames, sep=',', skip_blank_lines=True)

# Gerando coluna de classes
classColumn = pd.DataFrame(['signal' if idx <= 36488 else 'noise' for idx in range(1, len(dataset) + 1)])
# print(classColumn)

classColumnName = 'class'
dataset.insert(len(dataset.columns), classColumnName, classColumn)
del classColumn

datasetTreated = dataset.drop(columns=[classColumnName])
print(datasetTreated)

classes = ['signal', 'noise']
print(f'Classes (from column "{classColumnName}"): {classes}')

Processed dataset found in ./MiniBooNE_PID.csv. No need for processing!
 
        id_var_1  id_var_2  id_var_3  id_var_4  id_var_5  id_var_6  id_var_7   
0        2.59413  0.468803   20.6916  0.322648  0.009682  0.374393  0.803479  \
1        3.86388  0.645781   18.1375  0.233529  0.030733  0.361239  1.069740   
2        3.38584  1.197140   36.0807  0.200866  0.017341  0.260841  1.108950   
3        4.28524  0.510155  674.2010  0.281923  0.009174  0.000000  0.998822   
4        5.93662  0.832993   59.8796  0.232853  0.025066  0.233556  1.370040   
...          ...       ...       ...       ...       ...       ...       ...   
129591   4.80718  1.451020  174.6920  0.343481  0.002174  0.000000  0.747401   
129592   5.00527  1.501860  129.9270  0.273477  0.006098  0.109769  1.325370   
129593   3.10842  2.178140   56.3651  0.211850  0.000000  0.167382  1.318900   
129594   5.44560  1.845700  103.4630  0.287411  0.015929  0.107495  0.679931   
129595   4.55062  1.341740   80.0887  0.283594

In [26]:
# Normalizando o dataset (normalização min-max), para que todos valores estejam no intervalo [1, 10]
datasetTreated = ((datasetTreated - datasetTreated.min()) / (datasetTreated.max() - datasetTreated.min())) * 9 + 1

print(f'##### Dataset (tratado e normalizado, intervalo [1, 10]) #####\n{datasetTreated}')

##### Dataset (tratado e normalizado, intervalo [1, 10]) #####
        id_var_1  id_var_2  id_var_3  id_var_4  id_var_5  id_var_6  id_var_7   
0       2.368749  1.421131  1.039201  4.103207  5.452597  5.787233  2.158663  \
1       3.038712  1.603309  1.034359  2.834322  6.017928  5.619037  2.542627   
2       2.786482  2.170867  1.068374  2.369263  5.658285  4.335283  2.599170   
3       3.261035  1.463698  2.278040  3.523361  5.438966  1.000000  2.440359   
4       4.132359  1.796021  1.113489  2.824697  5.865742  3.986399  2.975677   
...          ...       ...       ...       ...       ...       ...       ...   
129591  3.536428  2.432206  1.331135  4.399829  5.250969  1.000000  2.077796   
129592  3.640947  2.484539  1.246275  3.403106  5.356339  2.403578  2.911261   
129593  2.640106  3.180688  1.106826  2.525655  5.192588  3.140255  2.901930   
129594  3.873280  2.838481  1.196108  3.601499  5.620371  2.374501  1.980500   
129595  3.401059  2.319715  1.151798  3.547153  5.192588 

In [27]:

# * ####################################
# * Rodando o K-Means CPU
# * ####################################

runTests('CPU', 100, TEST_CORRECTEDNESS)

Execution K-Means CPU run #1: 13.74942922592163; curr avg: 13.74942922592163; 
Execution K-Means CPU run #2: 9.834038972854614; curr avg: 11.791734099388123; 
Execution K-Means CPU run #3: 10.303525924682617; curr avg: 11.29566470781962; 
Execution K-Means CPU run #4: 7.239543676376343; curr avg: 10.281634449958801; 
Execution K-Means CPU run #5: 10.741769075393677; curr avg: 10.373661375045776; 
Execution K-Means CPU run #6: 9.568178176879883; curr avg: 10.239414175351461; 
Execution K-Means CPU run #7: 13.340907335281372; curr avg: 10.68248462677002; 
Execution K-Means CPU run #8: 9.366705179214478; curr avg: 10.518012195825577; 
Execution K-Means CPU run #9: 9.867347955703735; curr avg: 10.445716169145372; 
Execution K-Means CPU run #10: 7.897804260253906; curr avg: 10.190924978256225; 
Execution K-Means CPU run #11: 9.839978456497192; curr avg: 10.159020749005405; 
Execution K-Means CPU run #12: 8.875469446182251; curr avg: 10.052058140436808; 
Execution K-Means CPU run #13: 9.1164

In [28]:

# * ####################################
# * Rodando o K-Means GPU
# * ####################################

runTests('GPU', 100, TEST_CORRECTEDNESS)

Execution K-Means GPU run #1: 1.9322571754455566; curr avg: 1.9322571754455566; 
Execution K-Means GPU run #2: 1.2695777416229248; curr avg: 1.6009174585342407; 
Execution K-Means GPU run #3: 1.686997652053833; curr avg: 1.6296108563741047; 
Execution K-Means GPU run #4: 1.5400104522705078; curr avg: 1.6072107553482056; 
Execution K-Means GPU run #5: 1.8094213008880615; curr avg: 1.6476528644561768; 
Execution K-Means GPU run #6: 1.9311861991882324; curr avg: 1.6949084202448528; 
Execution K-Means GPU run #7: 1.6802430152893066; curr avg: 1.6928133623940604; 
Execution K-Means GPU run #8: 1.5016849040985107; curr avg: 1.6689223051071167; 
Execution K-Means GPU run #9: 2.3544206619262695; curr avg: 1.7450887891981337; 
Execution K-Means GPU run #10: 1.9889860153198242; curr avg: 1.7694785118103027; 
Execution K-Means GPU run #11: 1.622403860092163; curr avg: 1.7561080889268355; 
Execution K-Means GPU run #12: 2.6195828914642334; curr avg: 1.8280643224716187; 
Execution K-Means GPU run #

### Dataset 4 (N > 1.000.000, D = 8) — WESAD

Foi utilizado aqui um sub-conjunto dos dados do Dataset **[WESAD (Wearable Stress and Affect Detection)](https://archive.ics.uci.edu/dataset/465/wesad+wearable+stress+and+affect+detection)**, que reúne dados, fisiológicos e de movimento, de diversos sensores presentes em aparelhos *wearables* usados por 15 pacientes diferentes em testes laboratoriais. Um aparelho foi usado no peitoral e outro no pulso dos pacientes.

Esse dataset também contém informações de classe, definindo momentos dos testes como pertencendo à três classificações de emoção do paciente: **referência**, **estresse** ou **diversão**. Portanto, haverão **3 grupos de dados (K = 3)**.

O sub-conjunto de dados utilizado foi: dados obtidos apenas através do **aparelho usado no peito** do paciente, e apenas do **paciente #4**. Utilizando este sub-conjunto, temos **8 variáveis (D = 8)** e **4.588.552 instâncias**, cada uma sendo uma leitura ao longo do tempo do teste laboratorial (leituras realizadas na frequência de 700hz).

Esse sub-conjunto de dados está presente no arquivo `S4/S4_respiban.txt` dentro do arquivo `WESAD.zip` do dataset (também disponível em download direto [neste link](https://uni-siegen.sciebo.de/s/HGdUkoNlW1Ub0Gx/download)).

#### Código

In [29]:
# Novas variáveis globais
K = 3
MAX_ITERATIONS = 60
PLOT_RESULTS = False
DEBUG = False

datasetFilePath = './WESAD/S4/S4_respiban.txt'
columnNames = ['index', 'DI', 'ECG', 'EDA', 'EMG', 'TEMP', 'spatialX', 'spatialY', 'spatialZ', 'RESPIRATION', '_ignore_']

# Lendo dataset do arquivo
with open(datasetFilePath, 'r') as datasetFile:
    dataset = pd.read_csv(datasetFilePath, names=columnNames, sep='\t', index_col=0, skip_blank_lines=True, comment='#')

# datasetTreated = dataset.drop(columns=['DI', '_ignore_'])
# print(datasetTreated)

# classColumnName = 'DI'
# classes = dataset[classColumnName].unique()

# print(f'Classes (from column "{classColumnName}"): {classes}')

# Gerando coluna de classes
classColumn = []
for idx in range(len(dataset)):
    classification = None
    if idx < 1329300: classification = 'base'
    elif idx < 1926400: classification = 'fun'
    elif idx < 2563400: classification = 'base' # Medi 1
    elif idx < 4020100: classification = 'stress'
    else: classification = 'base' # Medi 2

    classColumn.append(classification)
classColumn = pd.DataFrame(classColumn)
print(classColumn)

classColumnName = 'class'
dataset.insert(len(dataset.columns), classColumnName, classColumn)
del classColumn

datasetTreated = dataset.drop(columns=['DI', '_ignore_', classColumnName])
print(datasetTreated)

classes = ['base', 'fun', 'stress']
print(f'Classes (from column "{classColumnName}"): {classes}')

            0
0        base
1        base
2        base
3        base
4        base
...       ...
4588548  base
4588549  base
4588550  base
4588551  base
4588552  base

[4588553 rows x 1 columns]
           ECG   EDA    EMG   TEMP  spatialX  spatialY  spatialZ  RESPIRATION
index                                                                        
0        34487  2844  32819  27563     37495     32437     31921        33292
1        34274  2869  32481  27560     37485     32433     31935        33295
2        33960  2774  32431  27557     37471     32445     31927        33293
3        33737  2767  32561  27555     37485     32433     31925        33308
4        33602  2768  32696  27562     37487     32429     31909        33300
...        ...   ...    ...    ...       ...       ...       ...          ...
4588548  33272  6470  32721  26727     37539     32597     32256        31863
4588549  33389  6467  32360  26726     37543     32583     32253        31865
4588550  33497  6456  32

In [30]:
# Normalizando o dataset (normalização min-max), para que todos valores estejam no intervalo [1, 10]
datasetTreated = ((datasetTreated - datasetTreated.min()) / (datasetTreated.max() - datasetTreated.min())) * 9 + 1

print(f'##### Dataset (tratado e normalizado, intervalo [1, 10]) #####\n{datasetTreated}')

##### Dataset (tratado e normalizado, intervalo [1, 10]) #####
              ECG       EDA       EMG      TEMP  spatialX  spatialY  spatialZ   
index                                                                           
0        5.735295  1.508909  4.175522  8.925593  5.800000  5.203079  4.600799  \
1        5.706038  1.527215  3.927640  8.903516  5.789437  5.196481  4.607789   
2        5.662907  1.457652  3.890971  8.881439  5.774648  5.216276  4.603795   
3        5.632276  1.452526  3.986310  8.866721  5.789437  5.196481  4.602796   
4        5.613733  1.453258  4.085316  8.918234  5.791549  5.189883  4.594808   
...           ...       ...       ...       ...       ...       ...       ...   
4588548  5.568405  4.164022  4.103651  2.773508  5.846479  5.467009  4.768057   
4588549  5.584475  4.161826  3.838902  2.766149  5.850704  5.443915  4.766559   
4588550  5.599310  4.153771  3.836701  2.714636  5.836972  5.468658  4.761567   
4588551  5.599585  4.149378  3.703227  2.81766

In [31]:

# * ####################################
# * Rodando o K-Means CPU
# * ####################################

runTests('CPU', 20, TEST_CORRECTEDNESS)

Execution K-Means CPU run #1: 86.61235094070435; curr avg: 86.61235094070435; 
Execution K-Means CPU run #2: 118.04515361785889; curr avg: 102.32875227928162; 
Execution K-Means CPU run #3: 275.24043250083923; curr avg: 159.9659790198008; 
Execution K-Means CPU run #4: 78.44966340065002; curr avg: 139.58690011501312; 
Execution K-Means CPU run #5: 107.69467115402222; curr avg: 133.20845432281493; 
Execution K-Means CPU run #6: 70.05526375770569; curr avg: 122.6829225619634; 
Execution K-Means CPU run #7: 78.03111362457275; curr avg: 116.30409271376473; 
Execution K-Means CPU run #8: 77.76918721199036; curr avg: 111.48722952604294; 
Execution K-Means CPU run #9: 85.16525530815125; curr avg: 108.56256572405498; 
Execution K-Means CPU run #10: 84.75460505485535; curr avg: 106.181769657135; 
Execution K-Means CPU run #11: 85.3818736076355; curr avg: 104.29087001627141; 
Execution K-Means CPU run #12: 137.41615676879883; curr avg: 107.05131057898204; 
Execution K-Means CPU run #13: 138.9652

In [32]:

# * ####################################
# * Rodando o K-Means GPU
# * ####################################

runTests('GPU', 20, TEST_CORRECTEDNESS)

Execution K-Means GPU run #1: 9.656635761260986; curr avg: 9.656635761260986; 
Execution K-Means GPU run #2: 53.54011058807373; curr avg: 31.59837317466736; 
Execution K-Means GPU run #3: 18.038119316101074; curr avg: 27.078288555145264; 
Execution K-Means GPU run #4: 18.211073637008667; curr avg: 24.861484825611115; 
Execution K-Means GPU run #5: 31.860013723373413; curr avg: 26.261190605163574; 
Execution K-Means GPU run #6: 19.56910514831543; curr avg: 25.145843029022217; 
Execution K-Means GPU run #7: 14.027629375457764; curr avg: 23.557526792798722; 
Execution K-Means GPU run #8: 19.392937660217285; curr avg: 23.036953151226044; 
Execution K-Means GPU run #9: 15.596630096435547; curr avg: 22.210250589582657; 
Execution K-Means GPU run #10: 24.279122829437256; curr avg: 22.417137813568115; 
Execution K-Means GPU run #11: 33.63014578819275; curr avg: 23.436502174897626; 
Execution K-Means GPU run #12: 41.626813888549805; curr avg: 24.95236148436864; 
Execution K-Means GPU run #13: 2

#### Resultados

> Resultados completos disponíveis no arquivo `code/examples-and-tests/speedupTestsRawResults.txt`

| |Tempo médio (50 execuções)|Speedup Médio|
|-|-|-|
|K-Means CPU|~129,81s|-|
|K-Means GPU|~27,87s|~4,65x|

### Dataset 5 (N > 10.000.000, D = 3, K = 7) — HHAR

Foi utilizado aqui um sub-conjunto dos dados do Dataset **[Heterogeneity Human Activity Recognition (HHAR)](https://archive.ics.uci.edu/dataset/344/heterogeneity+activity+recognition)**, que reúne dados de movimento do giroscópio e acelerômetro presentes em aparelhos celulares (*smartphones*) e relógios (*smartwatches*) usados por 9 usuários diferentes ao realizar diversas atividades físicas diferentes ou estando em repouso.

Esse dataset também contém informações de classe, definindo momentos dos testes como pertencendo a uma de seis atividades realizadas: **ciclismo**, **repouso (sentado)**, **repouso (em pé)**, **andar**, **subir escadas** e **descer escadas**. Além disto, há uma sétima "atividade", **nula**, que representa espaços do teste onde não foi realizada nenhuma atividade. Portanto, haverão **7 grupos de dados (K = 7)**.

O sub-conjunto de dados utilizado foi: dados obtidos apenas através do **giroscópio do smartphone** do usuário. Utilizando este sub-conjunto, temos **3 variáveis (D = 3)** e **13.932.632 instâncias**, cada uma sendo uma leitura ao longo do tempo do experimento.

Esse sub-conjunto de dados está presente no arquivo `Activity recognition exp/Phones_gyroscope.csv` dentro do arquivo `heterogeneity+activity+recognition.zip` do dataset (também disponível em download direto [neste link](https://archive.ics.uci.edu/static/public/344/heterogeneity+activity+recognition.zip)).

#### Código

In [33]:
# Novas variáveis globais
K = 7
MAX_ITERATIONS = 60
PLOT_RESULTS = False
DEBUG = False

datasetFilePath = './heterogeneity+activity+recognition/Activity recognition exp/Phones_gyroscope.csv'
columnNames = ['index', 'arrival_time', 'creation_Time', 'x', 'y', 'z', 'user', 'model', 'device', 'gt']

# Lendo dataset do arquivo
with open(datasetFilePath, 'r') as datasetFile:
    dataset = pd.read_csv(datasetFile, names=columnNames, header=0, sep=',', index_col=0)

datasetTreated = dataset.drop(columns=['arrival_time', 'creation_Time', 'user', 'model', 'device', 'gt'])
print(datasetTreated)

classColumnName = 'gt'
classes = dataset[classColumnName].unique()

print(f'Classes (from column "{classColumnName}"): {classes}')

              x         y         z
index                              
0      0.013748 -0.000626 -0.023376
1      0.014816 -0.001694 -0.022308
2      0.015884 -0.001694 -0.021240
3      0.016953 -0.003830 -0.020172
4      0.015884 -0.007034 -0.020172
...         ...       ...       ...
11306 -0.046844  0.337667  0.134677
11307 -0.117598  0.221777  0.131749
11308 -0.177617  0.056115  0.095152
11309 -0.195183 -0.124429  0.063191
11310 -0.162002 -0.208846  0.043184

[13932632 rows x 3 columns]
Classes (from column "gt"): ['stand' nan 'sit' 'walk' 'stairsup' 'stairsdown' 'bike']


In [34]:
# Normalizando o dataset (normalização min-max), para que todos valores estejam no intervalo [1, 10]
datasetTreated = ((datasetTreated - datasetTreated.min()) / (datasetTreated.max() - datasetTreated.min())) * 9 + 1

print(f'##### Dataset (tratado e normalizado, intervalo [1, 10]) #####\n{datasetTreated}')

##### Dataset (tratado e normalizado, intervalo [1, 10]) #####
              x         y         z
index                              
0      3.820436  4.219502  5.217352
1      3.821163  4.219072  5.218209
2      3.821890  4.219072  5.219066
3      3.822618  4.218211  5.219923
4      3.821890  4.216920  5.219923
...         ...       ...       ...
11306  3.779176  4.355790  5.344196
11307  3.730996  4.309101  5.341846
11308  3.690126  4.242361  5.312475
11309  3.678164  4.169625  5.286825
11310  3.700759  4.135616  5.270769

[13932632 rows x 3 columns]


In [35]:

# * ####################################
# * Rodando o K-Means CPU
# * ####################################

runTests('CPU', 10, TEST_CORRECTEDNESS)

Execution K-Means CPU run #1: 1512.5023040771484; curr avg: 1512.5023040771484; 
Execution K-Means CPU run #2: 1512.2519915103912; curr avg: 1512.3771477937698; 
Execution K-Means CPU run #3: 1513.3815248012543; curr avg: 1512.711940129598; 
Execution K-Means CPU run #4: 1507.2979617118835; curr avg: 1511.3584455251694; 
Execution K-Means CPU run #5: 1502.9835982322693; curr avg: 1509.6834760665893; 
Execution K-Means CPU run #6: 1453.6571323871613; curr avg: 1500.345752120018; 
Execution K-Means CPU run #7: 1507.427669286728; curr avg: 1501.3574545724052; 
Execution K-Means CPU run #8: 1507.733592748642; curr avg: 1502.1544718444347; 
Execution K-Means CPU run #9: 1496.4242026805878; curr avg: 1501.517775270674; 
Execution K-Means CPU run #10: 1509.0807752609253; curr avg: 1502.274075269699; 
 
Avg exec K-Means CPU: 1502.274075269699
Max exec K-Means CPU: 1513.3815248012543
Min exec time K-Means CPU: 1453.6571323871613


In [36]:

# * ####################################
# * Rodando o K-Means GPU
# * ####################################

runTests('GPU', 10, TEST_CORRECTEDNESS)

Execution K-Means GPU run #1: 560.9174516201019; curr avg: 560.9174516201019; 
Execution K-Means GPU run #2: 546.3177721500397; curr avg: 553.6176118850708; 
Execution K-Means GPU run #3: 573.3783559799194; curr avg: 560.2045265833536; 
Execution K-Means GPU run #4: 563.0632152557373; curr avg: 560.9191987514496; 
Execution K-Means GPU run #5: 568.268963098526; curr avg: 562.3891516208648; 
Execution K-Means GPU run #6: 563.0007171630859; curr avg: 562.491079211235; 
Execution K-Means GPU run #7: 571.5064053535461; curr avg: 563.7789829458509; 
Execution K-Means GPU run #8: 564.1647207736969; curr avg: 563.8272001743317; 
Execution K-Means GPU run #9: 574.6080060005188; curr avg: 565.0250674883524; 
Execution K-Means GPU run #10: 533.5188543796539; curr avg: 561.8744461774826; 
 
Avg exec K-Means GPU: 561.8744461774826
Max exec K-Means GPU: 574.6080060005188
Min exec time K-Means GPU: 533.5188543796539
