# Testes de Desempenho

## K-Means

Vamos agora realizar testes de ganho de velocidade de execução, comparando o desempenho do K-Means rodando na CPU com o do K-means rodando na GPU.

Desta vez, iremos utilizar datasets bem maiores e, portanto, nada triviais — como era o caso do [*Iris* Data Set](https://archive.ics.uci.edu/ml/datasets/Iris) que foi usado anteriormente apenas como uma prova de conceito e teste de corretude.

A ideia é testar se os ganhos de desempenho ao utilizarmos uma versão paralelizada em GPU diminuem, estagnam ou aumentam junto com o aumento de instâncias ou dimensionalidade do dataset.

### Código Comum

In [1]:
import kMeans as km
import pandas as pd

import time
import os

import importlib
importlib.reload(km)

# Testing imports
print(km.kMeansCPU)
print(km.kMeansGPU)

# Valor imenso para um float, para ser usado como valor inicial na variável "slowestExecTime"
FLOAT_32_BIT_MAX = 3.4028237 * (10**38)

<function kMeansCPU at 0x774338d2a5c0>
<function kMeansGPU at 0x774336aca8e0>


### Dataset 1 (N > 1.000, D = 7, K = 2) — Rice (Cammeo and Osmancik)

Foi utilizado aqui o Dataset **[Rice (Cammeo and Osmancik)](https://archive.ics.uci.edu/dataset/545/rice+cammeo+and+osmancik)**, que reúne dados expressando características morfológicas de grãos de arroz de duas espécies, extraídas a partir de fotos destes. Temos **7 variáveis (D = 7)** e **3.810 instâncias**.

Esse dataset também contém informações de classe, definindo qual a espécie real do grão de arroz: **Cammeo** ou **Osmancik**. Portanto, haverão **2 grupos de dados (K = 2)**.

Esse conjunto de dados está presente no arquivo `Rice_Cammeo_Osmancik.arff` dentro do arquivo `rice+cammeo+and+osmancik.zip` do dataset (também disponível em download direto [neste link](https://archive.ics.uci.edu/static/public/545/rice+cammeo+and+osmancik.zip)).

#### Código

In [2]:
# Novas variáveis globais
K = 2
MAX_ITERATIONS = 5000
PLOT_RESULTS = False
DEBUG = False

COMMENT_CHAR = '%'
ALTERNATIVE_COMMENT_CHARS = ['@']

datasetFilePath = './Rice_Cammeo_Osmancik.csv'

# Processando o aqruivo .arff file e convertendo para um arquivo .csv válido (com linhas comentadas)
if not os.path.exists(datasetFilePath):
    with \
        open('./rice+cammeo+and+osmancik/Rice_Cammeo_Osmancik.arff', 'r') as file,\
        open(datasetFilePath, 'w') as fileNew:

        for line in file:
            if line[0] in ALTERNATIVE_COMMENT_CHARS:
                fileNew.write(COMMENT_CHAR + ' ' + line[1:])
            else:
                fileNew.write(line)

columnNames = ['Area', 'Perimeter', 'Major_Axis_Length', 'Minor_Axis_Length', 'Eccentricity', 'Convex_Area', 'Extent', 'Class']

# Lendo dataset do arquivo
with open(datasetFilePath, 'r') as datasetFile:
    dataset = pd.read_csv(datasetFilePath, names=columnNames, sep=',', skip_blank_lines=True, comment=COMMENT_CHAR)

dataset = dataset.drop(columns=['Class'])

print(dataset)

       Area   Perimeter  Major_Axis_Length  Minor_Axis_Length  Eccentricity   
0     15231  525.578979         229.749878          85.093788      0.928882  \
1     14656  494.311005         206.020065          91.730972      0.895405   
2     14634  501.122009         214.106781          87.768288      0.912118   
3     13176  458.342987         193.337387          87.448395      0.891861   
4     14688  507.166992         211.743378          89.312454      0.906691   
...     ...         ...                ...                ...           ...   
3805  11441  415.858002         170.486771          85.756592      0.864280   
3806  11625  421.390015         167.714798          89.462570      0.845850   
3807  12437  442.498993         183.572922          86.801979      0.881144   
3808   9882  392.296997         161.193985          78.210480      0.874406   
3809  11434  404.709991         161.079269          90.868195      0.825692   

      Convex_Area    Extent  
0           15617  0.

In [3]:
# Normalizando o dataset (normalização min-max), para que todos valores estejam no intervalo [1, 10]
datasetTreated = ((dataset - dataset.min()) / (dataset.max() - dataset.min())) * 9 + 1

print(f'##### Dataset (tratado e normalizado, intervalo [1, 10]) #####\n{datasetTreated}')

##### Dataset (tratado e normalizado, intervalo [1, 10]) #####
          Area  Perimeter  Major_Axis_Length  Minor_Axis_Length  Eccentricity   
0     7.083436   8.913085           9.110943           5.791756      8.992095  \
1     6.627970   7.426854           6.832784           7.035968      7.227819   
2     6.610544   7.750595           7.609142           6.293120      8.108617   
3     5.455642   5.717221           5.615196           6.233153      7.041041   
4     6.653318   8.037925           7.382246           6.582591      7.822599   
...        ...        ...                ...                ...           ...   
3805  4.081324   3.697823           3.421444           5.916006      5.587521   
3806  4.227073   3.960771           3.155323           6.610732      4.616211   
3807  4.870269   4.964124           4.677767           6.111975      6.476267   
3808  2.846418   2.577921           2.529299           4.501406      6.121153   
3809  4.075779   3.167936           2.518285  

In [4]:

# * ####################################
# * Rodando o K-Means CPU
# * ####################################

NUMBER_OF_RUNS = 3000

totalExecTime = 0.0
slowestExecTime = -1.0
fastestExecTime = FLOAT_32_BIT_MAX

for rep in range(1, NUMBER_OF_RUNS + 1):
    startTime = time.time()
    km.kMeansCPU(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
    # print(f'Results:\n \n{resultsCPU}\n ')
    elapsedTime = time.time() - startTime
    print(f'Execution time for K-Means CPU run #{rep}: {elapsedTime}')
    if elapsedTime < fastestExecTime: fastestExecTime = elapsedTime
    if elapsedTime > slowestExecTime: slowestExecTime = elapsedTime
    totalExecTime += elapsedTime
    # print(f'Average execution time for K-Means CPU until now: {totalExecTime / rep}')

print(f' \nAverage execution time for K-Means CPU: {totalExecTime / NUMBER_OF_RUNS}')
print(f'Slowest execution time for K-Means CPU: {slowestExecTime}')
print(f'Fastest execution time for K-Means CPU: {fastestExecTime}')

Execution time for K-Means CPU run #1: 0.13989925384521484
Execution time for K-Means CPU run #2: 0.10335850715637207
Execution time for K-Means CPU run #3: 0.04597806930541992
Execution time for K-Means CPU run #4: 0.045931339263916016
Execution time for K-Means CPU run #5: 0.0948173999786377
Execution time for K-Means CPU run #6: 0.05476975440979004
Execution time for K-Means CPU run #7: 0.07042193412780762
Execution time for K-Means CPU run #8: 0.062140703201293945
Execution time for K-Means CPU run #9: 0.054311275482177734
Execution time for K-Means CPU run #10: 0.06257772445678711
Execution time for K-Means CPU run #11: 0.0867910385131836
Execution time for K-Means CPU run #12: 0.07892370223999023
Execution time for K-Means CPU run #13: 0.08675813674926758
Execution time for K-Means CPU run #14: 0.07889842987060547
Execution time for K-Means CPU run #15: 0.10283422470092773
Execution time for K-Means CPU run #16: 0.08688998222351074
Execution time for K-Means CPU run #17: 0.104105

In [5]:

# * ####################################
# * Rodando o K-Means GPU
# * ####################################

NUMBER_OF_RUNS = 3000

totalExecTime = 0.0
slowestExecTime = -1.0
fastestExecTime = FLOAT_32_BIT_MAX

for rep in range(1, NUMBER_OF_RUNS + 1):
    startTime = time.time()
    km.kMeansGPU(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
    # print(f'Results:\n \n{resultsGPU}\n ')
    elapsedTime = time.time() - startTime
    print(f'Execution time for K-Means GPU run #{rep}: {elapsedTime}')
    if elapsedTime < fastestExecTime: fastestExecTime = elapsedTime
    if elapsedTime > slowestExecTime: slowestExecTime = elapsedTime
    totalExecTime += elapsedTime
    # print(f'Average execution time for K-Means GPU until now: {totalExecTime / rep}')

print(f' \nAverage execution time for K-Means GPU: {totalExecTime / NUMBER_OF_RUNS}')
print(f'Slowest execution time for K-Means GPU: {slowestExecTime}')
print(f'Fastest execution time for K-Means GPU: {fastestExecTime}')



Execution time for K-Means GPU run #1: 0.26082348823547363
Execution time for K-Means GPU run #2: 0.014752626419067383
Execution time for K-Means GPU run #3: 0.021697998046875
Execution time for K-Means GPU run #4: 0.015761375427246094
Execution time for K-Means GPU run #5: 0.013913869857788086
Execution time for K-Means GPU run #6: 0.021639585494995117
Execution time for K-Means GPU run #7: 0.017804622650146484
Execution time for K-Means GPU run #8: 0.013870954513549805
Execution time for K-Means GPU run #9: 0.021091461181640625
Execution time for K-Means GPU run #10: 0.017765522003173828
Execution time for K-Means GPU run #11: 0.01776123046875
Execution time for K-Means GPU run #12: 0.02205514907836914
Execution time for K-Means GPU run #13: 0.01851034164428711
Execution time for K-Means GPU run #14: 0.03151535987854004
Execution time for K-Means GPU run #15: 0.0221099853515625
Execution time for K-Means GPU run #16: 0.02169656753540039
Execution time for K-Means GPU run #17: 0.02198

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #56: 9.054982900619507
Execution time for K-Means GPU run #57: 0.017803192138671875
Execution time for K-Means GPU run #58: 0.033774375915527344
Execution time for K-Means GPU run #59: 0.0226900577545166
Execution time for K-Means GPU run #60: 0.03383827209472656
Execution time for K-Means GPU run #61: 0.021981239318847656
Execution time for K-Means GPU run #62: 0.02061915397644043
Execution time for K-Means GPU run #63: 0.022480249404907227
Execution time for K-Means GPU run #64: 0.01948690414428711
Execution time for K-Means GPU run #65: 0.012254953384399414
Execution time for K-Means GPU run #66: 0.016160011291503906
Execution time for K-Means GPU run #67: 0.014500141143798828
Execution time for K-Means GPU run #68: 0.018715858459472656
Execution time for K-Means GPU run #69: 0.015417337417602539
Execution time for K-Means GPU run #70: 0.021404743194580078
Execution time for K-Means GPU run #71: 0.026509761810302734
Execution time for K-Means GPU r

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #192: 8.992191314697266
Execution time for K-Means GPU run #193: 0.01963520050048828
Execution time for K-Means GPU run #194: 0.021518707275390625
Execution time for K-Means GPU run #195: 0.017786026000976562
Execution time for K-Means GPU run #196: 0.015742063522338867
Execution time for K-Means GPU run #197: 0.02505803108215332
Execution time for K-Means GPU run #198: 0.026518821716308594
Execution time for K-Means GPU run #199: 0.017784833908081055
Execution time for K-Means GPU run #200: 0.024808645248413086
Execution time for K-Means GPU run #201: 0.011966466903686523
Execution time for K-Means GPU run #202: 0.02327871322631836
Execution time for K-Means GPU run #203: 0.019618749618530273
Execution time for K-Means GPU run #204: 0.016585111618041992
Execution time for K-Means GPU run #205: 0.015924692153930664
Execution time for K-Means GPU run #206: 0.019840002059936523
Execution time for K-Means GPU run #207: 0.017590999603271484
Execution time

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #215: 9.103022575378418
Execution time for K-Means GPU run #216: 0.021744489669799805
Execution time for K-Means GPU run #217: 0.025305986404418945
Execution time for K-Means GPU run #218: 0.016403675079345703
Execution time for K-Means GPU run #219: 0.021720170974731445
Execution time for K-Means GPU run #220: 0.02576899528503418
Execution time for K-Means GPU run #221: 0.022989749908447266
Execution time for K-Means GPU run #222: 0.017738819122314453
Execution time for K-Means GPU run #223: 0.021413326263427734
Execution time for K-Means GPU run #224: 0.015563011169433594
Execution time for K-Means GPU run #225: 0.03016209602355957
Execution time for K-Means GPU run #226: 0.01815652847290039
Execution time for K-Means GPU run #227: 0.0160219669342041
Execution time for K-Means GPU run #228: 0.021724939346313477
Execution time for K-Means GPU run #229: 0.028163909912109375
Execution time for K-Means GPU run #230: 0.026732444763183594
Execution time f

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #250: 8.959384202957153
Execution time for K-Means GPU run #251: 0.018108844757080078
Execution time for K-Means GPU run #252: 0.021003246307373047
Execution time for K-Means GPU run #253: 0.014887571334838867
Execution time for K-Means GPU run #254: 0.013985395431518555
Execution time for K-Means GPU run #255: 0.021033287048339844
Execution time for K-Means GPU run #256: 0.02125382423400879
Execution time for K-Means GPU run #257: 0.024248123168945312
Execution time for K-Means GPU run #258: 0.01761627197265625
Execution time for K-Means GPU run #259: 0.015522003173828125
Execution time for K-Means GPU run #260: 0.021287202835083008
Execution time for K-Means GPU run #261: 0.014397382736206055
Execution time for K-Means GPU run #262: 0.01567816734313965
Execution time for K-Means GPU run #263: 0.012374401092529297
Execution time for K-Means GPU run #264: 0.03136110305786133
Execution time for K-Means GPU run #265: 0.023306608200073242
Execution time 

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #348: 8.896193981170654
Execution time for K-Means GPU run #349: 0.028928041458129883
Execution time for K-Means GPU run #350: 0.015949249267578125
Execution time for K-Means GPU run #351: 0.023295879364013672
Execution time for K-Means GPU run #352: 0.02106499671936035
Execution time for K-Means GPU run #353: 0.021740198135375977
Execution time for K-Means GPU run #354: 0.01620626449584961
Execution time for K-Means GPU run #355: 0.022333860397338867
Execution time for K-Means GPU run #356: 0.016432523727416992
Execution time for K-Means GPU run #357: 0.011127710342407227
Execution time for K-Means GPU run #358: 0.01920795440673828
Execution time for K-Means GPU run #359: 0.024985313415527344
Execution time for K-Means GPU run #360: 0.01792430877685547
Execution time for K-Means GPU run #361: 0.023707866668701172
Execution time for K-Means GPU run #362: 0.01969170570373535
Execution time for K-Means GPU run #363: 0.016121387481689453
Execution time f

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #621: 8.92687201499939
Execution time for K-Means GPU run #622: 0.01786947250366211
Execution time for K-Means GPU run #623: 0.020695924758911133
Execution time for K-Means GPU run #624: 0.024612903594970703
Execution time for K-Means GPU run #625: 0.021641016006469727
Execution time for K-Means GPU run #626: 0.017484426498413086
Execution time for K-Means GPU run #627: 0.021262645721435547
Execution time for K-Means GPU run #628: 0.02116680145263672
Execution time for K-Means GPU run #629: 0.021214962005615234
Execution time for K-Means GPU run #630: 0.015606164932250977
Execution time for K-Means GPU run #631: 0.01567673683166504
Execution time for K-Means GPU run #632: 0.027058839797973633
Execution time for K-Means GPU run #633: 0.021472454071044922
Execution time for K-Means GPU run #634: 0.02301502227783203
Execution time for K-Means GPU run #635: 0.01955699920654297
Execution time for K-Means GPU run #636: 0.027424097061157227
Execution time fo

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #682: 8.927640676498413
Execution time for K-Means GPU run #683: 0.03229165077209473
Execution time for K-Means GPU run #684: 0.019463777542114258
Execution time for K-Means GPU run #685: 0.013849496841430664
Execution time for K-Means GPU run #686: 0.02318406105041504
Execution time for K-Means GPU run #687: 0.02311563491821289
Execution time for K-Means GPU run #688: 0.02841973304748535
Execution time for K-Means GPU run #689: 0.021357059478759766
Execution time for K-Means GPU run #690: 0.01896381378173828


  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #691: 8.915256023406982
Execution time for K-Means GPU run #692: 0.028768301010131836
Execution time for K-Means GPU run #693: 0.019489765167236328
Execution time for K-Means GPU run #694: 0.012318849563598633
Execution time for K-Means GPU run #695: 0.02274036407470703
Execution time for K-Means GPU run #696: 0.02890753746032715
Execution time for K-Means GPU run #697: 0.0136871337890625
Execution time for K-Means GPU run #698: 0.028197526931762695
Execution time for K-Means GPU run #699: 0.01779913902282715
Execution time for K-Means GPU run #700: 0.013709306716918945
Execution time for K-Means GPU run #701: 0.016097068786621094
Execution time for K-Means GPU run #702: 0.021571874618530273
Execution time for K-Means GPU run #703: 0.015994787216186523
Execution time for K-Means GPU run #704: 0.023693323135375977
Execution time for K-Means GPU run #705: 0.02294182777404785
Execution time for K-Means GPU run #706: 0.019631385803222656
Execution time fo

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #738: 8.925364017486572
Execution time for K-Means GPU run #739: 0.010785102844238281
Execution time for K-Means GPU run #740: 0.02284073829650879
Execution time for K-Means GPU run #741: 0.01778864860534668
Execution time for K-Means GPU run #742: 0.015615224838256836
Execution time for K-Means GPU run #743: 0.015434741973876953
Execution time for K-Means GPU run #744: 0.0180206298828125
Execution time for K-Means GPU run #745: 0.023805618286132812
Execution time for K-Means GPU run #746: 0.01774120330810547
Execution time for K-Means GPU run #747: 0.01401829719543457
Execution time for K-Means GPU run #748: 0.017816781997680664
Execution time for K-Means GPU run #749: 0.0284881591796875
Execution time for K-Means GPU run #750: 0.026900768280029297
Execution time for K-Means GPU run #751: 0.02380084991455078
Execution time for K-Means GPU run #752: 0.021301746368408203
Execution time for K-Means GPU run #753: 0.019491910934448242
Execution time for K

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #1589: 9.116970539093018
Execution time for K-Means GPU run #1590: 0.018215656280517578
Execution time for K-Means GPU run #1591: 0.024006128311157227
Execution time for K-Means GPU run #1592: 0.023111343383789062
Execution time for K-Means GPU run #1593: 0.0292356014251709
Execution time for K-Means GPU run #1594: 0.015377998352050781
Execution time for K-Means GPU run #1595: 0.019525527954101562
Execution time for K-Means GPU run #1596: 0.021470308303833008
Execution time for K-Means GPU run #1597: 0.014039039611816406
Execution time for K-Means GPU run #1598: 0.021303415298461914
Execution time for K-Means GPU run #1599: 0.02032923698425293
Execution time for K-Means GPU run #1600: 0.018604040145874023
Execution time for K-Means GPU run #1601: 0.01417851448059082
Execution time for K-Means GPU run #1602: 0.01697373390197754
Execution time for K-Means GPU run #1603: 0.01775527000427246
Execution time for K-Means GPU run #1604: 0.012107372283935547
E

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #1675: 9.088350772857666
Execution time for K-Means GPU run #1676: 0.02137017250061035
Execution time for K-Means GPU run #1677: 0.021468639373779297
Execution time for K-Means GPU run #1678: 0.01986551284790039
Execution time for K-Means GPU run #1679: 0.023015499114990234
Execution time for K-Means GPU run #1680: 0.017630338668823242
Execution time for K-Means GPU run #1681: 0.014511585235595703
Execution time for K-Means GPU run #1682: 0.016047954559326172
Execution time for K-Means GPU run #1683: 0.012081146240234375
Execution time for K-Means GPU run #1684: 0.01378488540649414
Execution time for K-Means GPU run #1685: 0.015745162963867188
Execution time for K-Means GPU run #1686: 0.02887105941772461
Execution time for K-Means GPU run #1687: 0.024610042572021484
Execution time for K-Means GPU run #1688: 0.022385597229003906
Execution time for K-Means GPU run #1689: 0.019372224807739258
Execution time for K-Means GPU run #1690: 0.022458314895629883

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #1885: 8.855802774429321
Execution time for K-Means GPU run #1886: 0.01787590980529785
Execution time for K-Means GPU run #1887: 0.024587154388427734
Execution time for K-Means GPU run #1888: 0.023070573806762695
Execution time for K-Means GPU run #1889: 0.014113664627075195
Execution time for K-Means GPU run #1890: 0.022525310516357422
Execution time for K-Means GPU run #1891: 0.021032094955444336
Execution time for K-Means GPU run #1892: 0.010107994079589844
Execution time for K-Means GPU run #1893: 0.017148733139038086
Execution time for K-Means GPU run #1894: 0.013756752014160156
Execution time for K-Means GPU run #1895: 0.028997421264648438
Execution time for K-Means GPU run #1896: 0.029984712600708008
Execution time for K-Means GPU run #1897: 0.01711583137512207
Execution time for K-Means GPU run #1898: 0.017760038375854492
Execution time for K-Means GPU run #1899: 0.021033763885498047


  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #1900: 8.853227138519287
Execution time for K-Means GPU run #1901: 0.01760077476501465
Execution time for K-Means GPU run #1902: 0.02440500259399414
Execution time for K-Means GPU run #1903: 0.0178678035736084
Execution time for K-Means GPU run #1904: 0.017463207244873047
Execution time for K-Means GPU run #1905: 0.01788330078125
Execution time for K-Means GPU run #1906: 0.021416425704956055
Execution time for K-Means GPU run #1907: 0.013631582260131836
Execution time for K-Means GPU run #1908: 0.012437105178833008
Execution time for K-Means GPU run #1909: 0.01375722885131836
Execution time for K-Means GPU run #1910: 0.017281532287597656
Execution time for K-Means GPU run #1911: 0.0160672664642334
Execution time for K-Means GPU run #1912: 0.01437687873840332
Execution time for K-Means GPU run #1913: 0.0225677490234375
Execution time for K-Means GPU run #1914: 0.023832082748413086
Execution time for K-Means GPU run #1915: 0.01732635498046875
Execution 

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #1921: 8.865094900131226
Execution time for K-Means GPU run #1922: 0.017844200134277344
Execution time for K-Means GPU run #1923: 0.021810531616210938
Execution time for K-Means GPU run #1924: 0.02139592170715332
Execution time for K-Means GPU run #1925: 0.014396429061889648
Execution time for K-Means GPU run #1926: 0.02337503433227539
Execution time for K-Means GPU run #1927: 0.017330169677734375
Execution time for K-Means GPU run #1928: 0.017895936965942383
Execution time for K-Means GPU run #1929: 0.014329195022583008
Execution time for K-Means GPU run #1930: 0.019522905349731445
Execution time for K-Means GPU run #1931: 0.022295713424682617
Execution time for K-Means GPU run #1932: 0.01664900779724121
Execution time for K-Means GPU run #1933: 0.02186131477355957
Execution time for K-Means GPU run #1934: 0.019011735916137695
Execution time for K-Means GPU run #1935: 0.022357940673828125
Execution time for K-Means GPU run #1936: 0.01586174964904785


  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #1945: 9.034074306488037
Execution time for K-Means GPU run #1946: 0.014996767044067383
Execution time for K-Means GPU run #1947: 0.014508247375488281
Execution time for K-Means GPU run #1948: 0.017710447311401367
Execution time for K-Means GPU run #1949: 0.023558616638183594
Execution time for K-Means GPU run #1950: 0.017530441284179688
Execution time for K-Means GPU run #1951: 0.02548956871032715
Execution time for K-Means GPU run #1952: 0.019761323928833008
Execution time for K-Means GPU run #1953: 0.017490386962890625
Execution time for K-Means GPU run #1954: 0.022025108337402344
Execution time for K-Means GPU run #1955: 0.014054536819458008
Execution time for K-Means GPU run #1956: 0.015920400619506836
Execution time for K-Means GPU run #1957: 0.0237119197845459
Execution time for K-Means GPU run #1958: 0.015181303024291992
Execution time for K-Means GPU run #1959: 0.023398160934448242
Execution time for K-Means GPU run #1960: 0.02424287796020507

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #2011: 8.918254375457764
Execution time for K-Means GPU run #2012: 0.0163724422454834
Execution time for K-Means GPU run #2013: 0.019359827041625977
Execution time for K-Means GPU run #2014: 0.02353048324584961
Execution time for K-Means GPU run #2015: 0.02251410484313965
Execution time for K-Means GPU run #2016: 0.018163681030273438
Execution time for K-Means GPU run #2017: 0.02433013916015625
Execution time for K-Means GPU run #2018: 0.021500349044799805
Execution time for K-Means GPU run #2019: 0.014090776443481445
Execution time for K-Means GPU run #2020: 0.018027305603027344
Execution time for K-Means GPU run #2021: 0.015778779983520508
Execution time for K-Means GPU run #2022: 0.02348041534423828
Execution time for K-Means GPU run #2023: 0.014422893524169922
Execution time for K-Means GPU run #2024: 0.0297393798828125
Execution time for K-Means GPU run #2025: 0.022706985473632812
Execution time for K-Means GPU run #2026: 0.013930797576904297
Exe

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #2077: 8.889272928237915
Execution time for K-Means GPU run #2078: 0.024549007415771484
Execution time for K-Means GPU run #2079: 0.01592540740966797
Execution time for K-Means GPU run #2080: 0.022820472717285156
Execution time for K-Means GPU run #2081: 0.015974044799804688
Execution time for K-Means GPU run #2082: 0.021170616149902344
Execution time for K-Means GPU run #2083: 0.015365123748779297
Execution time for K-Means GPU run #2084: 0.02118396759033203
Execution time for K-Means GPU run #2085: 0.0174410343170166
Execution time for K-Means GPU run #2086: 0.022916555404663086
Execution time for K-Means GPU run #2087: 0.02016592025756836
Execution time for K-Means GPU run #2088: 0.017963409423828125
Execution time for K-Means GPU run #2089: 0.023284912109375
Execution time for K-Means GPU run #2090: 0.0232241153717041
Execution time for K-Means GPU run #2091: 0.024431467056274414
Execution time for K-Means GPU run #2092: 0.018990516662597656
Execu

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #2127: 9.050375699996948


  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #2128: 8.905558824539185
Execution time for K-Means GPU run #2129: 0.01955270767211914
Execution time for K-Means GPU run #2130: 0.019566774368286133
Execution time for K-Means GPU run #2131: 0.015508413314819336
Execution time for K-Means GPU run #2132: 0.019234657287597656
Execution time for K-Means GPU run #2133: 0.01573324203491211
Execution time for K-Means GPU run #2134: 0.015671253204345703
Execution time for K-Means GPU run #2135: 0.017585277557373047
Execution time for K-Means GPU run #2136: 0.015842437744140625
Execution time for K-Means GPU run #2137: 0.02106022834777832
Execution time for K-Means GPU run #2138: 0.024417400360107422
Execution time for K-Means GPU run #2139: 0.019481182098388672
Execution time for K-Means GPU run #2140: 0.02277088165283203
Execution time for K-Means GPU run #2141: 0.022427082061767578
Execution time for K-Means GPU run #2142: 0.021553754806518555
Execution time for K-Means GPU run #2143: 0.026998043060302734

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #2269: 8.926567792892456
Execution time for K-Means GPU run #2270: 0.019616127014160156
Execution time for K-Means GPU run #2271: 0.01600050926208496
Execution time for K-Means GPU run #2272: 0.019124507904052734
Execution time for K-Means GPU run #2273: 0.014428853988647461
Execution time for K-Means GPU run #2274: 0.019948959350585938
Execution time for K-Means GPU run #2275: 0.018793344497680664
Execution time for K-Means GPU run #2276: 0.015917539596557617
Execution time for K-Means GPU run #2277: 0.028474092483520508
Execution time for K-Means GPU run #2278: 0.01712489128112793
Execution time for K-Means GPU run #2279: 0.013656377792358398
Execution time for K-Means GPU run #2280: 0.015420675277709961
Execution time for K-Means GPU run #2281: 0.014279842376708984
Execution time for K-Means GPU run #2282: 0.016444683074951172
Execution time for K-Means GPU run #2283: 0.02303791046142578
Execution time for K-Means GPU run #2284: 0.01782298088073730

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #2388: 8.926052570343018
Execution time for K-Means GPU run #2389: 0.018396377563476562
Execution time for K-Means GPU run #2390: 0.023093700408935547
Execution time for K-Means GPU run #2391: 0.016319990158081055
Execution time for K-Means GPU run #2392: 0.024918556213378906
Execution time for K-Means GPU run #2393: 0.02859210968017578
Execution time for K-Means GPU run #2394: 0.017038822174072266
Execution time for K-Means GPU run #2395: 0.02251291275024414
Execution time for K-Means GPU run #2396: 0.017743587493896484
Execution time for K-Means GPU run #2397: 0.017776012420654297
Execution time for K-Means GPU run #2398: 0.013847589492797852
Execution time for K-Means GPU run #2399: 0.023633241653442383
Execution time for K-Means GPU run #2400: 0.018352985382080078
Execution time for K-Means GPU run #2401: 0.02283763885498047
Execution time for K-Means GPU run #2402: 0.017767906188964844
Execution time for K-Means GPU run #2403: 0.01550006866455078

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #2419: 8.829914808273315
Execution time for K-Means GPU run #2420: 0.024169921875
Execution time for K-Means GPU run #2421: 0.017633438110351562
Execution time for K-Means GPU run #2422: 0.023102283477783203
Execution time for K-Means GPU run #2423: 0.024627685546875
Execution time for K-Means GPU run #2424: 0.01686382293701172
Execution time for K-Means GPU run #2425: 0.01415252685546875
Execution time for K-Means GPU run #2426: 0.019492626190185547
Execution time for K-Means GPU run #2427: 0.023471593856811523
Execution time for K-Means GPU run #2428: 0.014223575592041016
Execution time for K-Means GPU run #2429: 0.013910055160522461
Execution time for K-Means GPU run #2430: 0.019853830337524414
Execution time for K-Means GPU run #2431: 0.01504063606262207
Execution time for K-Means GPU run #2432: 0.010529756546020508
Execution time for K-Means GPU run #2433: 0.014625310897827148
Execution time for K-Means GPU run #2434: 0.014414072036743164
Executi

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #2439: 8.987685203552246
Execution time for K-Means GPU run #2440: 0.03498077392578125
Execution time for K-Means GPU run #2441: 0.01359105110168457
Execution time for K-Means GPU run #2442: 0.024550199508666992
Execution time for K-Means GPU run #2443: 0.02596569061279297
Execution time for K-Means GPU run #2444: 0.01945018768310547
Execution time for K-Means GPU run #2445: 0.017343759536743164
Execution time for K-Means GPU run #2446: 0.022778749465942383
Execution time for K-Means GPU run #2447: 0.015578269958496094
Execution time for K-Means GPU run #2448: 0.019283533096313477
Execution time for K-Means GPU run #2449: 0.023182153701782227
Execution time for K-Means GPU run #2450: 0.016206979751586914


  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #2451: 8.865233182907104
Execution time for K-Means GPU run #2452: 0.016146421432495117
Execution time for K-Means GPU run #2453: 0.017151355743408203
Execution time for K-Means GPU run #2454: 0.012029170989990234
Execution time for K-Means GPU run #2455: 0.015772581100463867
Execution time for K-Means GPU run #2456: 0.017694473266601562
Execution time for K-Means GPU run #2457: 0.025754690170288086
Execution time for K-Means GPU run #2458: 0.02245807647705078
Execution time for K-Means GPU run #2459: 0.015792131423950195
Execution time for K-Means GPU run #2460: 0.014054298400878906
Execution time for K-Means GPU run #2461: 0.0175020694732666
Execution time for K-Means GPU run #2462: 0.012672185897827148
Execution time for K-Means GPU run #2463: 0.02307605743408203
Execution time for K-Means GPU run #2464: 0.021412372589111328
Execution time for K-Means GPU run #2465: 0.0180513858795166
Execution time for K-Means GPU run #2466: 0.015722274780273438
E

  meansByClosestCent[centroidIdx] = relevantLogs.mean(axis=0)
  ret = um.true_divide(


Execution time for K-Means GPU run #2477: 8.946741819381714
Execution time for K-Means GPU run #2478: 0.01776289939880371
Execution time for K-Means GPU run #2479: 0.01576972007751465
Execution time for K-Means GPU run #2480: 0.023155689239501953
Execution time for K-Means GPU run #2481: 0.023186206817626953
Execution time for K-Means GPU run #2482: 0.014143943786621094
Execution time for K-Means GPU run #2483: 0.019600868225097656
Execution time for K-Means GPU run #2484: 0.018075227737426758
Execution time for K-Means GPU run #2485: 0.035700321197509766
Execution time for K-Means GPU run #2486: 0.015851259231567383
Execution time for K-Means GPU run #2487: 0.02356576919555664
Execution time for K-Means GPU run #2488: 0.023665666580200195
Execution time for K-Means GPU run #2489: 0.015114784240722656
Execution time for K-Means GPU run #2490: 0.024234771728515625
Execution time for K-Means GPU run #2491: 0.024004697799682617
Execution time for K-Means GPU run #2492: 0.01836395263671875

### Dataset 2 (N > 10.000, D = 8, K = 2) — HTRU2

Foi utilizado aqui o Dataset **[HTRU2 (High Time Resolution Universe 2)](https://archive.ics.uci.edu/dataset/372/htru2)**, que reúne dados a respeito de emissões de sinais de rádio de banda larga obtidos através de leituras feitas com telescópios de rádio. É um dos resultados da busca por pulsares, estrelas de neutrôn que possuem uma rotação rápida e que emitem sinais de rádio banda larga detectáveis do nosso planeta. Temos **8 variáveis (D = 8)** e **17.898 instâncias**.

Esse dataset também contém informações de classe, definindo se a leitura é **positiva** ou **negativa**, a respeito do sinal candidato de fato originar ou não de um pulsar. Portanto, haverão **2 grupos de dados (K = 2)**.

Esse conjunto de dados está presente no arquivo `HTRU_2.csv` dentro do arquivo `HTRU_2.zip` do dataset (também disponível em download direto [neste link](https://archive.ics.uci.edu/static/public/372/htru2.zip)).

#### Código

In [6]:
# Novas variáveis globais
K = 2
MAX_ITERATIONS = 5000
PLOT_RESULTS = False
DEBUG = False

datasetFilePath = './htru2/HTRU_2.csv'

columnNames = ['mean_IP', 'std_dev_IP', 'exc_kurt_IP', 'skew_IP', 'mean_DM_SNR', 'std_dev_DM_SNR', 'exc_kurt_DM_SNR', 'skew_DM_SNR', 'is_positive']

# Lendo dataset do arquivo
with open(datasetFilePath, 'r') as datasetFile:
    dataset = pd.read_csv(datasetFilePath, names=columnNames, sep=',')

dataset = dataset.drop(columns=['is_positive'])

print(dataset)

          mean_IP  std_dev_IP  exc_kurt_IP   skew_IP  mean_DM_SNR   
0      140.562500   55.683782    -0.234571 -0.699648     3.199833  \
1      102.507812   58.882430     0.465318 -0.515088     1.677258   
2      103.015625   39.341649     0.323328  1.051164     3.121237   
3      136.750000   57.178449    -0.068415 -0.636238     3.642977   
4       88.726562   40.672225     0.600866  1.123492     1.178930   
...           ...         ...          ...       ...          ...   
17893  136.429688   59.847421    -0.187846 -0.738123     1.296823   
17894  122.554688   49.485605     0.127978  0.323061    16.409699   
17895  119.335938   59.935939     0.159363 -0.743025    21.430602   
17896  114.507812   53.902400     0.201161 -0.024789     1.946488   
17897   57.062500   85.797340     1.406391  0.089520   188.306020   

       std_dev_DM_SNR  exc_kurt_DM_SNR  skew_DM_SNR  
0           19.110426         7.975532    74.242225  
1           14.860146        10.576487   127.393580  
2        

In [7]:
# Normalizando o dataset (normalização min-max), para que todos valores estejam no intervalo [1, 10]
datasetTreated = ((dataset - dataset.min()) / (dataset.max() - dataset.min())) * 9 + 1

print(f'##### Dataset (tratado e normalizado, intervalo [1, 10]) #####\n{datasetTreated}')

##### Dataset (tratado e normalizado, intervalo [1, 10]) #####
        mean_IP  std_dev_IP  exc_kurt_IP   skew_IP  mean_DM_SNR   
0      7.492075    4.759187     2.485386  1.140645     1.120440  \
1      5.658651    5.148176     3.118736  1.164410     1.059040   
2      5.683117    2.771815     2.990246  1.366092     1.117270   
3      7.308394    4.940954     2.635746  1.148810     1.138310   
4      4.994689    2.933627     3.241398  1.375405     1.038944   
...         ...         ...          ...       ...          ...   
17893  7.292961    5.265529     2.527670  1.135690     1.043698   
17894  6.624482    4.005425     2.813468  1.272336     1.653146   
17895  6.469407    5.276293     2.841869  1.135059     1.855621   
17896  6.236795    4.542553     2.879693  1.227544     1.069897   
17897  3.469156    8.421307     3.970340  1.242264     8.585104   

       std_dev_DM_SNR  exc_kurt_DM_SNR  skew_DM_SNR  
0            2.023125         3.654872     1.575009  
1            1.652719   

In [8]:

# * ####################################
# * Rodando o K-Means CPU
# * ####################################

NUMBER_OF_RUNS = 500

totalExecTime = 0.0
slowestExecTime = -1.0
fastestExecTime = FLOAT_32_BIT_MAX

for rep in range(1, NUMBER_OF_RUNS + 1):
    startTime = time.time()
    km.kMeansCPU(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
    # print(f'Results:\n \n{resultsCPU}\n ')
    elapsedTime = time.time() - startTime
    print(f'Execution time for K-Means CPU run #{rep}: {elapsedTime}')
    if elapsedTime < fastestExecTime: fastestExecTime = elapsedTime
    if elapsedTime > slowestExecTime: slowestExecTime = elapsedTime
    totalExecTime += elapsedTime
    # print(f'Average execution time for K-Means CPU until now: {totalExecTime / rep}')

print(f' \nAverage execution time for K-Means CPU: {totalExecTime / NUMBER_OF_RUNS}')
print(f'Slowest execution time for K-Means CPU: {slowestExecTime}')
print(f'Fastest execution time for K-Means CPU: {fastestExecTime}')

Execution time for K-Means CPU run #1: 0.386516809463501
Execution time for K-Means CPU run #2: 0.26863765716552734
Execution time for K-Means CPU run #3: 0.43999195098876953
Execution time for K-Means CPU run #4: 0.24206805229187012
Execution time for K-Means CPU run #5: 0.3245205879211426
Execution time for K-Means CPU run #6: 0.3240029811859131
Execution time for K-Means CPU run #7: 0.4376654624938965
Execution time for K-Means CPU run #8: 0.4672253131866455
Execution time for K-Means CPU run #9: 0.32550883293151855
Execution time for K-Means CPU run #10: 0.29544854164123535
Execution time for K-Means CPU run #11: 0.4377772808074951
Execution time for K-Means CPU run #12: 0.3247861862182617
Execution time for K-Means CPU run #13: 0.4657411575317383
Execution time for K-Means CPU run #14: 0.3246302604675293
Execution time for K-Means CPU run #15: 0.43775010108947754
Execution time for K-Means CPU run #16: 0.3528883457183838
Execution time for K-Means CPU run #17: 0.3526489734649658
E

In [9]:

# * ####################################
# * Rodando o K-Means GPU
# * ####################################

NUMBER_OF_RUNS = 500

totalExecTime = 0.0
slowestExecTime = -1.0
fastestExecTime = FLOAT_32_BIT_MAX

for rep in range(1, NUMBER_OF_RUNS + 1):
    startTime = time.time()
    km.kMeansGPU(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
    # print(f'Results:\n \n{resultsGPU}\n ')
    elapsedTime = time.time() - startTime
    print(f'Execution time for K-Means GPU run #{rep}: {elapsedTime}')
    if elapsedTime < fastestExecTime: fastestExecTime = elapsedTime
    if elapsedTime > slowestExecTime: slowestExecTime = elapsedTime
    totalExecTime += elapsedTime
    # print(f'Average execution time for K-Means GPU until now: {totalExecTime / rep}')

print(f' \nAverage execution time for K-Means GPU: {totalExecTime / NUMBER_OF_RUNS}')
print(f'Slowest execution time for K-Means GPU: {slowestExecTime}')
print(f'Fastest execution time for K-Means GPU: {fastestExecTime}')

Execution time for K-Means GPU run #1: 0.11395835876464844
Execution time for K-Means GPU run #2: 0.0629584789276123




Execution time for K-Means GPU run #3: 0.08500385284423828
Execution time for K-Means GPU run #4: 0.08057904243469238
Execution time for K-Means GPU run #5: 0.06778788566589355
Execution time for K-Means GPU run #6: 0.069122314453125
Execution time for K-Means GPU run #7: 0.07471370697021484
Execution time for K-Means GPU run #8: 0.05223512649536133
Execution time for K-Means GPU run #9: 0.07316112518310547
Execution time for K-Means GPU run #10: 0.07320189476013184
Execution time for K-Means GPU run #11: 0.07957577705383301
Execution time for K-Means GPU run #12: 0.07831740379333496
Execution time for K-Means GPU run #13: 0.06876659393310547
Execution time for K-Means GPU run #14: 0.09913802146911621
Execution time for K-Means GPU run #15: 0.07914328575134277
Execution time for K-Means GPU run #16: 0.08031511306762695
Execution time for K-Means GPU run #17: 0.0722360610961914
Execution time for K-Means GPU run #18: 0.07373785972595215
Execution time for K-Means GPU run #19: 0.06707906

### Dataset 3 (N > 100.000, D = 50, K = 2) — MiniBooNE

Foi utilizado aqui o Dataset **[MiniBooNE Particle Identification](https://archive.ics.uci.edu/dataset/199/miniboone+particle+identification)**, que reúne dados a respeito de partículas detectadas no experimento *MiniBooNE* (*Mini Booster Neutrino Experiment*), conduzido no laboratório americano *Fermilab*. Cada detecção de partícula é descrita por **50 variáveis reais (D = 50)** e há **129.596 instâncias no total**.

As primeiras 36.488 instâncias são detecções de neutrinos do elétron (sinal) e as 93.108 restantes são de neutrinos do múon (ruído de fundo). Assim, as informações de classe desse dataset estão implícitas, expressa pela ordem das instâncias no arquivo. Como temos duas classes, haverão **2 grupos de dados (K = 2)**.

Esse conjunto de dados está presente no arquivo `MiniBooNE_PID.txt` dentro do arquivo `miniboone+particle+identification.zip` do dataset (também disponível em download direto [neste link](https://archive.ics.uci.edu/static/public/199/miniboone+particle+identification.zip)).

Foi necessário, neste dataset, realizar um **pré-processamento** para **remoção de outliers**. Originalmente, há 130.064 instâncias no total (36.499 sinal e 93.565 ruído). Porém, existem 468 instâncias (11 sinal e 457 ruído) que são extremos outliers, possuindo o valor -999.0 em todas as 50 variáveis — provavelmente advindos de algum erro de detecção. A presença destes outliers causava a criação de um cluster contendo apenas estes outliers, diminuindo muito o tempo de execução do algoritmo de maneira artificial. Estes outliers tiveram que ser removidos. Note que poderíamos ter solucionado este problema com outra abordagem: aumentar K para 3, criando um cluster novo para conter apenas os outliers. Isso, no entanto, seria mais custoso computacionalmente do que a remoção das instâncias.

#### Código

In [10]:
# Novas variáveis globais
K = 2
MAX_ITERATIONS = 5000
PLOT_RESULTS = False
DEBUG = False

COMMENT_CHAR = '#'

# As primeiras 36.499 instâncias são consideradas um sinal, e o resto como ruído
N_OF_SIGNAL_LINES = 36499

datasetFilePath = './MiniBooNE_PID.csv'

# Processando o aqruivo .txt file e convertendo para um arquivo .csv válido (com a primeira linha comentada, removendo o leading whitespace, e trocando o separador de "  " ou " " para ",")
if not os.path.exists(datasetFilePath):
    with \
        open('./MiniBooNE_PID.txt', 'r') as file,\
        open(datasetFilePath, 'w') as fileNew:

        print('Processing MiniBooNE_PID.txt...\n ')

        # Removendo outliers com -999.0 de valor nas 50 variáveis. Há 468 destas instâncias
        outlierString = '''-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03,-0.999000E+03'''

        index = 1
        signalInstRemoved = 0
        noiseInstRemoved = 0
        for line in file:
            if index != 1:
                lineToWrite = line.strip(' ').replace('  ', ' ').replace(' ', ',')
                if outlierString not in lineToWrite:
                    fileNew.write(lineToWrite)
                else:
                    if index - 1 <= N_OF_SIGNAL_LINES:
                        # print(f'Instance (signal) #{index - 1} removed...')
                        signalInstRemoved += 1
                    else:
                        # print(f'Instance (noise) #{index - 1} removed...')
                        noiseInstRemoved += 1
            # else:
            #     fileNew.write(COMMENT_CHAR + ' ' + line.strip(' ').replace('  ', ' '))
            index += 1

        print(f'Signal outlier instances removed = {signalInstRemoved}')
        print(f'Noise outlier instances removed = {noiseInstRemoved}\n ')

        print(f'Processed dataset saved in {datasetFilePath} with success!\n ')
else:
    print(f'Processed dataset found in {datasetFilePath}. No need for processing!\n ')

columnNames = [f'id_var_{i}' for i in range(1, 50 + 1) ]

# Lendo dataset do arquivo
with open(datasetFilePath, 'r') as datasetFile:
    dataset = pd.read_csv(datasetFile, names=columnNames, sep=',', skip_blank_lines=True)

print(dataset)

Processed dataset found in ./MiniBooNE_PID.csv. No need for processing!
 
        id_var_1  id_var_2  id_var_3  id_var_4  id_var_5  id_var_6  id_var_7   
0        2.59413  0.468803   20.6916  0.322648  0.009682  0.374393  0.803479  \
1        3.86388  0.645781   18.1375  0.233529  0.030733  0.361239  1.069740   
2        3.38584  1.197140   36.0807  0.200866  0.017341  0.260841  1.108950   
3        4.28524  0.510155  674.2010  0.281923  0.009174  0.000000  0.998822   
4        5.93662  0.832993   59.8796  0.232853  0.025066  0.233556  1.370040   
...          ...       ...       ...       ...       ...       ...       ...   
129591   4.80718  1.451020  174.6920  0.343481  0.002174  0.000000  0.747401   
129592   5.00527  1.501860  129.9270  0.273477  0.006098  0.109769  1.325370   
129593   3.10842  2.178140   56.3651  0.211850  0.000000  0.167382  1.318900   
129594   5.44560  1.845700  103.4630  0.287411  0.015929  0.107495  0.679931   
129595   4.55062  1.341740   80.0887  0.283594

In [11]:
# Normalizando o dataset (normalização min-max), para que todos valores estejam no intervalo [1, 10]
datasetTreated = ((dataset - dataset.min()) / (dataset.max() - dataset.min())) * 9 + 1

print(f'##### Dataset (tratado e normalizado, intervalo [1, 10]) #####\n{datasetTreated}')

##### Dataset (tratado e normalizado, intervalo [1, 10]) #####
        id_var_1  id_var_2  id_var_3  id_var_4  id_var_5  id_var_6  id_var_7   
0       2.368749  1.421131  1.039201  4.103207  5.452597  5.787233  2.158663  \
1       3.038712  1.603309  1.034359  2.834322  6.017928  5.619037  2.542627   
2       2.786482  2.170867  1.068374  2.369263  5.658285  4.335283  2.599170   
3       3.261035  1.463698  2.278040  3.523361  5.438966  1.000000  2.440359   
4       4.132359  1.796021  1.113489  2.824697  5.865742  3.986399  2.975677   
...          ...       ...       ...       ...       ...       ...       ...   
129591  3.536428  2.432206  1.331135  4.399829  5.250969  1.000000  2.077796   
129592  3.640947  2.484539  1.246275  3.403106  5.356339  2.403578  2.911261   
129593  2.640106  3.180688  1.106826  2.525655  5.192588  3.140255  2.901930   
129594  3.873280  2.838481  1.196108  3.601499  5.620371  2.374501  1.980500   
129595  3.401059  2.319715  1.151798  3.547153  5.192588 

In [12]:

# * ####################################
# * Rodando o K-Means CPU
# * ####################################

NUMBER_OF_RUNS = 200

totalExecTime = 0.0
slowestExecTime = -1.0
fastestExecTime = FLOAT_32_BIT_MAX

for rep in range(1, NUMBER_OF_RUNS + 1):
    startTime = time.time()
    km.kMeansCPU(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
    # print(f'Results:\n \n{resultsCPU}\n ')
    elapsedTime = time.time() - startTime
    print(f'Execution time for K-Means CPU run #{rep}: {elapsedTime}')
    if elapsedTime < fastestExecTime: fastestExecTime = elapsedTime
    if elapsedTime > slowestExecTime: slowestExecTime = elapsedTime
    totalExecTime += elapsedTime
    # print(f'Average execution time for K-Means CPU until now: {totalExecTime / rep}')

print(f' \nAverage execution time for K-Means CPU: {totalExecTime / NUMBER_OF_RUNS}')
print(f'Slowest execution time for K-Means CPU: {slowestExecTime}')
print(f'Fastest execution time for K-Means CPU: {fastestExecTime}')

Execution time for K-Means CPU run #1: 11.72091007232666
Execution time for K-Means CPU run #2: 7.5509233474731445
Execution time for K-Means CPU run #3: 8.744991779327393
Execution time for K-Means CPU run #4: 7.55429482460022
Execution time for K-Means CPU run #5: 8.524061679840088
Execution time for K-Means CPU run #6: 9.246991157531738
Execution time for K-Means CPU run #7: 4.846573352813721
Execution time for K-Means CPU run #8: 8.03544807434082
Execution time for K-Means CPU run #9: 9.236092805862427
Execution time for K-Means CPU run #10: 6.315953731536865
Execution time for K-Means CPU run #11: 12.202248573303223
Execution time for K-Means CPU run #12: 12.681684732437134
Execution time for K-Means CPU run #13: 7.06890082359314
Execution time for K-Means CPU run #14: 9.77237057685852
Execution time for K-Means CPU run #15: 8.992178440093994
Execution time for K-Means CPU run #16: 7.059105634689331
Execution time for K-Means CPU run #17: 10.480403900146484
Execution time for K-Me

In [13]:

# * ####################################
# * Rodando o K-Means GPU
# * ####################################

NUMBER_OF_RUNS = 200

totalExecTime = 0.0
slowestExecTime = -1.0
fastestExecTime = FLOAT_32_BIT_MAX

for rep in range(1, NUMBER_OF_RUNS + 1):
    startTime = time.time()
    km.kMeansGPU(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
    # print(f'Results:\n \n{resultsGPU}\n ')
    elapsedTime = time.time() - startTime
    print(f'Execution time for K-Means GPU run #{rep}: {elapsedTime}')
    if elapsedTime < fastestExecTime: fastestExecTime = elapsedTime
    if elapsedTime > slowestExecTime: slowestExecTime = elapsedTime
    totalExecTime += elapsedTime
    # print(f'Average execution time for K-Means GPU until now: {totalExecTime / rep}')

print(f' \nAverage execution time for K-Means GPU: {totalExecTime / NUMBER_OF_RUNS}')
print(f'Slowest execution time for K-Means GPU: {slowestExecTime}')
print(f'Fastest execution time for K-Means GPU: {fastestExecTime}')

Execution time for K-Means GPU run #1: 2.381702423095703
Execution time for K-Means GPU run #2: 2.1276016235351562
Execution time for K-Means GPU run #3: 2.3525049686431885
Execution time for K-Means GPU run #4: 2.1725308895111084
Execution time for K-Means GPU run #5: 2.391785144805908
Execution time for K-Means GPU run #6: 2.405285596847534
Execution time for K-Means GPU run #7: 1.6755168437957764
Execution time for K-Means GPU run #8: 1.7822277545928955
Execution time for K-Means GPU run #9: 1.2774362564086914
Execution time for K-Means GPU run #10: 2.5487823486328125
Execution time for K-Means GPU run #11: 2.189342498779297
Execution time for K-Means GPU run #12: 2.275547742843628
Execution time for K-Means GPU run #13: 2.3371646404266357
Execution time for K-Means GPU run #14: 2.0226707458496094
Execution time for K-Means GPU run #15: 3.0340769290924072
Execution time for K-Means GPU run #16: 2.183748483657837
Execution time for K-Means GPU run #17: 1.5326790809631348
Execution ti

### Dataset 4 (N > 1.000.000, D = 8) — WESAD

Foi utilizado aqui um sub-conjunto dos dados do Dataset **[WESAD (Wearable Stress and Affect Detection)](https://archive.ics.uci.edu/dataset/465/wesad+wearable+stress+and+affect+detection)**, que reúne dados, fisiológicos e de movimento, de diversos sensores presentes em aparelhos *wearables* usados por 15 pacientes diferentes em testes laboratoriais. Um aparelho foi usado no peitoral e outro no pulso dos pacientes.

Esse dataset também contém informações de classe, definindo momentos dos testes como pertencendo à três classificações de emoção do paciente: **referência**, **estresse** ou **diversão**. Portanto, haverão **3 grupos de dados (K = 3)**.

O sub-conjunto de dados utilizado foi: dados obtidos apenas através do **aparelho usado no peito** do paciente, e apenas do **paciente #4**. Utilizando este sub-conjunto, temos **8 variáveis (D = 8)** e **4.588.552 instâncias**, cada uma sendo uma leitura ao longo do tempo do teste laboratorial (leituras realizadas na frequência de 700hz).

Esse sub-conjunto de dados está presente no arquivo `S4/S4_respiban.txt` dentro do arquivo `WESAD.zip` do dataset (também disponível em download direto [neste link](https://uni-siegen.sciebo.de/s/HGdUkoNlW1Ub0Gx/download)).

#### Código

In [14]:
# Novas variáveis globais
K = 3
MAX_ITERATIONS = 5000
PLOT_RESULTS = False
DEBUG = False

datasetFilePath = './WESAD/S4/S4_respiban.txt'
columnNames = ['index', 'DI', 'ECG', 'EDA', 'EMG', 'TEMP', 'spatialX', 'spatialY', 'spatialZ', 'RESPIRATION', '_ignore_']

# Lendo dataset do arquivo
with open(datasetFilePath, 'r') as datasetFile:
    dataset = pd.read_csv(datasetFilePath, names=columnNames, sep='\t', index_col=0, skip_blank_lines=True, comment='#')

dataset = dataset.drop(columns=['DI', '_ignore_'])

print(dataset)

           ECG   EDA    EMG   TEMP  spatialX  spatialY  spatialZ  RESPIRATION
index                                                                        
0        34487  2844  32819  27563     37495     32437     31921        33292
1        34274  2869  32481  27560     37485     32433     31935        33295
2        33960  2774  32431  27557     37471     32445     31927        33293
3        33737  2767  32561  27555     37485     32433     31925        33308
4        33602  2768  32696  27562     37487     32429     31909        33300
...        ...   ...    ...    ...       ...       ...       ...          ...
4588548  33272  6470  32721  26727     37539     32597     32256        31863
4588549  33389  6467  32360  26726     37543     32583     32253        31865
4588550  33497  6456  32357  26719     37530     32598     32243        31857
4588551  33499  6450  32175  26733     37539     32585     32263        31855
4588552  33425  6445  32340  26753     37525     32595     32237

In [15]:
# Normalizando o dataset (normalização min-max), para que todos valores estejam no intervalo [1, 10]
datasetTreated = ((dataset - dataset.min()) / (dataset.max() - dataset.min())) * 9 + 1

print(f'##### Dataset (tratado e normalizado, intervalo [1, 10]) #####\n{datasetTreated}')

##### Dataset (tratado e normalizado, intervalo [1, 10]) #####
              ECG       EDA       EMG      TEMP  spatialX  spatialY  spatialZ   
index                                                                           
0        5.735295  1.508909  4.175522  8.925593  5.800000  5.203079  4.600799  \
1        5.706038  1.527215  3.927640  8.903516  5.789437  5.196481  4.607789   
2        5.662907  1.457652  3.890971  8.881439  5.774648  5.216276  4.603795   
3        5.632276  1.452526  3.986310  8.866721  5.789437  5.196481  4.602796   
4        5.613733  1.453258  4.085316  8.918234  5.791549  5.189883  4.594808   
...           ...       ...       ...       ...       ...       ...       ...   
4588548  5.568405  4.164022  4.103651  2.773508  5.846479  5.467009  4.768057   
4588549  5.584475  4.161826  3.838902  2.766149  5.850704  5.443915  4.766559   
4588550  5.599310  4.153771  3.836701  2.714636  5.836972  5.468658  4.761567   
4588551  5.599585  4.149378  3.703227  2.81766

In [16]:

# * ####################################
# * Rodando o K-Means CPU
# * ####################################

NUMBER_OF_RUNS = 100

totalExecTime = 0.0
slowestExecTime = -1.0
fastestExecTime = FLOAT_32_BIT_MAX

for rep in range(1, NUMBER_OF_RUNS + 1):
    startTime = time.time()
    km.kMeansCPU(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
    # print(f'Results:\n \n{resultsCPU}\n ')
    elapsedTime = time.time() - startTime
    print(f'Execution time for K-Means CPU run #{rep}: {elapsedTime}')
    if elapsedTime < fastestExecTime: fastestExecTime = elapsedTime
    if elapsedTime > slowestExecTime: slowestExecTime = elapsedTime
    totalExecTime += elapsedTime
    # print(f'Average execution time for K-Means CPU until now: {totalExecTime / rep}')

print(f' \nAverage execution time for K-Means CPU: {totalExecTime / NUMBER_OF_RUNS}')
print(f'Slowest execution time for K-Means CPU: {slowestExecTime}')
print(f'Fastest execution time for K-Means CPU: {fastestExecTime}')

Execution time for K-Means CPU run #1: 131.12973356246948
Execution time for K-Means CPU run #2: 129.3718237876892
Execution time for K-Means CPU run #3: 92.20838856697083
Execution time for K-Means CPU run #4: 173.8915410041809
Execution time for K-Means CPU run #5: 107.43795824050903
Execution time for K-Means CPU run #6: 77.74150919914246
Execution time for K-Means CPU run #7: 99.81570076942444
Execution time for K-Means CPU run #8: 129.46960282325745
Execution time for K-Means CPU run #9: 77.9063093662262
Execution time for K-Means CPU run #10: 130.74145364761353
Execution time for K-Means CPU run #11: 139.4243927001953
Execution time for K-Means CPU run #12: 147.21596956253052
Execution time for K-Means CPU run #13: 101.85348558425903
Execution time for K-Means CPU run #14: 244.90633630752563
Execution time for K-Means CPU run #15: 86.35177779197693
Execution time for K-Means CPU run #16: 109.0810821056366
Execution time for K-Means CPU run #17: 207.1127369403839
Execution time fo

In [17]:

# * ####################################
# * Rodando o K-Means GPU
# * ####################################

NUMBER_OF_RUNS = 100

totalExecTime = 0.0
slowestExecTime = -1.0
fastestExecTime = FLOAT_32_BIT_MAX

for rep in range(1, NUMBER_OF_RUNS + 1):
    startTime = time.time()
    km.kMeansGPU(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
    # print(f'Results:\n \n{resultsGPU}\n ')
    elapsedTime = time.time() - startTime
    print(f'Execution time for K-Means GPU run #{rep}: {elapsedTime}')
    if elapsedTime < fastestExecTime: fastestExecTime = elapsedTime
    if elapsedTime > slowestExecTime: slowestExecTime = elapsedTime
    totalExecTime += elapsedTime
    # print(f'Average execution time for K-Means GPU until now: {totalExecTime / rep}')

print(f' \nAverage execution time for K-Means GPU: {totalExecTime / NUMBER_OF_RUNS}')
print(f'Slowest execution time for K-Means GPU: {slowestExecTime}')
print(f'Fastest execution time for K-Means GPU: {fastestExecTime}')

Execution time for K-Means GPU run #1: 40.835532665252686
Execution time for K-Means GPU run #2: 20.919931173324585
Execution time for K-Means GPU run #3: 25.200390338897705
Execution time for K-Means GPU run #4: 22.40498375892639
Execution time for K-Means GPU run #5: 8.191907405853271
Execution time for K-Means GPU run #6: 47.870601654052734
Execution time for K-Means GPU run #7: 39.49761199951172
Execution time for K-Means GPU run #8: 18.16865634918213
Execution time for K-Means GPU run #9: 33.86043381690979
Execution time for K-Means GPU run #10: 20.986873626708984
Execution time for K-Means GPU run #11: 27.966493368148804
Execution time for K-Means GPU run #12: 28.05574631690979
Execution time for K-Means GPU run #13: 39.56160521507263
Execution time for K-Means GPU run #14: 36.62439489364624
Execution time for K-Means GPU run #15: 28.135660886764526
Execution time for K-Means GPU run #16: 19.71050214767456
Execution time for K-Means GPU run #17: 35.58286762237549
Execution time f

#### Resultados

> Resultados completos disponíveis no arquivo `code/examples-and-tests/speedupTestsRawResults.txt`

| |Tempo médio (50 execuções)|Speedup Médio|
|-|-|-|
|K-Means CPU|~129,81s|-|
|K-Means GPU|~27,87s|~4,65x|

### Dataset 5 (N > 10.000.000, D = 3, K = 6) — HHAR

Foi utilizado aqui um sub-conjunto dos dados do Dataset **[Heterogeneity Human Activity Recognition (HHAR)](https://archive.ics.uci.edu/dataset/344/heterogeneity+activity+recognition)**, que reúne dados de movimento do giroscópio e acelerômetro presentes em aparelhos celulares (*smartphones*) e relógios (*smartwatches*) usados por 9 usuários diferentes ao realizar diversas atividades físicas diferentes ou estando em repouso.

Esse dataset também contém informações de classe, definindo momentos dos testes como pertencendo a uma de seis atividades realizadas: **ciclismo**, **repouso (sentado)**, **repouso (em pé)**, **andar**, **subir escadas** e **descer escadas**. Portanto, haverão **6 grupos de dados (K = 6)**.

O sub-conjunto de dados utilizado foi: dados obtidos apenas através do **giroscópio do smartphone** do usuário. Utilizando este sub-conjunto, temos **3 variáveis (D = 3)** e **4.588.552 instâncias**, cada uma sendo uma leitura ao longo do tempo do experimento.

Esse sub-conjunto de dados está presente no arquivo `Activity recognition exp/Phones_gyroscope.csv` dentro do arquivo `heterogeneity+activity+recognition.zip` do dataset (também disponível em download direto [neste link](https://archive.ics.uci.edu/static/public/344/heterogeneity+activity+recognition.zip)).

#### Código

In [18]:
# Novas variáveis globais
K = 6
MAX_ITERATIONS = 4294967296
PLOT_RESULTS = False
DEBUG = False

datasetFilePath = './heterogeneity+activity+recognition/Activity recognition exp/Phones_gyroscope.csv'
columnNames = ['index', 'arrival_time', 'creation_Time', 'x', 'y', 'z', 'user', 'model', 'device', 'gt']

# Lendo dataset do arquivo
with open(datasetFilePath, 'r') as datasetFile:
    dataset = pd.read_csv(datasetFile, names=columnNames, header=0, sep=',', index_col=0)

dataset = dataset.drop(columns=['arrival_time', 'creation_Time', 'user', 'model', 'device', 'gt'])

print(dataset)

              x         y         z
index                              
0      0.013748 -0.000626 -0.023376
1      0.014816 -0.001694 -0.022308
2      0.015884 -0.001694 -0.021240
3      0.016953 -0.003830 -0.020172
4      0.015884 -0.007034 -0.020172
...         ...       ...       ...
11306 -0.046844  0.337667  0.134677
11307 -0.117598  0.221777  0.131749
11308 -0.177617  0.056115  0.095152
11309 -0.195183 -0.124429  0.063191
11310 -0.162002 -0.208846  0.043184

[13932632 rows x 3 columns]


In [19]:
# Normalizando o dataset (normalização min-max), para que todos valores estejam no intervalo [1, 10]
datasetTreated = ((dataset - dataset.min()) / (dataset.max() - dataset.min())) * 9 + 1

print(f'##### Dataset (tratado e normalizado, intervalo [1, 10]) #####\n{datasetTreated}')

##### Dataset (tratado e normalizado, intervalo [1, 10]) #####
              x         y         z
index                              
0      3.820436  4.219502  5.217352
1      3.821163  4.219072  5.218209
2      3.821890  4.219072  5.219066
3      3.822618  4.218211  5.219923
4      3.821890  4.216920  5.219923
...         ...       ...       ...
11306  3.779176  4.355790  5.344196
11307  3.730996  4.309101  5.341846
11308  3.690126  4.242361  5.312475
11309  3.678164  4.169625  5.286825
11310  3.700759  4.135616  5.270769

[13932632 rows x 3 columns]


In [20]:

# * ####################################
# * Rodando o K-Means CPU
# * ####################################

NUMBER_OF_RUNS = 5

totalExecTime = 0.0
slowestExecTime = -1.0
fastestExecTime = FLOAT_32_BIT_MAX

for rep in range(1, NUMBER_OF_RUNS + 1):
    startTime = time.time()
    km.kMeansCPU(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
    # print(f'Results:\n \n{resultsCPU}\n ')
    elapsedTime = time.time() - startTime
    print(f'Execution time for K-Means CPU run #{rep}: {elapsedTime}')
    if elapsedTime < fastestExecTime: fastestExecTime = elapsedTime
    if elapsedTime > slowestExecTime: slowestExecTime = elapsedTime
    totalExecTime += elapsedTime
    # print(f'Average execution time for K-Means CPU until now: {totalExecTime / rep}')

print(f' \nAverage execution time for K-Means CPU: {totalExecTime / NUMBER_OF_RUNS}')
print(f'Slowest execution time for K-Means CPU: {slowestExecTime}')
print(f'Fastest execution time for K-Means CPU: {fastestExecTime}')

Execution time for K-Means CPU run #1: 1464.724158525467
Execution time for K-Means CPU run #2: 1631.5510032176971
Execution time for K-Means CPU run #3: 1388.7622044086456
Execution time for K-Means CPU run #4: 1434.7594995498657
Execution time for K-Means CPU run #5: 1463.6720283031464
 
Average execution time for K-Means CPU: 1476.6937788009643
Slowest execution time for K-Means CPU: 1631.5510032176971
Fastest execution time for K-Means CPU: 1388.7622044086456


In [21]:

# * ####################################
# * Rodando o K-Means GPU
# * ####################################

NUMBER_OF_RUNS = 5

totalExecTime = 0.0
slowestExecTime = -1.0
fastestExecTime = FLOAT_32_BIT_MAX

for rep in range(1, NUMBER_OF_RUNS + 1):
    startTime = time.time()
    km.kMeansGPU(datasetTreated, k=K, maxIter=MAX_ITERATIONS, printIter=False, plotResults=PLOT_RESULTS, debug=DEBUG)
    # print(f'Results:\n \n{resultsGPU}\n ')
    elapsedTime = time.time() - startTime
    print(f'Execution time for K-Means GPU run #{rep}: {elapsedTime}')
    if elapsedTime < fastestExecTime: fastestExecTime = elapsedTime
    if elapsedTime > slowestExecTime: slowestExecTime = elapsedTime
    totalExecTime += elapsedTime
    # print(f'Average execution time for K-Means GPU until now: {totalExecTime / rep}')

print(f' \nAverage execution time for K-Means GPU: {totalExecTime / NUMBER_OF_RUNS}')
print(f'Slowest execution time for K-Means GPU: {slowestExecTime}')
print(f'Fastest execution time for K-Means GPU: {fastestExecTime}')

Execution time for K-Means GPU run #1: 810.9785737991333
Execution time for K-Means GPU run #2: 508.7315237522125
Execution time for K-Means GPU run #3: 465.9330599308014
Execution time for K-Means GPU run #4: 528.6637589931488
Execution time for K-Means GPU run #5: 479.31747007369995
 
Average execution time for K-Means GPU: 558.7248773097992
Slowest execution time for K-Means GPU: 810.9785737991333
Fastest execution time for K-Means GPU: 465.9330599308014
