**Pengertian**

K-Nearest Neighbour (kNN) adalah sebuah metode dalam machine learning yang digunakan untuk klasifikasi dan regresi. Metode ini didasarkan pada ide bahwa objek dengan fitur yang serupa cenderung memiliki label atau nilai target yang sama. Pada kNN, prediksi dilakukan dengan mencari k tetangga terdekat dari data uji dalam ruang fitur. Kemudian, mayoritas label atau nilai target dari tetangga-tetangga ini digunakan untuk memprediksi label atau nilai target data uji. Jarak antara data uji dan tetangga-tetangga ini dihitung menggunakan metrik jarak seperti jarak Euclidean.


In [None]:
import pandas as pd
import numpy as np

**Implementasi**
Pada bagian ini dilakukan implementasi algoritma kNN sederhana dengan menggunakan euclidean distance. Dataset yang digunakan menggunakan dataset iris yaitu dataset yang berisi mengenai klasifikasi jenis bunga iris.

# Load Dataset

Pada semi modul kNN ini, digunakan dataset yaitu berupa data bunga iris yang memiliki 6 kolom dimana 4 kolom (SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm) akan digunakan sebagai atribut atau fitur, dan kolom Species akan digunakan sebagai label kelas yang memiliki 3 kelas yaitu 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'.


In [None]:
# Load Iris Dataset
dwn_url='https://drive.google.com/uc?id=' + '13bI0wkXJ6nxQrZsTModWH2lekMfuwiIB'
df = pd.read_csv(dwn_url)

# Data Explorations

Sebelum dilakukan langkah selanjutnya, dilakukan terlebih dahulu eksplorasi data yang berguna untuk mengetahui properti dan karakteristik dari dataset tersebut.


In [None]:
df.head()

Unnamed: 0,age,sex,height,weight,QRSduration,PRinterval,Q-Tinterval,Tinterval,Pinterval,QRS,...,chV6_QwaveAmp,chV6_RwaveAmp,chV6_SwaveAmp,chV6_RPwaveAmp,chV6_SPwaveAmp,chV6_PwaveAmp,chV6_TwaveAmp,chV6_QRSA,chV6_QRSTA,class
0,75,0,190,80,91,193,371,174,121,-16,...,0.0,9.0,-0.9,0.0,0.0,0.9,2.9,23.3,49.4,8
1,56,1,165,64,81,174,401,149,39,25,...,0.0,8.5,0.0,0.0,0.0,0.2,2.1,20.4,38.8,6
2,54,0,172,95,138,163,386,185,102,96,...,0.0,9.5,-2.4,0.0,0.0,0.3,3.4,12.3,49.0,10
3,55,0,175,94,100,202,380,179,143,28,...,0.0,12.2,-2.2,0.0,0.0,0.4,2.6,34.6,61.6,1
4,75,0,190,80,88,181,360,177,103,-16,...,0.0,13.1,-3.6,0.0,0.0,-0.1,3.9,25.4,62.8,7


In [None]:
df.tail()

Unnamed: 0,age,sex,height,weight,QRSduration,PRinterval,Q-Tinterval,Tinterval,Pinterval,QRS,...,chV6_QwaveAmp,chV6_RwaveAmp,chV6_SwaveAmp,chV6_RPwaveAmp,chV6_SPwaveAmp,chV6_PwaveAmp,chV6_TwaveAmp,chV6_QRSA,chV6_QRSTA,class
447,53,1,160,70,80,199,382,154,117,-37,...,0.0,4.3,-5.0,0.0,0.0,0.7,0.6,-4.4,-0.5,1
448,37,0,190,85,100,137,361,201,73,86,...,0.0,15.6,-1.6,0.0,0.0,0.4,2.4,38.0,62.4,10
449,36,0,166,68,108,176,365,194,116,-85,...,0.0,16.3,-28.6,0.0,0.0,1.5,1.0,-44.2,-33.2,2
450,32,1,155,55,93,106,386,218,63,54,...,-0.4,12.0,-0.7,0.0,0.0,0.5,2.4,25.0,46.6,1
451,78,1,160,70,79,127,364,138,78,28,...,0.0,10.4,-1.8,0.0,0.0,0.5,1.6,21.3,32.8,1


Eksplorasi data pertama dilakukan dengan mencari unique value dari kolom Species yang akan digunakan sebagai label kelas, tujuannya yaitu untuk melihat ada berapa kelas pada Species.


In [None]:
df.height.unique()

array([190, 165, 172, 175, 169, 160, 162, 168, 167, 170, 150, 171, 158,
       166, 153, 164, 163, 155, 176, 157, 156, 159, 110, 182, 161, 177,
       185, 184, 132, 154, 186, 780, 173, 178, 179, 180, 133, 124, 174,
       149, 130, 608, 105, 188, 181, 146, 120, 152, 127, 148, 119, 138,
       140])

Kemudian untuk melihat metadata dari dataset untuk lebih mengetahui tentang properti dari datanya.


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Columns: 280 entries, age to class
dtypes: float64(125), int64(155)
memory usage: 988.9 KB


Menampilkan  summary statistik deskriptif dari datasetnya untuk melihat distribusi data, rentang nilai dan statistik lainnya untuk lebih memahami karakteristik dari datasetnya secara numerik.


In [None]:
df.describe()

Unnamed: 0,age,sex,height,weight,QRSduration,PRinterval,Q-Tinterval,Tinterval,Pinterval,QRS,...,chV6_QwaveAmp,chV6_RwaveAmp,chV6_SwaveAmp,chV6_RPwaveAmp,chV6_SPwaveAmp,chV6_PwaveAmp,chV6_TwaveAmp,chV6_QRSA,chV6_QRSTA,class
count,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0,...,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0,452.0
mean,46.471239,0.550885,166.188053,68.170354,88.920354,155.152655,367.207965,169.949115,90.004425,33.676991,...,-0.278982,9.048009,-1.457301,0.003982,0.0,0.514823,1.222345,19.326106,29.47323,3.880531
std,16.466631,0.497955,37.17034,16.590803,15.364394,44.842283,33.385421,35.633072,25.826643,45.431434,...,0.548876,3.472862,2.00243,0.050118,0.0,0.347531,1.426052,13.503922,18.493927,4.407097
min,0.0,0.0,105.0,6.0,55.0,0.0,232.0,108.0,0.0,-172.0,...,-4.1,0.0,-28.6,0.0,0.0,-0.8,-6.0,-44.2,-38.6,1.0
25%,36.0,0.0,160.0,59.0,80.0,142.0,350.0,148.0,79.0,3.75,...,-0.425,6.6,-2.1,0.0,0.0,0.4,0.5,11.45,17.55,1.0
50%,47.0,1.0,164.0,68.0,86.0,157.0,367.0,162.0,91.0,40.0,...,0.0,8.8,-1.1,0.0,0.0,0.5,1.35,18.1,27.9,1.0
75%,58.0,1.0,170.0,79.0,94.0,175.0,384.0,179.0,102.0,66.0,...,0.0,11.2,0.0,0.0,0.0,0.7,2.1,25.825,41.125,6.0
max,83.0,1.0,780.0,176.0,188.0,524.0,509.0,381.0,205.0,169.0,...,0.0,23.6,0.0,0.8,0.0,2.4,6.0,88.8,115.9,16.0


In [None]:
df.shape

(452, 280)

# Preprocessing

Setelah dilakukan eksplorasi data, selanjutnya dilakukan preprocessing. Preprocessing adalah proses persiapan data sebelum memasukkannya ke dalam model machine learning. Tujuannya adalah untuk memastikan data siap dan sesuai untuk proses pemodelan. Preprocessing yang dilakukan antara lain :

Menghapus kolom 'Id' yang tidak relevan dari data frame dan melakukan shuffle data untuk menghindari bias.

In [None]:
#df = df.drop('sex', axis=1)

In [None]:
df = df.sample(150).reset_index(drop=True)
df

Unnamed: 0,age,sex,height,weight,QRSduration,PRinterval,Q-Tinterval,Tinterval,Pinterval,QRS,...,chV6_QwaveAmp,chV6_RwaveAmp,chV6_SwaveAmp,chV6_RPwaveAmp,chV6_SPwaveAmp,chV6_PwaveAmp,chV6_TwaveAmp,chV6_QRSA,chV6_QRSTA,class
0,22,1,163,50,74,133,370,163,71,66,...,0.0,8.6,-0.6,0.0,0.0,0.4,2.2,18.1,35.2,1
1,45,1,158,65,82,122,336,174,63,38,...,-0.5,7.8,-0.7,0.0,0.0,-0.1,2.5,17.6,40.6,1
2,45,0,170,74,82,163,373,266,87,87,...,-0.4,6.4,-0.6,0.0,0.0,0.5,-0.4,14.1,8.5,3
3,40,1,165,53,81,160,347,154,81,67,...,-0.4,9.6,0.0,0.0,0.0,0.5,2.1,22.7,37.8,1
4,57,0,164,64,89,155,400,121,100,62,...,0.0,14.1,0.0,0.0,0.0,0.5,0.5,45.1,48.6,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,58,1,170,75,90,157,382,168,114,52,...,0.0,10.9,-1.8,0.0,0.0,0.5,1.2,22.2,31.0,1
146,35,1,164,94,85,200,385,174,74,48,...,0.0,11.2,-1.3,0.0,0.0,1.0,4.3,26.5,67.7,10
147,28,1,156,52,83,135,359,169,80,34,...,0.0,6.3,-0.9,0.0,0.0,0.5,1.2,11.9,22.7,1
148,45,0,169,67,90,122,336,177,78,81,...,-0.6,8.3,-1.8,0.0,0.0,0.8,1.1,11.7,19.6,1


Membagi data menjadi beberapa fold untuk proses validasi silang dengan memisahkan dataset menjadi subset data training dan subset data testing.


In [None]:
# Melakukan folding pada data
fold1 = (df.iloc[0:50].reset_index(drop=True), df.iloc[50:150].reset_index(drop=True))
fold2 = (df.iloc[50:100].reset_index(drop=True), pd.concat([df.iloc[0:50], df.iloc[100:150]]).reset_index(drop=True))
fold3 = (df.iloc[100:150].reset_index(drop=True), df.iloc[0:100].reset_index(drop=True))

test, train = fold2
print(train)

    age  sex  height  weight  QRSduration  PRinterval  Q-Tinterval  Tinterval  \
0    22    1     163      50           74         133          370        163   
1    45    1     158      65           82         122          336        174   
2    45    0     170      74           82         163          373        266   
3    40    1     165      53           81         160          347        154   
4    57    0     164      64           89         155          400        121   
..  ...  ...     ...     ...          ...         ...          ...        ...   
95   58    1     170      75           90         157          382        168   
96   35    1     164      94           85         200          385        174   
97   28    1     156      52           83         135          359        169   
98   45    0     169      67           90         122          336        177   
99   50    0     164      75           85         142          339        157   

    Pinterval  QRS  ...  ch

Dilakukan normalisasi data menggunakan metode min-max normalization untuk memastikan bahwa skala fitur-fitur seragam.


In [None]:
# Normalizations
def norm(df):
  df = (df - df.min()) / (df.max() - df.min())
  return df

In [None]:
# Assign fitur dan kelas
X = df.drop('sex', axis=1)
y = df.age

In [None]:
X = norm(X) # Melakukan normalisasi untuk data fitur
X

Unnamed: 0,age,height,weight,QRSduration,PRinterval,Q-Tinterval,Tinterval,Pinterval,QRS,T,...,chV6_QwaveAmp,chV6_RwaveAmp,chV6_SwaveAmp,chV6_RPwaveAmp,chV6_SPwaveAmp,chV6_PwaveAmp,chV6_TwaveAmp,chV6_QRSA,chV6_QRSTA,class
0,0.2750,0.115308,0.240964,0.102362,0.415625,0.610619,0.291005,0.387978,0.790698,0.664756,...,1.000000,0.364407,0.960000,,,0.631579,0.713043,0.436204,0.477670,0.000000
1,0.5625,0.105368,0.331325,0.165354,0.381250,0.460177,0.349206,0.344262,0.697674,0.627507,...,0.878049,0.330508,0.953333,,,0.368421,0.739130,0.432217,0.512621,0.000000
2,0.5625,0.129225,0.385542,0.165354,0.509375,0.623894,0.835979,0.475410,0.860465,0.842407,...,0.902439,0.271186,0.960000,,,0.684211,0.486957,0.404306,0.304854,0.133333
3,0.5000,0.119284,0.259036,0.157480,0.500000,0.508850,0.243386,0.442623,0.794020,0.656160,...,0.902439,0.406780,1.000000,,,0.684211,0.704348,0.472887,0.494498,0.000000
4,0.7125,0.117296,0.325301,0.220472,0.484375,0.743363,0.068783,0.546448,0.777409,0.819484,...,1.000000,0.597458,1.000000,,,0.684211,0.565217,0.651515,0.564401,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,0.7250,0.129225,0.391566,0.228346,0.490625,0.663717,0.317460,0.622951,0.744186,0.699140,...,1.000000,0.461864,0.880000,,,0.684211,0.626087,0.468900,0.450485,0.000000
146,0.4375,0.117296,0.506024,0.188976,0.625000,0.676991,0.349206,0.404372,0.730897,0.593123,...,1.000000,0.474576,0.913333,,,0.947368,0.895652,0.503190,0.688026,0.600000
147,0.3500,0.101392,0.253012,0.173228,0.421875,0.561947,0.322751,0.437158,0.684385,0.590258,...,1.000000,0.266949,0.940000,,,0.684211,0.626087,0.386762,0.396764,0.000000
148,0.5625,0.127237,0.343373,0.228346,0.381250,0.460177,0.365079,0.426230,0.840532,0.716332,...,0.853659,0.351695,0.880000,,,0.842105,0.617391,0.385167,0.376699,0.000000


In [None]:
X.describe()

Unnamed: 0,age,height,weight,QRSduration,PRinterval,Q-Tinterval,Tinterval,Pinterval,QRS,T,...,chV6_QwaveAmp,chV6_RwaveAmp,chV6_SwaveAmp,chV6_RPwaveAmp,chV6_SPwaveAmp,chV6_PwaveAmp,chV6_TwaveAmp,chV6_QRSA,chV6_QRSTA,class
count,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,146.0,...,150.0,150.0,150.0,0.0,0.0,150.0,150.0,150.0,150.0,150.0
mean,0.556083,0.122425,0.34261,0.221627,0.495771,0.597935,0.339118,0.498179,0.700709,0.594517,...,0.920976,0.392062,0.902578,,,0.68807,0.625217,0.450478,0.441924,0.213778
std,0.217192,0.076158,0.113107,0.144227,0.123465,0.156484,0.184458,0.137232,0.159699,0.167624,...,0.154585,0.156282,0.115389,,,0.167338,0.138662,0.122213,0.124424,0.30898
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0
25%,0.425,0.109344,0.289157,0.141732,0.441406,0.518805,0.222222,0.43306,0.601329,0.534384,...,0.878049,0.288136,0.86,,,0.631579,0.565217,0.386164,0.376699,0.0
50%,0.5625,0.119284,0.343373,0.192913,0.496875,0.599558,0.296296,0.5,0.742525,0.613181,...,1.0,0.385593,0.923333,,,0.684211,0.652174,0.435407,0.440777,0.0
75%,0.7125,0.129225,0.408133,0.251969,0.549219,0.675885,0.391534,0.56694,0.799834,0.673352,...,1.0,0.488347,1.0,,,0.789474,0.713043,0.499601,0.510356,0.333333
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0


# KNN Model

Fungsi knn digunakan untuk melatih model kNN. Fungsi ini menerima empat argumen, yaitu X_train (data training untuk fitur), y_train (label kelas dari data training), X_test (data testing untuk fitur), dan k (jumlah tetangga yang akan dipertimbangkan).


In [None]:

def euclidean(x1,x2):
  return np.sqrt(np.sum((x1-x2)**2))

Pada model kNN ini, digunakan menghitung jarak menggunakan Euclidean Distance yang menghitung jarak antara dua vektor x1 dan x2. Euclidean Distance dihitung sebagai akar kuadrat dari jumlah kuadrat selisih antara setiap elemen vektor.


In [None]:
euclidean(X.iloc[0],X.iloc[1])

2.3614882272636293

In [None]:
def knn(X_train, y_train, X_test, k):
  dist=[]
  for row in range (X_train.shape[0]):
    dist.append(euclidean(X_train.iloc[row],X_test))

  data=X_train.copy()
  data['Dist']=dist
  data['Class']=y_train

  data=data.sort_values(by='Dist').reset_index(drop=True)

  y_pred=data.iloc[:k].Class.mode()

  return y_pred[0]

# Evaluation

Evaluation model dengan menggunakan data hasil folding sebelumnya yang digunakan untuk mengevaluasi model kNN dengan menggunakan metode validasi silang (cross-validation). Hal ini digunakan untuk mengukur performa model kNN dengan metode validasi silang dan memperoleh estimasi akurasi yang lebih konsisten dan representatif.


In [None]:
def acc(y_pred, y_true):
  return np.sum(y_pred == y_true) / len(y_pred)



In [None]:
def evaluate(fold,k):
  test, train = fold

  X_train, y_train = train.drop('height',axis=1), train.height
  X_test, y_test= test.drop('height', axis=1), test.height

  X_train=norm(X_train)
  X_test=norm(X_test)

  y_preds=[]

  for row in range (X_test.shape[0]):
    y_preds.append(knn(X_train, y_train, X_test.iloc[row],k))

  return (acc(y_preds, y_test))

Cross Validation dilakukan untuk menguji performa model kNN dengan menggunakan nilai k yang telah ditentukan. Untuk variable k, merupakan banyaknya tetangga yang digunakan dalam model kNN, variable k bisa ditentukan dengan nilai bebas dimana kita mencari k berapa yang bisa memiliki akurasi paling tinggi.


In [None]:
k = 10
accs=[]
folds = [fold1,fold2,fold3]

for i in range(len(folds)):
  accs.append(evaluate(folds[i],k))
print(f'Nilai K: {k}, Akurasi : {sum(accs)/3}')

Nilai K: 10, Akurasi : 0.14
