# Pendahuluan

Analisis ini bertujuan untuk mengklasifikasikan diagnosis kanker payudara (Malignant atau Benign)
berdasarkan dataset *Breast Cancer Wisconsin Diagnostic* menggunakan algoritma *K-Nearest Neighbors (K-NN)*.
Selain itu, dilakukan *Feature Selection* dengan Principal Component Analysis (PCA) untuk mengurangi dimensi fitur.

# 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

#  2. Import Dataset

In [None]:
from google.colab import files

uploaded = files.upload()

Saving wdbc_data.csv to wdbc_data.csv


In [None]:
file_path = 'wdbc_data.csv'
headers = [
    "ID", "Diagnosis", "radius1", "texture1", "perimeter1", "area1", "smoothness1", "compactness1", "concavity1", "concave_points1", "symmetry1", "fractal_dimension1",
    "radius2", "texture2", "perimeter2", "area2", "smoothness2", "compactness2", "concavity2", "concave_points2", "symmetry2", "fractal_dimension2",
    "radius3", "texture3", "perimeter3", "area3", "smoothness3", "compactness3", "concavity3", "concave_points3", "symmetry3", "fractal_dimension3"
]

df = pd.read_csv(file_path, header=None, names=headers)

In [None]:
df.head()

Unnamed: 0,ID,Diagnosis,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,...,radius3,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


* Kolom pertama (ID) tidak diperlukan untuk analisis dan akan dihapus.
* Kolom kedua (Diagnosis) adalah target label (M = Malignant, B = Benign), akan dikonversi ke numerik.
* Semua fitur lainnya akan dinormalisasi agar memiliki skala yang sama.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  569 non-null    int64  
 1   Diagnosis           569 non-null    object 
 2   radius1             569 non-null    float64
 3   texture1            569 non-null    float64
 4   perimeter1          569 non-null    float64
 5   area1               569 non-null    float64
 6   smoothness1         569 non-null    float64
 7   compactness1        569 non-null    float64
 8   concavity1          569 non-null    float64
 9   concave_points1     569 non-null    float64
 10  symmetry1           569 non-null    float64
 11  fractal_dimension1  569 non-null    float64
 12  radius2             569 non-null    float64
 13  texture2            569 non-null    float64
 14  perimeter2          569 non-null    float64
 15  area2               569 non-null    float64
 16  smoothne

Bisa diliat jika tidak ada missing value pada semua kolom dimana bertotalkan 569 row / data

In [None]:
df.describe()

Unnamed: 0,ID,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,...,radius3,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


# 3. Preprocessing

## Hapus Kolom yang tidak terpakai

In [None]:
df.drop(columns=["ID"], inplace=True)

## Mengubah label menjadi numerik

In [None]:
label_encoder = LabelEncoder()
df["Diagnosis"] = label_encoder.fit_transform(df["Diagnosis"])

Melakukan pengubahan menjadi numerik terhadap kolom diagnosis (M = Malignant, B = Benign) dimana akan mengubah nilai
* M menjadi 1
* B menjadi 0


# Persiapan untuk feature Selection

In [None]:
X = df.iloc[:, 1:]
y = df["Diagnosis"].astype(int)

Dari sini terdapat pemisahan dimana
* Y hanya berisi target label yaitu kolom Diagnosis
* X mengambil semua kolom kecuali Diagnosis

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Semua kolom yang ada di X melakukan normalisasi supaya pembagian angka tersebut merata dengan tujuan meningkatkan efisiensi, mengurangi redundansi data, dan meningkatkan akurasi data.

# 4. Feature Selection dengan PCA

* PCA digunakan untuk mereduksi dimensi data tanpa kehilangan informasi penting.
* Digunakan 10 komponen utama berdasarkan eksperimen.

In [None]:
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X_scaled)

Kemudian kita akan melihat proporsi varians yang dijelaskan oleh setiap komponen
 dalam X

In [None]:
print("Explained Variance Ratio:", pca.explained_variance_ratio_)

Explained Variance Ratio: [0.44272026 0.18971182 0.09393163 0.06602135 0.05495768 0.04024522
 0.02250734 0.01588724 0.01389649 0.01168978]


# 5. Model Klasifikasi dengan K-NN dan Cross-Validation

* Model K-NN digunakan untuk klasifikasi dengan k=5.
* Cross-validation 5-fold dilakukan untuk evaluasi stabilitas model.

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)
cv_scores = cross_val_score(knn, X_pca, y, cv=5)

In [None]:
print(f"Cross-Validation Accuracy: {np.mean(cv_scores) * 100:.2f}% ± {np.std(cv_scores) * 100:.2f}%")

Cross-Validation Accuracy: 96.13% ± 1.32%


Dengan melakukan Cross Validation menggunakan model K-Nearest Neighbors didapatkan hasil akurasi sekitar antara 94,81% - 97,45% dengan pusatnya berada di 96.13%

# 6. Evaluasi Model dengan Data Test

* Data dibagi menjadi training 80% dan testing 20%.
* Model K-NN dilatih dengan data training dan diuji dengan data test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

In [None]:
knn.fit(X_train, y_train)

In [None]:
y_pred = knn.predict(X_test)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Test Set Accuracy: {accuracy * 100:.2f}%')
print("Classification Report:")
print(classification_report(y_test, y_pred))

Analisis Hasil:
1. Model K-NN mendapatkan akurasi sebesar 96.13% pada *Cross-Validation*.
2. Dengan data uji, akurasi model adalah 95.61%.
3. PCA membantu mengurangi dimensi data, tetapi tetap mempertahankan informasi yang cukup untuk klasifikasi.
4. Model K-NN bekerja cukup baik dalam membedakan antara kanker *Malignant* dan *Benign*.