# Proyek Prediksi Risiko Diabetes (Simple CSV)

Notebook ini didesain untuk membaca file **`diabetes.csv`** standar yang sudah memiliki header (judul kolom).

**Langkah Awal:**
Upload file `diabetes.csv` yang Anda miliki.

## 1. Import Library

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score

# Library khusus Colab untuk upload file
from google.colab import files
import io

## 2. Upload dan Memuat Data
Karena file Anda sudah berupa CSV standar dengan header, kita bisa membacanya langsung dengan `pd.read_csv`.

In [None]:
print("Silakan upload file 'diabetes.csv':")
uploaded = files.upload()

# Ambil nama file yang diupload
filename = next(iter(uploaded))

# Membaca CSV (Pandas otomatis mendeteksi header di baris pertama)
diabetes_dataset = pd.read_csv(io.BytesIO(uploaded[filename]))

print(f"\nBerhasil membaca file: {filename}")
print("=== 5 Baris Pertama Data ===")
diabetes_dataset.head()

Silakan upload file 'diabetes.csv':


Saving diabetes.csv to diabetes.csv

Berhasil membaca file: diabetes.csv
=== 5 Baris Pertama Data ===


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## 3. Analisis Data Singkat

In [3]:
# Melihat ukuran data (baris, kolom)
print(f"Ukuran data: {diabetes_dataset.shape}")

# Statistik deskriptif
diabetes_dataset.describe()

Ukuran data: (768, 9)


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [4]:
# Cek keseimbangan data target (Outcome)
# 0 = Tidak Diabetes, 1 = Diabetes
print("Jumlah data per label:")
print(diabetes_dataset['Outcome'].value_counts())

Jumlah data per label:
Outcome
0    500
1    268
Name: count, dtype: int64


## 4. Preprocessing Data (Standarisasi)
Menyamakan skala data agar model SVM bekerja optimal.

In [5]:
# Pisahkan Data Fitur (X) dan Label (Y)
X = diabetes_dataset.drop(columns='Outcome', axis=1)
Y = diabetes_dataset['Outcome']

# Standarisasi Data
scaler = StandardScaler()
scaler.fit(X)
standarized_data = scaler.transform(X)

X = standarized_data
print("Data 5 baris pertama setelah standarisasi:")
print(X[:5])

Data 5 baris pertama setelah standarisasi:
[[ 0.63994726  0.84832379  0.14964075  0.90726993 -0.69289057  0.20401277
   0.46849198  1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575  0.53090156 -0.69289057 -0.68442195
  -0.36506078 -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 -1.28821221 -0.69289057 -1.10325546
   0.60439732 -0.10558415]
 [-0.84488505 -0.99820778 -0.16054575  0.15453319  0.12330164 -0.49404308
  -0.92076261 -1.04154944]
 [-1.14185152  0.5040552  -1.50468724  0.90726993  0.76583594  1.4097456
   5.4849091  -0.0204964 ]]


## 5. Train/Test Split & Modeling
Membagi data latih (80%) dan data uji (20%), lalu melatih model SVM.

In [6]:
# Splitting Data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2, stratify=Y)

# Membuat dan Melatih Model SVM
classifier = svm.SVC(kernel='linear')
classifier.fit(X_train, Y_train)

# Evaluasi Akurasi
train_acc = accuracy_score(classifier.predict(X_train), Y_train)
test_acc = accuracy_score(classifier.predict(X_test), Y_test)

print(f'Akurasi Training: {train_acc * 100:.2f}%')
print(f'Akurasi Testing : {test_acc * 100:.2f}%')

Akurasi Training: 78.66%
Akurasi Testing : 77.27%


## 6. Prediksi Data Baru (Perbaikan Warning)
Di sini kita ubah input manual menjadi DataFrame agar memiliki nama kolom yang sama dengan data latih, sehingga peringatan (warning) hilang.

In [7]:
# Contoh Data Input (Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age)
input_data = (5, 166, 72, 19, 175, 25.8, 0.587, 51)

# --- PERBAIKAN DI SINI ---
# Kita buat DataFrame dengan nama kolom yang PERSIS sama dengan data latih
nama_kolom = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

# Membuat DataFrame dari input data
input_df = pd.DataFrame([input_data], columns=nama_kolom)

# Standarisasi data input (Sekarang inputnya punya nama kolom, jadi scaler tidak akan komplain)
std_data = scaler.transform(input_df)

# Prediksi
prediction = classifier.predict(std_data)

print(f"\nHasil Prediksi: {prediction[0]}")

if (prediction[0] == 0):
  print('Kesimpulan: Pasien TIDAK berisiko Diabetes')
else:
  print('Kesimpulan: Pasien BERISIKO Diabetes')


Hasil Prediksi: 1
Kesimpulan: Pasien BERISIKO Diabetes
