# Machine Learning - Prediksi Biaya Asuransi Kesehatan
Proyek ini menggunakan Decision Tree Classifier untuk memprediksi kategori biaya asuransi berdasarkan data demografis dan gaya hidup.

# Source Dataset
Dataset digunakan berasal dari [Insurance Health Cost Dataset](https://www.kaggle.com/datasets/mirichoi0218/insurance) yang berisi informasi demografis dan biaya asuransi kesehatan dari 1338 individu.

In [26]:
# Workflow Machine Learning

# Library
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

# 1. Training Data
from sklearn.model_selection import train_test_split

# 2. Learning Process
# 3. Knowledge Model
from sklearn.tree import DecisionTreeClassifier

# 4. Testing Data
from sklearn.metrics import classification_report, confusion_matrix

# Baca dataset
insurance = pd.read_csv('insurance.csv')

# Tampilkan informasi dasar
print("Dataset Shape:", insurance.shape)
print("\nFirst 5 rows:")
print(insurance.head())
print("\nData Types:")
print(insurance.dtypes)
print("\nMissing Values:")
print(insurance.isnull().sum())

Dataset Shape: (1338, 7)

First 5 rows:
   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520

Data Types:
age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

Missing Values:
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64


In [27]:
# Eksplorasi Data dan Preprocessing

# Buat kategori biaya asuransi (label)
# Kategorisasi berdasarkan quartile
insurance['charges_category'] = pd.qcut(insurance['charges'], 
                                         q=4, 
                                         labels=['Low', 'Medium', 'High', 'Very High'],
                                         duplicates='drop')

print("Charges Category Distribution:")
print(insurance['charges_category'].value_counts())
print("\nCharges Statistics:")
print(insurance['charges'].describe())

Charges Category Distribution:
charges_category
Low          335
Very High    335
Medium       334
High         334
Name: count, dtype: int64

Charges Statistics:
count     1338.000000
mean     13270.422265
std      12110.011237
min       1121.873900
25%       4740.287150
50%       9382.033000
75%      16639.912515
max      63770.428010
Name: charges, dtype: float64


In [28]:
# Persiapan Data untuk Training

# Encode kategori 'sex', 'smoker', dan 'region'
X = insurance.drop(['charges', 'charges_category'], axis=1)
y = insurance['charges_category']

# Label Encoder untuk kolom string
le_sex = LabelEncoder()
le_smoker = LabelEncoder()
le_region = LabelEncoder()

X['sex'] = le_sex.fit_transform(X['sex'])
X['smoker'] = le_smoker.fit_transform(X['smoker'])
X['region'] = le_region.fit_transform(X['region'])

print("Features setelah encoding:")
print(X.head())
print("\nLabel (Target):")
print(y.head())

Features setelah encoding:
   age  sex     bmi  children  smoker  region
0   19    0  27.900         0       1       3
1   18    1  33.770         1       0       2
2   28    1  33.000         3       0       2
3   33    1  22.705         0       0       1
4   32    1  28.880         0       0       1

Label (Target):
0    Very High
1          Low
2          Low
3    Very High
4          Low
Name: charges_category, dtype: category
Categories (4, object): ['Low' < 'Medium' < 'High' < 'Very High']


In [29]:
# Split Data untuk Training dan Testing

# Bagi data menjadi 80% training dan 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print(f"\nTraining data sample:")
print(X_train.head())
print(f"\nTesting target sample:")
print(y_test.head())

Training set size: 1070 samples
Testing set size: 268 samples

Training data sample:
      age  sex    bmi  children  smoker  region
560    46    0  19.95         2       0       1
1285   47    0  24.32         0       0       0
1142   52    0  24.86         0       0       2
969    39    0  34.32         5       0       2
486    54    0  21.47         3       0       1

Testing target sample:
764        Medium
887        Medium
890     Very High
1293       Medium
259     Very High
Name: charges_category, dtype: category
Categories (4, object): ['Low' < 'Medium' < 'High' < 'Very High']


In [30]:
# Proses Pembelajaran (Training)

# Buat dan training Decision Tree Classifier
classifier = DecisionTreeClassifier(max_depth=5, random_state=42)
classifier.fit(X_train, y_train)

# Prediksi pada testing data
y_pred = classifier.predict(X_test)

# Evaluasi Model
print("=" * 50)
print("KLASIFIKASI REPORT")
print("=" * 50)
print(classification_report(y_test, y_pred))

print("\n" + "=" * 50)
print("CONFUSION MATRIX")
print("=" * 50)
print(confusion_matrix(y_test, y_pred))

KLASIFIKASI REPORT
              precision    recall  f1-score   support

        High       0.85      0.80      0.83        56
         Low       0.87      0.96      0.91        77
      Medium       0.83      0.91      0.87        69
   Very High       0.96      0.79      0.87        66

    accuracy                           0.87       268
   macro avg       0.88      0.87      0.87       268
weighted avg       0.88      0.87      0.87       268


CONFUSION MATRIX
[[45  0  9  2]
 [ 0 74  3  0]
 [ 1  5 63  0]
 [ 7  6  1 52]]


In [31]:
# Perbandingan Hasil Prediksi

# Buat DataFrame untuk perbandingan
results = pd.DataFrame({
    'Actual': y_test.values,
    'Predicted': y_pred
})

print("Hasil Prediksi (50 sampel pertama):")
print(results.head(50))
print(f"\nTotal akurasi prediksi: {(results['Actual'] == results['Predicted']).sum()} dari {len(results)}")

Hasil Prediksi (50 sampel pertama):
       Actual  Predicted
0      Medium     Medium
1      Medium     Medium
2   Very High  Very High
3      Medium     Medium
4   Very High  Very High
5         Low     Medium
6         Low        Low
7        High       High
8         Low        Low
9        High       High
10  Very High  Very High
11     Medium     Medium
12        Low        Low
13  Very High  Very High
14  Very High  Very High
15  Very High  Very High
16       High     Medium
17  Very High  Very High
18     Medium     Medium
19  Very High  Very High
20     Medium     Medium
21     Medium     Medium
22        Low        Low
23        Low        Low
24       High       High
25       High       High
26       High       High
27  Very High        Low
28       High     Medium
29        Low        Low
30       High     Medium
31       High       High
32        Low        Low
33     Medium     Medium
34        Low        Low
35     Medium     Medium
36        Low        Low
37     Medium 

In [32]:
# Prediksi untuk Seluruh Dataset

# Prediksi kategori biaya untuk semua data
all_predictions = classifier.predict(X)

# Tambahkan hasil prediksi ke dataset
insurance['predicted_category'] = all_predictions

# Ekspor ke file CSV
insurance.to_csv('hasil_prediksi_insurance.csv', index=False)

print("Dataset dengan prediksi:")
print(insurance[['age', 'sex', 'bmi', 'smoker', 'charges', 'charges_category', 'predicted_category']].head(20))
print("\nFile 'hasil_prediksi_insurance.csv' telah disimpan!")

Dataset dengan prediksi:
    age     sex     bmi smoker      charges charges_category  \
0    19  female  27.900    yes  16884.92400        Very High   
1    18    male  33.770     no   1725.55230              Low   
2    28    male  33.000     no   4449.46200              Low   
3    33    male  22.705     no  21984.47061        Very High   
4    32    male  28.880     no   3866.85520              Low   
5    31  female  25.740     no   3756.62160              Low   
6    46  female  33.440     no   8240.58960           Medium   
7    37  female  27.740     no   7281.50560           Medium   
8    37    male  29.830     no   6406.41070           Medium   
9    60  female  25.840     no  28923.13692        Very High   
10   25    male  26.220     no   2721.32080              Low   
11   62  female  26.290    yes  27808.72510        Very High   
12   23    male  34.400     no   1826.84300              Low   
13   56  female  39.820     no  11090.71780             High   
14   27    male

In [33]:
# Menguji Model dengan Data Baru

# Buat sampel data baru untuk diprediksi
# Format: age, sex, bmi, children, smoker, region, charges
new_data = pd.DataFrame({
    'age': [25, 45, 55, 30, 65],
    'sex': ['male', 'female', 'male', 'female', 'male'],
    'bmi': [24.5, 28.3, 32.1, 26.8, 35.2],
    'children': [0, 2, 1, 3, 0],
    'smoker': ['no', 'no', 'yes', 'no', 'yes'],
    'region': ['northeast', 'southwest', 'northwest', 'southeast', 'northeast']
})

print("Data Baru untuk Prediksi:")
print(new_data)

# Encode data baru dengan encoder yang sama
new_data_encoded = new_data.copy()
new_data_encoded['sex'] = le_sex.transform(new_data_encoded['sex'])
new_data_encoded['smoker'] = le_smoker.transform(new_data_encoded['smoker'])
new_data_encoded['region'] = le_region.transform(new_data_encoded['region'])

# Prediksi kategori biaya
predictions = classifier.predict(new_data_encoded)

# Tampilkan hasil prediksi
results_new = pd.DataFrame({
    'Age': new_data['age'],
    'Sex': new_data['sex'],
    'BMI': new_data['bmi'],
    'Smoker': new_data['smoker'],
    'Region': new_data['region'],
    'Predicted_Category': predictions
})

print("\n" + "=" * 80)
print("HASIL PREDIKSI KATEGORI BIAYA ASURANSI")
print("=" * 80)
print(results_new)
print("\nKeterangan:")
print("- Low: Biaya asuransi rendah")
print("- Medium: Biaya asuransi menengah")
print("- High: Biaya asuransi tinggi")
print("- Very High: Biaya asuransi sangat tinggi")

Data Baru untuk Prediksi:
   age     sex   bmi  children smoker     region
0   25    male  24.5         0     no  northeast
1   45  female  28.3         2     no  southwest
2   55    male  32.1         1    yes  northwest
3   30  female  26.8         3     no  southeast
4   65    male  35.2         0    yes  northeast

HASIL PREDIKSI KATEGORI BIAYA ASURANSI
   Age     Sex   BMI Smoker     Region Predicted_Category
0   25    male  24.5     no  northeast                Low
1   45  female  28.3     no  southwest             Medium
2   55    male  32.1    yes  northwest          Very High
3   30  female  26.8     no  southeast             Medium
4   65    male  35.2    yes  northeast          Very High

Keterangan:
- Low: Biaya asuransi rendah
- Medium: Biaya asuransi menengah
- High: Biaya asuransi tinggi
- Very High: Biaya asuransi sangat tinggi
