# Nomor 1

Modul python ini berisi prosedur regresi untuk *data insurance* yang dengan sumber
https://www.kaggle.com/datasets/mirichoi0218/insurance?resource=download

Machine Learning pada soal ini akan dilakukan dengan algoritma SVR (*Support Vector Machine*). Adapun sebelum melakukan penulisan program, *data insurance* telah saya lakukan **Label Encoding** yang diterapkan dengan mengubah beberapa kolom berikut:

| No | Kolom    | Enumerasi                            |
|----|----------|--------------------------------------|
| 1  | `sex`    | 0 artinya *female*, 1 artinya *male* |
| 2  | `smoker` | 0 artinya *no*, 1 artinya *yes*      |

Data yang akan digunakan ada pada `SOURCE_Number_1.csv`

In [39]:
# Importing required libraries

# Pandas is a tool to help dataset manipulation
import pandas

# This is the machine learning library
from sklearn.model_selection import train_test_split # Training test split
from sklearn.svm import SVR # Support Vector Regression
from sklearn.pipeline import make_pipeline # Model
from sklearn import metrics # Testing performance metrics

In [40]:
# load csv data
dataset = pandas.read_csv('SOURCE_Number_1.csv')

# Preview data from SOURCE_Number_1.csv
dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,southwest,16884.924
1,18,1,33.77,1,0,southeast,1725.5523
2,28,1,33.0,3,0,southeast,4449.462
3,33,1,22.705,0,0,northwest,21984.47061
4,32,1,28.88,0,0,northwest,3866.8552


In [41]:
# Check if there is any missing data
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   int64  
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   int64  
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(4), object(1)
memory usage: 73.3+ KB


In [42]:
# Count data by its region
dataset['region'].value_counts()

southeast    364
southwest    325
northwest    325
northeast    324
Name: region, dtype: int64

In [43]:
# Perform One Hot Encoding for region data
encode_region = pandas.get_dummies(dataset['region'], prefix = 'region')

# Combine the encoded with current table
dataset = pandas.concat([dataset, encode_region], axis = 1)

# Preview current dataset after One Hot Encoding
dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,region_northeast,region_northwest,region_southeast,region_southwest
0,19,0,27.9,0,1,southwest,16884.924,0,0,0,1
1,18,1,33.77,1,0,southeast,1725.5523,0,0,1,0
2,28,1,33.0,3,0,southeast,4449.462,0,0,1,0
3,33,1,22.705,0,0,northwest,21984.47061,0,1,0,0
4,32,1,28.88,0,0,northwest,3866.8552,0,1,0,0


In [44]:
# Training Preparation

# Define columns for input definitions
train_input = dataset[
    [
        'age', 'sex', 'bmi', 'children',
        'smoker', 'region_northwest', 'region_northeast',
        'region_southwest', 'region_southeast'
    ]
]

# Define columns for output definitions
train_target = dataset [
    [
        'charges'
    ]
]

# Split training and testing data
x_train, x_test, y_train, y_test = train_test_split(train_input, train_target, test_size = 0.2)

In [45]:
# This part is to fulfil 3 kernels challenge in the question.
for kernel_config in ['linear', 'poly' , 'rbf']:
    print("Starting train using kernel [", kernel_config, "]")

    # Training Session
    regression_model = make_pipeline(SVR(kernel = kernel_config, C = 1.0, epsilon = 0.2))
    regression_model.fit(x_train, y_train)
    regression_output_train = regression_model.predict(x_train)

    # Testing Session
    regression_test_output = regression_model.predict(x_test)

    # Calculate error for training and testing
    print("Training Error MAE [", kernel_config, "]: ", metrics.mean_absolute_error(regression_output_train, y_train))
    print("Testing Error MAE [", kernel_config, "]: ", metrics.mean_absolute_error(regression_test_output, y_test))

    print("Training Error MSE [", kernel_config, "]: ", metrics.mean_squared_error(regression_output_train, y_train))
    print("Testing Error MSE [", kernel_config, "]: ", metrics.mean_squared_error(regression_test_output, y_test))

    print("\n\n")



Starting train using kernel [ linear ]
Training Error MAE [ linear ]:  6669.3181694361665
Testing Error MAE [ linear ]:  6152.955913866509
Training Error MSE [ linear ]:  170374514.21149662
Testing Error MSE [ linear ]:  150210749.54572812



Starting train using kernel [ poly ]
Training Error MAE [ poly ]:  8092.947230877517
Testing Error MAE [ poly ]:  7761.531984669477
Training Error MSE [ poly ]:  162367588.9016269
Testing Error MSE [ poly ]:  144660795.05777678



Starting train using kernel [ rbf ]


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Training Error MAE [ rbf ]:  8375.503562498256
Testing Error MAE [ rbf ]:  8071.790650365266
Training Error MSE [ rbf ]:  163865960.70465785
Testing Error MSE [ rbf ]:  146717776.88453314



