# Medical Insurance Charges: Medical Insurance Charges Prediction

## 1. Install SapientML

In [None]:
!pip install -U sapientml

## 2. Import libraries

In [None]:
import pandas as pd
from sapientml import SapientML
from sapientml.util.logging import setup_logger
from sklearn.metrics import r2_score

## 3. Load Medical Insurance Charges Dataset

[Medical Insurance Charges Dataset](https://www.kaggle.com/datasets/harishkumardatalab/medical-insurance-price-prediction) includes information that influences medical expenses, such as the age, gender, and BMI of the policyholders. This tutorial predict `charges` indicating medical insurance charges.

In [None]:
train_data = pd.read_csv("https://github.com/sapientml/sapientml/files/12617660/train_medical-insurance-prediction.csv")
test_data = pd.read_csv("https://github.com/sapientml/sapientml/files/12617696/test_medical-insurance-prediction.csv")
train_data

## 4. Split the dataset into train and test data

In [None]:
y_true = test_data["charges"].reset_index(drop=True)
test_data.drop(["charges"], axis=1, inplace=True)

## 5. Generate code

At first, instantiate SapientML object with the target columns of the ML task. 
In the example, `charges` is the target column, and you will assign it as a list to the first argument of the constructor. 

Second, call `cls.fit()` to generate code for training a ML model and prediction by:
1. selecting preprocessors and the most plausible top-3 models, 
2. composing their code snippet as the top-3 best pipelines, and
3. evaluating the pipelines to choose the best one.

In [None]:
cls = SapientML(["charges"])
setup_logger().handlers.clear() # to prevent duplication of logging

cls.fit(train_data)

## 6. Prediction

Third, call `cls.predict()` to conduct the prediction by test data.

In [None]:
y_pred = cls.predict(test_data)
y_pred = y_pred["charges"].rename("charges_pred")

pd.concat([y_pred, y_true], axis=1)

## 7. Show R2 score

Since this is the regression task, you can evaluate the model by R2 score.

In [None]:
print(f"R2 score: {r2_score(y_true, y_pred)}")

## 8. Get the generated code

The generated code is contained in `cls.model.files`, and you can get a specific code by putting the following filename as a key:
- `final_script.py`: the best pipeline code for validation
- `final_predict.py`: code for prediction
- `final_train.py`: code for training a model

For further information, please see https://sapientml.readthedocs.io/en/latest/user/usage.html#generated-source-code

In [None]:
print(cls.model.files["final_script.py"].decode("utf-8"))