# Medical cost prediction revisited

Insurance companies need to predict the annual medical cost of a insurance policy holder.

* Target: **annual medical cost**
* Predictors: 
    * age, gender, bmi, number of children, smoker/non-smoke, region

Let us load the dataset.

Let's perform **one-hot encoding** for **sex**, **smoker**, **region**.

Let's seperate the feature variables from the target and perform train-test split

## Linear Regression

Let's create a linear regression on all varaibles. We will:
* Fit the model on the training data
* Predict on the test data

Now we will calculate the (root) mean squared error and the $R^2$ score

## KNN for Regression

We need the `KNeighborsRegressor()` function from `sklearn.neighbors` to create a KNN model for regression. Recall that we need to specify the number of neighbors.

We will fit the model on the training data and predict on the test data.

Let's again compute the RMSE and $R^2$ score.

## Random Forest for Regression

We can use the `RandomForestRegressor()` to create a random forest model for regression. The function is included in the module `ensemble` of `sklearn`.

Note that `RandomForestRegressor()` has similar arguments to `RandomForestClassifier()`. In the above, most arguments are set to their default value. See the [documentation of RandomForestRegressor](https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestRegressor.html) for details.

Now evaluate the performance on the test data.

### XGBOOST for Regression

XGBoost stands for Extreme Gradient Boosting and is based on the gradient boosting framework.
* Builds decision trees iteratively, with each tree correcting the residuals (errors) of previous trees.
* Highly optimized for performance with features like parallel computation, tree pruning (max depth), and out-of-core computing for large datasets.

XGBOOST is not part of `sklearn`. Use the following common to install the library (only need to be installed once on your computer).

In [25]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-2.1.2-py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.macosx_12_0_x86_64.whl.metadata (2.1 kB)
Downloading xgboost-2.1.2-py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.macosx_12_0_x86_64.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: xgboost
Successfully installed xgboost-2.1.2


Now we can import the `xgboost` library and use it to create a XGBOOST model.

In the above, we have ommited most of the arguments, effectively setting them to the default values. See the [documentation of XGBOOST](https://xgboost.readthedocs.io/en/stable/parameter.html) for the full list of arguments.

We can use the `.plot_importances()` method of the trained model to visualize the importance of the features.

Arguments to this method inculdes:
* The trained model
* The type of importance:
    * `gain`: Average improvement in model performance brought by a feature
    * `weight`: Number of times a feature is used in splits
    * `cover`: Average number of data samples (instances) impacted by splits that involve a particular feature