# Problem Set 7

See [Visualization Rules](https://datascience.quantecon.org/../applications/visualization_rules.html) and [Regression](https://datascience.quantecon.org/../applications/regression.html)

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import patsy
import sklearn
import sklearn.model_selection
import sklearn.ensemble

%matplotlib inline

This problem set uses data on insuree characteristics and medical costs. This is a public domain dataset downloaded from [kaggle](https://www.kaggle.com/mirichoi0218/insurance). 

The variables in the data are:
- age: age of primary beneficiary
- sex: insurance contractor gender, female, male
- bmi: Body mass index of primary beneficiary
- children: Number of children covered by health insurance / Number of dependents
- smoker: whether primary beneficiary smokes
- region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
- charges: medical costs billed by health insurance

You will build a  model to predict charges given the other variables in the data. 


In [3]:
insure = pd.read_csv("https://raw.githubusercontent.com/ubcecon/ECON323_2022_Spring/main/problem_sets/insurance.csv")
insure.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## Questions 1-3

These question are intentionally open-ended. For each one, carefully choose the type of visualization you’ll create. Put some effort into choosing colors, labels, and other formatting.

### Question 1

Create a visualization showing the relationship between smoking and medical costs. 

In [3]:
# your code here

### Question 2

Create a visualization showing the relationship between BMI and medical costs. 

In [4]:
# your code here

### Question 3

Does the relationship between medical costs and BMI vary with gender? Create a visualization to answer this question.

In [5]:
# your code here

## Questions 4-6

In these questions you will build and evaluate a model to predict medical costs. 

First, we divide the data into training and testing sets. 

In [4]:
train = insure.sample(frac = 0.8,random_state = 42) 
test = insure.drop(train.index)

Now we create a numeric matrix of features from our dataframe. The formula interface from the patsy package is one convenient method for doing this.

In [5]:
y, X = patsy.dmatrices("charges ~ C(sex)*(age + children + C(smoker) + C(region)) + age:C(smoker)", insure, return_type='matrix')
y = y.flatten()
y_train = y[train.index]
X_train = X[train.index]
y_test = y[test.index]
X_test = X[test.index]

### Question 4

Fit a LASSO model to the training data. Use cross-validation to choose the regularization parameter `alpha`. Print the MSE on the training and testing data.

In [23]:
# your code here

### Question 5

Fit a random forest to the training data. Use cross-validation to select the maximum tree depth. (The provided code shows how to do this). Print the MSE on the training and testing data.

In [66]:
param_grid = { 'max_depth': [1, 2, 4, 8, 16, None] } # depths to try, you could change this
rf_grid = sklearn.model_selection.GridSearchCV(sklearn.ensemble.RandomForestRegressor(n_estimators=25), 
                                               param_grid, cv=6, scoring='neg_mean_squared_error')
rf_grid.fit(X_train, y_train)
rf_best = rf_grid.best_estimator_ 

# you still need to print the MSE

### Question 6 (Optional)

Note: this is an optional question, and there is no bonus point for this question.

Fit a neural network to the training data. Use cross-validation to select the network architecture (number of layers and widths of layers). Print the MSE for the training and testing data. 

Hint: Use `GridSearchCV` like in the previous question.

In [1]:
# your code here 