---

## 02. Explanations for observation with the worst prediction

---

This notebook provides some explanations of final model for observation from test dataset for which difference between real value and predicted value was the biggest.

---

### Import packages

In [1]:
import functions as fun
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import dalex as dx
import lime
import lime.lime_tabular

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from ceteris_paribus.explainer import explain
from ceteris_paribus.profiles import individual_variable_profile
from ceteris_paribus.plots.plots import plot_notebook, plot
from ceteris_paribus.select_data import select_neighbours
from sklearn.inspection import permutation_importance

### Load model to be explain

In [2]:
pickle_in = open(".\..\models\XGB\MEPS_xgb_model_final_v2.pickle", "rb")
reg_xgb = pickle.load(pickle_in)

### Read data

In [3]:
path = ".\..\data\MEPS_data_preprocessed"
X_train, y_train = fun.read_x_y(path + "_train.csv", "HEALTHEXP")
X_test, y_test = fun.read_x_y(path + "_test.csv", "HEALTHEXP")

In [4]:
raw_test_data = pd.read_csv(path + "_test.csv")
X_test_raw = raw_test_data.drop("HEALTHEXP", axis = 1)
y_test_raw = raw_test_data["HEALTHEXP"]

### Find and explore observation with the worst prediction

In [5]:
y_pred_test = reg_xgb.predict(X_test)
obs_idx = fun.find_nth_obs_idx(y_test, y_pred_test, -1)

In [6]:
# Real value (target transformed with log base 3)
y_test[obs_idx]

0.0

In [7]:
# Real value (raw)
y_test_raw[obs_idx]

0

In [8]:
# Predicted value
y_pred_test[obs_idx]

8.55008543808286

In [9]:
# Observation
X_test.iloc[[obs_idx], :]

Unnamed: 0,REGION,AGE31X,GENDER,RACE3,MARRY31X,EDRECODE,FTSTU31X,ACTDTY31,HONRDC31,RTHLTH31,...,DFSEE42,ADSMOK42,PCS42,MCS42,K6SUM42,PHQ242,EMPST31,POVCAT15,INSCOV15,INCOME_M
3803,1,76,1.0,0.0,2,13,-1,4,2,5,...,1,-1,-1.0,-1.0,-1,-1,4,5,2,12345.0


**Meaning of variables and their values:** <br/>
* REGION = 1.0 	 - 	 census region was northeast
* AGE31X = 76.0 	 - 	 76 years old
* GENDER = 1.0 	 - 	 female
* RACE3 = 0.0 	 - 	 
* MARRY31X = 2.0 	 - 	 widowed
* EDRECODE = 13.0 	 - 	 GED or high school degree
* FTSTU31X = -1.0 	 - 	 student status - inapplicable
* ACTDTY31 = 4.0 	 - 	 military full-time active duty - over 59 - inapplicable
* HONRDC31 = 2.0 	 - 	 not honorably discharged from military
* RTHLTH31 = 5.0 	 - 	 perceived health status - poor
* MNHLTH31 = 4.0 	 - 	 perceived mental health status - fair
* HIBPDX = 1.0 	 - 	 high blood pressure diagnosed
* CHDDX = 1.0 	 - 	 coronary heart disease diagnosed
* ANGIDX = 2.0 	 - 	 angina wasn't diagnosed
* MIDX = 2.0 	 - 	 heart attack wasn't diagnosed
* OHRTDX = 2.0 	 - 	 any other heart diseases weren't diagnosed
* STRKDX = 2.0 	 - 	 stroke wasn't diagnosed
* EMPHDX = 1.0 	 - 	 emphysema diagnosed
* CHBRON31 = 1.0 	 - 	 chronic bronchitis diagnosed
* CHOLDX = 1.0 	 - 	 high cholesterol diagnosed
* CANCERDX = 1.0 	 - 	 cancer diagnosed
* DIABDX = 2.0 	 - 	 diabetes wasn't diagnosed
* JTPAIN31 = 1.0 	 - 	 joint pain last 12 months diagnosed
* ARTHDX = 1.0 	 - 	 arthritis diagnosed
* ARTHTYPE = 1.0 	 - 	 type of arthritis - rheumatoid
* ASTHDX = 2.0 	 - 	 asthma wasn't diagnosed
* ADHDADDX = -1.0 	 - 	 ADHD or ADD diagnosis - inapplicable
* PREGNT31 = -1.0 	 - 	 pregnant - inapplicable
* WLKLIM31 = 1.0 	 - 	 has limitation in physical functioning
* ACTLIM31 = 1.0 	 - 	 has any other limitation work/house work/school
* SOCLIM31 = 1.0 	 - 	 has social limitation
* COGLIM31 = 1.0 	 - 	 has cognitive limitation
* DFHEAR42 = 2.0 	 - 	 hasn't serious difficulty hearing
* DFSEE42 = 1.0 	 - 	 has serious difficulty see or wears glasses
* ADSMOK42 = -1.0 	 - 	 doesn't smoke
* PCS42 = -1.0 	 - 	 saq:phy component summry sf-12v2 imputed - inapplicable
* MCS42 = -1.0 	 - 	 mnt component summry sf-12v2 imputed - inapplicable
* K6SUM42 = -1.0 	 - 	 overall rating of feelings - inapplicable (last 30 days)
* PHQ242 = -1.0 	 - 	 overall rating of feelings - inapplicable (last 2 weeks)
* EMPST31 = 4.0 	 - 	 employment status - ?
* POVCAT15 = 5.0 	 - 	 family income as % of poverty line - high income
* INSCOV15 = 2.0 	 - 	 health insurance coverage indicator 2015 - public only
* INCOME_M = 12345.0 - 	 person total income = 12345.0 