# Background to osteoarthritis case study

_taken from [narrative seminar Osteoarthritis by Hunter & Bierma-Zeinstra (2019) in the Lancet](https://github.com/jads-nl/execute-nhs-proms/blob/master/references/hunter2019osteaoarthritis.pdf)._

Hip and knee osteoarthritis is a leading cause of disability and source of societal cost in older adults. Global prevalence of osteoarthritis is increasing and the burden of the disease will rise. The medical cost of osteoarthritis in various high-income countries has been estimated to account for between 1% and 2.5% of the gross domestic product of these countries, with hip and knee joint replacements representing the major proportion of these health-care costs.

Joint replacement surgery is a clinically relevant and cost-effective treatment for end-stage osteoarthritis. The characteristics of end-stage osteoarthritis include joint pain, which disrupts normal sleep patterns and causes a severe reduction in capable walking distance and marked restriction of daily activities. Hence, the aim of knee and hip replacements is to alleviate pain and disability in daily functioning.

However, up to 25% of patients presenting for total joint replacement continue to complain of pain and disability 1 year after well performed surgery. With data available on thousands of patients, the question arises to what extent it is possible to predict treatment success. This could be useful in supporting doctors in deciding whether knee replacement is indicated, and could help give patients a more personalised assessment of what to expect of treatment.

# CRISP-DM phase 1 and 2: Business and Data Understanding

This is day 1 from the [5-day JADS NHS PROMs data science case study](https://github.com/jads-nl/execute-nhs-proms/blob/master/README.md).

## Learning objectives


### Business Understanding
- Determine business objectives
- Assess situation
- Determine data mining and machine learning goals


### Data Understanding: descriptive statistics
- Explore Y
- Define Y with results exploration combined with clinical knowledge
- Assess missing values
- Assess data structure
- Explore correlation plot (X, Y)

### Python
- [Using pandas to explore data](https://realpython.com/pandas-python-explore-dataset/)
- [Fundamental stats to describe your data](https://realpython.com/python-statistics/)
- [Reading and writing files with pandas](https://realpython.com/pandas-read-write-files/)
- Be aware of some `pandas` pitfalls:
  - Know how `pandas` uses copies by default and good practice not to replace in-place
  - Use Int64 for N/A


## Business Understanding
Assume that primary reasons to replace a knee is to **i) reduce pain**, and **ii) improve daily functioning**. The dataset contains the various patient-reported outcome measures that can be used to measure the outcome along these two dimenions:
- Oxford Knee Score (OKS): a 12-item questionnaire that assess daily functioning c.q. disability due to knee osteoarthritis. Items are scored from [0,4], where higher is better (no disability).  
  - OKS question on pain and night pain, both on scale from [0,4]
  - OKS , higher is better. The OKS is a 12-item questionaire that assess daily functioning c.q. disability.
- EQ5D: generic quality of life PROMs along 5 dimensions on a 3-point Likert scale [1,3], lower is better. Dimensions are problems activity, anxiety, discomfort, mobility and self-care.
- EQ-VAS: general reported health on a scale from [0,100], higher is better.

We will explore the PROMs data to see which target variable Y is meaningful to assess the outcome of knee replacement. Note PROMs are measured at T0 (prior to surgery) and T1 (six months after surgery).
- [NHS PROMs methodology](https://github.com/dkapitan/jads-nhs-proms/blob/master/references/nhs/proms_guide_v12.pdf)
- [NHS PROMs data dictionairy](https://github.com/dkapitan/jads-nhs-proms/blob/master/references/nhs/proms_data_dictionary.pdf)

## Data Understanding


### Getting started with pandas

In [None]:
import warnings
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.feature_selection import chi2, VarianceThreshold
import sklearn.linear_model

# supressing warnings for readability
warnings.filterwarnings("ignore")

# To plot pretty figures directly within Jupyter
%matplotlib inline

# choose your own style: https://matplotlib.org/3.1.0/gallery/style_sheets/style_sheets_reference.html
plt.style.use("seaborn-whitegrid")

# Go to town with https://matplotlib.org/tutorials/introductory/customizing.html
# plt.rcParams.keys()
mpl.rc("axes", labelsize=14, titlesize=14)
mpl.rc("figure", titlesize=20)
mpl.rc("xtick", labelsize=12)
mpl.rc("ytick", labelsize=12)

# contants for figsize
S = (8, 8)
M = (12, 12)
L = (14, 14)

In [None]:
# import data
df = pd.read_parquet('../input/nhs-proms-case-study/data/interim/knee-provider.parquet')

# alternatively, if you have cloned this repository, read local file
# df = pd.read_parquet('../input/nhs-proms-case-study/data/interim/knee-provider.parquet')

#### Display datatypes

In [None]:
df.info()

### Describing the data

In [None]:
# if you have wide tables, adjust these settings to enable scrolling
pd.set_option("display.max.columns", None)
pd.set_option("display.max.rows", None)
pd.set_option("display.precision", 2)

#### pd.DataFrame.describe()

In [None]:
df.describe().transpose()

### Explore possible outcomes Y

In [None]:
proms = ['oks_t0_score', 'oks_t1_score', 'oks_t0_pain', 'oks_t1_pain', 'oks_t0_night_pain', 'oks_t1_night_pain', 't0_discomfort', 't1_discomfort', 't0_eq_vas', 't1_eq_vas', ]
df.loc[:,proms].describe().transpose().round(1)

In [None]:
# inspect with histograms
df.loc[:,proms].hist(figsize=M);

### Missing values and sentinel values
The histograms shown earlier indicate [sentinel values](https://en.wikipedia.org/wiki/Sentinel_value) are used to encode missing values:
* `9` for individual OKS questions
* `999` for EQ VAS

In [None]:
df.loc[:,['t0_eq_vas', 't1_eq_vas']].apply(pd.value_counts).tail()

In [None]:
_no9 = [col for col in df.columns if col.startswith('oks_t') and not col.endswith('score')]
df.loc[:,_no9].apply(pd.value_counts)

### Example: volume per provider per year

In [None]:
# count number of providers
df.provider_code.unique().shape


In [None]:
# volume per provider per year
volume_provider_year = df.groupby(['year', 'provider_code'])['procedure'].count().unstack()
volume_provider_year.iloc[:,0:20]

In [None]:
# first 20 providers in the data
volume_provider_year.iloc[:,0:20].plot(figsize=M);

In [None]:
# select 10 largest providers by 2018/19
_year = (df.year == '2018/19')
top10 = df.loc[_year,:].groupby('provider_code').count()['procedure'].sort_values(ascending=False).head(10)
volume_provider_year.loc[:, list(top10.index)].plot(figsize=M);

### Discussion

#### **Question:** what are relevant considerations to handle NAs?
- imputation with mean/median?
- just drop all?

#### **Question:** what would you choose as the primary outcome Y?

## Explore Y

### Y = Improvement in Oxford Knee Score

Given the variance between the pre- and postoperative values, we choose Y as the improvement in the Oxford knee score, i.e.

$${\Delta} {OKS} = OKS_{T1} - OKS_{T0}$$

We now need to handle the missing values first, before proceeding with our analysis.


### Data preparation: first iteration

For now we will just drop all rows with one or more missing value. In the next iteration we will look into more sophisticated methods for imputing missing data.

#### Handling and replacing missing values in pandas
Up to now, pandas used several values to represent missing data: `np.nan` is used for this for float data, `np.nan` or `None` for object-dtype data and `pd.NaT` for datetime-like data. An experimental, new `pandas.NA` feature is introduced in version 1.0. Note that Google Colab still uses version 0.25 (as of 2020-03-15).

`.loc` and `.iloc` are the recommended way to access parts of a `pd.DataFrame`. Note that this **always returns as copy**. Hence, to replace values, best practice is to:
- make a new copy
- do the replacements in the copy with explicit assignment

Note that [pandas can handle missing values in integer columns recently](https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html) (since version 0.24). This is still experimental, so be aware! Your options for handling missing values in integer columns are:
- Convert to `float64`, which is the most robust, old way to do it
- Convert to the new `Int64` dtype (note the capital `I`!)




In [None]:
# This DOES NOT work
df.loc[:,['t0_eq_vas', 't1_eq_vas']].replace(999, np.nan).astype('Int64').dtypes
df.loc[:,['t0_eq_vas', 't1_eq_vas']].head()

In [None]:
# when using iloc or loc you need to specify a location to update with some value.
# explicitly make new copy
dfc = df.copy()
dfc.loc[:,['t0_eq_vas', 't1_eq_vas']] = dfc.loc[:,['t0_eq_vas', 't1_eq_vas']].replace(999, np.nan).astype('Int64')
dfc.loc[:,['t0_eq_vas', 't1_eq_vas']].head()

In [None]:
# simarly, change 9 to N/A for OKS
dfc.loc[:,_no9] = df.loc[:,_no9].replace(9, np.nan).astype('Int64')
pd.concat([df.loc[:,_no9].isnull().sum(), dfc.loc[:,_no9].isnull().sum()], axis=1, keys=['df', 'dfc'])

#### Dropping `NA`s

In [None]:
dfc.dropna(inplace=True)

#### Add `delta_oks_score`

In [None]:
print(f'Raw data:   {df.shape[0]} rows\nNo NA data: {dfc.shape[0]} rows\n # dropped:  {df.shape[0] - dfc.shape[0]} rows')
dfc['delta_oks_score'] = dfc.oks_t1_score - dfc.oks_t0_score

#### Y by provider, year (using boxplots)

In [None]:
# calculate descriptive stats for delta_oks
dfc.groupby(['year', 'provider_code'])['delta_oks_score'].describe().head()

In [None]:
# Otherwise boxplot tries to include provider codes, as it was a categorical type
dfc.provider_code = dfc.provider_code.astype('str')

In [None]:
_top10_latest = (dfc.provider_code.isin(top10.index)) & (dfc.year=='2018/19')
dfc.loc[_top10_latest,['provider_code', 'delta_oks_score']].boxplot(by=['provider_code'], figsize=L);

In [None]:
# lets look at a single provider and see whether delta_oks_score changes year-on-year
dfc.loc[(dfc.provider_code.isin(['NXM01'])),['year', 'delta_oks_score']].boxplot(by=['year'], figsize=S);

In [None]:
# bonus: how to create small multiples
# https://seaborn.pydata.org/tutorial/axis_grids.html
import seaborn as sns

g = sns.FacetGrid(
    dfc.loc[dfc.provider_code.isin(top10.index), :],
    col="provider_code",
    col_wrap=2,
    height=5,
    gridspec_kws={"figsize": L},
)
g.map(sns.boxplot, "year", "delta_oks_score", color=".3");

In [None]:
# Extra exercise: is there an association between size of a provider and Y
# group providers in 10 deciles by volume
provider_cat = pd.qcut(volume_provider_year.loc['2018/19',:], q=10)

### Discussion

#### **Question:** reconsider our choice of Y
- Is it useful to quantify treatment success?
- What limitations are there?

## Explore (X,Y)


### Visual data exploration: scattermatrix
At this stage of the process, it is often helpful to try different visualizations to explore the data.

In [None]:
# 3-point scales don't look that interesting, let's zoom in on oks_score and eq_vas
# note some functions don't work with nullable integer Int64 (yet), so converting back to int64
oks_vas = ['delta_oks_score', 'oks_t0_score', 'oks_t1_score', 't0_eq_vas', 't1_eq_vas']
pd.plotting.scatter_matrix(dfc.loc[:,oks_vas].astype('int64'), figsize=L, alpha=0.2);

### Data preparation: second iteration

#### Ceiling effect
Clearly we need to do something to account for the ceiling effect. So in our second iteration we define a binary outcome parameter Y as 'good' when either `delta_oks_score` is above a certain threshold or `oks_t1_score` is above a certain absolute value.

In [None]:
def good_outcome(oks_t1, delta_oks):
  if oks_t1 > 43 or delta_oks > 13:
    return True
  else:
    return False

dfc['Y'] = dfc.apply(lambda row: good_outcome(row['oks_t1_score'], row['delta_oks_score']), axis=1)

#### Y by casemix variables (using Chi Square test)
Usually casemix attributes are strong predictors for outcomes. With chi-squared test we can assess which categorical variables have strong associations. Note that the casemix indicators are encoded with 1 (present) or 9 (missing).

Note that chi2 requires X and Y to be categorical. Research has shown that an improvement in OKS score of approx. 30% is relevant ([van der Wees 2017](https://github.com/dkapitan/jads-nhs-proms/blob/master/references/vanderwees2017patient-reported.pdf)). Hence an increase of +14 points is considered a 'good' outcome.

In [None]:
casemix = ['heart_disease', 'high_bp', 'stroke', 'circulation', 'lung_disease', 'diabetes',
           'kidney_disease', 'nervous_system', 'liver_disease', 'cancer', 'depression', 'arthritis']
dfc.loc[:,casemix] = dfc.loc[:,casemix].replace(1, True).replace(9, False)
chi2_, pval = chi2(dfc.loc[:,casemix], dfc.Y)
chi = pd.DataFrame({'feature': casemix, 'chi2': chi2_, 'p': pval}).sort_values('chi2', ascending=False)
chi

### Explore some more: correlation plots
Assess whether the outcome Y has a correlation with size of provider?

In [None]:
# group by providers for 2018/19
_provider2019 = dfc.loc[(dfc.year == '2018/19'),:].groupby('provider_code')

# count procedures, calculate mean delta_oks_score and plot
(pd.concat([_provider2019.count()['procedure'].sort_values(ascending=False),
          _provider2019.mean()['delta_oks_score']], axis=1)
   .plot(kind='scatter', x='procedure', y='delta_oks_score', figsize=M)
);

In [None]:
# same approach, using binary Y
(pd.concat([_provider2019.count()['procedure'].sort_values(ascending=False),
          _provider2019.mean()['Y']], axis=1)
   .plot(kind='scatter', x='procedure', y='Y', figsize=M)
);

**Extra exercise:** construct a funnelplot to assess which providers have significantly better c.q. poorer outcome within 95% confidence interval.

### Other things to look into
* Are there near-zero variance features (using `sklearn.feature_selection.VarianceThreshold)?
* Do infeasible combinations of variable values occur in the data (e.g. minors with a drivers license or pregnant males)? 
* A tree-model where Y is being predicted using a cluster of related X-variables, such as ROM-items
* Which variables are known / not known at the point of prediction?
* Which domains (work, health, family, lifestyle, therapy, etc etc) are covered?

# Conclusion and reflection

## Discussion of results

Looking at the different results, reflect on the meaning and usefulness of the outputs.
* To what extent are the chosen outcome variables meaningful?
* Does the data provide relevant and sufficiently detailed information to address the key question?
* What are the key uncertainties c.q. unknowns?


## Checklist for results from data understanding process
* assessment of the quality of the data (in terms of outliers and missings)
* input regarding the moment of prediction
* input for data cleaning (handling missing data; removing variables not known at time of prediction, near-zero variance variables, etc)
* input for feature engineering (adjusting variables based on tree-analyses, based on correlations, based on domain-analysis)
* input for defining the outcome variable Y
* input for defining the project in terms of generalizability (in case of missing Y values)
* input for choosing the project in case there are still multiple options at the table
* input for defining the scope of the project (e.g. limiting to a subgroup to get a better balanced outcome variable)
* a potential revision of the goal of your project
* input for which variables and combination of variables seem particularly relevant within the to-be-developed algorithms 