# **Healthcare: Opioid Analysis**

<body>
<img src="https://thumbor.forbes.com/thumbor/fit-in/1200x0/filters%3Aformat%28jpg%29/https%3A%2F%2Fspecials-images.forbesimg.com%2Fimageserve%2F5dbb4182d85e3000078fddae%2F0x0.jpg"/>
</body>

**Alpha Insurance is a large corporate Insurance firm based in the USA. Specialising in Healthcare insurance for large hospitals and practices they have set a name for themselves in the Healthcare Insurance sector. Alpha Insurance has been receiving a lot of healthcare claims due to a recent Opioid epidemic. To battle this, the company wants to be aware of which healthcare professions are more likely to prescribe opioids. This way they can offer further advice to the potential client and be aware of potential risk before it happens.**


***Having collected data of multiple US healthcare practices from the CMS, Alpha Insurance would like you to come up with a prediction system to identify potential opioid prescribing practices.***

## **1. Import Libraries and Data needed**
<body>
<img src="https://offloadmedia.feverup.com/secretldn.com/wp-content/uploads/2016/06/18075319/Libraries-1024x901.jpg" width="500"/>
</body>

Here we are importing all the libraries that we will need to analyse the data.

Libraries contain all the little functions and tools that other programmers have created. This way we don't have to spend hours recreating code. Instead, we just call out the function name and it performs all the steps we want it to do.

In [None]:
import pandas as pd
#Importing all modules here
import os
import sys
import re
import numpy as np
import pickle
import matplotlib.pyplot as plt
import pandas as pd
import xgboost as xgb
from sklearn import metrics
%pylab inline

import re
from bokeh.io import output_notebook
from bokeh.sampledata import us_states
from bokeh.plotting import figure, show, output_file, ColumnDataSource
from bokeh.palettes import brewer
from bokeh.models import HoverTool, Range1d
from bokeh.models import (CDSView, ColorBar, ColumnDataSource,
                          CustomJS, CustomJSFilter, 
                          GeoJSONDataSource, HoverTool,
                          LinearColorMapper, Slider)

Load in the data...

In [None]:
Prescriber = pd.read_csv(r"https://raw.githubusercontent.com/ssonkol/Medication_Prediction/master/Healthcare%20Analysis%20Tutorial/prescriber-info.csv", header=0, sep=",", nrows = 10000000)
overdoses = pd.read_csv(r"https://raw.githubusercontent.com/ssonkol/Medication_Prediction/master/Healthcare%20Analysis%20Tutorial/overdoses.csv", header=0, sep=",", nrows = 10000000)
opioids = pd.read_csv(r"https://raw.githubusercontent.com/ssonkol/Medication_Prediction/master/Healthcare%20Analysis%20Tutorial/opioids.csv", header=0, sep=",", nrows = 10000000)

### **Check The Data**

What looks interesting in the table?
What predictions can you make with this data?

In [None]:
Prescriber.head()

In [None]:
overdoses.head()

In [None]:
opioids

## **2. Set Up The Model**

<body>
<img src="https://beyondtheory.co.uk/storage/images/other/2016/08/Beyond-Theory-Data-Analysis-Landing-Page-graphic.png" width="800"/>
</body>

Teaching models to do things is essentially how you revise or learn a topic you are learning.


1.   Look at the material and try to give a correct answer to a question given
2.   Check your answer against the mark scheme
3.   Repeat steps 1 and 2 over a set amount of time to improve your score

This is exactly how we teach models to handle tasks.

In this example, we are creating a predictive model for dentists.


*   X is the data the model looks at to make a prediction - i.e. the material and questions
*   Y is the data the model looks at to see if its prediction was correct - i.e. the mark scheme

Note: 

1 = opioid prescriber


0 = not an opioid prescriber







In [None]:
#Select our specialty
dentists = Prescriber[Prescriber['Specialty'] == 'Dentist']

dentists['Gender'] = pd.get_dummies(dentists['Gender'])

target = 'Opioid.Prescriber'
X = dentists.drop(['Gender','NPI', 'Specialty', 'Credentials', 'State', target], 1)
y = dentists[target]
(X.shape, y.shape)

Now let us take a look at the data!

In [None]:
X.head()

Our training data essentially contains a tally of all drug prescriptions made per medical specialist. As you can see, the specialists gender,npi, specialty, credentials, and state have been taken out. 

This is because we don't want the model to take these factors into account - only the list of drug names should be attributed to the specialty.

As for our y data, we are only looking at whether or not the specialist is an opioid prescriber.

In [None]:
y.head()

In [None]:
from sklearn.model_selection import train_test_split
alg = xgb.XGBClassifier(
        learning_rate =0.1,
        n_estimators=1000,
        max_depth=4,
        min_child_weight=1,
        gamma=0,
        subsample=0.8,
        colsample_bytree=0.8,
        nthread=4,
        objective="binary:logistic",
        scale_pos_weight=1,
        seed=27) 

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42)

In [None]:
#xgBoost Model
metrics = ['auc', 'map']
xgtrain = xgb.DMatrix(X_train,y_train)#takes in x and y variables
param = alg.get_xgb_params()
cvresult = xgb.cv(param,
                  xgtrain,
                  num_boost_round=alg.get_params()['n_estimators'],
                  nfold=7,
                  metrics=metrics,
                  early_stopping_rounds=50)
alg.set_params(n_estimators=cvresult.shape[0])
#Predict training set:
alg.fit(X_train,y_train,eval_metric=metrics)
xgbooSt = alg.fit(X_train,y_train,eval_metric=metrics)
# Show features, rated by fscore
features = alg._Booster.get_fscore()
feat_imp = pd.Series(features).sort_values(ascending=False)
feat_imp[:50].plot(kind='bar', title='Feature Importances', figsize=(9,6))
plt.ylabel('Feature Importance Score')

## **3. Model Review**

<body>
<img src="https://t3.ftcdn.net/jpg/03/28/54/50/360_F_328545004_Q7tujNu0VpoTXUqlGad4LyEVxpNSeoYu.jpg"/>
</body>


In [None]:
# sort for human readability
import operator
sorted_features = sorted(features.items(), key=operator.itemgetter(1))
print('features by importance', sorted_features)

In [None]:
from sklearn import metrics
pred = alg.predict(X_test)
predprob = alg.predict_proba(X_test)[:,1]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42)
#Print model report:
print("Accuracy : %.4g" % metrics.accuracy_score(y_test, pred))
print("AUC Score (Train): %f" % metrics.roc_auc_score(y_test, predprob))
mean_dentists = dentists.groupby('Opioid.Prescriber').mean()
relevant_stats = [mean_dentists[feature] for feature in features]
pd.DataFrame(relevant_stats).plot(kind="bar", figsize=(30,10))

In [None]:
top_features = list(features.keys())
top_features

##**Things To Try**

Copy one of the specialties from the list below and apply it to the code where we selected a specialty.

* What drugs are they more likely to prescribe?
*   Are these Opioids?
*   Therefore, are they more likely to prescribe opioids?





In [None]:
Prescriber['Specialty'].value_counts()