In [None]:
%matplotlib inline

Diagnosis Codes, Procedure Codes, BB USC codes, Oh my
================================================

This is a fairly rich data set.  In addition to the data leaks discussed in the forums, 
there are scads of different things to look at if you can get sql to disgorge them in time.
Our goal is to find groups that satisfy the following criterion:

1. Relatively large group size.
1. Relatively low or high screener rate for the group, albeit preferably low.
1. Groups that are easily identified by only a few features

The first criterion is important for two reasons.  The first reason is number of people who benefit 
from an increase in the screening rate of a group is directly proportional to the size of the 
group. In turn this translates into more lives saved.  The second reason is that screener rates 
estimated from larger groups are more likely to approximate the screener rate of the general 
population meeting that criterion.  

The second criterion relates to cost. The cost of reaching a specific group is 
proportional to the size of the group whereas the benefit is proportional to the size of the 
subset of non-screeners.  If a group has a high screening rate, then we are actually considering
the portion of the population that doesn't fall into the group as our target.  In this case 
recall that the screening rate for the target is not 1 - the screening rate of the group and in 
fact unless the group is relatively large (think much greater than 1% of the population), then the 
screening rate of the target is close to that of the general population.

To illustrate, approximately 56% of the patients in training data have at least one instance of 
diagnosis code V72.31 and of those in this group the screener rate is approximately 81%.  Of those
not in this group the screener rate is %24 percent.  Only 2% of patients are in the group with 
at least one instance of diagnosis code 256.4 which has a screener rate of approximately 72%.  
However the screener rate for those not in this group is 55% which approximates the screening
rate of the population as a whole.

The third criterion comes from the fact that simpler models easier to implement and are 
more likely to generalize.  In the particular case though we are looking into statistics and 
not specific models.

CAVEAT:  The constraints of having to limit queries may have unintential side effects.

In [None]:
import numpy as np 
import pandas as pd 
import pylab as plt
import matplotlib as mpl
import sqlite3
from scipy import stats

In [None]:
# Any results you write to the current directory are saved as output.
con = sqlite3.connect('../input/database.sqlite')
cursor = con.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cursor.fetchall())


Diagnosis Codes
--------------

From the first two million distinct `diagnosis_code, patient_id` pairs compute the screening 
rate by diagnosis code for the group that of patients that has has that diagnosis code at 
least once.

In [None]:
db = sqlite3.connect('../input/database.sqlite')
dcode_rates = pd.read_sql_query("SELECT diagnosis_code, count(diagnosis_code) as freq, avg(is_screener) as screener_rate FROM (patients_train as pt INNER JOIN (SELECT DISTINCT patient_id, diagnosis_code from diagnosis LIMIT 2000000) as ds ON pt.patient_id = ds.patient_id) GROUP BY diagnosis_code ORDER BY freq;",db)

From this we select the 300 most frequently used diagnosis codes to look at their screening
rates and relative sizes.

In [None]:
xnum = 25
ynum = 12
num_codes = xnum*ynum
bones = dcode_rates.iloc[-num_codes:].sort_values('screener_rate').copy()
grid = np.meshgrid(range(xnum),range(ynum))
bones['y'] = grid[1].flatten()
bones['x'] = grid[0].flatten()
ann_list = bones[(bones.freq > 900) & (bones.screener_rate > .7)]
print(ann_list)
ann_list2 = bones[(bones.freq > 900) & (bones.screener_rate < .4)]
print(ann_list2)

In the following plot, each code is represented by a circle whose size is proportional to the 
relative frequency of the diagnosis code and whose color reflect the screening rate for that code.

In [None]:
#fig = plt.figure(figsize=(17,12))
#ax = fig.add_subplot(111)
plt.xkcd()
p1 = bones.plot(kind = 'scatter', x = 'x', y = 'y', c = 'screener_rate', s = bones['freq']/10.0, cmap = 'RdBu', figsize=(17,12))
t1 = plt.ylabel(' ')
t1 = p1.yaxis.set_ticklabels([])
for dc,x,y in zip(ann_list['diagnosis_code'],ann_list['x'],ann_list['y']):
    #plt.annotate(dc, xy = (x, y), xytext = (-20, 20 +20*(24-y) ),textcoords = 'offset points', ha = 'right', va = 'bottom',
    plt.annotate(dc, xy = (x, y), xytext = (-20 + 5*(ynum-y), 10 +10*(ynum-y) ),textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
for dc,x,y in zip(ann_list2['diagnosis_code'],ann_list2['x'],ann_list2['y']):
    #plt.annotate(dc, xy = (x, y), xytext = (-20 + 5*y, -20 -20*(y+1)),textcoords = 'offset points', ha = 'right', va = 'bottom',
    plt.annotate(dc, xy = (x, y), xytext = (-20 + 5*y, -20 -20*(y+1)),textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
t1 = plt.title('Screener Rates for the most frequently used diagnosis codes',fontsize = 18)


Based on the above criterion we should consider codes in the top and bottom row with the 
added restriction that codes from the top row need to be larger. Here are the actual descriptions
of the labeled codes.

In [None]:
dc_descript = pd.read_sql_query("SELECT diagnosis_code,diagnosis_description FROM diagnosis_code;",db)

In [None]:
pcodes = bones[['diagnosis_code','freq','screener_rate']].merge(dc_descript, on = 'diagnosis_code')
print("CODES WITH THE HIGHEST SCREENER RATES: ")
print (pcodes.tail(15))
print("CODES WITH THE LOWEST SCREENER RATES: ")
print (pcodes.head(15))

Of the listed codes, groups associated 496, 414.01 (perhaps combined with 414.00), 428.0,
584.9, 780.97, 429.3, V72.31, and 616.10 are all worth considering. 


Procedure Codes
--------------

Repeating the above analysis with procedure codes 

In [None]:
db = sqlite3.connect('../input/database.sqlite')
pcode_rates = pd.read_sql_query("SELECT procedure_code, count(procedure_code) as freq, avg(is_screener) as screener_rate FROM (patients_train as pt INNER JOIN (SELECT DISTINCT patient_id, procedure_code from procedure LIMIT 2000000) as ds ON pt.patient_id = ds.patient_id) GROUP BY procedure_code ORDER BY freq;",db)

In [None]:
xnum = 25
ynum = 12
num_codes = xnum*ynum
bones = pcode_rates.iloc[-num_codes:].sort_values('screener_rate').copy()
grid = np.meshgrid(range(xnum),range(ynum))
bones['y'] = grid[1].flatten()
bones['x'] = grid[0].flatten()
ann_list = bones[(bones.freq > 900) & (bones.screener_rate > .7)]
print(ann_list)
ann_list2 = bones[(bones.freq > 900) & (bones.screener_rate < .4)]
print(ann_list2)

In [None]:
p1 = bones.plot(kind = 'scatter', x = 'x', y = 'y', c = 'screener_rate', s = bones['freq']/10.0, cmap = 'RdBu', figsize=(17,12))
t1 = plt.ylabel(' ')
t1 = p1.yaxis.set_ticklabels([])
for dc,x,y in zip(ann_list['procedure_code'],ann_list['x'],ann_list['y']):
    #plt.annotate(dc, xy = (x, y), xytext = (-20, 20 +20*(24-y) ),textcoords = 'offset points', ha = 'right', va = 'bottom',
    plt.annotate(dc, xy = (x, y), xytext = (-20 + 5*(ynum-y), 10 +10*(ynum-y) ),textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
for dc,x,y in zip(ann_list2['procedure_code'],ann_list2['x'],ann_list2['y']):
    #plt.annotate(dc, xy = (x, y), xytext = (-20 + 5*y, -20 -20*(y+1)),textcoords = 'offset points', ha = 'right', va = 'bottom',
    plt.annotate(dc, xy = (x, y), xytext = (-20 + 5*y, -20 -20*(y+1)),textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
t1 = plt.title('Screener Rates for the most frequently used diagnosis codes',fontsize = 18)



In [None]:
pc_descript = pd.read_sql_query("SELECT procedure_code,procedure_description FROM procedure_code;",db)

In [None]:
proc_codes = bones[['procedure_code','freq','screener_rate']].merge(pc_descript, on = 'procedure_code')
print("CODES WITH THE HIGHEST SCREENER RATES: ")
print (proc_codes.tail(30))
print("CODES WITH THE LOWEST SCREENER RATES: ")
print (proc_codes.head(30))

Here we see that procedure codes provide an even better source of groups that meet our criterion. 
In particular the groups associated to the 5 procedure codes with the lowest screening rates would 
all be good candidates.

BB USC Codes
--------------

Repeating the above analysis with BB USC codes 

In [None]:
db = sqlite3.connect('../input/database.sqlite')
#Sample query for checking that the inside join is correct
#drugjoin = pd.read_sql_query("SELECT DISTINCT patient_id, BB_USC_code, BB_USC_name FROM (prescription as pr JOIN drug ON pr.drug_id = drug.drug_id) LIMIT 2000000;",db)

bcode_rates = pd.read_sql_query("SELECT BB_USC_code, BB_USC_name, count(BB_USC_code) as freq, avg(is_screener) as screener_rate FROM (patients_train as pt INNER JOIN (SELECT DISTINCT patient_id, BB_USC_code, BB_USC_name FROM (prescription as pr JOIN drug ON pr.drug_id = drug.drug_id) LIMIT 2000000) as ds ON pt.patient_id = ds.patient_id) GROUP BY BB_USC_code ORDER BY freq;",db)


In [None]:
#print(drugjoin[drugjoin.patient_id== 113105088])
#print(drugjoin.patient_id.value_counts())
print(bcode_rates.tail(20))

In [None]:
xnum = 25
ynum = 12
num_codes = xnum*ynum
bones = bcode_rates.iloc[-num_codes:].sort_values('screener_rate').copy()
grid = np.meshgrid(range(xnum),range(ynum))
bones['y'] = grid[1].flatten()
bones['x'] = grid[0].flatten()
ann_list = bones[(bones.freq > 900) & (bones.screener_rate > .7)]
print(ann_list)
ann_list2 = bones[(bones.freq > 900) & (bones.screener_rate < .4)]
print(ann_list2)
ann_list2 = ann_list2.sort_values(['x','y'])


In [None]:
p1 = bones.plot(kind = 'scatter', x = 'x', y = 'y', c = 'screener_rate', s = bones['freq']/10.0, cmap = 'RdBu', figsize=(17,12))
t1 = plt.ylabel(' ')
t1 = p1.yaxis.set_ticklabels([])
for dc,x,y,num in zip(ann_list['BB_USC_name'],ann_list['x'],ann_list['y'],range(ann_list.shape[0])):
    #plt.annotate(dc, xy = (x, y), xytext = (-20, 20 +20*(24-y) ),textcoords = 'offset points', ha = 'right', va = 'bottom',
    plt.annotate(dc, xy = (x, y), xytext = (x + .5*num, y+1.5),textcoords = 'data', 
        ha = 'left', va = 'bottom',rotation = 45,color='darkslategrey',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0',color='darkslategrey'))
for dc,x,y,num in zip(ann_list2['BB_USC_name'],ann_list2['x'],ann_list2['y'],range(ann_list2.shape[0])):
    #plt.annotate(dc, xy = (x, y), xytext = (-20 + 5*y, -20 -20*(y+1)),textcoords = 'offset points', ha = 'right', va = 'bottom',
    plt.annotate(dc, xy = (x, y), xytext = (1.5*num,-.5),textcoords = 'data', 
        ha = 'right', va = 'top',rotation = 45,color='darkslategrey',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
t1 = plt.title('Screener rates for the most frequently prescribed drugs',fontsize = 18)




Patients having UNLISTED DIALYSIS PROCEDURE INPATIENT/OUTPATIENT
------------------------------------------------------

The six procedure codes with the lowest screener rates are all good candidates under 
the above criterion. Of these UNLISTED DIALYSIS PROCEDURE INPATIENT/OUTPATIENT has 
the highest frequency. In this section we explore the group of patients who have had 
this procedure at least once.  

In [None]:
print(proc_codes.head(10))

In [None]:
dialysis_pats = pd.read_sql_query("SELECT * FROM (patients_train as pt INNER JOIN (SELECT DISTINCT patient_id, procedure_code from procedure WHERE procedure_code = '90999' LIMIT 2000000) as ds ON pt.patient_id = ds.patient_id)",db)

In [None]:
print(dialysis_pats.shape)
print(dialysis_pats.is_screener.mean())

In [None]:
pdf = dialysis_pats.groupby('patient_state').apply(lambda x: pd.Series([x.is_screener.sum()/x.shape[0],x.shape[0]]))

In [None]:
print(pdf)