# Visualization

* We are looking for correlations between the independent variables and the target variable, the likelihood of being readmitted to the hospital, using graphs and plots. 
* This is also a good time to get a better understanding of patient demographics, their experiences at the hospital, medications being used / not used, and any diagnosed conditions.

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from pylab import rcParams
rcParams['figure.figsize'] = 12,6

# to avoid warnings
import warnings
warnings.filterwarnings('ignore')
warnings.warn("this will not show")

sns.set(style='darkgrid')
%matplotlib inline

### Import Dataset

- We got cleaned dataset in the [first notebook](https://www.kaggle.com/kirshoff/01-exploratory-data-analysis-with-diabetes-dataset)
- We use diabetic_data_cleaned.csv here.

In [None]:
data = pd.read_csv('../input/diabetic-data-features/diabetic_data_cleaned.csv', index_col=0)
df = data.copy()
df.head()

In [None]:
features = pd.read_csv('../input/diabetic-data-features/features.csv',index_col='Unnamed: 0')
info = lambda attribute:print(f"{attribute.upper()} : {features[features['Feature']==attribute]['Description'].values[0]}\n")
features.head()

In [None]:
def summary(df, pred=None):
    obs = df.shape[0]
    Types = df.dtypes
    Counts = df.apply(lambda x: x.count())
    Min = df.min()
    Max = df.max()
    Uniques = df.apply(lambda x: x.unique().shape[0])
    Nulls = df.apply(lambda x: x.isnull().sum())
    print('Data shape:', df.shape)

    if pred is None:
        cols = ['Types', 'Counts', 'Uniques', 'Nulls', 'Min', 'Max']
        str = pd.concat([Types, Counts, Uniques, Nulls, Min, Max], axis = 1, sort=True)

    str.columns = cols
    print('___________________________\nData Types:')
    print(str.Types.value_counts())
    print('___________________________')
    return str

summary(df)

In [None]:
round(df.describe(), 2)

In [None]:
df.shape

In [None]:
sns.pairplot(df, hue='readmitted');

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm");

### FOCUS ON "readmitted" patients overall

In [None]:
info('readmitted')

In [None]:
def labels(ax):
    for p in ax.patches:
            ax.annotate('%{:.1f}\n{:.0f}'.format(100*p.get_height()/len(df),p.get_height()), 
                        (p.get_x()+0.3, p.get_height()-1900),size=11)

ax = sns.countplot(x='readmitted', palette='husl', data=df)
labels(ax)

# sns.catplot(x='readmitted', kind='count', palette='husl', data=df)  # alternative
plt.title('Readmit Rates')
plt.show()

### FOCUS ON "race"

In [None]:
def labels(ax):
    for bar in ax.patches: 
        ax.annotate('%{:.1f}\n{:.0f}'.format(100*bar.get_height()/len(df),bar.get_height()), (bar.get_x() + bar.get_width() / 2,  
                        bar.get_height()), ha='center', va='center', 
                       size=10, xytext=(0, 8), 
                       textcoords='offset points') 

rcParams['figure.figsize'] = 12,6
ax = sns.countplot(x='race', hue='readmitted', palette='husl', data=df)
labels(ax)
# sns.catplot(x='race', hue='readmitted', kind='count', palette='husl', data=df, aspect=2, legend_out=False)
plt.title('Patient Demographic Readmissions')
plt.show()

In [None]:
pd.crosstab(df.race, df.readmitted, margins=True, margins_name='Total')

### FOCUS ON "gender"

In [None]:
rcParams['figure.figsize'] = 12,6
ax = sns.countplot(x='gender', hue='readmitted', palette='husl', data=df)
labels(ax)
plt.title('Readmissions by Gender')
plt.show()

In [None]:
pd.crosstab(df.gender, df.readmitted, margins=True, margins_name='Total')

### FOCUS ON "age" groups

In [None]:
ax = sns.countplot(x='age', palette='husl', data=df.sort_values('age'))
labels(ax)
plt.title('Patient Demographics')
plt.show()

> It looks like most patients are older, 50+ years old, though there aren't many patients over 90.

In [None]:
ax = sns.countplot(x='age', hue='readmitted', palette='husl', data=df.sort_values('age'))
labels(ax)
plt.title('Readmits By Age Group')
plt.show()

In [None]:
pd.crosstab(df.age, df.readmitted, margins=True, margins_name='Total').T

>In every age group, more patients are not readmitted. The 70-80 age group account has the highest number of readmitted and not readmitted patients.

### FOCUS ON "time_in_hospital"

In [None]:
sns.countplot(x='time_in_hospital', palette='muted', data=df)
mean, median = np.mean(df.time_in_hospital), np.median(df.time_in_hospital)
plt.axvline(mean-df.time_in_hospital.min(), color='blue', label=f'mean:{round(mean,2)}')
plt.axvline(median-df.time_in_hospital.min(), color='red', label=f'median:{round(median,2)}')
plt.title('Duration of Hospital Visit in Days')
plt.legend()
plt.show()

> **Does the amount of time spent in the hospital impact a patient's chances of readmission?**

In [None]:
sns.catplot(x='time_in_hospital', hue='readmitted', kind='count', palette='husl', aspect=3, data=df, legend_out=False)
plt.title('Readmission Based on Time in Hospital')
plt.show()

In [None]:
sns.displot(x='time_in_hospital', hue='readmitted', data=df, height=7, aspect=3)
plt.title('Readmission Based on Time in Hospital')
plt.show()

> Based on the graph, the longer a patient spends in the hospital, the likelier their chances are of being readmitted. Patients who spend more than a week in the hospital usually have a serious illness or complication that may reoccur depending on their ability to recover, which is why they may need to revisit the hospital.

> **Which age group is spending the most time in hospitals during visits?**

In [None]:
def box_labels(ax, df,col1,col2):
    medians = df.groupby([col1])[col2].median()
    vertical_offset = df[col2].median() * 0.05 # offset from median for display

    for xtick in ax.get_xticks():
        ax.text(xtick,medians[xtick] + vertical_offset,medians[xtick], 
                horizontalalignment='center',size='x-small',color='w',weight='semibold')

ax = sns.boxplot(x='age', y='time_in_hospital', data=df.sort_values('age'))
box_labels(ax, df.sort_values('age'),'age','time_in_hospital')    
plt.title('Length of Hospital Stay Based on Age')
plt.show()

> **What is the comparison of time in hospital for readmitted patients?**

In [None]:
ax = sns.boxplot(x='readmitted', y='time_in_hospital', data=df.sort_values('readmitted'))
box_labels(ax, df.sort_values('readmitted'),'readmitted','time_in_hospital') 
plt.title('Length of Hospital Stay for Readmitted Patients')
plt.show()

> Readmitted patients stay longer in the hospital on average compared to those who are not readmitted.

### FOCUS ON "number of lab procedures`

In [None]:
info("num_lab_procedures")

In [None]:
rcParams['figure.figsize'] = 25,10
sns.countplot(x='num_lab_procedures', data=df)
mean, median = np.mean(df.num_lab_procedures), np.median(df.num_lab_procedures)
plt.axvline(mean-df.num_lab_procedures.min(), color='blue', label=f'mean:{round(mean,2)}')
plt.axvline(median-df.num_lab_procedures.min(), color='black', label=f'median:{round(median,2)}')
plt.title('Number of Lab Procedures Performed During Visit')
plt.legend()
plt.show()

In [None]:
df.groupby('readmitted')['num_lab_procedures'].describe().round(2)

> **Do the patients with longer hospital stays have more lab tests?**

In [None]:
def box_labels(ax, df,col1,col2):
    medians = df.groupby([col1])[col2].median()
    vertical_offset = df[col2].median() * 0.05 # offset from median for display

    for xtick in ax.get_xticks():
        ax.text(xtick,medians[xtick] + vertical_offset,medians[xtick], 
                horizontalalignment='center',size=12,color='w',weight='semibold')

ax = sns.boxplot(x='time_in_hospital', y='num_lab_procedures', data=df.sort_values('time_in_hospital'))
# box_labels(ax, df.sort_values('time_in_hospital'),'time_in_hospital','num_lab_procedures') 
plt.title('Lab Procedures Based on Length of Hospital Visit')
plt.show()

* There is a positive correlation between time spent in the hospital and number of lab tests completed. 
* This makes sense since patients with longer stays had more tests completed to properly diagnose their conditions.

> **Do readmitted patients have more lab tests?**

In [None]:
plt.figure(figsize=(10, 8))
ax = sns.boxplot(x='readmitted', y='num_lab_procedures', data=df.sort_values('readmitted'))
box_labels(ax, df.sort_values('readmitted'),'readmitted','num_lab_procedures') 
plt.title('Lab Procedures for Readmitted Patients')
plt.show()

* The average number of lab procedures is about equal for readmitted and not readmitted patients. 
* Not readmitted patients have a slightly lower number of lab procedures done during their visit.

### FOCUS ON "`number of procedures`" (other than lab)

In [None]:
info('num_procedures')

In [None]:
sns.catplot(x='num_procedures', kind='count', palette='muted', data=df)
mean, median = np.mean(df.num_procedures), np.median(df.num_procedures)
plt.axvline(mean, color='blue', label=f'mean:{round(mean,2)}')
plt.axvline(median, color='black', label=f'median:{round(median,2)}')
plt.title('Number of Procedures Performed (Except Lab)')
plt.legend()
plt.show()

> **Do the number of tests performed indicate whether a patient will be readmitted?**

In [None]:
def labels(ax):
    for bar in ax.patches: 
        ax.annotate('%{:.1f}\n{:.0f}'.format(100*bar.get_height()/len(df),bar.get_height()), (bar.get_x() + bar.get_width() / 2,  
                        bar.get_height()-400), ha='center', va='center', 
                       size=14, xytext=(0, 8), 
                       textcoords='offset points') 
        
ax = sns.countplot(x='num_procedures', hue='readmitted', palette='husl', data=df)
labels(ax)
plt.title('Readmits Based on Procedures (Sans Lab)')
plt.show()

### FOCUS ON "number of medications"

In [None]:
info('num_medications')

In [None]:
rcParams['figure.figsize'] = 25,10
sns.countplot(x='num_medications', data=df)
mean, median = np.mean(df.num_medications), np.median(df.num_medications)
plt.axvline(mean-df.num_medications.min(), color='blue', label=f'mean:{round(mean,2)}')
plt.axvline(median-df.num_medications.min(), color='black', label=f'median:{round(median,2)}')
plt.title('Number of Distinct Generic Medications Administered During Visit')
plt.legend()
plt.show()

In [None]:
df.groupby('readmitted')['num_medications'].describe()

> **How many medications are patients receiving during their visit?**

In [None]:
ax = sns.boxplot(x='time_in_hospital', y='num_medications', data=df)
# box_labels(ax, df.sort_values('time_in_hospital'),'time_in_hospital','num_medications')
plt.title('Medications Administered Based on Length of Hospital Visit')
plt.show()

> Patients who spend more time in the hospital receive more medications, but there are a few that receive over 60 different kinds of medications.

> **How many medications are patients receiving during their visit?**

In [None]:
ax = sns.boxplot(x='readmitted', y='num_medications', data=df.sort_values('readmitted'))
box_labels(ax, df.sort_values('readmitted'),'readmitted','num_medications')
plt.title('Medications Administered')
plt.show()

> The distribution is almost equal for readmitted and not readmitted patients, with readmits being slightly higher on average.

### FOCUS ON "`number of outpatient`" visits

In [None]:
info('number_outpatient')

In [None]:
def labels(ax):
    for bar in ax.patches: 
        ax.annotate('%{:.1f}\n{:.0f}'.format(100*bar.get_height()/len(df),bar.get_height()), (bar.get_x() + bar.get_width() / 2,  
                        bar.get_height()+750), ha='center', va='center', 
                       size=16, xytext=(0, 8), 
                       textcoords='offset points') 
        
ax = sns.countplot(x='number_outpatient',data=df)
labels(ax)
plt.title('Number of Outpatient Visits Prior to Encounter')
plt.show()

In [None]:
# outpatient visit stats
df.groupby('readmitted')['number_outpatient'].describe()

In [None]:
# outpatient vists and readmissions
ax = sns.countplot(x='number_outpatient',data=df, hue='readmitted')
labels(ax)
plt.title('Outpatient Vists and Readmissions')
plt.show()

In [None]:
pd.crosstab(df.readmitted, df.number_outpatient, margins=True, margins_name='Total')

> Most patients did not have any outpatient visits prior to the recorded one.

### FOCUS ON "`number of emergency`" visits

In [None]:
info('number_emergency')

In [None]:
# plt.figure(figsize=(20,5))
ax = sns.countplot(x='number_emergency', data=df)
labels(ax)
plt.title('Number of Emergency Visits Prior to Encounter')
plt.show()

In [None]:
# emergency vists and readmissions
ax = sns.countplot(x='number_emergency', hue='readmitted', data=df)
labels(ax)
plt.title('Emergency Vists and Readmissions')
plt.show()

> Most patients did not visit the emergency room prior to their recorded visit.

In [None]:
pd.crosstab(df.readmitted, df.number_emergency, margins=True, margins_name='Total')

> **How many emergency visits did patients have prior to this visit?**

In [None]:
plt.figure(figsize=(5, 5))
sns.boxplot(x='readmitted', y='number_emergency', data=df)
plt.title('Readmits for Emergency Vists')
plt.show()

### FOCUS ON "`number of inpatient`" visits

In [None]:
info('number_inpatient') # onceki yildaki yatarak tedavi sayisi

In [None]:
ax = sns.countplot(x='number_inpatient',data=df)
labels(ax)
plt.title('Number of Inpatient Visits Prior to Encounter')
plt.show()

In [None]:
# inpatient visits and readmissions
ax = sns.countplot(x='number_inpatient', hue='readmitted',data=df)
labels(ax)
plt.title('Inpatient Visits and Readmissions')
plt.show()

> Inpatient visits are not common for most patients prior to this visit.

In [None]:
pd.crosstab(df.readmitted, df.number_inpatient, margins=True, margins_name='Total')

### FOCUS ON "`number of diagnoses`"

In [None]:
info('number_diagnoses')

In [None]:
ax = sns.countplot(x='number_diagnoses',data=df)
mean, median = np.mean(df.number_diagnoses), np.median(df.number_diagnoses)
plt.axvline(mean-df.number_diagnoses.min(), color='blue', label=f'mean:{round(mean,2)}')
plt.axvline(median-df.number_diagnoses.min(), color='red', label=f'median:{round(median,2)}')
plt.title('Number of Diagnoses')
plt.legend()
plt.show()

In [None]:
# number of diagnoses and readmit rate
ax = sns.countplot(x='number_diagnoses', hue='readmitted', palette='Accent', data=df)
# labels(ax)
plt.title('Readmits By Number of Diagnoses')
plt.show()

In [None]:
pd.DataFrame(df.number_diagnoses.describe()).T.round(2)

In [None]:
df.groupby('readmitted')['number_diagnoses'].describe().round(2)

In [None]:
# number of diagnoses
pd.crosstab(df.readmitted, df.number_diagnoses, margins=True, margins_name='Total')

* Most patients have up to nine diagnosed conditions during their visit, after that, only a handful have more than nine in one visit. 
* Readmitted patients tend to have more diagnosed conditions but their average is only slightly higher than those not readmitted.

> **How many diagnoses do readmitted patients have?**

In [None]:
plt.figure(figsize=(8, 6))
ax = sns.boxplot(x='readmitted', y='number_diagnoses', data=df.sort_values('readmitted'))
box_labels(ax, df.sort_values('readmitted'),'readmitted','number_diagnoses')
plt.title('Number of Diagnoses for Re/admitted Patients')
plt.show()

# FOCUS ON "`glucose serum test results`"

In [None]:
info('max_glu_serum')

In [None]:
ax = sns.countplot(x='max_glu_serum', data=df)
labels(ax)
plt.title('Glucose Serum Test Results')
plt.show()

> Since the majority of patients do not have a glucose reading, they will be excluded for the next graph in order to show the readmit rates for patients who do have a reading.

In [None]:
def labels(ax, df=df):
    for p in ax.patches:
            ax.annotate('%{:.1f}\n{:.0f}'.format(100*p.get_height()/len(df),p.get_height()), 
                        (p.get_x()+0.2, p.get_height()-27),size=16)

# exclude patients without a glucose reading
glucose_none = df[df.max_glu_serum != 'None']

# glucose serum results and readmit impact
ax = sns.countplot(x='max_glu_serum', hue='readmitted', palette='Accent', data=glucose_none)
labels(ax,glucose_none)
plt.title('Readmits By Glucose Serum Levels')
plt.show()

Patients with a glucose serum reading of over 300 have a 50-50 chance of being readmitted. High blood sugar levels are often dangerous for older patients due to the medical complications involved, so it's understandable that more patients return to the hospital for additional care.

In [None]:

pd.crosstab(df.readmitted, df.max_glu_serum, margins=True, margins_name='Total')

### FOCUS ON "`A1C results`"

In [None]:
info('A1Cresult')

In [None]:
ax = sns.countplot(x='A1Cresult', palette='husl', data=df)
labels(ax)
plt.title('A1c Test Results')
plt.show()

* Similar to the glucose reading, the majority of patients also do not have a HbA1c test reading. 
* In order to understand the impact of A1c tests on readmit rates, patients without a reading will be excluded in the graph below.

In [None]:
# exclude patients without an A1C reading
alc_none = df[df.A1Cresult != 'None']

# A1C results and readmit impact
ax = sns.countplot(x='A1Cresult', hue='readmitted', palette='Accent', data=alc_none)
labels(ax, alc_none)
plt.title('Readmits By A1C Test Results')
plt.show()

In [None]:
pd.crosstab(df.readmitted, df.A1Cresult, margins=True, margins_name='Total')

### FOCUS ON "`change`" column

In [None]:
info('change')

## change in medications, dosage or brand

In [None]:
# change in medications
ax = sns.countplot(x='change', hue='readmitted', data=df)
labels(ax)
plt.title('Change in Diabetic Medications')
plt.show()

In [None]:
pd.crosstab(df.change, df.readmitted, margins=True, margins_name='Total')

> **who is likely to have a change in medication?**

In [None]:
ax = sns.countplot(x='gender', hue='change', palette='Set2', data=df)
labels(ax)
plt.title('Change in Medication Based on Gender')
plt.show()

In [None]:
pd.crosstab(df.gender, df.change, margins=True, margins_name='Total')

### FOCUS ON "`diabetesMed`"

In [None]:
info('diabetesMed')

In [None]:
ax = sns.countplot(x='diabetesMed', hue='readmitted', data=df)
labels(ax)
plt.title('Prescribed Diabetic Medications During Visit')
plt.show()

In [None]:
pd.crosstab(df.diabetesMed, df.readmitted, margins=True, margins_name='Total')

> **Who is likely or not likely to have a change in medication?**

In [None]:
sns.catplot(x='diabetesMed', hue='readmitted', col='gender', palette='Accent', data=df, kind='count', height=4, aspect=1)
plt.show()

### medications used by patients

In [None]:
columns=['metformin', 'repaglinide', 'nateglinide',
       'chlorpropamide', 'glimepiride', 'glipizide', 'glyburide',
       'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol',
       'troglitazone', 'tolazamide', 'insulin', 'glyburide-metformin',
       'glipizide-metformin', 'metformin-pioglitazone']

plt.figure(figsize=(26, 26))
for i,col in enumerate(columns):
    plt.subplot(6,3,i+1)
    sns.countplot(x=df[col])

> Dosages for insulin shows the most activity out of all diabetic medications, most of which aren't prescribed to patients.

In [None]:
info('insulin')

In [None]:
sns.countplot(x='insulin', hue='readmitted', data=df)
plt.title('Readmit Rates by Medication: Insulin')
plt.show()