Through exploratory data analysis, we aim to provide a simple decision tree to decide which drug to give a patient, based on previous prescriptions to patients given in the dataset.

# Imports and Reading Data

In [None]:
import numpy as np
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('../input/drug-classification/drug200.csv')

df.head()

Each patient has a value for age, sex, blood pressure, cholesterol and Sodium to Potassium ratio in the blood (Na_to_K). We aim to classify their drug prescription based on these factors.

# Initial Exploration

Initially, we check how many people were given each drug, and how many different drugs there are.

In [None]:
sns.countplot('Drug', data=df)
plt.show()

We see that there are 5 different drugs, with the most part given Drug Y.

# Data Analysis

We start with plotting the continuous variables in a scatter plot, with each drug type coloured differently. We check for obvious clusters whereby we can deduce a drug prescription based on Sodium to Potassium and/or age. 

In [None]:
plt.figure(figsize=(10, 10))
sns.scatterplot(x='Age', y='Na_to_K', hue='Drug', data=df, s=100)
plt.plot(range(10, 80), [15 for i in range(10, 80)], 'b--')
plt.show()

We see a clear divide between those who were given drug Y: they all have an Na to K value greater than 15. Thus, this is our first criteria in our algorithm. It is therefore enough to consider those with Na_to_K <= 15 from this point.

In [None]:
df = df[df.Na_to_K <= 15]
df.head()

Checking sex as the next factor, the drugs seem to be equally allocated among each sex, so perhaps this factor is not a good predictor of prescription. 

In [None]:
sns.countplot(x='Sex', hue='Drug', data=df)
plt.show()

Next, we check blood pressure for this group. 

In [None]:
sns.countplot(x='BP', hue='Drug', data=df)
plt.show()

A further condition always holds: if BP is normal, and Na_to_K is below 15, drug X is always allocated. Let's remove this case from our further analysis. Also notice that high blood pressure patients are exclusively given drugs A or B, while low blood pressure patients are exclusively given drug C or X.

In [None]:
df_low = df[df.BP == 'LOW']
df_high = df[df.BP == 'HIGH']

We consider the age distribution among the high blood pressure group.

In [None]:
for i, drug in enumerate(set(df_high.Drug.values)):
    sns.kdeplot(df_high[df_high.Drug == drug].Age, shade=True, legend=False)
plt.legend(list(set(df_high.Drug.values)))
plt.show()

Notice there is a clear difference in age distribution for those assigned these two drugs! The next plot shows this divide more clearly.

In [None]:
plt.figure(figsize=(10, 10))
sns.scatterplot(x='Age', y='Na_to_K', hue='Drug', data=df_high, s=100)
plt.plot([50 for i in range(5, 20)], range(5, 20), 'r--')
plt.show()

So, of those with high BP and Na to K <= 15, we prescribe drug A to those of age <=50, while those older than 50 are prescribed drug B. It now remains to consider those of low BP. For these, we consider the cholesterol.

In [None]:
sns.countplot(x='Cholesterol', hue='Drug', data=df_low)
plt.show()

We see that if the patient has a high cholesterol, we prescribe drug C, while drug X is prescribed to those with normal cholesterol. This completes our analysis!

# Decision Tree

The following decision tree sums up our findings in the EDA, and gives a simple algorithm to decide which drug to prescribe!

In [None]:
im = plt.imread('../input/decision-tree/Blank Diagram.png')
plt.figure(figsize=(20, 20))
plt.imshow(im, cmap='gray')
plt.show()

I love how simple this solution is! If you enjoyed it, please give this notebook an upvote :)