# Drug Classification

**Aim: build a ML model that predict with high confidence in predicting the drug type (A, B,C,X,Y) that should be given to a particular patient based on their characteristics.**

Content:

The target feature is
* Drug type

The feature sets are:
* Age
* Sex
* Blood Pressure Levels (BP)
* Cholesterol Levels
* Na to Potassium Ration

In [None]:
!pip install pydotplus # installing here for later usage
# import some packages
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier


# - Data Exploration & Visualization

In [None]:
df= pd.read_csv('../input/drug-classification/drug200.csv')

In [None]:
df.head()

In [None]:
df.describe(include='all').T

In [None]:
df.info()

We see that for the classification model we will need to encode all 'object'-type variables as dummy variables.

Let us do some visualization of the data first so see more or lees what are we dealing with.

In [None]:
# distribution of responderns by age
sns.set_palette("Paired")
sns.displot(df['Age'])


In [None]:
plt.subplots(12,figsize=(14,6))
plt.subplot(121)
sns.boxplot(y='Age',x='Drug',data=df,  order= ['drugA', 'drugB', 'drugC', 'drugX', 'DrugY']) 
sns.despine()
plt.title('Distribution of Age by Drug')

# distribution of cholesterol by age
plt.subplot(122)
sns.boxplot(y='Age',x='Cholesterol',data=df)
sns.despine()
plt.title('Distribution of Age by Cholesterol')

Observations:
* Drug B is clearly given to older people.
* High cholesterol, compared to normal, is presented in older patients. 

In [None]:
# what about sex

plt.subplots(21,figsize=(14,6))
plt.subplot(121)
df['Sex'].value_counts().plot(kind='pie', autopct='%1.1f%%', shadow=False, startangle=0)
plt.title('Gender Distribution')

plt.subplot(122)
sns.boxplot(y= 'Age',x='Sex',data=df)
sns.despine()

pd.crosstab(index=df['Drug'],columns=df['Sex'],normalize='columns').plot(kind='bar')
#plt.tile('distr')

* We have a 5% more of Male than Female patients.
* Female patients are a bit youger than male patients
* In the normalized histogram, we see that more or less the drugs are given to the same ratio of female and male, being maybe higher the Female percentage when it comes to drug Y.


# - Data Preprocessing

In [None]:
df.shape

In [None]:
# creation of a matrix of features, select everything except drug type
X = df[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
X.shape

we need to convert the cataegorical variable into dummy/indicator variables.


In [None]:
# pd.get_dummies(X) would be an option (~ hot encoder)
# however we can use LabelEncober (in sklearn)
from sklearn import preprocessing

sex = preprocessing.LabelEncoder()
sex.fit(['F', 'M'])
X[:,1] = sex.transform(X[:,1])


BP = preprocessing.LabelEncoder()
BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X[:,2] = BP.transform(X[:,2])


chol = preprocessing.LabelEncoder()
chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = chol.transform(X[:,3]) 


X[0:4]

In [None]:
# now we have a matrix of features that we can actually fit
# still need to create a vector of target variable
y = df["Drug"].values
y[0:5]

# - Classification

Decision Tree Model has been chosen for this problem.

In [None]:
# so first of all we need to split the dataset
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
# check shapes
print('X train shape: ', X_train.shape)
print('y train shape: ', y_train.shape)
print('X train shape: ', X_test.shape)
print('y train shape: ', y_test.shape)

In [None]:
# now let's model
# I am choosing now a Decission Tree based on entropy criterion

drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
drugTree 


In [None]:
# now we fit the data
drugTree.fit(X_train,y_train)

In [None]:
# let see how good it fits our model: look at its accuray

y_pred = drugTree.predict(X_test)

In [None]:
# some visuals
print (y_pred [0:5])
print (y_test [0:5])

In [None]:
from sklearn import metrics
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, y_pred))

In [None]:
# lets try a visualization of the decision tree
from  io import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree

dot_data = StringIO()
filename = "tree.png"
featureNames = df.columns[0:5]
targetNames = df["Drug"].unique().tolist()
out=tree.export_graphviz(drugTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_train), filled=True,  special_characters=True,rotate=False)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')