# Drug Classification
### Zaquin - 27 Jun 2021

This analysis will build a classification model to prescribe patients with the correct pharmaceutical. The data set was retrieved from https://www.kaggle.com/pablomgomez21/drugs-a-b-c-x-y-for-decision-trees.

# Contents
***
<ol>
    <li><a href="#Introduction"> Introduction </a></li>
    <li><a href="#Setup"> Environment Setup </a></li>   
    <li><a href="#ExploratoryDataAnalysis"> Exploratory Data Analysis </a></li>
    <ul style="list-style-type:circle;">
        <li><a href="#Plots"> Exploratory Plots </a></li>
        <li><a href="#CTabs"> Cross-Tab Tables </a></li>
        <li><a href="#Corr"> Correlation Matrix </a></li></ul>
    <li><a href="#Classification"> Classification Models </a></li>
    <ul style="list-style-type:circle;">
        <li><a href="#Logistic"> Logistic Regression </a></li>
        <li><a href="#KNN"> K-Nearest Neighbors </a></li>
        <li><a href="#NB"> Naive-Bayes </a></li>
        <li><a href="#Tree"> Decision Tree </a></li></ul>
    <li><a href="#Results"> Results </a></li>
    <li><a href="#REF"> References </a></li>
</ol>

# 1. Introduction <a id="Introduction"></a>
***
As technology advances, scientists are able to discover new pharmacetucals to help treat or cure diseases that previosuly were thought to be fatal. But with the rise in new treatments raises another challenge; which treatment is the best for the patient? Most drugs have side effects, or unintended reactions which could occur after taking the drug. These side effects, depending on the patient's health and medical history, could cause serious health concerns in the patient. The doctor needs to consider these factors when prescribing a drug to a patient, however doctors are people too, meaning they are subject to human error. In fact, according to a 2014 Harvard study, newly approved prescription medication has a 20% chance to cause serious side effects [1].

This analysis will explore various classification models in order to accurately prescribe patients with the correct pharmaceutical. Each model created will be analyzed for accuracy by using a confusion matrix. The types of classification models are as follows: 

1. Logistic Regression   
2. K-Nearest Neighbors (KNN)   
3. Naive-Bayes   
4. Decision Tree   

Gaussian Naive-Bayes classification will be used to build the Naive-Bayes model. Each classification model's accuracy and confusion matrix will be compared, and the best performing model based on these criteria will be selected. 

Exploratory data analysis will be performed prior to building the classification models. Various plots will be generated to understand the attribute distributions and relationships. Data cleaning and preprocessing will be performed as part of exploratory data analysis.

# 2. Environment Setup <a id="Setup"></a>
***

In [None]:
# Import modules
import pandas as pd
import numpy as np
import seaborn as sns
import pydot
import graphviz
from matplotlib import pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Load data
drugdf = pd.read_csv("../input/drugs-a-b-c-x-y-for-decision-trees/drug200.csv")
alabs = drugdf.Drug.unique()
drugdf.tail()

# 3. Exploratory Data Analysis <a id="ExploratoryDataAnalysis"></a>
***

In [None]:
# Descriptive Statistics
drugdf.describe(include="all")

The range of the patient age is 15-74, with over half the patients being male. Over 1/3 of the patients have high blood pressure, and over half have high cholesterol. The sodium potassium ratio's average is below optimal, which is between 30-35. The sodium potassium ratio is an indicator of cardiovascular disease, and a low sodium potassium ratio can indicate chronic stress. Chronic stress causes increased cortisol production, which is a hormone responsible for cell and tissue decomposition. Low sodium potassium ratios could indicate catabolism, where the body breaks down tissues faster than it regenerates them [2]. It would be interesting to see how blood pressure and cholesterol affect the sodium potassium ratio, as high blood pressure and cholesterol are often associated with high stress levels. It seems that drugY has the highest prescription frequency of 91 (45.5%). 

## 3.1 Exploratory Plots <a id="Plots"></a>
***

In [None]:
# Histograms
# Age
%matplotlib inline
plt.hist(drugdf["Age"], bins=15, alpha=0.85)
plt.xlabel("Patient Age")
plt.ylabel("Count")
plt.title("Histogram of Patient Age")
plt.grid(True)
plt.show()

# Na_to_K
plt.hist(drugdf["Na_to_K"], bins=15, alpha=0.85)
plt.xlabel("Patient Na:K Ratio")
plt.ylabel("Count")
plt.title("Histogram of Patient Na:K Ratio")
plt.grid(True)
plt.show()

The age variable has a relatively uniform distribution, with a few spikes. The sodium potassium ratio distribution is positively skewed.

In [None]:
# Pie Charts
# Sex
fig1, ax1 = plt.subplots()
ax1.pie(drugdf["Sex"].value_counts(), labels=drugdf["Sex"].unique(), autopct="%1.1f%%")
ax1.axis("equal")
ax1.set_title("Patient Sex Breakdown")

# BP
fig1, ax1 = plt.subplots()
ax1.pie(drugdf["BP"].value_counts(), labels=drugdf["BP"].unique(), autopct="%1.1f%%")
ax1.axis("equal")
ax1.set_title("Patient Blood Pressure Breakdown")

# Cholesterol
fig1, ax1 = plt.subplots()
ax1.pie(drugdf["Cholesterol"].value_counts(), labels=drugdf["Cholesterol"].unique(), autopct="%1.1f%%")
ax1.axis("equal")
ax1.set_title("Patient Cholesterol Breakdown")

# Drug
fig1, ax1 = plt.subplots()
ax1.pie(drugdf["Drug"].value_counts(), labels=drugdf["Drug"].unique(), autopct="%1.1f%%")
ax1.axis("equal")
ax1.set_title("Drug Perscription Rate")
plt.show()

The patients are roughly evenly split between male and female, with a slight female majority. Over 2/3 of the patients have abnormal blood pressure and nearly 40% of the patients have high blood pressure. Over half of the patients have high cholesterol as well. DrugY is the most commonly prescribed to the patients.

## 3.2 Cross-Tab Tables <a id="CTabs"></a>
***
Cross tabulation tables (cross-tab tables) provide another method to uncover relationships in the data set. 

### Average Age by Drug and Sex

In [None]:
# Avg Age
dsa = pd.crosstab(drugdf.Drug, drugdf.Sex, values=drugdf.Age, aggfunc="mean")
dsa

The table above shows an interesting realationship between the age and the prescribed drug. It would appear that drugA is prescribed to younger patients on average. DrugC, X, and Y are prescribed to middle-aged patients, and drugB is prescribed to more senior patients, on average. It would be interesting to see what other medication the patients are prescribed, as medications can interact and cause serious side effects. It's quite possible that drugB is assigned to pateints taking another type of drug to avoid said serious side effects. 

### Average Sodium Potassium Ratio by Drug and Sex

In [None]:
# Avg NaK
dsnak = pd.crosstab(drugdf.Drug, drugdf.Sex, values=drugdf.Na_to_K, aggfunc="mean")
dsnak

DrugY, on average, is precribed to patients with a higher sodium potassium ratio. The patients prescibed to drugY have an average sodium potassium ratio closest to the nominal range, while the patients prescribed to the other drugs have a much lower average ratio.

## 3.3 Correlation Matrix <a id="Corr"></a>
***
Prior to creating the correlation matrix, the categorical variables must be converted to numeric. This will serve as the preprocessing for the classification models as well.

In [None]:
# Create preprocessed df
le = LabelEncoder()
drug_pp = drugdf
drug_pp["Sex"] = le.fit_transform(drug_pp["Sex"])
drug_pp["BP"] = le.fit_transform(drug_pp["BP"])
drug_pp["Cholesterol"] = le.fit_transform(drug_pp["Cholesterol"])
drug_pp["Drug"] = le.fit_transform(drug_pp["Drug"])
drug_pp.head()

In [None]:
# Create and display corr matrix
%matplotlib inline
cor_mat = drug_pp.corr()
sns.heatmap(cor_mat, annot=True)
plt.title("Correlation Matrix")
plt.show()

The correlation matrix shows that the sodium potassium ratio and blood pressure have the strongest impact on the prescribed drug.

# 4. Classification Models <a id="Classification"></a>
***
The data will be split into a training and test set using scikit-learn's train_test_split function. The training and test sets will be used to build each model in this analysis. 

In [None]:
# Separate target
x = drug_pp.iloc[:, :-1].values
y = drug_pp.iloc[:, -1].values

# Create test and train sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)

## 4.1 Logisitic Regression <a id="Logistic"></a>
***

In [None]:
# Create model
lgr = LogisticRegression(random_state=0, max_iter=2000).fit(x_train, y_train)

# Predict with test set
lgr_pred = lgr.predict(x_test)
print("Accuracy Score = {}%".format(round(accuracy_score(y_test, lgr_pred)*100,2)))

In [None]:
# Confusion Matrix
%matplotlib inline
cmat = confusion_matrix(y_test, lgr_pred)
fig, ax = plt.subplots()
sns.heatmap(cmat, annot=True, cmap="BuPu")
ax.set_xticklabels(alabs)
ax.set_yticklabels(alabs)
plt.title("Logistic Regression Confusion Matrix")
plt.show()

In [None]:
# Print classification report
print("Classification Report:"+"\n",classification_report(y_test, lgr_pred), sep="\n")

The logisitic regression model performed very well overall, with an accruacy score of 95%. The logisitc regression model miscategorized 2 instances of drugC as drugX. This is shown in the classification report. The precision for row 1 (drugC) is 50%, while the recall for row 2 (durgX) is also 50%, indicating that the model miscategorized 2 instaces of drugX as drugC.

## 4.2 K-Nearest Neighbors <a id="KNN"></a>
***
10-fold cross validation will be performed to determine the optimal value for K. To perform the cross validation, models will be generated with K=1 to K=50. 10-fold cross validation will be performed for each model, and the average accuracy score of all 10 folds will be recorded. The average misclassification error (1 - accuracy) will be plotted vs the K values. The K value(s) with the lowest misclassification error will be created, and the model with the highest accuracy will be selected as the KNN model.

In [None]:
# Create list to store scores
cv_scores = []
ks = list(range(1,51,1))

# Perform CV
for k in ks:
    knn = KNeighborsClassifier(n_neighbors=k)
    cscore = cross_val_score(knn, x_train, y_train, cv=10, scoring='accuracy')
    cv_scores.append(cscore.mean())

# changing to misclassification error
MSE = [1 - x for x in cv_scores]

plt.figure()
plt.figure(figsize=(15,10))
plt.title("Optimal number of neighbors", fontsize=20, fontweight='bold')
plt.xlabel("K", fontsize=15)
plt.ylabel("Misclassification Error", fontsize=15)
plt.plot(ks, MSE)

plt.show()

Based on the cross validation results, K = 3 and K = 4 yield the lowest average misclassification error. A model with 3 and 4 neighbors will be generated, with the model with the highest accuracy being selected as the KNN model.

### KNN with K = 3
***

In [None]:
# KNN (K=3)
# Create model
knn_3 = KNeighborsClassifier(n_neighbors=3).fit(x_train, y_train)

# Predict with test set
knn_3_pred = knn_3.predict(x_test)
print("Accuracy Score = {}%".format(round(accuracy_score(y_test, knn_3_pred)*100,2)))

In [None]:
# Confusion Matrix
%matplotlib inline
cmat = confusion_matrix(y_test, knn_3_pred)
fig, ax = plt.subplots()
sns.heatmap(cmat, annot=True, cmap="BuPu")
ax.set_xticklabels(alabs)
ax.set_yticklabels(alabs)
plt.title("KNN (K=3) Confusion Matrix")
plt.show()

In [None]:
# Print classification report
print("Classification Report:"+"\n",classification_report(y_test, knn_3_pred), sep="\n")

### KNN with K = 4
***

In [None]:
# KNN (K=4)
# Create model
knn_4 = KNeighborsClassifier(n_neighbors=4).fit(x_train, y_train)

# Predict with test set
knn_4_pred = knn_4.predict(x_test)
print("Accuracy Score = {}%".format(round(accuracy_score(y_test, knn_4_pred)*100,2)))

In [None]:
# Confusion Matrix
%matplotlib inline
cmat = confusion_matrix(y_test, knn_4_pred)
fig, ax = plt.subplots()
sns.heatmap(cmat, annot=True, cmap="BuPu")
ax.set_xticklabels(alabs)
ax.set_yticklabels(alabs)
plt.title("KNN (K=4) Confusion Matrix")
plt.show()

In [None]:
# Print classification report
print("Classification Report:"+"\n",classification_report(y_test, knn_4_pred), sep="\n")

The KNN model, with K=3, had a higher accuracy score than the K=4 model, thus the selected KNN model is the model with K=3. The K=3 model is not the best. The accuracy score is only 67.5%. DrugX was not correctly categorized in either KNN model, and DrugA had the most incorrect classifications in both models. 

## 4.3 Naive-Bayes <a id="NB"></a>
***

In [None]:
# Create model
gnb = GaussianNB().fit(x_train, y_train)

# Predict with test set
gnb_pred = gnb.predict(x_test)
print("Accuracy Score = {}%".format(round(accuracy_score(y_test, gnb_pred)*100,2)))

In [None]:
# Confusion Matrix
%matplotlib inline
cmat = confusion_matrix(y_test, gnb_pred)
fig, ax = plt.subplots()
sns.heatmap(cmat, annot=True, cmap="BuPu")
ax.set_xticklabels(alabs)
ax.set_yticklabels(alabs)
plt.title("Naive-Bayes Confusion Matrix")
plt.show()

In [None]:
# Print classification report
print("Classification Report:"+"\n",classification_report(y_test, gnb_pred), sep="\n")

The Naive-Bayes classification model had some issues classifying drugB. The accuracy of the model is 87.5%.

## 4.4 Decision Tree <a id="Tree"></a>
***

In [None]:
# Create model
dtree = DecisionTreeClassifier().fit(x_train, y_train)

# Predict with test set
dtree_pred = dtree.predict(x_test)
print("Accuracy Score = {}%".format(round(accuracy_score(y_test, dtree_pred)*100,2)))

In [None]:
# Confusion Matrix
%matplotlib inline
cmat = confusion_matrix(y_test, dtree_pred)
fig, ax = plt.subplots()
sns.heatmap(cmat, annot=True, cmap="BuPu")
ax.set_xticklabels(alabs)
ax.set_yticklabels(alabs)
plt.title("Decision Tree Confusion Matrix")
plt.show()

In [None]:
# Print classification report
print("Classification Report:"+"\n",classification_report(y_test, dtree_pred), sep="\n")

The decision tree model was able to 100% correctly classify the test set. Below is a visualization of the decision tree.

In [None]:
# Export decision tree to file
export_graphviz(dtree, out_file="decis_tree.dot", feature_names=drugdf.columns[:-1], class_names=alabs)

# Create png of decision tree
(graph,) = pydot.graph_from_dot_file("decis_tree.dot")
graph.write_png("DecisTree.png")

# Show decision tree
with open("decis_tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

# 5. Results <a id="Results"></a>
***

Comparing each of the models, the decision tree was the best performing classifier, with an accuracy of 100%. Logisitc regression was a close second with an accuracy of 95% percent. The KNN model and the logisitc regression model had misclassification errors when classifying drugX. This might be due to the fact that patients prescirbed to drugX had a very similar average sodium potassium ratio and age compared to other patients prescribed to drugC. The Naive-Bayes model struggled to classify drugB, which could be due to the fact that the patients prescribed to drugs A, B, C, and X all had similar average sodium potassium ratios.

It would be interesting to get additional data about the patients and the drugs themselves. As stated previously, drugs can interact with each other and cause serious side effects. Other medications the patients are taking could drastically affect the model results. Also, the medical history of the patients, such as cardiovascualr issues, cancer diagnoses, and kidney or liver problems, will affect physician's decisions on which medication to prescribe to their patients. This additional information could help fine tune the models to ensure patient safety. Information about the side effects of each drug will also help to fine tune the models. 

# 6. References <a id="REF"></a>
***

[1] 	W. L. Donald, "New Prescription Drugs: A Major Health Risk With Few Offsetting Advantages," 27 June 2014. [Online]. Available: https://ethics.harvard.edu/blog/new-prescription-drugs-major-health-risk-few-offsetting-advantages. [Accessed 27 June 2021].      
[2] 	D. Weatherby, "Know Your Biomarkers: Sodium Potassium Ratio," OptimalDX, 20 may 2019. [Online]. Available: https://www.optimaldx.com/blog/sodium-potassium-ratio/. [Accessed 27 June 2021].

