# **WHICH IS WHICH?**

<img src="https://media.giphy.com/media/l2SpOKqWUZ2KRHne8/giphy.gif">

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/drug-classification/drug200.csv')
print(df.shape)
df.head()

In [None]:
df.info()

We have a lot of *objects* in our data. For data analysis it is great to see the labels so I will not make them numeric for now.

In [None]:
plt.rcParams["figure.figsize"] = (10,3)

In [None]:
sns.relplot(x='BP', y='Drug', data=df, color='red')
plt.grid()

It seems there is a specific pattern with different blood pressures.
1. Drug Y is the basic drug that prescribes to every level.
1. Only high blood pressure people use Drug A and Drug B.
1. Drug C is just for low blood pressure people.

In [None]:
sns.swarmplot(x='Sex', y='Drug', data=df)

Sex has literally has no difference over the drug selections.

In [None]:
sns.swarmplot(x='Cholesterol', y='Drug', data=df)

1. Drug C is only for high cholesterol people.
1. Rest of the drugs can be used for both scenerios.

In [None]:
sns.swarmplot(x='Drug',y='Age', data=df)
plt.grid()

1. Drug Y seems like it is a generic drug, it can be cold medicine or something like that.
1. Drug B is for only 50+ years of age people.
1. Drug A is only for 20-50 years of age people.
1. Drug C is the most rare drug in here.

In [None]:
sns.distplot(df['Na_to_K'])

In [None]:
sns.barplot(x='Drug', y='Na_to_K', data=df)

1. Most of the people have similar sodium to potasium ratio.
1. Every drug except for Drug Y get used around same sodium to potasium ratio. Drug Y is for the people who has big NA/K ratio.

We have completed our analysis. **Normally** I would drop the *sex* column because of the similarities between two classes **BUT** in this case I will not. Because the data is already small and I need every information I'll get with this data.

In [None]:
x = df.drop(['Drug'], axis=1)
y = df['Drug']

As I told in the beginning, I need to get rid of the *objects* and turn them into *numeric* values.

In [None]:
x = pd.get_dummies(x, ['Sex','BP','Cholesterol'])
y = pd.get_dummies(y)

In [None]:
x.head()

In [None]:
y.head()

**We are ready to go.**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn import metrics

In [None]:
x.shape, y.shape

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

Because my test data as five dimensions I will not be able to perform regression models.
If you want to apply regression models to this data. I suggest you to *replace* drug names with numbers as *1,2,3,4,5* and go on with that. You would have a one dimensional test data.

**DECISION TREE**

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(x_train, y_train)
prediction_dt = tree.predict(x_test)
print(classification_report(y_test, prediction_dt))

**RANDOM FOREST**

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier()
forest.fit(x_train, y_train)
prediction_rf = forest.predict(x_test)
print(classification_report(y_test, prediction_rf))

**ANN**

In [None]:
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.layers import Flatten
from keras.layers import Dropout

In [None]:
x.shape

In [None]:
model = Sequential([
    Flatten(input_dim=9),
    Dense(32, activation='relu'),
    Dropout(0.5),
    Dense(16, activation='relu'),
    Dense(5, activation='sigmoid')
])
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
model.summary()

In [None]:
history = model.fit(x_train, y_train, batch_size=10, validation_data=(x_test, y_test), 
                    epochs=250, verbose=2)

In [None]:
print(history.history.keys())

In [None]:
plt.plot(history.history['accuracy'], label='Accuracy', color='blue')
plt.plot(history.history['loss'], label='Loss', color='red')
plt.title('Training')
plt.legend()

In [None]:
plt.plot(history.history['val_accuracy'], label='Accuracy', color='blue')
plt.plot(history.history['val_loss'], label='Loss', color='red')
plt.title('Validation')
plt.legend()

# **Summary**

* Decision Tree and Random Forest models has %100 accuracy. The artificial neural network also had succesfull accuracy but not like other models. The reason for that is not having a large enough dataset, if the dataset was a little larger ANN would have more success. 
* Overall, the data was already great as it was, I just transformed it to fit machine learning and deep learning models.
* This it for this notebook. I hope you like it and learn something from it. Take care.

<img src="https://media.giphy.com/media/xUPOqo6E1XvWXwlCyQ/giphy.gif">