# <a id='0'>Content</a>

- <a href='#1'>Introduction</a>  
- <a href='#2'>Import Libraries</a>  
- <a href='#3'>Import Data</a>   
- <a href='#4'>Exploratory Data Analysis</a>   
 - <a href='#41'>Data header</a>   
 - <a href='#42'>Data shape</a>   
 - <a href='#43'>Describing dataset</a>   
 - <a href='#44'>Dataset info</a>   
 - <a href='#45'>Finding relations between features and target feature</a>    
- <a href='#5'>Data Cleaning</a>   
 - <a href='#51'>Correct format of columns</a>   
 - <a href='#52'>Handle Outliers</a>   
 - <a href='#53'>Handle missing values</a>
 - <a href='#54'>Correlation heatmap</a> 
- <a href='#7'>Feature Engineering</a> 
- <a href='#8'>Data split</a>
- <a href='#9'>Modeling and Evaluation</a>
 - <a href='#91'>Random Forest</a>
 - <a href='#92'>Decision tree</a>
 - <a href='#93'>KNN</a>
 - <a href='#94'>SVM</a>
- <a href='#10'>Conclusion</a>

# <a id='1'>Introduction</a> 

### Data set source:
https://www.kaggle.com/prathamtripathi/drug-classification

### Business problem: 
Classification problem where we have to predict which of drug will fit requirements of a patient.

### Target feature:
* Drug type

### The feature sets are:
* Age
* Sex
* Blood Pressure Level (BP)
* Cholesterol Level
* sodium to Potassium Ratio (na_to_k)

# <a id='2'>Include Libraries</a> 

In [None]:
!pip install pydotplus

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# linear algebra
import numpy as np 
# data processing, CSV file I/O (e.g. pd.read_csv)
import pandas as pd 

#Visualization libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

#decision tree visualization
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree

#Data split
from sklearn.model_selection import train_test_split

#ML Algorithms 
from sklearn import tree
import pydotplus
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm



#Model evaluation metrics
from sklearn import metrics

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# <a id='3'>Import data</a>

In [None]:
df = pd.read_csv('/kaggle/input/drug-classification/drug200.csv')

# <a id='4'>Exploratory data analysis</a>

### <a id='41'>Header of dataset</a>

In [None]:
df.head()

### <a id='42'>Shape of dataset</a>

In [None]:
df.shape

**Dataframe has 200 entries and 6 columns**

### <a id='43'>Describe of dataset</a>

In [None]:
df.describe(include = 'all')

* Sex column has 2 unique values, male and female proportions are almost equivalent.
* BP most frequent value is HIGH, 77 out of 200 entries with 3 unique values.
* Cholesterol has 2 unique values, both of them are almost equivalent.
* Drug most frequent value is DrugY, 91 of 200 entries with 5 unique values.
* Distribution of Na_to_K seems a bit odd, 38 as max might be outlier.
* Dataset does not contain any missing values since count for each column is 200.

### <a id='44'>Info about dataset</a>

In [None]:
df.info()

**Info function provide us with confirmation about which columns are object Dtype and the fact that all entries are filled.**|

### <a id='45'>Finding relations between features and target feature</a>

### 1) Age with Drug type

In [None]:
sns.catplot('Drug', 'Age', data = df)

* DrugY is most used for all ages.
* DrugC frequency is low and is used for all ages.
* DrugX is second highest frequency and is used for all ages.
* Drug A appears only for people age 50 or lower.
* Drug B appears only for people age 50 or higher.

### 2) Sex with Drug type

1. Firstly i'm gonna check distribution of gender and  age of patients

In [None]:
plt.figure(figsize = (8,6))
ax = sns.boxplot('Sex', 'Age', data = df).set(ylim = (0, 80))

Distribution of patients age compared to gender is pretty simillar.

In [None]:
df.Sex.value_counts()

Distribution between genders of patients is almos equivalent.

In [None]:
sex_drug = df.groupby('Sex').Drug.value_counts()
sex_drug

In [None]:
sex_drug.unstack(level=0).plot(kind='bar', subplots=False)

Gender of patients is not correlated with their drug type.

### 3) BP with Drug type

In [None]:
df.BP.value_counts()

In [None]:
tab = pd.crosstab(df['BP'], df['Drug'])
print (tab)

tab.div(tab.sum(1).astype(float), axis=0).plot(kind="bar", stacked=False)
plt.xlabel('BP')
plt.ylabel('Percentage')

* DrugX is used for BP low and normal.
* DrugC is used only for BP low.
* drugB and drugA are used only for BP high.
* DrugY is used for every BP level.

Above barplot clearly show that, there is a correlation between BP level and type of drug.

### 4) Cholesterol with Drug type

In [None]:
df.Cholesterol.value_counts()

In [None]:
tab = pd.crosstab(df['Cholesterol'], df['Drug'])

tab.div(tab.sum(1).astype(float), axis=0).plot(kind="bar", stacked=False)
plt.xlabel('Cholesterol')
plt.ylabel('Percentage')

All drugs are used in both levels of Cholesterol except **DrugC** which is only used at High level of Cholesterol.

### 5) Na_to_K with Drug type

In [None]:
sns.catplot('Drug', 'Na_to_K', data=df)

* For na_to_k higher than 15, only drugY is used.
* Rest drugs are used in 5 to 15 Na_to_k value range.

## Summary

 Age, Na_to_k, BP and Cholesterol have correlation with Drug type

# <a id='5'>Data Cleaning<a>

### <a id='51'> Correct format of columns <a>

In [None]:
for col in df:
    print(col)
    print(df[col].unique())
    print()

In [None]:
df["Sex"] = df["Sex"].map({"M": 0, "F":1})
df["BP"] = df["BP"].map({"HIGH" : 3, "NORMAL" : 2, "LOW": 1})
df["Cholesterol"] = df["Cholesterol"].map({"HIGH": 1, "NORMAL" : 0})
df["Drug"] = df["Drug"].map({"DrugY": 0, "drugC": 1, "drugX": 2, "drugA":3, "drugB":4})

Check head of dataframe again

In [None]:
df.head()

Check data type of each column

In [None]:
df.dtypes

Now our dataset contains only numeric dTypes, so we can use it with ML algorithms

### <a id='53'> Handle Outliers <a>

### Check outliers for column Age

In [None]:
sns.boxplot(x=df['Age'])

### Check outliers for column Na_to_K

In [None]:
sns.boxplot(x=df['Na_to_K'])

There is few data points looking like outliers with values above 30, we can remove these rows.

In [None]:
df.drop(df[df.Na_to_K > 30].index, inplace=True)

Check boxplot again

In [None]:
sns.boxplot(x=df['Na_to_K'])

Since BP and Cholesterol has been converted from Categorical type, there is no need to check outliers for them.

### <a id='54'>Correlation heatmap<a>

In [None]:
plt.figure(figsize=(15,6))
sns.heatmap(df.corr(), vmax=0.6, square=True, annot=True)

* BP is positively correlated with Drug type.
* Na_to_K is highly negatively correlated with drug type.

# <a id='7'>Feature Engineering<a>

Drop of column Sex, since in EDA we saw it didin't had any affect on target feature.

In [None]:
df.drop('Sex', axis=1, inplace=True)

In [None]:
df.head()

# <a id='8'>Data split<a>

In [None]:
values=df.values
X, y = values[:, :-1], values[:, -1]
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.3, random_state = 2)

In [None]:
print("X_train shape:",X_train.shape)
print("X_test shape:",X_test.shape)
print("y_train shape:",y_train.shape)
print("y_test shape:",y_test.shape)

# <a id='9'> ML Algorithms </a>

### <a id='91'> Random Forest </a>

### Modeling

In [None]:
rfc = RandomForestClassifier(n_estimators = 9, criterion = 'entropy', random_state=22)

rfc.fit(X_train,y_train)

### Prediction

In [None]:
rf_pred = rfc.predict(X_test)

### Evaluation

In [None]:
print("Accuracy score : ", metrics.accuracy_score(y_test, rf_pred))

print("F1 score: ", metrics.f1_score(y_test, rf_pred, average='weighted') )

print("Jaccard score: ", metrics.jaccard_score(y_test, rf_pred, average='weighted'))

print("recall score: ", metrics.recall_score(y_test, rf_pred, average='weighted'))

print("precision score: ", metrics.precision_score(y_test, rf_pred, average='weighted'))

In [None]:
rfc_score = {
            'accuracy': metrics.accuracy_score(y_test, rf_pred),
            'f1': metrics.f1_score(y_test, rf_pred, average='weighted'),
            'jaccard': metrics.jaccard_score(y_test, rf_pred, average='weighted'),
            'recall': metrics.recall_score(y_test, rf_pred, average='weighted'),
            'precision': metrics.precision_score(y_test, rf_pred, average='weighted')
        }

### <a id='92'> Decision Tree </a>

### Modeling

In [None]:
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

### Prediction

In [None]:
dt_pred = dtc.predict(X_test)

### Evaluation

In [None]:
print("Accuracy score : ", metrics.accuracy_score(y_test, dt_pred))

print("F1 score: ", metrics.f1_score(y_test, dt_pred, average='weighted') )

print("Jaccard score: ", metrics.jaccard_score(y_test, dt_pred, average='weighted'))

print("recall score: ", metrics.recall_score(y_test, dt_pred, average='weighted'))

print("precision score: ", metrics.precision_score(y_test, dt_pred, average='weighted'))

In [None]:
dt_score = {
            'accuracy': metrics.accuracy_score(y_test, dt_pred),
            'f1': metrics.f1_score(y_test, dt_pred, average='weighted'),
            'jaccard': metrics.jaccard_score(y_test, dt_pred, average='weighted'),
            'recall': metrics.recall_score(y_test, dt_pred, average='weighted'),
            'precision': metrics.precision_score(y_test, dt_pred, average='weighted')
        }

### Visualisation of decision tree

In [None]:
featureNames = ['Age', 'BP', 'Cholesterol','Na_to_K']

In [None]:
dot_data = tree.export_graphviz(dtc,
                                feature_names=featureNames,
                                out_file=None,
                                special_characters=True,
                                filled=True,
                                rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data)

filename = "drugTree.png"
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(50,100))
plt.imshow(img,interpolation = 'nearest')
plt.show()

### <a id='93'> KNearest Neighboor </a>

### Modeling

Search for best K

In [None]:
Ks = 15
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))
ConfustionMx = [];
for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

    
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

mean_acc

Let's visualise it

In [None]:
plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.legend(('Accuracy ', '+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Nabors (K)')
plt.tight_layout()
plt.show()

4 is our number, i'm not using k=1 due to overfitting issues.

In [None]:
k = 4
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k)
neigh.fit(X_train,y_train)

### Prediction

In [None]:
predKNN = neigh.predict(X_test)

### Evaluation

In [None]:
print("Accuracy score : ", metrics.accuracy_score(y_test, predKNN))

print("F1 score: ", metrics.f1_score(y_test, predKNN, average='weighted') )

print("Jaccard score: ", metrics.jaccard_score(y_test, predKNN, average='weighted'))

print("recall score: ", metrics.recall_score(y_test, predKNN, average='weighted'))

print("precision score: ", metrics.precision_score(y_test, predKNN, average='weighted'))

In [None]:
knn_score = {
            'accuracy': metrics.accuracy_score(y_test, predKNN),
            'f1': metrics.f1_score(y_test, predKNN, average='weighted'),
            'jaccard': metrics.jaccard_score(y_test, predKNN, average='weighted'),
            'recall': metrics.recall_score(y_test, predKNN, average='weighted'),
            'precision': metrics.precision_score(y_test, predKNN, average='weighted')
        }

### <a id='94'> SVM </a>

### Modeling

In [None]:
svc = svm.SVC(kernel='rbf', random_state = 22)
svc.fit(X_train, y_train)

### Prediction

In [None]:
predSVC = svc.predict(X_test)

### Evaluation

In [None]:
print("Accuracy score : ", metrics.accuracy_score(y_test, predSVC))

print("F1 score: ", metrics.f1_score(y_test, predSVC, average='weighted') )

print("Jaccard score: ", metrics.jaccard_score(y_test, predSVC, average='weighted'))

print("recall score: ", metrics.recall_score(y_test, predSVC, average='weighted'))

print("precision score: ", metrics.precision_score(y_test, predSVC, average='weighted', zero_division=1))

In [None]:
svm_score = {
            'accuracy': metrics.accuracy_score(y_test, predSVC),
            'f1': metrics.f1_score(y_test, predSVC, average='weighted'),
            'jaccard': metrics.jaccard_score(y_test, predSVC, average='weighted'),
            'recall': metrics.recall_score(y_test, predSVC, average='weighted'),
            'precision': metrics.precision_score(y_test, predSVC, average='weighted', zero_division=1)
        }

# <a id='10'> Conclusion </a>

### Random forest score

In [None]:
print(pd.DataFrame.from_dict(rfc_score, orient = "index",columns=["Score"]))

### Decision tree score

In [None]:
print(pd.DataFrame.from_dict(dt_score, orient = "index",columns=["Score"]))

### KNN score

In [None]:
print(pd.DataFrame.from_dict(knn_score, orient = "index",columns=["Score"]))

### SVM score

In [None]:
print(pd.DataFrame.from_dict(svm_score, orient = "index",columns=["Score"]))

## Summary

KNN and SVM algorithms appears to be much worse compared to the rest. Random forest and decision tree scores are the same and since random forest complexity time is higher than decision tree that would be my choice for this data.