### Drug Classification

**Problem Type: MultiClass Classification**

In this begineer friendly notebook, I have done exploratory data analysis, Modelling using DecisionTreeClassifier and RandomForestClassifier with StratifiedKFold cross validation strategy. 

I have noted my observations at many places. Still if you have any queries or suggestions please ask. Kindly Upvote if you find it interesting :)

### PART 1: Exploratory data analysis

In [None]:
#Import libraries
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#reading data
data= pd.read_csv('/kaggle/input/drug-classification/drug200.csv')
print("Dataframe Shape: ",data.shape)

In [None]:
#check data
data.head()

* 5 Features: ['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']
* 1 Target Variable: Drug

In [None]:
# Target variable analysis
data['Drug'].value_counts()

* We have imbalanced dataset with 5 classes in target. Need to use StratifiedKFold cross validation strategy.

In [None]:
#Feature variables analysis
#Check for missing values
data.isnull().sum()

* No missing values in data.

In [None]:
data.describe()

In [None]:
# col-Age
sns.distplot(data['Age'])

* We have age distribution from 15-74 years. Need to create Age bins for different age groups.

In [None]:
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

In [None]:
data.groupby(['Age', 'Drug']).size()

In [None]:
# col- Sex
data.Sex.value_counts()

In [None]:
data.groupby(['Sex', 'Drug']).size()

* Sex feature is low cardinality nominal variable. Need to use One hot encoding technique here or since we have only two labels, we can binarize it.
* Almost equal distribution of drugs over both sexes.

In [None]:
# col- BP
data.BP.value_counts()

In [None]:
sns.catplot(x="Drug", y="BP", data=data)

* Its distinctive in case of drugC, drugA, drugB

In [None]:
data.groupby(['BP', 'Drug']).size()

* BP (Blood Pressure) feature is ordinal categorical variable(having some kind of order between values, LOW, NORMAL, HIGH). Label encoding would be suitable for this.

In [None]:
# col- Cholesterol
data.Cholesterol.value_counts()

In [None]:
data.groupby(['Cholesterol', 'Drug']).size()

* Cholesterol is again ordinal categorical variable (NORMAL, HIGH). Need to use label encoder.

In [None]:
# col- Na_to_K
print(data.Na_to_K.nunique())
sns.distplot(data['Na_to_K'])

In [None]:
sns.catplot(x="Drug", y="Na_to_K", data=data)

* Cool, when Na_to_K ratio > 15, only DrugY is used. Create new feature.

* Out of 200 dataset 198 rows have unique values for Na_to_K ratio, It is not distinctive and useful. We need to group it into different bins in order to make sense from this data.
* We can observe deviation from normal distribution here. Data is skewed.
* Positive skewness

In [None]:
# Positive skewness also tells, (mean and median) > mode
#mean, median, mode: lets check
print(data.Na_to_K.mean())
print(data.Na_to_K.median())
print(data.Na_to_K.mode()[0])

In [None]:
#skewness and kurtosis
print("Skewness= ", data['Na_to_K'].skew())
print("Kurtosis= ", data['Na_to_K'].kurt())

* Skewness > 1, suggests distribution is somewhat moderate to highly skewed (positive)
* kurtosis < 3, suggests distribution is shorter, tails are thinner than the normal distribution. The peak is lower and broader, which means that data are light-tailed or lack of outliers.

*NOTE: We can apply some kind of transformation technique to make distribution normal.*

### PART 2: Data Preparation

In [None]:
data.Age.max()

In [None]:
# feature engg
# Binning Age into Age groups
bins= [13,18,65,80]
labels = ['Teen','Adult','Elderly']
data['AgeGroup'] = pd.cut(data['Age'], bins=bins, labels=labels, right=False)
data.drop('Age', axis=1, inplace=True)
print (data.head())

In [None]:
data.AgeGroup.value_counts()

#### Now the next challenge is how do we group the Sodium to Potassium ratio data. We dont have any distinct groups as Age. 
#### So, I will group the data based on percentile distribution of data.

In [None]:
data['is_Na2K_greater15'] = [1 if x>15 else 0 for x in data['Na_to_K']]

In [None]:
# Na_to_K groups
data['Na_to_K_groups'] = pd.qcut(data['Na_to_K'],
                            q=[0, .2, .4, .6, .8, 1],
                            labels=False)
data.drop('Na_to_K', axis=1, inplace=True)
data.Na_to_K_groups.value_counts()

In [None]:
# Binarize Sex variable
data['Sex'].replace(['F','M'],[0,1],inplace=True)

In [None]:
#Label encoding
from sklearn import preprocessing 
  
le = preprocessing.LabelEncoder() 
data['BP']= le.fit_transform(data['BP']) 
data['Cholesterol']= le.fit_transform(data['Cholesterol'])
data['AgeGroup']= le.fit_transform(data['AgeGroup']) 
data.head()

### PART 3: Modelling & Evaluation

In [None]:
data.columns

In [None]:
#features
features = ['Sex', 'BP', 'Cholesterol', 'AgeGroup','is_Na2K_greater15', 'Na_to_K_groups']

In [None]:
#model
from sklearn import tree
from sklearn import ensemble
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn import metrics

In [None]:
kf = StratifiedKFold(n_splits=5,shuffle=True,random_state=42)
X = data[features]
y = data.Drug

scores= []
i=1
for train_index,test_index in kf.split(X, y):
    print('Fold no. = ', i)
    
    x_train, x_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    #model
    model1 = tree.DecisionTreeClassifier(random_state=42)
    model1.fit(x_train, y_train)
     
    test_pred= model1.predict(x_test)
    test_acc = metrics.accuracy_score(y_test, test_pred)
    print('Accuracy score over test set:',test_acc)
    scores.append(test_acc)    
    
    i+=1
    
#mean score
print()
print('Mean Accuracy for Decision Tree: ', np.mean(scores))

#### We got an Accuracy of 0.96 with Decision Tree (default parameters). But the good thing we avoided Overfitting using StratifiedKFold approach. Ofcourse this accuracy can be improved by hyperparameter tuning and feature selection/feature engg.

#### Lets try another ML model, RandomForestClassifier.


In [None]:
#RandomForestClassifier
kf = StratifiedKFold(n_splits=5,shuffle=True,random_state=42)
X = data[features]
y = data.Drug

scores= []
i=1
for train_index,test_index in kf.split(X, y):
    print('Fold no. = ', i)
    
    x_train, x_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    #model
    model2 = ensemble.RandomForestClassifier(random_state=42)
    model2.fit(x_train, y_train)
     
    test_pred= model2.predict(x_test)
    test_acc = metrics.accuracy_score(y_test, test_pred)
    print('Accuracy score over test set:',test_acc)
    scores.append(test_acc)    
    
    i+=1
    
#mean score
print()
print('Mean Accuracy for Random Forest Classifier: ', np.mean(scores))

#### Cool!!! we got better accuracy 0.96 with RandomForestClassifier.

**Lets plot feature importance and visualize.**

In [None]:
# model-random forest classifier feature importance
feat_importances = pd.Series(model2.feature_importances_, index=features)
feat_importances.plot(kind='barh')

* Most Important feature is is_Na2K_greater15 (Sodium to Potassium ratio) that we created.
* Second important feature is Blood pressure.
* Agegroups and Cholesterol level have similar level of importance.

#### Thank you for making it till the end.  

#### Kindly upvote and comment if you have any suggestions or queries. Happy learning :)