# Drug Classification

In this notebook we will be solving the problem of classifying the type of drug from the $5$ drug types given (i.e.):
* drugX
* drugY
* drugC
* drugA
* drugB

This is a *multiclass classification* problem as we have five classes in the target to predict.



<img src = "https://images.theconversation.com/files/358080/original/file-20200915-22-1t5myba.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=926&fit=clip" width = 600 height = 200>

<br>


**Data Attributes**
* Age
* Sex
* Blood Pressure Levels
* Cholesterol Levels
* Na to Potassium Ratio

**Target Feature**
* Drug Type

Roughly, we will be following the below structure: 

* Load the data.
* Display useful statistics.
* Build generic functions to detect nulls and missing values.
* Handle missing values.
* Make Visualizations to understand data better.
* Build Models

## Table of Contents

* [Import Libraries](#lib)
* [Load Data](#load_data)
* [Summary Statistics](#summary_stats)
* [Identify Missing or Null Values](#missing_values)
* [EDA & Data Visualization](#eda_data_vis)
* [Encoding Categorical Features](#encoding)
* [Developing Classification Models](#model)
* [Evaluating Classification Models](#evaluate)

<a id ='lib'></a>
# Import Libraries

In [None]:
import os
import numpy as np 
import matplotlib.pyplot as plt
import missingno as msno
import seaborn as sns
import pandas as pd 
pd.options.mode.chained_assignment = None  # default='warn'


for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


<a id ='load_data'></a>
# Load Data

In [None]:
drugs_df = pd.read_csv('/kaggle/input/drug-classification/drug200.csv')

print(drugs_df.head(10))

<a id = 'summary_stats'></a>
## Display summary statistics

In [None]:
# Display column names
print(drugs_df.columns)

In [None]:
print(drugs_df.info())

This dataset has more categorical features than numerical. So we may have to encode the categorical features.

In [None]:
print(drugs_df.describe())

The maximum or the oldest age give is $74$ and the youngest being $15$

<a id = 'missing_values'></a>
# Investigating Missing Values

In [None]:
# Generic function to calculate missing values, zero values
def calcMissingValues(df: pd.DataFrame):
    '''
    Function to calculate zero,missing and empty values in the dataframe
    
    '''
    # Calculate zero values
    zero_values = (df == 0.0).astype(int).sum(axis = 0)
    
    # Calculate missing values
    missing_vals = df.isnull().sum()
    
    missing_val_percent = round((missing_vals / len(df)) * 100.0 , 2)
    
    df_missing_stat = pd.concat([zero_values , missing_vals , missing_val_percent] , axis = 1)
    
    df_missing_stat = df_missing_stat.rename(columns = {0: 'zero_values' , 1: 'missing_vals' , 2: '%_missing_vals'})
    
    df_missing_stat['data_types'] = df.dtypes
    
    print(df_missing_stat)
    
    
    

In [None]:
calcMissingValues(drugs_df)

As seen, the dataset is clean without any missing values to impute.

<a id = 'eda_data_vis'> </a>
# EDA & Data Visualization

#### Visualize **Age** versus **Drug Type**
* Stripplot
* Boxplot

**Stripplot**

In [None]:
# Visualize age and drug type using strip plot
plt.figure(figsize = (10 , 6))
# Plotting a swarmplot to get a distribution of categorical and numerical variables
sns.stripplot(x = 'Drug' , y = 'Age' , data = drugs_df)
plt.title('Distribution of Age & Drug')
plt.show()



The stripplot is used to visualize multiple data distributions, from the plot it looks like *DrugY* and *drugX* are more commonly prescribed or used by the populace.

**Box Plot**

In [None]:
# Visualize age and drug type using Box plot
plt.figure(figsize = (20 , 10))
props = dict(boxes = "orange", whiskers="black", medians= "green", caps ="Gray")
drugs_df.boxplot(by = 'Drug' , column = ['Age'] , figsize = (10 , 8) , color = props)
plt.title('Distribution of Age & Drug')
plt.tight_layout()
plt.show()

This confirms our assumption that *DrugY* and *drugX* are most commonly used

### Visualize target variable **Drug**

We now visualize the distribution of the target variable to see if there are any imbalances in class distribution as this is a multiclass classification and any imbalances might affect the outcome.

There will be two plots:
* Bar plot
* Pie Chart

**Bar Plot**

In [None]:
# Get unique class values
print(drugs_df['Drug'].unique())

# Plot a bar chart of the various classes
drugs_df['Drug'].value_counts().plot(kind = 'bar' , x = 'Drug Type' , y = 'Drug Type Count' , color = 'yellow' , figsize = (10 , 8))
plt.title('Drug Type Distribution')
plt.show()

**Pie Chart**

In [None]:
print(drugs_df.groupby(['Drug']).size())

drug_type = drugs_df.groupby(['Drug']).size()

sizes = list(drugs_df.groupby(['Drug']).size())

labels = ['Bachelors' , 'Below Secondary', 'Masters']

pie_chart_drug = {'labels': list(drug_type.index.values) , 'vals': sizes}


# print(drug_type.index.values)
# print(sizes)

colors = ['#b79c4f', '#4fb772', '#eb7d26' , '#77e8c2' , '#99eff2']

#print(pie_chart_drug)
# colors = ['#ff9999','#1f70f0','#99ff99']
pie_explode = [0 , 0 , 0.3 , 0 , 0]

plt.figure(figsize = (10 , 8))
plt.pie(pie_chart_drug['vals'] , labels = pie_chart_drug['labels'] , explode = pie_explode , colors = colors , shadow = True, startangle = 90 , textprops={'fontsize': 14} , autopct = '%.1f%%')
plt.ylabel('')
plt.title('Drug Type distribution in the data' , fontsize = 20)
plt.tight_layout()
plt.show()

From the two plots we see the distribution of *drugA*, *drugB* and *drugC* is relatively lower. This may affect the prediction and based on the accuracy metrics we can use **SMOTE** to oversample classes having lower distributions. However, this requires some domain knowledge and cannot be sampled as is without any assumptions.

### Visualize **Gender** and **Drug**

In [None]:
gender_drug = drugs_df.groupby(['Sex' , 'Drug']).size().reset_index(name = 'value_count')

gender_drug_pivot = pd.pivot_table(
    gender_drug, 
    values = 'value_count',
    index = 'Drug',
    columns = 'Sex'
)


gender_drug_pivot.plot(kind = 'bar' , figsize = (10 , 8) , fontsize = 12 , rot = 360)
plt.xlabel('Drug Type', fontsize = 14)
plt.ylabel('Value' , fontsize = 14)
plt.title('Gender vs Drug Type', fontsize = 16)
plt.tight_layout()
plt.show()

Nothing substantial can be interpreted from plotting Gender vs Drug. There is no bias towards genders for any specific type of Drug.

## Visualize **BP** and **Drug**

Ploltting to see if there is any relation between BP and Drug type. The chart will be a gouped bar chart.

In [None]:
print(drugs_df.groupby(['Drug']).mean())

print(drugs_df['BP'].unique())

print(drugs_df.groupby(['BP']).mean())

bp_drug = drugs_df.groupby(['BP' , 'Drug']).size().reset_index(name = 'value_count')

print(bp_drug)

gender_drug_pivot = pd.pivot_table(
    gender_drug, 
    values = 'value_count',
    index = 'Drug',
    columns = 'Sex'
)


bp_drug_pivot = pd.pivot_table(bp_drug , values = 'value_count' , columns = 'BP' , index = 'Drug')


bp_drug_pivot.plot(kind = 'bar' , figsize = (10 , 8) , fontsize = 12 , rot = 360)
plt.xlabel('Drug Type', fontsize = 14)
plt.ylabel('Value' , fontsize = 14)
plt.title('BP vs Drug Type', fontsize = 16)
plt.tight_layout()
plt.show()


A majority of normal BP take DrugX and those with a higher BP take predominantly DrugY with Drug A and Drug B being close contenders.

## Visualize **Na_to_K** and **Drug**

In [None]:
print(drugs_df[['Na_to_K' , 'Drug']])

drug_na_k = drugs_df.groupby(['Drug'])['Na_to_K'].mean()

print(drug_na_k)

drug_na_k.plot(kind = 'bar' , color = 'red' , alpha = 0.5 , rot = 360 , fontsize = 14 , figsize = (10 , 8))
plt.xlabel('Drug Type' , fontsize = 15)
plt.ylabel('Na_to_K Avg' , fontsize = 15)
plt.title('Distirbution of Drug type under Na_to_K' , fontsize = 15)
plt.tight_layout()
plt.show()

The bar chart tells that if the average Na_to_K value exceeds15 then DrugY is preffered and so this feature also plays an important role in classification. We can view the joint distribution of variables in Stripplot.

In [None]:
# Visualize Na_to_K and drug type using strip plot
plt.figure(figsize = (10 , 6))
# Plotting a swarmplot to get a distribution of categorical and numerical variables
sns.stripplot(x = 'Drug' , y = 'Na_to_K' , data = drugs_df)
plt.xlabel('Drug Type' , fontsize = 12)
plt.ylabel('Na_to_K Avg' , fontsize = 12)
plt.title('Distribution of Na_to_K & Drug')
plt.show()


<a id = 'encoding'></a>
## Encoding Categorical Features

In [None]:
# Get all non-numerical columns
print(drugs_df.select_dtypes(exclude=["number","bool_"]))

### Label Encoding
We can use label encoding for *Sex* as there is no problem of precedance or hierarchy.
The target feature need not be encoded as scikit-learn encodes by default if the target values are strings.

The following columns will be label encoded:
* Sex

In [None]:
from sklearn.preprocessing import LabelEncoder

labelEncoder = LabelEncoder()

In [None]:
# Make a copy of the dataset
drugs_train_df = drugs_df.copy()


In [None]:
drugs_train_df['Sex'] = labelEncoder.fit_transform(drugs_train_df['Sex'])
print(drugs_train_df.loc[0 : 5, 'Sex'])
print(drugs_df.loc[0 : 5, 'Sex'])

### Ordinal Encoding

Columns *BP* and *Cholesterol* are odrdinal in nature as they have an order of sorts (i.e.) LOW, NORMAL and HIGH, we can use pandas map function to ordinally encode these variables.

The following columns will be label encoded:
* BP
* Cholesterol

In [None]:
# Get the unique values
print('BP: ', drugs_train_df['BP'].unique())
print('Cholesterol: ', drugs_train_df['Cholesterol'].unique())

In [None]:
# Define a map function
ord_dict = {'LOW': 1 , 'NORMAL' : 2, 'HIGH' : 3}
#chol_dict = {}
drugs_train_df['BP'] = drugs_train_df['BP'].map(ord_dict)
drugs_train_df['Cholesterol'] = drugs_train_df['Cholesterol'].map(ord_dict)

In [None]:
print('BP: ', drugs_train_df['BP'].unique())
print('Cholesterol: ', drugs_train_df['Cholesterol'].unique())

In [None]:
print(drugs_train_df)

Now the data does not lose its meaning since we have done ordinal encoding of the key feature columns.

<a id = 'model'></a>
## Building Classification models

**Split into Trian and Test data**

In [None]:
# Number of records
print(drugs_train_df.shape)

Writing a Test Train split from scratch using numpy masks is a good practice and a useful trick to know.

In [None]:
def splitDataset(x_df: pd.DataFrame , y_df: pd.DataFrame)-> (pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame):
        
        '''
        Function to split a dataset into Train and test sets
        
        '''
        
        ratio = 0.8
        
        mask = np.random.rand(len(x_df)) <= ratio
        
        x_train = x_df[mask]
        x_test = x_df[~mask]
        
        y_train = y_df[mask]
        y_test = y_df[~mask]
        
        
        return x_train, y_train, x_test, y_test
        

In [None]:
np.random.seed(123)

y_df = drugs_train_df['Drug']
x_df = drugs_train_df.drop(['Drug'] , axis = 1)

x_train, y_train, x_test, y_test = splitDataset(x_df , y_df)

print('X Train Shape: ', x_train.shape)
print('X Test Shape: ', x_test.shape)
print('Y Train Shape: ', y_train.shape)
print('Y Test Shape: ', y_test.shape)

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
# Define the model
logistic_regression = LogisticRegression(solver='liblinear')
logistic_regression.fit(x_train , y_train)

y_pred = logistic_regression.predict(x_test)

In [None]:
# Get scores
train_score = logistic_regression.score(x_train , y_train)
test_score = logistic_regression.score(x_test , y_test)

In [None]:
print('Train score: {:.2f}'.format(train_score))
print('Test score: {:.2f}'.format(test_score))

<a id= "evaluate"></a>
## Evaluating Classification Models

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score , precision_score , recall_score , f1_score
from sklearn.metrics import classification_report

**Confusion Matrix**

In [None]:
conf_matrix = confusion_matrix(y_test , y_pred)

# print(conf_matrix)

plt.figure(figsize = (10, 8))
sns.heatmap(conf_matrix, annot = True, fmt = ".3f", linewidths =.5, square = True, cmap = 'Blues_r')
plt.ylabel('Actual label' , fontsize = 12)
plt.xlabel('Predicted label' , fontsize = 12)
plt.title('Confusion Matrix' , fontsize = 15)
plt.show()

**Precision, Recall and F1-Score**

In [None]:
# Classification Report
print(classification_report(y_test, y_pred))

The **Recall** score for the various classes are high which is a good indicator that the model is predicting a positive case when the actual value is also true. Recall tells from all the positive cases how many were predicted correctly

$recall = \frac{TP}{TP + FN} $


Precision tells about predicting positive classes when the result is actually positive and the scores look good.

$precision = \frac{TP}{TP + FP}$


**Classification Error or Misclassification Rate**

This tells overall how often the classification is incorrect.

$accuracy = \frac{TP + TN}{TP + TN + FP + FN}$

$classification{\_}error = \frac{FP + FN}{TP + TN + FP + FN}$

$classification{\_}error = 1 - accuracy$

In [None]:
# Get accuracy score
acc = accuracy_score(y_test , y_pred)
print('Accuracy: {:.2f}'.format(acc))

class_err = 1 - acc
print('Misclassification rate: {:.2f}'.format(class_err))

The misclassification rate is quite low and this makes the model a decent predictor of different drug types. We can also improve the model performance by performing Hyperparameter tuning using GridSearchCV, but that will be useful on a bigger dataset with more features. Using other classifier models will be suprfluous for this dataset with limited features and can be attempted easily as an exercise.