# Prediction of Drugs needed

This notebook is a work flow for various Python-based machine learning model for predicting what drugs a person who need

Going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Model Evaluation

# 1. Problem Definition

How we can use various python based Machine Learning Model to and the given parameters to predict what drug the person needs?


# 2. Data

Data set from: https://www.kaggle.com/pablomgomez21/drugs-a-b-c-x-y-for-decision-trees

Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y.

Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The features of this dataset are Age, Sex, Blood Pressure, and the Cholesterol of the patients, and the target is the drug that each patient responded to.

It is a sample of multiclass classifier, and you can use the training part of the dataset to build a decision tree, and then use it to predict the class of a unknown patient, or to prescribe a drug to a new patient.

DATA Source: IBM

# 3. Evaluation

It will be done with the Classification Metrics.

# 4. Features

## Inputs / features
1. Age - Age of patient
2. Sex - Sex of the patient
3. BP - Blood Pressure
4. Cholesterol - Cholesterol level
5. Na_to_KSodium - Potassium 

## Output / label
6. Drug - Drug that worked with that patient


## Standard Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/drugs-a-b-c-x-y-for-decision-trees/drug200.csv')
df.head()

## Data Exporation

In [None]:
df

In [None]:
df['Drug'].unique()

In [None]:
df.info()

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of Drug')
sns.countplot(data=df,x='Drug');

Data set is in-balanced

In [None]:
plt.figure(figsize=(20,10))
plt.title('Histogram of Age')
sns.histplot(data=df,x='Age',bins=30, kde=True);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of Sex')
sns.countplot(data=df,x='Sex', hue='Drug');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of BP')
sns.countplot(data=df,x='BP', hue='Drug');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of Cholesterol')
sns.countplot(data=df,x='Cholesterol', hue='Drug');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Age vs Potassium vs Drugs')
sns.scatterplot(data=df,x='Age', y='Na_to_K', hue='Drug',s=100);

# 5. Modelling

In [None]:
X = df.drop('Drug', axis=1)
y = df['Drug']

In [None]:
X = pd.get_dummies(X, drop_first= True)

In [None]:
X

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Model imports

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier, XGBRFClassifier

## Baseline Modelling

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        model_scores[name] = model.score(X_test,y_test)

    model_scores = pd.DataFrame(model_scores, index=['Score']).transpose()
    model_scores = model_scores.sort_values('Score')
        
    return model_scores

In [None]:
models = {'LogisticRegression': LogisticRegression(max_iter=10000),
          'KNeighborsClassifier': KNeighborsClassifier(),
          'SVC': SVC(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'RandomForestClassifier': RandomForestClassifier(),
          'AdaBoostClassifier': AdaBoostClassifier(),
          'GradientBoostingClassifier': GradientBoostingClassifier(),
          'XGBClassifier': XGBClassifier(),
          'XGBRFClassifier': XGBRFClassifier()}

In [None]:
baseline_model_scores = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
baseline_model_scores.sort_values('Score')

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(data=baseline_model_scores.sort_values('Score').T)
plt.title('Baseline Model Precision Score')
plt.xticks(rotation=90);

Seem like the baseline model is performing at 100% accuracy. for this models, since it s Decision Tree Pratice, we will bulid the model using a DecisionTreeClassifier

1. SVC 	1.000
2. DecisionTreeClassifier 	1.000
3. RandomForestClassifier 	1.000

# 6. Model Evaluting

In [None]:
model = DecisionTreeClassifier()
model.fit(X_train,y_train)
y_preds = model.predict(X_test)

In [None]:
y_preds

In [None]:
from sklearn.metrics import classification_report, plot_confusion_matrix, plot_roc_curve 

## Classification Report

In [None]:
print(classification_report(y_test,y_preds))

## Confusion Matrix

In [None]:
plot_confusion_matrix(model, X_test, y_test)

Model is performing at 100% Accuracy!