# About this Notebook
## Author: Seyedsaman Emami

<hr>


<h3> Dataset </h3>
In the following notebook, I tried to analyze the attached dataset.

<h3> Classification problem </h3>
Classification problem
To classify the class labels I used the gradient Boosting model from Friedman's work and applied different metrics and evaluation methods to check the model performance.

<h4> Metrics </h4>
The metric I used to measure the model performance is the accuracy of the classifier.
The followings are the evaluation methods;
<ol>
    <li> Accuracy </li>
    <li> Staged Predict </li>
    <li> Confusion matrix </li>
</ol>

<h3>splitting method</h3>
K-Fold cross validation

</br>

<img src="https://cdn.mdedge.com/files/s3fs-public/Image/August-2018/pills_520225198_web.jpg" alt="Travel" width="500" height="600">

<hr>
I tried to explain each cell in a markdown cell above.


<h5>If you are interested in this problem and detailed analysis, you can copy this Notebook as follows</h5>

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1101107%2F8187a9b84c9dde4921900f794c6c6ff9%2FScreenshot%202020-06-28%20at%201.51.53%20AM.png?generation=1593289404499991&alt=media" alt="Copyandedit" width="300" height="300" class="center">

# Table of Contents
* [Importing Libs](#lib)
* [Exploring dataset](#dataset)
* [Feature engineering](#Feature_engineering)
* [Modeling](#modeling)


<a id=’lib’></a>
# Import Libraries

In [None]:
import os
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import GradientBoostingClassifier

warnings.simplefilter("ignore")

<a id='dataset'> </a>
# 1. Dataset

## 1.1. Importing the dataset

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        data = pd.read_csv(os.path.join(dirname, filename))
data.head()

## 1.2. Data info

In [None]:
df = data.copy()
data.describe().T.style.bar()

In [None]:
data.info()

In [None]:
print('We have', data.shape[0], 'Rows and', data.shape[1], 'features')

In [None]:
data.columns

## 1.3. Check the missing values

In [None]:
plt.figure(figsize=(23, 3))
sns.heatmap(data.isnull(), yticklabels=False, cbar=True)

Hopefully, there is no Null value in the mentioned dataset.

<a id='Feature_engineering' > </a>
# 2. Feature engineering 

## 2.1. Identifying datatype

<h4> Returning the numeric features </h4>

In [None]:
num_col = data._get_numeric_data().columns.tolist()
print('numeric features:', num_col)

<h4> Returning the categorical features </h4>

In [None]:
cat_col = set(data.columns) - set(num_col)
print('categorical features:',cat_col)

## 2.2. One hot encoding
<h4>Converting categorical features and class labels</h4>

In [None]:
for i in cat_col:
    le = LabelEncoder()
    n = str(i) + '_n'
    df[n] = le.fit_transform(df[i])
    del df[i]
df.head()

In [None]:
plt.figure(figsize=(10, 5))
for i, j in enumerate(df.keys()):
    plt.subplot(2, 2+1, i+1)
    plt.boxplot(df[j], 0,'o',showbox=True,
            showfliers=True, showcaps=True, showmeans=True)
    plt.title(j + ' - box plot')

<h4> As we can see, the Na_to_K might has outliers, but I will skip the outlier treatment in this notebook </h4>

<a id='modeling'></a>
# 3. Modeling

## 3.1. Introducing the dependant and independent variables

In [None]:
X = (df.drop(['Drug_n'], axis=1)).values
y = (df.Drug_n).values
class_n = np.unique(y)
print('X shape:', X.shape, 'y shape:', y.shape, 'class labels:', class_n)

## 3.2. Label histogram

In [None]:
fig = plt.figure(figsize=(20, 3))
ax = plt.axes()
plt.title('class Distribution')
sns.histplot(y, kde=True, color='gray')
plt.xlabel('Drug Type')
plt.ylabel('Numbers')
plt.xticks(class_n)
plt.savefig('hist.jpg', dpi=300)

## 3.3. Training the model

<h4>For sampling and splitting the dataset, I used the stratified method to produce the test/train indices to guarantee the same distribution of samples and built ten-folds for training the model.
Moreover, the random seed is constant for re-producing the same result.
</h4>

In [None]:
y.shape

In [None]:
K = 10
N = 100

err_mart = np.zeros((K, N))
pred_mart = np.zeros((y.shape[0], 100))
pred_t = np.zeros_like(y)
acc = []

kfold = StratifiedKFold(n_splits=K, shuffle=True, random_state=1)

for k, (train_index, test_index) in enumerate(kfold.split(X, y)):
    x_train, x_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    mart = GradientBoostingClassifier(max_depth=2,
                                      subsample=0.75,
                                      max_features="sqrt",
                                      learning_rate=0.025,
                                      random_state=1,
                                      n_estimators=100)
    pipe = Pipeline([("scaler", StandardScaler()), ("clf", mart)])
    pipe.fit(x_train, y_train)
    pred_t[test_index] = pipe.predict(x_test)
    acc.append(accuracy_score(y_test, pipe.predict(x_test)))
    
    mart.fit(x_train, y_train)
    

    for i, pred in enumerate(mart.staged_predict(x_test)):
        pred_mart[test_index, i] = pred

## 3.4. Evaluation

### 3.4.1. Model Accuracy

In [None]:
print('Model average accuracy is:', '{0:.2f}%'.format(np.mean(acc, axis=0)))

### 3.4.2. Base learner accuracy
<h4> Check the performance of each base learner in the ensemble model.
 </h4>

In [None]:
test_score_mart = np.empty((100))
for i in range(mart.n_estimators_):
    test_score_mart[i] = accuracy_score(y, pred_mart[:, i])

    
plt.plot(test_score_mart, '-', label='Accuracy', linewidth=3, color='black')
plt.xlabel('Boosting Iteration')
plt.ylabel('accuracy')
plt.legend(loc=0)
plt.title('Base learners performance')
plt.grid(True)
plt.show()

### 3.4.3 loss curve
<h4> in train loss for each boosting iteration on the in-bag sample. </h4>

In [None]:
plt.plot(mart.train_score_, '-', label='Loss', linewidth=3, color='black')
plt.xlabel('Boosting Iteration')
plt.ylabel('loss')
plt.legend(loc=0)
plt.title('loss curve')
plt.grid(True)
plt.show()

### Model performance

In [None]:
plt.plot(mart.train_score_, '-', label='Loss', linewidth=3, color='blue')
plt.plot(test_score_mart, '-', label='Accuracy', linewidth=3, color='red')
plt.xlabel('Boosting Iteration')
plt.legend(loc=0)
plt.title('Model performance')
plt.grid(True)
plt.show()

### 3.4.4. Confusion Matrix

In [None]:
cf = confusion_matrix(y, pred_t)
sns.heatmap(cf, cmap='PuBu', annot=True, fmt='0.1f')
plt.xlabel('Predicted values')
plt.ylabel('True labels')
plt.title('MART')