# Case 3. First classification experiment
Neural networks for Health Technology Applications<br>
26.2.2020, Sakari Lukkarinen<br>
[Helsinki Metropolia University of Applied Sciences](www.metropolia.fi/en)

## Introduction

The aim of this Notebook is to work as introduction to text preprocessing functions for neural networks.

## Acknowledgments

The dataset is from: [UCI ML Drug Review dataset](https://www.kaggle.com/jessicali9530/kuc-hackathon-winter-2018).

![](http://)## Import libraries and read the datasets

In [None]:
# Read the basic libraries (similar start as in Kaggle kernels)
%pylab inline
import time # for timing
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
from sklearn.model_selection import train_test_split # preprocessing datasets
from tensorflow.keras.preprocessing.text import Tokenizer # text preprocessing
from tensorflow.keras.models import Sequential # modeling neural networks
from tensorflow.keras.layers import Dense, Activation # layers for neural networks
from sklearn.metrics import confusion_matrix, classification_report, cohen_kappa_score # final metrics

# Input data files are available in the "../input/" directory.
import os
print(os.listdir("../input"))
tf.__version__

In [None]:
# Change the default figure size
plt.rcParams['figure.figsize'] = [12, 5]

In [None]:
# Create dataframes train and test
train = pd.read_csv('../input/drugsComTrain_raw.csv')
test = pd.read_csv('../input/drugsComTest_raw.csv')

# Show the first 5 rows of the train set
train.head()

## Text processing

More info: 
- [scikit-learn CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- [scikit-learn text feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
- [keras Tokenizer]()


In [None]:
%%time
# Tokenize the text
samples = train['review']
tokenizer = Tokenizer(num_words = 5000)
tokenizer.fit_on_texts(samples)

# Make one hot samples
data = tokenizer.texts_to_matrix(samples, mode='binary')

In [None]:
%%time
# Create three categories
# label = 4, when rating == 10
# label = 3, when rating == 8...9
# label = 2, when rating = 5..7
# label = 1, when rating = 2..4
# label = 0, when rating = 1
labels = train['rating'].values
for i in range(len(labels)):
    x = labels[i]
    if x == 10:
        labels[i] = 4
    elif x >= 8:
        labels[i] = 3
    elif x >= 5:
        labels[i] = 2
    elif x >= 2:
        labels[i] = 1
    else:
        labels[i] = 0

In [None]:
%%time
# Split into training and validation sets
x_train, x_val, y_train, y_val = train_test_split(data, labels, test_size = 0.250, random_state = 2020)

In [None]:
%%time
# Convert outputs to one-hot-coded categoricals
from tensorflow.keras.utils import to_categorical
y_train_cat = to_categorical(y_train)
y_val_cat = to_categorical(y_val)

## Model

In [None]:
# Create a simple sequential model
model = Sequential()
model.add(Dense(256, input_dim = 5000))
model.add(Activation('relu'))
model.add(Dense(32))
model.add(Activation('relu'))
model.add(Dense(5))
model.add(Activation('softmax'))
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['acc'])
model.summary()

## Training

In [None]:
%%time
history = model.fit(x_train, y_train_cat, 
                    epochs = 10, 
                    batch_size = 32,
                    verbose = 1,
                    validation_data = (x_val, y_val_cat))

In [None]:
# Plot the accuracy and loss
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
e = arange(len(acc)) + 1

plot(e, acc, label = 'train')
plot(e, val_acc, label = 'validation')
title('Training and validation accuracy')
xlabel('Epoch')
grid()
legend()

figure()

plot(e, loss, label = 'train')
plot(e, val_loss, label = 'validation')
title('Training and validation loss')
xlabel('Epoch')
grid()
legend()

show()

## Calculate metrics

In [None]:
# Find the predicted values for the validation set
pred = argmax(model.predict(x_val), axis = 1)

In [None]:
# Calculate the classification report
cr = classification_report(y_val, pred)
print(cr)

In [None]:
# Calculate the confusion matrix
cm = confusion_matrix(y_val, pred).T
print(cm)

In [None]:
# Calculate the cohen's kappa, both with linear and quadratic weights
k = cohen_kappa_score(y_val, pred)
print(f"Cohen's kappa (linear)    = {k:.3f}")
k2 = cohen_kappa_score(y_val, pred, weights = 'quadratic')
print(f"Cohen's kappa (quadratic) = {k2:.3f}")


More info: 
- [sklearn.metrics.cohen_kappa_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html)
- [Cohen's kappa (Wikipedia)](https://en.wikipedia.org/wiki/Cohen%27s_kappa)