<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Case 3 - Patient Drug Review Analysis" data-toc-modified-id="Case-X.-Template-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Case 3 - Patient Drug Review Analysis</a></span></li><li><span><a href="#Background" data-toc-modified-id="Background-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Background</a></span></li><li><span><a href="#Data" data-toc-modified-id="Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data</a></span></li><li><span><a href="#Modes-and-training" data-toc-modified-id="Modes-and-training-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Modes and training</a></span></li><li><span><a href="#Results-and-Discussion" data-toc-modified-id="Results-and-Discussion-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Results and Discussion</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Conclusions</a></span></li></ul></div>

# Case 3 - Patient Drug Review Analysis
Samuel Räsänen, Arttu Sundell, Jari Putaansuu<br>
Last edited: 04.03.2020<br>
Neural Networks for Health Technology Applications<br>
[Helsinki Metropolia University of Applied Sciences](http://www.metropolia.fi/en/)<br>

# Background

The aim of this Notebook is to work as introduction to text preprocessing functions for neural networks.

# Data

The dataset is from: UCI ML Drug Review dataset.

In [1]:
# Read the basic libraries (similar start as in Kaggle kernels)
%pylab inline
import time # for timing
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
from sklearn.model_selection import train_test_split # preprocessing datasets
from tensorflow.keras.preprocessing.text import Tokenizer # text preprocessing
from tensorflow.keras.models import Sequential # modeling neural networks
from tensorflow.keras.layers import Dense, Activation # layers for neural networks
from sklearn.metrics import confusion_matrix, classification_report, cohen_kappa_score # final metrics

# Change the default figure size
plt.rcParams['figure.figsize'] = [12, 5]

tf.__version__

Populating the interactive namespace from numpy and matplotlib


'2.0.0'

In [2]:
# Input data files are available in the "./UCI ML Drug Review dataset" directory.
import os
print(os.listdir("./UCI_ML_Drug_Review_dataset"))

# Create dataframes train and test
train = pd.read_csv('./UCI_ML_Drug_Review_dataset/drugsComTrain_raw.csv')
test = pd.read_csv('./UCI_ML_Drug_Review_dataset/drugsComTest_raw.csv')

# Show the first 5 rows of the train set
train.head()

['drugsComTest_raw.csv', 'drugsComTrain_raw.csv']


Unnamed: 0,uniqueID,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9,20-May-12,27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8,27-Apr-10,192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5,14-Dec-09,17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8,3-Nov-15,10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9,27-Nov-16,37


In [3]:
# Tokenize the text
samples = train['review']
tokenizer = Tokenizer(num_words = 5000)
tokenizer.fit_on_texts(samples)

# Make one hot samples
data = tokenizer.texts_to_matrix(samples, mode='binary')

In [4]:
# Create three categories
# label = 4, when rating == 10
# label = 3, when rating == 8...9
# label = 2, when rating = 5..7
# label = 1, when rating = 2..4
# label = 0, when rating = 1
labels = train['rating'].values
for i in range(len(labels)):
    x = labels[i]
    if x == 10:
        labels[i] = 4
    elif x >= 8:
        labels[i] = 3
    elif x >= 5:
        labels[i] = 2
    elif x >= 2:
        labels[i] = 1
    else:
        labels[i] = 0

In [7]:
# Split into training and validation sets
x_train, x_val, y_train, y_val = train_test_split(data, labels, test_size = 0.25, random_state = 2020)

MemoryError: 

In [None]:
# Convert outputs to one-hot-coded categoricals
from tensorflow.keras.utils import to_categorical
y_train_cat = to_categorical(y_train)
y_val_cat = to_categorical(y_val)

# Modes and training

The following models were used ...

In [None]:
# Create a simple sequential model
model = Sequential()
model.add(Dense(256, input_dim = 5000))
model.add(Activation('relu'))
model.add(Dense(32))
model.add(Activation('relu'))
model.add(Dense(5))
model.add(Activation('softmax'))
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['acc'])
model.summary()

In [None]:
history = model.fit(x_train, y_train_cat, 
                    epochs = 10, 
                    batch_size = 32,
                    verbose = 1,
                    validation_data = (x_val, y_val_cat))

# Results and Discussion

The following results were achieved ...

In [None]:
# Plot the accuracy and loss
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
e = arange(len(acc)) + 1

plot(e, acc, label = 'train')
plot(e, val_acc, label = 'validation')
title('Training and validation accuracy')
xlabel('Epoch')
grid()
legend()

figure()

plot(e, loss, label = 'train')
plot(e, val_loss, label = 'validation')
title('Training and validation loss')
xlabel('Epoch')
grid()
legend()

show()

## Calculate metrics

In [None]:
# Find the predicted values for the validation set
pred = argmax(model.predict(x_val), axis = 1)

In [None]:
# Calculate the classification report
cr = classification_report(y_val, pred)
print(cr)

In [None]:
# Calculate the confusion matrix
cm = confusion_matrix(y_val, pred).T
print(cm)

In [None]:
# Calculate the cohen's kappa, both with linear and quadratic weights
k = cohen_kappa_score(y_val, pred)
print(f"Cohen's kappa (linear)    = {k:.3f}")
k2 = cohen_kappa_score(y_val, pred, weights = 'quadratic')
print(f"Cohen's kappa (quadratic) = {k2:.3f}")

# Conclusions

To summarize we found out that ...