# Predicting a patient's rating on a drug

這個資料集來源是 UCI Machine Learning Repository，這份 [Drug Review Dataset](https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29#) 主要提供患者、症狀以及他對於某特定藥物的評論以及評分。<br>
整份資料集被切割成train(75%)和test(25%)兩部份。

## Part 1: Import packages and describe the dataset

In [243]:
import pandas as pd
import numpy as np
import csv
import matplotlib.pyplot as plt
%matplotlib inline

### Attribute Information:

1. **drugName (categorical)**: name of drug 藥物名稱
2. **condition (categorical)**: name of condition 症狀名稱
3. **review (text)**: patient review 患者評論
4. **rating (numerical)**: 10 star patient rating 患者評分
5. **date (date)**: date of review entry 評論日期
6. **usefulCount (numerical)**: number of users who found review useful 認為該評論有用的個數

**Training data 和 testing data 有相同的欄位特徵。**

In [244]:
train = pd.read_csv('drugsComTrain_raw.tsv', sep='\t')
test = pd.read_csv('drugsComTest_raw.tsv', sep='\t')

In [245]:
print('Train shape:',train.shape)
print('Test shape:', test.shape)
train.head()

Train shape: (161297, 7)
Test shape: (53766, 7)


Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,"May 20, 2012",27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,"November 27, 2016",37


### Goal: Predict the patient rating in the testing data
目標是預測測試資料中的 rating 欄位，因此將 train 和 test 的該欄位 drop 掉作為 label 。

In [246]:
train_y = pd.DataFrame(train['rating'])
train_x = train.drop(['rating'], axis=1)
test_y = pd.DataFrame(test['rating'])
test_x = test.drop(['rating'], axis=1)

## Part 2: Data Exploration

In [247]:
print('Number of drugs in train:', train_x['drugName'].unique().size)
print('Number of conditions in train:', train_x['condition'].unique().size)

Number of drugs in train: 3436
Number of conditions in train: 885


In [248]:
# List some of the drug names
train_x['drugName'].value_counts().head()

Levonorgestrel                       3657
Etonogestrel                         3336
Ethinyl estradiol / norethindrone    2850
Nexplanon                            2156
Ethinyl estradiol / norgestimate     2117
Name: drugName, dtype: int64

See if all the ```drug names``` and ```conditions``` in the testing data are also in the training data

In [249]:
test_x['drugName'].isin(train_x['drugName']).value_counts()

True     53493
False      273
Name: drugName, dtype: int64

In [250]:
test_x['condition'].isin(train_x['condition']).value_counts()

True     53720
False       46
Name: condition, dtype: int64

We can see that almost all ```drugName```s and ```condition```s in the testing data also appear in the training data. However there are still some of them only appear in the testing data.

## Part 3: Data Preprocessing

### Encode ```drugName``` and ```condition```

In [251]:
drug_encode = pd.DataFrame(train_x['drugName'].unique())
drug_encode.columns = ['drugName']
condition_encode = pd.DataFrame(train_x['condition'].unique())
condition_encode.columns = ['condition']
drug_encode.head()

Unnamed: 0,drugName
0,Valsartan
1,Guanfacine
2,Lybrel
3,Ortho Evra
4,Buprenorphine / naloxone


In [252]:
drug_dict = {}
for i, name in enumerate(drug_encode['drugName']):
    drug_dict[name] = i
cond_dict = {}
for i, name in enumerate(condition_encode['condition']):
    cond_dict[name] = i

In [253]:
new_drug = []
new_cond = []
for index in range(train_x['drugName'].size):
    new_drug.append(drug_dict[train_x.loc[index]['drugName']])
    new_cond.append(cond_dict[train_x.loc[index]['condition']])
    if index % 20000 == 0:
        print(index)

0
20000
40000
60000
80000
100000
120000
140000
160000


In [254]:
train_x.loc[:, 'drugName'] = new_drug
train_x.loc[:, 'condition'] = new_cond

In [255]:
train_x.head()

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,date,usefulCount
0,206461,0,0,"""It has no side effect, I take it in combinati...","May 20, 2012",27
1,95260,1,1,"""My son is halfway through his fourth week of ...","April 27, 2010",192
2,92703,2,2,"""I used to take another oral contraceptive, wh...","December 14, 2009",17
3,138000,3,2,"""This is my first time using any form of birth...","November 3, 2015",10
4,35696,4,3,"""Suboxone has completely turned my life around...","November 27, 2016",37


In [None]:
test_new_drug = []
test_new_cond = []
for index in range(test_x['drugName'].size):
    try:
        test_new_drug.append(drug_dict[test_x.loc[index]['drugName']])
    except KeyError:
        drug_dict[test_x.loc[index]['drugName']] = list(drug_dict.values())[-1] + 1
        test_new_drug.append(drug_dict[test_x.loc[index]['drugName']])
    try:
        test_new_cond.append(cond_dict[test_x.loc[index]['condition']])
    except KeyError:
        cond_dict[test_x.loc[index]['condition']] = list(cond_dict.values())[-1] + 1
        test_new_cond.append(cond_dict[test_x.loc[index]['condition']])
    if index % 20000 == 0:
        print(index)

In [256]:
test_x.loc[:, 'drugName'] = test_new_drug
test_x.loc[:, 'condition'] = test_new_cond

In [257]:
test_x.head()

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,date,usefulCount
0,163740,91,9,"""I&#039;ve tried a few antidepressants over th...","February 28, 2012",22
1,206473,425,240,"""My son has Crohn&#039;s disease and has done ...","May 17, 2009",17
2,159672,351,13,"""Quick reduction of symptoms""","September 29, 2017",3
3,39293,39,74,"""Contrave combines drugs that were used for al...","March 5, 2017",35
4,97768,1339,2,"""I have been on this birth control for one cyc...","October 22, 2015",4


In [258]:
train_x = train_x.drop(['Unnamed: 0', 'review', 'date'], axis=1)
test_x = test_x.drop(['Unnamed: 0', 'review', 'date'], axis=1)

In [259]:
train_x.shape

(161297, 3)

In [260]:
from keras.utils import to_categorical

In [262]:
encode_train_y = to_categorical(train_y)
encode_train_y = np.delete(encode_train_y, 0, 1)
print(encode_train_y.shape)
encode_test_y = to_categorical(test_y)
encode_test_y = np.delete(encode_test_y, 0, 1)
print(encode_test_y.shape)

(161297, 10)
(53766, 10)


## Part 4: Training Model

In [285]:
from keras.models import Sequential
from keras.layers import Dense
from keras import backend as K

def custom_activation(x):
    return (K.sigmoid(x) * 10)

In [287]:
model = Sequential()
model.add(Dense(units=10, input_dim=3))
for i in range(5):
    model.add(Dense(units=10, activation = 'relu'))
model.add(Dense(units=1, activation=custom_activation))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mse'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_99 (Dense)             (None, 10)                40        
_________________________________________________________________
dense_100 (Dense)            (None, 10)                110       
_________________________________________________________________
dense_101 (Dense)            (None, 10)                110       
_________________________________________________________________
dense_102 (Dense)            (None, 10)                110       
_________________________________________________________________
dense_103 (Dense)            (None, 10)                110       
_________________________________________________________________
dense_104 (Dense)            (None, 10)                110       
_________________________________________________________________
dense_105 (Dense)            (None, 1)                 11        
Total para

In [289]:
print('Training -----------')
model.fit(train_x, train_y, epochs=50, batch_size=2048, validation_split=0.2)

Training -----------
Train on 129037 samples, validate on 32260 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50


Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7fcc7ba92ba8>

In [318]:
pred_y = model.predict(test_x)
pred = []
for i in pred_y:
    pred.append(i[0])

In [320]:
print('RMSE:', np.sqrt(np.mean(np.square(pred - test_y['rating'].values))))

RMSE: 3.2122784675117853
