<a href="https://colab.research.google.com/github/tarun-jethwani/SentimentAnalyses/blob/master/SentimentAnalysesUsingKeras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### I have kept the Dataset inside Google Drive, so now It is necessary to mount my gdrive here, to read the dataset 

In [0]:
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


In [0]:
with open('/gdrive/My Drive/foo.txt', 'w') as f:
  f.write('Hello Google Drive!')
!cat '/gdrive/My Drive/foo.txt'

Hello Google Drive!

# Importing necessary libraries

In [0]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk import word_tokenize
from keras.preprocessing.sequence import pad_sequences
from keras.layers import *
from keras.models import Sequential
import keras
from keras import backend as K 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Using TensorFlow backend.


### Read the Dataset from CSV file 

In [0]:
train_data = pd.read_csv('/gdrive/My Drive/kuc-hackathon-winter-2018/drugsComTrain_raw.csv')
test_data = pd.read_csv('/gdrive/My Drive/kuc-hackathon-winter-2018/drugsComTest_raw.csv')

In [0]:
train_data.head()

Unnamed: 0,uniqueID,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9,20-May-12,27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8,27-Apr-10,192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5,14-Dec-09,17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8,3-Nov-15,10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9,27-Nov-16,37


### Drop Duplicates and NA values

In [0]:
def drop_duplicatesNA(data):
  data.drop_duplicates(inplace=True)  #dropping duplicates
  data.dropna(axis=0,inplace=True)
  return data

In [0]:
train_data = drop_duplicatesNA(train_data)
test_data = drop_duplicatesNA(test_data)

### Creating Input and Ouput set 

In [0]:
X_train = train_data['review'].to_numpy()
Y_train = train_data['rating'].to_numpy()
X_test = test_data['review'].to_numpy()
Y_test = test_data['rating'].to_numpy()

## Filter Review 

### 1--> Remove Special symbols, punctuation
### 2--> Converting the text into lower case
### 3--> Removing double quotes(")
### 4--> substitute Contractions
### 5--> Remove stopwords('English') barring no, not , nor (because these words might play an important role in guessing the sentiment)

In [0]:
stop_words = set(stopwords.words('english'))

In [0]:
stop_words.remove('no')
stop_words.remove('not')
stop_words.remove('nor')

### Contractions Mapping 

In [0]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",

                           "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",

                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",

                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",

                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",

                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",

                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",

                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",

                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",

                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",

                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",

                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",

                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",

                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",

                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",

                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",

                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",

                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",

                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",

                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",

                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",

                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",

                           "you're": "you are", "you've": "you have"}


In [0]:
def filter_review(X_train):
  X = []
  for x in X_train:
    x = re.sub('[^A-Za-z0-9]+', ' ', x)
    x = x.lower()
    x = re.sub('"', '', x)
    x = ' '.join([contraction_mapping[w] if w in contraction_mapping else w for w in x.split()])
    x = ' '.join([w for w in x.split() if w not in stop_words])
    x = re.sub('[^A-Za-z0-9]+', ' ', x)
    X.append(x)
  return np.array(X)
  
  

In [0]:
X_train = filter_review(X_train)
X_test = filter_review(X_test)

### This is how filtered reviews look like, we are going to use this as an input to train are Sentiment Analyses Model

In [0]:
for i in range(10):
  print(i," >", X_train[i])

0  > no side effect take combination bystolic 5 mg fish oil
1  > son halfway fourth week intuniv became concerned began last week started taking highest dose two days could hardly get bed cranky slept nearly 8 hours drive home school vacation unusual called doctor monday morning said stick days see school getting morning last two days problem free much agreeable ever less emotional good thing less cranky remembering things overall behavior better tried many different medications far effective
2  > used take another oral contraceptive 21 pill cycle happy light periods max 5 days no side effects contained hormone gestodene not available us switched lybrel ingredients similar pills ended started lybrel immediately first day period instructions said period lasted two weeks taking second pack two weeks third pack things got even worse third period lasted two weeks 039 end third week still daily brown discharge positive side 039 side effects idea period free tempting alas
3  > first time usi

Looks clean, isn't it ??? :>)

In [0]:
max_len = len(max(X_train).split())

### subtracting 1 from Y_train will give me Rating coherent with number of classes, to match class number with rating number 

In [0]:
Y_train_minus = Y_train - 1
Y_test_minus = Y_test - 1

### Convert to One Hot vector

In [0]:
def convert_to_one_hot(Y, C):
    Y = np.eye(C)[Y.reshape(-1)]
    return Y

In [0]:
Y_train_one_hot = convert_to_one_hot(Y_train_minus, 10)
Y_test_one_hot = convert_to_one_hot(Y_test_minus,10)

### Priniting the output shape of the vector to verify dimensions of the transformed vectors

In [0]:
Y_train.shape

(160398,)

In [0]:
Y_train_one_hot.shape

(160398, 10)

### Function which reads Glove File, loads 100D vectors and returns words2index and index2words dictionary 

In [0]:
def load_glove():
  
  with open('/gdrive/My Drive/glove.6B/glove.6B.100d.txt', 'r') as f:
    words = set()
    word2vec = {}
    for line in f:
      line = line.strip().split()
      word = line[0]
      words.add(word)
      word2vec[word] = np.array(line[1:], dtype= np.float64)
      
  # Creating word2idx and idx2word dictionaries
  i = 0
  word2idx = {}
  idx2word = {}
  for word in sorted(words):
    word2idx[word] = i
    idx2word[i] = word
    i += 1

  return word2idx, idx2word, word2vec

  

In [0]:
word2idx, idx2word, word2vec = load_glove()

In [0]:
word2idx['eos'], word2idx['!'] = word2idx['!'], word2idx['eos']

In [0]:
idx2word = {v:k for k,v in word2idx.items()}

### Adding words from X_train to word2idx and idx2word, if word not already exist in X_train

In [0]:
for sentence in X_train:
  for word in word_tokenize(sentence):
    if word not in word2idx:
      i = len(word2idx) 
      word2idx[word] = i
      idx2word[i] = word      

### Adding words from X_test to word2idx and idx2word, if word not already exist in X_test

In [0]:
for sentence in X_test:
  for word in word_tokenize(sentence):
    if word not in word2idx:
      i = len(word2idx) 
      word2idx[word] = i
      idx2word[i] = word  

### Creating Embedding Matrix

In [0]:
vocab_size = len(word2idx)
embedding_dim = 100

In [0]:
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for i,word in enumerate(word2idx):
  try:
    embedding_matrix[i] = word2vec[word]
  except KeyError:
    embedding_matrix[i] = np.random.normal(scale=0.6, size=(100,))

### Prepare Training Set

1.   convert indices from sentence
2.   convert entire set to set of indices
3.   Padding Sequnces with 0 till max_len



In [0]:
def indices_from_sentence(sentence):
  return np.array([word2idx[word] for word in word_tokenize(sentence)])

def sequence_set(sentences):
  sentence_set = []
  for sentence in sentences:
    sentence_set.append(indices_from_sentence(sentence))
  return np.array(sentence_set)
    
def post_padding(input_set, max_len):
  return pad_sequences(input_set,  maxlen=max_len, padding='post')
 

In [0]:
X_tr = sequence_set(X_train)
X_tr = post_padding(X_tr, max_len)

### Now, Building Test Set

**-- Prepare the Test Set ( same like we got Train Set)**

In [0]:
X_test = filter_review(X_test)

In [0]:
X_test_set = sequence_set(X_test)
X_test_set = post_padding(X_test_set, max_len)

## Defining Model 

In [0]:
def sentiment_analyses_model(vocab_size, embedding_dim):
  
    model = Sequential()
    
    model.add(Embedding(vocab_size, embedding_dim, weights = [embedding_matrix], input_length=max_len, trainable=False))
    
    model.add(LSTM(1024, return_sequences = True))
    
    model.add(LSTM(1024, return_sequences = True))
    
    model.add(LSTM(1024))
    
    model.add(Dense(10, activation='softmax'))
    
    return model

In [0]:
model = sentiment_analyses_model(vocab_size, embedding_dim)








In [0]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 28, 100)           42134200  
_________________________________________________________________
lstm_1 (LSTM)                (None, 28, 1024)          4608000   
_________________________________________________________________
lstm_2 (LSTM)                (None, 28, 1024)          8392704   
_________________________________________________________________
lstm_3 (LSTM)                (None, 1024)              8392704   
_________________________________________________________________
dense_1 (Dense)              (None, 10)                10250     
Total params: 63,537,858
Trainable params: 21,403,658
Non-trainable params: 42,134,200
_________________________________________________________________


In [0]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])




In [0]:
model.fit(X_tr, Y_train_one_hot, epochs = 10, batch_size = 32, shuffle=True)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f7af4930438>

## Model got trained to exactly 90 % Accuaracy, considering the number of model Architectures I have tried, this much Accuracy outcome is feasible, 

## *Accuarcy on Test Set*

In [0]:
loss, acc = model.evaluate(X_test_set, Y_test_one_hot)
print()
print("Test accuracy = ", acc)


Test accuracy =  0.674702175016327


### For now, can be satisfied with Test Set Accuracy which is close to 70% -- 67.5 % exactly

### considering Accuracy is on the data, the model has never seen 





## Saving the Model ---
## this saves Architecture of the model along with the trained weights

In [0]:
from keras.models import load_model
model.save('/gdrive/My Drive/sentimenet_analyses_model.h5')

## Load the Trained Model 

In [0]:
drug_rating_model = load_model('/gdrive/My Drive/sentimenet_analyses_model.h5')

## Preparing for Manual Data Analyses

## Manual Data Analyses can hint me on what type of Reviews the model is not performing well, 
## It can help me in which direction I should spent my efforts to further improve the accuracy 

In [0]:
for i in range(10):
  print("Review >", test_data['review'][i])
  print("Original Rating >", test_data['rating'][i])
  test_example = np.array([X_test_set[i]])
  
  """Since Keras prediction only works on Batch Inputs only,
      not on single test sample,
      explicitly creating a batch of single example
  """
  y_pred = model.predict_classes(test_example)
  result = 1 + y_pred[0]
  
  """Readjusting the result by adding 1 to the predicted result
     (coz we subtracted 1 before, to account for that)
      to get Rating from Lables """
  
  print("Predicted Rating >", result)
  print("_____________________________________________________________________________________________")
  print("\n")
  
  

Review > "I&#039;ve tried a few antidepressants over the years (citalopram, fluoxetine, amitriptyline), but none of those helped with my depression, insomnia &amp; anxiety. My doctor suggested and changed me onto 45mg mirtazapine and this medicine has saved my life. Thankfully I have had no side effects especially the most common - weight gain, I&#039;ve actually lost alot of weight. I still have suicidal thoughts but mirtazapine has saved me."
Original Rating > 10
Predicted Rating > 10
_____________________________________________________________________________________________


Review > "My son has Crohn&#039;s disease and has done very well on the Asacol.  He has no complaints and shows no side effects.  He has taken as many as nine tablets per day at one time.  I&#039;ve been very happy with the results, reducing his bouts of diarrhea drastically."
Original Rating > 8
Predicted Rating > 8
_____________________________________________________________________________________________

### Hmmm, To be Honest, After looking at the above results, I am pretty much satisfied with the quality of Rating I am getting from the model