# Multi-Label Classification of Question 4

## List of contents

- [Importing data](#import)
- [NLP treatment of Q4 responses (x_train, x_test)](#nlp)
- [Data Augmentation of Q4 responses (x_train, x_test)](#augment)
- [Vectorisation of Q4 responses (x_train, x_test)](#vectorise)
- [Multi One-hot Encoding of categories (y_train, y_test)](#onehot)
- [Neural network model (Keras LSTM-Dense Model)](#model)
    - Model structure
    - Model train
    - Model fit
    - Model evaluate
    - Model predict
- [Accuracy analysis](#accuracyanalysis)
    - [Precision | Recall | F1-measure for different thresholds](#f1)
- [Automatic Categorisation of Unseen Data](#unseen)
- [Function for precision, recall and f1score](#accuracygraph)
- [Function for threshold and outputting categories](#categoryoutput)
- [Testing with different parameters, thresholds and optimising](#optimise)
    - [Run 1 - Gaussian Noise](#run1) 
    - [Run 2 - Batch Normalisation](#run2)
    - [Run 3 - Kernel Initialiser](#run3)
    - [Run 4 - Noise + Normalisation + Kernel Initialiser](#run4)
    - [Run 5 - Normalisation + Kernel Initialiser](#run5)
    - [Run 6 - Random search/Grid Search](#run6)
    - [Run 7 - Changing the no of LSTM cells](#run7)
    - [Run 8 - Embedding dimension](#run8)
- [Final Model - Q4_categorise function](#finalmodel)
- [Final Run on 2019 with category outputs](#final)
- [Improvement - Data Augmentation](#augmentation)

In [1]:
## importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# nltk library 
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# tensorflow
import tensorflow as tf

# keras library 
   # preprocessing
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
   # model
from keras.utils.np_utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Dropout, SpatialDropout1D
from keras.callbacks import EarlyStopping
from keras.utils import plot_model
   # accuracy and optimisation
from keras.layers import GaussianNoise
from keras.layers import BatchNormalization
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

#sklearn
from sklearn.preprocessing import MultiLabelBinarizer # multi onehot encoding 
from sklearn.model_selection import train_test_split # train-test split
from sklearn.metrics import precision_score, recall_score, f1_score

import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\roddy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Using TensorFlow backend.


### Importing data <a name="import"></a>

In [2]:
# Function to import datafiles

def import_df(df):
    '''This function imports csv files and is designed
    to import files with unequal dataframes as well
    input: df, csv file
    output: datafile as pandas dataframe'''

    # Delimiter
    data_file_delimiter = ','

    # The max column count a line in the file could have
    largest_column_count = 0

    # Loop the data lines
    with open(df, 'r') as temp_f:
        # Read the lines
        lines = temp_f.readlines()

        for l in lines:
            # Count the column count for the current line
            column_count = len(l.split(data_file_delimiter)) + 1

            # Set the new most column count
            largest_column_count = column_count if largest_column_count < column_count else largest_column_count

    # Close file
    temp_f.close()

    # Generate column names (will be 0, 1, 2, ..., largest_column_count - 1)
    column_names = [i for i in range(0, largest_column_count)]

    # Read csv
    datafile = pd.read_csv(df, header=None, delimiter=data_file_delimiter, names=column_names)
    
    return datafile

In [3]:
q4_1617_eng = import_df('Q4_1617_Eng.csv')
q4_1617_noteng = import_df('Q4_1617_NonEng.csv')

q4_1718_eng = import_df('Q4_1718_Eng.csv')
q4_1718_noteng = import_df('Q4_1718_NonEng.csv')

q4_1819_eng = import_df('Q4_1819_Eng.csv')
q4_1819_noteng = import_df('Q4_1819_NonEng.csv')

In [4]:
### Year 16-17 Dataset ###
# Eng
dfq4_1617_eng = pd.DataFrame()
dfq4_1617_eng = q4_1617_eng[1:]
# Not Eng
dfq4_1617_noteng = pd.DataFrame()
dfq4_1617_noteng = q4_1617_noteng[1:]

### Year 17-18 Dataset ###
# Eng
dfq4_1718_eng = pd.DataFrame()
dfq4_1718_eng = q4_1718_eng[1:]
# Not Eng
dfq4_1718_noteng = pd.DataFrame()
dfq4_1718_noteng = q4_1718_noteng[1:]

### Year 18-19 Dataset ###
# Eng
dfq4_1819_eng = pd.DataFrame()
dfq4_1819_eng = q4_1819_eng[1:]
# Not Eng
dfq4_1819_noteng = pd.DataFrame()
dfq4_1819_noteng = q4_1819_noteng[1:]

dfq4_all = pd.concat([dfq4_1617_eng,dfq4_1617_noteng,dfq4_1718_eng,dfq4_1718_noteng,dfq4_1819_eng,dfq4_1819_noteng])
dfq4_all = dfq4_all.rename(columns={1:"Response",3:"category1",5:"category2",7:"category3",9:"category4",11:"category5",13:"category6",15:"category7"})

# check null columns and remove null and unwanted columns
# print(dfq4_all.category4.isnull().all())
dfq4_all = dfq4_all.drop([0],axis=1)
dfq4_all = dfq4_all.drop([10],axis=1)
dfq4_all = dfq4_all.drop([12],axis=1)
dfq4_all = dfq4_all.drop([14],axis=1)
dfq4_all = dfq4_all.drop([16],axis=1)
dfq4_all = dfq4_all.drop(['category5'],axis=1)
dfq4_all = dfq4_all.drop(['category6'],axis=1)
dfq4_all = dfq4_all.drop(['category7'],axis=1)

dfq4_all = dfq4_all.rename(columns={2:'just1',4:'just2',6:'just3',8:'just4'})
print('The shape of the dataframe with blank responses is',dfq4_all.shape)

# drop blank responses from columns
dfq4_all = dfq4_all[dfq4_all.category1 != 'BL']
dfq4_all = dfq4_all[dfq4_all.category1 != 'bl']
dfq4_all = dfq4_all[dfq4_all.category1 != 'b;']

dfq4_all = dfq4_all.fillna('NaN')

print('The shape of the dataframe for Q4 is',dfq4_all.shape)

dfq4_all

The shape of the dataframe with blank responses is (453, 9)
The shape of the dataframe for Q4 is (371, 9)


Unnamed: 0,Response,just1,category1,just2,category2,just3,category3,just4,category4
1,to gain practical skills that may be used in t...,to gain practical skills that may be used in t...,US,gain data for analysis,TD,,,,
2,to learn how to conduct physics experiments,to learn how to conduct physics experiments,UX,,,,,,
3,To prove theoretical work and to just determin...,To prove theoretical work,TT,determine phenomena in particular not yet expl...,TR,,,,
4,develop us into autonomous professionals,develop us into autonomous professionals,UI,,,,,,
5,to learn how physicists think,to learn how physicists think,UP,,,,,,
...,...,...,...,...,...,...,...,...,...
65,to prove something,to prove something,TT,,,,,,
66,to collect data and investigate a hypothesis,to collect data and,TD,investigate a hypothesis,TT,,,,
67,to illustrate/prove theories. To learn practic...,to illustrate/prove theories.,TT,To learn practical skills,US,,,,
68,Prove that a hypothesis is correct or wrong ac...,Prove that a hypothesis is correct or wrong ac...,TT,obtained results,TD,with a defendable method,UX,,


## NLP treatment of responses - creating x_train and x_test <a name="nlp"></a>
To increase if NLP preprocessing increases the accuracy of the ML algorithm, we will treat the data and pass both our raw and cleaned versions to the model.

#### RO's function

In [5]:
stemming = PorterStemmer()
stops = set(stopwords.words("english"))

def clean_text(dirty_text):
    '''This function works on a raw text string, and:
        1) changes to lower case
        2) tokenizes --> breaks down into words
        3) removes punctuation and non-word text
        4) finds word stems
        5) removes stop words
        6) rejoins meaningful stem words'''
    
    # Convert to lower case
    #text = raw_text.lower()
    #text = raw_text.applymap(lambda s:s.lower() if type(s) == float else s)

    clean = []
    for row in dirty_text:
        # Tokenize
        tokens = nltk.word_tokenize(str(row))
    
        # Keep only words (removes punctuation + numbers)
        token_words = [w for w in tokens if w.isalpha()]
    
        # Stemming
        stemmed_words = [stemming.stem(w) for w in token_words]
    
        # Remove stop words
        meaningful_words = [w for w in stemmed_words if not w in stops]
    
        # Rejoin meaningful stemmed words
        joined_words = ( " ".join(meaningful_words))
        
        clean.append(joined_words)

    return clean

In [6]:
q4response_raw = dfq4_all.Response.tolist()
q4response_clean = clean_text(dfq4_all.Response.tolist())

#print(q4response_raw)
#print(q4response_clean)

## Data Augmentation

Due to difficulties with the code and time constraints, methods of data augmentation have been left hashed out and unused below. However, the methods and appropriate code with their sources have been left as legacy for future improvements to the model. 

#### 1st augmentation suite:

https://medium.com/opla/text-augmentation-for-machine-learning-tasks-how-to-grow-your-text-dataset-for-classification-38a9a207f88d

Possible models:
1. Synonym Replacement (SR): Randomly
choose n words from the sentence that are not
stop words. Replace each of these words with
one of its synonyms chosen at random.
2. Random Insertion (RI): Find a random synonym of a random word in the sentence that is
not a stop word. Insert that synonym into a random position in the sentence. Do this n times.
3. Random Swap (RS): Randomly choose two
words in the sentence and swap their positions.
Do this n times.
4. Random Deletion (RD): Randomly remove
each word in the sentence with probability p.

### 1: Word/sentence shuffling

In [7]:
# from nltk import word_tokenize
# import random

# q4response_raw = dfq4_all.Response.tolist()
# q4response_clean = clean_text(dfq4_all.Response.tolist())
# aug1 = q4response_clean
# aug2 = q4response_raw


# def augment(sentence,n):
#     new_sentences = []
#     words = word_tokenize(sentence)
#     for i in range(n):
#         random.shuffle(words)
#         new_sentences.append(' '.join(words))
#     new_sentences = list(set(new_sentences))
#     return new_sentences

# c = augment(str(aug1),10)
# print(c)
# d = augment(str(aug2),10)
# print(d)

# # concantenate augmented data to original
# for i in c : 
#     aug1.append(i) 
    
# for i in d : 
#     aug2.append(i) 
  
# # printing concatenated list
# print ("Concatenated clean list: " + str(aug1)) 

# print ("Concatenated clean list: " + str(aug2)) 

### 2: Synonym replacement

In [8]:
# from nltk import word_tokenize
# from nltk.corpus import stopwords

# stoplist = stopwords.words('english')


# def get_synonyms_lexicon(path):
#     synonyms_lexicon = {}
#     # change below, alternatively errors='ignore' instead of encoding
#     text_entries = [l.strip() for l in open(path, encoding = "latin-1").readlines()]
#     for e in text_entries:
#         e = e.split(' ')
#         k = e[0]
#         v = e[1:len(e)]
#         synonyms_lexicon[k] = v
#     return synonyms_lexicon


# def synonym_replacement(sentence, synonyms_lexicon):
#     keys = synonyms_lexicon.keys()
#     words = word_tokenize(sentence)
#     n_sentence = sentence
#     for w in words:
#         if w not in stoplist:
#             if w in keys:
#                 n_sentence = n_sentence.replace(w, synonyms_lexicon[w][0])  # we replace with the first synonym
#     return n_sentence

# #http://paraphrase.org/#/download

# if __name__ == '__main__':
#     text = 'Many customers initiated a return process of the product as it was not suitable for use.' \
#            'It was conditioned in very thin box which caused scratches on the main screen.' \
#            'The involved firms positively answered their clients who were fully refunded.'
#     sentences = text.split('.')
#     sentences.remove('')
#     print(sentences)
#     synonyms_lexicon = get_synonyms_lexicon('./ppdb-2.0--all.gz')
#     for sentence in sentences:
#         new_sentence = synonym_replacement(sentence, synonyms_lexicon)
#         print('%s' % sentence)
#         print('%s' % new_sentence)
#         print('\n')

#### 2nd augmentation suite:

https://github.com/makcedward/nlpaug#augmenter

Uses independently created NLPAug 

Example of Textual Augmenter Usage:
- Character Augmenter
    - OCR
    - Keyboard
    - Random
- Word Augmenter
    - Spelling
    - Word Embeddings
    - TF-IDF
    - Contextual Word Embeddings
    - Synonym
    - Antonym
    - Random Word
    - Split
- Sentence Augmenter
    - Contextual Word Embeddings for Sentence

In [9]:
# import os
# os.environ["MODEL_DIR"] = '../model'

## Config

In [10]:
# import nlpaug
# import nlpaug.augmenter.char as nac
# import nlpaug.augmenter.word as naw
# import nlpaug.augmenter.sentence as nas
# import nlpaug.flow as nafc

# from nlpaug.util import Action
# nltk.download('averaged_perceptron_tagger')

In [11]:
# text = 'To improve my practical skills and allow me to conduct experiments independently.'
# print(text)

## Character Augmenter

Augmenting data in character level. Possible scenarios include image to text and chatbot. During recognizing text from image, we need to optical character recognition (OCR) model to achieve it but OCR introduces some errors such as recognizing "o" and "0". OCRAug simulate these errors to perform the data augmentation. For chatbot, we still have typo even though most of application comes with word correction. Therefore, KeyboardAug is introduced to similar this kind of errors.

### OCR Augmenter

Substitute character by pre-defined OCR error

In [12]:
# aug = nac.OcrAug()
# augmented_texts = aug.augment(text, n=3)
# print("Original:")
# print(text)
# print("Augmented Texts:")
# print(augmented_texts)

### Keyboard Augmenter

Substitute character by keyboard distance

In [13]:
# aug = nac.KeyboardAug()
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

### Random Augmenter

Insert character randomly

In [14]:
# aug = nac.RandomCharAug(action="insert")
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

Substitute character randomly

In [15]:
# aug = nac.RandomCharAug(action="substitute")
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

Swap character randomly

In [16]:
# aug = nac.RandomCharAug(action="swap")
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

Delete character randomly

In [17]:
# aug = nac.RandomCharAug(action="delete")
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

## Word Augmenter

Besides character augmentation, word level is important as well. We make use of word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), fasttext (Joulin et al., 2016), BERT(Devlin et al., 2018) and wordnet to insert and substitute similar word. Word2vecAug, GloVeAug and FasttextAug use word embeddings to find most similar group of words to replace original word. On the other hand, BertAug use language models to predict possible target word. WordNetAug use statistics way to find the similar group of words.

### Spelling Augmenter

Substitute word by spelling mistake words dictionary

In [18]:
# aug = naw.SpellingAug(os.environ["MODEL_DIR"] + 'spelling_en.txt')
# augmented_texts = aug.augment(text, n=3)
# print("Original:")
# print(text)
# print("Augmented Texts:")
# print(augmented_texts)

### Word Embeddings Augmenter

Insert word randomly by word embeddings similarity

In [19]:
# # model_type: word2vec, glove or fasttext
# aug = naw.WordEmbsAug(
#     model_type='word2vec', model_path=model_dir+'GoogleNews-vectors-negative300.bin',
#     action="insert")
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

Substitute word by word2vec similarity

In [20]:
# # model_type: word2vec, glove or fasttext
# aug = naw.WordEmbsAug(
#     model_type='word2vec', model_path=model_dir+'GoogleNews-vectors-negative300.bin',
#     action="substitute")
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

### TF-IDF Augmenter

Insert word by TF-IDF similarity

In [21]:
# aug = naw.TfIdfAug(
#     model_path=os.environ.get("MODEL_DIR"),
#     action="insert")
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

Substitute word by TF-IDF similarity

In [22]:
# aug = naw.TfIdfAug(
#     model_path=os.environ.get("MODEL_DIR"),
#     action="substitute")
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

### Contextual Word Embeddings Augmenter

Insert word by contextual word embeddings (BERT, DistilBERT, RoBERTA or XLNet)

In [23]:
# aug = naw.ContextualWordEmbsAug(
#     model_path='bert-base-uncased', action="insert")
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

Substitute word by contextual word embeddings (BERT, DistilBERT, RoBERTA or XLNet

In [24]:
# aug = naw.ContextualWordEmbsAug(
#     model_path='bert-base-uncased', action="substitute")
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

In [25]:
# aug = naw.ContextualWordEmbsAug(
#     model_path='distilbert-base-uncased', action="substitute")
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

In [26]:
# aug = naw.ContextualWordEmbsAug(
#     model_path='roberta-base', action="substitute")
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

### Synonym Augmenter

Substitute word by WordNet's synonym

In [27]:
# aug = naw.SynonymAug(aug_src='wordnet')
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

Substitute word by PPDB's synonym

In [28]:
# aug = naw.SynonymAug(aug_src='ppdb', model_path=os.environ.get("MODEL_DIR") + 'ppdb-2.0-s-all')
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

### Antonym Augmenter

Substitute word by antonym

In [29]:
# aug = naw.AntonymAug()
# _text = 'Good boy'
# augmented_text = aug.augment(_text)
# print("Original:")
# print(_text)
# print("Augmented Text:")
# print(augmented_text)

### Random Word Augmenter

Swap word randomly

In [30]:
# aug = nac.RandomWordAug(action="swap")
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

Delete word randomly

In [31]:
# aug = naw.RandomWordAug()
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

### Split Augmenter

Split word to two tokens randomly

In [32]:
# aug = naw.SplitAug()
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

## Sentence Augmentation

### Contextual Word Embeddings for Sentence Augmenter

Insert sentence by contextual word embeddings (GPT2 or XLNet)

In [33]:
# # model_path: xlnet-base-cased or gpt2
# aug = nas.ContextualWordEmbsForSentenceAug(model_path='xlnet-base-cased')
# augmented_texts = aug.augment(text, n=3)
# print("Original:")
# print(text)
# print("Augmented Texts:")
# print(augmented_texts)

In [34]:
# aug = nas.ContextualWordEmbsForSentenceAug(model_path='gpt2')
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

In [35]:
# aug = nas.ContextualWordEmbsForSentenceAug(model_path='gpt2')
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

In [36]:
# aug = nas.ContextualWordEmbsForSentenceAug(model_path='distilgpt2')
# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

## Vectorisation of Q4 responses (x_train, x_test) <a name="vectorise"></a>

In [37]:
def vectorise(response):
    '''Converts a text data file, post cleaning, into a matrix/vector form.
    Input: Raw or clean data set (no stop works, word stems, meaningful words only)
    

    1) Calculates the max number of words in a response, then pads them with 0's to make the lengths of responses the same.
    2) Assigns an integer value to each word in a response, based on its frequency. Lower frequency=higher integer.
    
    Output: A matrix with all responses as number elements, each column represents a response. '''

    tokenizer= Tokenizer(oov_token=True)
    tokenizer.fit_on_texts(response)

    texts_numeric= tokenizer.texts_to_sequences(response)

    # finding max length of reponses:
    
    len_texts_numeric = list(map(len, texts_numeric))
    max_position = len_texts_numeric.index(max(len_texts_numeric))
    my_item = texts_numeric[max_position]

    response_max=len(texts_numeric[max_position])
    print('The length of the longest response is:', response_max, ', Its position is:', max_position)

    texts_pad = pad_sequences(texts_numeric, response_max) 
    
    #response_max-this is the max number of words per response-->sets number of words for all responses to this
    #max_position - the element number of the longest response
    
    return texts_pad

In [38]:
xQ4vec_raw = vectorise(aug2)
xQ4vec_clean = vectorise(aug1)

NameError: name 'aug2' is not defined

### Finding maximum number of words

In [None]:
# raw data
a = []
for i,x in enumerate(aug2):
    a.append(x)

max_words_raw = len(''.join(a))

# clean data
a = []
for i,x in enumerate(aug1):
    a.append(x)

max_words_clean = len(''.join(a))

print(max_words_raw)
print(max_words_clean)

## One-Hot Encode With Multiple Labels - creating y_train and y_test <a name="onehot"></a>

In [None]:
onehot = MultiLabelBinarizer()
cat1 = np.array(pd.Series(dfq4_all.category1)).tolist()
cat2 = np.array(pd.Series(dfq4_all.category2)).tolist()
cat3 = np.array(pd.Series(dfq4_all.category3)).tolist()
cat4 = np.array(pd.Series(dfq4_all.category4)).tolist()

yQ4 = np.column_stack((cat1,cat2,cat3,cat4))
yQ4vec = onehot.fit_transform(yQ4)

print('Each column represents the classes:',onehot.classes_)
print('The shape of the y vectors (corresponding to categories is',yQ4vec.shape)

**Comment** = A value of 1 in the the rows for a particular columns means that the vectorised response belongs to that category. The categories are mutually exclusive; one y vector may have multiple 1s.

**Comment** = NaN is a bit of an issue. As the responses in the training data do not have equal number of corresponding categories, NaN values are in place at the cells without an entry. Leaving this as an extra column in the one-hot encoded vectors is fine.

## Neural Network Model <a name="model"></a>

In [None]:
def NNmodel(x,y,x_words,embed_dim,epochs,batch_size):
    '''Neural Network model for training 
       using train and test data
       Inputs:  x, vectorised x data (responses)
                y, vectorised y data (categories)
                x_words, non-vectorised x as a list of str
                embed_dim, dimension of embedding in embedding layer
                epoch, epochs in model.fit
       Outputs: model, model summary and training run
                accr, accuracy and loss of the model
                xtest, the test data
                ytest, the categories of the test data'''
    
    ## Split data into training and test data ##
    xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.10,random_state=42)
    
    ## Number of distinct categories ##
    y_no = len(y[1]) 
   
    ## Finding maximum no of words ##
    a = []
    for i,x in enumerate(x_words):
        a.append(x)
    max_nb_words = len(''.join(a)) + 1      #should be 1 extra at least
    
    ## RNN Model ##
    model = Sequential(name = "RNN_model")
    model.add(Embedding(max_nb_words, embed_dim))
    model.add(LSTM(150, dropout=0.2, recurrent_dropout=0.2))    # 150 LSTM cells
    model.add(Dropout(0.2))
    model.add(Dense(y_no, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    model.summary() # summary showing parameters
    model.fit(xtrain, ytrain, epochs=epochs, batch_size=batch_size, validation_split=0.1) # fit data to model
    
    accr = model.evaluate(xtest,ytest) # accuracy

    return model, accr, xtest, ytest

In [None]:
# onehot vectors with NaN category
yQ4vec

In [None]:
# NaN category removed
yQ4vec[:,1:]

### Initial Run

In [None]:
# Parameters
x = xQ4vec_clean
y = yQ4vec[:,1:] # NaN category removed
x_words = aug1
embed_dim = 1000
epochs = 10
batch_size = 50

model, accr, xtest, ytest = NNmodel(x,y,x_words,embed_dim,epochs,batch_size)

#### Plot of model structure

The following function is imported:

**from keras.utils import plot_model**

It requires both *pydot* and *GraphViz* to run. These need to be present on the computer. (If code does not work properly, it can be commented out.)

In [None]:
plot_model(model,show_shapes = True, show_layer_names = True, to_file='model.png')

In [None]:
## Loss and Accuracy
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))

**Comment:** Accuracy increases and loss decreases with increasing number of epochs. 30 epochs gives quite a high accuracy in a reasonable amount of time. The accuracy hits 1.0 at around 50 epochs but takes a little time to get there. 

In [None]:
## Predictions using model
# running on all the x data(both test and train data)
print(model.predict(x))

## Accuracy analysis <a name="accuracyanalysis"></a>

If the output is sparse multi-label, meaning a few positive labels and a majority are negative labels, the Keras accuracy metric will be overflatted by the correctly predicted negative labels. So the prediction would be [0, 0, 0, 0, 0, 1]. And if the actual labels were [0, 0, 0, 0, 0, 0], the accuracy would be 5/6 rather than 0 as it should be
[1]. 

To get around this,a _custom accuracy_ rather than _accuracy_ can be used instead. The absolute accuracy is given by:

$$\text{Accuracy} = \frac{\text{No. data instances that are correctly classified}}{\text{total number of data instances}}$$

For example, if y_true is [1, 2, 3, 4] and y_pred is [0, 2, 3, 4] then the accuracy is 3/4 or .75. If the weights were specified as [1, 1, 0, 0] then the accuracy would be 1/2 or .5.

This metric creates two local variables, total and count that are used to compute the frequency with which y_pred matches y_true. This frequency is ultimately returned as binary accuracy: an idempotent operation that simply divides total by count. [2]

**Cateogrical accuracy** This metric is using the K.argmax method to compare the index of the maximal true value with the index of the maximal predicted value. In other words "how often predictions have maximum in the same spot as true values". For example, if y_true is [[0, 0, 1], [0, 1, 0]] and y_pred is [[0.1, 0.9, 0.8], [0.05, 0.95, 0]] then the categorical accuracy is 1/2 or .5 - One hot vectors. [5]

K.mean(K.equal(K.argmax(y_true, axis=-1), K.argmax(y_pred, axis=-1)))
axis=-1 means the axis along each element in the vector/matrix.

k.argmax takes the highest value element of the prediction and compares it to the position of the highest elemental value in the actual set of labels.

This means "how often predictions have maximum in the same spot as true values" [7]

**top_k catgeroical accuracy**: Top-k categorical accuracy is almost similar to categorical accuracy. Here we calculate how often target class is within the top-k predictions. - One hot vectors. E.g. how often a class is in the top k index values. [4] 

**binary accuracy**: For example, if y_true is [1, 1, 0, 0] and y_pred is [0.98, 1, 0, 0.6] then the binary accuracy is 3/4 or .75. If the weights were specified as [1, 0, 0, 1] then the binary accuracy would be 1/2 or .5.

This metric creates two local variables, total and count that are used to compute the frequency with which y_pred matches y_true. This frequency is ultimately returned as binary accuracy: an idempotent operation that simply divides total by count. [6]

**loss - binary crossentropy**:
Binary crossentropy is a loss function used on problems involving yes/no (binary) decisions. For instance, in multi-label problems, where an example can belong to multiple classes at the same time, the model tries to decide for each class whether the example belongs to that class or not. Model the output of the network as a independent Bernoulli distributions per label.[3]

**Custom metric:**

Similar to categorical accuracy but applied to multi hot vectors.

OR

**Hamming loss**
THe fraction of wrongly predicted labels to the total number of labels.

## Precision | Recall | F1-measure for different thresholds <a name="f1"></a>
From sklearn, three accuracy assessment functions are imported. The xtest and ytest arrays are run on the model and passes through these functions to test their accuracy. 

**precision_score**: The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative. [8]

$$\frac{\text{true positive}}{\text{true positive + false positive}}$$

**The best value is 1 and the worst value is 0.**

**recall_score**: The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples. [9]

$$\frac{\text{true positive}}{\text{true positive + false negative}}$$

**The best value is 1 and the worst value is 0.**

**f1_score**: The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is: [10]

$$ F1 = \frac{2 * \text{precision} * \text{recall} }{ \text{precision} + \text{recall}} $$


In [None]:
#[11]
predictions=model.predict([xtest])

thresholds=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for val in thresholds:
    pred=predictions.copy()
  
    pred[pred>=val]=1
    pred[pred<val]=0
  
    precision = precision_score(ytest, pred, average='micro')
    recall = recall_score(ytest, pred, average='micro')
    f1 = f1_score(ytest, pred, average='micro')
   
    print("Micro-average quality numbers")
    print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

## Testing on unseen data <a name="unseen"></a>
We will now run our model on the questionnaire data from the academic year 2019-2020 to automatically classify the responses. First, we import the Q4 data. This data is not distinguished between native and non-native English speakers because the questionnaire question had been changed to what qualifications they had prior to university, i.e. whether they had done A-Levels, as students educational background relate more to their expectations than whether their first language is English. A lot of overseas students do A levels without their first language being English. 

#### Preprocessing

In [None]:
# import Q4 data 2019-2020
q4_1920 = import_df('Q4_1920.csv')
### Dataset ###
dfq4_1920 = pd.DataFrame()
dfq4_1920 = q4_1920[1:]

dfq4_1920 = dfq4_1920.rename(columns={0:"response"})

#print
#dfq4_1920.response 

q4_1920_clean = clean_text(dfq4_1920.response.tolist())

xQ4_1920_vec = vectorise(q4_1920_clean)

#### Running the model

In [None]:
Q4_cat_vecs = model.predict(xQ4_1920_vec)

## Function for precision, recall and f1score <a name="accuracygraph"></a>
#### Date: 07/03/2020

In [None]:
def accuracy_plot(xtest,ytest):
    '''Outputs the precision, recall and f1score of the
       outcome of the test data when they are tested 
       using the model
       Inputs:  xtest, test data of responses
                ytest, test data of categories
       Output:  plot of scores against threshold'''
    
    predictions=model.predict([xtest])

    thresholds=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
  
    p = np.zeros(len(thresholds))
    r = np.zeros_like(p)
    f = np.zeros_like(p)
    
    for i,val in enumerate(thresholds):
        pred=predictions.copy()
  
        pred[pred>=val]=1
        pred[pred<val]=0
  
        precision = precision_score(ytest, pred, average='micro')
        p[i] = precision
        recall = recall_score(ytest, pred, average='micro')
        r[i] = recall
        f1 = f1_score(ytest, pred, average='micro')
        f[i] = f1
    
    plt.figure()
    plt.plot(thresholds,p,label='Precision')
    plt.plot(thresholds,r,label='Recall')
    plt.plot(thresholds,f,label='F1')
    plt.xlabel('Threshold for accepting column')
    plt.ylabel('Value')
    plt.legend(loc='best') 

In [None]:
accuracy_plot(xtest,ytest)

## Function for threshold and outputting categories <a name="categoryoutput"></a>
#### Date: 07/03/2020

In [None]:
def cat_output(yvecs, threshold):
    '''Takes the resulting multi one-hot encoded vectors and 
       returns corresponding categories for a specified threshold
       Inputs: yvecs, category vector predictions of the model
               threshold, thresholds for accepting columns
       Output: list of 
    '''
    onehotvecs = onehot.classes_[1:]
    
    cat = np.zeros((yvecs.shape),dtype='<U2')
    for index,val in np.ndenumerate(yvecs):
        if val >= threshold:
            category = onehotvecs[index[1:]]
            cat[index] = category
    
    return cat

## Testing with different parameters, thresholds and optimising <a name="optimise"></a>
### Effect of adding Gaussian Noise - RUN 1 <a name="run1"></a>

We can see that the accuracy of the training dataset is much better than that of the test, a possible sign of overfitting.

Noise is used as a regulariazation method to prevent overfitting and has the effect of creating more samples or resampling the domain. It is useful for small training data sets.

It can be used to add noise to the input data as an input layer in the model. It can also be added inbetween layers.

The output of the noise layer will have the same shape as the input, with the only modification being the addition of noise to the values. As the noise is Gaussian, its mean will be zero and will require a standard deviation input. 

**Variables and Parameters:**
- x = xQ4vec_clean
- y = yQ4vec[:,1:] 
- x_words = q4response_clean
- embed_dim = 1000
- epochs = 10
- batch_size = 50

In [None]:
## new model with noise
def NNmodelNoise(x,y,x_words,embed_dim,epochs,batch_size, max_words):
    
    ## Split data into training and test data ##
    xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.10,random_state=42)
    
    ## Number of distinct categories ##
    y_no = len(y[1]) 
   
    ## Finding maximum no of words ##
    a = []
    for i,x in enumerate(x_words):
        a.append(x)
    max_nb_words = len(''.join(a)) + 1      #should be 1 extra at least
    
    ## RNN Model ##
    model = Sequential(name = "RNN_model")
    model.add(Embedding(max_nb_words, embed_dim))
    #adding Gaussian Noise   [10]
    model.add(GaussianNoise(0.001, input_shape=(max_words,100)))
    model.add(LSTM(150, dropout=0.2, recurrent_dropout=0.2))    # 150 LSTM cells
    model.add(Dropout(0.2))
    model.add(Dense(y_no, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    model.summary() # summary showing parameters
    model.fit(xtrain, ytrain, epochs=epochs, batch_size=batch_size, validation_split=0.1) # fit data to model
    
    accr = model.evaluate(xtest,ytest) # accuracy

    return model, accr, xtest, ytest

In [None]:
x = xQ4vec_clean
y = yQ4vec[:,1:] # NaN category removed
x_words = q4response_clean
embed_dim = 1000
epochs = 10
batch_size = 50
max_words = 27 # Gaussian noise parameter

model, accrNoise, xtestNoise, ytestNoise = NNmodelNoise(x,y,x_words,embed_dim,epochs,batch_size, max_words)

In [None]:
accRun1 = accuracy_plot(xtest, ytest)
plt.title('Effect of adding Gaussian noise (LSTM = 150)')
plt.savefig('run1.png')

Adding the Gaussian noise improves recall and the overall F1.

### Adding Batch Normalisation - RUN 2 <a name="run2"></a>
**BatchNormalization**: normalizes outputs independently to the weight updates of the previous layer
**Variables and Parameters**:
- x = xQ4vec_clean
- y = yQ4vec[:,1:] 
- x_words = q4response_clean
- embed_dim = 1000
- epochs = 10
- batch_size = 50

In [None]:
def NNmodelNorm(x,y,x_words,embed_dim,epochs,batch_size):
    
    ## Split data into training and test data ##
    xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.10,random_state=42)
    
    ## Number of distinct categories ##
    y_no = len(y[1]) 
   
    ## Finding maximum no of words ##
    a = []
    for i,x in enumerate(x_words):
        a.append(x)
    max_nb_words = len(''.join(a)) + 1      #should be 1 extra at least
    
    ## RNN Model ##
    model = Sequential(name = "RNN_model")
    model.add(Embedding(max_nb_words, embed_dim))
    model.add(LSTM(150, dropout=0.2, recurrent_dropout=0.2, kernel_initializer='normal')) # 150 LSTM cells
    model.add(Dropout(0.2))
    model.add(BatchNormalization())
    model.add(Dense(y_no, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    model.summary() # summary showing parameters
    model.fit(xtrain, ytrain, epochs=epochs, batch_size=batch_size, validation_split=0.1) # fit data to model
    
    accr = model.evaluate(xtest,ytest) # accuracy

    return model, accr, xtest, ytest

In [None]:
x = xQ4vec_clean
y = yQ4vec[:,1:] # NaN category removed
x_words = q4response_clean
embed_dim = 1000
epochs = 10
batch_size = 50

model, accrNorm, xtestNorm, ytestNorm = NNmodelNorm(x,y,x_words,embed_dim,epochs,batch_size)

In [None]:
accRun2 = accuracy_plot(xtestNorm, ytestNorm)
plt.title('Effect of adding batch normalisation (LSTM = 150)')
plt.savefig('run2.png')

### Adding Kernel Initializer - RUN 3 <a name="run3"></a>
**kernel-intializer**: initializes weights after each layer using a statistical distribution
**Variables and Parameters**:
- x = xQ4vec_clean
- y = yQ4vec[:,1:] 
- x_words = q4response_clean
- embed_dim = 1000
- epochs = 10
- batch_size = 50

In [None]:
def NNmodelIni(x,y,x_words,embed_dim,epochs,batch_size):
    
    ## Split data into training and test data ##
    xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.10,random_state=42)
    
    ## Number of distinct categories ##
    y_no = len(y[1]) 
   
    ## Finding maximum no of words ##
    a = []
    for i,x in enumerate(x_words):
        a.append(x)
    max_nb_words = len(''.join(a)) + 1      #should be 1 extra at least
    
    ## RNN Model ##
    model = Sequential(name = "RNN_model")
    model.add(Embedding(max_nb_words, embed_dim))
    model.add(LSTM(150, dropout=0.2, recurrent_dropout=0.2, kernel_initializer='normal')) # 150 LSTM cells
    model.add(Dense(y_no, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    model.summary() # summary showing parameters
    model.fit(xtrain, ytrain, epochs=epochs, batch_size=batch_size, validation_split=0.1) # fit data to model
    
    accr = model.evaluate(xtest,ytest) # accuracy

    return model, accr, xtest, ytest

In [None]:
x = xQ4vec_clean
y = yQ4vec[:,1:] # NaN category removed
x_words = q4response_clean
embed_dim = 1000
epochs = 10
batch_size = 50

model, accrIni, xtestIni, ytestIni = NNmodelIni(x,y,x_words,embed_dim,epochs,batch_size)

In [None]:
accRun3 = accuracy_plot(xtestIni, ytestIni)
plt.title('Effect of adding Kernel initializer (LSTM = 150)')
plt.savefig('run3.png')

Clearly the batch normalization hasn't improved F1 accuracy...But the kernel initializer has.

### Noise + Normalisation + Kernal Initializer - RUN 4 <a name="run4"></a>

In [None]:
def NNmodelAll(x,y,x_words,embed_dim,epochs,batch_size):
    
    ## Split data into training and test data ##
    xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.10,random_state=42)
    
    ## Number of distinct categories ##
    y_no = len(y[1]) 
   
    ## Finding maximum no of words ##
    a = []
    for i,x in enumerate(x_words):
        a.append(x)
    max_nb_words = len(''.join(a)) + 1      #should be 1 extra at least
    
    ## RNN Model ##
    model = Sequential(name = "RNN_model")
    model.add(Embedding(max_nb_words, embed_dim))
    #adding Gaussian Noise   [10]
    model.add(GaussianNoise(0.001, input_shape=(27,100)))
    model.add(LSTM(150, dropout=0.2, recurrent_dropout=0.2, kernel_initializer='normal'))    # 150 LSTM cells
    model.add(Dropout(0.2))
    model.add(BatchNormalization())
    model.add(Dense(y_no, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    model.summary() # summary showing parameters
    model.fit(xtrain, ytrain, epochs=epochs, batch_size=batch_size, validation_split=0.1) # fit data to model
    
    accr = model.evaluate(xtest,ytest) # accuracy

    return model, accr, xtest, ytest

In [None]:
x = xQ4vec_clean
y = yQ4vec[:,1:] # NaN category removed
x_words = q4response_clean
embed_dim = 1000
epochs = 10
batch_size = 50

model, accrAll, xtestAll, ytestAll = NNmodelAll(x,y,x_words,embed_dim,epochs,batch_size)


In [None]:
accRun4 = accuracy_plot(xtestAll, ytestAll)
plt.title('Gaussian noise & Batch Normalisation & Kernel Initializer (LSTM = 150)')
plt.savefig('run4.png')

### Normalisation + Initializer - RUN 5 <a name="run5"></a>

In [None]:
def NNmodelNI(x,y,x_words,embed_dim,epochs,batch_size):

    ## Split data into training and test data ##
    xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.10,random_state=42)
    
    ## Number of distinct categories ##
    y_no = len(y[1]) 
   
    ## Finding maximum no of words ##
    a = []
    for i,x in enumerate(x_words):
        a.append(x)
    max_nb_words = len(''.join(a)) + 1      #should be 1 extra at least
    
    ## RNN Model ##
    model = Sequential(name = "RNN_model")
    model.add(Embedding(max_nb_words, embed_dim))
    #adding Gaussian Noise   [10]
    model.add(GaussianNoise(0.001, input_shape=(27,100)))
    model.add(LSTM(150, dropout=0.2, recurrent_dropout=0.2, kernel_initializer='normal'))    # 150 LSTM cells
    model.add(Dropout(0.2))
    model.add(Dense(y_no, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    model.summary() # summary showing parameters
    model.fit(xtrain, ytrain, epochs=epochs, batch_size=batch_size, validation_split=0.1) # fit data to model
    
    accr = model.evaluate(xtest,ytest) # accuracy

    return model, accr, xtest, ytest

In [None]:
x = xQ4vec_clean
y = yQ4vec[:,1:] # NaN category removed
x_words = q4response_clean
embed_dim = 1000
epochs = 10
batch_size = 50

model, accrNI, xtestNI, ytestNI = NNmodelNI(x,y,x_words,embed_dim,epochs,batch_size)

In [None]:
accRun5 = accuracy_plot(xtestNI, ytestNI)
plt.title('Batch normalisation & Kernel initializer (LSTM = 150)')
plt.savefig('run5.png')

# A Function - Random Search/Grid Search to find best parameters <a name="run6"></a>
This allows us to randomly try different parameter values such as those below, to see which gives us the best accuracy. This uses the Skikit learn estimator.

There are many different ways to do this...

- **Grid search**: searches all the possible configurations of the different parameters in order to find the best one.
- **Random search**: similar to Grid search but picks the point out randomly from the configuration space. This means that the hyperparameter space has been searched through more widely; allowing us to find the best configuration in less iterations.

We first want to create a function that describes our model. Within this we have added a _for loop_ to allow us to see how many layers and how many neurons per layer, is optimal when we later apply the _random search_. 

We will then convert the model to a sklearn estimator that will then run all the parameter configurations through the model. We must use a low number of epochs.

In [None]:
## Split data into training and test data ##
xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.10,random_state=42)


#nl: number of layers, nn: number of neurons
def create_model(nl=1, nn=25, std=0.001, dropout=0.2, recurrent_dropout=0.2, neu=150):
    
    ## Number of distinct categories ##
    y_no = len(y[1]) 
   
    ## Finding maximum no of words ##
    a = []
    for i,x in enumerate(x_words):
        a.append(x)
    max_nb_words = len(''.join(a)) + 1      #should be 1 extra at least
    
    model=Sequential()
    model.add(Embedding(max_nb_words, embed_dim))
    
    #adding Gaussian Noise   [10]
    model.add(GaussianNoise(std, input_shape=(max_words,100)))
    model.add(LSTM(neu, dropout=dropout, recurrent_dropout=recurrent_dropout))

    for i in range(nl):
        model.add(Dense(nn, activation='sigmoid'))
        
    model.add(Dense(y_no, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model


#create a model as a sklearn estimator
kerasmodel=KerasClassifier(build_fn=create_model, epochs=3)

In [None]:
#define a series of parameters

#params=dict( batch_size=[80,100,120], nl=[0,1,2], nn=[100,50,25], std=[1,5])
#params=dict(neu=[50,100,150,200,500], std=[0.00001, 0.0001, 0.01], batch_size=[40,60,80,100,120], nl=[0,1,2], nn=[100,150,25,50], dropout=[0.1,0.2,0.3,0.4])

#random_search=RandomizedSearchCV(kerasmodel,param_distributions=params,cv=5)

#print best results
#random_search_results=random_search.fit(x,y)

#results: best model parameters
#print("Best: %f using %s"%(random_search_results.best_score_, random_search_results.best_params_))

# Random Search suggests the best parameters are:

- batch size: 60
- dropout: 0.2
- 1 dense layer
- standard deviation of Gaussian noise: 0.00001
- LSTM with 200 neurons

out of the following parameters:
params=dict(neu=[50,100,150,200,500], std=[0.00001, 0.0001, 0.01], batch_size=[40,60,80,100,120], nl=[0,1,2], nn=[100,150,25,50], dropout=[0.1,0.2,0.3,0.4])

In [None]:
def NNmodelBest(x,y,x_words,embed_dim,epochs,batch_size,max_words):
    '''Neural Network model for training 
       using train and test data
       Inputs:  x, vectorised x data (responses)
                y, vectorised y data (categories)
                x_words, non-vectorised x as a list of str
                embed_dim, dimension of embedding in embedding layer
                epoch, epochs in model.fit
       Outputs: '''
    
    ## Split data into training and test data ##
    xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.10,random_state=42)
    
    ## Number of distinct categories ##
    y_no = len(y[1]) 
   
    ## Finding maximum no of words ##
    a = []
    for i,x in enumerate(x_words):
        a.append(x)
    max_nb_words = len(''.join(a)) + 1      #should be 1 extra at least
    
    ## RNN Model ##
    model = Sequential(name = "RNN_model")
    model.add(Embedding(max_nb_words, embed_dim))
    #adding Gaussian Noise   [10]
    model.add(GaussianNoise(0.00001, input_shape=(max_words,100)))
    model.add(LSTM(200, dropout=0.2, recurrent_dropout=0.2))    # 150 LSTM cells
    model.add(Dropout(0.2))
    model.add(Dense(y_no, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    model.summary() # summary showing parameters
    model.fit(xtrain, ytrain, epochs=epochs, batch_size=batch_size, validation_split=0.1) # fit data to model
    
    accr = model.evaluate(xtest,ytest) # accuracy

    return model, accr, xtest, ytest

In [None]:
x = xQ4vec_clean
y = yQ4vec[:,1:] # NaN category removed
x_words = q4response_clean
embed_dim = 1000
epochs = 10
batch_size = 60

model, accrBest, xtestBest, ytestBest = NNmodelBest(x,y,x_words,embed_dim,epochs,batch_size, max_words=27)

In [None]:
accRun6 = accuracy_plot(xtestBest, ytestBest)
plt.title('Batch_size = 60 & LSTM = 200')
plt.savefig('run6.png')

## Investigating the effect of changing LSTM cells and embed_dim <a name="run7">

### LSTM cells - Run 7 
The number of LSTM cells increased from 150 to 300.

**Variables and Parameters**:
- x = xQ4vec_clean
- y = yQ4vec[:,1:] 
- x_words = q4response_clean
- embed_dim = 1000
- epochs = 10
- batch_size = 50

In [None]:
def NNmodel_lstm(x,y,x_words,embed_dim,epochs,batch_size):

    ## Split data into training and test data ##
    xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.10,random_state=42)
    
    ## Number of distinct categories ##
    y_no = len(y[1]) 
   
    ## Finding maximum no of words ##
    a = []
    for i,x in enumerate(x_words):
        a.append(x)
    max_nb_words = len(''.join(a)) + 1      #should be 1 extra at least
    
    ## RNN Model ##
    model = Sequential(name = "RNN_model")
    model.add(Embedding(max_nb_words, embed_dim))
    #adding Gaussian Noise   [10]
    model.add(GaussianNoise(0.001, input_shape=(27,100)))
    model.add(LSTM(300, dropout=0.2, recurrent_dropout=0.2, kernel_initializer='normal'))    # 300 LSTM cells
    model.add(Dropout(0.2))
    model.add(Dense(y_no, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    model.summary() # summary showing parameters
    model.fit(xtrain, ytrain, epochs=epochs, batch_size=batch_size, validation_split=0.1) # fit data to model
    
    accr = model.evaluate(xtest,ytest) # accuracy

    return model, accr, xtest, ytest

In [None]:
x = xQ4vec_clean
y = yQ4vec[:,1:] # NaN category removed
x_words = q4response_clean
embed_dim = 1000
epochs = 10
batch_size = 50

model, accrLSTM, xtestLSTM, ytestLSTM = NNmodel_lstm(x,y,x_words,embed_dim,epochs,batch_size)

In [None]:
accRun7 = accuracy_plot(xtestLSTM,ytestLSTM)
plt.title('Embed_dim = 1000 & LSTM = 300')
plt.savefig('run7.png')

**Comment:** The doubling of the number of LSTM cells increased the value of f1 score slighlty, and also made it more flat across the thresholds. It does not slow the code significantly, therefore 300 LSTM cells can be used.

### Embed_dim - Run 8 <a name="run8">
**Variables and Parameters**:
- x = xQ4vec_clean
- y = yQ4vec[:,1:] 
- x_words = q4response_clean
- embed_dim = 3000
- epochs = 10
- batch_size = 50

In [None]:
def NNmodel_embeddim(x,y,x_words,embed_dim,epochs,batch_size):

    ## Split data into training and test data ##
    xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.10,random_state=42)
    
    ## Number of distinct categories ##
    y_no = len(y[1]) 
   
    ## Finding maximum no of words ##
    a = []
    for i,x in enumerate(x_words):
        a.append(x)
    max_nb_words = len(''.join(a)) + 1      #should be 1 extra at least
    
    ## RNN Model ##
    model = Sequential(name = "RNN_model")
    model.add(Embedding(max_nb_words, embed_dim))
    #adding Gaussian Noise   [10]
    model.add(GaussianNoise(0.001, input_shape=(27,100)))
    model.add(LSTM(300, dropout=0.2, recurrent_dropout=0.2, kernel_initializer='normal'))    # 300 LSTM cells
    model.add(Dropout(0.2))
    model.add(Dense(y_no, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    model.summary() # summary showing parameters
    model.fit(xtrain, ytrain, epochs=epochs, batch_size=batch_size, validation_split=0.1) # fit data to model
    
    accr = model.evaluate(xtest,ytest) # accuracy

    return model, accr, xtest, ytest

In [None]:
x = xQ4vec_clean
y = yQ4vec[:,1:] 
x_words = q4response_clean
embed_dim = 3000
epochs = 10
batch_size = 50

model, accrembed, xtestembed, ytestembed = NNmodel_embeddim(x,y,x_words,embed_dim,epochs,batch_size)

In [None]:
accRun8 = accuracy_plot(xtestembed,ytestembed)
plt.title('Embed_dim = 3000 & LSTM = 300')
plt.savefig('run8.png')

Increasing embedding dimension from 1000 to 3000 increased the f1 score which gave a peak at 0.3.

## OPTIMISED MODEL - BATCH SIZE REDUCED <a name="finalmodel">
**Variables and Parameters:**  

- x = xQ4vec_clean
- y = yQ4vec[:,1:] 
- x_words = q4response_clean
- embed_dim = 1500
- epochs = 10
- batch_size = 10

In [None]:
def Q4_categorise(x,y,x_words,embed_dim,epochs,batch_size):
    '''Neural Network model for training 
       using train and test data
       Inputs:  x, vectorised x data (responses)
                y, vectorised y data (categories)
                x_words, non-vectorised x as a list of str
                embed_dim, dimension of embedding in embedding layer
                epoch, epochs in model.fit
       Outputs: model, model summary and training run
                accr, accuracy and loss of the model
                xtest, the test data
                ytest, the categories of the test data'''
    ## Split data into training and test data ##
    xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.10,random_state=42)
    
    ## Number of distinct categories ##
    y_no = len(y[1]) 
   
    ## Finding maximum no of words ##
    a = []
    for i,x in enumerate(x_words):
        a.append(x)
    max_nb_words = len(''.join(a)) + 1      #should be 1 extra at least
    
    ## RNN Model ##
    model = Sequential(name = "RNN_model")
    model.add(Embedding(max_nb_words, embed_dim))
    #adding Gaussian Noise   [10]
    model.add(GaussianNoise(0.001, input_shape=(27,100)))
    model.add(LSTM(300, dropout=0.2, recurrent_dropout=0.2, kernel_initializer='normal'))    # 300 LSTM cells
    model.add(Dropout(0.2))
    model.add(Dense(y_no, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    model.summary() # summary showing parameters
    model.fit(xtrain, ytrain, epochs=epochs, batch_size=batch_size, validation_split=0.1) # fit data to model
    
    accr = model.evaluate(xtest,ytest) # accuracy

    return model, accr, xtest, ytest

In [None]:
x = xQ4vec_clean
y = yQ4vec[:,1:] 
x_words = q4response_clean
embed_dim = 1500
epochs = 10
batch_size = 10

model, accr, xtest, ytest = Q4_categorise(x,y,x_words,embed_dim,epochs,batch_size)

In [None]:
plot_model(model,show_shapes = True, show_layer_names = True, to_file='modelfinal.png')

In [None]:
final = accuracy_plot(xtest,ytest)
plt.title('Embed_dim = 1500 & Batch_size = 10 & LSTM = 300')
plt.savefig('final.png')

## Final Run on 2019 with category outputs<a name="final"></a>

In [None]:
Q4_cat_vecs = model.predict(xQ4_1920_vec)
#print(Q4_cat_vecs)
categorized = cat_output(Q4_cat_vecs,threshold=0.4) # HS function used

In [None]:
print(categorized[2])

In [None]:
pred = Q4_cat_vecs >= 0.4
class_dict = {}
for example in range(len(pred)):
    for class_ in range(len(pred[0])):
        if pred[example][class_] == True:
            if example in class_dict:
                class_dict[example].append(onehot.classes_[1:][class_])
            else:
                class_dict[example] = [onehot.classes_[1:][class_]]
print(class_dict)

Categories: 'TD' 'TR' 'TT' 'TU' 'TX' 'UD' 'UI' 'UP' 'US' 'UT' 'UU' 'UX'

In [None]:
columns = len(categorized[0])
rows = len(categorized)

td = 0
for j in range(columns):
    for i in range(rows):
        if categorized[i,j] == 'TD':
            td = td + 1
tr = 0
for j in range(columns):
    for i in range(rows):
        if categorized[i,j] == 'TR':
            tr = tr + 1
tt = 0
for j in range(columns):
    for i in range(rows):
        if categorized[i,j] == 'TT':
            tt = tt + 1
tu = 0
for j in range(columns):
    for i in range(rows):
        if categorized[i,j] == 'TU':
            tu = tu + 1
tx = 0
for j in range(columns):
    for i in range(rows):
        if categorized[i,j] == 'TX':
            tx = tx + 1
ud = 0
for j in range(columns):
    for i in range(rows):
        if categorized[i,j] == 'UD':
            ud = ud + 1
ui = 0
for j in range(columns):
    for i in range(rows):
        if categorized[i,j] == 'UI':
            ui = ui + 1
up = 0
for j in range(columns):
    for i in range(rows):
        if categorized[i,j] == 'UP':
            up = up + 1
us = 0
for j in range(columns):
    for i in range(rows):
        if categorized[i,j] == 'US':
            us = us + 1
ut = 0
for j in range(columns):
    for i in range(rows):
        if categorized[i,j] == 'UT':
            ut = ut + 1
uu = 0
for j in range(columns):
    for i in range(rows):
        if categorized[i,j] == 'UU':
            uu = uu + 1
ux = 0
for j in range(columns):
    for i in range(rows):
        if categorized[i,j] == 'UX':
            ux = ux + 1

In [None]:
## plot of frequencies of each category for 2019-2020
y = [td,tr,tt,tu,tx,ud,ui,up,us,ut,uu,ux]
y_arr = np.array(y)
x_bars = ('TD','TR','TT','TU','TX','UD','UI','UP','US','UT','UU','UX')
color = np.linspace(0,1,11)
plt.bar(x_bars,y_arr,color = ['red','blue','brown','green','orange'])
plt.title('Automatic categorisation of Q4 2019-2020 Data')
plt.ylabel('Number of occurences')
plt.savefig('q4automatic.png')

## Improvement - Data Augmentation <a name="augmentation"></a>

In [None]:
# print(Q4_categorise)

# model, accr, xtest, ytest = Q4_categorise(x,y,x_words,embed_dim,epochs,batch_size)

# m = Q4_categorise(x,y,x_words,embed_dim,epochs,batch_size)

In [None]:
# from nltk import word_tokenize
# import random

# q4response_raw = dfq4_all.Response.tolist()
# q4response_clean = clean_text(dfq4_all.Response.tolist())
# aug = unaug = q4response_clean

# def augment(sentence,n):
#     new_sentences = []
#     words = word_tokenize(sentence)
#     for i in range(n):
#         random.shuffle(words)
#         new_sentences.append(' '.join(words))
#     new_sentences = list(set(new_sentences))
#     return new_sentences

# a = augment(str(aug),10)
# print(a)

### 2: Synonym replacement

In [None]:
# from nltk import word_tokenize
# from nltk.corpus import stopwords

# stoplist = stopwords.words('english')


# def get_synonyms_lexicon(path):
#     synonyms_lexicon = {}
#     text_entries = [l.strip() for l in open(path, errors='ignore').readlines()]
#     for e in text_entries:
#         e = e.split(' ')
#         k = e[0]
#         v = e[1:len(e)]
#         synonyms_lexicon[k] = v
#     return synonyms_lexicon


# def synonym_replacement(sentence, synonyms_lexicon):
#     keys = synonyms_lexicon.keys()
#     words = word_tokenize(sentence)
#     n_sentence = sentence
#     for w in words:
#         if w not in stoplist:
#             if w in keys:
#                 n_sentence = n_sentence.replace(w, synonyms_lexicon[w][0])  # we replace with the first synonym
#     return n_sentence


# if __name__ == '__main__':
#     text = 'Many customers initiated a return process of the product as it was not suitable for use.' \
#            'It was conditioned in very thin box which caused scratches on the main screen.' \
#            'The involved firms positively answered their clients who were fully refunded.'
#     sentences = text.split('.')
#     sentences.remove('')
#     print(sentences)
#     synonyms_lexicon = get_synonyms_lexicon('./ppdb-2.0-xl-all.gz')
#     for sentence in sentences:
#         new_sentence = synonym_replacement(sentence, synonyms_lexicon)
#         print('%s' % sentence)
#         print('%s' % new_sentence)
#         print('\n')

## References
[1]: https://stackoverflow.com/questions/50686217/keras-how-is-accuracy-calculated-for-multi-label-classification?fbclid=IwAR2yGzEwT3yK6ODkPxVHzUCYvgJgohmMEP0ZeFY_4wxMfCDUsPOtMOfNS_o

[2]: https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Accuracy

[3]: https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/binary-crossentropy

[4]: https://github.com/sagr4019/ResearchProject/wiki/Keras-accuracy-(metrics)

[5]: https://www.tensorflow.org/api_docs/python/tf/keras/metrics/CategoricalAccuracy

[6]: https://www.tensorflow.org/api_docs/python/tf/keras/metrics/BinaryAccuracy

[7]: https://datascience.stackexchange.com/questions/14415/how-does-keras-calculate-accuracy

[8]: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

[9]: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

[10]: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

[11]: https://medium.com/towards-artificial-intelligence/keras-for-multi-label-text-classification-86d194311d0e

[12]: https://medium.com/opla/text-augmentation-for-machine-learning-tasks-how-to-grow-your-text-dataset-for-classification-38a9a207f88d

[13]: https://github.com/makcedward/nlpaug#augmenter