<center><h1>Social Media Sentiment Analysis</h1>
    <h2>by Rebecca Hinrichs</h2>
    <h3>SUMMER 2023</h3></center>

---

<b>Purpose:</b> The purpose of this section is to test concepts of hyperparameter optimization.

<b>Data:</b> We were provided with a folder `yelp_review_polarity_csv`, consisting of files `train_small.csv` and `test_small.csv` containing pre-split training/testing groups of data. First column is class label ("1" and "2" in this example), second column is text. Note that entries are within double quotes ("). These files will be loaded using python package 'pandas' and stored in pandas dataframes (like 'data frame' in R). Each column in pandas data frame is a dictionary, column name being the key. 
<br><br>
Additionally, we were provided with a folder `glove.6B` containing a tokenizer dictionary `glove.6B.100d.txt`, which is publicly-available <a href="https://nlp.stanford.edu/projects/glove/">here</a>. The dictionary includes a vocabulary of 400k words in 100 dimensions. 

<b>Approach:</b> We will perform hyperparameter optimization using Python's `HyperOpt` library on a dual-layer LSTM Model, optimizing at least 2 tuning parameters.

--- 

--- 

<center><h2>Data Preparation</h2></center>

--- 

#### Data Import

In [1]:
# Collect the text data from the provided directory
import pandas as pd
# Load file directory
directory = 'yelp_review_polarity_csv/'
train_data = pd.read_csv(str(directory)+'train_small.csv',
                         sep=',', names=['class','text'], on_bad_lines='skip')
test_data = pd.read_csv(str(directory)+'test_small.csv',
                        sep=',', names=['class','text'], on_bad_lines='skip')

# Report data shapes
print('\nTotal Text Samples :: ' + str(len(train_data)+len(test_data)))
print('\nDimensions of Training Data :: ' + str(train_data['class'].shape))
print('Dimensions of Testing Data :: ' + str(test_data['class'].shape) +'\n')


Total Text Samples :: 22000

Dimensions of Training Data :: (20000,)
Dimensions of Testing Data :: (2000,)



#### Single Series Data Analysis (EDA)

In [2]:
# Display a sample of a text from the training data
import numpy as np
pick_me = np.random.randint(0,len(train_data)) # pick a random line of text

# Display a printout of the sample
print('\nSample Text :: \n', train_data['text'][pick_me])
print('\nClass of Sample Text :: ', train_data['class'][pick_me], '\n')


Sample Text :: 
 Food is tasty; however, food establishment REFUSED to honor purchased coupon from Living Social.  DO NOT spend your hard-earned money at India Palace.

Class of Sample Text ::  1 



<center><h2><u>Data Pre-Processing</u></h2></center>

#### Cleanse the input space

In [3]:
# Cleanse text of any missing or corrupt data
print('\nBEFORE FILTERING ::')
print('Dimensions of Training Data :: ' + str(train_data['class'].shape))
print('Dimensions of Testing Data :: ' + str(test_data['class'].shape))

# Remove rows with missing data
train_data = train_data.dropna()
test_data = test_data.dropna()

# Remove rows with no comments
train_data = train_data[train_data.text.apply(lambda x: x !="")]
test_data = test_data[test_data.text.apply(lambda x: x !="")]
print('\nAFTER FILTERING ::')
print('Dimensions of Training Data :: ' + str(train_data['class'].shape))
print('Dimensions of Testing Data :: ' + str(test_data['class'].shape) +'\n')


BEFORE FILTERING ::
Dimensions of Training Data :: (20000,)
Dimensions of Testing Data :: (2000,)

AFTER FILTERING ::
Dimensions of Training Data :: (20000,)
Dimensions of Testing Data :: (2000,)



<u><center><h2>Data→Tensor Transformation</h2></center></u>

#### Vectorize the inputs (x-data)

In [4]:
# Map text data to integer values
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
MAX_NUM_TOKENS = train_data['class'].shape[0]  # set max to 20k

tokenizer = Tokenizer(num_words = MAX_NUM_TOKENS)
tokenizer.fit_on_texts(train_data['text'])
print('\nFound %s unique tokens' % len(tokenizer.word_index))
print('Number of Tokens used ::', tokenizer.num_words)
print('Longest Text Sample ::', len(train_data['text'].max()))

# Vectorize text values
sequences_train = tokenizer.texts_to_sequences(train_data['text'])
sequences_test = tokenizer.texts_to_sequences(test_data['text'])
print("\nImported Sample Text ::\n", train_data['text'][pick_me])
print("\nTokenized Sample Text ::\n", sequences_train[pick_me])

# Pad the token sequences to have equal length
MAX_SENTENCE_LENGTH = 50  # set longest sequence of one token to 50 <<-- our pick
x_train = pad_sequences(sequences_train, maxlen=MAX_SENTENCE_LENGTH)
x_test = pad_sequences(sequences_test, maxlen=MAX_SENTENCE_LENGTH)
print('\nPADDED TOKEN SEQUENCES ::')
print('Dimensions of Training Data :: ' + str(x_train.shape))
print('Dimensions of Testing Data :: ' + str(x_test.shape) + '\n')


Found 47081 unique tokens
Number of Tokens used :: 20000
Longest Text Sample :: 1317

Imported Sample Text ::
 Food is tasty; however, food establishment REFUSED to honor purchased coupon from Living Social.  DO NOT spend your hard-earned money at India Palace.

Tokenized Sample Text ::
 [30, 11, 346, 259, 30, 904, 1404, 4, 3074, 1158, 1029, 51, 1336, 2324, 80, 22, 695, 70, 324, 2829, 248, 25, 5152, 2386]

PADDED TOKEN SEQUENCES ::
Dimensions of Training Data :: (20000, 50)
Dimensions of Testing Data :: (2000, 50)



#### Categorize the outputs (y-data)

In [5]:
# Transform float Y-array values to integers
from keras.utils import np_utils
y_train = np.array(train_data['class'])
y_test = np.array(test_data['class'])
print('\nBEFORE CATEGORIZATION:')
print('Training Label Values ::', np.unique(y_train))
print('Testing Label Values ::', np.unique(y_test))
print('Training Label Shape ::', y_train.shape)
print('Testing Label Shape ::', y_test.shape)
y_train -= min(y_train)
y_test -= min(y_test)
NB_CLASSES = int(len(np.unique(y_train))) # number of classes
print('\nAFTER CATEGORIZATION:')
print('Training Label Values ::', np.unique(y_train))
print('Testing Label Values ::', np.unique(y_test))
print('Training Label Shape ::', y_train.shape)
print('Testing Label Shape ::', y_test.shape, '\n')


BEFORE CATEGORIZATION:
Training Label Values :: [1 2]
Testing Label Values :: [1 2]
Training Label Shape :: (20000,)
Testing Label Shape :: (2000,)

AFTER CATEGORIZATION:
Training Label Values :: [0 1]
Testing Label Values :: [0 1]
Training Label Shape :: (20000,)
Testing Label Shape :: (2000,) 



#### Tokenize the inputs (x-data)

In [6]:
# Load the Tokenizer Dictionary database
embeddings_index = {}
f = open('glove.6B/glove.6B.100d.txt', encoding='utf8')
for line in f:
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('\nQuantity of Dictionary Vectors ::', len(embeddings_index))
print('\nSample Dictionary Word Vector :: "the" :: \n', embeddings_index['the'], '\n')


Quantity of Dictionary Vectors :: 400000

Sample Dictionary Word Vector :: "the" :: 
 [-0.038194 -0.24487   0.72812  -0.39961   0.083172  0.043953 -0.39141
  0.3344   -0.57545   0.087459  0.28787  -0.06731   0.30906  -0.26384
 -0.13231  -0.20757   0.33395  -0.33848  -0.31743  -0.48336   0.1464
 -0.37304   0.34577   0.052041  0.44946  -0.46971   0.02628  -0.54155
 -0.15518  -0.14107  -0.039722  0.28277   0.14393   0.23464  -0.31021
  0.086173  0.20397   0.52624   0.17164  -0.082378 -0.71787  -0.41531
  0.20335  -0.12763   0.41367   0.55187   0.57908  -0.33477  -0.36559
 -0.54857  -0.062892  0.26584   0.30205   0.99775  -0.80481  -3.0243
  0.01254  -0.36942   2.2167    0.72201  -0.24978   0.92136   0.034514
  0.46745   1.1079   -0.19358  -0.074575  0.23353  -0.052062 -0.22044
  0.057162 -0.15806  -0.30798  -0.41625   0.37972   0.15006  -0.53212
 -0.2055   -1.2526    0.071624  0.70565   0.49744  -0.42063   0.26148
 -1.538    -0.30223  -0.073438 -0.28312   0.37104  -0.25217   0.016215
 -0

In [7]:
# Map vectorized data to Tokenizer Dictionary: embedding_matrix
EMBEDDING_DIM = len(embeddings_index['the'])  # length of each dictionary vector:: 100
embedding_matrix = np.zeros((MAX_NUM_TOKENS, EMBEDDING_DIM))  # placeholder matrix
for word, index in tokenizer.word_index.items():
    if index > MAX_NUM_TOKENS - 1:
        break
    else:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[index] = embedding_vector
print('\nEmbedding Matrix Shape ::', embedding_matrix.shape)

# Report sample value of tokenized vector for 'the' & the random sample
sample_word = [key for (key, value) in tokenizer.word_index.items() if value == 1]
print('\nWord mapped to integer index = 1 is: ', sample_word)
print('\n', sample_word,' is mapped to word vector\n', embedding_matrix[1,:])
sample_word = [key for (key, value) in tokenizer.word_index.items() if value == pick_me]
print('\nWord mapped to integer index =', pick_me, 'is: ', sample_word)
print('\n', sample_word,'  is mapped to word vector\n', embedding_matrix[1000,:], '\n')


Embedding Matrix Shape :: (20000, 100)

Word mapped to integer index = 1 is:  ['the']

 ['the']  is mapped to word vector
 [-0.038194   -0.24487001  0.72812003 -0.39961001  0.083172    0.043953
 -0.39140999  0.3344     -0.57545     0.087459    0.28786999 -0.06731
  0.30906001 -0.26383999 -0.13231    -0.20757     0.33395001 -0.33848
 -0.31742999 -0.48335999  0.1464     -0.37303999  0.34577     0.052041
  0.44946    -0.46970999  0.02628    -0.54154998 -0.15518001 -0.14106999
 -0.039722    0.28277001  0.14393     0.23464    -0.31020999  0.086173
  0.20397     0.52623999  0.17163999 -0.082378   -0.71787    -0.41531
  0.20334999 -0.12763     0.41367     0.55186999  0.57907999 -0.33476999
 -0.36559001 -0.54856998 -0.062892    0.26583999  0.30204999  0.99774998
 -0.80480999 -3.0243001   0.01254    -0.36941999  2.21670008  0.72201002
 -0.24978     0.92136002  0.034514    0.46744999  1.10790002 -0.19358
 -0.074575    0.23353    -0.052062   -0.22044     0.057162   -0.15806
 -0.30798    -0.41624

In [8]:
# Describe the data shapes & labels
print('\nTraining Shape  ::', x_train.ndim, 'dimensions', x_train.shape)
print('Training Labels ::', y_train.ndim, 'dimensions', y_train.shape)
print('\nTesting Shape   ::', x_test.ndim, 'dimensions', x_test.shape)
print('Testing Labels  ::', y_test.ndim, 'dimensions', y_test.shape)
print('- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -')

# Peek at the data & labels
print('\nImported Data')
print('Classes are::\t', np.unique(y_train, return_counts=True)[0])
print(' # per class:\t', np.unique(y_train, return_counts=True)[1])

# Describe the classifiers
label_descriptors = ['negative', 'positive']
print(f'\n <Label>\t<Descriptor>')
for _ in range(len(np.unique(y_train))):
    print(f'{np.unique(y_train)[_]:>5}\t\t{label_descriptors[_]:<10}')
print('- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -')

# Display the sample text as Tokenized Tensor
print('\nSample text sequence (vectorized) ::')
display(x_train[pick_me])


Training Shape  :: 2 dimensions (20000, 50)
Training Labels :: 1 dimensions (20000,)

Testing Shape   :: 2 dimensions (2000, 50)
Testing Labels  :: 1 dimensions (2000,)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Imported Data
Classes are::	 [0 1]
 # per class:	 [10002  9998]

 <Label>	<Descriptor>
    0		negative  
    1		positive  
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Sample text sequence (vectorized) ::


array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,   30,   11,  346,  259,   30,  904, 1404,
          4, 3074, 1158, 1029,   51, 1336, 2324,   80,   22,  695,   70,
        324, 2829,  248,   25, 5152, 2386])

#### Store pre-processed data in `h5` file

In [9]:
# Store tensor data in 'h5' file for later use
import h5py
with h5py.File('generated_files/Final_Hinrichs_source2-data.h5', 'w') as hf:
    dataset_group = hf.create_group('dataset')
    dataset_group.create_dataset('x_train', data=x_train)
    dataset_group.create_dataset('x_test', data=x_test)
    dataset_group.create_dataset('y_train', data=y_train)
    dataset_group.create_dataset('y_test', data=y_test)
    hf.create_dataset('embedding_matrix', data=embedding_matrix)

--- 

<center><h3>Model Construction</h3></center>

---

#### Import Source Data & Libraries

In [1]:
# Import pre-processed tensor data from 'h5' file
import numpy as np
import h5py
with h5py.File('generated_files/Final_Hinrichs_source2-data.h5', 'r') as hf:
    x_train = np.array(hf['dataset/x_train'])
    x_test = np.array(hf['dataset/x_test'])
    y_train = np.array(hf['dataset/y_train'])
    y_test = np.array(hf['dataset/y_test'])
    embedding_matrix = np.array(hf['embedding_matrix'])

# Describe the data shapes (4D Tensors)
print('\nTraining Shape ::', x_train.shape)
print('Training Labels ::', y_train.shape)
print('\nTesting Shape ::', x_test.shape)
print('Testing Labels ::', y_test.shape)
print('\nDictionary Shape ::', embedding_matrix.shape)

# Import ML libraries & dependencies
from tensorflow.keras.models import load_model, Sequential
from tensorflow.keras.layers import Embedding, Dense, Activation, Dropout
from tensorflow.keras.layers import LSTM, Bidirectional, Conv1D, MaxPooling1D, Flatten
from tensorflow.keras.optimizers import SGD, Adam
from tensorflow.keras.callbacks import History, EarlyStopping, ModelCheckpoint
print('\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -')
import time  # to track execution time


Training Shape :: (20000, 50)
Training Labels :: (20000,)

Testing Shape :: (2000, 50)
Testing Labels :: (2000,)

Dictionary Shape :: (20000, 100)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -


#### Define Global Variables

In [2]:
# Define fixed parameters for model fitting
NB_CLASSES = int(len(set(y_train))) # number of classes:: 2
N_SAMPLES = x_train.shape[0] # number of training samples:: 20k
TEXT_DIM = (None, x_train.shape[1]) # shape of each sequence:: 1x50
NB_EPOCH = 3*x_train.shape[1] # number of batch epoch:: 150 (we may adjust later)
filepaths = list()  # to store list of h5 files containing fitted models

# Define embedding layer using 'GLOVE' Dictionary Tokenizer
MAX_NUM_TOKENS = embedding_matrix.shape[0] # number of training samples:: 20k
EMBEDDING_DIM = embedding_matrix.shape[1]  # length of each dictionary vector:: 100
MAX_SENTENCE_LENGTH = x_train.shape[1]  # longest sequence of one token:: 50
embedding_layer = Embedding(MAX_NUM_TOKENS,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SENTENCE_LENGTH,
                            trainable=False)

#### Define Evaluative Functions

In [3]:
# Define function for plotting model performance
import matplotlib
import matplotlib.pyplot as plt
def plotHistory(tuned_model):
    fig, axs = plt.subplots(1,2,figsize=(15,5))

    # Show Training Loss vs Validation Loss (to minimize)
    axs[0].plot(tuned_model.history['loss'], color='magenta')
    axs[0].plot(tuned_model.history['val_loss'], color='orange')
    axs[0].set_title('loss vs epoch')
    #axs[0].set_ylim(0.0,0.7)
    axs[0].set_ylabel('loss')
    axs[0].set_xlabel('epoch')
    axs[0].legend(['training', 'validation'], loc='upper left')

    # Show Training Accuracy vs Validation Accuracy (to maximize)
    axs[1].plot(tuned_model.history['accuracy'], color='magenta')
    axs[1].plot(tuned_model.history['val_accuracy'], color='orange')
    axs[1].set_title('accuracy vs epoch')
    axs[1].set_ylabel('accuracy')
    axs[1].set_xlabel('epoch')
    #axs[1].set_ylim(0.5,1)
    axs[1].legend(['training', 'validation'], loc='upper left')
    plt.show(block = False)
    plt.show()

# Define function to report best model performance from trials
def getBestModelfromTrials(trials, modelname):
    # Extracts all valid iterations of the hyperoptimization 
    valid_trial_list = [trial for trial in trials
                            if STATUS_OK == trial['result']['status']]
    
    # Extracts obj. function value in all valid iterations of the hyperoptimization 
    losses = [float(trial['result']['loss']) for trial in valid_trial_list]
    
    # Finds the model with the lowest obj. function
    index_having_minumum_loss = np.argmin(losses)
    best_trial_obj = valid_trial_list[index_having_minumum_loss]
    
    # Extracts the model corresponding to the lowest obj. function
    bestest_model_ever = best_trial_obj['result']['Trained_Model']
    modelname = bestest_model_ever._name
    filepath = 'generated_files/saved_bests/Final_Hinrichs_'+str(modelname)+'.json'
    model_json = bestest_model_ever.to_json()
    with open(filepath, 'w') as json_file:
         json_file.write(model_json)
    filepath = 'generated_files/saved_bests/Final_Hinrichs_'+str(modelname)+'_weights.h5'
    bestest_model_ever.save_weights(filepath)
    return best_trial_obj['result']['Trained_Model']
    
# Define function to report model accuracy separately
from sklearn.metrics import accuracy_score
def GetAccuracy(model_name, test_data, test_labels):
    pred_prob = model_name.predict(test_data) # predict probabilities
    pred_labels = np.where(pred_prob > 0.5, 1,0) # predicted class labels
    true_labels = np.where(test_labels > 0.5, 1,0) # convert categorical→class labels
    return accuracy_score(true_labels, pred_labels)

<u><center><h2>Model Architecture</h2></center></u>

#### Define Objective Function

In [4]:
# Define function to instantiate an RNN Combination model
COUNT = int(0)  # to track model build calls
def model_maker(params, x_data, y_data):
    global COUNT
    COUNT += 1
    modelname = 'hyp_opt_model_'+str(COUNT)
    ## ---->> INSTANTIATE THE MODEL
    start_clock = time.process_time_ns()
    # Build a new Neural Network Model using RNN architectures
    model = Sequential(name = modelname)
    # ---- input layer :: Embedding Layer
    model.add(embedding_layer)
    # # -- layer LSTM #1
    # model.add(LSTM(units = params['num_kernel'],
    #                dropout = params['dropout'],
    #                recurrent_dropout = params['dropout'],
    #                return_sequences = True,  # train on multiple time points
    #                name = 'LSTM_Layer_1'))
    # # -- layer CNN :: got much better results after dropping this layer
    # model.add(Conv1D(filters = params['num_kernel'], #!-can't be variable if first
    #                  kernel_size = params['kernel_size'],
    #                  activation = params['activation_function'],
    #                  name = 'CNN_Layer'))
    # # ---- first MaxPool layer
    # model.add(MaxPooling1D(pool_size = params['size_pooling'],
    #                        strides = params['strides']))
    # model.add(Dropout(rate = params['dropout']))
    # -- layer Bidirectional LSTM
    model.add(Bidirectional(LSTM(units = int(params['num_kernel']/2),
                                 dropout = params['dropout'],
                                 recurrent_dropout = params['dropout'],
                                 return_sequences = True,
                                 name = 'Bidirectional_LSTM_Layer'),
                            merge_mode = 'concat'))
    # -- layer LSTM #2
    model.add(LSTM(units = params['num_kernel'],
                   dropout = params['dropout'],
                   recurrent_dropout = params['dropout'],
                   return_sequences = False,  # train on multiple time points
                   name = 'LSTM_Layer_2'))
    # -- output layer (Dense)
    model.add(Flatten())
    model.add(Dense(units = 1,
                    activation = params['activation_function_output'],
                    name = 'Output_Layer'))

    ## ---->> COMPILE THE MODEL
    model.compile(optimizer = params['optimizer'], 
                  loss = params['loss'], 
                  metrics = params['metrics'])
    print(f'Success! {modelname} has been compiled!')

    ## ---->> FIT THE MODEL
    # Functions to stop overfitting
    early_stopping_monitor = EarlyStopping(monitor='val_loss',
                                           patience=params['patience'],
                                           mode='min')
    filepath = 'generated_files/Final_Hinrichs_checkpoint2.h5' # for tracking
    checkpoint = ModelCheckpoint(filepath,
                                 monitor='val_loss',
                                 verbose= params['verbose'],
                                 save_best_only=True)

    # Fit a compiled model with the datasets presented
    tuned_model = model.fit(x_data, y_data,
                            batch_size = params['batch_size'],
                            epochs = params['epochs'], 
                            verbose = params['verbose'],
                            validation_split = params['validation_split'],
                            callbacks = [checkpoint, early_stopping_monitor])

    # Track the model's validation loss history
    keys = tuned_model.history.keys()
    res = [i for i in keys if ('val' in i and 'loss' in i)]
    val_loss = min(tuned_model.history[res[0]])
    print(f'Success! {str(model._name)} has been tuned!')
    end_clock = time.process_time_ns()
    print(f'Execution clocked at {(end_clock-start_clock)*10**(-9)} secs\n')
    return {'loss': val_loss, 'status': STATUS_OK, 'Trained_Model': model}

#### Define Hyperparameter Optimization Function

In [5]:
# Define the Hyperparameter Optimization Function
from hyperopt import fmin, tpe, STATUS_OK, Trials
def optimize_model(function, x_data, y_data, params, Build=True):
    # Run fmin to determine lowest loss using model parameters
    trials = Trials()
    best_model = fmin(partial(function, # fmin to minimize loss
                              x_data = x_data,
                              y_data = y_data),
                      space = params, # hyperparameter space
                      algo = tpe.suggest, # bayesian algorithm to be used
                      max_evals = params['max_evals_build'] if (Build==True) else params['max_evals_train'],
                      trials = trials)  # to save output
    
    # extracts all valid iterations of the hyperoptimization 
    valid_trial_list = [trial for trial in trials
                            if STATUS_OK == trial['result']['status']]
    
    # extracts obj. function value in all valid iterations of the hyperoptimization 
    losses = [float(trial['result']['loss']) for trial in valid_trial_list]
    
    # find the one with lowest obj. function
    index_having_minumum_loss = np.argmin(losses)
    best_trial_obj = valid_trial_list[index_having_minumum_loss]
    
    # extracts the model corresponding to the lowest obj. function
    bestest_model = best_trial_obj['result']['Trained_Model']
    modelname = bestest_model._name
    filepath = 'generated_files/saved_bests/Final_Hinrichs_'+str(modelname)+'.json'
    model_json = bestest_model.to_json()
    with open(filepath, 'w') as json_file:
         json_file.write(model_json)
    filepath = 'generated_files/saved_bests/Final_Hinrichs_'+str(modelname)+'_weights.h5'
    bestest_model.save_weights(filepath)
    
    # Return trials object
    least_loss = best_trial_obj['result']['loss']
    print(f'Success! {str(modelname)} has been optimized!' \
          f'Lowest loss {least_loss:.8f}\n')
    return best_trial_obj

<u><center><h2>Model Training</h2></center></u>

We attempted a number of trial models using our data with minimally-iterating, fixed hyperparameter variables in order to find the best-responding architecture to our data. Model architectures we attempted were: 
- Conv1D + MaxPool + Dropout → BiDirectional LSTM → LSTM → Dense
- LSTM → BiDirectional LSTM → LSTM → Dense
- BiDirectional LSTM → Dense
- BiDirectional LSTM → LSTM → Dense**

**Our chosen architecture, based upon the output's lowest loss (~ 24%) & highest accuracy (~ 90%)

#### Define Hyperparameter Tuning Variables

In [22]:
# Define Hyperparameter Domain Space to adjust per param in order of output
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from hyperopt.pyll.base import scope
from sklearn.metrics import roc_auc_score
from functools import partial
setup_space = {
    'activation_function': 'relu',
    'activation_function_output': 'sigmoid',
    'batch_size': int(10),
    'dropout': float(.2),
    'epochs': int(100),
    'initializer': 'uniform',
    'kernel_size': int(5),
    'learning_rate' : float(.01),
    'loss': 'binary_crossentropy',
    'max_evals_build': int(1),  # for moderating execution time
    'max_evals_train': int(10),  # during optimization
    'metrics': 'accuracy',
    'num_kernel': int(64),
    'optimizer': 'adam',
    'patience': int(2),
    'size_pooling': int(4),
    'strides': int(2),
    'validation_split': float(.2),
    'verbose': int(0)
}

#### Describe Model Construction

In [41]:
# Describe our chosen model architecture
model_opt['result']['Trained_Model'].summary()

Model: "hyp_opt_model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 50, 100)           2000000   
                                                                 
 bidirectional_7 (Bidirectio  (None, 50, 64)           34048     
 nal)                                                            
                                                                 
 LSTM_Layer_2 (LSTM)         (None, 64)                33024     
                                                                 
 flatten_7 (Flatten)         (None, 64)                0         
                                                                 
 Output_Layer (Dense)        (None, 1)                 65        
                                                                 
Total params: 2,067,137
Trainable params: 67,137
Non-trainable params: 2,000,000
____________________________________

In [34]:
# Compile & fit the model using hyperparameter tuning
model_opt = optimize_model(model_maker, x_train, y_train, setup_space)

Success! hyp_opt_model_3 has been compiled!          
Success! hyp_opt_model_3 has been tuned!             
Execution clocked at 729.34375 secs                  

100%|██████████| 1/1 [14:52<00:00, 892.27s/trial, best loss: 0.31043487787246704]
Success! hyp_opt_model_3 has been optimized!
	Here are the results:

{}


In [52]:
# Demonstrate results
best_model = model_opt['result']['Trained_Model']
best_loss = model_opt['result']['loss']
print(f'\nOur model {best_model._name} achieved minimized', \
      f'loss at {best_loss:.8f}')


Our model hyp_opt_model_3 achieved minimized loss at 0.31043488


<center><h2><u>Model Optimization</u></h2></center>

#### Define Hyperparameter Optimization Variables

In [58]:
# Define Hyperparameter Variable Domain Space
opt_space = {
    'activation_function': hp.choice('activation_function',
                                     ['relu','tanh','sigmoid']),
    'activation_function_output': 'sigmoid',
    'batch_size': scope.int(hp.quniform('batch_size',16,128,16)),
    'dropout': hp.uniform('dropout',.20,.35),
    'epochs': scope.int(hp.quniform('epochs',1,100,1)),
    'initializer': 'uniform',
    'kernel_size': scope.int(hp.quniform('kernel_size',2,5,1)),
    'learning_rate' : hp.loguniform('learning_rate', np.log(0.005), np.log(0.1)),
    'loss': 'binary_crossentropy',
    'max_evals_build': int(1),  # for moderating execution time
    'max_evals_train': int(10),  # max limit iterations of model fitting epochs
    'metrics': 'accuracy',
    'num_kernel': scope.int(hp.quniform('num_kernel',16,128,16)),
    'optimizer': hp.choice('optimizer',['adadelta','adam','rmsprop']),
    'patience': scope.int(hp.quniform('patience',2,8,1)),
    'size_pooling': scope.int(hp.quniform('size_pooling',2,4,1)),
    'strides': scope.int(hp.quniform('strides',1,2,1)),
    'validation_split': float(.2),
    'verbose': int(0)
}

#### Define the Best Model

In [59]:
# Now compile & fit the model using hyperparameter optimization variable space
model_hyperopt = optimize_model(model_maker, x_train, y_train, opt_space)
print(model_hyperopt)

Success! hyp_opt_model_4 has been compiled!          
Success! hyp_opt_model_4 has been tuned!             
Execution clocked at 1097.765625 secs                

100%|██████████| 1/1 [20:32<00:00, 1232.43s/trial, best loss: 0.6711501479148865]
Success! hyp_opt_model_4 has been optimized!Lowest loss 0.67115015

{'state': 2, 'tid': 0, 'spec': None, 'result': {'loss': 0.6711501479148865, 'status': 'ok', 'Trained_Model': <keras.engine.sequential.Sequential object at 0x0000026AD2BCA3A0>}, 'misc': {'tid': 0, 'cmd': ('domain_attachment', 'FMinIter_Domain'), 'workdir': None, 'idxs': {'activation_function': [0], 'batch_size': [0], 'dropout': [0], 'epochs': [0], 'kernel_size': [0], 'learning_rate': [0], 'num_kernel': [0], 'optimizer': [0], 'patience': [0], 'size_pooling': [0], 'strides': [0]}, 'vals': {'activation_function': [1], 'batch_size': [16.0], 'dropout': [0.33673238191258076], 'epochs': [21.0], 'kernel_size': [3.0], 'learning_rate': [0.014855186628828922], 'num_kernel': [80.0], 'optimiz

In [60]:
# Demonstrate results
best_hyperopt_model = model_hyperopt['result']['Trained_Model']
best_hyperopt_loss = model_hyperopt['result']['loss']
print(f'\nOur model {best_hyperopt_model._name} achieved minimized', \
      f'loss at {best_hyperopt_loss:.8f}')


Our model hyp_opt_model_4 achieved minimized loss at 0.67115015


#### Optimize the Model (max_evals = 10)

In [62]:
# Train the model using optimized parameters with higher iteration count
model_hyperopt = optimize_model(model_maker, x_train, y_train, opt_space, Build=False)
print(model_hyperopt)

Success! hyp_opt_model_5 has been compiled!           
Success! hyp_opt_model_5 has been tuned!              
Execution clocked at 1071.625 secs                    

Success! hyp_opt_model_6 has been compiled!                                          
Success! hyp_opt_model_6 has been tuned!                                             
Execution clocked at 280.578125 secs                                                 

Success! hyp_opt_model_7 has been compiled!                                          
Success! hyp_opt_model_7 has been tuned!                                             
Execution clocked at 1025.046875 secs                                                  

Success! hyp_opt_model_8 has been compiled!                                            
Success! hyp_opt_model_8 has been tuned!                                              
Execution clocked at 1099.921875 secs                                                 

Success! hyp_opt_model_9 has been compiled!        

In [63]:
# Demonstrate results
bestest_hyperopt_model = model_hyperopt['result']['Trained_Model']
bestest_hyperopt_loss = model_hyperopt['result']['loss']
print(f'\nOur model {bestest_hyperopt_model._name} achieved minimized', \
      f'loss at {bestest_hyperopt_loss:.8f}')


Our model hyp_opt_model_7 achieved minimized loss at 0.30679399


---
---

<center><h2>Model Performance Evaluation</h2></center>

Now that our model is tuned & optimized for least error, we can use it to get a real-time accuracy reading on our Testing Data.

In [18]:
# Retrieve saved models from saved trials
from pathlib import Path
from tensorflow.keras.models import model_from_json
best_models = []
best_directory = Path('generated_files/saved_bests/')
for ea, file in enumerate(best_directory.iterdir()):
    if file.is_file and '.json' in file.name:
        f = open(str(file),'r')
        saved_best = f.read()
        f.close()
        saved_best = model_from_json(saved_best)
    else:
        saved_best.load_weights(file)
    best_models.append(saved_best)

In [16]:
# Run prediction of each model against testing data
# losses = [best_loss, best_hyperopt_loss, bestest_hyperopt_loss]
# best_models = [best_model, best_hyperopt_model, bestest_hyperopt_model]
best_loss, best_score = int(100), int(0)
for i, model in enumerate(best_models):
    # loss = losses[i]
    score = GetAccuracy(model, x_test, y_test)
    print('\n\t', str(model._name), '.....', sep='')
    # print('Model Loss ::', loss)
    print('Model Accuracy ::', score)
    if score > best_score: # and loss < best_loss:
        best_model_name = str(model._name)
        # best_loss = loss
        best_score = score
        model.save('generated_files/Final_Hinrichs_best-overall.h5')
print('\nThe best-performing model on the Test Data is\n\t -->> ', best_model_name, \
      ' <<-- \nwith an accuracy score of', best_score,'\n') # ,'& loss of',best_loss


	hyp_opt_model_3.....
Model Accuracy :: 0.8505

	hyp_opt_model_3.....
Model Accuracy :: 0.8505

	hyp_opt_model_4.....
Model Accuracy :: 0.623

	hyp_opt_model_4.....
Model Accuracy :: 0.623

	hyp_opt_model_7.....
Model Accuracy :: 0.8575

	hyp_opt_model_7.....
Model Accuracy :: 0.8575

The best-performing model on the Test Data is
	 -->>  hyp_opt_model_7  <<-- 
with an accuracy score of 0.8575 



---
---

<center><h1>Analysis</h1></center>

| Model Name | Loss Report | Accuracy |
| --- | --- | --- |
| hyp_opt_model_3 | .3104 | .8505 |
| hyp_opt_model_4 | .6712 | .6230 |
| hyp_opt_model_7 | .3070 | .8575 |

<h4>The best model is:</h4>
`hyp_opt_model_7` with `max_eval = 10` giving the minimal loss of 30.7% using hyperparameter optimization.
<h4>The best parameters are:</h4>

- 'activation_function': ['tanh']
- 'batch_size': [112.0]
- 'dropout': [0.3274283406727577]
- 'epochs': [54.0]
- 'kernel_size': [3.0]
- 'learning_rate': [0.013220910276634638]
- 'num_kernel': [80.0]
- 'optimizer': ['adam']
- 'patience': [4.0]
- 'size_pooling': [2.0]
- 'strides': [2.0]}