

    ██████╗ ██╗██╗     ███████╗████████╗███╗   ███╗     ██████╗██████╗  █████╗  ██████╗██╗  ██╗███████╗██████╗ 
    ██╔══██╗██║██║     ██╔════╝╚══██╔══╝████╗ ████║    ██╔════╝██╔══██╗██╔══██╗██╔════╝██║ ██╔╝██╔════╝██╔══██╗
    ██████╔╝██║██║     ███████╗   ██║   ██╔████╔██║    ██║     ██████╔╝███████║██║     █████╔╝ █████╗  ██████╔╝
    ██╔══██╗██║██║     ╚════██║   ██║   ██║╚██╔╝██║    ██║     ██╔══██╗██╔══██║██║     ██╔═██╗ ██╔══╝  ██╔══██╗
    ██████╔╝██║███████╗███████║   ██║   ██║ ╚═╝ ██║    ╚██████╗██║  ██║██║  ██║╚██████╗██║  ██╗███████╗██║  ██║
    ╚═════╝ ╚═╝╚══════╝╚══════╝   ╚═╝   ╚═╝     ╚═╝     ╚═════╝╚═╝  ╚═╝╚═╝  ╚═╝ ╚═════╝╚═╝  ╚═╝╚══════╝╚═╝  ╚═╝
                                                                                                           

---

![alt text](http://www.treasurenet.com/forums/attachment.php?attachmentid=173574&amp;d=1332348453)

---

This module trains a bidirectional long short-term memory (LSTM) network on a dataset consisting of cleartext passwords. The trained network is then used to predict the most likely alterations and/or additions to a given sequence.

---


### Assumptions
The dataset is assumed to contain no information other than the cleartext passwords.

The network parameters (*e.g.*, number of hidden units, embedding layer, *etc.*) are defined in the configuration file (`program/config.yml`).


### Code steps
This is the basic flow of the code:

1. read in data
   * clean up data (duplicates, NaN, etc)  
2. get data characteristics
   * determine number of characters  
   * determine/define longest sequence length  
3. generator
   * tokenization  
   * sliding windows  
4. training
5. sequence
   * for i in sequence, predict most likely candidates in each position  
   * calculate most likely shared candidates  
   * calculate probabilities of overall adjusted sequences  


---
# Initial Definitions
---

Import the libraries used in this notebook:

In [None]:
# import libraries
import boto3
import time
import os
import keras

# sagemaker libraries
import sagemaker
from sagemaker.tuner              import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
from sagemaker.tensorflow         import TensorFlow
from sagemaker.tensorflow.serving import Model

## Variable definitions

Define all of the variables used in the notebook here:

In [None]:
# specify the S3 bucket parameters
bucket = 'blstm-cracker'
prefix = 'test-run'

# get the session and IAM role information
sess = sagemaker.Session()
role = sagemaker.get_execution_role

# location and name of the program containing all of the code
program_name = 'program.py'
program_path = 'program'

These are the variables related to the model artifacts:

In [None]:
# locations in which to store model artifacts
intermediate_location = 's3://{}/{}/intermediate'.format(bucket, prefix)
output_location       = 's3://{}/{}/output/'.format(bucket, prefix)

# specify the location in S3 containing the dataset
data_path     = 'data/dump.csv'
data_name     = 'train.csv'
key           = os.path.join(prefix, 'train', data_name)
s3_train_data = 's3://{}/{}'.format(bucket, key)

# define the intermediate path where the model artifacts will be stored
inter      = os.path.join(prefix, 'intermediate')
inter_path = 's3://{}/{}'.format(bucket, inter)

# define the output path
out         = os.path.join(prefix, 'output')
output_path = 's3://{}/{}'.format(bucket, out)

These are the variables related to the endpoint:

In [None]:
# path to model artifacts
model_artifacts = 's3://{}/{}/output/cloud-test/output/model.tar.gz'.format(bucket, prefix)

# include the date in the endpoint name
endpoint_name = 'keras-tf-fmnist-'+time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

Save the SageMaker session and IAM role information:

In [None]:
# https://github.com/aws/sagemaker-python-sdk/issues/911
# https://towardsdatascience.com/building-fully-custom-machine-learning-models-on-aws-sagemaker-a-practical-guide-c30df3895ef7

---
# Data
---

## Processing

The next section uploads the dataset to the S3 bucket:

In [None]:
# upload the training data to S3
boto3.resource('s3').Bucket(bucket).Object(key).put(Body=open(data_path, 'rb'))
print('Uploading data to: {}'.format(s3_train_data))

# configure SageMaker input channel
input_data = {
    'training': sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', content_type='text/csv')
}

---
# Model
---

## Training

Define the hyperparameters for the training job:

In [None]:
hyperparameters={'epochs':       5, 
                 'batch_size':   128,
                 'hidden_units': 100,
                 'training':     s3_train_data}

Define the TensorFlow estimator:

In [None]:
estimator = TensorFlow(entry_point          = program_name, 
                       role                 = role,
                       source_dir           = program_path,
                       model_dir            = intermediate_location,
                       output_path          = output_location,
                       code_location        = intermediate_location,
                       train_instance_count = 1, 
                       train_instance_type  = 'local',
                       framework_version    = '1.12', 
                       py_version           = 'py3',
                       script_mode          = True,
                       hyperparameters      = hyperparameters
                       )

Fit the model using the hyperparameters and estimator defined above:

In [None]:
estimator.fit(inputs=input_data, job_name='AJT')

## Hyperparameter Tuning

In [None]:
hyperparameter_ranges = {
    'epochs':        IntegerParameter(20, 100),
    'learning-rate': ContinuousParameter(0.001, 0.1, scaling_type='Logarithmic'), 
    'batch-size':    IntegerParameter(32, 1024),
    'dense-layer':   IntegerParameter(128, 1024),
    'dropout':       ContinuousParameter(0.2, 0.6)
}

objective_metric_name = 'val_acc'
objective_type        = 'Maximize'
metric_definitions    = [{'Name': 'val_acc', 'Regex': 'val_acc: ([0-9\\.]+)'}]

tuner = HyperparameterTuner(tf_estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            metric_definitions,
                            max_jobs=10,
                            max_parallel_jobs=2,
                            objective_type=objective_type)

---
# Endpoint
---

Load the model from its artifacts stored on S3 and use this to deploy an endpoint:

In [None]:
# load the model from its artifacts on S3
model = Model(model_data=model_artifacts, role=role)

# deploy an endpoint
predictor = model.deploy(initial_instance_count=1, 
                         instance_type='ml.t2.medium',
                         endpoint_name=endpoint_name)

Delete the endpoint after it is no longer needed:

In [None]:
predictor.delete_endpoint()