# Plagiarism Detection Model

The goal in this notebook is to train a binary classification model that learns to label an answer file as either plagiarized or not, based on the features you provide the model.

This task will be broken down into a few discrete steps:

* Upload your data to S3.
* Define a binary classification model and a training script.
* Train your model and deploy it.
* Evaluate your deployed classifier and answer some questions about your approach.

In [1]:
import pandas as pd
import boto3
import sagemaker

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

##  Upload your training data to S3


In [4]:
# directory created to save the features data
data_dir = 'plagiarism_data'

# set prefix, a descriptive name for a directory  
prefix = 'plagiarism_files'

# upload all data to S3
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)
print(input_data)

'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.


s3://sagemaker-us-east-2-573573149823/plagiarism_files


### Test cell

Test that your data has been successfully uploaded. 

In [5]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

plagiarism_files/test.csv
plagiarism_files/train.csv
time_series/test/test.json
time_series/train/train.json
Test passed!


# Modeling


## Complete a training script 

A typical training script:
* Loads training data from a specified directory
* Parses any training & model hyperparameters (ex. nodes in a neural network, training epochs, etc.)
* Instantiates a model of your design, with any specified hyperparams
* Trains that model 
* Finally, saves the model so that it can be hosted/deployed, later

### Defining and training a model
In this:
1. Import any extra libraries you need
2. Define any additional model training hyperparameters using `parser.add_argument`
2. Define a model in the `if __name__ == '__main__':` section
3. Train the model in that same section

Below, `!pygmentize` is used to display an existing `train.py` file. 

In [16]:
# directory can be changed to: source_sklearn or source_pytorch
!pygmentize source_pytorch/train.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36moptim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m

[37m# imports the model in model.py by name[39;49;00m
[34mfrom[39;49;00m [04m[36mmodel[39;49;00m [34mimport[39;49;00m BinaryClassifier

[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""Load the PyTorch model from the `model_dir` director

---
# Create an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the `train.py` function specified above.


## Define a PyTorch estimator



In [26]:
from sagemaker.pytorch import PyTorch
# your import and estimator code, here
# specify an output path
output_path = 's3://{}/{}'.format(bucket, prefix)

# instantiate a pytorch estimator
estimator = PyTorch(entry_point = 'train.py',
                    source_dir = 'source_pytorch',
                    train_instance_type='ml.c4.xlarge',
                    train_instance_count=1,
                    role = role,
                    output_path = output_path,
                    framework_version='1.0',
                    hyperparameters = {'epochs': 120,
                                       'input_features': 3, 
                                       'hidden_dim': 20,
                                       'output_dim': 1},
                    sagemaker_session = sagemaker_session)


## Train the estimator

Train your estimator on the training data stored in S3.

In [27]:
%%time

# Train your estimator on S3 training data
estimator.fit({'train': input_data})


'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-06-19 16:54:37 Starting - Starting the training job...
2020-06-19 16:54:39 Starting - Launching requested ML instances.........
2020-06-19 16:56:10 Starting - Preparing the instances for training...
2020-06-19 16:56:50 Downloading - Downloading input data...
2020-06-19 16:57:34 Training - Training image download completed. Training in progress.
2020-06-19 16:57:34 Uploading - Uploading generated training model[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-06-19 16:57:29,825 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-06-19 16:57:29,828 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-06-19 16:57:29,840 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-06-19 16:57:29,841 sagemaker_pytorch_container.training INFO     Invoking user training scri


2020-06-19 16:57:41 Completed - Training job completed
Training seconds: 51
Billable seconds: 51
CPU times: user 518 ms, sys: 23.5 ms, total: 541 ms
Wall time: 3min 41s


## Deploy the trained model

After training, deploy your model to create a `predictor`. 

In [28]:
%%time

from sagemaker.pytorch import PyTorchModel

model = PyTorchModel(model_data= estimator.model_data,
                    entry_point = 'predict.py',
                    source_dir = 'source_pytorch',
                    role = role,                
                    framework_version='1.0')


# deploy your model to create a predictor
predictor = model.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


---------------!CPU times: user 485 ms, sys: 20.9 ms, total: 506 ms
Wall time: 7min 32s


---
# Evaluating the Model

Once your model is deployed, you can see how it performs when applied to our test data.

The provided cell below, reads in the test data, assuming it is stored locally in `data_dir` and named `test.csv`. The labels and features are extracted from the `.csv` file.

In [29]:
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

## Determine the accuracy of your model

Use the deployed `predictor` to generate predicted, class labels for the test data. Compare those to the *true* labels, `test_y`, and calculate the accuracy as a value between 0 and 1.0 that indicates the fraction of test data that your model classified correctly. 

In [30]:
# First: generate predicted, class labels
import numpy as np
test_y_preds = np.squeeze(np.round(predictor.predict(test_x)))

# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [32]:
# Second: calculate the test accuracy

tp = np.logical_and(test_y, test_y_preds).sum()
fp = np.logical_and(1-test_y, test_y_preds).sum()
tn = np.logical_and(1-test_y, 1-test_y_preds).sum()
fn = np.logical_and(test_y, 1-test_y_preds).sum()
accuracy = (tp + tn) / (tp + fp + tn + fn)

print(accuracy)

## print out the array of predicted and true labels, if you want
print(pd.crosstab(test_y, test_y_preds, rownames=['actuals'], colnames=['predictions']))


print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

1.0
predictions  0.0  1.0
actuals              
0             10    0
1              0   15

Predicted class labels: 
[1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0.
 0.]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


----
##  Clean up Resources


In [33]:
predictor.delete_endpoint()


### Deleting S3 bucket

In [None]:
# deleting bucket

bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()