Our goal in this notebook, will be to train a binary classification model that learns to label an answer file as either plagiarized or not, based on the features you provide the model.

This task will be broken down into a few discrete steps:

* Upload your data to S3.
* Define a binary classification model and a training script.
* Train your model and deploy it.
* Evaluate your deployed classifier and answer some questions about your approach.


---

## Load Data to S3

In the last notebook, we have created two files: a `training.csv` and `test.csv` file with the features and class labels for the given corpus of plagiarized/non-plagiarized text data. 

The below cells load in some AWS SageMaker libraries and creates a default bucket. After creating this bucket, we can upload your locally stored data to S3.

In [24]:
import pandas as pd
import boto3
import sagemaker

In [None]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

## Upload your training data to S3

Specify the `data_dir` where we've saved your `train.csv` file. Decide on a descriptive `prefix` that defines where our data will be uploaded in the default S3 bucket. Finally, create a pointer to our training data by calling `sagemaker_session.upload_data` and passing in the required parameters. 

In [26]:
# should be the name of directory to save the features data
data_dir = 'plagiarism_data'

# set prefix, a descriptive name for a directory  
prefix = 'plagiarism_detection'

# upload all data to S3
input_data = sagemaker_session.upload_data(data_dir, key_prefix=prefix)

### Test cell

Test that your data has been successfully uploaded. The below cell prints out the items in the S3 bucket and will throw an error if it is empty. 

In [27]:
# confirm that data is in S3 bucket
empty_check = []

for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

plagiarism_detection/test.csv
plagiarism_detection/train.csv
Test passed!


---

## Modeling

Now that we've uploaded the training data, it's time to define and train a model!

For a binary classification we, you can choose to go one of three routes:
* Use a built-in classification algorithm, like LinearLearner.
* Define a custom Scikit-learn classifier, a comparison of models can be found [here](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html).
* Define a custom PyTorch neural network classifier. 
 
---

### Training a KNN classifier

The folder `source_sklearn` has a `train.py` training script and the script is responsible for training a KNN model.

A typical training script:
* Loads training data from a specified directory
* Parses any training & model hyperparameters (ex. nodes in a neural network, training epochs, etc.)
* Instantiates a model of your design, with any specified hyperparams
* Trains that model 
* Finally, saves the model so that it can be hosted/deployed, later


In [3]:
!pygmentize source_sklearn/train.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m

[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mexternals[39;49;00m [34mimport[39;49;00m joblib

[37m# Import any additional libraries you need to define a model[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mlinear_model[39;49;00m [34mimport[39;49;00m LogisticRegression
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mneighbors[39;49;00m [34mimport[39;49;00m KNeighborsClassifier
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mtree[39;49;00m [34mimport[39;49;00m DecisionTreeClassifier
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m

---
### Create an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the `train.py` function we specified above. To run a custom training script in SageMaker, construct an estimator, and fill in the appropriate constructor arguments:

* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `source_sklearn`.
* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `train_sklearn` 
* **entry_point**: The path to the Python script SageMaker runs for training.
* **source_dir**: The path to the training script directory `train_sklearn`.
* **role**: Role ARN, which was specified, above.
* **train_instance_count**: The number of training instances (should be left at 1).
* **train_instance_type**: The type of SageMaker instance for training. Note: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* **sagemaker_session**: The session used to train on Sagemaker.
* **hyperparameters** (optional): A dictionary `{'name':value, ..}` passed to the train function as hyperparameters.

In [None]:
# import the sklearn estimator 
from sagemaker.sklearn.estimator import SKLearn

KNN_estimator = SKLearn(entry_point='train.py',
                            source_dir='source_sklearn',
                            role=role,
                            train_instance_count=1,
                            train_instance_type='ml.c4.xlarge',
                            sagemaker_session=sagemaker_session,
                            hyperparameters={
                                'n_neighbors' : 5
                            })


## Train the estimator

Train the estimator on the training data stored in S3. This will create a training job that we can monitor in SageMaker console.

In [37]:
%%time

# Train the estimator on S3 training data
KNN_estimator.fit({'train': input_data})


2019-12-18 12:49:15 Starting - Starting the training job...
2019-12-18 12:49:18 Starting - Launching requested ML instances......
2019-12-18 12:50:45 Starting - Preparing the instances for training.........
2019-12-18 12:52:11 Downloading - Downloading input data
2019-12-18 12:52:11 Training - Downloading the training image..[34m2019-12-18 12:52:24,852 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2019-12-18 12:52:24,856 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2019-12-18 12:52:24,877 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2019-12-18 12:52:25,154 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2019-12-18 12:52:25,154 sagemaker-containers INFO     Generating setup.cfg[0m
[34m2019-12-18 12:52:25,154 sagemaker-containers INFO     Generating MANIFEST.in[0m
[34m2019-12-18 12:52:25,154 

## Deploy the trained model

To deploy a trained model, we'll use `<model>.deploy`, which takes in two arguments:
* **initial_instance_count**: The number of deployed instances (1).
* **instance_type**: The type of SageMaker instance for deployment.

In [31]:
%%time

# deploy the model to create a predictor
predictor = KNN_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

--------------------------------------------------------------------------------------!CPU times: user 440 ms, sys: 31.9 ms, total: 472 ms
Wall time: 7min 13s


---
## Evaluating The Model

Once our model is deployed, you can see how it performs when applied to our test data.

The provided cell below, reads in the test data, assuming it is stored locally in `data_dir` and named `test.csv`. The labels and features are extracted from the `.csv` file.

In [33]:
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

### Determine the accuracy of your model

Use the deployed `predictor` to generate predicted, class labels for the test data. Compare those to the *true* labels, `test_y`, and calculate the accuracy as a value between 0 and 1 that indicates the fraction of test data that your model classified correctly. 


In [34]:
# First: generate predicted, class labels
test_y_preds = predictor.predict(test_x)

# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [36]:
# Second: calculate the test accuracy
accuracy = (test_y_preds==test_y).sum()/test_y_preds.size
print(accuracy)

## print out the array of predicted and true labels
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

0.96

Predicted class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 1 1 0 0]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


----
## Clean up Resources

In [39]:
# delete the model endpoint
predictor.delete_endpoint()

### Deleting S3 bucket

When we are *completely* done with training and testing models, we can also delete your entire S3 bucket.

In [40]:
# delete the bucket
bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': '8DB70868B62EA604',
   'HostId': 'JLLqVTXGpxUDeccStMh8NJc1Yf2f4wfM43JZH1zYaGTWfZ8vW2tMHMf2WWtRPfZu4UrZb7a6M7M=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'JLLqVTXGpxUDeccStMh8NJc1Yf2f4wfM43JZH1zYaGTWfZ8vW2tMHMf2WWtRPfZu4UrZb7a6M7M=',
    'x-amz-request-id': '8DB70868B62EA604',
    'date': 'Wed, 18 Dec 2019 13:00:02 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'sagemaker-scikit-learn-2019-12-18-12-36-19-266/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-scikit-learn-2019-12-18-12-49-15-603/output/model.tar.gz'},
   {'Key': 'sagemaker-scikit-learn-2019-12-18-12-36-19-266/output/model.tar.gz'},
   {'Key': 'sagemaker-scikit-learn-2019-12-18-12-49-15-603/debug-output/training_job_end.ts'},
   {'Key': 'plagiarism_detection/train.csv'},
   {'Key': 'sagemaker-scikit-learn-2019-12-18-12-49-15-603/source/so

### Deleting all models and instances

When we are _completely_ done with this project and do **not** ever want to revisit this notebook, we could choose to delete all of your SageMaker notebook instances and models by following [these instructions](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html).

---
## Further Directions

There are many ways to improve or add on to this project, a few ideas are listed below:
* Train a classifier to predict the *category* (1-3) of plagiarism and not just plagiarized (1) or not (0).
* Utilize a different and larger dataset to see if this model can be extended to other types of plagiarism.
* Use language or character-level analysis to find different (and more) similarity features.
* Write a complete pipeline function that accepts a source text and submitted text file, and classifies the submitted text as plagiarized or not.
* Use API Gateway and a lambda function to deploy the model to a web application.
