# NOTE

We have created this notebook separately for two main reasons. 

1. As we have seen our dataset before, it's quite large and eventhough we have managed to save it to S3 bucket, we weren't able to read the whole file in sagemaker notebook as we have limited resources. 

2. Following to the first step, we decided to minimize our dataset which also helped us to balance the classes (spoiler vs not-spoiler review) at the same time. Now that we have managed to read the dataset, there was another blocking issue of computational power and latency during the preprocessing steps.

For this reasons, what we have decided was to create two notebooks: One that will explain the whole process in one go and another to show how we circumvented our problem by handling the preprocessing steps in another notebook and deploy this notebook to show the rest of the steps. 

We hope we didn't make the whole process hard to follow. :)

In [None]:
bucket = 'imdb-review-dataset'
train_data_key = 'train.csv'
test_data_key = 'test.csv'
word_dict_key = 'word_dict.pkl'

In [None]:
import torch
import torch.utils.data

# Read in only the first 250 rows
train_sample = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None, nrows=250)

# Turn the input pandas dataframe into tensors
train_sample_y = torch.from_numpy(train_sample[[0]].values).float().squeeze()
train_sample_X = torch.from_numpy(train_sample.drop([0], axis=1).values).long()

# Build the dataset
train_sample_ds = torch.utils.data.TensorDataset(train_sample_X, train_sample_y)
# Build the dataloader
train_sample_dl = torch.utils.data.DataLoader(train_sample_ds, batch_size=50)

In [None]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point="train.py",
                    source_dir="train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge',
                    hyperparameters={
                        'epochs': 10,
                        'hidden_dim': 200,
                    })

In [None]:
estimator.fit({'training': input_data})

In [None]:
# Deploy the trained model
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

In [None]:
# We split the data into chunks and send each chunk seperately, accumulating the results.

def predict(data, rows=512):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = np.array([])
    for array in split_array:
        predictions = np.append(predictions, predictor.predict(array))
    
    return predictions

In [None]:
predictions = predict(X_test.values)
predictions = [round(num) for num in predictions]

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y, predictions)

In [None]:
test_review = 'The simplest pleasures in life are the best, and this film is one of them. Combining a rather basic storyline of love and adventure this movie transcends the usual weekend fair with wit and unmitigated charm.'

In [None]:
# Convert test_review into a form usable by the model and save the results in test_data
test_data = review_to_words(test_review)
test_data = [np.array(convert_and_pad(word_dict, test_data)[0])]

In [None]:
predictor.predict(test_data)