## Practice with PyTorch: LSTMs and Spam/Ham Classifiers
In this tutorial, we'll practice with using PyTorch and [Long Short Term Memory (LSTM)](https://en.wikipedia.org/wiki/Long_short-term_memory) Neural Network models to build a spam classifier for emails.

![emails](notebook_diagrams/email_classifier.png)

The reference for this tutorial can be found [here](https://github.com/sijoonlee/spam-ham-walkthrough/blob/master/walkthrough.ipynb).

## 0. Installation Block

### 0.1 Install PyTorch
We'll use Anaconda to install PyTorch on our AWS machines for this tutorial.  If you don't want to install this package through Anaconda, you can also do so through `pip`.

In [None]:
# Activate conda environment
! conda activate local_env

# Install PyTorch in Conda environment
! conda install -c pytorch pytorch
! pip install torchvision

# Check PyTorch version
! pip show torch

# Use matplotlib inline version
%matplotlib inline

### 0.2 Import Packages
Here, since we're processing a lot of (possibly invalid) text data, we'll make use of the `pandas` library.

In [None]:
# For reading file paths
import os

# For processing data
import pandas as pd
from collections import Counter

### 0.2 Download Data

In [None]:
# You can download the data here: http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/index.html

# Download data
!wget http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/preprocessed/enron1.tar.gz
!wget http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/preprocessed/enron2.tar.gz
!wget http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/preprocessed/enron3.tar.gz
!wget http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/preprocessed/enron4.tar.gz
!wget http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/preprocessed/enron5.tar.gz
!wget http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/preprocessed/enron6.tar.gz

# Now unzip the data into the current directory
!tar -zxvf enron1.tar.gz
!tar -zxvf enron2.tar.gz
!tar -zxvf enron3.tar.gz
!tar -zxvf enron4.tar.gz
!tar -zxvf enron5.tar.gz
!tar -zxvf enron6.tar.gz

## 1. Pre-Processing
Like our computer vision applications, pre-processing of data will be important for this email classification problem as well.

### 1.1 Define Our File Reader 
We'll use our file reader to create our training and testing data for this neural network exercise.

In [None]:
import glob
import numpy as np
import random
import torch

class File_reader(object):
  def __init__(self):
    self.ham = []
    self.spam = []
    self.ham_paths = ["enron1/ham/*.txt", "enron2/ham/*.txt", "enron3/ham/*.txt", "enron4/ham/*.txt", "enron5/ham/*.txt", "enron6/ham/*.txt"]
    self.spam_paths = ["enron1/spam/*.txt", "enron2/spam/*.txt", "enron3/spam/*.txt", "enron4/spam/*.txt", "enron5/spam/*.txt", "enron6/spam/*.txt"]

  def read_file(self, path, minimum_word_count = 3, unnecessary =  ["-", ".", ",", "/", ":", "@"]):
    files  = glob.glob(path)
    content_list = []
    for file in files:
        with open(file, encoding="ISO-8859-1") as f:
            content = f.read()
            if len(content.split()) > minimum_word_count:
              content = content.lower()
              if len(unnecessary) is not 0:
                  content = ''.join([c for c in content if c not in unnecessary])
              content_list.append(content)
    return content_list

  def cut_before_combine(self, data, max = 5000):
    if max is not 0:
      if len(data) > max:
        random.shuffle(data)
        data = data[:max]
    return data

  def load_ham_and_spam(self, ham_paths = "default", spam_paths = "default", max = 5000): # 0 for no truncation

    if ham_paths == "default":
      ham_paths = self.ham_paths
    if spam_paths == "default":
      spam_paths = self.spam_paths

    self.ham = [ item for path in ham_paths for item in self.read_file(path) ]
    if max != 0:
      self.ham = self.cut_before_combine(self.ham, max)
    print("ham length ", len(self.ham))

    self.spam = [item for path in spam_paths for item in self.read_file(path) ]
    if max != 0:
      self.spam = self.cut_before_combine(self.spam, max)
    print("spam length ", len(self.spam))

    data = self.ham + self.spam

    ham_label = [0 for _ in range(len(self.ham))]
    spam_label = [1 for _ in range(len(self.spam))]

    label_tensor = torch.as_tensor(ham_label + spam_label, dtype = torch.int16)

    return data, label_tensor

  def print_sample(self, which ="both"): # ham, spam or both
    if which == "ham" or which == "both":
      idx = random.randint(0, len(self.ham))
      print("----------- ham sample -------------")
      print(self.ham[idx])
    if which == "spam" or which == "both":
      idx = random.randint(0, len(self.spam))
      print("----------- spam sample -------------")
      print(self.spam[idx])

In [None]:
# Make file reader object
fr = File_reader()

# Use file reader object to get data and labels
data, label = fr.load_ham_and_spam(ham_paths = "default", spam_paths = "default", max = 3000)

### 1.2 Define Vocabulary Objects for Pre-Processing

In [None]:
vocabs = [vocab for seq in data for vocab in seq.split()]
# a = [  word for seq in ["a d","b d","c d"] for word in seq.split() ]
# ['a', 'd', 'b', 'd', 'c', 'd']

vocab_count = Counter(vocabs)
# Count words in the whole dataset

print(vocab_count)
# Counter({'the': 47430, 'to': 35684, 'and': 26245, 'of': 24176, 'a': 19290, 'in': 17442, 'you': 14258, ...

vocab_count = vocab_count.most_common(len(vocab_count))

vocab_to_int = {word : index+2 for index, (word, count) in enumerate(vocab_count)}
vocab_to_int.update({'__PADDING__': 0}) # index 0 for padding
vocab_to_int.update({'__UNKNOWN__': 1}) # index 1 for unknown word such as broken character

print(vocab_to_int)
# {'the': 2, 'to': 3, 'and': 4, 'of': 5, 'a': 6, 'in': 7, 'you': 8, 'for': 9, "'": 10, 'is': 11, ...

**NOTE**: Notice how balanced the dataset above is!

## 1.3 Feature Engineering: Tokenization and Vectorization of Text Sequences

In [None]:
# Import pytorch package and important modules
import torch
from torch.autograd import Variable

# Tokenize & Vectorize sequences
vectorized_seqs = []
for seq in data: 
  vectorized_seqs.append([vocab_to_int[word] for word in seq.split()])

# Save the lengths of sequences
seq_lengths = torch.LongTensor(list(map(len, vectorized_seqs)))

# Add padding(0)
seq_tensor = Variable(torch.zeros((len(vectorized_seqs), seq_lengths.max()))).long()
for idx, (seq, seqlen) in enumerate(zip(vectorized_seqs, seq_lengths)):
  seq_tensor[idx, :seqlen] = torch.LongTensor(seq)
  

print(seq_lengths.max()) # tensor(30772)
print(seq_tensor[0]) # tensor([ 20,  77, 666,  ...,   0,   0,   0])
print(seq_lengths[0]) # tensor(412)

In [None]:
sample = "operations is digging out 2000 feet of pipe to begin the hydro test"

tokenized_sample = [ word for word in sample.split()]
print(tokenized_sample[:3]) # ['operations', 'is', 'digging']

vectorized_sample = [ vocab_to_int.get(word, 1) for word in tokenized_sample] # unknown word in dict marked as 1
print(vectorized_sample[:3]) # [424, 11, 14683]

### 1.4 Define Our PyTorch DataLoader
DataLoaders are extremely important objects in PyTorch.  They are tools we can use for easily customizing how our data is ingested during training and evaluation, and enable for compact, well-defined data augmentation.  For a tutorial on how DataLoaders work, visit the link [here](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html).

**NOTE**: Nearly all `DataLoader` objects are different from one another, so it's not critical that you memorize the structure below.  It is only important to know the methods that are used that define the `DataLoader`, and what they are used for:

- `__init__`: This is the constructor method called when this kind of `DataLoader` object is created (instantiated).


- `__iter__`:  This method defines how the DataLoader iterates through tensors it's given.


- `_next_index`: This method defines how the next index of the DataLoader is found.


- `__next__`: This method defines how the next element in the DataLoader is returned.


- `__len__`: This method defines the length of the DataLoader for iteration.

In [None]:
# Import data sampler from pytorch
import torch.utils.data.sampler as splr

# Create custom DataLoader that we'll use for loading training data into our training pipeline
class CustomDataLoader(object):
    
  # Constructor method
  def __init__(self, seq_tensor, seq_lengths, label_tensor, batch_size):
    self.batch_size = batch_size
    self.seq_tensor = seq_tensor
    self.seq_lengths = seq_lengths
    self.label_tensor = label_tensor
    self.sampler = splr.BatchSampler(splr.RandomSampler(self.label_tensor), self.batch_size, False)
    self.sampler_iter = iter(self.sampler)
  
  # This method defines how the DataLoader iterates
  def __iter__(self):
    self.sampler_iter = iter(self.sampler) # reset sampler iterator
    return self
  
  # This method defines how the next index of the DataLoader is found
  def _next_index(self):
    return next(self.sampler_iter) # may raise StopIteration
  
  # This method defines how the next element in the DataLoader is returned
  def __next__(self):
    index = self._next_index()

    subset_seq_tensor = self.seq_tensor[index]
    subset_seq_lengths = self.seq_lengths[index]
    subset_label_tensor = self.label_tensor[index]

    # order by length to use pack_padded_sequence()
    subset_seq_lengths, perm_idx = subset_seq_lengths.sort(0, descending=True)
    subset_seq_tensor = subset_seq_tensor[perm_idx]
    subset_label_tensor = subset_label_tensor[perm_idx]

    return subset_seq_tensor, subset_seq_lengths, subset_label_tensor

  # This method defines the length of the DataLoader for iteration
  def __len__(self):
    return len(self.sampler)



### 1.5 Split Data into Training and Testing
As with the other machine learning frameworks we've analyzed ...

In [None]:
shuffled_idx = torch.randperm(label.shape[0])

seq_tensor = seq_tensor[shuffled_idx]
seq_lenghts = seq_lengths[shuffled_idx]
label = label[shuffled_idx]

PCT_TRAIN = 0.7
PCT_VALID = 0.2

length = len(label)

# Specify components of training dataset
train_seq_tensor = seq_tensor[:int(length*PCT_TRAIN)] 
train_seq_lengths = seq_lengths[:int(length*PCT_TRAIN)]
train_label = label[:int(length*PCT_TRAIN)]

# Specify components of validation dataset
valid_seq_tensor = seq_tensor[int(length*PCT_TRAIN):int(length*(PCT_TRAIN+PCT_VALID))] 
valid_seq_lengths = seq_lengths[int(length*PCT_TRAIN):int(length*(PCT_TRAIN+PCT_VALID))] 
valid_label = label[int(length*PCT_TRAIN):int(length*(PCT_TRAIN+PCT_VALID))]

# Specify components of testing dataset
test_seq_tensor = seq_tensor[int(length*(PCT_TRAIN+PCT_VALID)):]
test_seq_lengths = seq_lengths[int(length*(PCT_TRAIN+PCT_VALID)):]
test_label = label[int(length*(PCT_TRAIN+PCT_VALID)):]

# Display datasets
print(train_seq_tensor.shape) # torch.Size([4200, 30772])
print(valid_seq_tensor.shape) # torch.Size([1199, 30772])
print(test_seq_tensor.shape) # torch.Size([601, 30772])


### 1.6 Set Batch Size and Create DataLoaders
We can use our `CustomDataLoader` class defined above as our dataloader for this problem.  **NOTE**: We need to give these DataLoaders a batch size for them to be used in our training procedures.

In [None]:
# set shuffle = False since data is already shuffled
batch_size = 80

# Create training data loader
train_loader = CustomDataLoader(train_seq_tensor, train_seq_lengths, train_label, batch_size)

# Create validation data loader
valid_loader = CustomDataLoader(valid_seq_tensor, valid_seq_lengths, valid_label, batch_size)

# Create testing data loader
test_loader = CustomDataLoader(test_seq_tensor, test_seq_lengths, test_label, batch_size)

## 2. Define LSTM Model and Parameters

### 2.1 Define LSTM Model
As mentioned at the beginning of this tutorial, we'll be using an LSTM model to predict whether an email is "spam" or "ham".  Let's define our model below, using PyTorch!  Below is an example where having the ability to customize different features of the network is quite helpful.

In [None]:
# Import nn module and RNN sub-module from PyTorch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

# Class for our LSTM model
class SpamHamLSTM(nn.Module):
    
    # Constructor method for model
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_size, n_layers,\
                 drop_lstm=0.1, drop_out = 0.1):
        
        # Model inherits from nn.Module superclass
        super().__init__()
        
        # Specify other parameters
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # Embedding 
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # LSTM layers
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=drop_lstm, batch_first=True)
        
        # Dropout layer
        self.dropout = nn.Dropout(drop_out)
        
        # Linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        
    # Method for making predictions from inputs to outputs
    def forward(self, x, seq_lengths):

        # Embeddings
        embedded_seq_tensor = self.embedding(x)
                
        # Pack, remove pads
        packed_input = pack_padded_sequence(embedded_seq_tensor, seq_lengths.cpu().numpy(), batch_first=True)
        
        # LSTM
        packed_output, (ht, ct) = self.lstm(packed_input, None)
          # https://pytorch.org/docs/stable/_modules/torch/nn/modules/rnn.html
          # If `(h_0, c_0)` is not provided, both **h_0** and **c_0** default to zero

        # Unpack, recover padded sequence
        output, input_sizes = pad_packed_sequence(packed_output, batch_first=True)
       
        # Collect the last output in each batch
        last_idxs = (input_sizes - 1).to(device) # last_idxs = input_sizes - torch.ones_like(input_sizes)
        output = torch.gather(output, 1, last_idxs.view(-1, 1).unsqueeze(2).repeat(1, 1, self.hidden_dim)).squeeze() # [batch_size, hidden_dim]
        
        # Dropout and fully-connected layer
        output = self.dropout(output)
        output = self.fc(output).squeeze()
               
        # Sigmoid function
        output = self.sig(output)
        
        return output


### 2.2 Specify Model and Training Hyperparameters
Now we can specify parameters that are critical for our LSTM model and training it effectively.

In [None]:
# Instantiate the model w/ hyperparams
vocab_size = len(vocab_to_int)
embedding_dim = 100 # int(vocab_size ** 0.25) # 15
hidden_dim = 15
output_size = 1
n_layers = 2

# See if we have GPU
device = "cuda" if torch.cuda.is_available() else "cpu" 

# Make network object using custom architecture from above
net = SpamHamLSTM(vocab_size, embedding_dim, hidden_dim, output_size, n_layers, \
                 0.2, 0.2)

# If we have GPU, move network from CPU --> GPU
net = net.to(device)

# Print network
print(net)

### 2.3 Specify Loss, Optimizer, and Scheduler
These are important for training our network efficiently and effectively.

**NOTE**: We didn't explicitly discuss schedulers in the previous tutorial, but if you're interested in learning more about them, you can do so [here](https://pytorch.org/docs/stable/optim.html).  Essentially, these objects enable for more stable training by dynamically adjusting the learning rate based off of validation dataset performance.

In [None]:
# loss and optimization functions
criterion = nn.BCELoss()

# Learning rate and optimizer
lr=0.03
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

# We didn't mention this before, but using schedulers are a way to achieve more stable training
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,\
                                                       mode = 'min', \
                                                      factor = 0.5,\
                                                      patience = 2)

## 3. Train Model
Now that we've specified our model, our model hyperparameters, and our training hyperparameters, we are ready to train our email classifier on our data!

### 3.1 Training Loop

In [None]:
import numpy as np

# training params
epochs = 6 
counter = 0
print_every = 10
clip=5 # gradient clipping

# Specify this to tell the network it needs to train
net.train()

# TRAINING LOOP - train for some number of epochs
val_losses = []
epochs_list = []
for e in range(epochs):
    
    epochs_list.append(e)
    
    scheduler.step(e)

    for seq_tensor, seq_tensor_lengths, label in iter(train_loader):
        counter += 1
               
        seq_tensor = seq_tensor.to(device)
        seq_tensor_lengths = seq_tensor_lengths.to(device)
        label = label.to(device)
 
        # get the output from the model
        output = net(seq_tensor, seq_tensor_lengths)
    
        # get the loss and backprop
        loss = criterion(output, label.float())
        optimizer.zero_grad() 
        loss.backward()
        
        # prevent the exploding gradient
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            
            val_losses_in_itr = []
            sums = []
            sizes = []
            
            net.eval()
            
            for seq_tensor, seq_tensor_lengths, label in iter(valid_loader):

                seq_tensor = seq_tensor.to(device)
                seq_tensor_lengths = seq_tensor_lengths.to(device)
                label = label.to(device)
                output = net(seq_tensor, seq_tensor_lengths)
                
                # losses
                val_loss = criterion(output, label.float())     
                val_losses_in_itr.append(val_loss.item())
                
                # accuracy
                binary_output = (output >= 0.5).short() # short(): torch.int16
                right_or_not = torch.eq(binary_output, label)
                sums.append(torch.sum(right_or_not).float().item())
                sizes.append(right_or_not.shape[0])
            
            accuracy = sum(sums) / sum(sizes)
            
            net.train()
            print("Epoch: {:2d}/{:2d}\t".format(e+1, epochs),
                  "Steps: {:3d}\t".format(counter),
                  "Loss: {:.6f}\t".format(loss.item()),
                  "Val Loss: {:.6f}\t".format(np.mean(val_losses_in_itr)),
                  "Accuracy: {:.3f}".format(accuracy))
            
# Epoch:  1/ 6	 Steps:  10	 Loss: 0.693371	 Val Loss: 0.689860	 Accuracy: 0.530
# Epoch:  1/ 6	 Steps:  20	 Loss: 0.699150	 Val Loss: 0.667903	 Accuracy: 0.585
# Epoch:  1/ 6	 Steps:  30	 Loss: 0.631709	 Val Loss: 0.626028	 Accuracy: 0.651
# Epoch:  1/ 6	 Steps:  40	 Loss: 0.609348	 Val Loss: 0.538908	 Accuracy: 0.716
# Epoch:  1/ 6	 Steps:  50	 Loss: 0.435395	 Val Loss: 0.440515	 Accuracy: 0.780
# Epoch:  2/ 6	 Steps:  60	 Loss: 0.364830	 Val Loss: 0.312334	 Accuracy: 0.892
# Epoch:  2/ 6	 Steps:  70	 Loss: 0.177650	 Val Loss: 0.283867	 Accuracy: 0.901
# Epoch:  2/ 6	 Steps:  80	 Loss: 0.379663	 Val Loss: 0.360904	 Accuracy: 0.883
# Epoch:  2/ 6	 Steps:  90	 Loss: 0.399583	 Val Loss: 0.390520	 Accuracy: 0.857
# Epoch:  2/ 6	 Steps: 100	 Loss: 0.467552	 Val Loss: 0.480415	 Accuracy: 0.808
# Epoch:  3/ 6	 Steps: 110	 Loss: 0.239100	 Val Loss: 0.282348	 Accuracy: 0.896
# Epoch:  3/ 6	 Steps: 120	 Loss: 0.091864	 Val Loss: 0.252968	 Accuracy: 0.915
# Epoch:  3/ 6	 Steps: 130	 Loss: 0.160094	 Val Loss: 0.209478	 Accuracy: 0.934     

### 3.2 Plot Validation Losses
As we've seen before with other tutorials, a great way to visualize how well a network is performing is to plot its losses (both training and testing/validation).

In [None]:
# Import for plotting
import matplotlib.pyplot as plt

# Make plot
plt.plot(epochs_list, val_losses, color="b")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Validation Loss of LSTM Classifier as a Function of Epochs")

# Show plot
plt.show()

## 4. Evaluate the LSTM Model
Now that we've trained our email classifer, let's test our performance on our test dataset.

### 4.1 Testing Loop

In [None]:
# Make counters for storing outputs from testing
test_losses = []
sums = []
sizes = []

# Use this to switch from "training" to "evaluation"
net.eval()

# TESTING/EVALUATION LOOP
test_losses = []
for seq_tensor, seq_tensor_lengths, label in iter(test_loader):

    seq_tensor = seq_tensor.to(device)
    seq_tensor_lengths = seq_tensor_lengths.to(device)
    label = label.to(device)
    output = net(seq_tensor, seq_tensor_lengths)

    # losses
    test_loss = criterion(output, label.float())     
    test_losses.append(test_loss.item())

    # accuracy
    binary_output = (output >= 0.5).short() # short(): torch.int16
    right_or_not = torch.eq(binary_output, label)
    sums.append(torch.sum(right_or_not).float().item())
    sizes.append(right_or_not.shape[0])

accuracy = np.sum(sums) / np.sum(sizes)
print("Test Loss: {:.6f}\t".format(np.mean(test_losses)),
      "Accuracy: {:.3f}".format(accuracy))

## 5. Exercise: Try Improving the Network
Your turn!  Try modifying the following to see if you can improve network classification accuracy:

- Netork architecture (the `SpamHamLSTM` class)
- Number of epochs
- Learning rate
- Optimizer
- (If you really want to, but not recommended) Scheduler
- (If you really want to, but not recommended) DataLoader Class

For reference, the baseline accuracy that you should try to improve is **0.927**.  Good luck!