## Final Project Day 2: Use Multi-layer Neural Net or Recurrent Neural Networks for the Product Safety Dataset

We continue to work with the final project dataset. This time, you can add more layers to your Neural network or try Recurrent Neural Networks (RNNs).

Implement the model. Then, train and test with the corresponding datasets. You can use these notebooks as starting point: __MLA-NLP-DAY2-NN-NB__ and __MLA-NLP-DAY2-RNN-NB__

You can follow these steps:
1. Read training-test data (Given)
2. Train a neural network (Implement)
3. Make predictions on your test dataset (Implement)
4. Write your test predictions to a CSV file (Given)

In [None]:
# Upgrade dependencies
!pip install -r ../../requirements.txt

In [1]:
import boto3
import os
from os import path
import pandas as pd
import numpy as np
import re, time

from collections import Counter

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

import torch, torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import Vocab
from torch.utils.data import TensorDataset, DataLoader
from torchtext.vocab import GloVe
from torch import nn, optim


## 1. Reading the dataset

We will use the __pandas__ library to read our dataset. Let's first download the files.

#### __Training data:__

In [2]:
train_df = pd.read_csv('../../data/final_project/training.csv', encoding='utf-8', header=0)
train_df.head()

Unnamed: 0,ID,doc_id,text,date,star_rating,title,human_tag
0,47490,15808037321,"I ordered a sample of the Dietspotlight Burn, ...",6/25/2018 17:51,1,DO NOT BUY!,0
1,16127,16042300811,This coffee tasts terrible as if it got burnt ...,2/8/2018 15:59,2,Coffee not good,0
2,51499,16246716471,I've been buying lightly salted Planters cashe...,3/22/2018 17:53,2,"Poor Quality - Burnt, Shriveled Nuts With Blac...",0
3,36725,14460351031,This product is great in so many ways. It goes...,12/7/2017 8:49,4,"Very lovey product, good sunscreen, but strong...",0
4,49041,15509997211,"My skin did not agree with this product, it wo...",3/21/2018 13:51,1,Not for everyone. Reactions can be harsh.,1


#### __Test data:__

In [3]:
test_df = pd.read_csv('../../data/final_project/test.csv', encoding='utf-8', header=0)
test_df.head()

Unnamed: 0,ID,doc_id,text,date,star_rating,title
0,62199,15449606311,"Quality of material is great, however, the bac...",3/7/2018 19:47,3,great backpack with strange fit
1,76123,15307152511,The product was okay but wasn't refined campho...,43135.875,2,Not refined
2,78742,12762748321,I normally read the reviews before buying some...,42997.37708,1,"Doesnt work, wouldnt recommend"
3,64010,15936405041,These pads are completely worthless. The light...,43313.25417,1,The lighter colored side of the pads smells li...
4,17058,13596875291,The saw works great but the blade oiler does n...,12/5/2017 20:17,2,The saw works great but the blade oiler does n...


In [4]:
train_df["human_tag"].value_counts()

0    53375
1     9759
Name: human_tag, dtype: int64

In [5]:
print(train_df.isna().sum())

ID             0
doc_id         0
text           6
date           0
star_rating    0
title          1
human_tag      0
dtype: int64


In [6]:
train_df['text'] = train_df['text'].fillna("missing")

## 2. Train a Classifier

In [7]:
# Implement this
# Let's first process the text data

print("Fixing missing values...")
# Fixing the missing values
train_df["text"].fillna("", inplace=True)

print("Splitting data into training and validation...")
X_train, X_val, y_train, y_val = train_test_split(
    train_df[["text"]],
    train_df["human_tag"].values,
    test_size=0.10,
    shuffle=True,
    random_state=324,
)

# Stop words removal and stemming
# Let's get a list of stop words from the NLTK library
stop = stopwords.words("english")

# These words are important for our problem. We don't want to remove them.
excluding = [
    "against",
    "not",
    "don",
    "don't",
    "ain",
    "aren",
    "aren't",
    "couldn",
    "couldn't",
    "didn",
    "didn't",
    "doesn",
    "doesn't",
    "hadn",
    "hadn't",
    "hasn",
    "hasn't",
    "haven",
    "haven't",
    "isn",
    "isn't",
    "mightn",
    "mightn't",
    "mustn",
    "mustn't",
    "needn",
    "needn't",
    "shouldn",
    "shouldn't",
    "wasn",
    "wasn't",
    "weren",
    "weren't",
    "won",
    "won't",
    "wouldn",
    "wouldn't",
]

# New stop word list
stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer("english")

def process_text(texts):
    final_text_list = []
    for sent in texts:

        # Check if the sentence is a missing value
        if isinstance(sent, str) == False:
            sent = ""

        filtered_sentence = []
        
        # Lowercase
        sent = sent.lower()
        # Remove leading/trailing whitespace
        sent = sent.strip()
        # Remove extra space and tabs
        sent = re.sub("\s+", " ", sent)
        # Remove HTML tags/markups:
        sent = re.compile("<.*?>").sub("", sent)

        for w in word_tokenize(sent):
            # We are applying some custom filtering here, feel free to try different things
            # Check if it is not numeric and its length>2 and not in stop words
            if (not w.isnumeric()) and (len(w) > 2) and (w not in stop_words):
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence)  # final string of cleaned words

        final_text_list.append(final_string)

    return final_text_list

print("Processing the text fields...")
X_train["text"] = process_text(X_train["text"].tolist())
X_val["text"] = process_text(X_val["text"].tolist())

# Use TD-IDF to vectorize to vectors of len 750.
tf_idf_vectorizer = TfidfVectorizer(max_features=750)

# Fit the vectorizer to training data
# Don't use the fit() on validation or test datasets
tf_idf_vectorizer.fit(X_train["text"].values)

print("Transforming the text fields (Bag of Words)...")
# Transform text fields
X_train = tf_idf_vectorizer.transform(X_train["text"].values).toarray()
X_val = tf_idf_vectorizer.transform(X_val["text"].values).toarray()

print("Shapes of features: Training and Validation")
print(X_train.shape, X_val.shape)

In [24]:
# Size of the state vectors
hidden_size = 128

# General NN training parameters
learning_rate = 0.0005    # was 0.0001
epochs = 15

# Embedding vector and vocabulary sizes
embed_size = 300  # glove.6B.300d.txt
vocab_size = len(vocab.itos)

In [17]:
class Net(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=1)
        self.rnn = nn.RNN(
            embed_size, hidden_size, num_layers=num_layers, batch_first=True
        )

        self.linear = nn.Linear(hidden_size, 1)  
        self.act = nn.Sigmoid()

    def forward(self, inputs):
        embeddings = self.embedding(inputs)
        # Call the RNN layer
        outputs, _ = self.rnn(embeddings)
        
        # Output shape after RNN: (batch_size, max_len, hidden_size)
        # Get the output from the last time step with outputs[:, -1, :] below
        # The output shape becomes: (batch_size, 1, hidden_size)
        # Send it to the linear layer
        outs = self.linear(outputs[:, -1, :])
        return self.act(outs)
    
# Initialize the weights
def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.xavier_uniform_(m.weight)
    if type(m) == nn.RNN:
        for param in m._flat_weights_names:
            if "weight" in param:
                nn.init.xavier_uniform_(m._parameters[param])

In [18]:
# Our architecture with 2 RNN layers
model = Net(vocab_size, embed_size, hidden_size, num_layers=2)

# We set the embedding layer's parameters from GloVe
model.embedding.weight.data.copy_(embedding_matrix)
# We won't change/train the embedding layer
model.embedding.weight.requires_grad = False

In [19]:
model

Net(
  (embedding): Embedding(31128, 300, padding_idx=1)
  (rnn): RNN(300, 128, num_layers=2, batch_first=True)
  (linear): Linear(in_features=128, out_features=1, bias=True)
  (act): Sigmoid()
)

In [20]:
# Setting our trainer
trainer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# We will use Binary Cross-entropy loss
# reduction="sum" sums the losses for given output and target
cross_ent_loss = nn.BCELoss(reduction="sum")


In [25]:
# Get the compute device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device = ", device)

model.apply(init_weights)
model.to(device)

for epoch in range(epochs):
    start = time.time()
    training_loss = 0
    val_loss = 0
    # Training loop, train the network
    for data, target in train_loader:
        trainer.zero_grad()
        data = data.to(device)
        target = target.to(device)
        output = model(data)
        L = cross_ent_loss(output, target.unsqueeze(1))
        training_loss += L.item()
        L.backward()
        trainer.step()

    # Validate the network, no training (no weight update)
    for data, target in val_loader:
        val_predictions = model(data.to(device))
        L = cross_ent_loss(val_predictions, target.to(device).unsqueeze(1))
        val_loss += L.item()

    # Let's take the average losses
    training_loss = training_loss / len(train_label)
    val_loss = val_loss / len(val_label)

    end = time.time()
    print(
        f"Epoch {epoch}. Train_loss {training_loss}. Val_loss {val_loss}. Seconds {end-start}"
    )

device =  cuda
Epoch 0. Train_loss 0.4263557529080212. Val_loss 0.40951640235120296. Seconds 8.20822262763977
Epoch 1. Train_loss 0.40082598113543716. Val_loss 0.3749909591251889. Seconds 7.89204216003418
Epoch 2. Train_loss 0.37548782596366587. Val_loss 0.34988980433742267. Seconds 7.936044216156006
Epoch 3. Train_loss 0.36810157924250625. Val_loss 0.3504086048270012. Seconds 7.932371377944946
Epoch 4. Train_loss 0.35429563384708995. Val_loss 0.3427843894994746. Seconds 7.935791015625
Epoch 5. Train_loss 0.3535374701673487. Val_loss 0.36795455426931156. Seconds 8.05828309059143
Epoch 6. Train_loss 0.35419051777263294. Val_loss 0.3374540866988348. Seconds 7.924380779266357
Epoch 7. Train_loss 0.3438515128908087. Val_loss 0.3338036750138542. Seconds 8.066648483276367
Epoch 8. Train_loss 0.39018011613447706. Val_loss 0.40996785567159477. Seconds 7.904617071151733
Epoch 9. Train_loss 0.3973140641863352. Val_loss 0.3616157185249703. Seconds 7.954876661300659
Epoch 10. Train_loss 0.39236197

In [26]:
val_predictions = []
for data, target in val_loader:
    val_preds = model(data.to(device))
    val_predictions.extend(
        [np.rint(val_pred)[0] for val_pred in val_preds.detach().cpu().numpy()]
    )
print(val_predictions[:10])

[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


In [27]:
# Use the fitted pipeline to make predictions on the validation dataset
print(confusion_matrix(val_label, val_predictions))
print(classification_report(val_label, val_predictions))
print("Accuracy (validation):", accuracy_score(val_label, val_predictions))

[[5121  212]
 [ 628  353]]
              precision    recall  f1-score   support

           0       0.89      0.96      0.92      5333
           1       0.62      0.36      0.46       981

    accuracy                           0.87      6314
   macro avg       0.76      0.66      0.69      6314
weighted avg       0.85      0.87      0.85      6314

Accuracy (validation): 0.8669623059866962


## 3. Make predictions on your test dataset

In [28]:
# Implement this
test_text = test_df["text"].fillna(value="missing").tolist()

test_dataset = TensorDataset(pad_features(test_text, max_len)) #, torch.tensor(val_label))
test_loader = DataLoader(test_dataset, batch_size=batch_size)

In [29]:
test_predictions = []
for data, in test_loader:
    test_preds = model(data.to(device))
    test_predictions.extend(
        [np.rint(test_pred)[0] for test_pred in test_preds.detach().cpu().numpy()]
    )
print(test_predictions[:10])

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


## 4. Write your predictions to a CSV file
You can use the following code to write your test predictions to a CSV file. Then upload your file to https://mlu.corp.amazon.com/contests/redirect/53

In [30]:
import pandas as pd
 
result_df = pd.DataFrame()
result_df["ID"] = test_df["ID"]
result_df["human_tag"] = test_predictions
 
#result_df.to_csv("../../data/final_project/project_day2_result.csv", encoding='utf-8', index=False)
result_df.to_csv("./project_day2_result.csv", encoding='utf-8', index=False)