## Final Project Day 3: Use LSTM or fine-tune BERT for the Product Safety Dataset

We continue to work with the final project dataset. This time you can work with [LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) (its use is similar to RNN) or fine-tune a BERT model. Be careful with the BERT approach as it may take a long time. You will again predict the __human_tag__ field of the dataset.

Use the notebooks from the class and implement the model, train and test with the corresponding datasets.
You can follow these steps:
1. Read training-test data (Given)
2. Train a classifier (Implement)
3. Make predictions on your test dataset (Implement)
4. Write your test predictions to a CSV file (Given)

In [1]:
# Upgrade dependencies
!pip install -r ../../requirements.txt
!pip install pytorch-pretrained-bert pytorch-nlp

^C
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/bin/pip", line 5, in <module>
    from pip._internal.cli.main import main
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pip/_internal/cli/main.py", line 9, in <module>
    from pip._internal.cli.autocompletion import autocomplete
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pip/_internal/cli/autocompletion.py", line 10, in <module>
    from pip._internal.cli.main_parser import create_main_parser
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pip/_internal/cli/main_parser.py", line 8, in <module>
    from pip._internal.cli import cmdoptions
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pip/_internal/cli/cmdoptions.py", line 23, in <module>
    from pip._internal.cli.parser import ConfigOptionParser
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages

In [1]:
import boto3
import os
from os import path
import pandas as pd

## 1. Reading the dataset

We will use the __pandas__ library to read our dataset. Let's first download the files.

#### __Training data:__

In [2]:
train_df = pd.read_csv('../../data/final_project/training.csv', encoding='utf-8', header=0)


#### __Test data:__

In [3]:
def concat_text(X):
    X.text=X.text.apply(str)
    X.star_rating=X.star_rating.apply(str)
    X.title=X.title.apply(str)
    X['concat']=X[['text','star_rating', 'title']].agg(' '.join, axis=1)    
    X['concat']= "[CLS] " + X['concat'] + " [SEP]"

In [4]:
test_df = pd.read_csv('../../data/final_project/test.csv', encoding='utf-8', header=0)

test_df.head()

Unnamed: 0,ID,doc_id,text,date,star_rating,title
0,62199,15449606311,"Quality of material is great, however, the bac...",3/7/2018 19:47,3,great backpack with strange fit
1,76123,15307152511,The product was okay but wasn't refined campho...,43135.875,2,Not refined
2,78742,12762748321,I normally read the reviews before buying some...,42997.37708,1,"Doesnt work, wouldnt recommend"
3,64010,15936405041,These pads are completely worthless. The light...,43313.25417,1,The lighter colored side of the pads smells li...
4,17058,13596875291,The saw works great but the blade oiler does n...,12/5/2017 20:17,2,The saw works great but the blade oiler does n...


Unnamed: 0,ID,doc_id,text,date,star_rating,title
0,62199,15449606311,"Quality of material is great, however, the bac...",3/7/2018 19:47,3,great backpack with strange fit
1,76123,15307152511,The product was okay but wasn't refined campho...,43135.875,2,Not refined
2,78742,12762748321,I normally read the reviews before buying some...,42997.37708,1,"Doesnt work, wouldnt recommend"
3,64010,15936405041,These pads are completely worthless. The light...,43313.25417,1,The lighter colored side of the pads smells li...
4,17058,13596875291,The saw works great but the blade oiler does n...,12/5/2017 20:17,2,The saw works great but the blade oiler does n...


## 2. Train a Classifier

In [5]:
concat_text(train_df)
concat_text(test_df)

In [6]:
train_df.isna().sum()


ID             0
doc_id         0
text           0
date           0
star_rating    0
title          0
human_tag      0
concat         0
dtype: int64

ID             0
doc_id         0
text           0
date           0
star_rating    0
title          0
human_tag      0
concat         0
dtype: int64

In [7]:
test_df.isna().sum()

ID             0
doc_id         0
text           0
date           0
star_rating    0
title          0
concat         0
dtype: int64

ID             0
doc_id         0
text           0
date           0
star_rating    0
title          0
concat         0
dtype: int64

In [8]:
# Implement this
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from pytorch_pretrained_bert import BertTokenizer, BertConfig
from pytorch_pretrained_bert import BertAdam, BertForSequenceClassification
from tqdm import tqdm, trange
import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
import time
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from transformers import Trainer, TrainingArguments, DistilBertForSequenceClassification, DistilBertTokenizerFast
from torch.utils.data import DataLoader
from datasets import load_metric

# specify GPU device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

'Tesla T4'

'Tesla T4'

In [41]:
# This separates 10% of the entire dataset into validation dataset.
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_df["concat"].tolist(),
    train_df["human_tag"].tolist(),
    test_size=0.3,
    shuffle=True,
    random_state=1234,
    stratify = train_df["human_tag"].tolist(),
)

In [42]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts,
                            truncation=True,
                            padding=True)
val_encodings = tokenizer(val_texts,
                          truncation=True,
                          padding=True)

In [43]:
class ReviewDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]).to(device) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx]).to(device)
        return item

    def __len__(self):
        return len(self.labels)
    
train_dataset = ReviewDataset(train_encodings, train_labels)
val_dataset = ReviewDataset(val_encodings, val_labels)

In [44]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased",
                                                            num_labels=2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

In [None]:
# Freeze the encoder weights until the classfier
for name, param in model.named_parameters():
    if "classifier" not in name:
        param.requires_grad = False

# Hyperparameters
num_epochs = 10
learning_rate=0.01

# Get the compute device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create data loaders
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=64, drop_last=True)
eval_dataloader = DataLoader(val_dataset, batch_size=64, drop_last=True)

# Setup the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

metric = load_metric("f1")
model=model.to(device)
print("starting training")
for epoch in range(num_epochs):
    start = time.time()
    training_loss = 0
    val_loss = 0
    print(epoch)
    # Training loop starts
    model.train() # put the model in training mode
    i = 0
    for batch in train_dataloader:
        # below: ** allows us to pass multiple arguments to model()
        outputs = model(**batch)
        loss = outputs.loss
        training_loss += loss.item()
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()   
    # Validation loop starts
    model.eval() # put the model in prediction mode
    for batch in eval_dataloader:
        with torch.no_grad():
            # below:  ** allows us to pass multiple arguments to model()
            outputs = model(**batch)
        loss = outputs.loss
        val_loss += loss.item()
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        metric.add_batch(predictions=predictions, references=batch["labels"])
    # Let's take the average losses
    training_loss = training_loss / len(train_dataloader)
    val_loss = val_loss / len(eval_dataloader)
    end = time.time()
    
    print(f"Epoch {epoch}. Train_loss {training_loss:.4f}. Val_loss {val_loss:.4f}. \
    f1 {metric.compute()['f1']:.4f}. Seconds {end-start:.3f}.")

Downloading builder script:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

starting training
0


## 3. Make predictions on your test dataset

In [None]:
test_texts = test_df["concat"].tolist()

In [None]:
test_encodings = tokenizer(test_texts,
                          truncation=True,
                          padding=True)

In [None]:
test_dataset = ReviewDataset(test_encodings, [0]*len(test_texts))

In [None]:
test_dataloader = DataLoader(test_dataset, batch_size=32)
test_predictions = []
model.eval()
print(len(test_dataloader))
i = 0
for batch in test_dataloader:
    if i % 10 == 0:
        print(i)
    i+=1
    with torch.no_grad():
        outputs = model(**batch)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    test_predictions.extend(predictions.cpu().numpy())

## 4. Write your predictions to a CSV file
You can use the following code to write your test predictions to a CSV file. Then upload your file to https://mlu.corp.amazon.com/contests/redirect/53

In [None]:
result_df = pd.DataFrame()
result_df["ID"] = test_df["ID"]
result_df["human_tag"] = test_predictions
 
result_df.to_csv("../../data/final_project/project_day3_result.csv", encoding='utf-8', index=False)