<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports-&amp;-Inits" data-toc-modified-id="Imports-&amp;-Inits-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports &amp; Inits</a></span></li><li><span><a href="#Functions" data-toc-modified-id="Functions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Functions</a></span></li><li><span><a href="#Full-Dataset-Preprocessing" data-toc-modified-id="Full-Dataset-Preprocessing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Full Dataset Preprocessing</a></span></li><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Preparation</a></span></li><li><span><a href="#Model" data-toc-modified-id="Model-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Model</a></span></li><li><span><a href="#Training" data-toc-modified-id="Training-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Training</a></span></li><li><span><a href="#Testing" data-toc-modified-id="Testing-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Testing</a></span><ul class="toc-item"><li><span><a href="#Ignite-Testing" data-toc-modified-id="Ignite-Testing-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Ignite Testing</a></span></li><li><span><a href="#NLPBook-Testing" data-toc-modified-id="NLPBook-Testing-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>NLPBook Testing</a></span></li><li><span><a href="#Predict-Rating" data-toc-modified-id="Predict-Rating-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>Predict Rating</a></span></li><li><span><a href="#Interpretablity" data-toc-modified-id="Interpretablity-7.4"><span class="toc-item-num">7.4&nbsp;&nbsp;</span>Interpretablity</a></span></li></ul></li></ul></div>

# Yelp Review Classifier from NLP Book

Yelp restaurant review binary classifier problem from NLP with PyTorch book. This uses the Ignite framework for training the model. The details of the problem can be found at page 57 of the book. [Here](https://nbviewer.jupyter.org/github/joosthub/PyTorchNLPBook/blob/master/chapters/chapter_3/3_5_Classifying_Yelp_Review_Sentiment.ipynb) is the notebook for training. I've made some changes in the code, refactoring the notebook code into modules.

There is already a preprocessed "lite" dataset file which has 10\% of the data. The code was already tested on the lite version before processing the full version.

## Imports & Inits

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np
import torch
import pdb
import re

from collections import defaultdict
from pathlib import Path
from torch import nn
from torch import optim
from torch.utils.data import DataLoader
from torch.utils.data import Dataset

In [3]:
from ignite.engine import Events, create_supervised_evaluator
from ignite.metrics import Accuracy, Loss
from ignite.contrib.handlers import ProgressBar

In [4]:
# imports from my modules
from yelp.dataset import ProjectDataset
from yelp.trainer import YelpTrainer
from yelp.model import Classifier
from yelp.args import args

In [5]:
path = Path('../data/yelp')

## Functions

In [6]:
def preprocess_text(text):
  text = text.lower()
  text = re.sub(r"([.,!?])", r" \1 ", text)
  text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
  return text

## Full Dataset Preprocessing

In [None]:
train_reviews = pd.read_csv(path/args.raw_train_csv, header=None, names=['rating', 'review'])
train_reviews = train_reviews[~pd.isnull(train_reviews['review'])]

test_reviews = pd.read_csv(path/args.raw_test_csv, header=None, names=['rating', 'review'])
test_reviews = test_reviews[~pd.isnull(test_reviews['review'])]

In [None]:
train_reviews.head()

In [None]:
test_reviews.head()

In [None]:
# splitting train by rating
by_rating = defaultdict(list)
for _, row in train_reviews.iterrows():
    by_rating[row['rating']].append(row.to_dict())

# create split data
final_list = []

for _, item_list in sorted(by_rating.items()):
  np.random.shuffle(item_list)
  n_total = len(item_list)
  n_train = int(args.train_proportion * n_total)
  n_val = int((1-args.train_proportion) * n_total)
  
  # give data point a split attribute
  for item in item_list[:n_train]:
    item['split'] = 'train'
  
  for item in item_list[n_train:n_train+n_val]:
    item['split'] = 'val'
    
  # add to final list
  final_list.extend(item_list)

# add test split
for _, row in test_reviews.iterrows():
  row_dict = row.to_dict()
  row_dict['split'] = 'test'
  final_list.append(row_dict)

In [None]:
# write split data to file
final_reviews = pd.DataFrame(final_list)
final_reviews.split.value_counts()

In [None]:
final_reviews['review'].head()

In [None]:
final_reviews['review'] = final_reviews['review'].apply(preprocess_text)
final_reviews['rating'] = final_reviews['rating'].apply({1: 'negative', 2: 'positive'}.get)
final_reviews.head()

In [None]:
final_reviews.to_csv(path/args.full_file, index=False)

## Data Preparation

In [7]:
is_lite = False
is_load = True

In [8]:
if is_lite:
  scratch = path/args.lite_dir
  review_csv = path/args.lite_file
else:
  scratch = path/args.full_dir
  review_csv = path/args.full_file

vectorizer_path = scratch/args.vectorizer_fname
args.save_dir = scratch
args

Namespace(batch_size=1024, checkpointer_name='classifier', checkpointer_prefix='yelp', device='cuda:2', early_stopping_criteria=5, frequency_cutoff=25, full_dir='models/full', full_file='reviews_with_splits_full.csv', learning_rate=0.001, lite_dir='models/lite', lite_file='reviews_with_splits_lite.csv', num_epochs=100, raw_test_csv='raw_test.csv', raw_train_csv='raw_train.csv', save_dir=PosixPath('../data/yelp/models/full'), save_every=2, save_total=5, train_proportion=0.7, vectorizer_fname='vectorizer.json')

In [9]:
df = pd.read_csv(review_csv)
len(df)

598000

Run only once for creating vectorizer

In [10]:
if not is_load:
  train_ds = ProjectDataset.load_data_and_create_vectorizer(df.loc[df['split'] == 'train'])
  train_ds.save_vectorizer(vectorizer_path)

In [11]:
train_df = df.loc[df['split'] == 'train']
train_ds = ProjectDataset.load_data_and_vectorizer(train_df, vectorizer_path)
vectorizer = train_ds.get_vectorizer()
train_dl = DataLoader(train_ds, args.batch_size, shuffle=True, drop_last=True)

val_df = df.loc[df['split'] == 'val']
val_ds = ProjectDataset.load_data_and_vectorizer(val_df, vectorizer_path)
val_dl = DataLoader(val_ds, args.batch_size, shuffle=True, drop_last=True)

test_df = df.loc[df['split'] == 'test']
test_ds = ProjectDataset.load_data_and_vectorizer(test_df, vectorizer_path)
test_dl = DataLoader(test_ds, args.batch_size, shuffle=True, drop_last=True)

In [12]:
len(train_dl.dataset), len(val_dl.dataset), len(test_dl.dataset)

(392000, 168000, 38000)

## Model

The following function is required since Ignite takes only binary values for accuray computation

In [13]:
def bce_logits_wrapper(output):
    y_pred, y = output
    y_pred = (torch.sigmoid(y_pred) > 0.5).long()
    return y_pred, y

In [14]:
classifier = Classifier(num_features=len((vectorizer).review_vocab))
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer, mode='min', factor=0.5, patience=1)
loss_func = nn.BCEWithLogitsLoss()

pbar = ProgressBar(persist=True)
metrics = {'accuracy': Accuracy(bce_logits_wrapper), 'loss': Loss(loss_func)}

## Training

In [None]:
yelp_trainer = YelpTrainer(classifier, optimizer, loss_func, train_dl, val_dl, args, pbar, metrics)
yelp_trainer.run()

## Testing

In [15]:
classifier = Classifier(num_features=len((vectorizer).review_vocab))
loss_func = nn.BCEWithLogitsLoss()

if is_lite:
  state_dict = torch.load(scratch/'yelp_classifier_lite.pth')
else:
  state_dict = torch.load(scratch/'yelp_classifier_full.pth')
classifier.load_state_dict(state_dict)

### Ignite Testing

In [16]:
evaluator = create_supervised_evaluator(classifier, metrics=metrics)

@evaluator.on(Events.COMPLETED)
def log_testing_results(engine):
  metrics = engine.state.metrics
  print(f"Test loss: {metrics['loss']:0.3f}")
  print(f"Test accuracy: {metrics['accuracy']:0.3f}")

In [17]:
evaluator.run(test_dl)

Test loss: 0.178
Test accuracy: 0.935


<ignite.engine.engine.State at 0x7f2c082e08d0>

### NLPBook Testing

In [18]:
def compute_accuracy(y_pred, y):
  y = y.type(torch.uint8)
  y_pred = (torch.sigmoid(y_pred)>0.5)#.max(dim=1)[1]
  n_correct = torch.eq(y_pred, y).sum().item()
  return n_correct / len(y_pred) * 100

In [19]:
running_loss = 0.
running_acc = 0.

classifier.eval()
for i, batch in enumerate(test_dl):
  x,y = batch
  y_pred = classifier(x_in=x.float())
  
  loss = loss_func(y_pred, y.float())
  loss_t = loss.item()
  running_loss += (loss_t-running_loss)/(i+1)
  
  acc_t = compute_accuracy(y_pred, y)
  running_acc += (acc_t-running_acc)/(i+1)

In [20]:
print(f"Test loss: {running_loss:0.3f}")
print(f"Test acc: {running_acc:0.3f}")

Test loss: 0.178
Test acc: 93.534


### Predict Rating

In [21]:
def predict_rating(review, classifier, vectorizer, decision_threshold=0.5):
  """Predict the rating of a review

  Args:
      review (str): the text of the review
      classifier (ReviewClassifier): the trained model
      vectorizer (ReviewVectorizer): the corresponding vectorizer
      decision_threshold (float): The numerical boundary which separates the rating classes
  """
  review = preprocess_text(review)
  print(review)

  vectorized_review = torch.tensor(vectorizer.vectorize(review))
  print(vectorized_review)
  result = classifier(vectorized_review.view(1, -1))
  print(result)

  probability_value = torch.sigmoid(result).item()
  print(probability_value)
  index = 1
  if probability_value < decision_threshold:
      index = 0

  return vectorizer.rating_vocab.lookup_idx(index)

In [22]:
test_review = "While the begining of this book is great, the ending sucks"

prediction = predict_rating(test_review, classifier, vectorizer, decision_threshold=0.5)
print(f"{test_review} -> {prediction}")

while the begining of this book is great , the ending sucks
tensor([0., 0., 1.,  ..., 0., 0., 0.])
tensor([0.1268], grad_fn=<SqueezeBackward1>)
0.5316697955131531
While the begining of this book is great, the ending sucks -> positive


### Interpretablity

In [23]:
classifier.fc1.weight.shape

torch.Size([1, 24662])

In [24]:
# sort weights
fc1_weights = classifier.fc1.weight.detach()[0]
_, idxs = torch.sort(fc1_weights, dim=0, descending=True)
idxs = idxs.numpy().tolist()

In [25]:
# Top 20 words
print("Influential words in Positive Reviews:")
print("--------------------------------------")
for i in range(20):
    print(vectorizer.review_vocab.lookup_idx(idxs[i]))
    
print("====\n\n\n")

Influential words in Positive Reviews:
--------------------------------------
exceeded
delicious
excellent
pleasantly
hooked
amazing
fantastic
perfection
awesome
disappoint
hesitate
nexcellent
perfect
yum
divine
complaint
downside
delish
addicting
heaven
====





In [26]:
# Top 20 words
print("Influential words in Negative Reviews:")
print("--------------------------------------")
idxs.reverse()
for i in range(20):
    print(vectorizer.review_vocab.lookup_idx(idxs[i]))
    
print("====\n\n\n")

Influential words in Negative Reviews:
--------------------------------------
worst
mediocre
meh
poisoning
bland
overrated
terrible
horrible
slowest
downhill
underwhelmed
disappointing
tasteless
unacceptable
flavorless
underwhelming
awful
disgusting
disappointment
unimpressed
====



