## Mohammed Furkhan, Shaikh

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
os.listdir()

## Import required libraries

In [None]:
import torch
import torchtext
import torch.nn as nn
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader
from torch.autograd import Variable
from torch.nn import functional as F

In [None]:
import re
import random
from sklearn.model_selection import train_test_split

## Load The dataset and preprocess
Only require comment and rating columns from the reviews csv file

In [None]:
data_df = pd.read_csv('/kaggle/input/boardgamegeek-reviews/bgg-15m-reviews.csv',usecols=[ "rating", "comment"])[["comment", "rating"]]

In [None]:
data_df.head()

In [None]:
df = data_df[data_df['comment'].notna()]
### https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-a-certain-column-is-nan

In [None]:
df.head()

We have removed the missing values from the dataset. Lets clean it up so we get only english characters.
I'm also gonna round-up the ratings because I'm trying to solve the problem as classification problem and not regression.

In [None]:
df['rating'] = df['rating'].apply(lambda x: round(x))
df['comment'] = df['comment'].apply(lambda x: x.lower())

We should be able to see the class values below

In [None]:
df['rating'].unique()

In [None]:
### https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows
df = df.sample(frac=1).reset_index(drop=True)
df.head(10)

You see that there are some characters other than english alphabets and numbers.

In [None]:
pattern = re.compile("[^a-zA-Z ]+")
df["comment"] = df['comment'].map(lambda x: pattern.sub('', x))
df.head(10)

Lets also drop rows which are having very few characters or words

In [None]:
# drop rows with comment length <= 10
df = df[df['comment'].map(len) > 10]
print(len(df))
df = df.reset_index(drop=True)

In [None]:
df.head(10)

thats done. Lets find the maximum comment length

In [None]:
df['comment'].map(len).max()

That's a lot and we dont need all of it to predict a rating. So we will fix the length during training.

## Create training and testing datasets

In [None]:
#rn out of RAM!
#training_df, testing_df =train, test = train_test_split(df, test_size=0.30)
training_df, testing_df = df.loc[:0.5*len(df)], df.loc[0.75*len(df):]

In [None]:
training_df.tail()

In [None]:
testing_df.tail()

In [None]:
del data_df
del df

In [None]:
training_df.to_csv("training.csv", index=False)
testing_df.to_csv("testing.csv", index=False)

In [None]:
os.listdir()

## Prepare the dataset for pytorch torchtext
Data should be tokenized and numeric

In [None]:
tokenizer = lambda x: x.split()

In [None]:
TEXT = torchtext.data.Field(sequential=True, tokenize=tokenizer, lower=True, include_lengths=True, batch_first=True, fix_length=200)
LABEL = torchtext.data.LabelField(dtype=torch.float)

In [None]:
fields = [('comment',TEXT),('rating', LABEL)]

In [None]:
train_data = torchtext.data.TabularDataset("training.csv","csv", fields, skip_header=True)

In [None]:
test_data = torchtext.data.TabularDataset("testing.csv","csv", fields, skip_header=True)

Lets see if we loaded the data properly

In [None]:
train_data.examples[0].comment, train_data.examples[0].rating

In [None]:
del training_df
del testing_df

## Create word embeddings

In [None]:
# TEXT.build_vocab(train_data, vectors=torchtext.vocab.GloVe(name='6B', dim=100,cache = 'output/kaggle/working/vector_cache'))
# TEXT.build_vocab(train_data, vectors="glove.6B.100d") #some url error. Due to permissions I believe
TEXT.build_vocab(train_data, vectors=torchtext.vocab.Vectors("/kaggle/input/glove6b100dtxt/glove.6B.100d.txt", cache = '../output/working/vector_cache'))
LABEL.build_vocab(train_data)

In [None]:
word_embeddings = TEXT.vocab.vectors
word_embeddings.shape

## create validation set

In [None]:
train_data, valid_data = train_data.split()

## Iterators for training and evaluation

In [None]:
train_iter, valid_iter, test_iter = torchtext.data.BucketIterator.splits((train_data, valid_data, test_data),
                                                               batch_size=32,
                                                               sort_key=lambda x: len(x.comment),
                                                               repeat=False,
                                                               shuffle=True)

In [None]:
vocab_size = len(TEXT.vocab)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(vocab_size, device)

In [None]:
word_embeddings.shape

In [None]:
torch.save(word_embeddings, "word_embeddings.pt")

In [None]:
import dill

In [None]:
with open("TEXT.Field", "wb") as f:
    dill.dump(TEXT, f)

## Model

In [None]:
class ClassifierModel(nn.Module):
    def __init__(self, batch_size, output_size, hidden_size, vocab_size, embedding_length, weights):
        super(ClassifierModel, self).__init__()
        """
        output_size : 2 = (pos, neg)
        """
        self.batch_size = batch_size
        self.output_size = output_size
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        self.embedding_length = embedding_length

        self.word_embeddings = nn.Embedding(vocab_size, embedding_length)  # Initiale the look-up table.
        self.word_embeddings.weight = nn.Parameter(weights, requires_grad=False) # Assign pre-trained GloVe word embedding.
        self.lstm = nn.LSTM(embedding_length, hidden_size)
        self.label = nn.Linear(hidden_size, output_size)

    def forward(self, input_sentence, batch_size=None):
        """ 
        final_output.shape = (batch_size, output_size)
        """
        input = self.word_embeddings(input_sentence) # embedded input of shape = (batch_size, num_sequences,  embedding_length)
        input = input.permute(1, 0, 2) # input.size() = (num_sequences, batch_size, embedding_length)
        if batch_size is None:
            h_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size).cuda()) # Initial hidden state of the LSTM
            c_0 = Variable(torch.zeros(1, self.batch_size, self.hidden_size).cuda()) # Initial cell state of the LSTM
        else:
            h_0 = Variable(torch.zeros(1, batch_size, self.hidden_size).cuda())
            c_0 = Variable(torch.zeros(1, batch_size, self.hidden_size).cuda())
        output, (final_hidden_state, final_cell_state) = self.lstm(input, (h_0, c_0))
        final_output = self.label(final_hidden_state[-1]) # final_hidden_state.size() = (1, batch_size, hidden_size) & final_output.size() = (batch_size, output_size)

        return final_output

## for gradients

In [None]:
def clip_gradient(model, clip_value):
    params = list(filter(lambda p: p.grad is not None, model.parameters()))
    for p in params:
        p.grad.data.clamp_(-clip_value, clip_value)

## Training and Evaluation

In [None]:
def train_model(model, train_iter, epoch):
    total_epoch_loss = 0
    total_epoch_acc = 0
    model.to(device)
    optim = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()))
    steps = 0
    model.train()
    for idx, batch in enumerate(train_iter):
        text = batch.comment[0]
        target = batch.rating
        target = torch.autograd.Variable(target).long()
        if torch.cuda.is_available():
            text = text.cuda()
            target = target.cuda()
        if (text.size()[0] != 32):# One of the batch has length different than 32.
            continue
        optim.zero_grad()
        prediction = model(text)
        loss = loss_fn(prediction, target)
        num_corrects = (torch.max(prediction, 1)[1].view(target.size()).data == target.data).float().sum()
        acc = 100.0 * num_corrects/len(batch)
        loss.backward()
        clip_gradient(model, 1e-1)
        optim.step()
        steps += 1
        
        if steps % 100 == 0:
            print (f'Epoch: {epoch+1}, Idx: {idx+1}, Training Loss: {loss.item():.4f}, Training Accuracy: {acc.item(): .2f}%')
        
        total_epoch_loss += loss.item()
        total_epoch_acc += acc.item()
        
    return total_epoch_loss/len(train_iter), total_epoch_acc/len(train_iter)

In [None]:
def eval_model(model, val_iter):
    total_epoch_loss = 0
    total_epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for idx, batch in enumerate(val_iter):
            text = batch.comment[0]
            if (text.size()[0] != 32):
                continue
            target = batch.rating
            target = torch.autograd.Variable(target).long()
            if torch.cuda.is_available():
                text = text.cuda()
                target = target.cuda()
            prediction = model(text)
            loss = loss_fn(prediction, target)
            num_corrects = (torch.max(prediction, 1)[1].view(target.size()).data == target.data).sum()
            acc = 100.0 * num_corrects/len(batch)
            total_epoch_loss += loss.item()
            total_epoch_acc += acc.item()

    return total_epoch_loss/len(val_iter), total_epoch_acc/len(val_iter)

In [None]:
batch_size = 32
output_size = 11
hidden_size = 256
embedding_length = 100
model = ClassifierModel(batch_size, output_size, hidden_size, vocab_size, embedding_length, word_embeddings)

In [None]:
#architecture
print(model)

#No. of trianable parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
    
print(f'The model has {count_parameters(model):,} trainable parameters')

In [None]:
learning_rate = 0.001
loss_fn = F.cross_entropy

In [None]:
for epoch in range(5):
    train_loss, train_acc = train_model(model, train_iter, epoch)
    val_loss, val_acc = eval_model(model, valid_iter)
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc:.2f}%, Val. Loss: {val_loss:3f}, Val. Acc: {val_acc:.2f}%')

In [None]:
test_loss, test_acc = eval_model(model, test_iter)
print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc:.2f}%')

## Save the weights

In [None]:
torch.save(model.state_dict(), 'saved_weights.pt')

## check on custom input text

In [None]:
test_sent = "This game is interesting"
test_sent = TEXT.preprocess(test_sent)
test_sent = [[TEXT.vocab.stoi[x] for x in test_sent]]
test_sent = np.asarray(test_sent)
test_sent = torch.LongTensor(test_sent)
test_tensor = Variable(test_sent)
test_tensor = test_tensor.cuda()
model.eval()
output = model(test_tensor, 1)
out = F.softmax(output, 1)
out

In [None]:
print("rating",torch.argmax(out[0]).item())

## References

#### https://pytorch.org/text/stable/data.html
#### https://pytorch.org/tutorials/beginner/transformer_tutorial.html
#### https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
#### https://github.com/prakashpandey9/Text-Classification-Pytorch/blob/master/main.py
#### https://www.analyticsvidhya.com/blog/2020/01/first-text-classification-in-pytorch/
#### https://towardsdatascience.com/use-torchtext-to-load-nlp-datasets-part-ii-f146c8b9a496

## Contribution and Findings

1. Data cleaning and preparation
2. Explicityly based on torchtext and self preprocessed dataset.
3. Different WordEmbedding Vectors and parameters
4. Optimized the hyperparameters empirically
5. Classifier based on 11 classes 0 - 10
6. Deploying on cloud
7. Faster processing using sclied data

- Hyperparameters
1. The values for embedding vectors and their dimensions can increase the number of parameters required by the program.
2. Batch Size can be 16, 32, 64, .. In this notebook I have used 32.
3. The number of layers in the model can be increased but not necessarily may have better results.
4. The input length has been fixed at 200 characters but can be increased. The smaller text will be padded by default.

- Overfitting
1. The model training accuracy and loss are closely related to the validation accuracy and loss
2. The model does not overfit. Also I had to use less amount of data due to resource limits

## Why use embedding vectors?/What does the embeddings do?

The Embedding vectors defines the relations between different words based on several features. For example King is related to Queen just like a Man is related to Women. Another generic example is oange and apple, both are fruits and the relation is defined by embeddings.

## What is LSTM?

LSTM (Long Short Term Memory) is recurrent neural network model and is mostly used for processing sequential data. Like in our case the text data is sequential by nature. Hence LSTM is usefull for NLP tasks. It is also a powerful model compared to Vanilla RNN. There are different variants of LSTM which can be experimented with

## Evaluation

After 10 epochs on the dataset, average training accuracy was around 39.7% and validation accuracy about 36%. Surely this numbers can be increased by tuning the hyperparameters defined above, and training more.