To perform sentence classification, and many other classification tasks for NLP, we need to do three main steps:

- Preprocessing the data
- Prepare the dataloader
- Build the model

Of course, all of these steps requires a lot of other steps, and also they can include many different solutions. 

To make you to jumpstart on this task, I will provide you a pretty clean dataset, the Amazon Reviews one, that you can extensively find online, and it's also included in the `torxchtext.datasets` module. 

For this example, I will use just a little part of it, to give some guidance on how to start, without actually training the whole model.

### Load the data



In [14]:
import pandas as pd

In [15]:
import spacy

In [16]:
df = pd.read_csv("test.csv", nrows=3000, header=None)
df

Unnamed: 0,0,1,2
0,1,mens ultrasheer,"This model may be ok for sedentary types, but ..."
1,4,Surprisingly delightful,This is a fast read filled with unexpected hum...
2,2,"Works, but not as advertised",I bought one of these chargers..the instructio...
3,2,Oh dear,I was excited to find a book ostensibly about ...
4,2,Incorrect disc!,"I am a big JVC fan, but I do not like this mod..."
...,...,...,...
2995,2,A MAJOR ( PUN INTENDED) DISAPPOINTMENT,I was so disappointed in this book. Having rea...
2996,4,Good Inside look at the U.S. Open,The author does a great job of taking the read...
2997,1,A Good Open Spoiled,"The subtitle should be, ""Inside the Port-o-Joh..."
2998,2,"Good ideas, but horrible context",I praise Ostebee and Zorn for making an attemp...


In [17]:
df.rename({0:"star", 1:"rating1", 2:"rating2"}, axis=1, inplace=True)

Since we are going to predict the number of stars a certain product has got based on the semantics of the text, we could merge the title of the review together with the body of the review, just by concatenating them:

In [18]:
df["review"] = df["rating1"] + " " +  df["rating2"]

In [19]:
df

Unnamed: 0,star,rating1,rating2,review
0,1,mens ultrasheer,"This model may be ok for sedentary types, but ...",mens ultrasheer This model may be ok for seden...
1,4,Surprisingly delightful,This is a fast read filled with unexpected hum...,Surprisingly delightful This is a fast read fi...
2,2,"Works, but not as advertised",I bought one of these chargers..the instructio...,"Works, but not as advertised I bought one of t..."
3,2,Oh dear,I was excited to find a book ostensibly about ...,Oh dear I was excited to find a book ostensibl...
4,2,Incorrect disc!,"I am a big JVC fan, but I do not like this mod...","Incorrect disc! I am a big JVC fan, but I do n..."
...,...,...,...,...
2995,2,A MAJOR ( PUN INTENDED) DISAPPOINTMENT,I was so disappointed in this book. Having rea...,A MAJOR ( PUN INTENDED) DISAPPOINTMENT I was s...
2996,4,Good Inside look at the U.S. Open,The author does a great job of taking the read...,Good Inside look at the U.S. Open The author d...
2997,1,A Good Open Spoiled,"The subtitle should be, ""Inside the Port-o-Joh...","A Good Open Spoiled The subtitle should be, ""I..."
2998,2,"Good ideas, but horrible context",I praise Ostebee and Zorn for making an attemp...,"Good ideas, but horrible context I praise Oste..."


and then of course we can drop the other two columns:

In [20]:
df.drop(columns=["rating1", "rating2"], inplace=True)

In [21]:
df

Unnamed: 0,star,review
0,1,mens ultrasheer This model may be ok for seden...
1,4,Surprisingly delightful This is a fast read fi...
2,2,"Works, but not as advertised I bought one of t..."
3,2,Oh dear I was excited to find a book ostensibl...
4,2,"Incorrect disc! I am a big JVC fan, but I do n..."
...,...,...
2995,2,A MAJOR ( PUN INTENDED) DISAPPOINTMENT I was s...
2996,4,Good Inside look at the U.S. Open The author d...
2997,1,"A Good Open Spoiled The subtitle should be, ""I..."
2998,2,"Good ideas, but horrible context I praise Oste..."


👏

The `star`column is what we want to predict, given the text of the review. I think we are all Amazon users, and we are all aware of how many stars a rating can have, but let's just double check:

In [22]:
df.star.unique()

array([1, 4, 2, 3, 5])

In [120]:
df.star = df.star.apply(lambda x: int(x) -1)

Ok, now that our data are in order, we need to preprocess them. We can take advantage of spacy for basically of the steps:

In [121]:
nlp = spacy.load("en_core_web_sm")

Let's create a function that, given a sentence, it preprocess it by doing:
- tokenization
- removing stopwords
- remove special characters/punctuation
- make everything lower case
- lemmatize it

With spacy, we can do it in a very compact form:

In [122]:
def preprocessing(sentence):
    """
    params sentence: a str containing the sentence we want to preprocess
    return the tokens list
    """
    doc = nlp(sentence)
    tokens = [token.lemma_ for token in doc if not token.is_punct and not token.is_stop]
    return tokens
    

In [123]:
preprocessing("This is an example! Hello")

['example', 'hello']

The preprocessing phase has not finished yet. In fact, we want to create a neural network, and a neural network works with numbers. In general, computers work with numbers...

So we need to use embeddings to transform a sentence into a tensor: the embeddings are usually one-dimensional, and in the following example they will have size 300, that means that if you have a sentence of 10 words (after have it preprocessed), the shape of the sentence will be $10\times 300$. You will notice another dimension, that is the batch size. So you will train and run a model that receive as input a tensor of shape:

`batch_size*length_of_the_sentence*embedding_size`.

Let's do things in order:

In [124]:
import torch
from collections import Counter
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm, tqdm_notebook

If you are using the whole dataset, you should not need to split the dataset into train and test 'cause it should be already. If not, and if you are using any other dataset, remember to split into train and test (eventually validation).

In [125]:
train_df, test_df = df.iloc[:2000], df.iloc[2000:]

To get the vectors for each token, we are going to use some pretrained embeddings. Specifically, we are going to use the FastText embeddings that you can find at this link https://pytorch.org/text/stable/vocab.html#fasttext .

We need to download and load them by doing:

In [126]:
from torchtext.vocab import FastText

In [127]:
fasttext = FastText("simple")

You can run `help(fasttext)` and/or `dir(fasttext)` to get more info about the methods and the attributes this object contains.

In [128]:
dir(fasttext)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'cache',
 'dim',
 'get_vecs_by_tokens',
 'itos',
 'stoi',
 'unk_init',
 'url_base',
 'vectors']

I want to highlight a couple of things:

- `dim` is the dimensions of the vectors (in our case it is 300)
- `itos` stands for *index to string* and it maps an integer to the corresponding string. The reason for having such a method is that it's much lighter to store integers and use them to index the vectors instead of having a string per word (In addition to that, heuristics can be used so that the most frequent words get lower value for the index, resulting in a better memory management. I know, it sounds like minor things, but the model is going to make billions of operations!)
- `stoi` is the opposite: it's a dictionary that given the string returns the index



Above you can see the embeddings associated with the word "hello". Let's inspect the shape:

In [129]:
fasttext["hello"].shape

torch.Size([300])

300, as anticipated. 

Let's inspect what's the index associated with "hello":

In [130]:
fasttext.stoi["hello"]

2610

and viceversa:

In [131]:
fasttext.stoi

{'</s>': 0,
 '.': 1,
 ',': 2,
 'the': 3,
 'of': 4,
 "'": 5,
 'in': 6,
 '-': 7,
 'and': 8,
 ')': 9,
 '(': 10,
 'a': 11,
 'to': 12,
 'is': 13,
 'was': 14,
 'it': 15,
 'for': 16,
 'on': 17,
 's': 18,
 'as': 19,
 'that': 20,
 'from': 21,
 'by': 22,
 'he': 23,
 'are': 24,
 'with': 25,
 'this': 26,
 '–': 27,
 'be': 28,
 'an': 29,
 'at': 30,
 'or': 31,
 'i': 32,
 'not': 33,
 'people': 34,
 '}': 35,
 'other': 36,
 'they': 37,
 'his': 38,
 'american': 39,
 'have': 40,
 'has': 41,
 'utc': 42,
 'also': 43,
 'one': 44,
 'were': 45,
 'which': 46,
 'but': 47,
 'can': 48,
 'talk': 49,
 'there': 50,
 'first': 51,
 '#': 52,
 'new': 53,
 'united': 54,
 'about': 55,
 'you': 56,
 'their': 57,
 'may': 58,
 'all': 59,
 'she': 60,
 'd': 61,
 'when': 62,
 'after': 63,
 'had': 64,
 'states': 65,
 'who': 66,
 'made': 67,
 'more': 68,
 'if': 69,
 'born': 70,
 'used': 71,
 'many': 72,
 'city': 73,
 'some': 74,
 'time': 75,
 'websites': 76,
 'two': 77,
 't': 78,
 'its': 79,
 'most': 80,
 'called': 81,
 'b': 82,
 '

We can create and *encoder* which can transform each word into an integer:

In [132]:
def token_encoder(token, vec):
    if token == "<pad>":
        return 1
    else:
        try:
            return vec.stoi[token]
        except:
            return 0

In [133]:
def encoder(tokens, vec):
    return [token_encoder(token, vec) for token in tokens]

In [134]:
text = "Antonio is learning Python"
encoder(preprocessing(text), fasttext)

[0, 1660, 0]

Why all those zeros?
Well, in the function that we have defined, we have put a try and except, in which we are basically saying: if the word is not in the vocabulary, return the index 0. Clearly, Antonio and Python weren't in the corpus used by FastText!


What about the `<pad>` thing? 

Well, not all the reviews have same length, so we need to find a solution for it. Why? Cause our Neural Network is waiting for input that are all of the same size! It needs to know how many weights it needs to initialize!

There are several possibilities, but the easiest is to just set a cap with a `max_seq_len` parameter, so that all the reviews that are shorter than that length will be padded by using a vector associated with the padding index, and all the ones that are longer than `max_seq_len` will be just cut.

Do you see problems? I actually don't see that much problems for it. I think that the sentiment of a comment can be seen already from the first words of the review.

In the encoder part, the `<pad>` is a made up token that we know is very unlikely to be part of the text. To that, I assigned the index 1. 

You may ask: what does it happen to things at index 0 and 1? Well, let's inspect them:

In [135]:
fasttext.itos[0], fasttext.itos[1]

('</s>', '.')

and in our preprocessing pipeline they can never appear! So we are fine with that!

Now let's create a function for padding:

In [136]:
def padding(list_of_indexes, max_seq_len, padding_index=1):
    output = list_of_indexes + (max_seq_len - len(list_of_indexes))*[padding_index]
    return output[:max_seq_len]

In [137]:
text = "this is a sample review"
list_of_indexes = encoder(preprocessing(text), fasttext)
list_of_indexes

[3697, 1363]

In [138]:
padding(list_of_indexes, max_seq_len=10)

[3697, 1363, 1, 1, 1, 1, 1, 1, 1, 1]

In this way, any sentence shorter than 10 becomes of length 10 and anything longer...

In [139]:
text = "this is a sample review this is a sample review this is a sample review this is a sample review this is a sample review v this is a sample review this is a sample review this is a sample review this is a sample review this is a sample review"
list_of_indexes = encoder(preprocessing(text), fasttext)
padding(list_of_indexes, max_seq_len=10)

[3697, 1363, 3697, 1363, 3697, 1363, 3697, 1363, 3697, 1363]

...get just cut to ten!

All right. I feel confident enough to say that we have all of what we need for the preprocessing part!

Now we need to create the:


### Data Loader

Yes, they are back. [Is it a good or a bad memory?]("https://github.com/Strive-School/ai_mar21/blob/main/M5_Deep_Learning/D7/Custom%20DataLoader%20and%20Dataset.ipynb")

If you take a look at that notebook, you remember that to create a custom data loader you need to override some method of the `Dataset` class from `torch.utils.data`. Before doing so, let's define the steps we need to do while loading the data:

- Receive as input a row from the dataframe that we have defined above, that contains two columns: "star" and "review"
- we separate "star" from "review"
- we preprocess the "review" columns by doing what we have so far (tokenization etc but excluding the embeddings for now)
- Padding 
- Store a list containing the sequence of indices with the associated labels

Then we need to override also the `__len__` and the `__getitem__`methods of the `Dataset` class.

Ok, stop talking, more action:

In [140]:
class TrainData(Dataset):
    def __init__(self, df, max_seq_len=32): # df is the input df, max_seq_len is the max lenght allowed to a sentence before cutting or padding
        self.max_seq_len = max_seq_len
        
        counter = Counter()
        train_iter = iter(df.review.values)
        self.vec = FastText("simple")
        self.vec.vectors[1] = -torch.ones(self.vec.vectors[1].shape[0]) # replacing the vector associated with 1 (padded value) to become a vector of -1.
        self.vec.vectors[0] = torch.zeros(self.vec.vectors[0].shape[0]) # replacing the vector associated with 0 (unknown) to become zeros
        self.vectorizer = lambda x: self.vec.vectors[x]
        self.labels = df.star
        sequences = [padding(encoder(preprocessing(sequence), self.vec), max_seq_len) for sequence in df.review.tolist()]
        self.sequences = sequences
    
    def __len__(self):
        return len(self.sequences)
    
    def __getitem__(self, i):
        assert len(self.sequences[i]) == self.max_seq_len
        return self.sequences[i], self.labels[i]

In [141]:
dataset = TrainData(train_df, max_seq_len=32)

When we index dataset with a `dataset[index]` notation, we get the pair containing the padded sequence of indices with the associated label: 

In [142]:
dataset[0]

([468,
  0,
  868,
  2613,
  58316,
  360,
  818,
  12130,
  1044,
  13126,
  520,
  42977,
  2996,
  14931,
  197,
  2901,
  992,
  10051,
  42977,
  0,
  0,
  2603,
  0,
  4085,
  454,
  1736,
  631,
  338,
  3332,
  10770,
  5302,
  5512],
 0)

In [143]:
dataset[1][0]

[15391,
 47950,
 1508,
 934,
 4672,
 11584,
 15402,
 24369,
 14401,
 542,
 71101,
 1097,
 3851,
 19201,
 0,
 5815,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1]

What are the ones there? They are the product of the padding! 

What is the vector associated with the index 1?

In [144]:
dataset.vec.vectors[1]

tensor([-1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1.,
        -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -1., -

All negative ones! Makes sense! This is what we have defined!

Storing into memory a lot of tensors containing all the embedded vectors, it can be very costly. This is why we load them by indexing with an integer. However, when we train our model, we need the embedded vectors!

So let's define the `collate` function that will index our vocabulary only when it needs it!

As argument it takes the batch (which will contains a `batch_size*max_seq_len` shape tensor) and the vectorizer. What is the vectorizer in our case? It's the vectorizer we have built in the TrainData class, that assign the vector associated with an index.

In [145]:
def collate(batch, vectorizer=dataset.vectorizer):
    inputs = torch.stack([torch.stack([vectorizer(token) for token in sentence[0]]) for sentence in batch])
    target = torch.LongTensor([item[1] for item in batch]) # Use long tensor to avoid unwanted rounding
    return inputs, target

And now, we can use the `DataLoader` class as we did for images:

In [146]:
batch_size = 16
train_loader = DataLoader(dataset, batch_size=batch_size, collate_fn=collate)


In [147]:
next(iter(train_loader))[0].shape

torch.Size([16, 32, 300])

Ready to train? Following is a small model to *makes things to run on my computer*. You can expect to be kicked out if you come at the debrief with this model! 



In [148]:
from torch import nn
import torch.nn.functional as F
emb_dim = 300
class Classifier(nn.Module):
    def __init__(self, max_seq_len, emb_dim, hidden1=16, hidden2=16):
        super(Classifier, self).__init__()
        self.fc1 = nn.Linear(max_seq_len*emb_dim, hidden1)
        self.fc2 = nn.Linear(hidden1, hidden2)
        self.fc3 = nn.Linear(hidden2, 5)
        self.out = nn.LogSoftmax(dim=1)
    
    
    def forward(self, inputs):
        x = F.relu(self.fc1(inputs.squeeze(1).float()))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return self.out(x)

In [149]:
MAX_SEQ_LEN = 32
model = Classifier(MAX_SEQ_LEN, 300, 16, 16)
model

Classifier(
  (fc1): Linear(in_features=9600, out_features=16, bias=True)
  (fc2): Linear(in_features=16, out_features=16, bias=True)
  (fc3): Linear(in_features=16, out_features=5, bias=True)
  (out): LogSoftmax(dim=1)
)

In [150]:
from torch import optim
criterion = nn.NLLLoss()

# Only train the classifier parameters, feature parameters are frozen
optimizer = optim.Adam(model.parameters(), lr=0.003)


In [151]:
dataiter = iter(train_loader)
sentences, labels = dataiter.next()

In [152]:
# Forward pass through the network
sentence_idx = 0
sentences.resize_(16, 1, MAX_SEQ_LEN*emb_dim).shape
log_ps = model.forward(sentences[sentence_idx,:])

sentence = sentences[sentence_idx]
torch.exp(log_ps)

tensor([[0.1627, 0.1963, 0.1702, 0.2471, 0.2237]], grad_fn=<ExpBackward>)

We got 5 probabilities: one for each of the possible rating star!

In [153]:
epochs = 3
print_every = 40

for e in range(epochs):
    running_loss = 0
    print(f"Epoch: {e+1}/{epochs}")

    for i, (sentences, labels) in enumerate(iter(train_loader)):

        sentences.resize_(sentences.size()[0], 32* emb_dim)
        
        optimizer.zero_grad()
        
        output = model.forward(sentences)   # 1) Forward pass
        loss = criterion(output, labels) # 2) Compute loss
        loss.backward()                  # 3) Backward pass
        optimizer.step()                 # 4) Update model
        
        running_loss += loss.item()
        
        if i % print_every == 0:
            print(f"\tIteration: {i}\t Loss: {running_loss/print_every:.4f}")
            running_loss = 0

Epoch: 1/3
	Iteration: 0	 Loss: 0.0412
	Iteration: 40	 Loss: 1.6470
	Iteration: 80	 Loss: 1.5952
	Iteration: 120	 Loss: 1.6148
Epoch: 2/3
	Iteration: 0	 Loss: 0.0385
	Iteration: 40	 Loss: 1.5419
	Iteration: 80	 Loss: 1.4906
	Iteration: 120	 Loss: 1.6104
Epoch: 3/3
	Iteration: 0	 Loss: 0.0343
	Iteration: 40	 Loss: 1.4839
	Iteration: 80	 Loss: 1.3513
	Iteration: 120	 Loss: 1.4886


Eventually:

In [154]:
from torchtext import datasets

In [155]:
# train, test = datasets.AmazonReviewFull()

amazon_review_full_csv.tar.gz: 188MB [00:15, 12.3MB/s] 


KeyboardInterrupt: 

### Exercises

- Create a real training process: use the train, val, test split for the dataset
- Create a training loop that includes validation and test at the end
    - You can borrow from your previous work, no need to write it from scratch
- If you want to, feel free to change dataset
