# Text classification with Pytorch

The goal of this lab are to explore the various way to represent textual data by applying them to a relatively small classification dataset - **20NewsGroup** - and evaluate how they perform on the classification task.
1. Using what we have previously seen, pre-process the data: clean it, obtain an appropriate vocabulary.
2. Obtain representations: any that will allow us to obtain a vector representation of each document is appropriate.
    - Symbolic: **BoW, TF-IDF**
    - Dense document representations: via **Topic Modeling: LSA, LDA**
    - Dense word representations: **SVD-reduced PPMI, Word2vec, GloVe**
        - For these, you will need to implement a **function aggregating word representations into document representations**
3. Perform classification: we can make things simple and only use a **logistic regression**
4. Learn how to use Pytorch for treating textual data, and implement **neural** classification models with Pytorch.

Besides ```torch```, we will use ```gensim``` to obain word embeddings, and ```scikit-learn``` for simple classification models.  

In [1]:
import os.path as op
import re
import numpy as np
import matplotlib.pyplot as plt
from pprint import pprint

## I - Simple classifier on top of dense representations

### I.1 Dataset

We're going to work with the **20NewsGroup** data. This dataset is available in ```scikit-learn```, you can find all relevant information in the [documentation](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html).

In [2]:
from sklearn.datasets import fetch_20newsgroups

In [3]:
# Import training data
ng_train = fetch_20newsgroups(subset='train',
                              remove=('headers', 'footers', 'quotes')
                              )

In [4]:
# Let's look at what is in this object
pprint(dir(ng_train))

['DESCR', 'data', 'filenames', 'target', 'target_names']


In [5]:
# Let's look at the categories
pprint(ng_train.target_names)

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [6]:
# .. and the data itself
pprint(ng_train.data[0])
print("Target: ", ng_train.target_names[ng_train.target[0]])

('I was wondering if anyone out there could enlighten me on this car I saw\n'
 'the other day. It was a 2-door sports car, looked to be from the late 60s/\n'
 'early 70s. It was called a Bricklin. The doors were really small. In '
 'addition,\n'
 'the front bumper was separate from the rest of the body. This is \n'
 'all I know. If anyone can tellme a model name, engine specs, years\n'
 'of production, where this car is made, history, or whatever info you\n'
 'have on this funky looking car, please e-mail.')
Target:  rec.autos


The dataset can be rather difficult as it is; especially, some categories are very close to each other. We can simplify the task by using the higher-level categorisation of the newsgroups, thanks to the following function:

In [7]:
def aggregate_labels(label):
    # comp
    if label in [1,2,3,4,5]:
        new_label = 0
    # rec
    if label in [7,8,9,10]:
        new_label = 1
    # sci
    if label in [11,12,13,14]:
        new_label = 2
    # misc
    if label in [6]:
        new_label = 3
    # pol
    if label in [16,17,18]:
        new_label = 4
    # rel
    if label in [0,15,19]:
        new_label = 5
    return new_label

We will first need to apply some pre-processing. We will here use our own tokenizer, imported from ```nltk```: ```word_tokenize```; and the processing you estimate appropriate. Careful, the data is not always clean and the messages are sometimes short: hence, applying pre-processing and tokenization can easily return an empty list of words. **Be careful to remove documents that are empty !**
<div class='alert alert-block alert-info'>
            Code:</div>

In [8]:
import nltk
# The first time you import this tokenizer, you need to download some data
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
from nltk import word_tokenize
from nltk.corpus import stopwords
import string

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [121]:
# Pre-processing
def pre_process(text):
    # tokenize
    # remove punctuation and make lower case
    letters = set(map(chr, range(97, 123)))
    stop_words = set(word_tokenize(" ".join(stopwords.words('english'))))
    stop_words_list = stop_words.union(letters)

    text = text.replace("'s", "")
    tokens = word_tokenize(text)
    tokens = [token.lower() for token in tokens]

    tokens = [word for word in tokens if (word not in stop_words_list and word != '\n' and word.isalpha())]

    return tokens

In [122]:
teste = pre_process("I don't want peter's cake, the fuck? x")
print(teste)

['want', 'peter', 'cake', 'fuck']


In [33]:
# Pre-processing
def pre_process_alt(text):
    # tokenize
    tokens_translated = []
    tokens = word_tokenize(text)

    stop_words = set(stopwords.words('english'))

    for word in tokens:
        if word in stop_words: continue

        # remove punctuation adn make lowercase
        token = word.translate(str.maketrans("", "", string.punctuation)).lower()

        # remove \n from tokens
        cleaned_token = token.replace("\n", " ")
        # disconsider empty strings
        if(len(cleaned_token) != 0): tokens_translated.append(cleaned_token)

    return tokens_translated

In [49]:
teste = pre_process("I don't want peter's cake, the fuck? x")
print(teste)

[]


In [123]:
ng_train_text = [pre_process(text) for text in ng_train.data if text != ""]

In [19]:
#ng_train_text = [pre_process(text) for text in ng_train.data if text != ""]

In [35]:
print(ng_train_text[0])

['i', 'wondering', 'anyone', 'could', 'enlighten', 'car', 'i', 'saw', 'day', 'it', '2door', 'sports', 'car', 'looked', 'late', '60s', 'early', '70s', 'it', 'called', 'bricklin', 'the', 'doors', 'really', 'small', 'in', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'this', 'i', 'know', 'if', 'anyone', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'production', 'car', 'made', 'history', 'whatever', 'info', 'funky', 'looking', 'car', 'please', 'email']


In [87]:
ng_train_labels = [ng_train.target[i] for i in range(len(ng_train.data)) if ng_train.data[i] != ""]

print(ng_train_labels)

[np.int64(7), np.int64(4), np.int64(4), np.int64(1), np.int64(14), np.int64(16), np.int64(13), np.int64(3), np.int64(2), np.int64(4), np.int64(8), np.int64(19), np.int64(4), np.int64(14), np.int64(6), np.int64(0), np.int64(1), np.int64(7), np.int64(12), np.int64(5), np.int64(0), np.int64(10), np.int64(6), np.int64(2), np.int64(4), np.int64(1), np.int64(12), np.int64(9), np.int64(15), np.int64(7), np.int64(6), np.int64(13), np.int64(12), np.int64(17), np.int64(18), np.int64(10), np.int64(8), np.int64(11), np.int64(8), np.int64(16), np.int64(9), np.int64(4), np.int64(3), np.int64(9), np.int64(9), np.int64(4), np.int64(4), np.int64(8), np.int64(12), np.int64(14), np.int64(5), np.int64(15), np.int64(2), np.int64(13), np.int64(17), np.int64(11), np.int64(7), np.int64(10), np.int64(2), np.int64(14), np.int64(12), np.int64(5), np.int64(4), np.int64(6), np.int64(7), np.int64(0), np.int64(11), np.int64(16), np.int64(0), np.int64(6), np.int64(17), np.int64(7), np.int64(12), np.int64(7), np.int64

In [88]:
ng_test = fetch_20newsgroups(subset='test',
                             remove=('headers', 'footers', 'quotes')
                            )

ng_test_text = [pre_process(text) for text in ng_test.data if text != ""]
ng_test_labels = [ng_test.target[i] for i in range(len(ng_test.data)) if ng_test.data[i] != ""]

### I.2 Get a vocabulary.

Now that the data is cleaned, the first step we will follow is to pick a common vocabulary that we will use for every model we create in this lab. **Use the code of the previous lab to create a vocabulary.** As in the previous lab, we will have to be able to control its size, either by indicating a maximum number of words, or a minimum number of occurrences to take the words into account. Again, we add, at the end, an "unknown" word that will replace all the words that do not appear in our "limited" vocabulary.
<div class='alert alert-block alert-info'>
            Code:</div>

In [89]:
# def vocabulary(corpus, voc_threshold=0):
#     """
#     Function using word counts to build a vocabulary - can be improved with a second parameter for
#     setting a frequency threshold
#     Params:
#         corpus (list of list of strings): corpus of sentences
#         voc_threshold (int): maximum size of the vocabulary (0 means no limit !)
#     Returns:
#         vocabulary (dictionary): keys: list of distinct words across the corpus
#                                  values: indexes corresponding to each word sorted by frequency
#         word_counts (dictionary): keys: list of distinct words across the corpus
#                                   values: their counts in the corpus
#     """
#     vocabulary = {}
#     word_counts = {}

#     flat_list = sum(corpus, [])
#     word_counts = {x: flat_list.count(x) for x in flat_list}

#     sorted_word_counts = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)

#     if(voc_threshold == 0):
#         vocabulary = {word[0]: i+1 for i, word in enumerate(sorted_word_counts)}
#     else:
#         vocabulary = {word[0]: i+1 for i, word in enumerate(sorted_word_counts[:voc_threshold])}

#     return vocabulary, word_counts

In [21]:
from collections import Counter

def create_vocabulary(corpus, voc_threshold=0):
    """
    Function using word counts to build a vocabulary - optimized for large corpora.
    Params:
        corpus (list of list of strings): corpus of sentences (list of tokenized sentences)
        voc_threshold (int): maximum size of the vocabulary (0 means no limit)
    Returns:
        vocabulary (dictionary): keys: list of distinct words across the corpus
                                 values: indexes corresponding to each word sorted by frequency
        word_counts (dictionary): keys: list of distinct words across the corpus
                                  values: their counts in the corpus
    """
    print('vocab function')
    # Flatten the list of tokenized sentences and count word frequencies
    flat_list = sum(corpus, [])
    print('flatten list')
    word_counts = Counter(flat_list)
    print('words counter')
    # Sort words by frequency in descending order
    sorted_word_counts = word_counts.most_common()
    print('sorted words')

    # Create the vocabulary with a threshold limit if specified
    if voc_threshold == 0:
        print('vocab threshold == 0')
        vocabulary = {word[0]: i+1 for i, word in enumerate(sorted_word_counts)}
    else:
        print('vocab threshold != 0')
        vocabulary = {word[0]: i+1 for i, word in enumerate(sorted_word_counts[:voc_threshold])}

    return vocabulary, word_counts


In [124]:
vocab, word_counts = create_vocabulary(ng_train_text, 10)

vocab function
flatten list
words counter
sorted words
vocab threshold != 0


In [25]:
stop_words = set(stopwords.words('english'))
print(stop_words)

{"haven't", 'as', 'being', 'your', "shan't", 'had', 'or', "won't", 'about', 'do', 'once', 'more', "should've", 'me', 'won', 'he', 'after', "mustn't", 'wasn', 'until', 'been', 'my', 'any', 's', 'ours', 'does', 'before', 'didn', 'in', 'while', 'd', 'aren', 'own', "isn't", "they've", 'she', "i've", 'from', "he's", "it'll", "i'm", 'how', 'against', 'these', 'through', 'what', "she'd", 'under', 'an', "aren't", 'its', 'over', "it'd", "we're", 'are', 'i', 'at', "that'll", 'them', 'o', 'whom', 'couldn', 'themselves', "you're", "wasn't", 'needn', "shouldn't", "they'll", 'will', 'we', 'our', 'on', "we'll", 'each', 'weren', 'himself', 'yours', 'same', 'not', 't', "i'd", 'some', 'his', 'herself', 'those', 'yourself', 'll', "hadn't", "i'll", 'was', 'were', 'am', "she'll", 'there', 'don', 'the', 'shouldn', "you've", 'out', 'isn', 'above', 'too', 'which', 'did', 'ain', 'by', "he'd", "doesn't", 'and', "hasn't", 'all', 'up', 're', 'have', 'shan', 've', 'itself', 'just', 'hadn', "wouldn't", 'should', "t

In [125]:
print(vocab)

{'one': 1, 'max': 2, 'people': 3, 'like': 4, 'get': 5, 'know': 6, 'also': 7, 'use': 8, 'think': 9, 'time': 10}


In [126]:
print(word_counts)



<div class='alert alert-block alert-warning'>
            Question:</div>
            
What do you think is the **appropriate vocabulary size here** ? Would any further pre-processing make sense ? Motivate your answer.

> The appropriate size depends on the size og the original dataset and the frequency of the words in it. According to Zipf's Law, we should only the keep the most frequent words as they are the most useful for capturing the meaning of the whole corpus. For example, words that appear only once in the corpus are not be very useful, thus we are going to find the ideal frequency for this corpus size. That means we need to calculate what is the minimum amount of times a words has to appear in the corpus for it to be included in the vocabulary. Based on this number, we can set the `voc_threshold` variable.  
For example, if the minimum frequency is 3 times, we will search for the last word - in a list of words in descending order of frequency - that has frequency = 3 and includes every word in the list up to this one.

In [None]:
recurlen=lambda ng_train_text: sum(map(recurlen,ng_train_text)) if isinstance(ng_train_text,list) else 1
print(f'There are {recurlen(ng_train_text)} tokens in the corpus')

Before creating the vocabulary, put aside some training data for a **validation set** !

In [None]:
from sklearn.model_selection import train_test_split
train_texts_splt, val_texts, train_labels_splt, val_labels = train_test_split(ng_train_text, ng_train_labels, test_size=.2)

In [None]:
# Get the vocabulary from 'train_texts_splt'


### I.3 - Symbolic text representations

We can use the ```CountVectorizer``` class from scikit-learn to obtain the first set of representations:
- Use the appropriate argument to get your own vocabulary
- Fit the vectorizer on your training data, transform your test data
- Create a ```LogisticRegression``` model and train it with these representations. Display the confusion matrix using functions from ```sklearn.metrics```

Then, re-execute the same pipeline with the ```TfidfVectorizer```.

<div class='alert alert-block alert-info'>
            Code:</div>

### I.4 - Dense Representations from Topic Modeling

Now, the goal is to re-use the bag-of-words representations we obtained earlier - but reduce their dimension through a **topic model**. Note that this allows to obtain reduced **document representations**, which we can again use directly to perform classification.
- Do this with two models: ```TruncatedSVD``` and ```LatentDirichletAllocation```
- Pick $300$ as the dimensionality of the latent representation (*i.e*, the number of topics)

<div class='alert alert-block alert-info'>
            Code:</div>

### I.5 - Dense Count-based Representations

The following function allows to obtain very large-dimensional vectors for **words**. We will now follow a different procedure:
- Use ```TruncatedSVD```to obtain **word embeddings** of dimension $300$ from the output of the ```co_occurence_matrix```function, to which you can apply any intermediate transformation you see fit.
- Complete the following ```sentence_representations``` matrix, which will allow you to obtain **document representations** from **word embeddings**.
- Put the pipeline together and obtain document representations for both training and testing data, using word embeddings you got from the *training data co-occurence matrix*.
- Apply the same classification model as before, and display the results.

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
def co_occurence_matrix(corpus, vocabulary, window=0, distance_weighting=False):
    """
    Params:
        corpus (list of list of strings): corpus of sentences
        vocabulary (dictionary): words to use in the matrix
        window (int): size of the context window; when 0, the context is the whole sentence
        distance_weighting (bool): indicates if we use a weight depending on the distance between words for co-oc counts
    Returns:
        matrix (array of size (len(vocabulary), len(vocabulary))): the co-oc matrix, using the same ordering as the vocabulary given in input
    """
    l = len(vocabulary)
    M = np.zeros((l,l))
    for sent in corpus:
        sent = ...
        # Obtain the indexes of the words in the sentence from the vocabulary
        sent_idx = ...
        # Avoid one-word sentences - can create issues in normalization:
        if len(sent_idx) == 1:
                sent_idx.append(len(vocabulary)-1)
        # Go through the indexes and add 1 / dist(i,j) to M[i,j] if words of index i and j appear in the same window
        for i, idx in enumerate(sent_idx):
            # If we consider a limited context:
            if window > 0:
                # Create a list containing the indexes that are on the left of the current index 'idx_i'
                l_ctx_idx = ...
            # If the context is the entire document:
            else:
                # The list containing the left context is easier to create
                l_ctx_idx = ...
            # Go through the list and update M[i,j]:
            for j, ctx_idx in enumerate(l_ctx_idx):
                if distance_weighting:
                    weight = ...
                else:
                    weight = ...
                M[idx, ctx_idx] += weight * 1.0
                M[ctx_idx, idx] += weight * 1.0
    return M

In [None]:
# Obtain the co-occurence matrix, transform it as needed, reduce its dimension


We will now use these representations for classification.
The basic model will be constructed in two steps:
- A function to obtain vector representations of criticism, from text, vocabulary, and vector representations of words. Such a function (to be completed below) will associate to each word of a review its embeddings, and create the representation for the whole sentence by summing these embeddings.
- A classifier will take these representations as input and make a prediction. To achieve this, we can first use logistic regression ```LogisticRegression``` from ```scikit-learn```

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
def sentence_representations(texts, vocabulary, embeddings, np_func=np.mean):
    """
    Represent the sentences as a combination of the vector of its words.
    Parameters
    ----------
    texts : a list of sentences
    vocabulary : dict
        From words to indexes of vector.
    embeddings : Matrix containing word representations
    np_func : function (default: np.sum)
        A numpy matrix operation that can be applied columnwise,
        like `np.mean`, `np.sum`, or `np.prod`.
    Returns
    -------
    np.array, dimension `(len(texts), embeddings.shape[1])`
    """
    #
    # To complete !
    #
    return representations

In [None]:
# Obtain document representations, apply the classifier


### I.6 - Dense Prediction-based Representations

We will now use two types of word embeddings:
1. From ```Word2Vec```: which we will train ourselves
2. From ```GloVe```: which we will import

We will use the ```gensim``` library for its implementation of word2vec in python. Since we want to keep the same vocabulary as before: we'll first create the class, then get the vocabulary we generated above.

#### Glove

In [None]:
import gensim.downloader as api
loaded_glove_model = api.load("glove-wiki-gigaword-300")

We can extract the embedding matrix this way, and check its size:

In [None]:
loaded_glove_embeddings = loaded_glove_model.vectors
print(loaded_glove_embeddings.shape)

We can see that there are $400,000$ words represented, and that the embeddings are of size $300$. We define a function that returns, from the loaded model, the vocabulary and the embedding matrix according to the structures we used before. We add, here again, an unknown word "UNK" in case there are words in our data that are not part of the $400,000$ words represented here.

In [None]:
def get_glove_voc_and_embeddings(glove_model):
    voc = {word : index for word, index in enumerate(glove_model.index_to_key)}
    voc['UNK'] = len(voc)
    embeddings = glove_model.vectors
    return voc, embeddings

In [None]:
loaded_glove_voc, loaded_glove_embeddings = get_glove_voc_and_embeddings(loaded_glove_model)

To be able to merge these $400.000$ words with those that are in our vocabulary, we can create a specific function that will extract the representations of the words that are in our vocabulary and return a matrix of the appropriate size:

In [None]:
def get_glove_adapted_embeddings(glove_model, input_voc):
    keys = {i: glove_model.key_to_index.get(w, None) for w, i in input_voc.items()}
    index_dict = {i: key for i, key in keys.items() if key is not None}
    embeddings = np.zeros((len(input_voc),glove_model.vectors.shape[1]))
    for i, ind in index_dict.items():
        embeddings[i] = glove_model.vectors[ind]
    return embeddings

This function takes as input the model loaded using the Gensim API, as well as a vocabulary we created ourselves, and returns the embedding matrix from the loaded model, for the words in our vocabulary and in the right order.


In [None]:
GloveEmbeddings = get_glove_adapted_embeddings(loaded_glove_model, ...) # Use your vocabulary

In [None]:
print(GloveEmbeddings.shape)

#### Word2Vec

We will use the ```gensim``` library for its implementation of word2vec in python. We'll have to make a specific use of it, since we want to keep the same vocabulary as before: we'll first create the class, then get the vocabulary we generated above.

In [None]:
from gensim.models import Word2Vec

model = Word2Vec(vector_size=300,
                 window=5,
                 null_word=len(...), # Use word counts
                 epochs=30)
model.build_vocab_from_freq(...) # Use word counts

The model takes as input a **list of list of words**: you need to tokenize the data beforehand.
In this case, you also need to indicate to the model the number of examples it should train with.
<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
ng_train_text_tokenized = []
ex = 0
for sent in train_texts_splt:
    ...

In [None]:
model.train(ng_train_text_tokenized, total_examples=ex, epochs=30, report_delay=1)

In [None]:
W2VEmbeddings = model.wv.vectors
print(W2VEmbeddings.shape)

In [None]:
# Re-train a Logistic regression classifier with those representations


<div class='alert alert-block alert-warning'>
            Question:</div>

- Why can we expect that the results obtained with embeddings extracted from representations pre-trained with Gl0ve are much better than word2vec ? What would be a 'fair' way to compare Gl0ve with word2vec ?

<div class='alert alert-block alert-warning'>
            Question:</div>

- Try to have an high-level analysis of the results. Which representation works the best ? Did the confusion matrix give you any insight ?

# II - Text classification with Pytorch

The goal of this second part of the lab is double: an introduction to using Pytorch for treating textual data, and implementing neural classification models that we can apply to our data - and then compare it to the models implemented previously.

In [None]:
import torch
import torch.nn as nn

### II.1 A (very small) introduction to pytorch

Pytorch Tensors are very similar to Numpy arrays, with the added benefit of being usable on GPU. For a short tutorial on various methods to create tensors of particular types, see [this link](https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py).
The important things to note are that Tensors can be created empty, from lists, and it is very easy to convert a numpy array into a pytorch tensor, and inversely.

In [None]:
a = torch.LongTensor(5)
b = torch.LongTensor([5])

print(a)
print(b)

In [None]:
a = torch.FloatTensor([2])
b = torch.FloatTensor([3])

print(a + b)

The main interest in us using Pytorch is the ```autograd``` package. ```torch.Tensor```objects have an attribute ```.requires_grad```; if set as True, it starts to track all operations on it. When you finish your computation, can call ```.backward()``` and all the gradients are computed automatically (and stored in the ```.grad``` attribute).

One way to easily cut a tensor from the computational once it is not needed anymore is to use ```.detach()```.
More info on automatic differentiation in pytorch on [this link](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#sphx-glr-beginner-blitz-autograd-tutorial-py).

In [None]:
x = torch.tensor(1., requires_grad=True)
w = torch.tensor(2., requires_grad=True)
b = torch.tensor(3., requires_grad=True)

# Build a computational graph.
y = w * x + b    # y = 2 * x + 3

# Compute gradients.
y.backward()

# Print out the gradients.
print(x.grad)    # x.grad = 2
print(w.grad)    # w.grad = 1
print(b.grad)    # b.grad = 1

In [None]:
x = torch.randn(10, 3)
y = torch.randn(10, 2)

# Build a fully connected layer.
linear = nn.Linear(3, 2)
for name, p in linear.named_parameters():
    print(name)
    print(p)

# Build loss function - Mean Square Error
criterion = nn.MSELoss()

# Forward pass.
pred = linear(x)

# Compute loss.
loss = criterion(pred, y)
print('Initial loss: ', loss.item())

# Backward pass.
loss.backward()

# Print out the gradients.
print ('dL/dw: ', linear.weight.grad)
print ('dL/db: ', linear.bias.grad)

In [None]:
# You can perform gradient descent manually, with an in-place update ...
linear.weight.data.sub_(0.01 * linear.weight.grad.data)
linear.bias.data.sub_(0.01 * linear.bias.grad.data)

# Print out the loss after 1-step gradient descent.
pred = linear(x)
loss = criterion(pred, y)
print('Loss after one update: ', loss.item())

In [None]:
# Use the optim package to define an Optimizer that will update the weights of the model.
optimizer = torch.optim.SGD(linear.parameters(), lr=0.01)

# By default, gradients are accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Before the backward pass, we need to use the optimizer object to zero all of the
# gradients.
optimizer.zero_grad()
loss.backward()

# Calling the step function on an Optimizer makes an update to its parameters
optimizer.step()

# Print out the loss after the second step of gradient descent.
pred = linear(x)
loss = criterion(pred, y)
print('Loss after two updates: ', loss.item())

### II.2 Tools for data processing

```torch.utils.data.Dataset``` is an abstract class representing a dataset. Your custom dataset should inherit ```Dataset``` and override the following methods:
- ```__len__``` so that ```len(dataset)``` returns the size of the dataset.
- ```__getitem__``` to support the indexing such that ```dataset[i]``` can be used to get the i-th sample

Here is a toy example:

In [None]:
toy_corpus = ['I walked down down the boulevard',
              'I walked down the avenue',
              'I ran down the boulevard',
              'I walk down the city',
              'I walk down the the avenue']

toy_categories = [0, 0, 1, 0, 0]

In [None]:
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    # A pytorch dataset class for holding data for a text classification task.
    def __init__(self, data, categories):
        # Upon creating the Dataset object, store the data in an attribute
        # Split the text data and labels from each other
        self.X, self.Y = [], []
        for x, y in zip(data, categories):
            # We will propably need to preprocess the data - have it done in a separate method
            # We do it here because we might need corpus-wide info to do the preprocessing
            # For example, cutting all examples to the same length
            self.X.append(self.preprocess(x))
            self.Y.append(y)

    # Method allowing you to preprocess data
    def preprocess(self, text):
        text_pp = text.lower().strip()
        return text_pp

    # Overriding the method __len__ so that len(CustomDatasetName) returns the number of data samples
    def __len__(self):
        return len(self.Y)

    # Overriding the method __getitem__ so that CustomDatasetName[i] returns the i-th sample of the dataset
    def __getitem__(self, idx):
           return self.X[idx], self.Y[idx]

In [None]:
toy_dataset = CustomDataset(toy_corpus, toy_categories)

In [None]:
print(len(toy_dataset))
for i in range(len(toy_dataset)):
    print(toy_dataset[i])

```torch.utils.data.DataLoader``` is what we call an iterator, which provides very useful features:
- Batching the data
- Shuffling the data
- Load the data in parallel using multiprocessing workers.
and can be created very simply from a ```Dataset```. Continuing on our simple example:

In [None]:
toy_dataloader = DataLoader(toy_dataset, batch_size = 2, shuffle = True)

In [None]:
for e in range(3):
    print("Epoch:" + str(e))
    for x, y in toy_dataloader:
        print("Batch: " + str(x) + "; labels: " + str(y))

#### Data processing of a text dataset

Now, we would like to apply what we saw to our case, and **create a specific class** ```TextClassificationDataset``` **inheriting** ```Dataset``` that will:
- Create a vocabulary from the data (same as above)
- Preprocess the data using this vocabulary, adding whatever we need for our pytorch model
- Have a ```__getitem__``` method that allows us to use the class with a ```Dataloader``` to easily build batches.

In [None]:
from torch.nn import functional as F
import random

from torch.nn.utils.rnn import pad_sequence

We will now need to create a ```TextClassificationDataset``` and a ```Dataloader``` for the training data, the validation data, and the testing data.

We will implement our ```TextClassificationDataset``` class, that we will build from:
- A list of documents: ```data```
- A list of the corresponding categories: ```categories```
We will add three optional arguments:
- First, a way to input a vocabulary (so that we can re-use the training vocabulary on the validation and training ```TextClassificationDataset```). By default, the value of the argument is ```None```.
- In order to work with batches, we will need to have sequences of the same size. That can be done via **padding** but we will still need to limit the size of documents (to avoid having batches of huge sequences that are mostly empty because of one very long documents) to a ```max_length```. Let's put it to 100 by default.
- Lastly, a ```min_freq``` that indicates how many times a word must appear to be taken in the vocabulary.

The idea behind **padding** is to transform a list of pytorch tensors (of maybe different length) into a two dimensional tensor - which we can see as a batch. The size of the first dimension is the one of the longest tensor - and other are **padded** with a chosen symbol: here, we choose 0.

**Careful: the symbol 0 is then reserved for padding. That means the vocabulary must begin at 1 !**

In [None]:
tensor_1 = torch.LongTensor([1, 4, 5])
tensor_2 = torch.LongTensor([2])
tensor_3 = torch.LongTensor([6, 7])

In [None]:
tensor_padded = pad_sequence([tensor_1, tensor_2, tensor_3], batch_first=True, padding_value = 0)
print(tensor_padded)

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
class TextClassificationDataset(Dataset):
    def __init__(self, data, categories, vocab = None, max_length = 200, voc_threshold = 10000):
        # Get all the data in a list
        self.data = data
        # Set the maximum length we will keep for the sequences
        self.max_length = max_length
        # Allow to import a vocabulary (for valid/test datasets, that will use the training vocabulary)
        if vocab is not None:
            self.word2idx, self.idx2word = vocab
        else:
            # If no vocabulary imported, build it (and reverse)
            self.word2idx, self.idx2word = self.build_vocab(self.data, voc_threshold)

        # We then need to tokenize the data ..
        tokenized_data = ...
        # Transform words into lists of indexes ... (use the .get() method to redirect unknown words to the UNK token)
        indexed_data = ...
        # And transform this list of lists into a list of Pytorch LongTensors
        tensor_data = ...
        # And the categories into a FloatTensor
        tensor_y = ...
        # To finally cut it when it's above the maximum length
        cut_tensor_data = ...

        # Now, we need to use the pad_sequence function to have the whole dataset represented as one tensor,
        # containing sequences of the same length. We choose the padding_value to be 0, the we want the
        # batch dimension to be the first dimension
        self.tensor_data = ...
        self.tensor_y = tensor_y

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # The iterator just gets one particular example with its category
        # The dataloader will take care of the shuffling and batching
        if torch.is_tensor(idx):
            idx = idx.tolist()
        return self.tensor_data[idx], self.tensor_y[idx]

    def build_vocab(self, corpus, voc_threshold):
        """
        Same as in the previously: we want to output word_index, a dictionary containing words
        and their corresponding indexes as {word : indexes}
        But we also want the reverse, which is a dictionary {indexes: word}
        """
        # To complete
        # Careful, here we need to shift the indexes by 1 to put the padding symbol to 0
        return ...

    def get_vocab(self):
        # A simple way to get the training vocab when building the valid/test
        return self.word2idx, self.idx2word

In [None]:
training_dataset = TextClassificationDataset(train_texts_splt, train_labels_splt)
training_word2idx, training_idx2word = training_dataset.get_vocab()

In [None]:
valid_dataset = TextClassificationDataset(val_texts, val_labels, (training_word2idx, training_idx2word))
test_dataset = TextClassificationDataset(ng_test_text, ng_test_labels, (training_word2idx, training_idx2word))

In [None]:
training_dataloader = DataLoader(training_dataset, batch_size = 200, shuffle=True)
valid_dataloader = DataLoader(valid_dataset, batch_size = 25)
test_dataloader = DataLoader(test_dataset, batch_size = 25)

In [None]:
print(valid_dataset[1])

In [None]:
example_batch = next(iter(training_dataloader))
print(example_batch[0].size())
print(example_batch[1].size())

### II.3 A simple averaging model

Now, we will implement in Pytorch what we did in the previous TP: a simple averaging model. For each model we will implement, we need to create a class which inherits from ```nn.Module``` and redifine the ```__init__``` method as well as the ```forward``` method.

In [None]:
import torch.optim as optim

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
class AveragingModel(nn.Module):

    def __init__(self, embedding_dim, vocabulary_size, categories_num):
        super().__init__()
        # Create an embedding object. Be careful to padding - you need to increase the vocabulary size by one !
        # Look into the arguments of the nn.Embedding class
        self.embeddings = ...
        # Create a linear layer that will transform the mean of the embeddings into classification scores
        self.linear = ...

    def forward(self, inputs):
        # Remember: the inputs are written as Batch_size * seq_length * embedding_dim
        # First, take the mean of the embeddings of the document
        x = ...
        o = ...
        return o

In [None]:
model = AveragingModel(300, len(training_word2idx), max(ng_train_labels)+1)
# Create an optimizer
opt = optim.Adam(model.parameters(), lr=0.0025, betas=(0.9, 0.999))
# The criterion is a cross entropy loss based on logits,
# meaning that the softmax is integrated into the criterion
criterion = nn.CrossEntropyLoss()

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
# Implement a training function, which will train the model with the corresponding optimizer and criterion,
# with the appropriate dataloader, for one epoch.

def train_epoch(model, opt, criterion, dataloader):
    model.train()
    losses = []
    for i, (x, y) in enumerate(dataloader):
        opt.zero_grad()
        # (1) Forward
        pred = ...
        # (2) Compute diff
        loss = ...
        # (3) Compute gradients
        ...
        # (4) update weights
        ...
        losses.append(loss.item())
        # Count the number of correct predictions in the batch - here, you'll need to use the softmax
        num_corrects = ...
        acc = 100.0 * num_corrects/len(y)

        if (i%20 == 0):
            print("Batch " + str(i) + " : training loss = " + str(loss.item()) + "; training acc = " + str(acc.item()))
    return losses

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
# Same for the evaluation ! We don't need the optimizer here.

def eval_model(model, criterion, evalloader):
    model.eval()
    total_epoch_loss = 0
    total_epoch_acc = 0
    with torch.no_grad():
        for i, (x, y) in enumerate(evalloader):
            pred = ...
            loss = ...
            num_corrects = ...
            acc = 100.0 * num_corrects/len(y)
            total_epoch_loss += loss.item()
            total_epoch_acc += acc.item()

    return total_epoch_loss/(i+1), total_epoch_acc/(i+1)

In [None]:
# A function which will help you execute experiments rapidly - with a early_stopping option when necessary.

def experiment(model, opt, criterion, num_epochs = 10, early_stopping = True):
    train_losses = []
    if early_stopping:
        best_valid_loss = 10.
    print("Beginning training...")
    for e in range(num_epochs):
        print("Epoch " + str(e+1) + ":")
        train_losses += train_epoch(model, opt, criterion, training_dataloader)
        valid_loss, valid_acc = eval_model(model, criterion, valid_dataloader)
        print("Epoch " + str(e+1) + " : Validation loss = " + str(valid_loss) + "; Validation acc = " + str(valid_acc))
        if early_stopping:
            if valid_loss < best_valid_loss:
                best_valid_loss = valid_loss
            else:
                print("Early stopping.")
                break
    test_loss, test_acc = eval_model(model, criterion, test_dataloader)
    print("Epoch " + str(e+1) + " : Test loss = " + str(test_loss) + "; Test acc = " + str(test_acc))
    return train_losses

In [None]:
train_losses = experiment(model, opt, criterion)

In [None]:
import matplotlib.pyplot as plt
plt.plot(train_losses)

### II.4 Initializing with pre-trained embeddings:

Now, we would like to integrate pre-trained word embeddings into our model ! However, we need to not forget to add a vector for the padding symbol.

In [None]:
def get_glove_adapted_embeddings(glove_model, input_voc):
    keys = {i: glove_model.key_to_index.get(w, None) for w, i in input_voc.items()}
    index_dict = {i: key for i, key in keys.items() if key is not None}
    # Important change here: add one supplementary word for padding
    embeddings = np.zeros((len(input_voc)+1,glove_model.vectors.shape[1]))
    for i, ind in index_dict.items():
        embeddings[i] = glove_model.vectors[ind]
    return embeddings

GloveEmbeddings = get_glove_adapted_embeddings(loaded_glove_model, training_word2idx)

In [None]:
print(GloveEmbeddings.shape)

Here, implement a ```PretrainedAveragingModel``` very similar to the previous model, using the ```nn.Embedding``` method ```from_pretrained()``` to initialize the embeddings from a numpy array. Use the ```requires_grad_``` method to specify if the model must fine-tune the embeddings or not !
<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
class PretrainedAveragingModel(nn.Module):
    # To complete !

<div class='alert alert-block alert-warning'>
            Questions:</div>
            
- What are the results **with and without fine-tuning of embeddings imported from GloVe** ? Explain them.
- Look again at the confusion matrix these results in more details.

In [None]:
model_pre_trained = PretrainedAveragingModel(300, max(ng_train_labels)+1, torch.FloatTensor(GloveEmbeddings), True)
opt_pre_trained = optim.Adam(model_pre_trained.parameters(), lr=0.0025, betas=(0.9, 0.999))

In [None]:
train_losses = experiment(model_pre_trained, opt_pre_trained, criterion)

In [None]:
model_pre_trained_light = PretrainedAveragingModel(300, max(ng_train_labels)+1, torch.FloatTensor(GloveEmbeddings), False)
opt_pre_trained_light = optim.Adam(model_pre_trained_light.parameters(), lr=0.0025, betas=(0.9, 0.999))

In [None]:
train_losses = experiment(model_pre_trained_light, opt_pre_trained_light, criterion)