# Dataset

During this practical session, we will use on the [AG's corpus of news article]():  
*AG News (AG’s News Corpus) is a subdataset of [AG's corpus](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) of news articles constructed by assembling titles and description fields of articles from the 4 largest classes (“World”, “Sports”, “Business”, “Sci/Tech”) of AG’s Corpus. The AG News contains 30,000 training and 1,900 test samples per class.*  

Let's first download the dataset:

In [None]:
!wget https://github.com/mhjabreel/CharCnn_Keras/raw/master/data/ag_news_csv/train.csv 2>&1
!wget https://github.com/mhjabreel/CharCnn_Keras/raw/master/data/ag_news_csv/test.csv 2>&1

The following code will load the dataset and add the label names in a new column.

In [None]:
import pandas as pd

traindf = pd.read_csv('train.csv', names=["label", "title", "text"]).sample(2000)
testdf = pd.read_csv('test.csv', names=["label", "title", "text"]).sample(1000)

traindf['label'] = traindf['label'] -1
traindf['label_name'] = traindf.label.map({0:"World", 1:"Sports", 2:"Business", 3:"Sci/Tech"})
testdf['label'] = testdf['label'] -1
testdf['label_name'] = testdf.label.map({0:"World", 1:"Sports", 2:"Business", 3:"Sci/Tech"})
testdf

Start with a little bit of exploration:  
Using the Wordcloud library and follow the example provided by the [documentation](https://amueller.github.io/word_cloud/auto_examples/simple.html#sphx-glr-auto-examples-simple-py) to plot the most common words in the corpus.  
Do do so, you should starts by joining all documents within a single one. A simple way consits in using the ```" ".join()``` function on a list of text...

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

corpus = ...
wordcloud = ...
...


By default Wordcloud removes the most common words used in english (and, or, the, a ...). It is still possible to provide a custom list. 
[NLTK](https://www.nltk.org/) is a powerfull library for natural language processing. It provides a several lists of stop words that can be used to clean text.  
Even if it doesn't change the result here let's provide Wordcloud with a custom list of stopwords taken from NLTK.

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english')) 

wordcloud = ...
...

Now, plot a different wordcloud for every category in the dataset.  
Are you capable of predicting the categories given only these wordclouds?

In [None]:
...

# Bag-of-Words  
We will now train different models to predict the category of these news articles.  
We saw, during the course, a first approach called "*bag of words*".  
BOW methods describe documents using counts or statistics on the words composing the documents. Once the bag-of-words is computed, documents are represented by vectors whose dimensions correspond to words present in the corpus vocabulary.  

First, we will vectorize our documents using term frequencies.  
Look at the documentation of [scikit-learn's CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to encode the __text column__ of your training set.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = ...
...

The ```vocabulary_``` argument of your vectorizer contains a dictionary with all the tokens and their corresponding index in the bag-of-words.  
How many unique tokens compose your bag-of-words?

In [None]:
...

You can also use the ```get_feature_names_out()``` method to get the list of identified tokens:

In [None]:
...

Now choose a classification method from scikit-learn and train it to classify news article.  
Print the classification score of your model on the training set.

In [None]:
...

Now use the ```transform``` method from your vectorizer on the testing set and print the score obtained by your model on the testing set.  
Your model is probably overfitting a lot.  
Plot a [consusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html) to see where your model makes the most mistakes.

In [None]:
...

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
...

Play with some of the vectorizer hyper-parameters to see whether you can improve the perfomance of your classifier on the testing set.  
Try adding stopwords or changing the ngram_range...

In [None]:
...

Once you are satisfied with the performance or do not improve it, plot a t-SNE of your training representations with labels as colors.  
In particular, compare the t-SNE representations computed with and without stop words.  
What do you observe?

In [None]:
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import seaborn as sns 

...

We will now use a second type of vectorization strategy, a little bit more efficient than pure term frequency: TF-idf.  
What is the difference with the previous method?  
Use [scikit-lear's ```TfidfVectorizer```](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) to vectorize the documents in your corpus and train a classification algorithm to classify documents.  
Print the score you obtain on the testing set and the corresponding confusion matrix.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
...

Plot a t-SNE of the representations obtained using TF-IDF.

In [None]:
...

Both the ```TfidfVectorizer``` and ```CountVectorizer``` use a default strategy to create a token given a text using whitespaces and punctuations as separators.  
It is possible to provide custom __tokenizers__ to these vectorizers.  
Here we will use NLTK to build a more powerful tokenizer that will:

*   Revmove stop words
*   Convert all texts to lowercase
*   Ignore punctuations symbols
*   Only consider letters
*   Perform Stemming on every token



In [None]:
from nltk import word_tokenize          
from nltk.stem import SnowballStemmer
import nltk
from nltk.corpus import stopwords
import re


nltk.download('punkt')
nltk.download('stopwords')
# Download stopwords list

stop_words = set(stopwords.words('english')) 

# Interface lemma tokenizer from nltk with sklearn
class StemTokenizer:
    ignore_tokens = [',', '.', ';', ':', '"', '``', "''", '`']
    def __init__(self):
        self.stemmer = SnowballStemmer('english')
    def __call__(self, doc):
        doc = doc.lower()
        return [self.stemmer.stem(t) for t in word_tokenize(re.sub("[^a-z' ]", "", doc)) if t not in self.ignore_tokens]

tokenizer=StemTokenizer()

Print an example of text from the dataset and the corresponding tokens computed by the tokenizer.

In [None]:
...

Now provide the tokenizer to the a ```TfidfVectorizer``` and repeat the entire process.  
Does it improves the testing performance?  
Tips: you should also provided a tokenized version of the stopwords since we apply stemming on all tokens.

In [None]:
token_stop = tokenizer(' '.join(stop_words))

tfidf = TfidfVectorizer(stop_words=token_stop, tokenizer=tokenizer)

...

It is also possible to combine bag-of-words features with other features manually computed.  
The following code computes some new features on all documents.

In [None]:
def count_chars(text):
    return len(text)

def count_words(text):
    return len(text.split())

def count_capital_words(string):
    return sum(map(str.isupper, string))

def count_capital_words(text):
  return sum(map(str.isupper,text.split()))

def count_punctuations(text):
  count = 0
  for i in range (0, len (text)):   
    if text[i] in ('!', "," ,"\'" ,";" ,"\"", ".", "-" ,"?"):  
        count = count + 1; 
  return  count

def count_sentences(text):
    return len(nltk.sent_tokenize(text))

def count_unique_words(text):
    return len(set(text.split()))

for df in [traindf, testdf]:
  df['count_chars'] = df.text.apply(lambda s: count_chars(s))
  df['count_words'] = df.text.apply(lambda s: count_words(s))
  df['count_capital_words'] = df.text.apply(lambda s: count_capital_words(s))
  df['count_capital_words'] = df.text.apply(lambda s: count_capital_words(s))
  df['count_punctuations'] = df.text.apply(lambda s: count_punctuations(s))
  df['count_sentences'] = df.text.apply(lambda s: count_sentences(s))
  df['count_unique_words'] = df.text.apply(lambda s: count_unique_words(s))
  df['avg_wordlength'] = df['count_chars']/df['count_words']
  df['avg_sentlength'] = df['count_words']/df['count_sentences']

Using a [```ColumnTransformer```](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) it is possible to combine all the features.

In [None]:
from sklearn.compose import ColumnTransformer

columns_to_keep = ['text', 'count_chars', 'count_words',
       'count_capital_words', 'count_punctuations',
       'count_unique_words', 'count_sentences', 'avg_wordlength',
       'avg_sentlength']

column_trans = ColumnTransformer(
    [('categories', TfidfVectorizer(stop_words=token_stop, tokenizer=tokenizer), 'text')],
    remainder='passthrough', verbose_feature_names_out=False)

X_train = column_trans.fit_transform(traindf[columns_to_keep])
X_test = column_trans.transform(testdf[columns_to_keep])

Unfortunatly in our case, these features do not provide any improvement on the testing performance.  
In some other tasks like spam detection they can have a stronger influence.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, traindf.label)
print(f"Model score on training data: {rf.score(X_train, traindf.label):.2f}")
print(f"Model score on test data: {rf.score(X_test, testdf.label)}")

predictions = rf.predict(X_test)
cm = confusion_matrix(testdf.label, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["World", "Sports", "Business", "Sci/Tech"])
disp.plot()
plt.show()

# Word2Vec

We will now use a second vectorization technique, seen during the course lectures: word vectorization.  
Fisrt, we will use the Gensim library to compute or download pre-computed word embeddings.

In [None]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip 
!unzip glove.6B.zip > /dev/null 2>&1

In [None]:
glove_file = ('glove.6B.100d.txt')
word2vec_glove_file = get_tmpfile("glove.6B.100d.word2vec.txt")
glove2word2vec(glove_file, word2vec_glove_file)
model = KeyedVectors.load_word2vec_format(word2vec_glove_file)

The model is a mapping between words and their vector representations.

In [None]:
model['apple']

It also has usefull methods to explore the  vocabulary's embeddings.  
Here are some examples to find the most similar words in the embedding space. 
Try with some other words and look if the most similar words seem plausibles.

In [None]:
model.most_similar('zuckerberg')

In [None]:
model.most_similar('google')

In [None]:
model.most_similar('intelligence')

In [None]:
model.most_similar(negative='network')

An other cool feature of Word2Vec is the possibility to perform analogies.  
The most famous example is certainly king - man + woman = queen.  
Try to find other working analogies.

In [None]:
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

In [None]:
result = model.most_similar(positive=['paris', 'spain'], negative=['france'])
print("{}: {:.4f}".format(*result[0]))

In [None]:
result = model.most_similar(positive=['clinton', 'republican'], negative=['democrat'])
print("{}: {:.4f}".format(*result[0]))

In [None]:
result = model.most_similar(positive=['beer', 'france'], negative=['usa'])
for i in range(3):
  print("{}: {:.4f}".format(*result[i]))

The following code plots a PCA or t-SNE representation of a list of words.
Use this method with your own list of words to see wether similar words are close to each other in the embedding space.

In [None]:
import numpy as np
from sklearn.decomposition import PCA

def plot_embeddings(model, words, reduction='pca'):       
    word_vectors = np.array([model[w] for w in words])
    if reduction == PCA:
      reductor = PCA(n_components=2)
    elif reduction == "tsne":
      reductor = TSNE(2, perplexity=20)
    X = reductor.fit_transform(word_vectors)
    plt.figure(figsize=(12,12))
    plt.scatter(X[:,0], X[:,1])
    for word, x in zip(words, X):
        plt.text(x[0]+0.05, x[1]+0.05, word)


In [None]:
word_list = ['coffee', 'tea', 'beer', 'wine', 'brandy', 'rum', 'champagne', 'water',
                         'spaghetti', 'borscht', 'hamburger', 'pizza', 'falafel', 'sushi', 'meatballs',
                         'dog', 'horse', 'cat', 'monkey', 'parrot', 'koala', 'lizard',
                         'frog', 'toad', 'monkey', 'ape', 'kangaroo', 'wombat', 'wolf',
                         'france', 'germany', 'hungary', 'luxembourg', 'australia', 'fiji', 'china',
                         'homework', 'assignment', 'problem', 'exam', 'test', 'class',
                         'school', 'college', 'university', 'institute']

plot_embeddings(model, words=word_list, reduction='tsne') 

We will now use these pre-computed embeddings to build the document representations.  
A simple way to compute a document representation from word embeddings consists in computing the mean or the sum of all the document's word embeddings.  
Here, since the documents do not have the same length, it is preferable to use the mean.  
Fill in the following code to compute the mean embeddings of all documents.  
Since this process is a little bit long, we will use a limited amount of documents during the practical session. Nonetheless, feel free to try with the complete dataset at home.  

In [None]:
from tqdm import tqdm
tqdm.pandas()

traindf = pd.read_csv('train.csv', names=["label", "title", "text"]).sample(1000)
testdf = pd.read_csv('test.csv', names=["label", "title", "text"]).sample(200)

traindf['label'] = traindf['label'] -1
traindf['label_name'] = traindf.label.map({0:"World", 1:"Sports", 2:"Business", 3:"Sci/Tech"})
testdf['label'] = testdf['label'] -1
testdf['label_name'] = testdf.label.map({0:"World", 1:"Sports", 2:"Business", 3:"Sci/Tech"})

def compute_mean_embeddings(s, model, words_list, dim=100):
  s = # convert to lower case
  emb_list = [model[w] for w in s if w in words_list]
  if emb_list != []:
    return # compute the mean
  else:
    return # return a vector filled with 0 if the list is empty

words_list = model.index2entity
traindf['mean_embeddings'] = traindf.text.progress_apply(lambda s: compute_mean_embeddings(s, model, words_list))
testdf['mean_embeddings'] = testdf.text.progress_apply(lambda s: compute_mean_embeddings(s, model, words_list))

The following code extracts the computed embeddings from the dataframe.  
Use these to train a model to predict the article category.  
Print your testing performance and plot a confusion matrix.  
The results may be a little bit disappointing. Any idea why?

In [None]:
X_train = np.vstack(traindf['mean_embeddings'].values)
X_test = np.vstack(testdf['mean_embeddings'].values)

...

Plot a t-SNE of the computed embeddings.  
Is it a good representation to classify documents?

In [None]:
...

We will now try with custom Word2Vec representations.  
Gensim allows to train Word2Vec representations in a few lines of codes.  
Since our vocabullary is smaller than the one used for the pre-computed Word2Vec, we can now use more samples (the embedding look-up will be cheaper to process).

In [None]:
traindf = pd.read_csv('train.csv', names=["label", "title", "text"]).sample(10000)
testdf = pd.read_csv('test.csv', names=["label", "title", "text"]).sample(1000)

traindf['label'] = traindf['label'] -1
traindf['label_name'] = traindf.label.map({0:"World", 1:"Sports", 2:"Business", 3:"Sci/Tech"})
testdf['label'] = testdf['label'] -1
testdf['label_name'] = testdf.label.map({0:"World", 1:"Sports", 2:"Business", 3:"Sci/Tech"})

It might be worth applying our tokenizer to reduce the vocabulary size.  
Fill the following code to create a new field with tokenized texts.

In [None]:
traindf["tokenized"] = traindf.text.apply(lambda s: tokenizer(s))
testdf["tokenized"] = testdf.text.apply(lambda s: tokenizer(s))

We will now train our own Word2Vec.  
Look at the official [documentation](https://radimrehurek.com/gensim/models/word2vec.html).  
What does the ```window``` argument stands for? What type of model are we using (CBOW or Skip-gramm).

In [None]:
from gensim.models import Word2Vec

model=Word2Vec(traindf["tokenized"],size=100,window=5,min_count=2)
model.train(traindf["tokenized"], total_examples=model.corpus_count, epochs=200)

Now train a model on these custom embeddings and evaluate it on the testing set.

In [None]:
words_list = model.wv.index2entity
traindf['mean_embeddings'] = traindf.text.progress_apply(lambda s: compute_mean_embeddings(s, model, words_list))
testdf['mean_embeddings'] = testdf.text.progress_apply(lambda s: compute_mean_embeddings(s, model, words_list))

X_train = np.vstack(traindf['mean_embeddings'].values)
X_test = np.vstack(testdf['mean_embeddings'].values)

...

PLot a t-SNE of these new embeddings (use the test set to avoid to many points).  Does it seem better?

In [None]:
...

Word2Vec is an self-supervised learning of words represenations. Thus all words representations are meaningfull and have an equal impact when computing the mean.  This means that category irrelevant words have an equal importance in the document average represenatation than other words more related to the category.  
Computing the average of word embeddings learned with self-supervised learning is not very efficient for document classification.  
In the following we will see two alternatives using deep neural networks:


1.   Replace the mean by a recurrent layer responsible for filtering informative words within the sequence
2.   Learn our word embeddings at the same time as we learn the classification function



For the rest of this notebook we will need a GPU to speed-up trainings.  
Go to __Runtime__ and change your runtime type to GPU.  
Then run the following command and retart your runtype.

In [None]:
!pip3 install torch==1.9.1+cu111
!pip3 install torchtext==0.10.1

Since we changed our runtime we need to re-download the datasets.

In [None]:
!wget https://github.com/mhjabreel/CharCnn_Keras/raw/master/data/ag_news_csv/train.csv 2>&1
!wget https://github.com/mhjabreel/CharCnn_Keras/raw/master/data/ag_news_csv/test.csv 2>&1

The following code will load the datasets in a format compatible with pytorch.  
During the process, texts will be tokenized using the [Spacy tokenizer](https://spacy.io/usage/linguistic-features#how-tokenizer-works), we won't need to do it ourself.

In [None]:
import torch
import torchtext
from torchtext.legacy.data import Field, LabelField, TabularDataset, BucketIterator

TEXT = Field(tokenize='spacy',batch_first=True,include_lengths=True)
LABEL = LabelField(dtype = torch.long, batch_first=True)
fields = [('label', LABEL), ('title', TEXT), ('text',TEXT)]
trainset = TabularDataset(path = 'train.csv',format = 'csv',fields = fields, skip_header = True)
testset = TabularDataset(path = 'test.csv',format = 'csv',fields = fields, skip_header = True)

Here is an example of one sample of the processed dataset.

In [None]:
print(vars(trainset.examples[0]))

We will used the same pre-trained Word2Vec than previously.  
Once again, since we changed our runtime, we need to download the corresponding files.

In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip 
!unzip glove.6B.zip > /dev/null 2>&1

The following code loads the pre-computed embeddings and builds the corresponding vocabulary.

In [None]:
glove = torchtext.vocab.Vectors('glove.6B.100d.txt')
TEXT.build_vocab(trainset,min_freq=3)
TEXT.vocab.set_vectors(glove.stoi, glove.vectors, dim=100)
LABEL.build_vocab(trainset)

In [None]:
torch.cuda.init()
torch.cuda.empty_cache()
print('CUDA MEM:',torch.cuda.memory_allocated())

print('cuda:', torch.cuda.is_available())
print('cude index:',torch.cuda.current_device())

We will use a [```BucketIterator```](https://torchtext.readthedocs.io/en/latest/data.html#bucketiterator) to generate mini-batches of token sequences.  
BucketIterators generate batches of examples of similar lengths while minimizing the amount of padding needed (padding here corresponds to adding a padding token to the sequence).  
The ```sort_key``` parameter is used to sort text sequences in batches. Here we want to use sequences of similar length, so we use a function returning a sequence's length.  
This is used with the complementary argument ```sort_with_batch```, which indicates sorting sequences with mini-batches only and not within the entire dataset.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  
batch_size = 64

train_loader, test_loader = BucketIterator.splits(
    (trainset, testset), 
    batch_size = batch_size,
    sort_key = lambda x: len(x.text),
    sort_within_batch=True,
    device = device)

We will now define our network architecture.  
It is composed of:

*   an embedding layer responsible for the mapping between tokens
*   several recurrent LSTM layers
*   a final fully connected layer

Here we will use pretrained embeddings and freeze their "weights".  
This is maybe to new for asking you to implement the architecture yourself, but make an effort to understand the following code.  

In [None]:
import torch.nn as nn

class TextClassifier(nn.Module):
    
  def __init__(self, vocab, embedding_dim, hidden_dim=32, nb_lstm_layers=2, dropout=0.2, output_dim=4):
      super().__init__()          
      
      #embedding layer
      self.embedding = nn.Embedding.from_pretrained(TEXT.vocab.vectors)
      self.embedding.weight.requires_grad = False
      self.lstm = nn.LSTM(embedding_dim, 
                          hidden_dim, 
                          num_layers=nb_lstm_layers, 
                          dropout=dropout,
                          batch_first=True)
      self.fc = nn.Linear(hidden_dim, output_dim)
      
  def forward(self, text, text_lengths): 
      embedded = self.embedding(text)    
      #packed sequence
      packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu(), batch_first=True)
      _, (hidden, cell) = self.lstm(packed_embedded)
      outputs=self.fc(hidden[1])        
      return outputs

Complete the following code to implement the training and testing routines.

In [None]:
from tqdm.notebook import tqdm


def test(model, dataloader):
    model.eval()
    test_corrects = 0
    total = 0
    with torch.no_grad():
        for data in dataloader:
            text, text_lengths = data.text
            labels = data.label

            pred = model(text, text_lengths).squeeze()
            _, predicted = pred.max(1)
            test_corrects += predicted.eq(labels).sum().item()
            total += labels.size(0)
    return test_corrects / total


def train(model, dataloader, optimizer, criterion, epochs=5):
  model.train()  
  for epoch in range(epochs): 
    running_loss = 0.0
    running_corrects = 0
    total = 0 
    t = tqdm(dataloader)
    for i, batch in enumerate(t):
      text, text_lengths = batch.text
      labels = batch.label

      pred = model(text, text_lengths).squeeze() #convert to 1D tensor
      loss = criterion(pred, labels)
      
      _, predicted = ...
      running_corrects += ...
      total += ...
      running_loss += loss.item()

      ... #zero grad your optimizer
      ... # backward the loss  
      ... # perform a step
            
      t.set_description(f"epoch:{epoch} loss: {(running_loss / (i+1)):.4f} current accuracy:{round(running_corrects / total * 100, 2)}%")

Now, instantiate a model and its corresponding optimizer, choose the right criterion (loss function) to train your model, and evaluate its performance on both the training and testing sets.

In [None]:
import torch.optim as optim
embedding_size = 100
model =  ... # send your model to gpu with .to(device)
optimizer = ... # use Adam with defautl parameters
criterion = ... # choose the correct criterion
print(model)

In [None]:
train(...)
train_acc = ...
print(f"Train accuracy: :{round(train_acc * 100, 2)}%")
test_acc = ...
print(f"Test accuracy: :{round(test_acc * 100, 2)}%")


We saw that using the mean of embeddings learned by self-supervised learning is ineffective.  
This comes from the fact that all the words present in the corpus are given equal importance.  
Another solution could be to learn the embeddings while learning to classify.  
We will do this now, still using the mean to compute the final text representation.  
To do so, we will use a particular layer in pytorch called [```EmbeddingBag```](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html#torch.nn.EmbeddingBag).  
This layer computes the mean value of a “bag” of embeddings. Although the text entries here have different lengths, nn.EmbeddingBag module requires no padding here since it only computes the means.  
We will now implement a simple network as illustarted bellow computing the mean of embeddings to classify texts.  
This time we wont freeze the embeddings since we are aiming to learn theme while learning to classify.  
![](https://pytorch.org/tutorials/_images/text_sentiment_ngrams_model.png)  
Source (https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html)

In [None]:
class TextClassifier(nn.Module):

  def __init__(self, vocab_size, embedding_dim, output_dim=4):
    super().__init__()  
    self.embedding = nn.EmbeddingBag(vocab_size, embedding_dim, sparse=True)
    self.fc = nn.Linear(embedding_dim, output_dim)

  def forward(self, text, _): # we just add a third factice _ argument to make this forward compatible with our previous training method
    embedded = self.embedding(text)
    return self.fc(embedded)

  def get_embeddings(self, text):
    return self.embedding(text)

We need to build a new vocabulary from our corpus.

In [None]:
TEXT.build_vocab(trainset,min_freq=3)
vocab_size = len(TEXT.vocab)

Now, instantiate a model and its corresponding optimizer, choose the right criterion to train your model, and evaluate its performance on both the training and testing sets.

In [None]:
model = ...
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)
train(...)
train_acc = ...
print(f"Train accuracy: :{round(train_acc * 100, 2)}%")
test_acc = ...
print(f"Test accuracy: :{round(test_acc * 100, 2)}%")

Is the obtained test perfoamnce better than the one of the previous models using the mean?

Now try to do the same but using LSTM layer instead of embeddings averages.

In [None]:
import torch.nn as nn

class TextClassifier(nn.Module):
    
  def __init__(self, vocab_size, embedding_dim, output_dim=4):
    super().__init__()  
    self.embedding = ...
    self.lstm = ...
    self.fc = ...
      
  def forward(self, text, text_lengths): 
      embedded = self.embedding(text)    
      #packed sequence
      packed_embedded = ...
      _, (hidden, cell) = ...
      outputs = ...        
      return outputs

In [None]:
model = ...
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)
train(...)
train_acc = test(...)
print(f"Train accuracy: :{round(train_acc * 100, 2)}%")
test_acc = test(...)
print(f"Test accuracy: :{round(test_acc * 100, 2)}%")

# Transformers ! 

Transformers models are the current state-of-the-art in natural language processing.  
We have not seen them during the video lectures since studying transformers would require an entire session in itself.  
Nonetheless, we will see how to use them as "black-box" models to finish this practical session.

Let $\mathbf{X} \in \mathbb{R}^{B  \times N\times F}$ be the input sequence, $\mathbf{W}^Q, \mathbf{W}^K, \mathbf{W}^V \in \mathbb{R}^{F \times H}$ be the query, key and value projection matrix.

We define the query, the key and the value as $\mathbf{Q}=\mathbf{X} \mathbf{W}^Q, \mathbf{K}=\mathbf{X} \mathbf{W}^K, \mathbf{V}=\mathbf{X} \mathbf{W}^V \in \mathbb{R}^{B \times N\times H}$.

Self attention compute a score $E =\frac{\mathbf{Q K}^T}{\sqrt{d_k}} \ \in \mathbb{R}^{B \times N\times N}$ that measure the similarity between the elements in the input sequence.
The normalized attention weights $\alpha_{i j}=\frac{\exp \left(e_{i j}\right)}{\sum_{k=1}^{T_x} \exp \left(e_{i k}\right)}$ are computed using softmax function on $E$ 

The output of the self attention layer is given by : 
$$\mathbf{Y}=\operatorname{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V})=\operatorname{softmax}\left(\frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}}\right) \mathbf{V}$$. The features of each sequence element are therefore weighted according to the context.
It is possible to use several attention mechanisms in parallel and aggregate them to obtain the final output : 
$$MultiHead(\mathbf{Q}, \mathbf{K}, \mathbf{V})= Concat \left(\right. head _1, \ldots, head \left._h\right) \mathbf{W}^O$$

where $head_i= Attention \left(\mathbf{Q} \mathbf{W}_i^Q, \mathbf{K} \mathbf{W}_i^K, \mathbf{V} \mathbf{W}_i^V\right)$

[Hugging Face](https://huggingface.co/) provides the most practical [library](https://huggingface.co/docs/transformers/main/en/index) to use transformers and pre-trained models.

In [None]:
!pip install transformers[torch] 2>&1

In [None]:
import pandas as pd

train_df = pd.read_csv('train.csv', names=["label", "title", "text"]).sample(40000)
test_df = pd.read_csv('test.csv', names=["label", "title", "text"]).sample(2000)

Separate the train set into a train and validation sets using sklearn's **train_test_split**.

Keep 20% of the data for the validation set, and remember to stratify! 

In [None]:
from sklearn.model_selection import train_test_split

train, val, train_labels, val_labels = ...

As for computer vision, it is possible to use pre-trained models for transfer learning in NLP. 
[Bert](https://arxiv.org/abs/1810.04805) is one of NLP's most famous standard transformer. In its largest form, it is composed of 345 million parameters.  
In this practical session, we will use a smaller version: Distilbert. [Distilbert](https://arxiv.org/abs/1910.01108) is a smaller model that has been trained to mimic the outputs of the Bert model.   
This [distillation](https://arxiv.org/abs/1503.02531) process provides a model achieving very good performance with much fewer parameters.  
To use the pre-trained model, we will need to match the toke they were trained on.
Here we will use the ```DistilBertTokenizerFast``` to manage the text preprocessing. 

In [None]:
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

We will also need to wrap our datasets into Pytorch-compatible datasets.  
The following code defines a Torch Dataset to handle our textual data.

In [None]:
import torch
from torch.utils.data import Dataset

class NlpDataset(Dataset):
    def __init__(self,data,labels,tokenizer):
        self.data = data.to_list()
        self.labels = labels.tolist()
        self.encodings = tokenizer(self.data, truncation=True, padding=True)

    def __getitem__(self,idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx],dtype=torch.long)
        return item
  
    def __len__(self):
        return len(self.labels)

Convert your datasets to torch compatible Datasets.

In [None]:
from torch.utils.data import DataLoader

train_dataset = ...
val_dataset = ...
test_dataset = NlpDataset(test_df["text"], test_df["label"]-1, tokenizer)

batch_size = 64
train_loader = ...
val_loader = ...
test_loader = DataLoader(test_dataset, batch_size=1) # we need a batch size at 1 for later in the notebook

Look at a sample yielded by your train loader.  
What type of object is that? Do you know what it is composed of?

In [None]:
next(iter(train_loader))

We will now instanciate our model and wrapp it into a Pytorch module.

In [None]:
from transformers import  DistilBertForSequenceClassification
import torch.nn as nn 

class BertClf(nn.Module):

    def __init__(self, distilbert):

        super(BertClf, self).__init__()

        self.distilbert = distilbert
        for name, param in distilbert.named_parameters():
            if not "classifier" in name:
                param.requires_grad = False

    def forward(self, sent_id, mask):

        #pass the inputs to the model  
        out = self.distilbert(sent_id, attention_mask=mask)
        logits = out.logits
        attn = out.attentions
        hidden_states = out.hidden_states
        

        return logits, hidden_states, attn

distilbert = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased",
                                                                  num_labels=4,
                                                                  output_attentions=True,
                                                                  output_hidden_states=True)

model = BertClf(distilbert)

Now complete the following training and testing loops.

In [None]:
from tqdm.notebook import tqdm

def train_bert(model, optimizer, dataloader, epochs):
  model.train()
  for epoch in range(epochs):
    running_loss = 0.0
    running_corrects = 0
    total = 0
    t = tqdm(dataloader)
    for i, batch in enumerate(t):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)
        preds, _, _ = model(input_ids,mask=attention_mask)
        loss = ...

        ... #zero grad your optimizer
      ... # backward the loss  
      ... # perform a step

        _, predicted = ...
        running_corrects += predicted.eq(labels).sum().item()
        total += labels.size(0)
        running_loss += loss.item()

        t.set_description(f"epoch:{epoch} loss: {(running_loss / (i+1)):.4f} current accuracy:{round(running_corrects / total * 100, 2)}%")

def test_bert(model, dataloader):
    model.eval()
    test_corrects = 0
    total = 0
    with torch.no_grad():
      for batch in tqdm(dataloader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        preds, _, _ = model(input_ids,mask=attention_mask)
        _, predicted = preds.max(1)
        test_corrects += predicted.eq(labels).sum().item()
        total += labels.size(0)
    return test_corrects / total

Now train a model for one epoch and print its accuracy on the test set.

In [None]:
from transformers import DistilBertTokenizerFast,  DistilBertForSequenceClassification, AdamW

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
optimizer = AdamW(model.parameters(),lr = 1e-5)
criterion  = nn.CrossEntropyLoss()
n_epochs = 1

train_bert(model, optimizer, train_loader, n_epochs)
test_bert(model, test_loader)

The following code computes the embeddings generated by the DistilBert model.  
Use it to plot a t-SNE of the DistilBert's represntations of the test set.  
How are the different classes distributed?

In [None]:
import numpy as np

def get_embeddings(model, dataloader):
    model.eval()
    embeddings = []
    labels = []
    with torch.no_grad():
      for batch in tqdm(dataloader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels.append(batch["labels"].item())

        _, emb, _ = model(input_ids,mask=attention_mask)
        last_layer_cls = emb[-1][:,0,:]
        embeddings.append(last_layer_cls.squeeze(0).squeeze(0))
    embeddings = np.array([e.cpu().numpy() for e in embeddings])
    return embeddings, labels

embeddings, labels = get_embeddings(model, test_loader)

In [None]:
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import seaborn as sns 

...

We can use [ **bertviz** library](https://github.com/jessevig/bertviz) to visualize the relation of the element in the input sequence

In [None]:
!pip install bertviz

In [None]:
from bertviz import model_view,head_view

sentence = test_df["text"].iloc[33]
tokenized = tokenizer(sentence)
print(sentence)
print(tokenized)

In [None]:
inputs = torch.tensor(tokenized["input_ids"]).unsqueeze(0).to(device)
mask = torch.tensor(tokenized["attention_mask"]).unsqueeze(0).to(device)
outputs = model(inputs,mask = mask)
attention = outputs[-1] 
tokens = tokenizer.convert_ids_to_tokens(inputs[0]) 
model_view(attention, tokens)

In [None]:
head_view(attention, tokens)