In this tutorial we are going to use a few deep learning libraries with toy examples just to get you started. While you are free to choose any library you want, for the sake of the course and the assignments, we will stick to using Keras with Tensorflow, Huggingface Transformers and KeyBERT/BERTopic

In [1]:
# First we need to install some libraries that are not readily present in Colab (Keras and TF are)
!pip install keybert
!pip install keybert[spacy]

!pip install bertopic

!pip install stop-words

!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Downloading a toy dataset
In this tutorial, we will use a public and relatively small dataset. You have to swap this with your own and make sure you preprocess your data accordingly. This dataset contains texts from 20 different newsgroups and as labels the newsgroup ID a text belonged to.


In [2]:
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
x = dataset['data']
y = dataset['target']

In [3]:
# Lets check it out
x[:3]

["\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n",
 'My brother is in the market for a high-performance video card that supports\nVESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:\n\n  - Diamond Stealth Pro Local Bus\n\n  - Orchid Farenheit 1280\n\n  - ATI Graphics Ultra Pro\n\n  - Any other high-per

In [4]:
y[:5]

array([10,  3, 17,  3,  4])

# Topic modeling
While we labels on our dataset to perform classification, we can also do unsupervised modeling and see if we can do some topic modeling like you have seen doing LDA but now with Transformers

In [None]:
!pip uninstall joblib
!pip install joblib==1.1.0

In [4]:
from bertopic import BERTopic

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(x)
topic_model.get_topic_info()

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Unnamed: 0,Topic,Count,Name
0,-1,6891,-1_to_the_is_of
1,0,1844,0_game_team_games_he
2,1,624,1_key_clipper_chip_encryption
3,2,526,2_ites_cheek_yep_huh
4,3,482,3_israel_israeli_jews_arab
...,...,...,...
220,219,11,219_abortion_abortions_women_choice
221,220,11,220_cramer_homosexual_men_homosexuals
222,221,10,221_list_ma_flea_june
223,222,10,222_amp_amps_components_adcom


In [5]:
# Check out a topic cluster
topic_model.get_topic(0)

[('game', 0.010203329120266847),
 ('team', 0.008895722164215521),
 ('games', 0.007090292212032411),
 ('he', 0.006891122705441054),
 ('players', 0.006249476861417218),
 ('season', 0.0061692971525915625),
 ('hockey', 0.006060312124864601),
 ('play', 0.005704005541998439),
 ('25', 0.005563422127321219),
 ('year', 0.005528754933826223)]

In [6]:
topic_model.get_topic(1)

[('key', 0.013720789131849946),
 ('clipper', 0.012683428141896099),
 ('chip', 0.011943059466901457),
 ('encryption', 0.011839195630821796),
 ('keys', 0.009549114683400953),
 ('government', 0.008275821692058557),
 ('escrow', 0.00825838767748389),
 ('nsa', 0.007416987945477699),
 ('algorithm', 0.006728890699840438),
 ('be', 0.00613871437750379)]

In [7]:
# Find some representative documents per topic for the most frequent topics
print(topic_model.get_representative_docs()[0])
print(topic_model.get_representative_docs()[1])
print(topic_model.get_representative_docs()[2])

['Sabres are\nFuhr or\nknow much\n\n\nTwo words: Grant Fuhr.', '\n\nI didn\'t mean to offend or anything, I\'m just quoting Stanky himself on\nthe subject. I remember one time last year he was being interviewed by\nESPN, and the interviewer (can\'t remember who), asked Stanky if he was\nJewish because he (the interviewer) was Jewish and wanted to see more\nJewish ballplayers. To which Stanky replied, "I\'m Polish, not Jewish."\n\nSo maybe that wasn\'t the most PC thing for Stanky to say, and maybe I was\na little naive when I posted it. I think we should just devote this\nsubject to finding actual Jewish ballplayers (I myself am Jewish and the\nonly ones I ever knew until now were Koufax, Greenberg, and Blomberg).', 'I thought I\'d post my predicted standings since I find those posted by others\nto be interesting.  Sorry this is after Opening Day.  I certify that these\nwere completed before the first pitch. :-)\n\nAL East\n1.  New York Yankees - the most (only?) improved team in this 

# Keyword extraction
Next to topic modeling, which tries to group documents together by common (key)words, we may also be interested in a reverse direction and find the most commong keywords per document. This is what KeyBERT does. While not guaranteed, the topic modeling using BERTopic and the keyword extraction using KeyBERT should be somewhat correlated among documents in the same topic model group.

In [8]:
from keybert import KeyBERT
kw_model = KeyBERT()

In [9]:
x[0]

"\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n"

In [10]:
kw_model.extract_keywords(x[0])

[('pens', 0.4717),
 ('islanders', 0.4596),
 ('jagr', 0.4259),
 ('bowman', 0.4052),
 ('pittsburghers', 0.3388)]

In [11]:
# Extract multi-word key phrases
kw_model.extract_keywords(x[0], keyphrase_ngram_range=(1, 2), stop_words=None)

[('pens fans', 0.6066),
 ('islanders lose', 0.529),
 ('game pens', 0.5033),
 ('pens rule', 0.4849),
 ('the pens', 0.4808)]

In [12]:
kw_model.extract_keywords(x[1])

[('ram', 0.4176),
 ('vesa', 0.3791),
 ('vlb', 0.3374),
 ('card', 0.3025),
 ('2mb', 0.28)]

# Keras and Tensorflow modeling
Using Keras (on top of TF), we can start training a model to predict the labels of our dataset while jointly learning word embeddings for all the tokens present. This will lead to our word embeddings being biased towards our task at hand which often is a useful thing as long as we don't overfit (which is a high risk on our toy dataset given its small size).

In [13]:
import numpy as np
import tensorflow as tf
tf.config.experimental_run_functions_eagerly(True)
from tensorflow import keras

Instructions for updating:
Use `tf.config.run_functions_eagerly` instead of the experimental version.


In [14]:
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences

We will first need to turn our text into words/tokens for Keras to use

In [15]:
tokenizer = Tokenizer(num_words = 20000)
tokenizer.fit_on_texts(x)

Normally, we would need to store this tokenizer now so we can re-use it later:
```
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
```

In [16]:
# Turn our text into feature sequences
X_set = tokenizer.texts_to_sequences(x)
# Our Keras models expect fixed-size vectors, so we pad them to 80
maxlen=80
X_set = pad_sequences(X_set, maxlen = maxlen)
# Now we split them into train and test (70/30)
X_train, X_test, Y_train, Y_test = train_test_split(X_set, y, test_size=0.3, shuffle=False, random_state=123456)

In [17]:
# This is now a numpy 2D array with #samples,80 as dimensions
X_train.shape

(13192, 80)

We will now create our deep net's architecture. No specific thought was put into this (contrary to what you should do!) but it shows how to create an architecture and use it.

In [18]:
from keras.models import Sequential
from keras.layers import Input, Bidirectional, Embedding, Dense, Dropout, SpatialDropout1D, LSTM, Activation
from keras.regularizers import L1L2

model = Sequential()
# Our inputs are vectors with 80 word indices
model.add(
    Input(shape=(maxlen,), dtype='int32')
)
# Which we turn into regularized word vectors
model.add(
    Embedding(input_dim=20000,
                output_dim=256,
                mask_zero=True,
                input_length=maxlen,
                embeddings_regularizer=L1L2(l2=1E-6),
                name='embedding')
)
# That need an activation function (tanh is common, see w2v)
model.add(
    Activation('tanh')
)
# BiLSTMs with dropout
model.add(Bidirectional(
  LSTM(512, return_sequences=True)
))
model.add(Dropout(rate=0.5))
model.add(Bidirectional(
  LSTM(512, return_sequences=False)
))
# We have 20 classes, so 20 predictions
model.add(Dense(units=20))
model.add(Activation('softmax'))
# Check out our model
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 80, 256)           5120000   
                                                                 
 activation (Activation)     (None, 80, 256)           0         
                                                                 
 bidirectional (Bidirectiona  (None, 80, 1024)         3149824   
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 80, 1024)          0         
                                                                 
 bidirectional_1 (Bidirectio  (None, 1024)             6295552   
 nal)                                                            
                                                                 
 dense (Dense)               (None, 20)                2

In [19]:
# One-hot encode our outputs
Y_train_oh = keras.utils.to_categorical(Y_train)
Y_test_oh = keras.utils.to_categorical(Y_test)

In [20]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

A couple of things to take into account in the following block:
- We should optimized our batch size wrt history and what fits on the GPU. As a rule of thumb, the window size of our LSTM (512) should be an integer multiple of our batch size (32)
- Here we use our test set for validation ánd evaluation - this is a violation of good ML practice and standards. Don't do it.
- We use crossentropy and accuracy, those are not necessarily the best for your problem at hand
- Note that if you run this, be sure to choose GPU as runtime or it will take forever :)
- 2 epochs is hardly ever sufficient, especially since here, we also train word embeddings as we go

In [21]:
model.fit(X_train, Y_train_oh, batch_size=32, epochs=2, validation_data=(X_test, Y_test_oh))
score, acc = model.evaluate(X_test, Y_test_oh, batch_size=32)
print('Test score:', score)
print('Test accuracy:', acc)

Epoch 1/2
Epoch 2/2
Test score: 2.0953047275543213
Test accuracy: 0.29430490732192993


# Huggingface Transformers
Now that we have seen how to do basic classification using Keras and BiLSTMs, lets try and do the same using Transformers from Hugginface. Since training our own LLMs with Transformers is nearly impossible, we always use out-of-the-box models or slightly finetune them. Note that all models come with their own tokenizer too.

In [22]:
# Transformers al
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=20)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_projector', 'vocab_layer_norm', 'vocab_transform', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_20']
You should probably TRAIN this model on a down-stream task to be able to use i

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [23]:
# Apply the tokenizer to our inputs, make sure to use Huggingface datasets here or prepare to get OoM :)
import pandas as pd
from datasets import Dataset

df = pd.DataFrame({"data": dataset["data"], "label": dataset["target"]})
dfs = train_test_split(df, y, test_size=0.3, shuffle=False, random_state=123456)
train_df = dfs[0]
test_df = dfs[1]
train_ds = Dataset.from_pandas(train_df)
test_ds = Dataset.from_pandas(test_df)

In [24]:
def tokenize_dataset(data):
    # Keys of the returned dictionary will be added to the dataset as columns
    return tokenizer(data["data"], padding=True, truncation=True)

# Apply the tokenizer to our input data
tkn_train_ds = train_ds.map(tokenize_dataset, batched=True)
tkn_test_ds = test_ds.map(tokenize_dataset, batched=True)

  0%|          | 0/14 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

In [25]:
# Here we create our optimizer
from transformers import create_optimizer
import tensorflow as tf

batch_size = 16
num_epochs = 5
batches_per_epoch = len(tkn_train_ds)
total_train_steps = int(batches_per_epoch * num_epochs)

optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

We need to put in some effort to turn our dataset into tensorflow-specific values. This is because Huggingface can rely on multiple runtimes like PyTorch and Tensorflow and our dataset is not natively in the right format.

In [26]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator(return_tensors="tf")

tf_train_dataset = tkn_train_ds.to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    label_cols=["label"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = tkn_test_ds.to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    label_cols=["label"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

In [27]:
# Now we can finally compile and fit our model
model.compile(optimizer=optimizer)
model.fit(x=tf_train_dataset, validation_data=tf_validation_dataset, epochs=1)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.




<keras.callbacks.History at 0x7fb34d152650>

In [28]:
model.predict(tf_validation_dataset)



TFSequenceClassifierOutput(loss=None, logits=array([[-1.7186681 ,  4.081797  ,  0.5308503 , ..., -2.196619  ,
        -2.1081474 , -2.378605  ],
       [-1.7905073 , -0.5689175 , -0.4168577 , ..., -2.08767   ,
        -2.0841732 , -1.7436225 ],
       [-0.35909104, -1.0783398 , -1.6784817 , ..., -1.3698161 ,
         0.04544097, -0.8618739 ],
       ...,
       [-1.7599331 , -0.26669726,  0.46577388, ..., -2.0898013 ,
        -2.1276777 , -2.4292684 ],
       [-0.1396529 ,  2.8338451 , -0.8337464 , ..., -1.405328  ,
        -1.4650098 , -0.71389085],
       [-0.9026109 , -1.8042994 , -1.6312944 , ..., -1.8590611 ,
        -0.9953726 , -1.0563321 ]], dtype=float32), hidden_states=None, attentions=None)