# Transfer Learning for NLP
Transfer learning is still in its nascent field so the field and open-source community has not settled entirely on one easy and bulletproof-solutions. This means that libraries are still being developed and are changing as we speak. However major hubs are beginning to form and conform its use. The first one is based on the tensorflow framework: `tensorflow_hub`. The second one: `transformers` (formerly `pytorch-transformers`), came from PyTorch but adding tensorflow support also. 

For more classic work on Word-embeddings the gensim package which you worked with last week also has some decent ressorces. 

In this exercise set we will practice loading and applying models from both `tensorflow_hub` and from `transformers`. We will practice using sentence/paragraph embeddings as input to a clustering algorithm, as pretrained element in a new model, and finally try out the transformers library for pretraining a language model from scratch. 

Again we will use the Toxicity dataset. See download instructions in [week 7 exercises](https://github.com/ulfaslak/sds_tddl_2020/blob/master/exercises/exercises(7)_Categorydev_Class.ipynb)


In [0]:
# load dataset
import pandas as pd
path2tox_data = '/content/drive/My Drive/lm/toxic_train.csv'
df = pd.read_csv(path2tox_data)
# subsample data to allow faster prototyping
# df = df.sample(5000) # simple solution
# stratified solution where we subsample from each meta data column to get a higher variance.
strat_sample_cols = df.columns[3:23]
samples = []
n = 300
for col in strat_sample_cols:
    binary = pd.DataFrame((df[col]>0.5).astype(int))
    samples+=[j for _,j in binary.groupby(col).apply(lambda x: x.sample(min(len(x),n//2))).index]
idx = list(set(samples))
df = df.iloc[idx]

# subsample for clustering
df['label'] = (df.target>0.5).astype(int)

sample = df.groupby('label').apply(lambda x: x.sample(500))
sample_texts = sample.comment_text.values

> **Ex. 8.1:** *Pretrained Sentence Representations for Discovery and Exploration, using tensorflow_hub and the Universal Sentence Encoder*
TFhub allows you plug and play with fully implemented pipelines, including preprocessing and the embedding forward pass. You will use this as a basic feature extractor similarly to what you did in Exercise 5 with image data.

You will need to install tf_hub: `pip install --upgrade tensorflow-hub` first.

> **Ex. 8.1.1:** Load and aply the ["Universal Sentence Encoder"](https://arxiv.org/abs/1803.11175) embedder using tensorflow hub.
  - first `import tensorflow_hub as hub`
  - define the "embedder" object using the `hub.Module()` function that takes a link to a pretrained module, and initializes it. Use the link: https://tfhub.dev/google/universal-sentence-encoder/4. (**Hint**: follow the link to see an example)
  - Embed / transform a sample of texts from the toxicity dataaset to vectors by applying the embedder to a list of texts. Rremember to run the process within a tensorflow session:
  ```
  with tf.Session() as session:
      session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  ```



In [0]:
### Load universal encoder.
import tensorflow_hub as hub
import tensorflow as tf

import numpy as np
import tensorflow as tf
import tensorflow_hub as hub


# Import the Universal Sentence Encoder's TF Hub module
#module_url = "https://tfhub.dev/google/universal-sentence-encoder/2"
#embed = hub.Module(module_url)
#inf = embed.get_output_info_dict()['default']
#output_dim = inf.get_shape()[1].value

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
embed = hub.load(module_url)
output_layer = embed.variables[-1]
output_dim = output_layer.shape[1]


In [0]:
## TensorFlow without eager execution
# initialize tf and embed documents.
#with tf.Session() as session:
  #session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  #embeddings = session.run(embed(sample_texts))
# New Tensorflow with eager execution
embeddings = embed(sample_texts)

> **Ex. 8.1.2:** Exploration based on Document embeddings. 
  - First we use this for **exploring** similar texts. This time we can not just use `gensim`'s neat `.most_similar` function. Instead we contruct it ourselves.
  - Construct a distance matrix between all texts. Here you can use the `sklearn.metrics.pairwise_distances` function that allows you to specify any distance measure implemented in the sklearn.metrics.pairwise. Get list here (`sklearn.metrics.pairwise.PAIRWISE_DISTANCE_FUNCTIONS`) 
  - Now we have a matrix where each document has a row that expresses the distance to every other document. Now we brush up on our matrix manipulations skills. We want to transform the distance matrix into a matrix that express which document are closest to each other, i.e. each row will be sorted indices relating to the closests documents. Use the `.argsort` function built in to the matrix. Validate that the argsort() is correct.
  - Now pick a random document contained in the distance matrix (i.e. a random index in the matrix). Print the text, along with with the most similar document (i.e. first index in the argsort matrix) and the distance score. Comment on what the model might have found / encoded.  
  - Finally write a function that takes a document not contained in the distance matrix already defined, but instead embeds the document, calculates the distance to sample of texts already embedded, returns the top k closests documents.


In [0]:
# Create doc2doc matrix.
import sklearn.metrics
doc2doc = sklearn.metrics.pairwise_distances(embeddings,metric='cosine')

In [0]:
# Apply argsort.
sort_mat = doc2doc.argsort()

In [0]:
# Get closests neighbors of a random document.
sample_id = np.random.choice(np.arange(len(sort_mat)))
print('This is the original text: \n %s\n'%sample_texts[sample_id].replace('\n','\t'))
neighbors = sort_mat[sample_id][1:5]
print('Most similar documents: \n')
for j in neighbors:
  print(sample_texts[j].replace('\n','\t'))
  print()
  print('__________________________')


This is the original text: 
 Fun...but not funny . Gays, woman, jews, atheists, christians, buddhists, hindus, kurds............and on and on and on down the list.

Most similar documents: 

Tough luck if you are Buddhist, Hindu, Muslim, Pagan, or Jew.  This is a good Hypocritical Christian Nation.

__________________________
Sounds reasonable....anti Muslim bigotry, anti Christian bigotry, anti Buddhist bigotry, anti Jewish bigotry....anti-belief system bigotry.

__________________________
The religion of the atheists is atheism itself.

__________________________
Whom should we blame then, Christians, Jews, Hindus - the list goes on but at the end of the day it is Muslims who are doing the killing, like it or not.

__________________________


The similarity seems to be related to invoking religious symbols. 

In [0]:
## Define function that has a document(i.e. userdefined string) as input and prints the 5 most similar documents in the dataset. 
def get_most_similar_doc(document,k=5):
  "Get k most similar documents"
  with tf.Session() as sess: # this will make it really slow since we are initializing it again. Should be kept within one session.

    sess.run([tf.global_variables_initializer(), tf.tables_initializer()])
    embedding = sess.run(embed([document]))
  dist = sklearn.metrics.pairwise_distances(embeddings,embedding,metric='cosine').flatten()
  sort = dist.argsort()[1:1+k]
  print('Most similar documents:\n')
  for i in sort:
    print('Score: %.2f'%dist[i])
    print('Document: %s'%sample_texts[i].replace('\n','\t'))
    print('__________________')
    print()
  
get_most_similar_doc('Your mom is so fat. How fat is she? Your mama is so big and fat that she can get busy with twenty-two burritos, but times are rough')
          

Most similar documents:

Score: 0.72
Document: for crying in the soup you have to dig up Hillary
__________________

Score: 0.75
Document: "The number of eggs produced by a female is related to its size. A 50-pound (23 kg) female will produce about 500,000 eggs, whereas a female over 250 pounds (113 kg) may produce 4 million eggs." (International Pacific Halibut Commission Technical Bulletin #40.)		The average size of halibut landed has declined dramatically in recent years. While the cause(s) is/are not clear, it would be prudent to try to maintain large-halibut genes in the pool, unless one believes smaller is better.
__________________

Score: 0.76
Document: John101	I stated 40 minutes via school bus.  Yes, school buses stop often and take much longer that a 15-minute car ride.  Not all families have the ability to drive their children to and from school.  Many families that attend Ainsworth have parents that both work, in fact, some are even single parents.	Ask PPS for the maps to 

> **Ex. 8.1.3:** Discovery based on Document embeddings. Cluster and summarize.  
  - Apply a clustering algorithm on the embeddings. `import sklearn.cluster`
  - Now we want to inspect the clusters. 
    - Random sample: Do a random sample from the largest cluster.
    - Most Representative: This will count as the Document with the shortest distance to all other docs. I.e. calculate average intra-cluster distance for each doc. Calculate the average intracluster distance for each document, and print the top 3 documents of the largest cluster.).

>## Wordbased Summarizations
Here we inspect the most representative words using TDIDF style weighing of each phrase/word in the cluster and in line with the "Computer Assisted Keyword and Documentset Discovery" we rank words in relation to feature importance / predictive capabilities. 

>**TfIdf style weighing**:. Idea is to calculate TDIDF not based on documents but on clusters. Formula is the following: 
$tfidf_{w,c} = tf_{w,c} \cdot log(\frac{\left | N_{c} \right |}{\left | CC_w \right |})$ where $N_c$ is no. of clusters $CC_w$ is no. of clusters word is present in. $tf_{w,c}$ is the frequency of a word in a cluster.
  - Transform documents into a DocumentTermMatrix,i.e. Counts of Words and Phrases., using `sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1,4),min_df=5). Add ngrams to allow longer phrases.
  - Extrac the word index from the vectorizer using the function `.get_feature_names()`. This shall be used when printing the words, by translating an columnindex in the DocumentTermMatrix to a word.
  - Transform the DocumentTermMatrix into a ClusterTermMatrix by summing accross each cluster.
  - Transform this into a ClusterTermFrequencyMatrix by dividing by the sum for each cluster.
  - Calculate the "Inverse Cluster Frequency". 
  - Multiply these together to form the TFIDF. 
  - For each cluster get sort each word by their tfidf score and  print the top 10 terms. Remember the word index you defined earlier.

>## Extra "Computer-Assisted Keyword and Document Set Discovery from Unstructured Text" style word weighing. 
The method for discovering new query terms and in our context, phrases for doing *weak supervision', can also be used to summarize a given cluster.
- Train a model (e.g. `sklearn.linear_model.LogisticRegression`) using the DocumentTermMatrix as input and the cluster labels as output.
- Extract the coefficients of the model using the `.coef_` property of the model object.
- For each cluster label sort the words by the largest coefficients.


In [0]:
# Cluster using sklearn.cluster
import sklearn.cluster
clus = sklearn.cluster.AgglomerativeClustering(n_clusters=None, distance_threshold=2)
clus_labels = clus.fit_predict(embeddings,)

In [0]:
# Inspect random sample of documents from largest cluster
from collections import Counter
import random
for label,count in Counter(clus_labels).most_common():
  idx = clus_labels==label
  clus_texts = sample_texts[idx]

  for sample_text in random.sample(list(clus_texts),5):
    print(sample_text.replace('\n','\t'))
    print()

  break


This is again a matter stoke of genius for President Trump. Without even trying he has exposed Ryan as a fraud and sapped his power. Obama care will get so bad the Democrats who now own it by paying for it twice will come begging to make a deal fracturing the hold of the Elisabeth Warren, Nancy Pelosi liberal wing. 		The criminal investigation of Anthony Weiner, now in protective custody because of his information against the Clintons, will bring down the rest and the Democratic Party will be destroyed. Democrats on the fence like Senator Manchin, seeing the revival of coal in West Virginia will jump to the Party of Lincoln and other Democrats not wanting to be tainted by the Weiner/Clinton scandal will move to the center. The only thing the Democrats have is the fake news of Russian Complicity. This pails to eclipse Hillary Clinton selling 3/4 of American Uranium reserves while the Secretary of State...  		Trump will surpass Reagan and reach the pinnacle of Abraham Lincoln.

Nobody's 

In [0]:
# Find most representative documents by calculating
# the average intra-cluster-distance for each 
# document to its cluster neighbors. 
for label,count in Counter(clus_labels).most_common()[2:]:
  print()
  print('Topic %d: '%label)
  idx = clus_labels==label
  idx_l = np.arange(len(doc2doc))[idx]
  intra_distance = doc2doc[idx_l,idx_l.reshape(-1,1)].sum(axis=1)
  for rep in idx_l[intra_distance.argsort()[0:3]]:
    print()
    print(sample_texts[rep].replace('\n','\t'))
    print('_______________')

  break


Topic 8: 

The problem here is that Islam embodies hatred of all other religions. The Koran clearly and unambiguously requires the murder of homosexuals among other infidels (chapter 4). While the media claims Islam is a religion of peace, muslims have never made such a claim. 	So, the question is: Are we required to tolerate their intolerance? 	And further more, given that the media, the establishment and the political left have far, far more hatred for white christians than muslims, shouldn't we be primarily concerned with Christophobia?
_______________

This writer is deep in the Orwellian logic that provoked so many to vote for Trump.  She thinks a nation deciding who may enter is a "hate crime". Give me a break.  Every non-western country in the world has a list of nations who may not enter. It's called national sovereignty.		We tried the nice way for 15 years - praise Islam after every slaughter, condemn terrorists who explicitly tape videos saying their crimes are inspired by I

In [0]:
## Word Summarization
# Summarize by printing TF-IDF weighed words.
import sklearn.feature_extraction.text
vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1,4),min_df=5)
bows = vectorizer.fit_transform([i.lower() for i in sample_texts])
index = vectorizer.get_feature_names()
lab2counts = bows.T.dot(pd.get_dummies(clus_labels)).T
# extra split each text into sentences and embed these, to find most representative sentence.
labeltf = (lab2counts.T/lab2counts.sum(axis=1)).T
n_clus = labeltf.shape[0]
idf = np.log(n_clus/(lab2counts>0).sum(axis=0))
lab_tfidf = labeltf*idf

In [0]:
# Print top "phrases" of each cluster based on the tfidf score.
for num,vec in enumerate(lab_tfidf.argsort()):
  print()
  print('Topic: %d'%num)
  top = vec[::-1][0:10]
  for phrase in top:
    print(index[phrase])
  print('______________________')


Topic: 0
homosexual
gay
bisexual
sex
men
straight
acts
heterosexual
sodomy
gay men
______________________

Topic: 1
million
gonna
drug
given that
impossible
hey
disease
the entire
you have to
stupid
______________________

Topic: 2
religious
christians
christian
jews
bishops
faith
buddhist
weren
churches
of jesus
______________________

Topic: 3
rail
read
black
meaning
but you
proof
who will
the city
mayor
for you
______________________

Topic: 4
tax
rail
city
housing
taxes
the city
market
federal
rate
traffic
______________________

Topic: 5
fire
parents
keeping
what they
children
to take
police
to get
insurance
working
______________________

Topic: 6
thanks
gary
thank you
not to
thank
mayor
quit
2015
post
why don you
______________________

Topic: 7
states
can get
america
united states
the united
the united states
gender
party
degree
women
______________________

Topic: 8
muslims
muslim
islam
attacks
sharia
islamic
religion
muslims are
terrorist
attack
______________________

Topic

In [0]:
#  "Computer Assisted document and set discovery" most predictive summary
log_mod = sklearn.linear_model.LogisticRegression(max_iter=1000)
#log_mod.fit(bows.sign(),clus_labels)
log_mod.fit(bows,clus_labels)

for num,vec in enumerate(log_mod.coef_):
  print()
  print('Topic %d: '%num)
  top = vec.argsort()[::-1][0:10]
  for phrase in top:
    print(index[phrase])
  print('______________________')


Topic 0: 
gay
bisexual
straight
homosexual
men
sex
likes
acts
male
is
______________________

Topic 1: 
gonna
stupid
think
suck
000
both
does
men
man
kill
______________________

Topic 2: 
christian
jews
christians
churches
anti
again
muslims
those
americans
jews and
______________________

Topic 3: 
didn
read
personal
comment
obviously
your
you
suppose
as
it was
______________________

Topic 4: 
no
money
oil
tax
spending
bus
talking about
city
talking
millions
______________________

Topic 5: 
defense
boy
throughout
want
our
ones
take
parents
what they
new
______________________

Topic 6: 
thanks
post
two
great
that what
perfect
gary
to the
correct
absolutely
______________________

Topic 7: 
up
women
over
degree
red
party
work
republican
less
than
______________________

Topic 8: 
muslim
muslims
terrorist
islam
women
terrorists
islamic
allah
was
honor
______________________

Topic 9: 
marriage
church
marriage is
gay
catholic
heterosexual
married
for
many
jesus
______________________

## Transfer Learning for supervised learning
**Ex. 8.2:** Adopt pretrained embeddings into a larger model.
Here we shall practice using pretrained models as part of a larger pipeline using tensorflow_hub and keras. 
>
**Ex. 8.2.1:** Built a Keras model where the first layer is the Universal Sentence Encoder, and stack layers on top.
  - initialize your model as using the `Sequential()` function.
  - add the hub layer using the `hub.KerasLayer(module_url,input_shape=[],dtype=tf.string,trainable=False)` 
    - `trainable = True` option allows you to finetune the Universal Sentence Encoder also, this however will slow down training significantly. 
  - Add a classification layer on top. You may add any layers you like.
  - Compile model.
  - Train model. Because it is using Tensorflow again you need to initialize the session, using the line: ```with tf.Session() as sess:
  sess.run([tf.global_variables_initializer(), tf.tables_initializer()])```. And then call the `.fit()` method inputting your training data.
` 


In [0]:
# Adopt Universal Sentence Encoder 
# as the first layer in preprocessing 
# step in a larger Keras Pipeline

# Prepare Dataset split the Toxicity dataset into train and test
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df['comment_text'], 
                                                    df['label'], 
                                                    test_size=0.3, 
                                                    stratify=df['label'], 
                                                    random_state=42)

# Initialize Keras Sequetial Pipeline
model = tf.keras.models.Sequential()

# Add USE layer to model
mod_url = 'https://tfhub.dev/google/universal-sentence-encoder/4'
model.add(hub.KerasLayer(mod_url, 
                        input_shape=[], 
                        dtype=tf.string,
                         trainable=False
                        ))

# Define extra layers for the pipeline.
model.add(tf.keras.layers.Dense(256, activation='relu')) # add standard feed forward layer
model.add(tf.keras.layers.Dense(1, activation='sigmoid')) # define output layer.
# Compile model.
model.compile(optimizer='adam', 
              loss='binary_crossentropy', 
              metrics=['accuracy'])
              
import numpy as np
val_dat = (np.array([i for i in x_test.values]), y_test.values)
# Run session and fit model.
with tf.Session() as sess:

  sess.run([tf.global_variables_initializer(), tf.tables_initializer()])
  model.fit(x_train.values, 
            y_train.values, 
            epochs=10
            #,validation_split=0.2
            ,validation_data=val_dat
            )


Train on 4002 samples, validate on 1716 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Transformers Library
> Exercise 8.3: Train a Language Model from scratch.
- First we need to compile a dataset. Famously wikipedia data has been used, becasue it is available in many languages. Wikimedia provides data dumps regularly. Choose a language to download from the https://dumps.wikimedia.org, e.g. the danish wikpedia: https://dumps.wikimedia.org/dawiki/20200101/dawiki-20200101-pages-articles.xml.bz2
- Next we should unzip it, and preprocess it to extract plain text. See code below.
- To further prepare it for training the language model we should split the text into train and eval. We will do this by running through all files, and writing to a train file with a probability of p, and a eval file with a probability of 1-p. 
  - For each article (i.e. line in file), get a random value using the `random.random()` function. If below p write to test, if above write to train file.

Now you should follow the Tutorial Provided by the huggingface organization and the Implementation found here: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
or appropriate the following implementation https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b


- **Important Update** as this is an open source world, the implementations are changing and the example is not fully up-to-date with the newest changes to the `transformers` package api. With the transformers 2.6 release certain changes that have not been implemented in their `run_language_modelling.py` training script. To work around this issue I have put in a cell implementing a somewhat *crazy* hack, to slightly alter the script.


In [0]:
# Download language modelling corpus from the wikimedia.org 
data_link = 'https://dumps.wikimedia.org/dawiki/20200101/dawiki-20200101-pages-articles.xml.bz2' # define link

# download data
! wget {data_link}

--2020-03-26 14:16:00--  https://dumps.wikimedia.org/dawiki/20200101/dawiki-20200101-pages-articles.xml.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.7, 2620:0:861:1:208:80:154:7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 325201125 (310M) [application/octet-stream]
Saving to: ‘dawiki-20200101-pages-articles.xml.bz2’


2020-03-26 14:17:05 (4.77 MB/s) - ‘dawiki-20200101-pages-articles.xml.bz2’ saved [325201125/325201125]



In [0]:
# unzip data
filename = data_link.split('/')[-1]
! bzip2 -d {filename}

bzip2: Output file dawiki-20200101-pages-articles.xml already exists.


In [0]:
# Inspect data
unzipped_file = filename.split('.bz2')[0]
# inspect file
! head -400 {unzipped_file}

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="da">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>dawiki</dbname>
    <base>https://da.wikipedia.org/wiki/Forside</base>
    <generator>MediaWiki 1.35.0-wmf.11</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Speciel</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Diskussion</namespace>
      <namespace key="2" case="first-letter">Bruger</namespace>
      <namespace key="3" case="first-letter">Brugerdiskussion</namespace>
      <namespace key="4" case="first-letter">Wikipedia</namespace>
      <namespace key="5" case="first-letter">Wikipedia-diskussion</

In [0]:
# Download external package for extracting plain text from wiki xml.
ext_pack = 'https://raw.githubusercontent.com/attardi/wikiextractor/master/WikiExtractor.py'
import requests
with open('WikiExtractor.py','wb') as f:
  f.write(requests.get(ext_pack).content)
import os 
if not os.path.isdir('da_files'):
  os.mkdir('da_files')
! python WikiExtractor.py --output da_files --bytes 10000000000 --quiet {unzipped_file}

mkdir: cannot create directory ‘da_files’: File exists


In [0]:
# inspect files generated
! ls -sh da_files/AA

total 366M
366M wiki_00


In [0]:
# Construct train and eval dataset.
import os
if not os.path.isdir('data'):
  os.mkdir('data')
efile = open('data/eval.txt','w')
tfile = open('data/train.txt','w')
import random
p = 0.01
for line in open('da_files/AA/wiki_00','r'):
  if line.strip()=='':
    continue
  if random.random()>p:
    tfile.write(line+'\n')
  else:
    efile.write(line+'\n')
tfile.close()
efile.close()

In [0]:
# Appropriation of the Transformers Tutorial on training a language model from scratch.

In [0]:
# install the tokenizers library developed by the huggingface group.
! pip install tokenizers

Collecting tokenizers
[?25l  Downloading https://files.pythonhosted.org/packages/73/de/ec55e2d5a8720557b25100dd7dd4a63108a44b6b303978ce2587666931cf/tokenizers-0.6.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 4.9MB/s 
[?25hInstalling collected packages: tokenizers
Successfully installed tokenizers-0.6.0


In [0]:
# set path to full dataset
path2data = 'da_files/AA/wiki_00'
import os
os.path.isfile(path2data)

True

In [0]:
# Train tokenizers.
from tokenizers import ByteLevelBPETokenizer
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=[path2data], vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

# Save files to disk
model_name = 'DaBerto'
if not os.path.isdir(model_name):
  os.mkdir(model_name)
tokenizer.save(model_name)
# test tokenizer
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "%s/vocab.json"%model_name,
    "%s/merges.txt"%model_name,
)
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)
# test tokenizer
print(tokenizer.encode("Mit navn er Snorre Sturlason").tokens)


['<s>', 'Mit', 'Ġnavn', 'Ġer', 'ĠSnorre', 'ĠStur', 'lason', '</s>']


In [0]:
# Define config file for the run_language_modelling.py script
import os
# define config file.
import json
config = {
	"architectures": [
		"RobertaForMaskedLM"
	],
	"attention_probs_dropout_prob": 0.1,
	"hidden_act": "gelu",
	"hidden_dropout_prob": 0.1,
	"hidden_size": 768,
	"initializer_range": 0.02,
	"intermediate_size": 3072,
	"layer_norm_eps": 1e-05,
	"max_position_embeddings": 514,
	"model_type": "roberta",
	"num_attention_heads": 12,
	"num_hidden_layers": 6,
	"type_vocab_size": 1,
	"vocab_size": 52000
}
with open("./%s/config.json"%model_name, 'w') as fp:
    json.dump(config, fp)

tokenizer_config = {
	"max_len": 512
}
with open("./%s/tokenizer_config.json"%model_name, 'w') as fp:
    json.dump(tokenizer_config, fp)

In [0]:
# download script training 
import requests
with open('run_language_modeling.py','w') as f:
  f.write(requests.get('https://raw.githubusercontent.com/huggingface/transformers/master/examples/run_language_modeling.py').text)
# install transformers package.
! pip install transformers

Collecting transformers==2.6
[?25l  Downloading https://files.pythonhosted.org/packages/4c/a0/32e3a4501ef480f7ea01aac329a716132f32f7911ef1c2fac228acc57ca7/transformers-2.6.0-py3-none-any.whl (540kB)
[K     |▋                               | 10kB 19.3MB/s eta 0:00:01[K     |█▏                              | 20kB 3.1MB/s eta 0:00:01[K     |█▉                              | 30kB 4.5MB/s eta 0:00:01[K     |██▍                             | 40kB 3.0MB/s eta 0:00:01[K     |███                             | 51kB 3.7MB/s eta 0:00:01[K     |███▋                            | 61kB 4.4MB/s eta 0:00:01[K     |████▎                           | 71kB 5.0MB/s eta 0:00:01[K     |████▉                           | 81kB 5.6MB/s eta 0:00:01[K     |█████▌                          | 92kB 6.3MB/s eta 0:00:01[K     |██████                          | 102kB 4.8MB/s eta 0:00:01[K     |██████▋                         | 112kB 4.8MB/s eta 0:00:01[K     |███████▎                        | 122kB

In [0]:
##### UGLY HACK do not touch or look##### 
s = open('run_language_modeling.py').read()
s = s.replace('''from transformers import (\n    CONFIG_MAPPING,\n    MODEL_WITH_LM_HEAD_MAPPING,\n    WEIGHTS_NAME,\n    AdamW,\n    AutoConfig,\n    AutoModelWithLMHead,\n    AutoTokenizer,\n    PreTrainedModel,\n    PreTrainedTokenizer,\n    get_linear_schedule_with_warmup,\n)\n'''
,'''from transformers import (\n   WEIGHTS_NAME,\n    AdamW,\n    AutoConfig,\n    AutoModelWithLMHead,\n    AutoTokenizer,\n    PreTrainedModel,\n    PreTrainedTokenizer,\n    get_linear_schedule_with_warmup,\n)\n
from collections import OrderedDict

from transformers.configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig
from transformers.configuration_bart import BART_PRETRAINED_CONFIG_ARCHIVE_MAP, BartConfig
from transformers.configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig
from transformers.configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
from transformers.configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig
from transformers.configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig
from transformers.configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig
from transformers.configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config
from transformers.configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig
from transformers.configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig
from transformers.configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
from transformers.configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig
from transformers.configuration_utils import PretrainedConfig
from transformers.configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig
from transformers.configuration_xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig
from transformers.configuration_xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig
CONFIG_MAPPING = OrderedDict(
    [
        ("t5", T5Config,),
        ("distilbert", DistilBertConfig,),
        ("albert", AlbertConfig,),
        ("camembert", CamembertConfig,),
        ("xlm-roberta", XLMRobertaConfig,),
        ("bart", BartConfig,),
        ("roberta", RobertaConfig,),
        ("flaubert", FlaubertConfig,),
        ("bert", BertConfig,),
        ("openai-gpt", OpenAIGPTConfig,),
        ("gpt2", GPT2Config,),
        ("transfo-xl", TransfoXLConfig,),
        ("xlnet", XLNetConfig,),
        ("xlm", XLMConfig,),
        ("ctrl", CTRLConfig,),
    ]
)
from transformers.modeling_auto import (
        MODEL_MAPPING,
        MODEL_FOR_PRETRAINING_MAPPING,
        MODEL_FOR_QUESTION_ANSWERING_MAPPING,
        MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,
        MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,
        MODEL_WITH_LM_HEAD_MAPPING,
    )''').replace('model = AutoModelWithLMHead(config=config)',
                  '''model = AutoModelWithLMHead.from_config(config=config)'''
        
    )

with open('run_language_modeling2.py','w') as f:
  f.write(s)

In [0]:
# define cmd to run the slightly altered script:
# : run_language_modelling2.py

cmd =	("""
  python run_language_modeling2.py
  --train_data_file ./data/train.txt
  --eval_data_file ./data/eval.txt
  --output_dir ./DaBERTo-small-v1
	--model_type roberta
	--mlm
	--config_name ./%s
	--tokenizer_name ./%s
	--do_train
	--line_by_line
	--learning_rate 1e-4
	--num_train_epochs 1
	--save_total_limit 2
	--save_steps 2000
	--per_gpu_train_batch_size 8
	--seed 42
"""%(model_name,model_name)).replace("\n", " ")

# Start training.
%time
!{cmd}

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Iteration:   1% 1369/245837 [02:50<7:42:19,  8.81it/s][A
Iteration:   1% 1370/245837 [02:50<7:41:54,  8.82it/s][A
Iteration:   1% 1372/245837 [02:50<6:50:25,  9.93it/s][A
Iteration:   1% 1374/245837 [02:51<7:20:05,  9.26it/s][A
Iteration:   1% 1375/245837 [02:51<7:29:56,  9.06it/s][A
Iteration:   1% 1376/245837 [02:51<8:21:07,  8.13it/s][A
Iteration:   1% 1378/245837 [02:51<8:13:36,  8.25it/s][A
Iteration:   1% 1379/245837 [02:51<8:33:28,  7.93it/s][A
Iteration:   1% 1381/245837 [02:51<8:10:16,  8.31it/s][A
Iteration:   1% 1382/245837 [02:52<9:47:21,  6.94it/s][A
Iteration:   1% 1384/245837 [02:52<9:38:18,  7.05it/s][A
Iteration:   1% 1385/245837 [02:52<10:08:32,  6.70it/s][A
Iteration:   1% 1386/245837 [02:52<10:03:59,  6.75it/s][A
Iteration:   1% 1387/245837 [02:52<10:04:33,  6.74it/s][A
Iteration:   1% 1389/245837 [02:53<9:47:27,  6.94it/s] [A
Iteration:   1% 1390/245837 [02:53<10:16:14,  6.61it/s][A
It

See setting this module for doing discriminate learning, i.e. different learning rates for each layer.
https://pypi.org/project/keras-lr-multiplier/

```
from keras.models import Sequential
from keras.layers import Dense
from keras_lr_multiplier import LRMultiplier

model = Sequential()
model.add(Dense(
    units=5,
    input_shape=(5,),
    activation='tanh',
    name='Dense',
))
model.add(Dense(
    units=2,
    activation='softmax',
    name='Output',
))
model.compile(
    optimizer=LRMultiplier('adam', {'Dense': 0.5, 'Output': 1.5}),
    loss='sparse_categorical_crossentropy',
)
```