# Transfer Learning for NLP
Transfer learning is still in its nascent field so the field and open-source community has not settled entirely on one easy and bulletproof-solutions. This means that libraries are still being developed and are changing as we speak. However major hubs are beginning to form and conform its use. The first one is based on the tensorflow framework: `tensorflow_hub`. The second one: `transformers` (formerly `pytorch-transformers`), came from PyTorch but adding tensorflow support also. 

For more classic work on Word-embeddings the gensim package which you worked with last week also has some decent ressorces. 

In this exercise set we will practice loading and applying models from both `tensorflow_hub` and from `transformers`. We will practice using sentence/paragraph embeddings as input to a clustering algorithm, as pretrained element in a new model, and finally try out the transformers library for pretraining a language model from scratch. 

Again we will use the Toxicity dataset. See download instructions in [week 7 exercises](https://github.com/ulfaslak/sds_tddl_2020/blob/master/exercises/exercises(7)_Categorydev_Class.ipynb)


In [0]:
# load dataset
import pandas as pd
path2tox_data = '/content/drive/My Drive/lm/toxic_train.csv'
df = pd.read_csv(path2tox_data)
# subsample data to allow faster prototyping
# df = df.sample(5000) # simple solution
# stratified solution where we subsample from each meta data column to get a higher variance.
strat_sample_cols = df.columns[3:23]
samples = []
n = 300
for col in strat_sample_cols:
    binary = pd.DataFrame((df[col]>0.5).astype(int))
    samples+=[j for _,j in binary.groupby(col).apply(lambda x: x.sample(min(len(x),n//2))).index]
idx = list(set(samples))
df = df.iloc[idx]

# subsample for clustering
df['label'] = (df.target>0.5).astype(int)

sample = df.groupby('label').apply(lambda x: x.sample(500))
sample_texts = sample.comment_text.values

> **Ex. 8.1:** *Pretrained Sentence Representations for Discovery and Exploration, using tensorflow_hub and the Universal Sentence Encoder*
TFhub allows you plug and play with fully implemented pipelines, including preprocessing and the embedding forward pass. You will use this as a basic feature extractor similarly to what you did in Exercise 5 with image data.

You will need to install tf_hub: `pip install --upgrade tensorflow-hub` first.

> **Ex. 8.1.1:** Load and aply the ["Universal Sentence Encoder"](https://arxiv.org/abs/1803.11175) embedder using tensorflow hub.
  - first `import tensorflow_hub as hub`
  - define the "embedder" object using the `hub.Module()` function that takes a link to a pretrained module, and initializes it. Use the link: https://tfhub.dev/google/universal-sentence-encoder/4. (**Hint**: follow the link to see an example)
  - Embed / transform a sample of texts from the toxicity dataaset to vectors by applying the embedder to a list of texts. Rremember to run the process within a tensorflow session:
  ```
  with tf.Session() as session:
      session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  ```



In [0]:
### Load universal encoder.

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [0]:
# initialize tf and embed documents.

> **Ex. 8.1.2:** Exploration based on Document embeddings. 
  - First we use this for **exploring** similar texts. This time we can not just use `gensim`'s neat `.most_similar` function. Instead we contruct it ourselves.
  - Construct a distance matrix between all texts. Here you can use the `sklearn.metrics.pairwise_distances` function that allows you to specify any distance measure implemented in the sklearn.metrics.pairwise. Get list here (`sklearn.metrics.pairwise.PAIRWISE_DISTANCE_FUNCTIONS`) 
  - Now we have a matrix where each document has a row that expresses the distance to every other document. Now we brush up on our matrix manipulations skills. We want to transform the distance matrix into a matrix that express which document are closest to each other, i.e. each row will be sorted indices relating to the closests documents. Use the `.argsort` function built in to the matrix. Validate that the argsort() is correct.
  - Now pick a random document contained in the distance matrix (i.e. a random index in the matrix). Print the text, along with with the most similar document (i.e. first index in the argsort matrix) and the distance score. Comment on what the model might have found / encoded.  
  - Finally write a function that takes a document not contained in the distance matrix already defined, but instead embeds the document, calculates the distance to sample of texts already embedded, returns the top k closests documents.


In [0]:
# Create doc2doc matrix.

In [0]:
# Apply argsort.

In [0]:
# Get closests neighbors of a random document.

The similarity seems to be related to invoking religious symbols. 

In [0]:
## Define function that has a document(i.e. userdefined string) as input and prints the 5 most similar documents in the dataset. 

> **Ex. 8.1.3:** Discovery based on Document embeddings. Cluster and summarize.  
  - Apply a clustering algorithm on the embeddings. `import sklearn.cluster`
  - Now we want to inspect the clusters. 
    - Random sample: Do a random sample from the largest cluster.
    - Most Representative: This will count as the Document with the shortest distance to all other docs. I.e. calculate average intra-cluster distance for each doc. Calculate the average intracluster distance for each document, and print the top 3 documents of the largest cluster.).

>## Wordbased Summarizations
Here we inspect the most representative words using TDIDF style weighing of each phrase/word in the cluster and in line with the "Computer Assisted Keyword and Documentset Discovery" we rank words in relation to feature importance / predictive capabilities. 

>**TfIdf style weighing**:. Idea is to calculate TDIDF not based on documents but on clusters. Formula is the following: 
$tfidf_{w,c} = tf_{w,c} \cdot log(\frac{\left | N_{c} \right |}{\left | CC_w \right |})$ where $N_c$ is no. of clusters $CC_w$ is no. of clusters word is present in. $tf_{w,c}$ is the frequency of a word in a cluster.
  - Transform documents into a DocumentTermMatrix,i.e. Counts of Words and Phrases., using `sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1,4),min_df=5). Add ngrams to allow longer phrases.
  - Extrac the word index from the vectorizer using the function `.get_feature_names()`. This shall be used when printing the words, by translating an columnindex in the DocumentTermMatrix to a word.
  - Transform the DocumentTermMatrix into a ClusterTermMatrix by summing accross each cluster.
  - Transform this into a ClusterTermFrequencyMatrix by dividing by the sum for each cluster.
  - Calculate the "Inverse Cluster Frequency". 
  - Multiply these together to form the TFIDF. 
  - For each cluster get sort each word by their tfidf score and  print the top 10 terms. Remember the word index you defined earlier.

>## Extra "Computer-Assisted Keyword and Document Set Discovery from Unstructured Text" style word weighing. 
The method for discovering new query terms and in our context, phrases for doing *weak supervision', can also be used to summarize a given cluster.
- Train a model (e.g. `sklearn.linear_model.LogisticRegression`) using the DocumentTermMatrix as input and the cluster labels as output.
- Extract the coefficients of the model using the `.coef_` property of the model object.
- For each cluster label sort the words by the largest coefficients.


In [0]:
# Cluster using sklearn.cluster

In [0]:
# Inspect random sample of documents from largest cluster

In [0]:
# Find most representative documents by calculating
# the average intra-cluster-distance for each 
# document to its cluster neighbors. 

In [0]:
## Word Summarization
# Summarize by printing TF-IDF weighed words.

In [0]:
# Print top "phrases" of each cluster based on the tfidf score.

In [0]:
#  "Computer Assisted document and set discovery" most predictive summary

## Transfer Learning for supervised learning
**Ex. 8.2:** Adopt pretrained embeddings into a larger model.
Here we shall practice using pretrained models as part of a larger pipeline using tensorflow_hub and keras. 
>
**Ex. 8.2.1:** Built a Keras model where the first layer is the Universal Sentence Encoder, and stack layers on top.
  - initialize your model as using the `Sequential()` function.
  - add the hub layer using the `hub.KerasLayer(module_url,input_shape=[],dtype=tf.string,trainable=False)` 
    - `trainable = True` option allows you to finetune the Universal Sentence Encoder also, this however will slow down training significantly. 
  - Add a classification layer on top. You may add any layers you like.
  - Compile model.
  - Train model. Because it is using Tensorflow again you need to initialize the session, using the line: ```with tf.Session() as sess:
  sess.run([tf.global_variables_initializer(), tf.tables_initializer()])```. And then call the `.fit()` method inputting your training data.
` 


In [0]:
# Adopt Universal Sentence Encoder 
# as the first layer in preprocessing 
# step in a larger Keras Pipeline

# Prepare Dataset split the Toxicity dataset into train and test
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df['comment_text'], 
                                                    df['label'], 
                                                    test_size=0.3, 
                                                    stratify=df['label'], 
                                                    random_state=42)
import numpy as np
val_dat = (np.array([i for i in x_test.values]), y_test.values)

# Initialize Keras Sequetial Pipeline

# Add USE layer to model
# Define extra layers for the pipeline.
# Compile model.
              
# Run session and fit model.

See also this module (https://pypi.org/project/keras-lr-multiplier/) for doing discriminate fintuning (i.e. different learning rates for each layer), as described in the ["Universal Language Model Fine-tuning for Text Classification"](https://arxiv.org/pdf/1801.06146.pdf)(Howard and Ruder 2017)

from keras.models import Sequential
from keras.layers import Dense
from keras_lr_multiplier import LRMultiplier

model = Sequential()
model.add(Dense(
    units=5,
    input_shape=(5,),
    activation='tanh',
    name='Dense',
))
model.add(Dense(
    units=2,
    activation='softmax',
    name='Output',
))
model.compile(
    optimizer=LRMultiplier('adam', {'Dense': 0.5, 'Output': 1.5}),
    loss='sparse_categorical_crossentropy',
)
```

## Testdriving the Transformers Library
> Exercise 8.3.2: Train a Language Model from scratch.
- First we need to compile a dataset. Famously wikipedia data has been used, becasue it is available in many languages. Wikimedia provides data dumps regularly. Choose a language to download from the https://dumps.wikimedia.org, e.g. the danish wikpedia: https://dumps.wikimedia.org/dawiki/20200101/dawiki-20200101-pages-articles.xml.bz2
- Next we should unzip it, and preprocess it to extract plain text. See code below.
- To further prepare it for training the language model we should split the text into train and eval. We will do this by running through all files, and writing to a train file with a probability of p, and a eval file with a probability of 1-p. 
  - For each article (i.e. line in file), get a random value using the `random.random()` function. If below p write to test, if above write to train file.

Now you should follow the Tutorial Provided by the huggingface organization and the Implementation found here: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
or appropriate the following implementation https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b


- **Important Update** as this is an open source world, the implementations are changing and the example is not fully up-to-date with the newest changes to the `transformers` package api. With the transformers 2.6 release certain changes that have not been implemented in their `run_language_modelling.py` training script. To work around this issue I have put in a cell implementing a somewhat *crazy* hack, to slightly alter the script.


In [0]:
# Download language modelling corpus from the wikimedia.org 
data_link = 'https://dumps.wikimedia.org/dawiki/20200101/dawiki-20200101-pages-articles.xml.bz2' # define link

# download data
! wget {data_link}

--2020-03-26 14:16:00--  https://dumps.wikimedia.org/dawiki/20200101/dawiki-20200101-pages-articles.xml.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.7, 2620:0:861:1:208:80:154:7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 325201125 (310M) [application/octet-stream]
Saving to: ‘dawiki-20200101-pages-articles.xml.bz2’


2020-03-26 14:17:05 (4.77 MB/s) - ‘dawiki-20200101-pages-articles.xml.bz2’ saved [325201125/325201125]



In [0]:
# unzip data
filename = data_link.split('/')[-1]
! bzip2 -d {filename}

bzip2: Output file dawiki-20200101-pages-articles.xml already exists.


In [0]:
# Inspect data
unzipped_file = filename.split('.bz2')[0]
# inspect file
! head -400 {unzipped_file}

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="da">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>dawiki</dbname>
    <base>https://da.wikipedia.org/wiki/Forside</base>
    <generator>MediaWiki 1.35.0-wmf.11</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Speciel</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Diskussion</namespace>
      <namespace key="2" case="first-letter">Bruger</namespace>
      <namespace key="3" case="first-letter">Brugerdiskussion</namespace>
      <namespace key="4" case="first-letter">Wikipedia</namespace>
      <namespace key="5" case="first-letter">Wikipedia-diskussion</

In [0]:
# Download external package for extracting plain text from wiki xml.
ext_pack = 'https://raw.githubusercontent.com/attardi/wikiextractor/master/WikiExtractor.py'
import requests
with open('WikiExtractor.py','wb') as f:
  f.write(requests.get(ext_pack).content)
import os 
if not os.path.isdir('da_files'):
  os.mkdir('da_files')
! python WikiExtractor.py --output da_files --bytes 10000000000 --quiet {unzipped_file}

mkdir: cannot create directory ‘da_files’: File exists


In [0]:
# inspect files generated
! ls -sh da_files/AA

total 366M
366M wiki_00


In [0]:
# Construct train and eval dataset iterating through the large file line by line.
import os
if not os.path.isdir('data'):
  os.mkdir('data')
efile = open('data/eval.txt','w')
tfile = open('data/train.txt','w')
import random
p = 0.01
for line in open('da_files/AA/wiki_00','r'):
  if line.strip()=='':
    continue
  if random.random()>p:
    tfile.write(line+'\n')
  else:
    efile.write(line+'\n')
tfile.close()
efile.close()

In [0]:
# Appropriation of the Transformers Tutorial on training a language model from scratch.

In [0]:
# install the tokenizers library developed by the huggingface group.
! pip install tokenizers

Collecting tokenizers
[?25l  Downloading https://files.pythonhosted.org/packages/73/de/ec55e2d5a8720557b25100dd7dd4a63108a44b6b303978ce2587666931cf/tokenizers-0.6.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 4.9MB/s 
[?25hInstalling collected packages: tokenizers
Successfully installed tokenizers-0.6.0


In [0]:
# set path to full dataset
path2data = 'da_files/AA/wiki_00'
import os
os.path.isfile(path2data)

True

In [0]:
#### Train tokenizers.
# Initialize a tokenizer
# Save files to disk
# test tokenizer
# test tokenizer

In [0]:
# Define config file for the run_language_modelling.py script
# define config file.

In [0]:
# download script training 
import requests
with open('run_language_modeling.py','w') as f:
  f.write(requests.get('https://raw.githubusercontent.com/huggingface/transformers/master/examples/run_language_modeling.py').text)
# install transformers package.
! pip install transformers

In [0]:
##### UGLY HACK do not touch or look##### 
s = open('run_language_modeling.py').read()
s = s.replace('''from transformers import (\n    CONFIG_MAPPING,\n    MODEL_WITH_LM_HEAD_MAPPING,\n    WEIGHTS_NAME,\n    AdamW,\n    AutoConfig,\n    AutoModelWithLMHead,\n    AutoTokenizer,\n    PreTrainedModel,\n    PreTrainedTokenizer,\n    get_linear_schedule_with_warmup,\n)\n'''
,'''from transformers import (\n   WEIGHTS_NAME,\n    AdamW,\n    AutoConfig,\n    AutoModelWithLMHead,\n    AutoTokenizer,\n    PreTrainedModel,\n    PreTrainedTokenizer,\n    get_linear_schedule_with_warmup,\n)\n
from collections import OrderedDict

from transformers.configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig
from transformers.configuration_bart import BART_PRETRAINED_CONFIG_ARCHIVE_MAP, BartConfig
from transformers.configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig
from transformers.configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
from transformers.configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig
from transformers.configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig
from transformers.configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig
from transformers.configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config
from transformers.configuration_openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig
from transformers.configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig
from transformers.configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
from transformers.configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig
from transformers.configuration_utils import PretrainedConfig
from transformers.configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig
from transformers.configuration_xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig
from transformers.configuration_xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig
CONFIG_MAPPING = OrderedDict(
    [
        ("t5", T5Config,),
        ("distilbert", DistilBertConfig,),
        ("albert", AlbertConfig,),
        ("camembert", CamembertConfig,),
        ("xlm-roberta", XLMRobertaConfig,),
        ("bart", BartConfig,),
        ("roberta", RobertaConfig,),
        ("flaubert", FlaubertConfig,),
        ("bert", BertConfig,),
        ("openai-gpt", OpenAIGPTConfig,),
        ("gpt2", GPT2Config,),
        ("transfo-xl", TransfoXLConfig,),
        ("xlnet", XLNetConfig,),
        ("xlm", XLMConfig,),
        ("ctrl", CTRLConfig,),
    ]
)
from transformers.modeling_auto import (
        MODEL_MAPPING,
        MODEL_FOR_PRETRAINING_MAPPING,
        MODEL_FOR_QUESTION_ANSWERING_MAPPING,
        MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING,
        MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING,
        MODEL_WITH_LM_HEAD_MAPPING,
    )''').replace('model = AutoModelWithLMHead(config=config)',
                  '''model = AutoModelWithLMHead.from_config(config=config)'''
        
    )

with open('run_language_modeling2.py','w') as f:
  f.write(s)

In [0]:
# define cmd to run the slightly altered script:
# : run_language_modelling2.py
# Start training.