Text Processing and Machine Learning
====================================

- pre-processing and tokenization (splitting text into words)
- n-grams, vectorization and word embeddings
- train and evaluate a text classifier
- a short look into [Hugging Face's transformers library](https://huggingface.co/transformers/)


## Natural Language Processing

[Natural language processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing) is about programming computers to process and analyze natural language data (text and speech).

During the text classification training we touch only some aspects of NLP, namely

- tokenization or splitting a text into words (aka. tokens)
- the representation of words in a vector space (word embeddings)

NLP modules for Python:

- [spaCy](https://spacy.io/) or [spaCy on pypi](https://pypi.org/project/spacy/)
- [NLTK](https://www.nltk.org/) or [NLTK on pypi](https://pypi.org/project/nltk/)


## Machine Learning

The field of machine learning is too broad to be introduced here. Please, see [Google's machine learning crash course](https://developers.google.com/machine-learning/crash-course/ml-intro).

## fastText

[fastText](https://fasttext.cc/) is a software library for text
classification and word representation learning. See the fastText
tutorials for

- [text classification](https://fasttext.cc/docs/en/supervised-tutorial.html)
- [word representation learning](https://fasttext.cc/docs/en/unsupervised-tutorial.html)

We will now follow the [fastText text
classification](https://fasttext.cc/docs/en/supervised-tutorial.html)
tutorial (cf. documentation of the [Python module
"fasttext"](https://pypi.org/project/fasttext/)) to train and apply
a text classifier.


The fastText tutorial uses the StackExchange cooking data set. We will use the [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview) data set. In order to download the data set, you need to register at [Kaggle.com](https://www.kaggle.com/).

After the data set is downloaded and unpacked into the folder `data/kaggle-jigsaw-toxic`, you should see the tree files `train.csv`, `test.csv` and `test_labels.csv` in the mentioned folder. 

In [1]:
import pandas as pd

df_train = pd.read_csv('data/kaggle-jigsaw-toxic/train.csv')

#df.head()

In [2]:
labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

df_train[labels].mean()

toxic            0.095844
severe_toxic     0.009996
obscene          0.052948
threat           0.002996
insult           0.049364
identity_hate    0.008805
dtype: float64

Only 10% of the comments are toxic. What does it mean for building a classifier?

In [3]:
# tokenize the comments
import string

from nltk.tokenize import TweetTokenizer

tweet_tokenizer = TweetTokenizer(reduce_len=True)

def tokenize(text):
    global tweet_tokenizer
    words = tweet_tokenizer.tokenize(text)
    words = filter(lambda w: w != ''
                             and w not in string.punctuation, words)
    words = map(lambda w: w.lower(), words)
    return ' '.join(words)

tokenize("You're a hero! http://example.com/index.html")

"you're a hero http://example.com/index.html"

In [4]:
# write data to fastText train file

train_file = 'data/kaggle-jigsaw-toxic/train.txt'

def write_line_fasttext(fp, row):
    global labels
    line = ''
    for label in labels:
        if row[label] == 1:
            if line:
                line += ' '
            line += '__label__' + label
    if line:
        line += ' '
    else:
        line += '__label__none '
    line += tokenize(row['comment_text'])
    fp.write(line)
    fp.write('\n')

with open(train_file, 'w') as fp:
    df_train.apply(lambda row: write_line_fasttext(fp, row), axis=1)

In [None]:
!pip install fasttext

In [5]:
# train a model

import fasttext

model = fasttext.train_supervised(input=train_file, wordNgrams=2, minCount=2)

In [6]:
model.predict(tokenize("This is a well-written article."))
# model.predict(tokenize("Fuck you!"), k=5)

(('__label__none',), array([0.99993789]))

In [7]:
# looking into the underlying word embeddings

model.get_nearest_neighbors('idiot', k=20)

[(0.9997914433479309, 'stupid'),
 (0.9996288418769836, 'moron'),
 (0.9995864033699036, 'jerk'),
 (0.9993796348571777, 'arrogant'),
 (0.9993292093276978, 'ignorant'),
 (0.999278724193573, 'stupidity'),
 (0.9992066025733948, 'coward'),
 (0.9992029070854187, 'disgusting'),
 (0.9991973638534546, 'idiotic'),
 (0.9990672469139099, 'pathetic'),
 (0.9990224242210388, 'fool'),
 (0.9989080429077148, 'morons'),
 (0.9989030957221985, 'losers'),
 (0.9988322854042053, 'hell'),
 (0.9988279342651367, 'jackass'),
 (0.9987922310829163, 'fascist'),
 (0.9987281560897827, 'idiots'),
 (0.9987263679504395, 'dirty'),
 (0.9987045526504517, 'sucked'),
 (0.998673141002655, 'bloody')]

In [8]:
# save the model
model_file = 'data/kaggle-jigsaw-toxic/model.bin'

model.save_model(model_file)

In [9]:
df_test = pd.read_csv('data/kaggle-jigsaw-toxic/test.csv')
df_test_labels = pd.read_csv('data/kaggle-jigsaw-toxic/test_labels.csv')

# join both tables
df_test = df_test.merge(df_test_labels, on='id')

# skip rows not labelled / not used
df_test = df_test[df_test['toxic'] != -1]

test_file = 'data/kaggle-jigsaw-toxic/test.txt'

# write test set for fastText
with open(test_file, 'w') as fp:
    df_test.apply(lambda row: write_line_fasttext(fp, row), axis=1)

### Model Validation

See also: [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall)

In [10]:
model.test(test_file)

(63978, 0.9303666885491888, 0.8240416430163499)

In [11]:
res_per_label = model.test_label(test_file)

for label in res_per_label.items():
    print(label)

('__label__threat', {'precision': nan, 'recall': 0.0, 'f1score': 0.0})
('__label__identity_hate', {'precision': nan, 'recall': 0.0, 'f1score': 0.0})
('__label__severe_toxic', {'precision': 0.275, 'recall': 0.05994550408719346, 'f1score': 0.09843400447427293})
('__label__insult', {'precision': 0.7333333333333333, 'recall': 0.0032098044937262913, 'f1score': 0.006391632771644393})
('__label__obscene', {'precision': 0.9406952965235174, 'recall': 0.12462747222974803, 'f1score': 0.22009569377990432})
('__label__toxic', {'precision': 0.5887384176764077, 'recall': 0.6781609195402298, 'f1score': 0.6302937809996184})
('__label__none', {'precision': 0.9737668280742829, 'recall': 0.950896336710834, 'f1score': 0.9621956990378043})


In [12]:
# in case the fastText command-line tool is installed: it has a nice output formatter
!fasttext test-label \
   data/kaggle-jigsaw-toxic/model.bin \
   data/kaggle-jigsaw-toxic/test.txt

F1-Score : 0.962196  Precision : 0.973767  Recall : 0.950896   __label__none
F1-Score : 0.630294  Precision : 0.588738  Recall : 0.678161   __label__toxic
F1-Score : 0.220096  Precision : 0.940695  Recall : 0.124627   __label__obscene
F1-Score : 0.006392  Precision : 0.733333  Recall : 0.003210   __label__insult
F1-Score : 0.098434  Precision : 0.275000  Recall : 0.059946   __label__severe_toxic
F1-Score : 0.000000  Precision : --------  Recall : 0.000000   __label__identity_hate
F1-Score : 0.000000  Precision : --------  Recall : 0.000000   __label__threat
N	63978
P@1	0.930
R@1	0.824


## Transformers

- https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)
- [Hugging Face's transformers library](https://huggingface.co/transformers/): unique interface and provisioning of various transformer language models
  - see https://huggingface.co/course

In [14]:
!pip install transformers
!pip install tensorflow
!pip install "transformers[sentencepiece]"

In [15]:
from transformers import pipeline

p = pipeline('fill-mask', model='bert-base-german-cased')

Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
for s in p("Er arbeitet als [MASK]."): print(s)

{'sequence': 'Er arbeitet als Rechtsanwalt.', 'score': 0.09919334203004837, 'token': 6143, 'token_str': 'Rechtsanwalt'}
{'sequence': 'Er arbeitet als Trainer.', 'score': 0.07836302369832993, 'token': 3674, 'token_str': 'Trainer'}
{'sequence': 'Er arbeitet als Journalist.', 'score': 0.0628521665930748, 'token': 10486, 'token_str': 'Journalist'}
{'sequence': 'Er arbeitet als Anwalt.', 'score': 0.05725342780351639, 'token': 6938, 'token_str': 'Anwalt'}
{'sequence': 'Er arbeitet als Schauspieler.', 'score': 0.05046413466334343, 'token': 5607, 'token_str': 'Schauspieler'}


In [17]:
pipeline_fill_mask = pipeline('fill-mask', model='bert-base-german-cased')

def fill_mask(cloze):
    global pipeline_fill_mask
    for s in pipeline_fill_mask(cloze):
        print('%-20s\t%.5f' % (s['token_str'], s['score']))

Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [18]:
fill_mask("Er arbeitet als [MASK] in einer Klinik.")

Arzt                	0.61843
Angestellter        	0.04225
Koch                	0.03064
Assistent           	0.02001
Mediziner           	0.01900


In [19]:
fill_mask("Er arbeitet als [MASK] in einer Lungenklinik.")

Arzt                	0.69560
Angestellter        	0.03423
Chemiker            	0.02711
Facharzt            	0.02113
Mediziner           	0.02024


In [20]:
fill_mask("Er arbeitet als [MASK] bei BMW.")

Ingenieur           	0.18871
Berater             	0.17160
Manager             	0.15090
Geschäftsführer     	0.07775
Trainer             	0.04951


In [21]:
fill_mask("Er arbeitet als [MASK] an der Universität Konstanz.")

Professor           	0.74687
Dozent              	0.11445
Hochschullehrer     	0.08565
Wissenschaftler     	0.00667
Assistent           	0.00427


In [22]:
fill_mask("Sie arbeitet als [MASK] an der Universität Konstanz.")

Professor           	0.52318
Lehrerin            	0.09859
Dozent              	0.08542
Professur           	0.04144
Richterin           	0.02292


In [23]:
fill_mask("Sie ist wirklich [MASK].")

schön               	0.11005
jung                	0.06098
glücklich           	0.05704
toll                	0.05053
gut                 	0.03495


In [24]:
fill_mask("Er ist wirklich [MASK].")

gut                 	0.05452
glücklich           	0.05183
da                  	0.03765
jung                	0.03233
tot                 	0.03229


In [25]:
help(pipeline)

Help on function pipeline in module transformers.pipelines:

pipeline(task: str, model: Optional = None, config: Union[str, transformers.configuration_utils.PretrainedConfig, NoneType] = None, tokenizer: Union[str, transformers.tokenization_utils.PreTrainedTokenizer, NoneType] = None, feature_extractor: Union[str, ForwardRef('SequenceFeatureExtractor'), NoneType] = None, framework: Optional[str] = None, revision: Optional[str] = None, use_fast: bool = True, use_auth_token: Union[str, bool, NoneType] = None, model_kwargs: Dict[str, Any] = {'use_auth_token': None}, **kwargs) -> transformers.pipelines.base.Pipeline
    Utility factory method to build a :class:`~transformers.Pipeline`.
    
    Pipelines are made of:
    
        - A :doc:`tokenizer <tokenizer>` in charge of mapping raw textual input to token.
        - A :doc:`model <model>` to make predictions from the inputs.
        - Some (optional) post processing for enhancing model's output.
    
    Args:
        task (:obj:`str`)

In [26]:
p = pipeline('sentiment-analysis')

p("I'm happy.")

[{'label': 'POSITIVE', 'score': 0.9998724460601807}]

In [27]:
p("I'm sad.")

[{'label': 'NEGATIVE', 'score': 0.9994174242019653}]

In [28]:
p("I'm not happy.")

[{'label': 'NEGATIVE', 'score': 0.9998021125793457}]

In [31]:
import transformers

p = pipeline('ner', aggregation_strategy=transformers.pipelines.AggregationStrategy.SIMPLE)

p("""We would like to belatedly welcome Ulrich Glassmann of the Europa-Universität
  Flensburg (#EUF), who is currently a guest at the Cluster. Ulrich has just decided
  to extend his stay until the end of June, welcome news indeed!""")

[{'entity_group': 'PER',
  'score': 0.9996402,
  'word': 'Ulrich Glassmann',
  'start': 35,
  'end': 51},
 {'entity_group': 'ORG',
  'score': 0.8913957,
  'word': 'Europa - Universität Flensburg',
  'start': 59,
  'end': 89},
 {'entity_group': 'ORG',
  'score': 0.988505,
  'word': 'EUF',
  'start': 92,
  'end': 95},
 {'entity_group': 'ORG',
  'score': 0.6957305,
  'word': 'Cluster',
  'start': 130,
  'end': 137},
 {'entity_group': 'PER',
  'score': 0.9996954,
  'word': 'Ulrich',
  'start': 139,
  'end': 145}]

In [32]:
p = pipeline('translation', model='facebook/wmt19-de-en')

p("""Nicht nur unterschiedliche Berechnungen bereiten Kopfzerbrechen.
  Bei der Eigenwahrnehmung zeigt sich: In Deutschland gibt es massive
  Missverständnisse über Ausmaß und Art von Ungleichheit.""")

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


[{'translation_text': 'It is not only different calculations that cause headaches. Self-perception shows that in Germany there are massive misunderstandings about the extent and type of inequality.'}]

In [33]:
p = pipeline('translation', model='facebook/wmt19-en-de')

p("""We would like to belatedly welcome Ulrich Glassmann of the Europa-Universität
  Flensburg (#EUF), who is currently a guest at the Cluster. Ulrich has just decided
  to extend his stay until the end of June, welcome news indeed!""")

[{'translation_text': 'Mit Verspätung begrüßen wir Ulrich Glassmann von der Europa-Universität Flensburg (# EUF), der derzeit zu Gast im Cluster ist. Er hat sich gerade entschieden, seinen Aufenthalt bis Ende Juni zu verlängern, eine gute Nachricht!'}]

In [34]:
p = pipeline('text-generation')

p("In Germany there are massive misunderstandings about the extent and type of inequality.")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In Germany there are massive misunderstandings about the extent and type of inequality. For instance, the fact that the "social mobility" issue is widely used in Germany and abroad suggests that some in Germany feel they have reached greater levels of economic equality without having'}]

In [36]:
p("some in Germany feel they have reached greater levels of economic equality without having")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'some in Germany feel they have reached greater levels of economic equality without having to depend on their labor. There are the small cities in particular without one of the very high productivity countries that have been the primary labor of the industrialised world.\n\nThey'}]

Transformers can be "fine-tuned" to a specific task, see [training of transformers](https://huggingface.co/transformers/training.html). Adding a task-specific head to a transformer pre-trained on large amounts of training data (usually 100 GBs or even TBs of text) saves resources spent for training and can overcome the problem of not enough training data. Manually labelling training data is expensive and naturally puts a limit on the amount of training data. But even if the vocabulary in the training data is limited, there's a good chance that the pre-trained transformer has seen the unknown words in the huge data used for pre-training.