<a href="https://colab.research.google.com/github/simon-clematide/casdmit-fs21/blob/master/notebooks/zora_dewey_fasttext.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dewey-Klassifikation mit Zora-Material mit fasttext
Dieses Notebook demonstriert, wie einfach man ein einfaches Klassifikations-Modell mit fastText trainieren kann.
Wir arbeiten mit der fasttext Python-Bibliothek.
Aus Effizienzgründen arbeiten wir hier mit einem kleineren Trainingsdatensatz.

## Das Python fasttext und spaCy Package installieren
Aktuellere Version hat [Bug](https://stackoverflow.com/questions/61787119/fasttext-0-9-2-why-is-recall-nan) in der label-spezifischen Evaluationsfunktion korrigiert 

In [65]:
# ! pip install fasttext # schnell zu installieren, aber hat Bug bei test_label()
! pip install git+https://github.com/facebookresearch/fastText.git  # braucht mehr Zeit fürs Kompilieren

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/facebookresearch/fastText.git
  Cloning https://github.com/facebookresearch/fastText.git to /tmp/pip-req-build-w2euytuk
  Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/fastText.git /tmp/pip-req-build-w2euytuk
  Resolved https://github.com/facebookresearch/fastText.git to commit 0622aad8571861d290b237e83e04e9a07a28839d
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [66]:
! pip install spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [69]:
! python3 -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m44.7 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [71]:
import logging
import spacy
nlp = spacy.load('en_core_web_sm')

# wir brauchen kein syntaktisches Parsing und Eigennamenerkennung
nlp.disable_pipes("parser", "ner")

['parser', 'ner']

# Datenset: Zufällig ausgewählte Publikationen

In [72]:
! curl https://files.ifi.uzh.ch/cl/siclemat/lehre/fs23/bibliosuisse/data/zora-eng-dewey.fasttext.tsv -o zora-eng-dewey.fasttext.tsv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15.1M  100 15.1M    0     0  7273k      0  0:00:02  0:00:02 --:--:-- 7277k


### Format des Datensets 
 - Pro tabulator-separierte Zeile gibt es 2 Spalten
 - Spalte 1: [Dewey-Labels](https://en.wikipedia.org/wiki/List_of_Dewey_Decimal_classes)
 - Spalte 2: Titel und Abstract untokenisiert

In [73]:
! head -n 10 zora-eng-dewey.fasttext.tsv

__label__300	Orangutan activity budgets and diet: a comparison between species, populations and habitats The chapter examines differences in the activity budgets of wild orangutans (Pongo spp.) within and between a large number of study sites in Sumatra and Borneo. The authors of the chapter found that each orangutan population appeared to follow one of two distinct foraging strategies: either (1) ‘sit and wait’, in which orangutans aim to minimize their energy expenditure by spending long periods of time resting and relatively short periods feeding and travelling; or (2) ‘search and find’ in which orangutans aim to maximize their energy intake by resting little and mainly feeding or moving in search of food. Orangutans adopt the first strategy in mixed-dipterocarp forests characterized by mast-fruiting events and irregular fruit availability; and adopt the second strategy in swamp forests with a regular supply of fruit, or in dryland forests with high strangling-fig density. The chapt

### Statistiken zum Datenset

In [74]:
! wc -l zora-eng-dewey.fasttext.tsv

10267 zora-eng-dewey.fasttext.tsv


In [None]:
!  cut -f 1 < zora-eng-dewey.fasttext.tsv | sort | uniq -c | sort -rn 

In [76]:
def lemmatize_tsv(inputfile, outputfile, spacy_nlp, limit=999999):
    """Write tokenized and lemmatized version of data set"""

    with open(outputfile,"w",encoding="utf-8")as output:
        with open(inputfile,"r",encoding="utf-8") as input:
            for i,line in enumerate(input):
                labels, text = line.strip().split("\t")
                doc = nlp(text)
                print(labels,' '.join(token.lemma_ for token in doc).lower(),sep="\t",file=output)
                if i > limit:
                    break
                if i % 100 == 0:
                    print(f"Processed {i} records")


In [78]:
# Download precomputed lemmatized data
! test -e zora-eng-dewey.lemmatized.fasttext.tsv || curl https://files.ifi.uzh.ch/cl/siclemat/lehre/fs23/bibliosuisse/data/zora-eng-dewey.lemmatized.fasttext.tsv -o zora-eng-dewey.lemmatized.fasttext.tsv

In [79]:
lemmatize_tsv("zora-eng-dewey.fasttext.tsv","zora-eng-dewey-10.lemmatized.fasttext.tsv",nlp,limit=10)

Processed 0 records


In [None]:
! head zora-eng-dewey-10.lemmatized.fasttext.tsv

In [None]:
lemmatize_tsv("zora-eng-dewey.fasttext.tsv","zora-eng-dewey.lemmatized.fasttext.tsv",nlp)

In [None]:
! head zora-eng-dewey.lemmatized.fasttext.tsv

In [None]:
def multilabel2singlelabel(inputfile, outputfile):
    """Reduce labels to the first label mentioned"""
    with open(outputfile,"w",encoding="utf-8")as output:
        with open(inputfile,"r",encoding="utf-8") as input:
            for i,line in enumerate(input):
                labels, text = line.strip().split("\t")
                label = labels.split(" ")[0]
                print(label, text, sep="\t",file=output)


In [None]:
multilabel2singlelabel("zora-eng-dewey.lemmatized.fasttext.tsv","zora-eng-dewey.lemmatized.fasttext.single.tsv")

In [None]:
! head zora-eng-dewey.lemmatized.fasttext.single.tsv

## Aufteilen der Daten in Trainings- und Testdaten
Erstellen von Training und Testdaten (Originaldaten sind zufällig geordnet)

In [None]:
! head -n 9000 < zora-eng-dewey.lemmatized.fasttext.tsv > zora-eng-dewey.lemmatized.fasttext.train.tsv
! tail -n 1000 < zora-eng-dewey.lemmatized.fasttext.tsv > zora-eng-dewey.lemmatized.fasttext.test.tsv

In [None]:
# optional erzeuge single label Daten
! head -n 9000 < zora-eng-dewey.lemmatized.fasttext.single.tsv > zora-eng-dewey.lemmatized.fasttext.train.tsv
! tail -n 1000 < zora-eng-dewey.lemmatized.fasttext.single.tsv > zora-eng-dewey.lemmatized.fasttext.test.tsv

In [None]:
! echo TRAINING DATA STATISTICS
! cut -f 1 < zora-eng-dewey.lemmatized.fasttext.train.tsv | sort | uniq -c | sort -rn |head
! echo TEST DATA STATISTICS
! cut -f 1 < zora-eng-dewey.lemmatized.fasttext.test.tsv | sort | uniq -c | sort -rn |head

# Trainieren von Modell mit Python-Package
 - Dokumentation siehe https://fasttext.cc/docs/en/python-module.html

In [80]:
import fasttext

[Word Embeddings](https://fasttext.cc/docs/en/pretrained-vectors.html) auf Wikipedia trainiert und wegen Speichergründen von mir auf 50 Dimensionen reduziert (Text-Format ist notwendig für supervisierte Klassifikation)

In [81]:
! test -e wiki.en.50.vec || curl https://files.ifi.uzh.ch/cl/siclemat/lehre/fs23/bibliosuisse/data/wiki.en.50.vec -o wiki.en.50.vec

In [82]:
# dauert ca. 40 Sekunden mit diesen Einstellungen
model = fasttext.train_supervised(
    input='zora-eng-dewey.lemmatized.fasttext.train.tsv', 
    pretrainedVectors="wiki.en.50.vec", # vortrainierte word embeddings
    epoch=10,  # Wie oft werden die Trainingsdaten benutzt
    minn=5,    # Minimal Subword-Länge in Buchstaben  
    maxn=5,    # Maximale Subword-Länge in Buchstaben 
    dim=50,    # Dimensionalität der Vektoren für die Repräsentation der Wörter und Subwords (muss gleich wie pretrainedVectors sein)
    lr=1,      # Learning Rate (Lernrate): Wie stark wird ein Fehler bestraft? 
    )

## Inspizieren des gelernten Modells

Welche Labels/Klassen kennt das Modell?

In [None]:
print(model.labels)

Einen String klassifizieren und die Wahrscheinlichkeitsverteilung über allen möglichen Dewey erhalten:

In [None]:
result = model.predict("interpersonal problems associate with multidimensional personality questionnaire traits in woman ",  
              k=5  # Gib die 5 besten Klassen aus
              )
for label,prob in zip(*result):
    print(label, round(prob,3))

Systematisches Testen des trainierten Models auf Testdaten:
 - k: Maximale Anzahl vorgeschlagener Labels
 - threshold: Minimale Wahrscheinlichkeit eine Labels, damit es als vorhergesagt gilt

In [None]:
model.test("zora-eng-dewey.lemmatized.fasttext.test.tsv",k=3,threshold=0.25)

In [None]:
def print_results(N, p, r):
    "Pretty print performance: N=Number of Samples, P/R@1=Precision/Recall of best prediction Acc=Accuracy "
    print(f"N\t{N}")
    print(f"P@k\t{p:.2f}")
    print(f"R@k\t{r:.2f}")
    print(f"Acc\t{r:.2f}")

In [None]:
print_results(*model.test("zora-eng-dewey.lemmatized.fasttext.test.tsv",k=3,threshold=0.25))

Detaillierte Evaluation zu jedem einzelnen Label:
 - Precision: Anteil korrekter Klassifikationen einer Klasse
 - Recall: Anteil korrekt klassifizierter Elemente einer Klasse
 - f1score: Harmonisches Mittel von Precision und Recall

In [88]:
def per_label_evaluation(model, test_file, k=3, threshold=0.25):
    data = model.test_label(test_file,k=k, threshold=threshold)
    sorted_data = sorted(data.items(), key=lambda x: x[1]['f1score'], reverse=True)

    for label, perf in sorted_data:
        print(f"{label} F1 {perf['f1score']:.3f} P {perf['precision']:.3f} R {perf['recall']:.3f}")

In [89]:
per_label_evaluation(model,"zora-eng-dewey.lemmatized.fasttext.test.tsv")

__label__390 F1 nan P nan R nan
__label__400 F1 nan P nan R nan
__label__360 F1 nan P nan R nan
__label__430 F1 nan P nan R nan
__label__530 F1 0.921 P 0.911 R 0.932
__label__560 F1 0.875 P 0.875 R 0.875
__label__070 F1 0.857 P 0.857 R 0.857
__label__510 F1 0.807 P 0.719 R 0.920
__label__610 F1 0.803 P 0.762 R 0.848
__label__910 F1 0.780 P 0.821 R 0.744
__label__540 F1 0.773 P 0.810 R 0.739
__label__000 F1 0.767 P 0.719 R 0.821
__label__570 F1 0.727 P 0.656 R 0.815
__label__330 F1 0.719 P 0.641 R 0.820
__label__580 F1 0.615 P 0.533 R 0.727
__label__150 F1 0.562 P 0.491 R 0.659
__label__320 F1 0.533 P 0.400 R 0.800
__label__700 F1 0.500 P 1.000 R 0.333
__label__100 F1 0.500 P 1.000 R 0.333
__label__300 F1 0.476 P 0.455 R 0.500
__label__370 F1 0.476 P 0.833 R 0.333
__label__170 F1 0.467 P 0.412 R 0.538
__label__340 F1 0.444 P 0.667 R 0.333
__label__820 F1 0.364 P 0.667 R 0.250
__label__490 F1 0.286 P 0.250 R 0.333
__label__142 F1 0.093 P 0.167 R 0.065
__label__630 F1 0.000 P nan R 0.000


## Vorhersagen und Wahrheit anzeigen

In [None]:
!ls -lh

In [90]:
test_data = []
with open("zora-eng-dewey.lemmatized.fasttext.test.tsv", mode="r",encoding="utf-8") as testfile:
    for line in testfile:
        test_data.append(line.strip().split("\t"))
test_data[:3]


[['__label__570',
  'perspective : chain dynamic of unfold and intrinsically disorder protein from nanosecond fluorescence correlation spectroscopy combine with single - molecule fret the dynamic of unfolded protein be important both for the process of protein fold and for the behavior of intrinsically disorder protein . however , method for investigate the global chain dynamic of these structurally diverse system have be limit . a versatile experimental approach be single - molecule spectroscopy in combination with förster resonance energy transfer and nanosecond fluorescence correlation spectroscopy . the concept of polymer physics offer a powerful framework both for interpret the result and for understanding and classify the property of unfold and intrinsically disorder protein . this information on long - range chain dynamic can be complement with spectroscopic technique that probe different length scale and time scale , and integration of these result greatly benefit from recent a

In [91]:
from collections import Counter
confusion_matrix = Counter()

# If given a list of strings, it will return a list of results as usually received for a single line of text.
predictions,probs = model.predict([text for _,text in test_data], k=3, threshold=0.25)

for i,preds in enumerate(predictions):
    labels = " ".join(sorted(preds)).replace('__label__','')
    if not labels:
        labels = '???'
    confusion_matrix[(test_data[i][0].replace('__label__',''),labels)] += 1

# korrekte 
print("CORRECT PREDICTIONS")
for (correct, predicted), count in confusion_matrix.most_common():
    if correct == predicted:
        print("TRUTH",correct, "SYSTEM",predicted, "COUNT",count)

# falsche 
print("\n\nWRONG PREDICTIONS")
for (correct, predicted), count in confusion_matrix.most_common():
    if correct != predicted:
        print("TRUTH",correct, "SYSTEM",predicted, "COUNT",count)

CORRECT PREDICTIONS
TRUTH 610 SYSTEM 610 COUNT 288
TRUTH 570 SYSTEM 570 COUNT 155
TRUTH 530 SYSTEM 530 COUNT 40
TRUTH 330 SYSTEM 330 COUNT 38
TRUTH 910 SYSTEM 910 COUNT 28
TRUTH 510 SYSTEM 510 COUNT 23
TRUTH 150 SYSTEM 150 COUNT 22
TRUTH 000 SYSTEM 000 COUNT 20
TRUTH 540 SYSTEM 540 COUNT 17
TRUTH 580 SYSTEM 580 COUNT 7
TRUTH 560 SYSTEM 560 COUNT 6
TRUTH 370 SYSTEM 370 COUNT 5
TRUTH 070 SYSTEM 070 COUNT 5
TRUTH 300 SYSTEM 300 COUNT 4
TRUTH 320 SYSTEM 320 COUNT 3
TRUTH 170 SYSTEM 170 COUNT 3
TRUTH 340 SYSTEM 340 COUNT 2
TRUTH 820 SYSTEM 820 COUNT 2
TRUTH 100 SYSTEM 100 COUNT 1
TRUTH 700 SYSTEM 700 COUNT 1
TRUTH 490 SYSTEM 490 COUNT 1


WRONG PREDICTIONS
TRUTH 610 SYSTEM 570 COUNT 32
TRUTH 570 SYSTEM 610 COUNT 27
TRUTH 570 SYSTEM 570 610 COUNT 26
TRUTH 610 SYSTEM 570 610 COUNT 21
TRUTH 142 SYSTEM 610 COUNT 11
TRUTH 610 SYSTEM 150 610 COUNT 10
TRUTH 142 SYSTEM 570 COUNT 7
TRUTH 610 SYSTEM 150 COUNT 6
TRUTH 820 SYSTEM ??? COUNT 5
TRUTH 150 SYSTEM 610 COUNT 5
TRUTH 910 SYSTEM ??? COUNT 4
TRU

# Verbessern des Modells
Verbessern des Modells: Z.B. mehr Epochen, mehr Dimensionen, längere Buchstaben-N-Gramme, ...

Wichtigste Parameter:
```
   epoch N  # Beim Lernen wird das ganze Trainingsset N mal benutzt. Beeinflusst die Dauer des Trainings linear!
   dim N    # Länge der gelernten Vektoren für Wörter und Buchstaben-N-Gramme
   lr 0.N   # Initiale Lernrate: Bestimmt, wie stark die Vektoren verändert werden, wenn Fehler passieren. Während des Lernens wird die Lernrate immer kleiner.
   mmin N   # Minimale Länge der Subwords, d.h. Buchstaben-N-Gramme
   maxn N   # Maximale Länger der Subwords, d.h. Buchstaben-N-Gramme (falls N=0, werden keine Subwords benutzt, nur Wörter)
```

In [None]:
model = fasttext.train_supervised(
    input='zora-eng-dewey.lemmatized.fasttext.train.tsv', 
    pretrainedVectors="wiki.en.50.vec", # vortrainierte word embeddings, können weggelassen werden
    epoch=20,  # Wie oft werden die Trainingsdaten benutzt
    minn=5,    # Minimal Subword-Länge in Buchstaben  
    maxn=5,    # Maximale Subword-Länge in Buchstaben 
    dim=50,    # Dimensionalität der Vektoren für die Repräsentation der Wörter und Subwords (muss gleich wie pretrainedVectors sein)
    lr=1,      # Learning Rate (Lernrate): Wie stark wird ein Fehler bestraft? 
    )
print_results(*model.test("zora-eng-dewey.lemmatized.fasttext.test.tsv"))

In [None]:
per_label_evaluation(model,'zora-eng-dewey.lemmatized.fasttext.test.tsv',k=3, threshold=0.25)

# Anhang: Embeddings

In [None]:
! test -e wiki.en.50.bin || curl https://files.ifi.uzh.ch/cl/siclemat/lehre/fs23/bibliosuisse/data/wiki.en.50.bin -o wiki.en.50.bin

In [None]:
full_model = fasttext.load_model('wiki.en.50.bin')

In [None]:
full_model.get_nearest_neighbors('disease')

A is to B, like ? is to C model.get_analogies(A,B,C)

In [None]:
full_model.get_analogies('man','woman','queen')

How to store the 400000 most frequent words in a smaller text format that is usable for supervised training.

In [None]:
model=full_model
# Store only the 100,000 most frequent words
max_words = 400000
words = model.words[:max_words]
vectors = [model[word] for word in words]

# Save the subset of words and vectors to a text file
with open("model_subset.txt", "w", encoding="utf-8") as f:
    # Write the header with the vocabulary size and vector dimensionality
    f.write(f"{max_words} {model.get_dimension()}\n")

    # Write the vectors for each word
    for word, vector in zip(words, vectors):
        vector_str = " ".join([f"{x:.6f}" for x in vector])
        f.write(f"{word} {vector_str}\n")
