# 02 Testkorpora_Tagging

## Tagger f√ºr POS
In diesem Jupyter Notebook wird ein Testkorpus testkorpus_divers_50.csv erstellt, welches verschiedene Schwierigkeiten wie Rechtschreibfehler, Hashes, @ und Emojis enth√§lt. Danach wird die Datei anhand mehrerer verschiedener Modelle getaggt, sodass verglichen werden kann, welches Modell am besten abschneidet. Die Entscheidung wird nicht quantifiziert begr√ºndet, sondern durch den eigenen Eindruck entschieden.

Verschiedene Tagsets:
- Penn Treebank Tags: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
- Universal Dependencies:
https://huggingface.co/flair/upos-english

### Welcher Tagger eignet sich am Besten:
- Spacy
- Stanza
- Flair
- Bert
- Tweebank

In [None]:
## Text zu Tagging und seinen Problemen schreiben
# nachkorrigieren, wenn Fehler konsistent passieren
# auch Transformer Modell testen
# Probleme bei kleinem Testset beobachten (mit Rechtschreibfehlern)
# (50 Tweets)
# Systematik gut beschreiben, dann aber vom pers√∂nlichen Eindruck sprechen (und nicht quantifizieren)

In [None]:
# von SpaCy als UD formatieren lassen
# bei CWB schauen, wie die Daten aussehen sollen (in xml-tag)
# Formatbeispiel an Stephanie schicken
# f√ºr CWB aufbereiten (mit Metadaten)
# an Stephanie das cwb-indexierte Corpus schicken

In [None]:
# Corpusanalyse in CQP-Web? Bequemer?
# Spacy? Tagging damit? (basierend auf universal dependencies)
# stanza (german universal dependencies treebank gsd, ud, hamburg dependency treebank)
# stichprobenartig verschiedene Modelle testen
# TweetNLP -> nur f√ºr Speechtagging geeignet

#### Testkorpus, mit dem verschiedene Tagger getestet werden

In [None]:
##Testkorpus:

In [None]:
# Installation: conda install -c conda-forge pyspellchecker

In [1]:
# um ein m√∂glichst diverses Korpus zu erstellen, das alle relevanten F√§lle pr√ºft
import pandas as pd
import re
from spellchecker import SpellChecker
from IPython.display import display

spell = SpellChecker(language="en")
# CSV einlesen
df = pd.read_csv("tta_final_clean.csv")

# Funktionen f√ºr verschiedene Post-Typen
def has_mention(x): return "@" in str(x)
def has_hashtag(x): return "#" in str(x)
def has_url(x): return re.search(r"http[s]?://", str(x)) is not None
def has_emoji(x): return re.search(r"[\U00010000-\U0010ffff]", str(x)) is not None
def is_long(x): return len(str(x)) > 200   # Grenze kannst du variieren
def has_typo(x): return re.compile(
    r"\b("
    r"teh|recieve|definately|seperat(?:e|ely)|occured|untill|wich|"
    r"neccessary|adress|tomm?orow|becuase|wierd|yeee?s"
    r")\b",
    flags=re.IGNORECASE
) # nur Beispiel
def has_typo_spellchecker(text):
    words = str(text).split()
    misspelled = spell.unknown(words)   # W√∂rter, die nicht im W√∂rterbuch sind
    return len(misspelled) > 0

# Kategoriesamples ziehen
samples = []

# je 5 Beispiele (wenn vorhanden)
samples.append(df[df['text'].apply(is_long)].sample(n=5, random_state=1))
samples.append(df[df['text'].apply(has_mention)].sample(n=5, random_state=2))
samples.append(df[df['text'].apply(has_hashtag)].sample(n=5, random_state=3))
samples.append(df[df['text'].apply(has_url)].sample(n=5, random_state=4))
samples.append(df[df['text'].apply(has_emoji)].sample(n=5, random_state=5))
#samples.append(df[df['text'].apply(has_typo)].sample(n=5, random_state=6))
samples.append(df[df["text"].apply(has_typo_spellchecker)].sample(n=5, random_state=6))
#df["text"].apply(has_typo_spellchecker)

# Rest zuf√§llig auff√ºllen bis 50
already = pd.concat(samples)
remaining = df.drop(already.index)
rest = remaining.sample(n=50-len(already), random_state=42)

# finales Testkorpus
test_divers = pd.concat([already, rest]).sample(frac=1, random_state=99)
test_divers = test_divers[['date', 'id', 'text']]
display(test_divers.head(10))
test_divers.to_csv("testkorpus_divers_50.csv", index=False)

Unnamed: 0,date,id,text
41873,2010-11-04,3498743628,Reminder: The Miss Universe competition will b...
48694,2013-08-15,367977996541788160,@Timc1021 Thanks!
50189,2013-06-12,344775405057753088,"""""@_KatherineWebb: Looking forward to #MissUSA..."
14002,2011-09-04,110498268480198144,Addressing the Rise of Chronic Childhood Illn...
35985,2020-05-11,1259672385286012928,RT @darhar981: Attorney General Barr‚Äôs Office ...
11663,2011-09-07,111274504071323312,RT @MagaGlamüá∫üá∏‚ô•Ô∏è Bring Back Trump üíôüá∫üá∏
22194,2011-09-01,109360482610823936,
2495,2025-03-19,2755,The CBP Home App is now available across all m...
84855,2019-01-23,1087867453684834304,Congratulations to Mariano Rivera on unanimous...
63673,2015-07-01,616265476616900608,My recent statement re: @macys -- We must have...


In [2]:
test_divers.shape

(50, 3)

# SpaCy
https://huggingface.co/spacy/en_core_web_sm

In [3]:
# !pip install spacy

In [4]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Donald Trump posted a new tweet. #realdonaldtrump @realdonaldtrump! @ben4appel ü§£ :) https://t.co/bsB6rVV7Yn")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_)
# besser w√§re, wenn die Hashes nicht zerlegt werden.

Donald Donald PROPN NNP
Trump Trump PROPN NNP
posted post VERB VBD
a a DET DT
new new ADJ JJ
tweet tweet NOUN NN
. . PUNCT .
# # X ADD
realdonaldtrump realdonaldtrump NOUN NN
@realdonaldtrump @realdonaldtrump PROPN NNP
! ! PUNCT .
@ben4appel @ben4appel PROPN NNP
ü§£ ü§£ PROPN NNP
:) :) PUNCT :
https://t.co/bsB6rVV7Yn https://t.co/bsb6rvv7yn NOUN NN


In [5]:
# python -m spacy download en_core_web_sm

In [6]:
#### 01 SpaCy ####
import pandas as pd
import spacy

df = pd.read_csv("testkorpus_divers_50.csv")
nlp = spacy.load("en_core_web_sm")
all_results = []

for idx, text in enumerate(df["text"], start=1):
    if pd.isna(text):
        continue  # leere Zellen √ºberspringen
    doc = nlp(str(text))
    for token in doc:
        all_results.append({
            "post_id": idx,
            "word": token.text,
            "lemma": token.lemma_,
            "pos": token.pos_,
            "lemma_p": f"{token.lemma_}_{token.pos_}",
            "tag": token.tag_
        })

pos_df = pd.DataFrame(all_results)
pos_df.to_csv("testkorpus_divers_50_spacy.csv", index=False)
display(pos_df[310:360])
# der Code funktioniert; allerdings werden Emojis, @ und Post-spezifische Dinge nicht erkannt.

Unnamed: 0,post_id,word,lemma,pos,lemma_p,tag
310,16,skies,sky,NOUN,sky_NOUN,NNS
311,16,over,over,ADP,over_ADP,IN
312,16,Iran,Iran,PROPN,Iran_PROPN,NNP
313,16,.,.,PUNCT,._PUNCT,.
314,16,Iran,Iran,PROPN,Iran_PROPN,NNP
315,16,had,have,VERB,have_VERB,VBD
316,16,good,good,ADJ,good_ADJ,JJ
317,16,sky,sky,NOUN,sky_NOUN,NN
318,16,trackers,tracker,NOUN,tracker_NOUN,NNS
319,16,and,and,CCONJ,and_CCONJ,CC


In [7]:
import pandas as pd
fpt = pd.read_csv("testkorpus_divers_50_spacy.csv")
display(fpt[200:215])

Unnamed: 0,post_id,word,lemma,pos,lemma_p,tag
200,11,#,#,SYM,#_SYM,$
201,11,MadeInAmerica,MadeInAmerica,PROPN,MadeInAmerica_PROPN,NNP
202,11,event,event,NOUN,event_NOUN,NN
203,11,",",",",PUNCT,",_PUNCT",","
204,11,right,right,ADV,right_ADV,RB
205,11,here,here,ADV,here_ADV,RB
206,11,at,at,ADP,at_ADP,IN
207,11,the,the,DET,the_DET,DT
208,11,@WhiteHouse,@whitehouse,NOUN,@whitehouse_NOUN,NN
209,11,!,!,PUNCT,!_PUNCT,.


In [8]:
display(fpt[1200:1242])

Unnamed: 0,post_id,word,lemma,pos,lemma_p,tag
1200,50,ALL,all,PRON,all_PRON,DT
1201,50,of,of,ADP,of_ADP,IN
1202,50,them,they,PRON,they_PRON,PRP
1203,50,were,be,AUX,be_AUX,VBD
1204,50,released,release,VERB,release_VERB,VBN
1205,50,into,into,ADP,into_ADP,IN
1206,50,our,our,PRON,our_PRON,PRP$
1207,50,Country,Country,PROPN,Country_PROPN,NNP
1208,50,.,.,PUNCT,._PUNCT,.
1209,50,Thanks,thank,NOUN,thank_NOUN,NNS


In [9]:
display(fpt[70:110])

Unnamed: 0,post_id,word,lemma,pos,lemma_p,tag
70,5,‚Äôs,‚Äôs,PART,‚Äôs_PART,POS
71,5,Office,office,NOUN,office_NOUN,NN
72,5,Shreds,shred,VERB,shred_VERB,VBZ
73,5,NBC,NBC,PROPN,NBC_PROPN,NNP
74,5,‚Äôs,‚Äôs,PART,‚Äôs_PART,POS
75,5,Chuck,Chuck,PROPN,Chuck_PROPN,NNP
76,5,Todd,Todd,PROPN,Todd_PROPN,NNP
77,5,For,for,ADP,for_ADP,IN
78,5,‚Äò,',PUNCT,'_PUNCT,``
79,5,Deceptive,Deceptive,PROPN,Deceptive_PROPN,NNP


In [10]:
fpt.shape

(1242, 6)

In [11]:
#### SpaCy ####
##das hier sieht auch gut aus
import pandas as pd
import spacy

# CSV einlesen
df = pd.read_csv("testkorpus_divers_50.csv")

# Englisches Modell laden (ggf. installieren: python -m spacy download en_core_web_sm)
nlp = spacy.load("en_core_web_sm")

# Ergebnis-Container
all_results = []

for idx, row in df.iterrows():
    text = row["text"]
    if pd.isna(text) or str(text).strip() == "":
        continue  # leere Zellen √ºberspringen
    
    doc = nlp(str(text))
    
    for sent_id, sent in enumerate(doc.sents, start=1):
        sent_text = sent.text  # gesamter Satz als String
        for token_id, token in enumerate(sent, start=1):
            all_results.append({
                "post_id": idx + 1,  # damit IDs bei 1 starten
                "date": row["date"],
                "sentence_id": sent_id,
                "token_id": token_id,
                "sentence_text": sent_text,
                "word": token.text,
                "lemma": token.lemma_,
                "pos": token.pos_,
                "lemma_p": f"{token.lemma_}_{token.pos_}"
            })

# DataFrame erstellen
pos_df = pd.DataFrame(all_results)

# CSV speichern
pos_df.to_csv("testkorpus_divers_50_spacy_2.csv", index=False, encoding="utf-8")

display(pos_df[500:555])


Unnamed: 0,post_id,date,sentence_id,token_id,sentence_text,word,lemma,pos,lemma_p
500,25,2025-03-12,1,19,The United States of America is going to take ...,by,by,ADP,by_ADP
501,25,2025-03-12,1,20,The United States of America is going to take ...,other,other,ADJ,other_ADJ
502,25,2025-03-12,1,21,The United States of America is going to take ...,countries,country,NOUN,country_NOUN
503,25,2025-03-12,1,22,The United States of America is going to take ...,and,and,CCONJ,and_CCONJ
504,25,2025-03-12,1,23,The United States of America is going to take ...,",",",",PUNCT,",_PUNCT"
505,25,2025-03-12,1,24,The United States of America is going to take ...,frankly,frankly,ADV,frankly_ADV
506,25,2025-03-12,1,25,The United States of America is going to take ...,",",",",PUNCT,",_PUNCT"
507,25,2025-03-12,1,26,The United States of America is going to take ...,by,by,ADP,by_ADP
508,25,2025-03-12,1,27,The United States of America is going to take ...,incompetent,incompetent,ADJ,incompetent_ADJ
509,25,2025-03-12,1,28,The United States of America is going to take ...,U.S.,U.S.,PROPN,U.S._PROPN


In [12]:
import pandas as pd
fpt2 = pd.read_csv("testkorpus_divers_50_spacy_2.csv")
display(fpt2[70:110])

Unnamed: 0,post_id,date,sentence_id,token_id,sentence_text,word,lemma,pos,lemma_p
70,5,2020-05-11,1,7,RT @darhar981: Attorney General Barr‚Äôs Office ...,‚Äôs,‚Äôs,PART,‚Äôs_PART
71,5,2020-05-11,1,8,RT @darhar981: Attorney General Barr‚Äôs Office ...,Office,office,NOUN,office_NOUN
72,5,2020-05-11,1,9,RT @darhar981: Attorney General Barr‚Äôs Office ...,Shreds,shred,VERB,shred_VERB
73,5,2020-05-11,1,10,RT @darhar981: Attorney General Barr‚Äôs Office ...,NBC,NBC,PROPN,NBC_PROPN
74,5,2020-05-11,1,11,RT @darhar981: Attorney General Barr‚Äôs Office ...,‚Äôs,‚Äôs,PART,‚Äôs_PART
75,5,2020-05-11,1,12,RT @darhar981: Attorney General Barr‚Äôs Office ...,Chuck,Chuck,PROPN,Chuck_PROPN
76,5,2020-05-11,1,13,RT @darhar981: Attorney General Barr‚Äôs Office ...,Todd,Todd,PROPN,Todd_PROPN
77,5,2020-05-11,1,14,RT @darhar981: Attorney General Barr‚Äôs Office ...,For,for,ADP,for_ADP
78,5,2020-05-11,1,15,RT @darhar981: Attorney General Barr‚Äôs Office ...,‚Äò,',PUNCT,'_PUNCT
79,5,2020-05-11,1,16,RT @darhar981: Attorney General Barr‚Äôs Office ...,Deceptive,Deceptive,PROPN,Deceptive_PROPN


In [13]:
fpt2.shape # Wie viele Tags wurden vergeben?

(1242, 9)

In [14]:
#### SpaCy mit Twitter ####
## nur die Structure-Tags
import pandas as pd
import spacy
from spacy.tokenizer import Tokenizer
import re

def create_twitter_tokenizer(nlp):
    # Erweiterte Infix-Regel f√ºr Hashtags und Mentions (z.B. #NLP, @user)
    infix_re = spacy.util.compile_infix_regex(
        nlp.Defaults.infixes + [r'(?<=\w)[#@](?=\w)']
    )
    return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

# CSV einlesen
df = pd.read_csv("testkorpus_divers_50.csv")

# SpaCy Modell laden
nlp = spacy.load("en_core_web_sm")

# Twitter-angepassten Tokenizer setzen
nlp.tokenizer = create_twitter_tokenizer(nlp)

all_results = []

for idx, text in enumerate(df["text"], start=1):
    if pd.isna(text):
        continue
    doc = nlp(str(text))
    for token in doc:
        all_results.append({
            "post_id": idx,
            "word": token.text,
            "lemma": token.lemma_,
            "pos": token.pos_,
            "lemma_p": f"{token.lemma_}_{token.pos_}"
        })

# In DataFrame umwandeln und speichern
pos_df = pd.DataFrame(all_results)
pos_df.to_csv("testkorpus_divers_50_spacy_twitter.csv", index=False, encoding="utf-8")
display(pos_df[450:500])

Unnamed: 0,post_id,word,lemma,pos,lemma_p
450,21,https://t.co,https://t.co,PROPN,https://t.co_PROPN
451,21,/,/,SYM,/_SYM
452,21,v6z46rUDtg,v6z46rUDtg,PROPN,v6z46rUDtg_PROPN
453,22,Congress,Congress,PROPN,Congress_PROPN
454,22,must,must,AUX,must_AUX
455,22,approve,approve,VERB,approve_VERB
456,22,the,the,DET,the_DET
457,22,"deal,","deal,",NOUN,"deal,_NOUN"
458,22,without,without,ADP,without_ADP
459,22,all,all,PRON,all_PRON


In [15]:
import pandas as pd
fptw = pd.read_csv("testkorpus_divers_50_spacy_twitter.csv")
display(fptw[70:110])

Unnamed: 0,post_id,word,lemma,pos,lemma_p
70,5,"Press,‚Äù","Press,‚Äù",NOUN,"Press,‚Äù_NOUN"
71,5,Todd,Todd,PROPN,Todd_PROPN
72,5,‚Ä¶,‚Ä¶,PUNCT,‚Ä¶_PUNCT
73,6,RT,RT,PROPN,RT_PROPN
74,6,,,SPACE,_SPACE
75,6,@MagaGlam,@magaglam,SYM,@magaglam_SYM
76,6,üá∫,üá∫,X,üá∫_X
77,6,üá∏,üá∏,NOUN,üá∏_NOUN
78,6,‚ô•,‚ô•,PROPN,‚ô•_PROPN
79,6,Ô∏è,Ô∏è,X,Ô∏è_X


In [16]:
fptw.shape

(1244, 5)

Fazit zu SpaCy (auch mit Twitter API):
- Satzzeichen werden bei der Twittererg√§nzung nicht richtig getrennt
- Hashes oft getrennt und dann falsch getrennt oder Hashes als SYM und Folgendes als PROPN
- @ als X oder NOUN (oder sonst wie falsch) getaggt, aber richtig beibehalten
- Barr's getrennt als Barr PROPN und 's als PART
- Emojis getrennt und als Satz analysiert (ebenso wie Hash und @ als Satz analysiert werden), falsch
- Lemmatisierung nicht so gut
- LL:Bean richtig erkannt
- Tagging nicht ganz richtig
spacy2: shreds richtig als shred erkannt
- besser lemmatisiert
- Hashes gut getrennt, aber als Satz analysiert, falsch
- Links gut beibehalten, Tagging komisch
- trotzdem Problem mit Hash und @
mit twitter:
- schlechte Tokenisierung
- @ und Hash gut
- nicht alle Tags passen (Meet NOUN the DET press NOUN, USA als ADV)
- Emojis trotzdem nicht gut erkannt
- manche Links werden zerh√§kselt, manche nicht
- Lemmatisierung gut

# Stanza
https://huggingface.co/stanfordnlp/stanza-en

In [17]:
# !pip install stanza

In [18]:
#### das englische Modell f√ºr Stanza #### basierend auf UD
## combined-model finden
import pandas as pd
import stanza

# Stanza Pipeline f√ºr Englisch laden (nur Tokenization, Lemma, POS)
stanza.download('en')  
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma', use_gpu=False)

# CSV einlesen
df = pd.read_csv("testkorpus_divers_50.csv")

all_results = []

for idx, row in df.iterrows():
    text = row["text"]
    if pd.isna(text):
        continue
    doc = nlp(str(text))
    for sentence in doc.sentences:
        for token in sentence.tokens:
            # Ein Token kann mehrere W√∂rter enthalten (multi-word tokens),
            # deshalb nehmen wir das erste Wort f√ºr lemma und pos
            word = token.text
            word_info = token.words[0]  # Erstes Wort im Token
            lemma = word_info.lemma
            pos = word_info.upos
            lemma_p = f"{lemma}_{pos}"
            
            all_results.append({
                "post_id": idx + 1,
                "date": row["date"],
                "text": text,
                "word": word,
                "lemma": lemma,
                "pos": pos,
                "lemma_p": lemma_p
            })

# In DataFrame umwandeln und speichern
pos_df = pd.DataFrame(all_results)
pos_df.to_csv("testkorpus_divers_50_stanza.csv", index=False, encoding="utf-8")
display(pos_df[612:665])

  from .autonotebook import tqdm as notebook_tqdm
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json: 434kB [00:00, 23.5MB/s]
2025-09-05 17:41:36 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-05 17:41:36 INFO: Downloading default packages for language: en (English) ...
2025-09-05 17:41:39 INFO: File exists: /Users/vivien/stanza_resources/en/default.zip
2025-09-05 17:41:45 INFO: Finished downloading models and saved to /Users/vivien/stanza_resources
2025-09-05 17:41:45 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json: 434kB [00:00, 12.8MB/s]
2025-09-05 17:41:46 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-05 17:41:46 INFO: Loadi

Unnamed: 0,post_id,date,text,word,lemma,pos,lemma_p
612,30,2014-09-01,"""""@NPHerron: @realDonaldTrump For president #2...",#,#,SYM,#_SYM
613,30,2014-09-01,"""""@NPHerron: @realDonaldTrump For president #2...",2016election,2016election,PROPN,2016election_PROPN
614,30,2014-09-01,"""""@NPHerron: @realDonaldTrump For president #2...","""","""",PUNCT,"""_PUNCT"
615,30,2014-09-01,"""""@NPHerron: @realDonaldTrump For president #2...","""","""",PUNCT,"""_PUNCT"
616,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,The,the,DET,the_DET
617,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,Zimmerman,Zimmerman,PROPN,Zimmerman_PROPN
618,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,trial,trial,NOUN,trial_NOUN
619,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,is,be,AUX,be_AUX
620,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,over,over,ADV,over_ADV
621,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,.,.,PUNCT,._PUNCT


In [19]:
import pandas as pd
fpts = pd.read_csv("testkorpus_divers_50_stanza.csv")
display(fpts[200:215])

Unnamed: 0,post_id,date,text,word,lemma,pos,lemma_p
200,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,it,it,PRON,it_PRON
201,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,is,be,AUX,be_AUX
202,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,the,the,DET,the_DET
203,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,BEST,good,ADJ,good_ADJ
204,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,!,!,PUNCT,!_PUNCT
205,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,USA,USA,PROPN,USA_PROPN
206,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,üá∫,üá∫,PUNCT,üá∫_PUNCT
207,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,üá∏,üá∏,PUNCT,üá∏_PUNCT
208,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,https://t.co/q4vB9GdE5y,https://t.co/q4vB9GdE5y,PROPN,https://t.co/q4vB9GdE5y_PROPN
209,12,2011-08-31,https://www.mediaite.com/tv/trump-team-scored-...,https://www.mediaite.com/tv/trump-team-scored-...,https://www.mediaite.com/tv/trump-team-scored-...,PROPN,https://www.mediaite.com/tv/trump-team-scored-...


In [20]:
fpts.shape

(1190, 7)

Fazit zu Stanza:
- weniger Tags als Spacy
- @ werden meistens gut erkannt (und so belassen wie sie waren)
- Emojis werden als Punkt erkannt
- Hashes werden meist getrennt
- Links als Eigenname, aber werden ganz gelassen
- richtige Lemmatisierung
- Barr's wird zu Barr als PROPN - Tweebank trennt in Barr 's auf
- Shreds und its (Rechtschreibfehler) falsch erkannt, ol' als Noun (eigentl. old, Tweebank hatte ol' als ADJ)
- doesn't wird zu do (lemma)

# Flair
- https://huggingface.co/flair/pos-english: F1 Score: 98,18
- https://huggingface.co/flair/pos-english-fast: 98,10
- https://huggingface.co/flair/upos-english: 98,6
- https://huggingface.co/flair/upos-english-fast: 98,47

In [21]:
# !pip install flair # bei Modell pos oder upos w√§hlbar

In [22]:
#### Flair mit Batch-UPOS-Tagging ####
import logging
import pandas as pd
from flair.data import Sentence
from flair.models import SequenceTagger

# Flair-Logausgabe reduzieren (optional)
logging.getLogger("flair").setLevel(logging.ERROR)

# CSV einlesen
df = pd.read_csv("testkorpus_divers_50.csv")

# Nur Zeilen mit Text behalten (um sauberes Pairing zu garantieren)
df_nonempty = df[df["text"].notna()].copy()

# POS-Tagger laden
tagger = SequenceTagger.load("flair/upos-english")
label_type = tagger.label_type  # sollte "pos" sein

# Sentences vorbereiten
sentences = [Sentence(str(t)) for t in df_nonempty["text"]]

# Batch-Prediction
tagger.predict(sentences, mini_batch_size=32)

# Ergebnisse sammeln (Zeilen aus df_nonempty mit S√§tzen paaren)
all_results = []
for row, sentence in zip(df_nonempty.itertuples(index=True), sentences):
    for token in sentence:
        # POS-Label holen ‚Äì neue API
        # Variante A (ein Label):
        pos_label = token.get_label(label_type).value if token.has_label(label_type) else None
        # Variante B (Liste):
        # labels = token.get_labels(label_type)
        # pos_label = labels[0].value if labels else None

        all_results.append({
            "post_id": row.Index,                  # Original-Index aus df
            "date": getattr(row, "date", None),    # falls vorhanden
            "word": token.text,
            "lemma": token.text.lower(),           # Flair liefert kein Lemma ‚Üí Platzhalter
            "pos": pos_label,
            "lemma_p": f"{token.text.lower()}_{pos_label}" if pos_label else token.text.lower()
        })

# In DataFrame umwandeln & speichern
pos_df = pd.DataFrame(all_results)
pos_df.to_csv("testkorpus_divers_50_flair.csv", index=False)

display(pos_df.head(20))

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,INTJ,reminder_INTJ
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,the,PROPN,the_PROPN
3,0,2010-11-04,Miss,miss,PROPN,miss_PROPN
4,0,2010-11-04,Universe,universe,PROPN,universe_PROPN
5,0,2010-11-04,competition,competition,PROPN,competition_PROPN
6,0,2010-11-04,will,will,AUX,will_AUX
7,0,2010-11-04,be,be,VERB,be_VERB
8,0,2010-11-04,LIVE,live,VERB,live_VERB
9,0,2010-11-04,from,from,ADP,from_ADP


In [23]:
import pandas as pd
fptf = pd.read_csv("testkorpus_divers_50_flair.csv")
display(fptf[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Shreds,shreds,NUM,shreds_NUM
71,4,2020-05-11,NBC,nbc,SYM,nbc_SYM
72,4,2020-05-11,‚Äôs,‚Äôs,NUM,‚Äôs_NUM
73,4,2020-05-11,Chuck,chuck,NOUN,chuck_NOUN
74,4,2020-05-11,Todd,todd,SYM,todd_SYM
75,4,2020-05-11,For,for,NUM,for_NUM
76,4,2020-05-11,‚Äò,‚Äò,SYM,‚Äò_SYM
77,4,2020-05-11,Deceptive,deceptive,NUM,deceptive_NUM
78,4,2020-05-11,Editing‚Äô,editing‚Äô,SYM,editing‚Äô_SYM
79,4,2020-05-11,Of,of,NUM,of_NUM


In [24]:
fptf.shape

(1357, 6)

Auf den ersten Blick kommt flair mit POS Tagging nicht sehr gut klar, wie SpaCy oder Stanza.

In [25]:
#### Flair und SpaCy ####
import pandas as pd
from flair.data import Sentence
from flair.models import SequenceTagger
import spacy

# CSV einlesen
df = pd.read_csv("testkorpus_divers_50.csv")

# Flair POS-Tagger laden
tagger = SequenceTagger.load("pos-fast")

# SpaCy englisches Modell laden f√ºr Lemmatisierung
nlp = spacy.load("en_core_web_sm")

all_results = []

for idx, row in df.iterrows():
    text = row['text']
    if pd.isna(text):
        continue

    # SpaCy-Dokument f√ºr Lemmata
    spacy_doc = nlp(str(text))

    # Flair Sentence f√ºr POS
    flair_sentence = Sentence(str(text))
    tagger.predict(flair_sentence)

    # Achtung: Flair und SpaCy tokenisieren unterschiedlich!
    # Deshalb versuchen wir, Tokens zu matchen per Position, falls gleich viele Tokens
    if len(flair_sentence) == len(spacy_doc):
        for flair_token, spacy_token in zip(flair_sentence, spacy_doc):
            all_results.append({
                "post_id": idx,
                "date": row.get("date"),
                "word": flair_token.text,
                "lemma": spacy_token.lemma_,
                "pos": flair_token.get_label('pos').value,
                "lemma_p": f"{spacy_token.lemma_}_{flair_token.get_label('pos').value}"
            })
    else:
        # Falls Tokenanzahl nicht √ºbereinstimmt: fallback nur Flair POS + Wort, Lemma = Wort klein
        for flair_token in flair_sentence:
            all_results.append({
                "post_id": idx,
                "date": row.get("date"),
                "word": flair_token.text,
                "lemma": flair_token.text.lower(),
                "pos": flair_token.get_label('pos').value,
                "lemma_p": f"{flair_token.text.lower()}_{flair_token.get_label('pos').value}"
            })

pos_df = pd.DataFrame(all_results)
pos_df.to_csv("testkorpus_divers_50_flair_spacy.csv", index=False)

display(pos_df.head(20))

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,NN,reminder_NN
1,0,2010-11-04,:,:,:,:_:
2,0,2010-11-04,The,the,DT,the_DT
3,0,2010-11-04,Miss,miss,NNP,miss_NNP
4,0,2010-11-04,Universe,universe,NNP,universe_NNP
5,0,2010-11-04,competition,competition,NN,competition_NN
6,0,2010-11-04,will,will,MD,will_MD
7,0,2010-11-04,be,be,VB,be_VB
8,0,2010-11-04,LIVE,live,JJ,live_JJ
9,0,2010-11-04,from,from,IN,from_IN


In [26]:
import pandas as pd
fptfs = pd.read_csv("testkorpus_divers_50_flair_spacy.csv")
display(fptfs[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Shreds,shreds,VBZ,shreds_VBZ
71,4,2020-05-11,NBC,nbc,NNP,nbc_NNP
72,4,2020-05-11,‚Äôs,‚Äôs,VBZ,‚Äôs_VBZ
73,4,2020-05-11,Chuck,chuck,NNP,chuck_NNP
74,4,2020-05-11,Todd,todd,NNP,todd_NNP
75,4,2020-05-11,For,for,IN,for_IN
76,4,2020-05-11,‚Äò,‚Äò,``,‚Äò_``
77,4,2020-05-11,Deceptive,deceptive,JJ,deceptive_JJ
78,4,2020-05-11,Editing‚Äô,editing‚Äô,NN,editing‚Äô_NN
79,4,2020-05-11,Of,of,IN,of_IN


In [27]:
fptfs.shape

(1357, 6)

In [28]:
#### Flair und SpaCy ####
import pandas as pd
from flair.data import Sentence
from flair.models import SequenceTagger
import spacy
from IPython.display import display

# CSV einlesen
df = pd.read_csv("testkorpus_divers_50.csv")

# Flair POS-Tagger laden
tagger = SequenceTagger.load("pos-fast")

# SpaCy englisches Modell laden f√ºr Lemmatisierung
nlp = spacy.load("en_core_web_sm")

def get_spacy_token_by_offset(spacy_tokens, start, end):
    """Suche SpaCy Token, der mindestens teilweise im Bereich [start, end) liegt"""
    for token in spacy_tokens:
        token_start = token.idx
        token_end = token.idx + len(token.text)
        # Check f√ºr √úberlappung
        if token_start <= end and token_end >= start:
            return token
    return None

all_results = []

for idx, row in df.iterrows():
    text = row['text']
    if pd.isna(text):
        continue

    spacy_doc = nlp(str(text))
    # use_tokenizer=True ist wichtig f√ºr start_position / end_position
    flair_sentence = Sentence(str(text), use_tokenizer=True)
    tagger.predict(flair_sentence)

    for flair_token in flair_sentence:
        start = flair_token.start_position
        end = flair_token.end_position
        spacy_token = get_spacy_token_by_offset(spacy_doc, start, end)

        if spacy_token:
            lemma = spacy_token.lemma_
        else:
            # Falls kein passender Token gefunden wurde, Fallback
            lemma = flair_token.text.lower()

        all_results.append({
            "post_id": idx,
            "date": row.get("date"),
            "word": flair_token.text,
            "lemma": lemma,
            "pos": flair_token.get_label('pos').value,
            "lemma_p": f"{lemma}_{flair_token.get_label('pos').value}"
        })

pos_df = pd.DataFrame(all_results)
pos_df.to_csv("testkorpus_divers_50_flair_spacy2.csv", index=False)
display(pos_df.head(40))

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,NN,reminder_NN
1,0,2010-11-04,:,reminder,:,reminder_:
2,0,2010-11-04,The,the,DT,the_DT
3,0,2010-11-04,Miss,Miss,NNP,Miss_NNP
4,0,2010-11-04,Universe,Universe,NNP,Universe_NNP
5,0,2010-11-04,competition,competition,NN,competition_NN
6,0,2010-11-04,will,will,MD,will_MD
7,0,2010-11-04,be,be,VB,be_VB
8,0,2010-11-04,LIVE,live,JJ,live_JJ
9,0,2010-11-04,from,from,IN,from_IN


In [29]:
import pandas as pd
fptfs2 = pd.read_csv("testkorpus_divers_50_flair_spacy2.csv")
display(fptfs2[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Shreds,shred,VBZ,shred_VBZ
71,4,2020-05-11,NBC,NBC,NNP,NBC_NNP
72,4,2020-05-11,‚Äôs,NBC,VBZ,NBC_VBZ
73,4,2020-05-11,Chuck,Chuck,NNP,Chuck_NNP
74,4,2020-05-11,Todd,Todd,NNP,Todd_NNP
75,4,2020-05-11,For,for,IN,for_IN
76,4,2020-05-11,‚Äò,',``,'_``
77,4,2020-05-11,Deceptive,',JJ,'_JJ
78,4,2020-05-11,Editing‚Äô,editing,NN,editing_NN
79,4,2020-05-11,Of,of,IN,of_IN


In [30]:
fptfs2.shape

(1357, 6)

Fazit zu Flair:
- Tagging trift oft nicht zu (zu viel NUM und SYM) vor allem bei Namen
- Zerteilung der Links
- Trennung der Hashes und @ (immerhin wortweise)
- Emojis als SYM und gut getrennt
flair_spacy: anderes Tagset
- Lemmatisierung nicht ganz richtig (nur lower)
- Emojis auch gut getrennt, aber als NFP
flair-spacy2: auch anderes Tagset
- Lemmatisierung schlechter
- Tagging ok
- Emojis auch als NFP, nicht so gut getrennt
- LL. Bean wird als gemeinsamer Ausdruck erkannt! (Bei Stanza, Bert und Tweebank nicht)
- ol' richtig als JJ
- @ als CC
- Links als eigener Satz getaggt, was es falsch macht und dadurch zu viel aufsplittet

# Bert
https://huggingface.co/vblagoje/bert-english-uncased-finetuned-pos

In [31]:
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
import spacy

# === 1. Modell f√ºr POS-Tagging (lokal, kompatibel mit PyTorch 2.2) ===
model_name = "vblagoje/bert-english-uncased-finetuned-pos"

# Laden
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

# spaCy f√ºr Lemma
nlp = spacy.load("en_core_web_sm")

# Label mapping
id2label = model.config.id2label

# === 2. CSV einlesen ===
df = pd.read_csv("testkorpus_divers_50.csv")
text_col = "text"

results = []

# === 3. POS-Tagging pro Tweet ===
for idx, row in df.iterrows():
    text = row.get(text_col)
    if pd.isna(text):
        continue

    spacy_doc = nlp(str(text))

    # Tokenize mit Offsets
    encoding = tokenizer(str(text), return_tensors="pt", return_offsets_mapping=True, truncation=True)
    input_ids = encoding["input_ids"]
    offset_mappings = encoding["offset_mapping"][0]

    with torch.no_grad():
        output = model(input_ids)
    
    logits = output.logits
    predictions = torch.argmax(logits, dim=2)[0].tolist()

    # Durch Token iterieren
    for idx_token, pred_id in enumerate(predictions):
        start, end = offset_mappings[idx_token].tolist()
        if start == end:
            continue  # Special tokens

        word_text = text[start:end]
        pos_tag = id2label[pred_id]

        # Lemma √ºber spaCy Token
        lemma = None
        for token in spacy_doc:
            token_start = token.idx
            token_end = token.idx + len(token.text)
            if start >= token_start and end <= token_end:
                lemma = token.lemma_
                break
        if lemma is None:
            lemma = word_text.lower()

        results.append({
            "post_id": idx,
            "date": row.get("date"),
            "word": word_text,
            "lemma": lemma,
            "pos": pos_tag,
            "lemma_p": f"{lemma}_{pos_tag}"
        })

# === 4. Ergebnisse speichern ===
pos_df = pd.DataFrame(results)
pos_df.to_csv("testkorpus_divers_50_pos_local.csv", index=False)
print("POS-Tagging abgeschlossen. Ergebnisse in 'testkorpus_divers_50_pos_local.csv'")
display(pos_df.head(40))

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


POS-Tagging abgeschlossen. Ergebnisse in 'testkorpus_divers_50_pos_local.csv'


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,NOUN,reminder_NOUN
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,the,DET,the_DET
3,0,2010-11-04,Miss,Miss,PROPN,Miss_PROPN
4,0,2010-11-04,Universe,Universe,PROPN,Universe_PROPN
5,0,2010-11-04,competition,competition,NOUN,competition_NOUN
6,0,2010-11-04,will,will,AUX,will_AUX
7,0,2010-11-04,be,be,AUX,be_AUX
8,0,2010-11-04,LIVE,live,ADJ,live_ADJ
9,0,2010-11-04,from,from,ADP,from_ADP


In [32]:
import pandas as pd
fptfba = pd.read_csv("testkorpus_divers_50_pos_local.csv")
display(fptfba[115:165])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
115,4,2020-05-11,‚Äô,‚Äôs,PART,‚Äôs_PART
116,4,2020-05-11,s,‚Äôs,PART,‚Äôs_PART
117,4,2020-05-11,Comments,comment,NOUN,comment_NOUN
118,4,2020-05-11,On,on,ADP,on_ADP
119,4,2020-05-11,‚Äú,"""",PUNCT,"""_PUNCT"
120,4,2020-05-11,Meet,Meet,VERB,Meet_VERB
121,4,2020-05-11,The,The,DET,The_DET
122,4,2020-05-11,Press,Press,PROPN,Press_PROPN
123,4,2020-05-11,",",",",PUNCT,",_PUNCT"
124,4,2020-05-11,‚Äù,"""",PUNCT,"""_PUNCT"


In [33]:
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
import spacy

# Daten laden
df = pd.read_csv("testkorpus_divers_50.csv")

# spaCy Modell laden f√ºr Lemma
nlp = spacy.load("en_core_web_sm")

model_name = "vblagoje/bert-english-uncased-finetuned-pos"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Pipeline f√ºr Token Classification (POS)
pos_tagger = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

results = []

for idx, row in df.iterrows():
    text = row["text"]
    if pd.isna(text):
        continue
    
    # spaCy Doc zum Lemmatisieren
    spacy_doc = nlp(str(text))
    
    # Hugging Face POS-Tagging auf Satz
    hf_pos = pos_tagger(str(text))
    
    # spaCy Tokens und Offset f√ºr Matching vorbereiten
    spacy_tokens = list(spacy_doc)
    
    # Funktion um spaCy-Token nach Offset zu suchen
    def get_spacy_token_by_offset(start, end):
        for token in spacy_tokens:
            token_start = token.idx
            token_end = token.idx + len(token.text)
            if token_start <= end and token_end >= start:
                return token
        return None
    
    for entity in hf_pos:
        word = entity['word']
        start = entity['start']
        end = entity['end']
        pos_tag = entity['entity_group']  # z.B. 'NOUN', 'VERB'
        
        spacy_token = get_spacy_token_by_offset(start, end)
        lemma = spacy_token.lemma_ if spacy_token else word.lower()
        
        results.append({
            "post_id": idx,
            "date": row.get("date"),
            "word": word,
            "lemma": lemma,
            "pos": pos_tag,
            "lemma_p": f"{lemma}_{pos_tag}"
        })

pos_df = pd.DataFrame(results)
pos_df.to_csv("testkorpus_divers_50_bert.csv", index=False)
display(pos_df.head(50))

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,reminder,reminder,NOUN,reminder_NOUN
1,0,2010-11-04,:,reminder,PUNCT,reminder_PUNCT
2,0,2010-11-04,the,the,DET,the_DET
3,0,2010-11-04,miss universe,Miss,PROPN,Miss_PROPN
4,0,2010-11-04,competition,competition,NOUN,competition_NOUN
5,0,2010-11-04,will be,will,AUX,will_AUX
6,0,2010-11-04,live,live,ADJ,live_ADJ
7,0,2010-11-04,from,from,ADP,from_ADP
8,0,2010-11-04,the,the,DET,the_DET
9,0,2010-11-04,bahamas,Bahamas,PROPN,Bahamas_PROPN


In [34]:
import pandas as pd
fptfb = pd.read_csv("testkorpus_divers_50_bert.csv")
display(fptfb[200:215])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
200,10,2017-07-21,hosted,host,VERB,host_VERB
201,10,2017-07-21,a,a,DET,a_DET
202,10,2017-07-21,#,#,SYM,#_SYM
203,10,2017-07-21,madeinamerica,#,X,#_X
204,10,2017-07-21,event,event,NOUN,event_NOUN
205,10,2017-07-21,",",event,PUNCT,event_PUNCT
206,10,2017-07-21,right here,right,ADV,right_ADV
207,10,2017-07-21,at,at,ADP,at_ADP
208,10,2017-07-21,the,the,DET,the_DET
209,10,2017-07-21,@,@whitehouse,X,@whitehouse_X


In [35]:
fptfb.shape

(1559, 6)

In [36]:
fptfb[70:110]
# das Lemma ist v√∂llig falsch
# @darhar wird als Satzzeichen erkannt?!

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,:,@darhar981,PUNCT,@darhar981_PUNCT
71,4,2020-05-11,attorney general barr,Attorney,PROPN,Attorney_PROPN
72,4,2020-05-11,‚Äô s,Barr,PART,Barr_PART
73,4,2020-05-11,office,office,NOUN,office_NOUN
74,4,2020-05-11,shreds,shred,VERB,shred_VERB
75,4,2020-05-11,nbc,NBC,PROPN,NBC_PROPN
76,4,2020-05-11,‚Äô s,NBC,PART,NBC_PART
77,4,2020-05-11,chuck todd,Chuck,PROPN,Chuck_PROPN
78,4,2020-05-11,for,for,ADP,for_ADP
79,4,2020-05-11,‚Äò,',PUNCT,'_PUNCT


In [37]:
# ohne Pipeline
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
import spacy

# Modellname (POS-Tagging)
model_name = "vblagoje/bert-english-uncased-finetuned-pos"

# Laden
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

# spaCy laden f√ºr Lemma
nlp = spacy.load("en_core_web_sm")

# Label mapping (Index zu POS-Tag) aus Modell-Config holen
id2label = model.config.id2label

# Daten einlesen
df = pd.read_csv("testkorpus_divers_50.csv")

results = []

for idx, row in df.iterrows():
    text = row["text"]
    if pd.isna(text):
        continue
    spacy_doc = nlp(str(text))
    
    # Tokenize mit R√ºckgabe der offsets (start/end Zeichen im Text)
    encoding = tokenizer(str(text), return_tensors="pt", return_offsets_mapping=True, truncation=True)
    input_ids = encoding["input_ids"]
    offset_mappings = encoding["offset_mapping"][0]  # Batchsize=1
    
    with torch.no_grad():
        output = model(input_ids)
    
    logits = output.logits  # shape [1, seq_len, num_labels]
    predictions = torch.argmax(logits, dim=2)[0].tolist()  # indices der besten Labels
    
    # Iteriere Tokens (skip Special Tokens wie CLS, SEP)
    for idx_token, pred_id in enumerate(predictions):
        # offsets (start, end) des Tokens im Originaltext
        start, end = offset_mappings[idx_token].tolist()
        if start == end:  # special tokens (CLS, SEP) haben offset 0,0
            continue
        
        word_text = text[start:end]
        pos_tag = id2label[pred_id]
        
        # Lemma suchen: SpaCy Token mit passendem Offset
        lemma = None
        for token in spacy_doc:
            token_start = token.idx
            token_end = token.idx + len(token.text)
            # check ob offsets sich √ºberlappen (f√ºr seltene F√§lle)
            if start >= token_start and end <= token_end:
                lemma = token.lemma_
                break
        if lemma is None:
            lemma = word_text.lower()
        
        results.append({
            "post_id": idx,
            "date": row.get("date"),
            "word": word_text,
            "lemma": lemma,
            "pos": pos_tag,
            "lemma_p": f"{lemma}_{pos_tag}"
        })

pos_df = pd.DataFrame(results)
pos_df.to_csv("testkorpus_divers_50_bert2.csv", index=False)
display(pos_df.head(40))

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,NOUN,reminder_NOUN
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,the,DET,the_DET
3,0,2010-11-04,Miss,Miss,PROPN,Miss_PROPN
4,0,2010-11-04,Universe,Universe,PROPN,Universe_PROPN
5,0,2010-11-04,competition,competition,NOUN,competition_NOUN
6,0,2010-11-04,will,will,AUX,will_AUX
7,0,2010-11-04,be,be,AUX,be_AUX
8,0,2010-11-04,LIVE,live,ADJ,live_ADJ
9,0,2010-11-04,from,from,ADP,from_ADP


In [38]:
import pandas as pd
fptfb2 = pd.read_csv("testkorpus_divers_50_bert2.csv")
display(fptfb2[115:165])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
115,4,2020-05-11,‚Äô,‚Äôs,PART,‚Äôs_PART
116,4,2020-05-11,s,‚Äôs,PART,‚Äôs_PART
117,4,2020-05-11,Comments,comment,NOUN,comment_NOUN
118,4,2020-05-11,On,on,ADP,on_ADP
119,4,2020-05-11,‚Äú,"""",PUNCT,"""_PUNCT"
120,4,2020-05-11,Meet,Meet,VERB,Meet_VERB
121,4,2020-05-11,The,The,DET,The_DET
122,4,2020-05-11,Press,Press,PROPN,Press_PROPN
123,4,2020-05-11,",",",",PUNCT,",_PUNCT"
124,4,2020-05-11,‚Äù,"""",PUNCT,"""_PUNCT"


In [39]:
fptfb2.shape

(2024, 6)

In [40]:
fptfb2[1980:2024]
# Links und @ werden seltsam zerlegt

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
1980,49,2025-03-01,were,be,AUX,be_AUX
1981,49,2025-03-01,released,release,VERB,release_VERB
1982,49,2025-03-01,into,into,ADP,into_ADP
1983,49,2025-03-01,our,our,PRON,our_PRON
1984,49,2025-03-01,Country,Country,NOUN,Country_NOUN
1985,49,2025-03-01,.,.,PUNCT,._PUNCT
1986,49,2025-03-01,Thanks,thank,NOUN,thank_NOUN
1987,49,2025-03-01,to,to,ADP,to_ADP
1988,49,2025-03-01,the,the,DET,the_DET
1989,49,2025-03-01,Trump,Trump,PROPN,Trump_PROPN


Fazit: (pos-local)
- die Links (und auch andere Abk√ºrzungen wie CBP) werden in viel zu viele Einzelteile zerlegt und falsch bestimmt (Zahlen und sonstige W√∂rter auch: USA(Emoji) wird zu usa(emoji) und SYM, also falsch getrennt)
- Hashes und @ werden getrennt (#MadeInAmerica wird zu # Made InA meric a)
- Emojis werden als SYM erkannt (und wenn falsch getrennt auch der Rest des Wortes)
- rchtige Lemmatisierung
- ca. 7-8.000 Tags mehr als die anderen Tagger (wegen der Links wahrscheinlich)
- bert2: Lemmatisierung schlecht und Trennung der W√∂rter schlecht (Opioid wird zu Op io id) 
- bert: Uhrzeit mit ## erg√§nzt?, Lemmatisierung noch schlechter

## Tweebank

In [1]:
# Die √§ltere Version TweebankNLP unterst√ºtzt kein POS-Tagging.
import tweetnlp
print(list(tweetnlp.loader.TASK_CLASS.keys()))

['sentiment', 'offensive', 'irony', 'hate', 'emotion', 'emoji', 'stance_abortion', 'stance_atheism', 'stance_climate', 'stance_feminist', 'stance_hillary', 'topic_classification', 'ner', 'language_model', 'sentence_embedding', 'question_answering', 'question_answer_generation']


In [None]:
#from transformers import AutoTokenizer, AutoModelForTokenClassification
#tokenizer = AutoTokenizer.from_pretrained("TweebankNLP/bertweet-tb2_ewt-pos-tagging")
#model = AutoModelForTokenClassification.from_pretrained("TweebankNLP/bertweet-tb2_ewt-pos-tagging")
# klappt nicht, weil Pytorch 2.6 nich installiert werden kann.

In [2]:
import torch
print(torch.__version__)

2.6.0


In [13]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# CSV einlesen
df = pd.read_csv("testkorpus_divers_50.csv")

# NaN oder leere Texte als leere Strings auff√ºllen
df["text"] = df["text"].fillna("").astype(str)

# Modell und Tokenizer laden
model_name = "TweebankNLP/bertweet-tb2_ewt-pos-tagging"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Device: CPU
device = -1

# Pipeline f√ºr Token-Classification
tagger = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",
    device=device
)

results = []

for idx, row in df.iterrows():
    text = row["text"].strip()
    
    if text:  # Text existiert
        tagged = tagger(text)
        for token_info in tagged:
            word_text = token_info.get("word")
            pos_tag = token_info.get("entity_group")  # POS-Tag
            lemma = word_text
            results.append({
                "post_id": idx,
                "date": row.get("date"),
                "word": word_text,
                "lemma": lemma,
                "pos": pos_tag,
                "lemma_p": f"{lemma}_{pos_tag}"
            })
    else:  # Text leer ‚Üí Platzhalter-Eintrag
        results.append({
            "post_id": idx,
            "date": row.get("date"),
            "word": "",
            "lemma": "",
            "pos": "",
            "lemma_p": ""
        })

# Ergebnisse als DataFrame
df_tagged = pd.DataFrame(results)

# CSV speichern
df_tagged.to_csv("testkorpus_divers_50_tweebank_1.csv", index=False, encoding="utf-8")

print("‚úÖ Fertig! Datei gespeichert als testkorpus_divers_50_tweebank_1.csv")
display(df_tagged.head())

Device set to use cpu


‚úÖ Fertig! Datei gespeichert als testkorpus_divers_50_tweebank_1.csv


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder@@,Reminder@@,NOUN,Reminder@@_NOUN
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,The,DET,The_DET
3,0,2010-11-04,Miss Universe,Miss Universe,PROPN,Miss Universe_PROPN
4,0,2010-11-04,competition,competition,NOUN,competition_NOUN


In [14]:
import pandas as pd
fptft1 = pd.read_csv("testkorpus_divers_50_tweebank_1.csv")
display(fptft1[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Of,Of,ADP,Of_ADP
71,4,2020-05-11,Barr@@,Barr@@,PROPN,Barr@@_PROPN
72,4,2020-05-11,<unk> s,<unk> s,PART,<unk> s_PART
73,4,2020-05-11,Comments,Comments,NOUN,Comments_NOUN
74,4,2020-05-11,On,On,ADP,On_ADP
75,4,2020-05-11,<unk>,<unk>,PUNCT,<unk>_PUNCT
76,4,2020-05-11,Meet,Meet,VERB,Meet_VERB
77,4,2020-05-11,The,The,DET,The_DET
78,4,2020-05-11,Press@@,Press@@,NOUN,Press@@_NOUN
79,4,2020-05-11,",‚Äù",",‚Äù",PUNCT,",‚Äù_PUNCT"


In [15]:
fptft1.shape
# Emojis durch @ ersetzt??

(1225, 6)

In [16]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import spacy

# CSV laden
df = pd.read_csv("testkorpus_divers_50.csv")

# Modell laden (Tweebank POS)
model_name = "TweebankNLP/bertweet-tb2_ewt-pos-tagging"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
tagger = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# SpaCy zum Lemmatisieren (englisches Modell, bei Tweets meist okay)
nlp = spacy.load("en_core_web_sm")

results = []

for idx, row in df.iterrows():
    text = str(row["text"])
    date = row.get("date", None)

    # POS-Tagging mit Tweebank
    pos_tags = tagger(text)

    # Lemmatisierung mit spaCy
    doc = nlp(text)

    # Map von Wort ‚Üí Lemma aus spaCy
    lemma_map = {token.text: token.lemma_ for token in doc}

    # Ergebnisse speichern
    for token_info in pos_tags:
        word_text = token_info["word"]
        pos_tag = token_info["entity_group"]
        lemma = lemma_map.get(word_text, word_text)  # Fallback: Wort selbst, falls nicht im Lemma-Map

        results.append({
            "post_id": idx,
            "date": date,
            "word": word_text,
            "lemma": lemma,
            "pos": pos_tag,
            "lemma_p": f"{lemma}_{pos_tag}"
        })

# DataFrame erstellen und speichern
out_df = pd.DataFrame(results)
out_df.to_csv("testkorpus_divers_50_tweebank.csv", index=False)

Device set to use cpu


Device set to use cpu:
Auf MacOs gibt es normalerweise keine CUDA-GPU.
Apple Silicon (M1/M2/M3) hat eine eigene GPU, die du nur √ºber PyTorch mit MPS (Metal Performance Shaders) nutzen kannst.
Auf Intel-Macs bleibt dir nur die CPU.

In [20]:
import pandas as pd
fptft = pd.read_csv("testkorpus_divers_50_tweebank.csv")
display(fptft[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Of,of,ADP,of_ADP
71,4,2020-05-11,Barr@@,Barr@@,PROPN,Barr@@_PROPN
72,4,2020-05-11,<unk> s,<unk> s,PART,<unk> s_PART
73,4,2020-05-11,Comments,comment,NOUN,comment_NOUN
74,4,2020-05-11,On,on,ADP,on_ADP
75,4,2020-05-11,<unk>,<unk>,PUNCT,<unk>_PUNCT
76,4,2020-05-11,Meet,Meet,VERB,Meet_VERB
77,4,2020-05-11,The,The,DET,The_DET
78,4,2020-05-11,Press@@,Press@@,NOUN,Press@@_NOUN
79,4,2020-05-11,",‚Äù",",‚Äù",PUNCT,",‚Äù_PUNCT"


In [25]:
fptft.shape

(1225, 6)

In [18]:
## mit Pipeline und anderer Lemmatisierung

In [21]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import torch
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Sicherstellen, dass die NLTK-Resourcen vorhanden sind
nltk.download("wordnet")
nltk.download("omw-1.4")

# CSV einlesen
df = pd.read_csv("testkorpus_divers_50.csv")

# NaN oder leere Texte als leere Strings auff√ºllen
df["text"] = df["text"].fillna("").astype(str)

# Modell und Tokenizer laden
model_name = "TweebankNLP/bertweet-tb2_ewt-pos-tagging"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Device: CPU
device = -1  

# Pipeline
tagger = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",
    device=device
)

# NLTK-Lemmatizer vorbereiten
lemmatizer = WordNetLemmatizer()

# Hilfsfunktion: Mapping von POS zu WordNet
def map_pos_to_wordnet(pos_tag):
    if pos_tag.startswith("N"):
        return wordnet.NOUN
    elif pos_tag.startswith("V"):
        return wordnet.VERB
    elif pos_tag.startswith("J"):
        return wordnet.ADJ
    elif pos_tag.startswith("R"):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # fallback

results = []

for idx, row in df.iterrows():
    text = row["text"].strip()
    
    if text:  # Text existiert
        tagged = tagger(text)
        for token_info in tagged:
            word_text = token_info.get("word")
            pos_tag = token_info.get("entity_group")  # POS-Tag
            wn_pos = map_pos_to_wordnet(pos_tag)
            lemma = lemmatizer.lemmatize(word_text, wn_pos)
            
            results.append({
                "post_id": idx,
                "date": row.get("date"),
                "word": word_text,
                "lemma": lemma,
                "pos": pos_tag,
                "lemma_p": f"{lemma}_{pos_tag}"
            })
    else:  # Text leer ‚Üí Platzhalter-Eintrag
        results.append({
            "post_id": idx,
            "date": row.get("date"),
            "word": "",
            "lemma": "",
            "pos": "",
            "lemma_p": ""
        })

# Ergebnisse als DataFrame
df_tagged = pd.DataFrame(results)

# CSV speichern
output_file = "testkorpus_divers_50_tweebank_2.csv"
df_tagged.to_csv(output_file, index=False, encoding="utf-8")

print(f"‚úÖ Ergebnisse gespeichert unter: {output_file}")
display(df_tagged.head())

[nltk_data] Downloading package wordnet to /Users/vivien/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/vivien/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Device set to use cpu


‚úÖ Ergebnisse gespeichert unter: testkorpus_divers_50_tweebank_2.csv


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder@@,Reminder@@,NOUN,Reminder@@_NOUN
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,The,DET,The_DET
3,0,2010-11-04,Miss Universe,Miss Universe,PROPN,Miss Universe_PROPN
4,0,2010-11-04,competition,competition,NOUN,competition_NOUN


In [23]:
import pandas as pd
fptft2 = pd.read_csv("testkorpus_divers_50_tweebank_2.csv")
display(fptft2[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Of,Of,ADP,Of_ADP
71,4,2020-05-11,Barr@@,Barr@@,PROPN,Barr@@_PROPN
72,4,2020-05-11,<unk> s,<unk> s,PART,<unk> s_PART
73,4,2020-05-11,Comments,Comments,NOUN,Comments_NOUN
74,4,2020-05-11,On,On,ADP,On_ADP
75,4,2020-05-11,<unk>,<unk>,PUNCT,<unk>_PUNCT
76,4,2020-05-11,Meet,Meet,VERB,Meet_VERB
77,4,2020-05-11,The,The,DET,The_DET
78,4,2020-05-11,Press@@,Press@@,NOUN,Press@@_NOUN
79,4,2020-05-11,",‚Äù",",‚Äù",PUNCT,",‚Äù_PUNCT"


In [24]:
fptft2.shape

(1225, 6)

Fazit zu Tweebank:
- Emojis werden in <unk/> umgewandelt plus @@ ?!
- MWEs werden erkannt
- Emojis in @@, 
- Worte falsch getrennt (v.a. bei Satzzeichen)
- bei Trennung immer @@ erg√§nzt
- einen Zeichenfehler gut erkannt
- Hashtag bleibt zusammen, aber mit X gelabelt 
- Emojis als unkown und SYM gelabelt
- Links bleiben ganz und als X
- kein Unterschied zwischen den drei Codes (0,1,2)
- Lemma immer wie Word bei 1 & 2

# Finale Entscheidung und Tagging der Daten

Ich entscheide mich f√ºr.. STANZA

In [None]:
import pandas as pd

# DataFrame mit allen Tokens und Metadaten einlesen (oder aus vorherigem Schritt √ºbernehmen)
pos_df = pd.read_csv("factbase_posts_pos_tags_with_metadata.csv")

# Spalten, die wir gruppieren wollen (Metadaten behalten wir aus der ersten Zeile pro Post)
metadata_cols = ["post_id", "author", "platform", "date", "time", "day", "month", "year", "text"]

# Token-Spalten als Listen aggregieren
grouped = pos_df.groupby("post_id").agg({
    **{col: 'first' for col in metadata_cols if col != "post_id"},  # Metadaten aus erster Zeile
    "word": list,
    "lemma": list,
    "pos": list,
    "lemma_p": list
}).reset_index()

# Optional: speichern
grouped.to_csv("factbase_posts_grouped_tokens.csv", index=False)

print(grouped.head(3))
## hier sind sie gruppiert in Listen

In [None]:
import pandas as pd

# Stanza POS-Tag DataFrame einlesen (aus vorherigem Schritt oder direkt √ºbernehmen)
pos_df = pd.read_csv("factbase_posts_pos_tags_stanza.csv")

# Metadaten-Spalten
metadata_cols = ["post_id", "author", "platform", "date", "time", "day", "month", "year", "text"]

# Gruppieren und Token-Spalten als Listen aggregieren
grouped = pos_df.groupby("post_id").agg({
    **{col: 'first' for col in metadata_cols if col != "post_id"},
    "word": list,
    "lemma": list,
    "pos": list,
    "lemma_p": list
}).reset_index()

# Optional: in CSV speichern
grouped.to_csv("factbase_posts_grouped_stanza_tokens.csv", index=False)

print(grouped.head(3))