# 02 Testkorpora_Tagging

! F√ºr folgendes Notebook wurden verschiedene Kernel benutzt, da einige Modelle f√ºrs POS-Tagging besondere Anforderungen an die Version der Pakete hatten, welche mit anderen Paketversionen f√ºr andere Modelle nicht kompatibel sind.

## Tagger f√ºr POS
In diesem Jupyter Notebook wird ein Testkorpus testkorpus_divers_50.csv erstellt, welches verschiedene Schwierigkeiten wie Rechtschreibfehler, Hashes, @ und Emojis enth√§lt. Danach wird die Datei anhand mehrerer verschiedener Modelle getaggt, sodass verglichen werden kann, welches Modell am besten abschneidet. Die Entscheidung wird mit meinem pers√∂nlichen Eindruck begr√ºndet und nicht quantifiziert.
Im Folgenden werden immer die gleichen 50 Zeilen der Testkorpora gezeigt, um einen ersten Eindruck der Performance zu erhalten. F√ºr einen zuverl√§ssigen und quaitativ h√∂heren Eindruck √∂ffnete ich die Dateien allerdings alle einzeln und verglich Zeile f√ºr Zeile miteinander.

Verschiedene Tagsets:
- Penn Treebank Tags: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
- Universal Dependencies:
https://universaldependencies.org/u/pos/
https://huggingface.co/flair/upos-english

### Welcher Tagger eignet sich am Besten:
- Spacy
- Stanza
- Flair
- Bert
- Tweebank
- andere Varianten

#### Testkorpus, mit dem verschiedene Tagger getestet werden:

In [1]:
# Installation: conda install -c conda-forge pyspellchecker

In [2]:
# Ziel: ein m√∂glichst diverses Korpus erstellen, das alle relevanten F√§lle pr√ºft
import pandas as pd
import re
from spellchecker import SpellChecker
from IPython.display import display

spell = SpellChecker(language="en")
df = pd.read_csv("tta_final_clean.csv")

# Funktionen f√ºr verschiedene Post-Typen
def has_mention(x): return "@" in str(x)
def has_hashtag(x): return "#" in str(x)
def has_url(x): return re.search(r"http[s]?://", str(x)) is not None
def has_emoji(x): return re.search(r"[\U00010000-\U0010ffff]", str(x)) is not None
def is_long(x): return len(str(x)) > 200
def has_typo(x): return re.compile(
    r"\b("
    r"teh|recieve|definately|seperat(?:e|ely)|occured|untill|wich|"
    r"neccessary|adress|tomm?orow|becuase|wierd|yeee?s"
    r")\b",
    flags=re.IGNORECASE
)
def has_typo_spellchecker(text):
    words = str(text).split()
    misspelled = spell.unknown(words)   # W√∂rter, die nicht im W√∂rterbuch sind
    return len(misspelled) > 0

samples = []
# je 5 Beispiele (wenn vorhanden)
samples.append(df[df['text'].apply(is_long)].sample(n=5, random_state=1))
samples.append(df[df['text'].apply(has_mention)].sample(n=5, random_state=2))
samples.append(df[df['text'].apply(has_hashtag)].sample(n=5, random_state=3))
samples.append(df[df['text'].apply(has_url)].sample(n=5, random_state=4))
samples.append(df[df['text'].apply(has_emoji)].sample(n=5, random_state=5))
#samples.append(df[df['text'].apply(has_typo)].sample(n=5, random_state=6))
samples.append(df[df["text"].apply(has_typo_spellchecker)].sample(n=5, random_state=6))
#df["text"].apply(has_typo_spellchecker)

# Rest zuf√§llig auff√ºllen bis 50
already = pd.concat(samples)
remaining = df.drop(already.index)
rest = remaining.sample(n=50-len(already), random_state=42)

# finales Testkorpus
test_divers = pd.concat([already, rest]).sample(frac=1, random_state=99)
test_divers.to_csv("test_full.csv", index=False)
test_divers = test_divers[['date', 'id', 'text']]
display(test_divers.head(50))
test_divers.to_csv("testkorpus_divers_50.csv", index=False)
# Anmerkung: statt print verwende ich aufgrund der sch√∂neren Ansicht display

Unnamed: 0,date,id,text
41873,2010-11-04,3498743628,Reminder: The Miss Universe competition will b...
48694,2013-08-15,367977996541788160,@Timc1021 Thanks!
50189,2013-06-12,344775405057753088,"""""@_KatherineWebb: Looking forward to #MissUSA..."
14002,2011-09-04,110498268480198144,Addressing the Rise of Chronic Childhood Illn...
35985,2020-05-11,1259672385286012928,RT @darhar981: Attorney General Barr‚Äôs Office ...
11663,2011-09-07,111274504071323312,RT @MagaGlamüá∫üá∏‚ô•Ô∏è Bring Back Trump üíôüá∫üá∏
22194,2011-09-01,109360482610823936,
2495,2025-03-19,2755,The CBP Home App is now available across all m...
84855,2019-01-23,1087867453684834304,Congratulations to Mariano Rivera on unanimous...
63673,2015-07-01,616265476616900608,My recent statement re: @macys -- We must have...


In [3]:
test_divers.shape

(50, 3)

In [4]:
test_divers.to_json("testkorpus_divers_50.json", orient="records", force_ascii=False, indent=2)

# SpaCy
https://huggingface.co/spacy/en_core_web_sm

In [5]:
# !pip install spacy

In [6]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Donald Trump posted a new tweet. #realdonaldtrump @realdonaldtrump! @ben4appel ü§£ :) https://t.co/bsB6rVV7Yn")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_)
# besser w√§re, wenn die Hashes nicht zerlegt werden.
# und Emojis nicht als Eigennamen angesehen werden

Donald Donald PROPN NNP
Trump Trump PROPN NNP
posted post VERB VBD
a a DET DT
new new ADJ JJ
tweet tweet NOUN NN
. . PUNCT .
# # X ADD
realdonaldtrump realdonaldtrump NOUN NN
@realdonaldtrump @realdonaldtrump PROPN NNP
! ! PUNCT .
@ben4appel @ben4appel PROPN NNP
ü§£ ü§£ PROPN NNP
:) :) PUNCT :
https://t.co/bsB6rVV7Yn https://t.co/bsb6rvv7yn NOUN NN


In [7]:
# python -m spacy download en_core_web_sm

In [8]:
#### 01 SpaCy ####
import pandas as pd
import spacy

df = pd.read_csv("testkorpus_divers_50.csv")
nlp = spacy.load("en_core_web_sm")
all_results = []

for idx, text in enumerate(df["text"], start=1):
    if pd.isna(text):
        continue
    doc = nlp(str(text))
    for token in doc:
        all_results.append({
            "post_id": idx,
            "word": token.text,
            "lemma": token.lemma_,
            "pos": token.pos_,
            "lemma_p": f"{token.lemma_}_{token.pos_}",
            "tag": token.tag_
        })

sp = pd.DataFrame(all_results)
sp.to_csv("testkorpus_divers_50_spacy.csv", index=False)
display(sp[310:360])
# der Code funktioniert; allerdings werden Emojis, @ und Post-spezifische Dinge nicht erkannt.

Unnamed: 0,post_id,word,lemma,pos,lemma_p,tag
310,16,skies,sky,NOUN,sky_NOUN,NNS
311,16,over,over,ADP,over_ADP,IN
312,16,Iran,Iran,PROPN,Iran_PROPN,NNP
313,16,.,.,PUNCT,._PUNCT,.
314,16,Iran,Iran,PROPN,Iran_PROPN,NNP
315,16,had,have,VERB,have_VERB,VBD
316,16,good,good,ADJ,good_ADJ,JJ
317,16,sky,sky,NOUN,sky_NOUN,NN
318,16,trackers,tracker,NOUN,tracker_NOUN,NNS
319,16,and,and,CCONJ,and_CCONJ,CC


In [9]:
import pandas as pd
sp = pd.read_csv("testkorpus_divers_50_spacy.csv")
display(sp[200:215])

Unnamed: 0,post_id,word,lemma,pos,lemma_p,tag
200,11,#,#,SYM,#_SYM,$
201,11,MadeInAmerica,MadeInAmerica,PROPN,MadeInAmerica_PROPN,NNP
202,11,event,event,NOUN,event_NOUN,NN
203,11,",",",",PUNCT,",_PUNCT",","
204,11,right,right,ADV,right_ADV,RB
205,11,here,here,ADV,here_ADV,RB
206,11,at,at,ADP,at_ADP,IN
207,11,the,the,DET,the_DET,DT
208,11,@WhiteHouse,@whitehouse,NOUN,@whitehouse_NOUN,NN
209,11,!,!,PUNCT,!_PUNCT,.


In [10]:
display(sp[1200:1242])

Unnamed: 0,post_id,word,lemma,pos,lemma_p,tag
1200,50,ALL,all,PRON,all_PRON,DT
1201,50,of,of,ADP,of_ADP,IN
1202,50,them,they,PRON,they_PRON,PRP
1203,50,were,be,AUX,be_AUX,VBD
1204,50,released,release,VERB,release_VERB,VBN
1205,50,into,into,ADP,into_ADP,IN
1206,50,our,our,PRON,our_PRON,PRP$
1207,50,Country,Country,PROPN,Country_PROPN,NNP
1208,50,.,.,PUNCT,._PUNCT,.
1209,50,Thanks,thank,NOUN,thank_NOUN,NNS


In [11]:
display(sp[70:110])

Unnamed: 0,post_id,word,lemma,pos,lemma_p,tag
70,5,‚Äôs,‚Äôs,PART,‚Äôs_PART,POS
71,5,Office,office,NOUN,office_NOUN,NN
72,5,Shreds,shred,VERB,shred_VERB,VBZ
73,5,NBC,NBC,PROPN,NBC_PROPN,NNP
74,5,‚Äôs,‚Äôs,PART,‚Äôs_PART,POS
75,5,Chuck,Chuck,PROPN,Chuck_PROPN,NNP
76,5,Todd,Todd,PROPN,Todd_PROPN,NNP
77,5,For,for,ADP,for_ADP,IN
78,5,‚Äò,',PUNCT,'_PUNCT,``
79,5,Deceptive,Deceptive,PROPN,Deceptive_PROPN,NNP


In [12]:
sp.shape # Wie viele Tags wurden vergeben?

(1242, 6)

In [13]:
#### 02 SpaCy ### mit ganzen S√§tzen zum Vergleichen ##
import pandas as pd
import spacy

df = pd.read_csv("testkorpus_divers_50.csv")

# python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")
all_results = []

for idx, row in df.iterrows():
    text = row["text"]
    if pd.isna(text) or str(text).strip() == "":
        continue
    
    doc = nlp(str(text))
    
    for sent_id, sent in enumerate(doc.sents, start=1):
        sent_text = sent.text
        for token_id, token in enumerate(sent, start=1):
            all_results.append({
                "post_id": idx + 1,
                "date": row["date"],
                "sentence_id": sent_id,
                "token_id": token_id,
                "sentence_text": sent_text,
                "word": token.text,
                "lemma": token.lemma_,
                "pos": token.pos_,
                "lemma_p": f"{token.lemma_}_{token.pos_}"
            })

sp2 = pd.DataFrame(all_results)
sp2.to_csv("testkorpus_divers_50_spacy_2.csv", index=False, encoding="utf-8")
display(sp2[500:555])

Unnamed: 0,post_id,date,sentence_id,token_id,sentence_text,word,lemma,pos,lemma_p
500,25,2025-03-12,1,19,The United States of America is going to take ...,by,by,ADP,by_ADP
501,25,2025-03-12,1,20,The United States of America is going to take ...,other,other,ADJ,other_ADJ
502,25,2025-03-12,1,21,The United States of America is going to take ...,countries,country,NOUN,country_NOUN
503,25,2025-03-12,1,22,The United States of America is going to take ...,and,and,CCONJ,and_CCONJ
504,25,2025-03-12,1,23,The United States of America is going to take ...,",",",",PUNCT,",_PUNCT"
505,25,2025-03-12,1,24,The United States of America is going to take ...,frankly,frankly,ADV,frankly_ADV
506,25,2025-03-12,1,25,The United States of America is going to take ...,",",",",PUNCT,",_PUNCT"
507,25,2025-03-12,1,26,The United States of America is going to take ...,by,by,ADP,by_ADP
508,25,2025-03-12,1,27,The United States of America is going to take ...,incompetent,incompetent,ADJ,incompetent_ADJ
509,25,2025-03-12,1,28,The United States of America is going to take ...,U.S.,U.S.,PROPN,U.S._PROPN


In [14]:
import pandas as pd
sp2 = pd.read_csv("testkorpus_divers_50_spacy_2.csv")
display(sp2[70:110])

Unnamed: 0,post_id,date,sentence_id,token_id,sentence_text,word,lemma,pos,lemma_p
70,5,2020-05-11,1,7,RT @darhar981: Attorney General Barr‚Äôs Office ...,‚Äôs,‚Äôs,PART,‚Äôs_PART
71,5,2020-05-11,1,8,RT @darhar981: Attorney General Barr‚Äôs Office ...,Office,office,NOUN,office_NOUN
72,5,2020-05-11,1,9,RT @darhar981: Attorney General Barr‚Äôs Office ...,Shreds,shred,VERB,shred_VERB
73,5,2020-05-11,1,10,RT @darhar981: Attorney General Barr‚Äôs Office ...,NBC,NBC,PROPN,NBC_PROPN
74,5,2020-05-11,1,11,RT @darhar981: Attorney General Barr‚Äôs Office ...,‚Äôs,‚Äôs,PART,‚Äôs_PART
75,5,2020-05-11,1,12,RT @darhar981: Attorney General Barr‚Äôs Office ...,Chuck,Chuck,PROPN,Chuck_PROPN
76,5,2020-05-11,1,13,RT @darhar981: Attorney General Barr‚Äôs Office ...,Todd,Todd,PROPN,Todd_PROPN
77,5,2020-05-11,1,14,RT @darhar981: Attorney General Barr‚Äôs Office ...,For,for,ADP,for_ADP
78,5,2020-05-11,1,15,RT @darhar981: Attorney General Barr‚Äôs Office ...,‚Äò,',PUNCT,'_PUNCT
79,5,2020-05-11,1,16,RT @darhar981: Attorney General Barr‚Äôs Office ...,Deceptive,Deceptive,PROPN,Deceptive_PROPN


In [15]:
sp2.shape

(1242, 9)

In [16]:
#### SpaCy mit Twitter ####
import pandas as pd
import spacy
from spacy.tokenizer import Tokenizer
import re

def create_twitter_tokenizer(nlp):
    # Erweiterte Infix-Regel f√ºr Hashtags und Mentions (z.B. #NLP, @user)
    infix_re = spacy.util.compile_infix_regex(
        nlp.Defaults.infixes + [r'(?<=\w)[#@](?=\w)']
    )
    return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

df = pd.read_csv("testkorpus_divers_50.csv")
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = create_twitter_tokenizer(nlp)

all_results = []

for idx, text in enumerate(df["text"], start=1):
    if pd.isna(text):
        continue
    doc = nlp(str(text))
    for token in doc:
        all_results.append({
            "post_id": idx,
            "word": token.text,
            "lemma": token.lemma_,
            "pos": token.pos_,
            "lemma_p": f"{token.lemma_}_{token.pos_}"
        })

spt = pd.DataFrame(all_results)
spt.to_csv("testkorpus_divers_50_spacy_twitter.csv", index=False, encoding="utf-8")
display(spt[450:500])

Unnamed: 0,post_id,word,lemma,pos,lemma_p
450,21,https://t.co,https://t.co,PROPN,https://t.co_PROPN
451,21,/,/,SYM,/_SYM
452,21,v6z46rUDtg,v6z46rUDtg,PROPN,v6z46rUDtg_PROPN
453,22,Congress,Congress,PROPN,Congress_PROPN
454,22,must,must,AUX,must_AUX
455,22,approve,approve,VERB,approve_VERB
456,22,the,the,DET,the_DET
457,22,"deal,","deal,",NOUN,"deal,_NOUN"
458,22,without,without,ADP,without_ADP
459,22,all,all,PRON,all_PRON


In [17]:
import pandas as pd
spt = pd.read_csv("testkorpus_divers_50_spacy_twitter.csv")
display(spt[70:110])

Unnamed: 0,post_id,word,lemma,pos,lemma_p
70,5,"Press,‚Äù","Press,‚Äù",NOUN,"Press,‚Äù_NOUN"
71,5,Todd,Todd,PROPN,Todd_PROPN
72,5,‚Ä¶,‚Ä¶,PUNCT,‚Ä¶_PUNCT
73,6,RT,RT,PROPN,RT_PROPN
74,6,,,SPACE,_SPACE
75,6,@MagaGlam,@magaglam,SYM,@magaglam_SYM
76,6,üá∫,üá∫,X,üá∫_X
77,6,üá∏,üá∏,NOUN,üá∏_NOUN
78,6,‚ô•,‚ô•,PROPN,‚ô•_PROPN
79,6,Ô∏è,Ô∏è,X,Ô∏è_X


In [18]:
spt.shape

(1244, 5)

#### Fazit zu SpaCy:
- Hashes oft getrennt und dann falsch getrennt oder Hashes als SYM und Folgendes als PROPN
- @ meistens als X oder NOUN getaggt, aber richtig beibehalten
- Barr's getrennt als Barr PROPN und 's als PART
- Emojis getrennt und als Satz analysiert (ebenso wie Hash und @ als Satz analysiert werden), falsch
- Lemmatisierung nicht so gut
- LL.Bean richtig erkannt
- Tagging nicht ganz richtig
#### spacy2: 
- shreds richtig als shred erkannt
- besser lemmatisiert
- Hashes gut getrennt, aber als Satz analysiert
- Links gut beibehalten, Tagging komisch
- trotzdem Problem mit Hash und @
#### mit twitter:
- Satzzeichen werden nicht richtig getrennt
- schlechte Tokenisierung
- @ und Hash gut
- nicht alle Tags passen (z.B. Meet NOUN the DET press NOUN, USA als ADV)
- Emojis nicht gut erkannt
- manche Links werden zerh√§kselt, manche nicht
- Lemmatisierung gut

# Stanza
https://huggingface.co/stanfordnlp/stanza-en

In [19]:
# !pip install stanza

In [20]:
#### das englische Modell f√ºr Stanza #### basierend auf UD
import pandas as pd
import stanza
stanza.download('en')  
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma', use_gpu=False)
df = pd.read_csv("testkorpus_divers_50.csv")

all_results = []

for idx, row in df.iterrows():
    text = row["text"]
    if pd.isna(text):
        continue
    doc = nlp(str(text))
    for sentence in doc.sentences:
        for token in sentence.tokens:
            word = token.text
            word_info = token.words[0]
            lemma = word_info.lemma
            pos = word_info.upos
            lemma_p = f"{lemma}_{pos}"
            
            all_results.append({
                "post_id": idx + 1,
                "date": row["date"],
                "text": text,
                "word": word,
                "lemma": lemma,
                "pos": pos,
                "lemma_p": lemma_p
            })

sta = pd.DataFrame(all_results)
sta.to_csv("testkorpus_divers_50_stanza.csv", index=False, encoding="utf-8")
display(sta[612:665])

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-13 13:51:13 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-13 13:51:13 INFO: Downloading default packages for language: en (English) ...
2025-09-13 13:51:16 INFO: File exists: /Users/vivien/stanza_resources/en/default.zip
2025-09-13 13:51:22 INFO: Finished downloading models and saved to /Users/vivien/stanza_resources
2025-09-13 13:51:22 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-13 13:56:22 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-13 13:56:23 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2025-09-13 13:56:23 INFO: Using device: cpu
2025-09-13 13:56:23 INFO: Loading: tokenize
2025-09-13 13:56:26 INFO: Loading: mwt
2025-09-13 13:56:26 INFO: Loading: pos
2025-09-13 13:56:29 INFO: Loading: lemma
2025-09-13 13:56:30 INFO: Done loading processors!


Unnamed: 0,post_id,date,text,word,lemma,pos,lemma_p
612,30,2014-09-01,"""""@NPHerron: @realDonaldTrump For president #2...",#,#,SYM,#_SYM
613,30,2014-09-01,"""""@NPHerron: @realDonaldTrump For president #2...",2016election,2016election,PROPN,2016election_PROPN
614,30,2014-09-01,"""""@NPHerron: @realDonaldTrump For president #2...","""","""",PUNCT,"""_PUNCT"
615,30,2014-09-01,"""""@NPHerron: @realDonaldTrump For president #2...","""","""",PUNCT,"""_PUNCT"
616,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,The,the,DET,the_DET
617,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,Zimmerman,Zimmerman,PROPN,Zimmerman_PROPN
618,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,trial,trial,NOUN,trial_NOUN
619,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,is,be,AUX,be_AUX
620,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,over,over,ADV,over_ADV
621,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,.,.,PUNCT,._PUNCT


In [21]:
import pandas as pd
sta = pd.read_csv("testkorpus_divers_50_stanza.csv")
display(sta[200:215])

Unnamed: 0,post_id,date,text,word,lemma,pos,lemma_p
200,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,it,it,PRON,it_PRON
201,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,is,be,AUX,be_AUX
202,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,the,the,DET,the_DET
203,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,BEST,good,ADJ,good_ADJ
204,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,!,!,PUNCT,!_PUNCT
205,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,USA,USA,PROPN,USA_PROPN
206,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,üá∫,üá∫,PUNCT,üá∫_PUNCT
207,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,üá∏,üá∏,PUNCT,üá∏_PUNCT
208,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,https://t.co/q4vB9GdE5y,https://t.co/q4vB9GdE5y,PROPN,https://t.co/q4vB9GdE5y_PROPN
209,12,2011-08-31,https://www.mediaite.com/tv/trump-team-scored-...,https://www.mediaite.com/tv/trump-team-scored-...,https://www.mediaite.com/tv/trump-team-scored-...,PROPN,https://www.mediaite.com/tv/trump-team-scored-...


In [22]:
sta.shape

(1190, 7)

In [23]:
#### Stanza + Tweebank ####
import pandas as pd
import stanza
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

stanza.download('en')
stanza_nlp = stanza.Pipeline(
    lang='en',
    processors='tokenize,pos,lemma',
    use_gpu=False
)

# Tweebank
model_name = "TweebankNLP/bertweet-tb2_ewt-pos-tagging"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

tweebank_tagger = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",
    device=-1  # CPU
)

df = pd.read_csv("testkorpus_divers_50.csv")
df["text"] = df["text"].fillna("").astype(str)

all_results = []
for idx, row in df.iterrows():
    text = row["text"].strip()
    if not text:
        continue

    # Stanza-Analyse (Lemma + Universal POS)
    stanza_doc = stanza_nlp(text)
    stanza_tokens = []
    for sentence in stanza_doc.sentences:
        for token in sentence.tokens:
            word = token.text
            word_info = token.words[0]
            stanza_tokens.append({
                "word": word,
                "lemma": word_info.lemma,
                "upos": word_info.upos
            })

    # Tweebank-Analyse (Tweet-POS-Tags)
    tweebank_result = tweebank_tagger(text)
    tweebank_tokens = [tok.get("word") for tok in tweebank_result]
    tweebank_pos = [tok.get("entity_group") for tok in tweebank_result]

    # Token-Alignierung
    # gleiche Reihenfolge matchen
    min_len = min(len(stanza_tokens), len(tweebank_tokens))
    for i in range(min_len):
        word = stanza_tokens[i]["word"]
        lemma = stanza_tokens[i]["lemma"]
        upos = stanza_tokens[i]["upos"]
        tweet_pos = tweebank_pos[i]

        all_results.append({
            "post_id": idx + 1,
            "date": row.get("date"),
            "word": word,
            "lemma": lemma,
            "upos": upos,          # Universal POS (Stanza)
            "tweebank_pos": tweet_pos,  # Tweet-spezifisches POS
            "lemma_p": f"{lemma}_{upos}"
        })

stat = pd.DataFrame(all_results)
output_file = "testkorpus_divers_50_stanza_tweebank.csv"
stat.to_csv(output_file, index=False, encoding="utf-8")
display(stat.head(30))

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-13 14:02:31 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-13 14:02:31 INFO: Downloading default packages for language: en (English) ...
2025-09-13 14:02:34 INFO: File exists: /Users/vivien/stanza_resources/en/default.zip
2025-09-13 14:02:39 INFO: Finished downloading models and saved to /Users/vivien/stanza_resources
2025-09-13 14:02:39 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-13 14:07:39 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-13 14:07:40 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2025-09-13 14:07:40 INFO: Using device: cpu
2025-09-13 14:07:40 INFO: Loading: tokenize
2025-09-13 14:07:40 INFO: Loading: mwt
2025-09-13 14:07:40 INFO: Loading: pos
2025-09-13 14:07:42 INFO: Loading: lemma
2025-09-13 14:07:43 INFO: Done loading processors!
Device set to use cpu


Unnamed: 0,post_id,date,word,lemma,upos,tweebank_pos,lemma_p
0,1,2010-11-04,Reminder,reminder,NOUN,NOUN,reminder_NOUN
1,1,2010-11-04,:,:,PUNCT,PUNCT,:_PUNCT
2,1,2010-11-04,The,the,DET,DET,the_DET
3,1,2010-11-04,Miss,Miss,PROPN,PROPN,Miss_PROPN
4,1,2010-11-04,Universe,Universe,PROPN,NOUN,Universe_PROPN
5,1,2010-11-04,competition,competition,NOUN,AUX,competition_NOUN
6,1,2010-11-04,will,will,AUX,ADJ,will_AUX
7,1,2010-11-04,be,be,AUX,ADP,be_AUX
8,1,2010-11-04,LIVE,live,ADJ,DET,live_ADJ
9,1,2010-11-04,from,from,ADP,PROPN,from_ADP


In [24]:
import pandas as pd
stat = pd.read_csv("testkorpus_divers_50_stanza_tweebank.csv")
display(stat[:60])

Unnamed: 0,post_id,date,word,lemma,upos,tweebank_pos,lemma_p
0,1,2010-11-04,Reminder,reminder,NOUN,NOUN,reminder_NOUN
1,1,2010-11-04,:,:,PUNCT,PUNCT,:_PUNCT
2,1,2010-11-04,The,the,DET,DET,the_DET
3,1,2010-11-04,Miss,Miss,PROPN,PROPN,Miss_PROPN
4,1,2010-11-04,Universe,Universe,PROPN,NOUN,Universe_PROPN
5,1,2010-11-04,competition,competition,NOUN,AUX,competition_NOUN
6,1,2010-11-04,will,will,AUX,ADJ,will_AUX
7,1,2010-11-04,be,be,AUX,ADP,be_AUX
8,1,2010-11-04,LIVE,live,ADJ,DET,live_ADJ
9,1,2010-11-04,from,from,ADP,PROPN,from_ADP


In [25]:
stat.shape

(1081, 7)

In [26]:
# kurzer Test
import stanza

stanza.download("en")  
nlp = stanza.Pipeline(
    lang="en", 
    processors="tokenize,pos,lemma", 
    tokenize_pretokenized=False,
    use_gpu=False, 
    tokenize_with_spacy=False, 
    tokenize_no_ssplit=False,
    tokenize_engine="tokenize/tweet"
)

# Beispieltext
text = "Donald Trump posted a new tweet. #realdonaldtrump @realdonaldtrump! @ben4appel ü§£ :) https://t.co/bsB6rVV7Yn #Canada"

doc = nlp(text)

for sentence in doc.sentences:
    for word in sentence.words:
        print(word.text, word.lemma, word.upos, word.xpos)
# Hashes werden zu 50% getrennt und zu 50% ganz gelassen.

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-13 14:15:07 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-13 14:15:07 INFO: Downloading default packages for language: en (English) ...
2025-09-13 14:15:10 INFO: File exists: /Users/vivien/stanza_resources/en/default.zip
2025-09-13 14:15:14 INFO: Finished downloading models and saved to /Users/vivien/stanza_resources
2025-09-13 14:15:14 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-13 14:20:15 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-13 14:20:16 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2025-09-13 14:20:16 INFO: Using device: cpu
2025-09-13 14:20:16 INFO: Loading: tokenize
2025-09-13 14:20:16 INFO: Loading: mwt
2025-09-13 14:20:16 INFO: Loading: pos
2025-09-13 14:20:19 INFO: Loading: lemma
2025-09-13 14:20:20 INFO: Done loading processors!


Donald Donald PROPN NNP
Trump Trump PROPN NNP
posted post VERB VBD
a a DET DT
new new ADJ JJ
tweet tweet NOUN NN
. . PUNCT .
#realdonaldtrump #realdonaldtrump PROPN NNP
@realdonaldtrump @realdonaldtrump PROPN NNP
! ! PUNCT .
@ben4appel @ben4appel PROPN ADD
ü§£ ü§£ PUNCT .
:) :) SYM NFP
https://t.co/bsB6rVV7Yn https://t.co/bsB6rVV7Yn PROPN ADD
# # SYM NN
Canada Canada PROPN NNP


In [27]:
## mit Twitter-Tokenizer
import pandas as pd
import stanza
import emoji

stanza.download("en")

# Stanza-Pipeline mit Tweet-Tokenizer
nlp = stanza.Pipeline(
    lang="en",
    processors="tokenize,pos,lemma",
    tokenize_pretokenized=False,
    use_gpu=False,
    tokenize_with_spacy=False,
    tokenize_no_ssplit=False,
    tokenize_engine="tokenize/tweet"
)

df = pd.read_csv("testkorpus_divers_50.csv")

all_results = []

for idx, row in df.iterrows():
    text = row.get("text")
    if pd.isna(text) or not isinstance(text, str) or text.strip() == "":
        continue

    doc = nlp(text)
    for sentence in doc.sentences:
        for word in sentence.words:
            lemma_p = f"{word.lemma}_{word.upos}"
            all_results.append({
                "post_id": idx + 1,
                "date": row.get("date"),
                "word": word.text,
                "lemma": word.lemma,
                "pos": word.upos,   
                "xpos": word.xpos,
                "lemma_p": lemma_p,
            })

statw = pd.DataFrame(all_results)
output_file = "testkorpus_divers_50_stanza_tweets.csv"
statw.to_csv(output_file, index=False, encoding="utf-8")
display(statw.head(20))

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-13 14:25:21 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-13 14:25:21 INFO: Downloading default packages for language: en (English) ...
2025-09-13 14:25:25 INFO: File exists: /Users/vivien/stanza_resources/en/default.zip
2025-09-13 14:25:31 INFO: Finished downloading models and saved to /Users/vivien/stanza_resources
2025-09-13 14:25:31 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-13 14:30:31 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-13 14:30:32 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2025-09-13 14:30:32 INFO: Using device: cpu
2025-09-13 14:30:32 INFO: Loading: tokenize
2025-09-13 14:30:33 INFO: Loading: mwt
2025-09-13 14:30:33 INFO: Loading: pos
2025-09-13 14:30:35 INFO: Loading: lemma
2025-09-13 14:30:38 INFO: Done loading processors!


Unnamed: 0,post_id,date,word,lemma,pos,xpos,lemma_p
0,1,2010-11-04,Reminder,reminder,NOUN,NN,reminder_NOUN
1,1,2010-11-04,:,:,PUNCT,:,:_PUNCT
2,1,2010-11-04,The,the,DET,DT,the_DET
3,1,2010-11-04,Miss,Miss,PROPN,NNP,Miss_PROPN
4,1,2010-11-04,Universe,Universe,PROPN,NNP,Universe_PROPN
5,1,2010-11-04,competition,competition,NOUN,NN,competition_NOUN
6,1,2010-11-04,will,will,AUX,MD,will_AUX
7,1,2010-11-04,be,be,AUX,VB,be_AUX
8,1,2010-11-04,LIVE,live,ADJ,JJ,live_ADJ
9,1,2010-11-04,from,from,ADP,IN,from_ADP


In [28]:
import pandas as pd
statw = pd.read_csv("testkorpus_divers_50_stanza_tweets.csv")
display(statw[200:215])

Unnamed: 0,post_id,date,word,lemma,pos,xpos,lemma_p
200,11,2017-07-21,MADE,made,VERB,VBN,made_VERB
201,11,2017-07-21,IN,in,ADP,IN,in_ADP
202,11,2017-07-21,AMERICA,America,PROPN,NNP,America_PROPN
203,11,2017-07-21,",",",",PUNCT,",",",_PUNCT"
204,11,2017-07-21,it,it,PRON,PRP,it_PRON
205,11,2017-07-21,is,be,AUX,VBZ,be_AUX
206,11,2017-07-21,the,the,DET,DT,the_DET
207,11,2017-07-21,BEST,good,ADJ,JJS,good_ADJ
208,11,2017-07-21,!,!,PUNCT,.,!_PUNCT
209,11,2017-07-21,USA,USA,PROPN,NNP,USA_PROPN


In [29]:
statw.shape
# taggt sehr gut, allerdings wird nicht jedes Emoji richtig als NFP erkannt, 
# sondern oft nur als Satzzeichen.

(1207, 7)

In [30]:
# Stanza + Tweet-Tokenizer + Emojivariante
import pandas as pd
import stanza
import re

stanza.download("en")

# Stanza-Pipeline mit Tweet-Tokenizer
nlp = stanza.Pipeline(
    lang="en",
    processors="tokenize,pos,lemma",
    tokenize_pretokenized=False,
    use_gpu=False,
    tokenize_with_spacy=False,
    tokenize_no_ssplit=False,
    tokenize_engine="tokenize/tweet"
)


# Unicode-Emoji-Regex
emoji_pattern = re.compile("["
    u"\U0001F600-\U0001F64F"  # Emoticons
    u"\U0001F300-\U0001F5FF"  # Symbole & Piktogramme
    u"\U0001F680-\U0001F6FF"  # Transport & Symbole
    u"\U0001F1E0-\U0001F1FF"  # Flaggen
    "]+", flags=re.UNICODE)

# klassische Smileys
smiley_pattern = re.compile(r'[:;=8][\-~]?[)D]', flags=re.UNICODE)

def is_emoji_or_smiley(token):
    return bool(emoji_pattern.fullmatch(token)) or bool(smiley_pattern.fullmatch(token))


df = pd.read_csv("testkorpus_divers_50.csv")
all_results = []

for idx, row in df.iterrows():
    text = row.get("text")
    if pd.isna(text) or not isinstance(text, str) or text.strip() == "":
        continue

    doc = nlp(text)
    for sentence in doc.sentences:
        for word in sentence.words:
            xpos = word.xpos
            if is_emoji_or_smiley(word.text):
                xpos = "NFP"

            lemma_p = f"{word.lemma}_{word.upos}"
            all_results.append({
                "post_id": idx + 1,
                "date": row.get("date"),
                "word": word.text,
                "lemma": word.lemma,
                "pos": word.upos,   
                "xpos": xpos,
                "lemma_p": lemma_p,
            })

statw2 = pd.DataFrame(all_results)
output_file = "testkorpus_divers_50_stanza_tweets2.csv"
statw2.to_csv(output_file, index=False, encoding="utf-8")
display(statw2.head(20))

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-13 14:36:34 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-13 14:36:34 INFO: Downloading default packages for language: en (English) ...
2025-09-13 14:36:37 INFO: File exists: /Users/vivien/stanza_resources/en/default.zip
2025-09-13 14:36:42 INFO: Finished downloading models and saved to /Users/vivien/stanza_resources
2025-09-13 14:36:42 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-13 14:41:42 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-13 14:41:43 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2025-09-13 14:41:43 INFO: Using device: cpu
2025-09-13 14:41:43 INFO: Loading: tokenize
2025-09-13 14:41:43 INFO: Loading: mwt
2025-09-13 14:41:43 INFO: Loading: pos
2025-09-13 14:41:46 INFO: Loading: lemma
2025-09-13 14:41:47 INFO: Done loading processors!


Unnamed: 0,post_id,date,word,lemma,pos,xpos,lemma_p
0,1,2010-11-04,Reminder,reminder,NOUN,NN,reminder_NOUN
1,1,2010-11-04,:,:,PUNCT,:,:_PUNCT
2,1,2010-11-04,The,the,DET,DT,the_DET
3,1,2010-11-04,Miss,Miss,PROPN,NNP,Miss_PROPN
4,1,2010-11-04,Universe,Universe,PROPN,NNP,Universe_PROPN
5,1,2010-11-04,competition,competition,NOUN,NN,competition_NOUN
6,1,2010-11-04,will,will,AUX,MD,will_AUX
7,1,2010-11-04,be,be,AUX,VB,be_AUX
8,1,2010-11-04,LIVE,live,ADJ,JJ,live_ADJ
9,1,2010-11-04,from,from,ADP,IN,from_ADP


In [31]:
import pandas as pd
statw2 = pd.read_csv("testkorpus_divers_50_stanza_tweets2.csv")
display(statw2[200:215])

Unnamed: 0,post_id,date,word,lemma,pos,xpos,lemma_p
200,11,2017-07-21,MADE,made,VERB,VBN,made_VERB
201,11,2017-07-21,IN,in,ADP,IN,in_ADP
202,11,2017-07-21,AMERICA,America,PROPN,NNP,America_PROPN
203,11,2017-07-21,",",",",PUNCT,",",",_PUNCT"
204,11,2017-07-21,it,it,PRON,PRP,it_PRON
205,11,2017-07-21,is,be,AUX,VBZ,be_AUX
206,11,2017-07-21,the,the,DET,DT,the_DET
207,11,2017-07-21,BEST,good,ADJ,JJS,good_ADJ
208,11,2017-07-21,!,!,PUNCT,.,!_PUNCT
209,11,2017-07-21,USA,USA,PROPN,NNP,USA_PROPN


In [32]:
statw2.shape

(1207, 7)

Fazit zu Stanza:
- weniger Tags als alle anderen Modelle
- @ werden meistens gut erkannt (und so belassen wie sie waren)
- Emojis werden als Punkt erkannt
- Hashes werden meist getrennt
- Links als Eigenname, aber werden ganz gelassen
- richtige Lemmatisierung
- meistens richtiges Tagging
- Barr's wird zu Barr als PROPN - Tweebank trennt in Barr 's auf
- Shreds und its (Rechtschreibfehler) falsch erkannt, ol' als Noun (eigentl. old, Tweebank hatte ol' als ADJ)
- doesn't wird zu do (lemma)
- in Kombination mit Tweebank: leider sehr schlechtes Tokenisieren (zu viele X)
##### mit Tweet-Tokenizer: 
- super
- am besten xpos verwenden, da diese Tweet-spezifische Tags beinhalten
- Hashes und @ werden gut erkannt und entsprechend getaggt
- Links bei xpos mit ADD getagt, bei pos (also nicht tweetspezifisch) mit PROPN
- Emojis werden meist als Satzzeichen getaggt und nur selten als NFP
- Barr's wird aufgeteilt in Barr und 's
- Hashes und @ werden allerdings auch genau so oft getrennt, wie sie zusammen gelassen werden
- zwei neue Tags f√ºr Tweets: ADD und NFP
##### Tweet Tokenizer 2:
- Emojis werden besser erkannt als NFP
- der Rest ist identisch mit dem ersten Tweet-Tokenizer Modell

# Flair
- https://huggingface.co/flair/pos-english: F1 Score: 98,18
- https://huggingface.co/flair/pos-english-fast: 98,10
- https://huggingface.co/flair/upos-english: 98,6
- https://huggingface.co/flair/upos-english-fast: 98,47

In [33]:
# !pip install flair # bei Modell pos oder upos w√§hlbar

In [34]:
#### Flair mit UPOS ####
import logging
import pandas as pd
from flair.data import Sentence
from flair.models import SequenceTagger

logging.getLogger("flair").setLevel(logging.ERROR)
df = pd.read_csv("testkorpus_divers_50.csv")
df_nonempty = df[df["text"].notna()].copy()
tagger = SequenceTagger.load("flair/upos-english")
label_type = tagger.label_type

sentences = [Sentence(str(t)) for t in df_nonempty["text"]]
tagger.predict(sentences, mini_batch_size=32)

all_results = []
for row, sentence in zip(df_nonempty.itertuples(index=True), sentences):
    for token in sentence:
        # POS-Label 
        pos_label = token.get_label(label_type).value if token.has_label(label_type) else None

        all_results.append({
            "post_id": row.Index,
            "date": getattr(row, "date", None),
            "word": token.text,
            "lemma": token.text.lower(),           # Flair liefert kein Lemma
            "pos": pos_label,
            "lemma_p": f"{token.text.lower()}_{pos_label}" if pos_label else token.text.lower()
        })
        
fl = pd.DataFrame(all_results)
fl.to_csv("testkorpus_divers_50_flair.csv", index=False)
display(fl.head(20))

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,INTJ,reminder_INTJ
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,the,PROPN,the_PROPN
3,0,2010-11-04,Miss,miss,PROPN,miss_PROPN
4,0,2010-11-04,Universe,universe,PROPN,universe_PROPN
5,0,2010-11-04,competition,competition,PROPN,competition_PROPN
6,0,2010-11-04,will,will,AUX,will_AUX
7,0,2010-11-04,be,be,VERB,be_VERB
8,0,2010-11-04,LIVE,live,VERB,live_VERB
9,0,2010-11-04,from,from,ADP,from_ADP


In [35]:
import pandas as pd
fl = pd.read_csv("testkorpus_divers_50_flair.csv")
display(fl[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Shreds,shreds,NUM,shreds_NUM
71,4,2020-05-11,NBC,nbc,SYM,nbc_SYM
72,4,2020-05-11,‚Äôs,‚Äôs,NUM,‚Äôs_NUM
73,4,2020-05-11,Chuck,chuck,NOUN,chuck_NOUN
74,4,2020-05-11,Todd,todd,SYM,todd_SYM
75,4,2020-05-11,For,for,NUM,for_NUM
76,4,2020-05-11,‚Äò,‚Äò,SYM,‚Äò_SYM
77,4,2020-05-11,Deceptive,deceptive,NUM,deceptive_NUM
78,4,2020-05-11,Editing‚Äô,editing‚Äô,SYM,editing‚Äô_SYM
79,4,2020-05-11,Of,of,NUM,of_NUM


In [36]:
fl.shape

(1357, 6)

In [37]:
#### Flair mit POS ####
import logging
import pandas as pd
from flair.data import Sentence
from flair.models import SequenceTagger

logging.getLogger("flair").setLevel(logging.ERROR)
df = pd.read_csv("testkorpus_divers_50.csv")
df_nonempty = df[df["text"].notna()].copy()
tagger = SequenceTagger.load("flair/pos-english")
label_type = tagger.label_type

sentences = [Sentence(str(t)) for t in df_nonempty["text"]]
tagger.predict(sentences, mini_batch_size=32)

all_results = []
for row, sentence in zip(df_nonempty.itertuples(index=True), sentences):
    for token in sentence:
        # POS-Label 
        pos_label = token.get_label(label_type).value if token.has_label(label_type) else None

        all_results.append({
            "post_id": row.Index,
            "date": getattr(row, "date", None),
            "word": token.text,
            "lemma": token.text.lower(),           # Flair liefert kein Lemma
            "pos": pos_label,
            "lemma_p": f"{token.text.lower()}_{pos_label}" if pos_label else token.text.lower()
        })
        
fl2 = pd.DataFrame(all_results)
fl2.to_csv("testkorpus_divers_50_flair.csv", index=False)
display(fl2.head(20))

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,NN,reminder_NN
1,0,2010-11-04,:,:,:,:_:
2,0,2010-11-04,The,the,DT,the_DT
3,0,2010-11-04,Miss,miss,NNP,miss_NNP
4,0,2010-11-04,Universe,universe,NNP,universe_NNP
5,0,2010-11-04,competition,competition,NN,competition_NN
6,0,2010-11-04,will,will,MD,will_MD
7,0,2010-11-04,be,be,VB,be_VB
8,0,2010-11-04,LIVE,live,RB,live_RB
9,0,2010-11-04,from,from,IN,from_IN


In [38]:
import pandas as pd
fl2 = pd.read_csv("testkorpus_divers_50_flair.csv")
display(fl2[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Shreds,shreds,VBZ,shreds_VBZ
71,4,2020-05-11,NBC,nbc,NNP,nbc_NNP
72,4,2020-05-11,‚Äôs,‚Äôs,VBZ,‚Äôs_VBZ
73,4,2020-05-11,Chuck,chuck,NNP,chuck_NNP
74,4,2020-05-11,Todd,todd,NNP,todd_NNP
75,4,2020-05-11,For,for,IN,for_IN
76,4,2020-05-11,‚Äò,‚Äò,``,‚Äò_``
77,4,2020-05-11,Deceptive,deceptive,JJ,deceptive_JJ
78,4,2020-05-11,Editing‚Äô,editing‚Äô,NN,editing‚Äô_NN
79,4,2020-05-11,Of,of,IN,of_IN


In [39]:
fl2.shape

(1357, 6)

Auf den ersten Blick kommt flair mit POS Tagging nicht sehr gut klar, wie SpaCy oder Stanza.

In [40]:
#### Flair und SpaCy ####
import pandas as pd
from flair.data import Sentence
from flair.models import SequenceTagger
import spacy

df = pd.read_csv("testkorpus_divers_50.csv")
tagger = SequenceTagger.load("pos-fast")
nlp = spacy.load("en_core_web_sm")

all_results = []

for idx, row in df.iterrows():
    text = row['text']
    if pd.isna(text):
        continue

    spacy_doc = nlp(str(text))

    flair_sentence = Sentence(str(text))
    tagger.predict(flair_sentence)

    # Achtung: Flair und SpaCy tokenisieren unterschiedlich!
    if len(flair_sentence) == len(spacy_doc):
        for flair_token, spacy_token in zip(flair_sentence, spacy_doc):
            all_results.append({
                "post_id": idx,
                "date": row.get("date"),
                "word": flair_token.text,
                "lemma": spacy_token.lemma_, #f√ºr Lemma Spacy verwenden
                "pos": flair_token.get_label('pos').value,
                "lemma_p": f"{spacy_token.lemma_}_{flair_token.get_label('pos').value}"
            })
    else:
        # Falls Tokenanzahl nicht √ºbereinstimmt
        for flair_token in flair_sentence:
            all_results.append({
                "post_id": idx,
                "date": row.get("date"),
                "word": flair_token.text,
                "lemma": flair_token.text.lower(),
                "pos": flair_token.get_label('pos').value,
                "lemma_p": f"{flair_token.text.lower()}_{flair_token.get_label('pos').value}"
            })

flsp = pd.DataFrame(all_results)
flsp.to_csv("testkorpus_divers_50_flair_spacy.csv", index=False)
display(flsp.head(20))

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,NN,reminder_NN
1,0,2010-11-04,:,:,:,:_:
2,0,2010-11-04,The,the,DT,the_DT
3,0,2010-11-04,Miss,miss,NNP,miss_NNP
4,0,2010-11-04,Universe,universe,NNP,universe_NNP
5,0,2010-11-04,competition,competition,NN,competition_NN
6,0,2010-11-04,will,will,MD,will_MD
7,0,2010-11-04,be,be,VB,be_VB
8,0,2010-11-04,LIVE,live,JJ,live_JJ
9,0,2010-11-04,from,from,IN,from_IN


In [41]:
import pandas as pd
flsp = pd.read_csv("testkorpus_divers_50_flair_spacy.csv")
display(flsp[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Shreds,shreds,VBZ,shreds_VBZ
71,4,2020-05-11,NBC,nbc,NNP,nbc_NNP
72,4,2020-05-11,‚Äôs,‚Äôs,VBZ,‚Äôs_VBZ
73,4,2020-05-11,Chuck,chuck,NNP,chuck_NNP
74,4,2020-05-11,Todd,todd,NNP,todd_NNP
75,4,2020-05-11,For,for,IN,for_IN
76,4,2020-05-11,‚Äò,‚Äò,``,‚Äò_``
77,4,2020-05-11,Deceptive,deceptive,JJ,deceptive_JJ
78,4,2020-05-11,Editing‚Äô,editing‚Äô,NN,editing‚Äô_NN
79,4,2020-05-11,Of,of,IN,of_IN


In [42]:
flsp.shape

(1357, 6)

Fazit zu Flair:
- Tagging trift oft nicht zu (zu viel NUM und SYM) vor allem bei Namen
- Zerteilung der Links
- Trennung der Hashes und @ (immerhin wortweise)
- Emojis als SYM und gut getrennt
flair_spacy: anderes Tagset
- Lemmatisierung nicht ganz richtig (nur lower)
- Emojis auch gut getrennt, aber als NFP
flair-spacy2: auch anderes Tagset
- Lemmatisierung schlechter
- Tagging ok
- Emojis auch als NFP, nicht so gut getrennt
- LL. Bean wird als gemeinsamer Ausdruck erkannt! (Bei Stanza, Bert und Tweebank nicht)
- ol' richtig als JJ
- @ als CC
- Links als eigener Satz getaggt, was es falsch macht und dadurch zu viel aufsplittet

# Bert
https://huggingface.co/vblagoje/bert-english-uncased-finetuned-pos

In [43]:
#### Bert mit SpaCy ####
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
import spacy

model_name = "vblagoje/bert-english-uncased-finetuned-pos"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

nlp = spacy.load("en_core_web_sm")
id2label = model.config.id2label

df = pd.read_csv("testkorpus_divers_50.csv")
text_col = "text"

results = []


for idx, row in df.iterrows():
    text = row.get(text_col)
    if pd.isna(text):
        continue

    spacy_doc = nlp(str(text))

    encoding = tokenizer(str(text), return_tensors="pt", return_offsets_mapping=True, truncation=True)
    input_ids = encoding["input_ids"]
    offset_mappings = encoding["offset_mapping"][0]

    with torch.no_grad():
        output = model(input_ids)
    
    logits = output.logits
    predictions = torch.argmax(logits, dim=2)[0].tolist()

    for idx_token, pred_id in enumerate(predictions):
        start, end = offset_mappings[idx_token].tolist()
        if start == end:
            continue

        word_text = text[start:end]
        pos_tag = id2label[pred_id]
        lemma = None
        for token in spacy_doc:
            token_start = token.idx
            token_end = token.idx + len(token.text)
            if start >= token_start and end <= token_end:
                lemma = token.lemma_
                break
        if lemma is None:
            lemma = word_text.lower()

        results.append({
            "post_id": idx,
            "date": row.get("date"),
            "word": word_text,
            "lemma": lemma,
            "pos": pos_tag,
            "lemma_p": f"{lemma}_{pos_tag}"
        })

ber = pd.DataFrame(results)
ber.to_csv("testkorpus_divers_50_bert.csv", index=False)
display(ber.head(40))

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,NOUN,reminder_NOUN
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,the,DET,the_DET
3,0,2010-11-04,Miss,Miss,PROPN,Miss_PROPN
4,0,2010-11-04,Universe,Universe,PROPN,Universe_PROPN
5,0,2010-11-04,competition,competition,NOUN,competition_NOUN
6,0,2010-11-04,will,will,AUX,will_AUX
7,0,2010-11-04,be,be,AUX,be_AUX
8,0,2010-11-04,LIVE,live,ADJ,live_ADJ
9,0,2010-11-04,from,from,ADP,from_ADP


In [44]:
import pandas as pd
ber = pd.read_csv("testkorpus_divers_50_bert.csv")
display(ber[115:165])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
115,4,2020-05-11,‚Äô,‚Äôs,PART,‚Äôs_PART
116,4,2020-05-11,s,‚Äôs,PART,‚Äôs_PART
117,4,2020-05-11,Comments,comment,NOUN,comment_NOUN
118,4,2020-05-11,On,on,ADP,on_ADP
119,4,2020-05-11,‚Äú,"""",PUNCT,"""_PUNCT"
120,4,2020-05-11,Meet,Meet,VERB,Meet_VERB
121,4,2020-05-11,The,The,DET,The_DET
122,4,2020-05-11,Press,Press,PROPN,Press_PROPN
123,4,2020-05-11,",",",",PUNCT,",_PUNCT"
124,4,2020-05-11,‚Äù,"""",PUNCT,"""_PUNCT"


In [45]:
#### Bert2 ####
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
import spacy

df = pd.read_csv("testkorpus_divers_50.csv")
nlp = spacy.load("en_core_web_sm")

model_name = "vblagoje/bert-english-uncased-finetuned-pos"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

pos_tagger = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

results = []

for idx, row in df.iterrows():
    text = row["text"]
    if pd.isna(text):
        continue

    spacy_doc = nlp(str(text))
    hf_pos = pos_tagger(str(text))
    spacy_tokens = list(spacy_doc)
    
    def get_spacy_token_by_offset(start, end):
        for token in spacy_tokens:
            token_start = token.idx
            token_end = token.idx + len(token.text)
            if token_start <= end and token_end >= start:
                return token
        return None
    
    for entity in hf_pos:
        word = entity['word']
        start = entity['start']
        end = entity['end']
        pos_tag = entity['entity_group']  # z.B. 'NOUN', 'VERB'
        
        spacy_token = get_spacy_token_by_offset(start, end)
        lemma = spacy_token.lemma_ if spacy_token else word.lower()
        
        results.append({
            "post_id": idx,
            "date": row.get("date"),
            "word": word,
            "lemma": lemma,
            "pos": pos_tag,
            "lemma_p": f"{lemma}_{pos_tag}"
        })

bert = pd.DataFrame(results)
bert.to_csv("testkorpus_divers_50_bert2.csv", index=False)
display(bert.head(50))

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,reminder,reminder,NOUN,reminder_NOUN
1,0,2010-11-04,:,reminder,PUNCT,reminder_PUNCT
2,0,2010-11-04,the,the,DET,the_DET
3,0,2010-11-04,miss universe,Miss,PROPN,Miss_PROPN
4,0,2010-11-04,competition,competition,NOUN,competition_NOUN
5,0,2010-11-04,will be,will,AUX,will_AUX
6,0,2010-11-04,live,live,ADJ,live_ADJ
7,0,2010-11-04,from,from,ADP,from_ADP
8,0,2010-11-04,the,the,DET,the_DET
9,0,2010-11-04,bahamas,Bahamas,PROPN,Bahamas_PROPN


In [46]:
import pandas as pd
fptfb = pd.read_csv("testkorpus_divers_50_bert2.csv")
display(bert[200:215])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
200,10,2017-07-21,hosted,host,VERB,host_VERB
201,10,2017-07-21,a,a,DET,a_DET
202,10,2017-07-21,#,#,SYM,#_SYM
203,10,2017-07-21,madeinamerica,#,X,#_X
204,10,2017-07-21,event,event,NOUN,event_NOUN
205,10,2017-07-21,",",event,PUNCT,event_PUNCT
206,10,2017-07-21,right here,right,ADV,right_ADV
207,10,2017-07-21,at,at,ADP,at_ADP
208,10,2017-07-21,the,the,DET,the_DET
209,10,2017-07-21,@,@whitehouse,X,@whitehouse_X


In [47]:
bert.shape

(1559, 6)

In [48]:
bert[70:110]
# das Lemma ist v√∂llig falsch
# @darhar wird als Satzzeichen erkannt?!

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,:,@darhar981,PUNCT,@darhar981_PUNCT
71,4,2020-05-11,attorney general barr,Attorney,PROPN,Attorney_PROPN
72,4,2020-05-11,‚Äô s,Barr,PART,Barr_PART
73,4,2020-05-11,office,office,NOUN,office_NOUN
74,4,2020-05-11,shreds,shred,VERB,shred_VERB
75,4,2020-05-11,nbc,NBC,PROPN,NBC_PROPN
76,4,2020-05-11,‚Äô s,NBC,PART,NBC_PART
77,4,2020-05-11,chuck todd,Chuck,PROPN,Chuck_PROPN
78,4,2020-05-11,for,for,ADP,for_ADP
79,4,2020-05-11,‚Äò,',PUNCT,'_PUNCT


In [49]:
#### Bert3 ### ohne Pipeline
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
import spacy

model_name = "vblagoje/bert-english-uncased-finetuned-pos"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()
nlp = spacy.load("en_core_web_sm")
id2label = model.config.id2label

df = pd.read_csv("testkorpus_divers_50.csv")

results = []

for idx, row in df.iterrows():
    text = row["text"]
    if pd.isna(text):
        continue
    spacy_doc = nlp(str(text))
    
    encoding = tokenizer(str(text), return_tensors="pt", return_offsets_mapping=True, truncation=True)
    input_ids = encoding["input_ids"]
    offset_mappings = encoding["offset_mapping"][0]
    
    with torch.no_grad():
        output = model(input_ids)
    
    logits = output.logits  
    predictions = torch.argmax(logits, dim=2)[0].tolist()
    
    for idx_token, pred_id in enumerate(predictions):
        start, end = offset_mappings[idx_token].tolist()
        if start == end:
            continue
        
        word_text = text[start:end]
        pos_tag = id2label[pred_id]
        
        lemma = None
        for token in spacy_doc:
            token_start = token.idx
            token_end = token.idx + len(token.text)
            if start >= token_start and end <= token_end:
                lemma = token.lemma_
                break
        if lemma is None:
            lemma = word_text.lower()
        
        results.append({
            "post_id": idx,
            "date": row.get("date"),
            "word": word_text,
            "lemma": lemma,
            "pos": pos_tag,
            "lemma_p": f"{lemma}_{pos_tag}"
        })

bert3 = pd.DataFrame(results)
bert3.to_csv("testkorpus_divers_50_bert3.csv", index=False)
display(bert3.head(40))

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,NOUN,reminder_NOUN
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,the,DET,the_DET
3,0,2010-11-04,Miss,Miss,PROPN,Miss_PROPN
4,0,2010-11-04,Universe,Universe,PROPN,Universe_PROPN
5,0,2010-11-04,competition,competition,NOUN,competition_NOUN
6,0,2010-11-04,will,will,AUX,will_AUX
7,0,2010-11-04,be,be,AUX,be_AUX
8,0,2010-11-04,LIVE,live,ADJ,live_ADJ
9,0,2010-11-04,from,from,ADP,from_ADP


In [50]:
import pandas as pd
bert3 = pd.read_csv("testkorpus_divers_50_bert3.csv")
display(bert3[115:165])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
115,4,2020-05-11,‚Äô,‚Äôs,PART,‚Äôs_PART
116,4,2020-05-11,s,‚Äôs,PART,‚Äôs_PART
117,4,2020-05-11,Comments,comment,NOUN,comment_NOUN
118,4,2020-05-11,On,on,ADP,on_ADP
119,4,2020-05-11,‚Äú,"""",PUNCT,"""_PUNCT"
120,4,2020-05-11,Meet,Meet,VERB,Meet_VERB
121,4,2020-05-11,The,The,DET,The_DET
122,4,2020-05-11,Press,Press,PROPN,Press_PROPN
123,4,2020-05-11,",",",",PUNCT,",_PUNCT"
124,4,2020-05-11,‚Äù,"""",PUNCT,"""_PUNCT"


In [51]:
bert3.shape

(2024, 6)

In [52]:
bert3[1980:2024]
# Links und @ werden seltsam zerlegt

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
1980,49,2025-03-01,were,be,AUX,be_AUX
1981,49,2025-03-01,released,release,VERB,release_VERB
1982,49,2025-03-01,into,into,ADP,into_ADP
1983,49,2025-03-01,our,our,PRON,our_PRON
1984,49,2025-03-01,Country,Country,NOUN,Country_NOUN
1985,49,2025-03-01,.,.,PUNCT,._PUNCT
1986,49,2025-03-01,Thanks,thank,NOUN,thank_NOUN
1987,49,2025-03-01,to,to,ADP,to_ADP
1988,49,2025-03-01,the,the,DET,the_DET
1989,49,2025-03-01,Trump,Trump,PROPN,Trump_PROPN


Fazit zu Bert:
- die Links (und auch andere Abk√ºrzungen wie CBP) werden in viel zu viele Einzelteile zerlegt und falsch bestimmt (Zahlen und sonstige W√∂rter auch: USA(Emoji) wird zu usa(emoji) und SYM, also falsch getrennt)
- Hashes und @ werden getrennt (#MadeInAmerica wird zu # Made InA meric a)
- Emojis werden als SYM erkannt (und wenn falsch getrennt auch der Rest des Wortes)
- rchtige Lemmatisierung
- ca. 7-8.000 Tags mehr als die anderen Tagger (wegen der Links wahrscheinlich)
- bert3: Lemmatisierung schlecht und Trennung der W√∂rter schlecht (Opioid wird zu Op io id) 
- bert2: Uhrzeit mit ## erg√§nzt?, Lemmatisierung noch schlechter

## Tweebank

In [53]:
# Die √§ltere Version TweebankNLP unterst√ºtzt kein POS-Tagging.
import tweetnlp
print(list(tweetnlp.loader.TASK_CLASS.keys()))

['sentiment', 'offensive', 'irony', 'hate', 'emotion', 'emoji', 'stance_abortion', 'stance_atheism', 'stance_climate', 'stance_feminist', 'stance_hillary', 'topic_classification', 'ner', 'language_model', 'sentence_embedding', 'question_answering', 'question_answer_generation']


In [54]:
import torch
print(torch.__version__)

2.6.0


In [55]:
#### Tweebank ####
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
df = pd.read_csv("testkorpus_divers_50.csv")

df["text"] = df["text"].fillna("").astype(str)
model_name = "TweebankNLP/bertweet-tb2_ewt-pos-tagging"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Device: CPU
device = -1

tagger = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",
    device=device
)

results = []

for idx, row in df.iterrows():
    text = row["text"].strip()
    
    if text:
        tagged = tagger(text)
        for token_info in tagged:
            word_text = token_info.get("word")
            pos_tag = token_info.get("entity_group")
            lemma = word_text
            results.append({
                "post_id": idx,
                "date": row.get("date"),
                "word": word_text,
                "lemma": lemma,
                "pos": pos_tag,
                "lemma_p": f"{lemma}_{pos_tag}"
            })
    else:
        results.append({
            "post_id": idx,
            "date": row.get("date"),
            "word": "",
            "lemma": "",
            "pos": "",
            "lemma_p": ""
        })

twe = pd.DataFrame(results)
twe.to_csv("testkorpus_divers_50_tweebank_1.csv", index=False, encoding="utf-8")
display(twe.head())

Device set to use cpu


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder@@,Reminder@@,NOUN,Reminder@@_NOUN
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,The,DET,The_DET
3,0,2010-11-04,Miss Universe,Miss Universe,PROPN,Miss Universe_PROPN
4,0,2010-11-04,competition,competition,NOUN,competition_NOUN


In [56]:
import pandas as pd
twe = pd.read_csv("testkorpus_divers_50_tweebank_1.csv")
display(twe[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Of,Of,ADP,Of_ADP
71,4,2020-05-11,Barr@@,Barr@@,PROPN,Barr@@_PROPN
72,4,2020-05-11,<unk> s,<unk> s,PART,<unk> s_PART
73,4,2020-05-11,Comments,Comments,NOUN,Comments_NOUN
74,4,2020-05-11,On,On,ADP,On_ADP
75,4,2020-05-11,<unk>,<unk>,PUNCT,<unk>_PUNCT
76,4,2020-05-11,Meet,Meet,VERB,Meet_VERB
77,4,2020-05-11,The,The,DET,The_DET
78,4,2020-05-11,Press@@,Press@@,NOUN,Press@@_NOUN
79,4,2020-05-11,",‚Äù",",‚Äù",PUNCT,",‚Äù_PUNCT"


In [57]:
twe.shape
# Emojis durch @ ersetzt??

(1225, 6)

In [58]:
#### Tweebank mit SpaCy ####
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import spacy

df = pd.read_csv("testkorpus_divers_50.csv")
model_name = "TweebankNLP/bertweet-tb2_ewt-pos-tagging"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
tagger = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

nlp = spacy.load("en_core_web_sm")

results = []

for idx, row in df.iterrows():
    text = str(row["text"])
    date = row.get("date", None)

    # POS-Tagging mit Tweebank
    pos_tags = tagger(text)

    # Lemmatisierung mit spaCy
    doc = nlp(text)

    lemma_map = {token.text: token.lemma_ for token in doc}

    for token_info in pos_tags:
        word_text = token_info["word"]
        pos_tag = token_info["entity_group"]
        lemma = lemma_map.get(word_text, word_text)

        results.append({
            "post_id": idx,
            "date": date,
            "word": word_text,
            "lemma": lemma,
            "pos": pos_tag,
            "lemma_p": f"{lemma}_{pos_tag}"
        })

twesp = pd.DataFrame(results)
twesp.to_csv("testkorpus_divers_50_tweebank.csv", index=False)
display(twesp.head())

Device set to use cpu


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder@@,Reminder@@,NOUN,Reminder@@_NOUN
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,the,DET,the_DET
3,0,2010-11-04,Miss Universe,Miss Universe,PROPN,Miss Universe_PROPN
4,0,2010-11-04,competition,competition,NOUN,competition_NOUN


Device set to use cpu:
Auf MacOs gibt es normalerweise keine CUDA-GPU.
Apple Silicon (M1/M2/M3) hat eine eigene GPU, die nur √ºber PyTorch mit MPS (Metal Performance Shaders) genutzt werden kann.
Auf Intel-Macs bleibt nur die CPU.

In [59]:
import pandas as pd
twesp = pd.read_csv("testkorpus_divers_50_tweebank.csv")
display(twesp[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Of,of,ADP,of_ADP
71,4,2020-05-11,Barr@@,Barr@@,PROPN,Barr@@_PROPN
72,4,2020-05-11,<unk> s,<unk> s,PART,<unk> s_PART
73,4,2020-05-11,Comments,comment,NOUN,comment_NOUN
74,4,2020-05-11,On,on,ADP,on_ADP
75,4,2020-05-11,<unk>,<unk>,PUNCT,<unk>_PUNCT
76,4,2020-05-11,Meet,Meet,VERB,Meet_VERB
77,4,2020-05-11,The,The,DET,The_DET
78,4,2020-05-11,Press@@,Press@@,NOUN,Press@@_NOUN
79,4,2020-05-11,",‚Äù",",‚Äù",PUNCT,",‚Äù_PUNCT"


In [60]:
twesp.shape

(1225, 6)

In [62]:
#### Tweebank mit Stanza ####
import pandas as pd
import stanza
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

df = pd.read_csv("testkorpus_divers_50.csv")

model_name = "TweebankNLP/bertweet-tb2_ewt-pos-tagging"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
tagger = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

stanza.download("en")
nlp_stanza = stanza.Pipeline(
    lang="en",
    processors="tokenize,lemma",
    tokenize_engine="tokenize/tweet"  # Tweet-Tokenizer
)

results = []

for idx, row in df.iterrows():
    text = str(row["text"])
    date = row.get("date", None)

    if not isinstance(text, str) or text.strip() == "":
        continue

    # Tokenisierung und Lemmatisierung mit Stanza
    doc = nlp_stanza(text)
    tokens = [(w.text, w.lemma) for s in doc.sentences for w in s.words]

    words = [t[0] for t in tokens]
    pos_tags_nested = tagger(words)
    pos_tags = [tag[0] if isinstance(tag, list) else tag for tag in pos_tags_nested]

    for token_info, (word_text, lemma) in zip(pos_tags, tokens):
        pos_tag = token_info["entity_group"]

        results.append({
            "post_id": idx,
            "date": date,
            "word": word_text,
            "lemma": lemma,
            "pos": pos_tag,
            "lemma_p": f"{lemma}_{pos_tag}",
        })

twesta = pd.DataFrame(results)
twesta.to_csv("testkorpus_divers_50_tweebank2.csv", index=False, encoding="utf-8")
display(twesta.head())

Device set to use cpu


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-13 15:01:34 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-13 15:01:34 INFO: Downloading default packages for language: en (English) ...
2025-09-13 15:01:37 INFO: File exists: /Users/vivien/stanza_resources/en/default.zip
2025-09-13 15:01:43 INFO: Finished downloading models and saved to /Users/vivien/stanza_resources
2025-09-13 15:01:43 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-13 15:06:43 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-13 15:06:44 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| lemma     | combined_nocharlm |

2025-09-13 15:06:44 INFO: Using device: cpu
2025-09-13 15:06:44 INFO: Loading: tokenize
2025-09-13 15:06:44 INFO: Loading: mwt
2025-09-13 15:06:44 INFO: Loading: lemma
2025-09-13 15:06:45 INFO: Done loading processors!


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,NOUN,reminder_NOUN
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,the,DET,the_DET
3,0,2010-11-04,Miss,Miss,PROPN,Miss_PROPN
4,0,2010-11-04,Universe,Universe,NOUN,Universe_NOUN


In [63]:
import pandas as pd
twesta = pd.read_csv("testkorpus_divers_50_tweebank2.csv")
display(twesta[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Shreds,shreds,NOUN,shreds_NOUN
71,4,2020-05-11,NBC,NBC,PROPN,NBC_PROPN
72,4,2020-05-11,‚Äôs,'s,PRON,'s_PRON
73,4,2020-05-11,Chuck,Chuck,PROPN,Chuck_PROPN
74,4,2020-05-11,Todd,Todd,PROPN,Todd_PROPN
75,4,2020-05-11,For,for,INTJ,for_INTJ
76,4,2020-05-11,‚Äò,',PUNCT,'_PUNCT
77,4,2020-05-11,Deceptive,deceptive,ADJ,deceptive_ADJ
78,4,2020-05-11,Editing,editing,NOUN,editing_NOUN
79,4,2020-05-11,‚Äô,'s,PUNCT,'s_PUNCT


In [64]:
twesta.shape

(1209, 6)

Fazit zu Tweebank:
- Emojis werden in <unk/> umgewandelt plus @@ ?!
- MWEs werden erkannt
- Emojis in @@, 
- Worte falsch getrennt (v.a. bei Satzzeichen)
- bei Trennung immer @@ erg√§nzt
- einen Zeichenfehler gut erkannt
- Hashtag bleibt zusammen, aber mit X gelabelt 
- Emojis als unkown und SYM gelabelt
- Links bleiben ganz und als X
- kein Unterschied zwischen den zwei Codes (0,1)
- Lemma immer wie Word bei 1 & 2
- bei Tagging und Lemmatisierung mit Stanza besseres Ergebnis

# Finale Entscheidung und Tagging der Daten

Ich entscheide mich f√ºr... STANZA mit Tweet-Tokenizer

Hier die beiden Varianten des Tagging mit extra Tags f√ºr Tweets:

### Mapping Tabelle:
| `xpos` (Tweebank / PTB) | `upos` (Universal) | Bedeutung / Beispiele                                                                  |
| ----------------------- | ------------------ | -------------------------------------------------------------------------------------- |
| **NN**                  | NOUN               | Noun, singular ‚Üí *dog, idea*                                                           |
| **NNS**                 | NOUN               | Noun, plural ‚Üí *dogs, cars*                                                            |
| **NNP**                 | PROPN              | Proper noun, singular ‚Üí *Trump, Canada*                                                |
| **NNPS**                | PROPN              | Proper noun, plural ‚Üí *the Smiths*                                                     |
| **PRP**                 | PRON               | Personal pronoun ‚Üí *I, you, he*                                                        |
| **PRP\$**               | PRON               | Possessive pronoun ‚Üí *my, your*                                                        |
| **WP**                  | PRON               | Wh-pronoun ‚Üí *who, what*                                                               |
| **WP\$**                | PRON               | Possessive wh-pronoun ‚Üí *whose*                                                        |
| **DT**                  | DET                | Determiner ‚Üí *the, a, some*                                                            |
| **PDT**                 | DET                | Predeterminer ‚Üí *all the kids*                                                         |
| **WDT**                 | DET                | Wh-determiner ‚Üí *which*                                                                |
| **JJ**                  | ADJ                | Adjective ‚Üí *big, nice*                                                                |
| **JJR**                 | ADJ                | Comparative adj ‚Üí *bigger*                                                             |
| **JJS**                 | ADJ                | Superlative adj ‚Üí *biggest*                                                            |
| **RB**                  | ADV                | Adverb ‚Üí *quickly*                                                                     |
| **RBR**                 | ADV                | Comparative adv ‚Üí *faster*                                                             |
| **RBS**                 | ADV                | Superlative adv ‚Üí *fastest*                                                            |
| **WRB**                 | ADV                | Wh-adverb ‚Üí *how, when, why*                                                           |
| **VB**                  | VERB               | Verb base ‚Üí *eat, go*                                                                  |
| **VBD**                 | VERB               | Verb past ‚Üí *ate, went*                                                                |
| **VBG**                 | VERB               | Verb gerund/participle ‚Üí *eating*                                                      |
| **VBN**                 | VERB               | Verb past participle ‚Üí *eaten*                                                         |
| **VBP**                 | VERB               | Verb non-3sg present ‚Üí *eat, go*                                                       |
| **VBZ**                 | VERB               | Verb 3sg present ‚Üí *eats, goes*                                                        |
| **MD**                  | AUX                | Modal ‚Üí *can, should*                                                                  |
| **IN**                  | ADP                | Preposition, subordinating ‚Üí *in, of, because*                                         |
| **TO**                  | PART               | Particle *to*                                                                          |
| **CC**                  | CCONJ              | Coordinating conj ‚Üí *and, or*                                                          |
| **UH**                  | INTJ               | Interjection ‚Üí *oh, hi*                                                                |
| **EX**                  | PRON               | Existential *there*                                                                    |
| **FW**                  | X                  | Foreign word                                                                           |
| **SYM**                 | SYM                | Symbol ‚Üí *%, \$, +*                                                                    |
| **LS**                  | X                  | List item marker                                                                       |
| **CD**                  | NUM                | Cardinal number ‚Üí *5, twenty*                                                          |
| **POS**                 | PART               | Possessive marker *‚Äôs*                                                                 |
| **RP**                  | PART               | Particle ‚Üí *up, off*                                                                   |
| **ADD**                 | PROPN              | **Spezialtag Tweebank**: URL, Email, @mention, Hashtag ‚Üí *@user, #hashtag, https\://‚Ä¶* |
| **NFP**                 | SYM                | **Spezialtag Tweebank**: Non-functional punctuation ‚Üí *ü§£, :), ‚ù§Ô∏è*                     |
| **. , : ; - etc.**      | PUNCT              | Satzzeichen                                                                            |
