# 02 Tagging_Testkorpus

## Tagger f√ºr POS
In diesem Jupyter Notebook wird ein Testkorpus testkorpus_divers_50.csv erstellt, welches verschiedene Schwierigkeiten wie Rechtschreibfehler, Hashes, @ und Emojis enth√§lt. Danach wird die Datei anhand mehrerer verschiedener Modelle getaggt, sodass verglichen werden kann, welches Modell am besten abschneidet. Die Entscheidung wird mit meinem pers√∂nlichen Eindruck begr√ºndet und nicht quantifiziert.
Im Folgenden werden immer die gleichen 50 Zeilen der Testkorpora gezeigt, um einen ersten Eindruck der Performance zu erhalten. F√ºr einen zuverl√§ssigen und quaitativ h√∂heren Eindruck wurden die Dateien allerdings alle einzeln ge√∂ffnet und Zeile f√ºr Zeile miteinander verglichen.

Verschiedene Tagsets:
https://universaldependencies.org/introduction.html
- Penn Treebank Tags: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
- Universal Dependencies:
https://universaldependencies.org/u/pos/
https://huggingface.co/flair/upos-english

### Welcher Tagger eignet sich am Besten:
- 01 Spacy
- 02 Flair
- 03 Bert
- 04 Tweebank
- 05 Stanza
- Variationen & Kombinationen

#### Testkorpus, mit dem verschiedene Tagger getestet werden:

In [3]:
# Installation: conda install -c conda-forge pyspellchecker

In [4]:
# Ziel: ein m√∂glichst diverses Korpus erstellen, das alle relevanten F√§lle pr√ºft
import pandas as pd
import re
from spellchecker import SpellChecker
from IPython.display import display

spell = SpellChecker(language="en")
df = pd.read_csv("tta_final_clean.csv")

# Funktionen f√ºr verschiedene Post-Typen
def has_mention(x): return "@" in str(x)
def has_hashtag(x): return "#" in str(x)
def has_url(x): return re.search(r"http[s]?://", str(x)) is not None
def has_emoji(x): return re.search(r"[\U00010000-\U0010ffff]", str(x)) is not None
def is_long(x): return len(str(x)) > 200
def has_typo(x): return re.compile(
    r"\b("
    r"teh|recieve|definately|seperat(?:e|ely)|occured|untill|wich|"
    r"neccessary|adress|tomm?orow|becuase|wierd|yeee?s"
    r")\b",
    flags=re.IGNORECASE
)
def has_typo_spellchecker(text):
    words = str(text).split()
    misspelled = spell.unknown(words)   # W√∂rter, die nicht im W√∂rterbuch sind
    return len(misspelled) > 0

samples = []
# je 5 Beispiele (wenn vorhanden)
samples.append(df[df['text'].apply(is_long)].sample(n=5, random_state=1))
samples.append(df[df['text'].apply(has_mention)].sample(n=5, random_state=2))
samples.append(df[df['text'].apply(has_hashtag)].sample(n=5, random_state=3))
samples.append(df[df['text'].apply(has_url)].sample(n=5, random_state=4))
samples.append(df[df['text'].apply(has_emoji)].sample(n=5, random_state=5))
#samples.append(df[df['text'].apply(has_typo)].sample(n=5, random_state=6))
samples.append(df[df["text"].apply(has_typo_spellchecker)].sample(n=5, random_state=6))
#df["text"].apply(has_typo_spellchecker)

# Rest zuf√§llig auff√ºllen bis 50
already = pd.concat(samples)
remaining = df.drop(already.index)
rest = remaining.sample(n=50-len(already), random_state=42)

# finales Testkorpus
test_divers = pd.concat([already, rest]).sample(frac=1, random_state=99)
test_divers.to_csv("test_full.csv", index=False)
test_divers = test_divers[['date', 'id', 'text']]
display(test_divers.head(50))
test_divers.to_csv("testkorpus_divers_50.csv", index=False)
# Anmerkung: statt print verwende ich aufgrund der sch√∂neren Ansicht display

Unnamed: 0,date,id,text
41873,2010-11-04,3498743628,Reminder: The Miss Universe competition will b...
48694,2013-08-15,367977996541788160,@Timc1021 Thanks!
50189,2013-06-12,344775405057753088,"""""@_KatherineWebb: Looking forward to #MissUSA..."
14002,2011-09-04,110498268480198144,Addressing the Rise of Chronic Childhood Illn...
35985,2020-05-11,1259672385286012928,RT @darhar981: Attorney General Barr‚Äôs Office ...
11663,2011-09-07,111274504071323312,RT @MagaGlamüá∫üá∏‚ô•Ô∏è Bring Back Trump üíôüá∫üá∏
22194,2011-09-01,109360482610823936,
2495,2025-03-19,2755,The CBP Home App is now available across all m...
84855,2019-01-23,1087867453684834304,Congratulations to Mariano Rivera on unanimous...
63673,2015-07-01,616265476616900608,My recent statement re: @macys -- We must have...


In [5]:
test_divers.shape

(50, 3)

In [6]:
test_divers.to_json("testkorpus_divers_50.json", orient="records", force_ascii=False, indent=2)

# 01 SpaCy
- https://huggingface.co/spacy/en_core_web_sm
- https://github.com/explosion/spaCy

In [7]:
# !pip install spacy

In [8]:
# python -m spacy download en_core_web_sm

In [9]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Donald Trump posted a new tweet. #realdonaldtrump @realdonaldtrump! @ben4appel ü§£ :) https://t.co/bsB6rVV7Yn")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_)
# besser w√§re, wenn die Hashes nicht zerlegt werden.
# und Emojis nicht als Eigennamen angesehen werden

Donald Donald PROPN NNP
Trump Trump PROPN NNP
posted post VERB VBD
a a DET DT
new new ADJ JJ
tweet tweet NOUN NN
. . PUNCT .
# # X ADD
realdonaldtrump realdonaldtrump NOUN NN
@realdonaldtrump @realdonaldtrump PROPN NNP
! ! PUNCT .
@ben4appel @ben4appel PROPN NNP
ü§£ ü§£ PROPN NNP
:) :) PUNCT :
https://t.co/bsB6rVV7Yn https://t.co/bsb6rvv7yn NOUN NN


In [10]:
#### SpaCy ####
import pandas as pd
import spacy

df = pd.read_csv("testkorpus_divers_50.csv")
nlp = spacy.load("en_core_web_sm")
all_results = []

for idx, text in enumerate(df["text"], start=1):
    if pd.isna(text):
        continue
    doc = nlp(str(text))
    for token in doc:
        all_results.append({
            "post_id": idx,
            "word": token.text,
            "lemma": token.lemma_,
            "pos": token.pos_,
            "lemma_p": f"{token.lemma_}_{token.pos_}",
            "tag": token.tag_
        })

sp = pd.DataFrame(all_results)
sp.to_csv("testkorpus_divers_50_spacy.csv", index=False)
display(sp[310:360])
# der Code funktioniert; allerdings werden Emojis, @ und Post-spezifische Dinge nicht erkannt.

Unnamed: 0,post_id,word,lemma,pos,lemma_p,tag
310,16,skies,sky,NOUN,sky_NOUN,NNS
311,16,over,over,ADP,over_ADP,IN
312,16,Iran,Iran,PROPN,Iran_PROPN,NNP
313,16,.,.,PUNCT,._PUNCT,.
314,16,Iran,Iran,PROPN,Iran_PROPN,NNP
315,16,had,have,VERB,have_VERB,VBD
316,16,good,good,ADJ,good_ADJ,JJ
317,16,sky,sky,NOUN,sky_NOUN,NN
318,16,trackers,tracker,NOUN,tracker_NOUN,NNS
319,16,and,and,CCONJ,and_CCONJ,CC


In [11]:
import pandas as pd
sp = pd.read_csv("testkorpus_divers_50_spacy.csv")
display(sp[200:215])

Unnamed: 0,post_id,word,lemma,pos,lemma_p,tag
200,11,#,#,SYM,#_SYM,$
201,11,MadeInAmerica,MadeInAmerica,PROPN,MadeInAmerica_PROPN,NNP
202,11,event,event,NOUN,event_NOUN,NN
203,11,",",",",PUNCT,",_PUNCT",","
204,11,right,right,ADV,right_ADV,RB
205,11,here,here,ADV,here_ADV,RB
206,11,at,at,ADP,at_ADP,IN
207,11,the,the,DET,the_DET,DT
208,11,@WhiteHouse,@whitehouse,NOUN,@whitehouse_NOUN,NN
209,11,!,!,PUNCT,!_PUNCT,.


In [12]:
display(sp[1200:1242])

Unnamed: 0,post_id,word,lemma,pos,lemma_p,tag
1200,50,ALL,all,PRON,all_PRON,DT
1201,50,of,of,ADP,of_ADP,IN
1202,50,them,they,PRON,they_PRON,PRP
1203,50,were,be,AUX,be_AUX,VBD
1204,50,released,release,VERB,release_VERB,VBN
1205,50,into,into,ADP,into_ADP,IN
1206,50,our,our,PRON,our_PRON,PRP$
1207,50,Country,Country,PROPN,Country_PROPN,NNP
1208,50,.,.,PUNCT,._PUNCT,.
1209,50,Thanks,thank,NOUN,thank_NOUN,NNS


In [13]:
display(sp[70:110])

Unnamed: 0,post_id,word,lemma,pos,lemma_p,tag
70,5,‚Äôs,‚Äôs,PART,‚Äôs_PART,POS
71,5,Office,office,NOUN,office_NOUN,NN
72,5,Shreds,shred,VERB,shred_VERB,VBZ
73,5,NBC,NBC,PROPN,NBC_PROPN,NNP
74,5,‚Äôs,‚Äôs,PART,‚Äôs_PART,POS
75,5,Chuck,Chuck,PROPN,Chuck_PROPN,NNP
76,5,Todd,Todd,PROPN,Todd_PROPN,NNP
77,5,For,for,ADP,for_ADP,IN
78,5,‚Äò,',PUNCT,'_PUNCT,``
79,5,Deceptive,Deceptive,PROPN,Deceptive_PROPN,NNP


In [14]:
sp.shape # Wie viele Tags wurden vergeben?

(1242, 6)

In [15]:
#### 02 SpaCy ###
import pandas as pd
import spacy

df = pd.read_csv("testkorpus_divers_50.csv")

# python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")
all_results = []

for idx, row in df.iterrows():
    text = row["text"]
    if pd.isna(text) or str(text).strip() == "":
        continue
    
    doc = nlp(str(text))
    
    for token_id, token in enumerate(doc, start=1):
        all_results.append({
            "post_id": idx + 1,
            "date": row["date"],
            "token_id": token_id,
            "word": token.text,
            "lemma": token.lemma_,
            "pos": token.pos_,
            "lemma_p": f"{token.lemma_}_{token.pos_}"
        })

sp2 = pd.DataFrame(all_results)
sp2.to_csv("testkorpus_divers_50_spacy2.csv", index=False, encoding="utf-8")
display(sp2[500:555])

Unnamed: 0,post_id,date,token_id,word,lemma,pos,lemma_p
500,25,2025-03-12,19,by,by,ADP,by_ADP
501,25,2025-03-12,20,other,other,ADJ,other_ADJ
502,25,2025-03-12,21,countries,country,NOUN,country_NOUN
503,25,2025-03-12,22,and,and,CCONJ,and_CCONJ
504,25,2025-03-12,23,",",",",PUNCT,",_PUNCT"
505,25,2025-03-12,24,frankly,frankly,ADV,frankly_ADV
506,25,2025-03-12,25,",",",",PUNCT,",_PUNCT"
507,25,2025-03-12,26,by,by,ADP,by_ADP
508,25,2025-03-12,27,incompetent,incompetent,ADJ,incompetent_ADJ
509,25,2025-03-12,28,U.S.,U.S.,PROPN,U.S._PROPN


In [18]:
import pandas as pd
sp2 = pd.read_csv("testkorpus_divers_50_spacy2.csv")
display(sp2[70:110])

Unnamed: 0,post_id,date,token_id,word,lemma,pos,lemma_p
70,5,2020-05-11,7,‚Äôs,‚Äôs,PART,‚Äôs_PART
71,5,2020-05-11,8,Office,office,NOUN,office_NOUN
72,5,2020-05-11,9,Shreds,shred,VERB,shred_VERB
73,5,2020-05-11,10,NBC,NBC,PROPN,NBC_PROPN
74,5,2020-05-11,11,‚Äôs,‚Äôs,PART,‚Äôs_PART
75,5,2020-05-11,12,Chuck,Chuck,PROPN,Chuck_PROPN
76,5,2020-05-11,13,Todd,Todd,PROPN,Todd_PROPN
77,5,2020-05-11,14,For,for,ADP,for_ADP
78,5,2020-05-11,15,‚Äò,',PUNCT,'_PUNCT
79,5,2020-05-11,16,Deceptive,Deceptive,PROPN,Deceptive_PROPN


In [20]:
sp2.shape

(1242, 7)

In [1]:
#### SpaCy mit Twitter ####
import pandas as pd
import spacy
from spacy.tokenizer import Tokenizer
import re

def create_twitter_tokenizer(nlp):
    # Erweiterte Infix-Regel f√ºr Hashtags und Mentions (z.B. #NLP, @user)
    infix_re = spacy.util.compile_infix_regex(
        nlp.Defaults.infixes + [r'(?<=\w)[#@](?=\w)']
    )
    return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

df = pd.read_csv("testkorpus_divers_50.csv")
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = create_twitter_tokenizer(nlp)

all_results = []

for idx, text in enumerate(df["text"], start=1):
    if pd.isna(text):
        continue
    doc = nlp(str(text))
    for token in doc:
        all_results.append({
            "post_id": idx,
            "word": token.text,
            "lemma": token.lemma_,
            "pos": token.pos_,
            "lemma_p": f"{token.lemma_}_{token.pos_}"
        })

spt = pd.DataFrame(all_results)
spt.to_csv("testkorpus_divers_50_spacy_twitter.csv", index=False, encoding="utf-8")
display(spt[450:500])

Unnamed: 0,post_id,word,lemma,pos,lemma_p
450,21,https://t.co,https://t.co,PROPN,https://t.co_PROPN
451,21,/,/,SYM,/_SYM
452,21,v6z46rUDtg,v6z46rUDtg,PROPN,v6z46rUDtg_PROPN
453,22,Congress,Congress,PROPN,Congress_PROPN
454,22,must,must,AUX,must_AUX
455,22,approve,approve,VERB,approve_VERB
456,22,the,the,DET,the_DET
457,22,"deal,","deal,",NOUN,"deal,_NOUN"
458,22,without,without,ADP,without_ADP
459,22,all,all,PRON,all_PRON


In [2]:
import pandas as pd
spt = pd.read_csv("testkorpus_divers_50_spacy_twitter.csv")
display(spt[200:215])

Unnamed: 0,post_id,word,lemma,pos,lemma_p
200,12,-,-,PUNCT,-_PUNCT
201,12,scored,score,VERB,score_VERB
202,12,-,-,PUNCT,-_PUNCT
203,12,big,big,ADJ,big_ADJ
204,12,-,-,PUNCT,-_PUNCT
205,12,win,win,NOUN,win_NOUN
206,12,-,-,PUNCT,-_PUNCT
207,12,with,with,ADP,with_ADP
208,12,-,-,PUNCT,-_PUNCT
209,12,potentially,potentially,ADV,potentially_ADV


In [3]:
spt.shape

(1244, 5)

In [4]:
import pandas as pd
import spacy
from spacy.symbols import ORTH
import re

# Regex-Regeln
HASHTAG_RE = re.compile(r"^#\w+")
MENTION_RE = re.compile(r"^@\w+")
EMOJI_RE = re.compile(
    r"[\U0001F600-\U0001F64F"  # Emoticons
    r"\U0001F300-\U0001F5FF"   # Symbole & Piktogramme
    r"\U0001F680-\U0001F6FF"   # Transport & Karten
    r"\U0001F1E0-\U0001F1FF"   # Flaggen
    r"]", flags=re.UNICODE
)
URL_RE = re.compile(r"^(https?://[^\s]+|www\.[^\s]+|[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})(/[^\s]*)?$", flags=re.IGNORECASE)

def custom_pos(token):
    if HASHTAG_RE.match(token.text):
        return "ADD"
    elif MENTION_RE.match(token.text):
        return "PROPN"
    elif EMOJI_RE.match(token.text):
        return "NFP"
    elif URL_RE.match(token.text):
        return "URL"
    return token.pos_

# Pipeline laden
nlp = spacy.load("en_core_web_sm")

# Spezialf√§lle dem Tokenizer hinzuf√ºgen, damit sie nicht zerschnitten werden
def add_special_cases(nlp):
    # Beispiel: Hashtags, Mentions, Emojis
    for prefix in ["#", "@"]:
        nlp.tokenizer.add_special_case(prefix, [{ORTH: prefix}])
    # Optional: URLs & Emoji-Samples als spezielle Tokens
    # F√ºr alle URLs oder Emojis in Texten kann man dynamisch vorab Tokens sammeln und hinzuf√ºgen

add_special_cases(nlp)

# Daten laden
df = pd.read_csv("testkorpus_divers_50.csv")

all_results = []
for idx, text in enumerate(df["text"], start=1):
    if pd.isna(text):
        continue
    doc = nlp(str(text))
    for token in doc:
        pos_tag = custom_pos(token)
        all_results.append({
            "post_id": idx,
            "word": token.text,
            "lemma": token.lemma_,
            "pos": pos_tag,
            "lemma_p": f"{token.lemma_}_{pos_tag}"
        })

spt2 = pd.DataFrame(all_results)
spt2.to_csv("testkorpus_divers_50_spacy_tags.csv", index=False, encoding="utf-8")
display(spt2[450:500])

Unnamed: 0,post_id,word,lemma,pos,lemma_p
450,22,.,.,PUNCT,._PUNCT
451,22,Our,our,PRON,our_PRON
452,22,workers,worker,NOUN,worker_NOUN
453,22,will,will,AUX,will_AUX
454,22,be,be,AUX,be_AUX
455,22,hurt,hurt,VERB,hurt_VERB
456,22,!,!,PUNCT,!_PUNCT
457,23,https://justthenews.com/politics-policy/all-th...,https://justthenews.com/politics-policy/all-th...,URL,https://justthenews.com/politics-policy/all-th...
458,24,RT,RT,PROPN,RT_PROPN
459,24,@FLOTUS,@FLOTUS,PROPN,@FLOTUS_PROPN


In [5]:
import pandas as pd
spt2 = pd.read_csv("testkorpus_divers_50_spacy_tags.csv")
display(spt[200:215])

Unnamed: 0,post_id,word,lemma,pos,lemma_p
200,11,#,#,SYM,#_SYM
201,11,MadeInAmerica,MadeInAmerica,PROPN,MadeInAmerica_PROPN
202,11,event,event,NOUN,event_NOUN
203,11,",",",",PUNCT,",_PUNCT"
204,11,right,right,ADV,right_ADV
205,11,here,here,ADV,here_ADV
206,11,at,at,ADP,at_ADP
207,11,the,the,DET,the_DET
208,11,@WhiteHouse,@whitehouse,PROPN,@whitehouse_PROPN
209,11,!,!,PUNCT,!_PUNCT


In [6]:
spt2.shape

(1242, 5)

### Fazit zu SpaCy:
#### SpaCy:
- LL.Bean richtig erkannt, @timc1021 ganz gelassen, als X
- Thanks zu thank als NOUN, pm als NOUN, EST als PROPN
- Url werden ganz gelassen, aber als PROPN, NOUN oder VERB
- @ werden gelassen (@_KatherineWebb PROPN, @timc1021, @darhar981, @MagaGlam SYM, @FitnessGov NOUN, @WhiteHouse, @BreitbartNews)
- Hashtags werden getrennt (# SYM MissUSA NOUN, # SYM AGENDA47 NOUN, # SYM EnterSandman PROPN, # KellyFile)
- Rechtschreibfehler richtig getrennt (former ADJ . PUNCT Miss PROPN Alabama PROPN)
- Addressing wird zu address lemmatisiert, VERB
- Illnesses PROPN wird nur zu Illnesses lemmatisiert
- Barr's wird zu Barr PROPN 's PART
- Shreds wird zu shred als VERB
- Meet PROPN The PROPN Press PROPN
- Todd ADJ, RT PROPN
- Emojis als X, NOUN, PROPN (aber richtig abgetrennt)
- is wird zu be lemmatisiert, Stores zu store, elected zu elect, am zu be, MADE zu make
- National PROPN Baseball PROPN
- not PART only ADV a DET great ADJ player NOUN
- @macys bleibt @macys
- it PRON is AUX the DET BEST PROPN (eigentlich good)
- L.L.Bean PROPN, ol' ADJ
- violating VERB 1st PROPN Amendment PROPN
- We PRON 're AUX going VERB to PART take VERB back ADV our PRON wealth NOUN
- RT PROPN

#### spacy2: 

- Shreds richtig als shred erkannt
- will AUX be AUX LIVE VERB
- pm als NOUN
- besser lemmatisiert
- Hash SYM MissUSA NOUN, Hash SYM AGENDA47 NOUN, Hash SYM MadeInAmerica PROPN event NOUN, Hash SYM MAGA PROPN, Hash SYM KellyFile PROPN
- Url gut beibehalten, aber als PROPN, NOUN (bzw als Teil des Satzes analysiert)
- @timc1021 X, @_KatherineWebb PROPN, @darhar981 PROPN, @WhiteHouse NOUN, @LBPerfectMaine ADJ, @Citizens_United wird zu @citizens_unite lemmatisiert und als VERB getaggt
- Thanks zu thank lemmatisiert und als NOUN, Looking zu look lemmatisiert, Addressing zu address, being zu be, MADE zu make, BEST bleibt aber BEST, People bleibt people, her wird zu she
- Barr's wird zu Barr PROPN 's PART
- Not PART only ADV a DET great ADJ player NOUN
- Emojis als Teil des Satzes analysiert: X, NOUN, PROPN
- L.L.Bean als PROPN, its als PRON, ol' als ADJ (wird nicht zu old lemmatisiert)
- doesn't wird zu do AUX und not PART
- Via PROPN @BreitbartNews PROPN
- violating VERB 1st PROPN Amendment PROPN
- RT PROPN

#### mit Twitterfunktion:

- Satzzeichen werden nicht richtig getrennt (Vegas!"" PROPN, Thanks! INTJ, 
- schlechte Tokenisierung
- @timc1021 als X, ""@_KatherineWebb: als PUNCT, @darhar981: PROPN, @FitnessGov. NOUN, @WhiteHouse! PROPN, ""@HarmonBrew: NOUN, @megynkelly ADV, ""@Citizens_United PUNCT
- #AGENDA47 NOUN, #EnterSandman PROPN, #HOF2019 PROPN, #MadeInAmerica PROPN, #KellyFile NOUN, #MerryChristmas! NOUN
- nicht alle Tags passen (z.B. Meet NOUN the DET press NOUN, USA als ADV)
- Barr's bleibt Barr's PROPN
- Emojis gut tokenisiert, aber Tags: X, NOUN, PROPN
- manche Url werden zerh√§kselt, manche nicht (und als PROPN, SYM PROPN oder NOUN getaggt)
- Lemmatisierung gut (Looking wird zu look, Shreds zu shred VERB, NBC's bleibt so als PRON, MADE zu make, BEST! aber best!, her, bleibt her, getting zu get)
- Lemmatisierung dann gut, wenn Satzzeichen vorher abgetrennt wurden
- Thank VERB you PRON .. support NOUN and CCONJ courage. VERB
- L.L.Bean. PROPN
- 1st PROPN Amendment PROPN
- RT PROPN

#### Erg√§nzung durch vier Funktionen:
- Hashtags werden dann nicht als solche erkannt, wenn sie in den Satz als Objekt eingebettet sind: dann wird SYM NOUN/PROPN daraus
- @ sind alle einwandfrei als PROPN und in einem St√ºck
- Tokenisierung gut (Worte und Satzzeichen gut getrennt)
- Illnesses nur zu Illnesses lemmatisiert?, Shreds zu shred, Stores zu store, elected zu elect, MADE zu make, stolen zu steal, 're zu be
- RT als PROPN, Todd ADJ
- Barr's wird zu PROPN PART
- Meet PROPN The PROPN Press PROPN
- Emojis meist als NFP, aber auch als PROPN
- Thank VERB you PRON
- @macys bleibt @macys PROPN, her bleibt her PRON
- even ADV more ADV (bleibt als Lemma more) now ADV
- DonaldTrump PROPN, Via PROPN, 1st Amendment PROPN PROPN
- its PRON (statt it's), ol' bleibt ol' ADJ
- doesn't wird zu do AUX not PART
- 1887.Twisting wird zu 1887.twiste und als URL getaggt, arms?Allowed wird zu arms?allowe VERB
- JOBS als PROPN
- anti-Trump wird zu anti ADJ - ADJ trump ADJ

# 02 Flair
- https://huggingface.co/flair/pos-english: F1 Score: 98,18
- https://huggingface.co/flair/pos-english-fast: 98,10
- https://huggingface.co/flair/upos-english: 98,6
- https://huggingface.co/flair/upos-english-fast: 98,47

In [22]:
# !pip install flair # bei Modell pos oder upos w√§hlbar

In [1]:
#### Flair mit UPOS ####
import logging
import pandas as pd
from flair.data import Sentence
from flair.models import SequenceTagger

logging.getLogger("flair").setLevel(logging.ERROR)
df = pd.read_csv("testkorpus_divers_50.csv")
df_nonempty = df[df["text"].notna()].copy()
tagger = SequenceTagger.load("flair/upos-english")
label_type = tagger.label_type

sentences = [Sentence(str(t)) for t in df_nonempty["text"]]
tagger.predict(sentences, mini_batch_size=32)

all_results = []
for row, sentence in zip(df_nonempty.itertuples(index=True), sentences):
    for token in sentence:
        # POS-Label 
        pos_label = token.get_label(label_type).value if token.has_label(label_type) else None

        all_results.append({
            "post_id": row.Index,
            "date": getattr(row, "date", None),
            "word": token.text,
            "lemma": token.text.lower(),           # Flair liefert kein Lemma
            "pos": pos_label,
            "lemma_p": f"{token.text.lower()}_{pos_label}" if pos_label else token.text.lower()
        })
        
fl = pd.DataFrame(all_results)
fl.to_csv("testkorpus_divers_50_flair_upos.csv", index=False)
display(fl.head(20))

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,INTJ,reminder_INTJ
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,the,PROPN,the_PROPN
3,0,2010-11-04,Miss,miss,PROPN,miss_PROPN
4,0,2010-11-04,Universe,universe,PROPN,universe_PROPN
5,0,2010-11-04,competition,competition,PROPN,competition_PROPN
6,0,2010-11-04,will,will,AUX,will_AUX
7,0,2010-11-04,be,be,VERB,be_VERB
8,0,2010-11-04,LIVE,live,VERB,live_VERB
9,0,2010-11-04,from,from,ADP,from_ADP


In [2]:
import pandas as pd
fl = pd.read_csv("testkorpus_divers_50_flair_upos.csv")
display(fl[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Shreds,shreds,NUM,shreds_NUM
71,4,2020-05-11,NBC,nbc,SYM,nbc_SYM
72,4,2020-05-11,‚Äôs,‚Äôs,NUM,‚Äôs_NUM
73,4,2020-05-11,Chuck,chuck,NOUN,chuck_NOUN
74,4,2020-05-11,Todd,todd,SYM,todd_SYM
75,4,2020-05-11,For,for,NUM,for_NUM
76,4,2020-05-11,‚Äò,‚Äò,SYM,‚Äò_SYM
77,4,2020-05-11,Deceptive,deceptive,NUM,deceptive_NUM
78,4,2020-05-11,Editing‚Äô,editing‚Äô,SYM,editing‚Äô_SYM
79,4,2020-05-11,Of,of,NUM,of_NUM


In [3]:
fl.shape

(1357, 6)

In [4]:
#### Flair mit POS ####
import logging
import pandas as pd
from flair.data import Sentence
from flair.models import SequenceTagger

logging.getLogger("flair").setLevel(logging.ERROR)
df = pd.read_csv("testkorpus_divers_50.csv")
df_nonempty = df[df["text"].notna()].copy()
tagger = SequenceTagger.load("flair/pos-english")
label_type = tagger.label_type

sentences = [Sentence(str(t)) for t in df_nonempty["text"]]
tagger.predict(sentences, mini_batch_size=32)

all_results = []
for row, sentence in zip(df_nonempty.itertuples(index=True), sentences):
    for token in sentence:
        # POS-Label 
        pos_label = token.get_label(label_type).value if token.has_label(label_type) else None

        all_results.append({
            "post_id": row.Index,
            "date": getattr(row, "date", None),
            "word": token.text,
            "lemma": token.text.lower(),           # Flair liefert kein Lemma
            "pos": pos_label,
            "lemma_p": f"{token.text.lower()}_{pos_label}" if pos_label else token.text.lower()
        })
        
fl2 = pd.DataFrame(all_results)
fl2.to_csv("testkorpus_divers_50_flair_pos.csv", index=False)
display(fl2.head(20))

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,NN,reminder_NN
1,0,2010-11-04,:,:,:,:_:
2,0,2010-11-04,The,the,DT,the_DT
3,0,2010-11-04,Miss,miss,NNP,miss_NNP
4,0,2010-11-04,Universe,universe,NNP,universe_NNP
5,0,2010-11-04,competition,competition,NN,competition_NN
6,0,2010-11-04,will,will,MD,will_MD
7,0,2010-11-04,be,be,VB,be_VB
8,0,2010-11-04,LIVE,live,RB,live_RB
9,0,2010-11-04,from,from,IN,from_IN


In [5]:
import pandas as pd
fl2 = pd.read_csv("testkorpus_divers_50_flair_pos.csv")
display(fl2[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Shreds,shreds,VBZ,shreds_VBZ
71,4,2020-05-11,NBC,nbc,NNP,nbc_NNP
72,4,2020-05-11,‚Äôs,‚Äôs,VBZ,‚Äôs_VBZ
73,4,2020-05-11,Chuck,chuck,NNP,chuck_NNP
74,4,2020-05-11,Todd,todd,NNP,todd_NNP
75,4,2020-05-11,For,for,IN,for_IN
76,4,2020-05-11,‚Äò,‚Äò,``,‚Äò_``
77,4,2020-05-11,Deceptive,deceptive,JJ,deceptive_JJ
78,4,2020-05-11,Editing‚Äô,editing‚Äô,NN,editing‚Äô_NN
79,4,2020-05-11,Of,of,IN,of_IN


In [6]:
fl2.shape

(1357, 6)

In [31]:
#### Flair und SpaCy ####
import pandas as pd
from flair.data import Sentence
from flair.models import SequenceTagger
import spacy

df = pd.read_csv("testkorpus_divers_50.csv")
tagger = SequenceTagger.load("pos-fast")
nlp = spacy.load("en_core_web_sm")

all_results = []

for idx, row in df.iterrows():
    text = row['text']
    if pd.isna(text):
        continue

    spacy_doc = nlp(str(text))

    flair_sentence = Sentence(str(text))
    tagger.predict(flair_sentence)

    # Achtung: Flair und SpaCy tokenisieren unterschiedlich!
    if len(flair_sentence) == len(spacy_doc):
        for flair_token, spacy_token in zip(flair_sentence, spacy_doc):
            all_results.append({
                "post_id": idx,
                "date": row.get("date"),
                "word": flair_token.text,
                "lemma": spacy_token.lemma_, #f√ºr Lemma Spacy verwenden
                "pos": flair_token.get_label('pos').value,
                "lemma_p": f"{spacy_token.lemma_}_{flair_token.get_label('pos').value}"
            })
    else:
        # Falls Tokenanzahl nicht √ºbereinstimmt
        for flair_token in flair_sentence:
            all_results.append({
                "post_id": idx,
                "date": row.get("date"),
                "word": flair_token.text,
                "lemma": flair_token.text.lower(),
                "pos": flair_token.get_label('pos').value,
                "lemma_p": f"{flair_token.text.lower()}_{flair_token.get_label('pos').value}"
            })

flsp = pd.DataFrame(all_results)
flsp.to_csv("testkorpus_divers_50_flair_spacy.csv", index=False)
display(flsp.head(20))

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,NN,reminder_NN
1,0,2010-11-04,:,:,:,:_:
2,0,2010-11-04,The,the,DT,the_DT
3,0,2010-11-04,Miss,miss,NNP,miss_NNP
4,0,2010-11-04,Universe,universe,NNP,universe_NNP
5,0,2010-11-04,competition,competition,NN,competition_NN
6,0,2010-11-04,will,will,MD,will_MD
7,0,2010-11-04,be,be,VB,be_VB
8,0,2010-11-04,LIVE,live,JJ,live_JJ
9,0,2010-11-04,from,from,IN,from_IN


In [32]:
import pandas as pd
flsp = pd.read_csv("testkorpus_divers_50_flair_spacy.csv")
display(flsp[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Shreds,shreds,VBZ,shreds_VBZ
71,4,2020-05-11,NBC,nbc,NNP,nbc_NNP
72,4,2020-05-11,‚Äôs,‚Äôs,VBZ,‚Äôs_VBZ
73,4,2020-05-11,Chuck,chuck,NNP,chuck_NNP
74,4,2020-05-11,Todd,todd,NNP,todd_NNP
75,4,2020-05-11,For,for,IN,for_IN
76,4,2020-05-11,‚Äò,‚Äò,``,‚Äò_``
77,4,2020-05-11,Deceptive,deceptive,JJ,deceptive_JJ
78,4,2020-05-11,Editing‚Äô,editing‚Äô,NN,editing‚Äô_NN
79,4,2020-05-11,Of,of,IN,of_IN


In [33]:
flsp.shape

(1357, 6)

### Fazit zu Flair: 
(Flair bietet keine Lemmatisierung)
#### Flair upos:
- Lemmatisierung nur lower
- Url zerh√§kselt (mrzad9 als PUNCT)
- Emojis als SYM
- @_KatherineWebb in ""@_ SYM KatherineWebb NUM zerteilt
- Hashtags werden zerlegt (# SYM MissUSA NUM, # SYM AGENDA47 NUM)
- RT als X, @ X darhar981 X
- zu viel NUM (weekend NUM, Vegas NUM, former.Miss NUM (und falsch tokenisiert), illnesses NUM, Attorney NUM, Shreds NUM, Fame NUM)
- Barr's wird zu Barr PUNCT 's SYM
- @MagaGlam(Emojis) Bring Back Trump Emojis: wird zu @ X MagaGlam X Emojis SYM Bring VERB Back SYM Trump NUM Emojis SYM
- available VERB
- on INTJ unanimously INTJ (Interjection) being PRON elected VERB
- of INTJ L.L.Bean INTJ (und an sp√§terer Stelle L.L.Bean als PUNCT)
- @macys wird @ X macys X
- !.. X, . X
- @ SYM WhiteHouse NUM, AMERICA NUM
- This NUM week NOUN we VERB hosted VERB a DET # SYM MadeInAmerica NUM event NOUN

#### Flair pos:

- Lemmatisierung nur lower
- Url zerh√§kselt, aber anders getaggt als upos Modell (mrzad9 als NN)
- @_KatherineWebb in ""@_ NFP KatherineWebb NNP zerteilt
- Hashtags werden zerlegt (# NNP MissUSA NNP, # NFP AGENDA47 CD, # NFP EnterSandman NN)
- weekend NN, Vegas NNP, former.Miss NN, Illnesses NNPS
- Addressing VBG, Shreds VBZ, Barr NNP 's VBZ
- RT NN, @ IN darhar981 CD, elected VBN
- @ werden zerlegt (@ SYM MagaGlam NNP Emojis NFP Bring VB Back RB Trump NN Emojis NFP
- Emojis als NFP
- @macys wird @ NFP macys NNS
- This DT week NN we PRP hosted VBD a DT # NN MadeInAmerica NNP event NN

#### Flair-spacy: 

- da die beiden Tagger verschieden tokenisieren, wird Spacys Lemmatisieren nicht immer angewendet (hier schon: Stores wird zu store, aber being bleibt being, elected bleibt elected, BEST wird best, People wird people,longer wird long, takes wird take, harder wird hard, workers zu worker, is zu be, was zu be, stolen zu steal, 're zu be,...)
- erweitertes Tagset (pos mit ADD und NFP)
- 9pm ungetrennt als NN
- Url zerh√§kselt und als Satz getaggt (ADD NFP ADD SYM NNP NFP CD NNS .)
- Addressing wird zu addressing lemmatisiert
- #AGENDA47 wird zu # NFP AGENDA47 NN
- RT als NNP
- @ CC darhar981 ADD
- Shreds wird zu shreds lemmatisiert und VBZ
- Comments wird zu comments lemmatisiert
- Trennung von Hashtags und @ (# NFP MAGA NNP, @ CC LBPerfectMaine NNP, # NNP KellyFile NNP)
- Emojis als NFP
- L.L. Bean als NNP ohne Leerzeichen
- even RB more RBR now RB (ist richtig)
- its als PRP
- doesn't wird zu does (was zu do lemmatisiert wird) VBZ und n't (was zu not lemmatisiert wird) RB
- ol' als JJ
- nicht konsequent segmentiert: predictions. als NN, 1887.Twisting als VBG
- Hashtag als Dollarzeichen? MAGA NN, more JJR

# 03 Bert
https://huggingface.co/vblagoje/bert-english-uncased-finetuned-pos

In [34]:
#### Bert mit SpaCy ####
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
import spacy

model_name = "vblagoje/bert-english-uncased-finetuned-pos"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

nlp = spacy.load("en_core_web_sm")
id2label = model.config.id2label

df = pd.read_csv("testkorpus_divers_50.csv")
text_col = "text"

results = []


for idx, row in df.iterrows():
    text = row.get(text_col)
    if pd.isna(text):
        continue

    spacy_doc = nlp(str(text))

    encoding = tokenizer(str(text), return_tensors="pt", return_offsets_mapping=True, truncation=True)
    input_ids = encoding["input_ids"]
    offset_mappings = encoding["offset_mapping"][0]

    with torch.no_grad():
        output = model(input_ids)
    
    logits = output.logits
    predictions = torch.argmax(logits, dim=2)[0].tolist()

    for idx_token, pred_id in enumerate(predictions):
        start, end = offset_mappings[idx_token].tolist()
        if start == end:
            continue

        word_text = text[start:end]
        pos_tag = id2label[pred_id]
        lemma = None
        for token in spacy_doc:
            token_start = token.idx
            token_end = token.idx + len(token.text)
            if start >= token_start and end <= token_end:
                lemma = token.lemma_
                break
        if lemma is None:
            lemma = word_text.lower()

        results.append({
            "post_id": idx,
            "date": row.get("date"),
            "word": word_text,
            "lemma": lemma,
            "pos": pos_tag,
            "lemma_p": f"{lemma}_{pos_tag}"
        })

ber = pd.DataFrame(results)
ber.to_csv("testkorpus_divers_50_bert.csv", index=False)
display(ber.head(50))

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,NOUN,reminder_NOUN
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,the,DET,the_DET
3,0,2010-11-04,Miss,Miss,PROPN,Miss_PROPN
4,0,2010-11-04,Universe,Universe,PROPN,Universe_PROPN
5,0,2010-11-04,competition,competition,NOUN,competition_NOUN
6,0,2010-11-04,will,will,AUX,will_AUX
7,0,2010-11-04,be,be,AUX,be_AUX
8,0,2010-11-04,LIVE,live,ADJ,live_ADJ
9,0,2010-11-04,from,from,ADP,from_ADP


In [35]:
import pandas as pd
ber = pd.read_csv("testkorpus_divers_50_bert.csv")
display(ber[115:165])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
115,4,2020-05-11,‚Äô,‚Äôs,PART,‚Äôs_PART
116,4,2020-05-11,s,‚Äôs,PART,‚Äôs_PART
117,4,2020-05-11,Comments,comment,NOUN,comment_NOUN
118,4,2020-05-11,On,on,ADP,on_ADP
119,4,2020-05-11,‚Äú,"""",PUNCT,"""_PUNCT"
120,4,2020-05-11,Meet,Meet,VERB,Meet_VERB
121,4,2020-05-11,The,The,DET,The_DET
122,4,2020-05-11,Press,Press,PROPN,Press_PROPN
123,4,2020-05-11,",",",",PUNCT,",_PUNCT"
124,4,2020-05-11,‚Äù,"""",PUNCT,"""_PUNCT"


In [36]:
ber.shape

(2024, 6)

In [37]:
#### Bert ### ohne Pipeline
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
import spacy

model_name = "vblagoje/bert-english-uncased-finetuned-pos"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()
nlp = spacy.load("en_core_web_sm")
id2label = model.config.id2label

df = pd.read_csv("testkorpus_divers_50.csv")

results = []

for idx, row in df.iterrows():
    text = row["text"]
    if pd.isna(text):
        continue
    spacy_doc = nlp(str(text))
    
    encoding = tokenizer(str(text), return_tensors="pt", return_offsets_mapping=True, truncation=True)
    input_ids = encoding["input_ids"]
    offset_mappings = encoding["offset_mapping"][0]
    
    with torch.no_grad():
        output = model(input_ids)
    
    logits = output.logits  
    predictions = torch.argmax(logits, dim=2)[0].tolist()
    
    for idx_token, pred_id in enumerate(predictions):
        start, end = offset_mappings[idx_token].tolist()
        if start == end:
            continue
        
        word_text = text[start:end]
        pos_tag = id2label[pred_id]
        
        lemma = None
        for token in spacy_doc:
            token_start = token.idx
            token_end = token.idx + len(token.text)
            if start >= token_start and end <= token_end:
                lemma = token.lemma_
                break
        if lemma is None:
            lemma = word_text.lower()
        
        results.append({
            "post_id": idx,
            "date": row.get("date"),
            "word": word_text,
            "lemma": lemma,
            "pos": pos_tag,
            "lemma_p": f"{lemma}_{pos_tag}"
        })

bert = pd.DataFrame(results)
bert.to_csv("testkorpus_divers_50_bert2.csv", index=False)
display(bert.head(40))

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,NOUN,reminder_NOUN
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,the,DET,the_DET
3,0,2010-11-04,Miss,Miss,PROPN,Miss_PROPN
4,0,2010-11-04,Universe,Universe,PROPN,Universe_PROPN
5,0,2010-11-04,competition,competition,NOUN,competition_NOUN
6,0,2010-11-04,will,will,AUX,will_AUX
7,0,2010-11-04,be,be,AUX,be_AUX
8,0,2010-11-04,LIVE,live,ADJ,live_ADJ
9,0,2010-11-04,from,from,ADP,from_ADP


In [38]:
import pandas as pd
bert = pd.read_csv("testkorpus_divers_50_bert2.csv")
display(bert[115:165])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
115,4,2020-05-11,‚Äô,‚Äôs,PART,‚Äôs_PART
116,4,2020-05-11,s,‚Äôs,PART,‚Äôs_PART
117,4,2020-05-11,Comments,comment,NOUN,comment_NOUN
118,4,2020-05-11,On,on,ADP,on_ADP
119,4,2020-05-11,‚Äú,"""",PUNCT,"""_PUNCT"
120,4,2020-05-11,Meet,Meet,VERB,Meet_VERB
121,4,2020-05-11,The,The,DET,The_DET
122,4,2020-05-11,Press,Press,PROPN,Press_PROPN
123,4,2020-05-11,",",",",PUNCT,",_PUNCT"
124,4,2020-05-11,‚Äù,"""",PUNCT,"""_PUNCT"


In [39]:
bert.shape

(2024, 6)

In [40]:
bert[1980:2024]

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
1980,49,2025-03-01,were,be,AUX,be_AUX
1981,49,2025-03-01,released,release,VERB,release_VERB
1982,49,2025-03-01,into,into,ADP,into_ADP
1983,49,2025-03-01,our,our,PRON,our_PRON
1984,49,2025-03-01,Country,Country,NOUN,Country_NOUN
1985,49,2025-03-01,.,.,PUNCT,._PUNCT
1986,49,2025-03-01,Thanks,thank,NOUN,thank_NOUN
1987,49,2025-03-01,to,to,ADP,to_ADP
1988,49,2025-03-01,the,the,DET,the_DET
1989,49,2025-03-01,Trump,Trump,PROPN,Trump_PROPN


### Fazit zu Bert:
#### Bert 1
- 2024 Tags sind deutlich mehr als die anderen Tagger vergaben
- Url (und auch andere Abk√ºrzungen wie CBP) werden in zu viele Einzelteile zerlegt und falsch bestimmt - X und PUNCT (sonstige W√∂rter auch: USA(Emoji) wird zu usa(emoji) und SYM, also falsch getrennt)
- Hashes und @ werden getrennt (#MadeInAmerica wird zu # Made InA meric a)
- Emojis werden als SYM erkannt (und wenn falsch getrennt auch der Rest des Wortes)
- richtige Lemmatisierung
- ca. 7-8.000 Tags mehr als die anderen Tagger (wegen der Links wahrscheinlich)
- @Timc1021 wird zu @ SYM Tim NOUN c NOUN 10 NUM 21 NUM
- Thanks wird zu thank lemmatisiert als NOUN, Looking zu look als VERB, skies zu sky NOUN
- #MissUSA wird zu # SYM Miss PROPN USA PROPN; Lemma missusa?
- @ ADP dar PROPN har PROPN 9 NUM 8 NUM 1 NUM
- Shreds wird zu Sh VERB red VERB s VERB
- Barr's wird zu Barr PROPN ' PART s PART
- Emojis als SYM
- @MagaGlam(Emojis) wird zu @ SYM und MagaGlam(Emojis) SYM, @Breitbart-news wird zu @ SYM Br PROPN eit X bar X t X Ne X ws X
- CBP Home App wird zu CB P Home App
- @macys wird zu @ SYM macy PROPN s PROPN
- L.L.Bean wird zu L . L. Bean
- People wird zu People lemmatisiert
- @LBPerfektMaine wird zu @ LB Per fect Main e als X
- its als AUX (eigentlich it's)

#### Bert 2

- Url (und @ und Hashtags) werden zerh√§kselt und mit X und PUNCT getaggt (auch teilweise andere Tags)
- pm bleibt pm (nicht p.m.) als NOUN
- 9 als NOUN
- @Timc1021 wird zu @ SYM Tim NOUN c NOUN 10 NUM 21 NUM
- Thanks wird zu thank lemmatisiert und als NOUN getaggt
- @_KatherineWebb wird zu @ SYM _ SYM Katherine PROPN We PROPN bb PROPN
- Looking wird zu look
- #MissUSA wird zu # Miss USA
- Barr's wird Barr ' s
- Deceptive wird zu Dec ADJ eptive ADJ
- Emojis als SYM
- Opioid Drug wird zu Op NOUN io X id PROPN Drug NOUN
- @FitnessGov. wird zu @ X Fitness X Go X v X . NOUN
- #HOF2019(Emoji) wird zu # SYM HOF2019(Emoji) SYM
- @macys wird zu @ SYM macy PROPN s PROPN
- @WhiteHouse: @ White House
- MADE wird zu make lemmatisiert
- USA(Flagge) zu usa(Flagge) SYM
- #MAGA wird # MAG A
- LL. Bean wird zu L . L . Bean
- Donald Trump wird Donald Trum p
- its AUX
- doesn't zu doesn AUX ' PART t PART
- ol ' ADJ

In [None]:
## TweebankNLP
import stanza

# config for the `en_tweet` models (models trained only on Tweebank)
config = {
          'processors': 'tokenize,lemma,pos,depparse,ner',
          'lang': 'en',
          'tokenize_pretokenized': True, # disable tokenization
          'tokenize_model_path': './twitter-stanza/saved_models/tokenize/en_tweet_tokenizer.pt',
          'lemma_model_path': './twitter-stanza/saved_models/lemma/en_tweet_lemmatizer.pt',
          "pos_model_path": './twitter-stanza/saved_models/pos/en_tweet_tagger.pt',
          "depparse_model_path": './twitter-stanza/saved_models/depparse/en_tweet_parser.pt',
          "ner_model_path": './twitter-stanza/saved_models/ner/en_tweet_nertagger.pt',
}

# Initialize the pipeline using a configuration dict
stanza.download("en")
nlp = stanza.Pipeline(**config)
doc = nlp("Oh ikr like Messi better than Ronaldo but we all like Ronaldo more")
print(doc) # Look at the result

## 04 TweebankNLP
- Twpipe: https://github.com/Oneplus/twpipe (nutzt den Tweeboparser und past in Universal Dependencies)
- der Tweeboparser: https://github.com/ikekonglp/TweeboParser
- Tweebank V2: https://github.com/Oneplus/Tweebank
- https://github.com/mit-ccc/TweebankNLP
- pre-trained transformer models: the state-of-the-art Twitter NER and POS tagging models are available on Hugging Face Hub: https://huggingface.co/TweebankNLP

verschiedene vortrainierte Modelle:
- TweebankNLP/bertweet-tb2_ewt-pos-tagging

- TweebankNLP/bertweet-tb2-pos-tagging

- TweebankNLP/bertweet-tb2-ner

- TweebankNLP/bertweet-tb2_wnut17-ner

In [41]:
# Die √§ltere Version TweebankNLP unterst√ºtzt kein POS-Tagging:
import tweetnlp
print(list(tweetnlp.loader.TASK_CLASS.keys()))

['sentiment', 'offensive', 'irony', 'hate', 'emotion', 'emoji', 'stance_abortion', 'stance_atheism', 'stance_climate', 'stance_feminist', 'stance_hillary', 'topic_classification', 'ner', 'language_model', 'sentence_embedding', 'question_answering', 'question_answer_generation']


In [42]:
import torch
print(torch.__version__)

2.6.0


In [4]:
#### TweebankNLP ####
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
df = pd.read_csv("testkorpus_divers_50.csv")

df["text"] = df["text"].fillna("").astype(str)
model_name = "TweebankNLP/bertweet-tb2_ewt-pos-tagging"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Device: CPU
device = -1

tagger = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",
    device=device
)

results = []

for idx, row in df.iterrows():
    text = row["text"].strip()
    
    if text:
        tagged = tagger(text)
        for token_info in tagged:
            word_text = token_info.get("word")
            pos_tag = token_info.get("entity_group")
            lemma = word_text
            results.append({
                "post_id": idx,
                "date": row.get("date"),
                "word": word_text,
                "lemma": lemma,
                "pos": pos_tag,
                "lemma_p": f"{lemma}_{pos_tag}"
            })
    else:
        results.append({
            "post_id": idx,
            "date": row.get("date"),
            "word": "",
            "lemma": "",
            "pos": "",
            "lemma_p": ""
        })

twe = pd.DataFrame(results)
twe.to_csv("testkorpus_divers_50_tweebank.csv", index=False, encoding="utf-8")
display(twe.head())

Device set to use cpu


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder@@,Reminder@@,NOUN,Reminder@@_NOUN
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,The,DET,The_DET
3,0,2010-11-04,Miss Universe,Miss Universe,PROPN,Miss Universe_PROPN
4,0,2010-11-04,competition,competition,NOUN,competition_NOUN


In [5]:
import pandas as pd
twe = pd.read_csv("testkorpus_divers_50_tweebank.csv")
display(twe[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Of,Of,ADP,Of_ADP
71,4,2020-05-11,Barr@@,Barr@@,PROPN,Barr@@_PROPN
72,4,2020-05-11,<unk> s,<unk> s,PART,<unk> s_PART
73,4,2020-05-11,Comments,Comments,NOUN,Comments_NOUN
74,4,2020-05-11,On,On,ADP,On_ADP
75,4,2020-05-11,<unk>,<unk>,PUNCT,<unk>_PUNCT
76,4,2020-05-11,Meet,Meet,VERB,Meet_VERB
77,4,2020-05-11,The,The,DET,The_DET
78,4,2020-05-11,Press@@,Press@@,NOUN,Press@@_NOUN
79,4,2020-05-11,",‚Äù",",‚Äù",PUNCT,",‚Äù_PUNCT"


In [6]:
twe.shape
# Emojis durch @ ersetzt??

(1225, 6)

In [10]:
#### Tweebank mit SpaCy ####
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import spacy

df = pd.read_csv("testkorpus_divers_50.csv")
model_name = "TweebankNLP/bertweet-tb2_ewt-pos-tagging"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
tagger = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

nlp = spacy.load("en_core_web_sm")

results = []

for idx, row in df.iterrows():
    text = str(row["text"])
    date = row.get("date", None)

    # POS-Tagging mit Tweebank
    pos_tags = tagger(text)

    # Lemmatisierung mit spaCy
    doc = nlp(text)

    lemma_map = {token.text: token.lemma_ for token in doc}

    for token_info in pos_tags:
        word_text = token_info["word"]
        pos_tag = token_info["entity_group"]
        lemma = lemma_map.get(word_text, word_text)

        results.append({
            "post_id": idx,
            "date": date,
            "word": word_text,
            "lemma": lemma,
            "pos": pos_tag,
            "lemma_p": f"{lemma}_{pos_tag}"
        })

twesp = pd.DataFrame(results)
twesp.to_csv("testkorpus_divers_50_twee_spa.csv", index=False)
display(twesp.head())

Device set to use cpu


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder@@,Reminder@@,NOUN,Reminder@@_NOUN
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,the,DET,the_DET
3,0,2010-11-04,Miss Universe,Miss Universe,PROPN,Miss Universe_PROPN
4,0,2010-11-04,competition,competition,NOUN,competition_NOUN


In [11]:
import pandas as pd
twesp = pd.read_csv("testkorpus_divers_50_twee_spa.csv")
display(twesp[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Of,of,ADP,of_ADP
71,4,2020-05-11,Barr@@,Barr@@,PROPN,Barr@@_PROPN
72,4,2020-05-11,<unk> s,<unk> s,PART,<unk> s_PART
73,4,2020-05-11,Comments,comment,NOUN,comment_NOUN
74,4,2020-05-11,On,on,ADP,on_ADP
75,4,2020-05-11,<unk>,<unk>,PUNCT,<unk>_PUNCT
76,4,2020-05-11,Meet,Meet,VERB,Meet_VERB
77,4,2020-05-11,The,The,DET,The_DET
78,4,2020-05-11,Press@@,Press@@,NOUN,Press@@_NOUN
79,4,2020-05-11,",‚Äù",",‚Äù",PUNCT,",‚Äù_PUNCT"


In [12]:
twesp.shape

(1225, 6)

In [13]:
#### Tweebank mit Stanza ####
import pandas as pd
import stanza
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

df = pd.read_csv("testkorpus_divers_50.csv")

model_name = "TweebankNLP/bertweet-tb2_ewt-pos-tagging" 
#basiert auf einer Kombination aus EWT von UD und tweebank v2
#das schneidet laut (Parsing Tweets into UD) besser ab als andere Kombinationen
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
tagger = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

stanza.download("en")
nlp_stanza = stanza.Pipeline(
    lang="en",
    processors="tokenize,lemma",
    tokenize_engine="tokenize/tweet"  # Tweet-Tokenizer
)

results = []

for idx, row in df.iterrows():
    text = str(row["text"])
    date = row.get("date", None)

    if not isinstance(text, str) or text.strip() == "":
        continue

    # Tokenisierung und Lemmatisierung mit Stanza
    doc = nlp_stanza(text)
    tokens = [(w.text, w.lemma) for s in doc.sentences for w in s.words]

    words = [t[0] for t in tokens]
    pos_tags_nested = tagger(words)
    pos_tags = [tag[0] if isinstance(tag, list) else tag for tag in pos_tags_nested]

    for token_info, (word_text, lemma) in zip(pos_tags, tokens):
        pos_tag = token_info["entity_group"]

        results.append({
            "post_id": idx,
            "date": date,
            "word": word_text,
            "lemma": lemma,
            "pos": pos_tag,
            "lemma_p": f"{lemma}_{pos_tag}",
        })

twesta = pd.DataFrame(results)
twesta.to_csv("testkorpus_divers_50_twee_sta.csv", index=False, encoding="utf-8")
display(twesta.head())

Device set to use cpu


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-27 17:20:09 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-27 17:20:09 INFO: Downloading default packages for language: en (English) ...
2025-09-27 17:20:12 INFO: File exists: /Users/vivien/stanza_resources/en/default.zip
2025-09-27 17:20:17 INFO: Finished downloading models and saved to /Users/vivien/stanza_resources
2025-09-27 17:20:17 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-27 17:20:17 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-27 17:20:17 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| lemma     | combined_nocharlm |

2025-09-27 17:20:17 INFO: Using device: cpu
2025-09-27 17:20:17 INFO: Loading: tokenize
2025-09-27 17:20:17 INFO: Loading: mwt
2025-09-27 17:20:17 INFO: Loading: lemma
2025-09-27 17:20:18 INFO: Done loading processors!


Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
0,0,2010-11-04,Reminder,reminder,NOUN,reminder_NOUN
1,0,2010-11-04,:,:,PUNCT,:_PUNCT
2,0,2010-11-04,The,the,DET,the_DET
3,0,2010-11-04,Miss,Miss,PROPN,Miss_PROPN
4,0,2010-11-04,Universe,Universe,NOUN,Universe_NOUN


In [14]:
import pandas as pd
twesta = pd.read_csv("testkorpus_divers_50_twee_sta.csv")
display(twesta[70:115])

Unnamed: 0,post_id,date,word,lemma,pos,lemma_p
70,4,2020-05-11,Shreds,shreds,NOUN,shreds_NOUN
71,4,2020-05-11,NBC,NBC,PROPN,NBC_PROPN
72,4,2020-05-11,‚Äôs,'s,PRON,'s_PRON
73,4,2020-05-11,Chuck,Chuck,PROPN,Chuck_PROPN
74,4,2020-05-11,Todd,Todd,PROPN,Todd_PROPN
75,4,2020-05-11,For,for,INTJ,for_INTJ
76,4,2020-05-11,‚Äò,',PUNCT,'_PUNCT
77,4,2020-05-11,Deceptive,deceptive,ADJ,deceptive_ADJ
78,4,2020-05-11,Editing,editing,NOUN,editing_NOUN
79,4,2020-05-11,‚Äô,'s,PUNCT,'s_PUNCT


In [15]:
twesta.shape

(1209, 6)

### Fazit zu TweebankNLP:
#### Tweebank:
- viele @@ bei Segmentierung @Timc1021 wird zu @Timc@@ 1021
- Url als X
- Tokenisierung bei W√∂rtern mit Satzzeichen misslingt: @_KatherineWeb@@ X b: PUNCT
- Hashtags werden ganz gelassen, PROPN oder X
- RT @darhar981@@ als X
- General ADJ
- Shreds wird zu shred VERB
- Chuck Todd als PROPN
- <unk/> f√ºr ', " und : unknown
- Emojis als unknown und SYM
- Mariano Rivera als PROPN
- Not only ADV a DET
- manche Url werden zerh√§kselt und als Satz analysiert (X PROPN X PROPN X NOUN VERB, ..)
- People zu People lemmatisiert (also gar nicht)
- May your day be filled with peace! (May als AUX)
- jobs als NOUN
- keine Lemmatisierung
- Tokenisierung mit @@ bei Satzzeichen
- anti-@@ Trump@@ getrennt (ADJ PROPN)
- Linda Bean als PROPN
- LL.Bean (ohne Leerzeichen als PROPN)
- its (Rechtschreibfehler) als PRON
- good ol' USA. wird zu good ol@@ ADJ ' PART USA. PROPN

#### Tweebank mit SpaCy:

- Emojis werden in <unk/> umgewandelt plus @@, Segmentieren der W√∂rter klappt nicht
- MWEs werden erkannt, aber auch zu viel anderes wird zusammen gelassen (will be als AUX und Miss Universe als PROPN, Childhood Illnesses@@ als NOUN, RT @darhar981@@ als X, Mariano Rivera als PROPN, Not only als ADV, even more now@@ als ADV)
- Satzzeichen '" in unknown umgewandelt, 
- General als ADJ, Governor Phil Bryant als PROPN
- Hashes werden beibehalten, #EnterSandman #HOF2019 mit X getaggt
- Shreds zu shred als VERB
- @macys bleibt @macys, @LBPerfectMaine wird getrennt
- Hashtag #Made@@ In@@ America getrennt (VERB ADP PROPN)
- AMERIC@@ A, PROPN PUNCT
- election predictions." zu election prediction@@ NOUN s." PUNCT
- joke! zu jo@@ NOUN ke! PUNCT
- Worte falsch getrennt (v.a. bei Satzzeichen): 9pm als NUM, Thank@@ NOUN s@@ ADV, Fame! wird zu Fam@@ PROPN e@@ NOUN ! PUNCT
- bei Tokenisierung immer @@ erg√§nzt: Vegas!" wird zu Vegas@@ !""
- einen Zeichenfehler gut erkannt
- Hashtag bleibt zusammen, aber mit X gelabelt 
- Emojis als unkown und SYM oder PUNCT gelabelt
- Url bleiben meistens ganz und als X; wenn sie zerh√§kselt werden, dann werden sie als Satz analysiert (X PROPN X VERB X NOUN X ADP DET NOUN,..)
- kein Unterschied zwischen den zwei Codes (0,1)
- Lemma passt oft nicht (BEST@@ bleibt BEST@@, is wird aber zu be)

#### Tweebank mit Stanza:

- @KatherineWebb und @LBPerfectMaine als X, @macys bleibt @macys (beim lemma)
- #AGENDA47 als X, KellyFile als X, BreitbartNews als X
- Url im Ganzen, aber X
- ol' als SYM?, on als X, of ADV, to als SYM, take als NOUN, jobs als PROPN
- May (your day be filled with peace) als PROPN, from als ADP
- RT als VERB, BEST zu good lemmatisiert, People zu person, countries zu country, stolen zu steal
- Barr's aufgeteilt in PROPN und PRON
- Shreds als NOUN
- does wird zu do lemmatisiert, und n't zu not (AUX und PART)
- Comments zu comment lemmatisiert
- getrennte @ und Emojis als SYM
- Addressing zu addressing lemmatisiert
- Tokenisierung von Stanza (siehe Stanza)

# 05 Stanza
https://huggingface.co/stanfordnlp/stanza-en
https://github.com/stanfordnlp/stanza

In [52]:
# !pip install stanza

In [53]:
# kurzer Test
import stanza

stanza.download("en") 
# wenn das Modell EWT gew√ºnscht ist, muss package="ewt" erg√§nzt werden
# default ist "en" seit Version 1.10.0 eine Kombination mehrerer englischer Tagsets
nlp = stanza.Pipeline(
    lang="en", 
    processors="tokenize,pos,lemma", 
    tokenize_pretokenized=False,
    use_gpu=False, 
    tokenize_with_spacy=False, 
    tokenize_no_ssplit=False,
    tokenize_engine="tokenize/tweet"
)

# Beispieltext
text = "Donald Trump posted a new tweet. #realdonaldtrump @realdonaldtrump! @ben4appel ü§£ :) https://t.co/bsB6rVV7Yn #Canada"

doc = nlp(text)

for sentence in doc.sentences:
    for word in sentence.words:
        print(word.text, word.lemma, word.upos, word.xpos)
# Hashes werden zu 50% getrennt und zu 50% ganz gelassen.

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-24 11:57:31 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-24 11:57:31 INFO: Downloading default packages for language: en (English) ...
2025-09-24 11:57:34 INFO: File exists: /Users/vivien/stanza_resources/en/default.zip
2025-09-24 11:57:39 INFO: Finished downloading models and saved to /Users/vivien/stanza_resources
2025-09-24 11:57:39 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-24 11:57:40 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-24 11:57:41 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2025-09-24 11:57:41 INFO: Using device: cpu
2025-09-24 11:57:41 INFO: Loading: tokenize
2025-09-24 11:57:41 INFO: Loading: mwt
2025-09-24 11:57:41 INFO: Loading: pos
2025-09-24 11:57:43 INFO: Loading: lemma
2025-09-24 11:57:44 INFO: Done loading processors!


Donald Donald PROPN NNP
Trump Trump PROPN NNP
posted post VERB VBD
a a DET DT
new new ADJ JJ
tweet tweet NOUN NN
. . PUNCT .
#realdonaldtrump #realdonaldtrump PROPN NNP
@realdonaldtrump @realdonaldtrump PROPN NNP
! ! PUNCT .
@ben4appel @ben4appel PROPN ADD
ü§£ ü§£ PUNCT .
:) :) SYM NFP
https://t.co/bsB6rVV7Yn https://t.co/bsB6rVV7Yn PROPN ADD
# # SYM NN
Canada Canada PROPN NNP


In [54]:
#### das englische Modell f√ºr Stanza #### basierend auf UD
import pandas as pd
import stanza
stanza.download('en')  
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma', use_gpu=False)
df = pd.read_csv("testkorpus_divers_50.csv")

all_results = []

for idx, row in df.iterrows():
    text = row["text"]
    if pd.isna(text):
        continue
    doc = nlp(str(text))
    for sentence in doc.sentences:
        for token in sentence.tokens:
            word = token.text
            word_info = token.words[0]
            lemma = word_info.lemma
            pos = word_info.upos
            lemma_p = f"{lemma}_{pos}"
            
            all_results.append({
                "post_id": idx + 1,
                "date": row["date"],
                "text": text,
                "word": word,
                "lemma": lemma,
                "pos": pos,
                "lemma_p": lemma_p
            })

sta = pd.DataFrame(all_results)
sta.to_csv("testkorpus_divers_50_stanza.csv", index=False, encoding="utf-8")
display(sta[612:665])

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-24 11:57:45 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-24 11:57:45 INFO: Downloading default packages for language: en (English) ...
2025-09-24 11:57:47 INFO: File exists: /Users/vivien/stanza_resources/en/default.zip
2025-09-24 11:57:51 INFO: Finished downloading models and saved to /Users/vivien/stanza_resources
2025-09-24 11:57:51 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-24 11:57:51 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-24 11:57:52 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2025-09-24 11:57:52 INFO: Using device: cpu
2025-09-24 11:57:52 INFO: Loading: tokenize
2025-09-24 11:57:52 INFO: Loading: mwt
2025-09-24 11:57:52 INFO: Loading: pos
2025-09-24 11:57:55 INFO: Loading: lemma
2025-09-24 11:57:55 INFO: Done loading processors!


Unnamed: 0,post_id,date,text,word,lemma,pos,lemma_p
612,30,2014-09-01,"""""@NPHerron: @realDonaldTrump For president #2...",#,#,SYM,#_SYM
613,30,2014-09-01,"""""@NPHerron: @realDonaldTrump For president #2...",2016election,2016election,PROPN,2016election_PROPN
614,30,2014-09-01,"""""@NPHerron: @realDonaldTrump For president #2...","""","""",PUNCT,"""_PUNCT"
615,30,2014-09-01,"""""@NPHerron: @realDonaldTrump For president #2...","""","""",PUNCT,"""_PUNCT"
616,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,The,the,DET,the_DET
617,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,Zimmerman,Zimmerman,PROPN,Zimmerman_PROPN
618,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,trial,trial,NOUN,trial_NOUN
619,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,is,be,AUX,be_AUX
620,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,over,over,ADV,over_ADV
621,31,2013-07-17,The Zimmerman trial is over. It is time to mo...,.,.,PUNCT,._PUNCT


In [55]:
import pandas as pd
sta = pd.read_csv("testkorpus_divers_50_stanza.csv")
display(sta[200:215])

Unnamed: 0,post_id,date,text,word,lemma,pos,lemma_p
200,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,it,it,PRON,it_PRON
201,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,is,be,AUX,be_AUX
202,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,the,the,DET,the_DET
203,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,BEST,good,ADJ,good_ADJ
204,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,!,!,PUNCT,!_PUNCT
205,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,USA,USA,PROPN,USA_PROPN
206,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,üá∫,üá∫,PUNCT,üá∫_PUNCT
207,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,üá∏,üá∏,PUNCT,üá∏_PUNCT
208,11,2017-07-21,ICYMI- This week we hosted a #MadeInAmerica ev...,https://t.co/q4vB9GdE5y,https://t.co/q4vB9GdE5y,PROPN,https://t.co/q4vB9GdE5y_PROPN
209,12,2011-08-31,https://www.mediaite.com/tv/trump-team-scored-...,https://www.mediaite.com/tv/trump-team-scored-...,https://www.mediaite.com/tv/trump-team-scored-...,PROPN,https://www.mediaite.com/tv/trump-team-scored-...


In [56]:
sta.shape

(1190, 7)

In [58]:
## mit Twitter-Tokenizer
import pandas as pd
import stanza
import emoji

stanza.download("en")

# Stanza-Pipeline mit Tweet-Tokenizer
nlp = stanza.Pipeline(
    lang="en",
    processors="tokenize,pos,lemma", #mwt
    tokenize_pretokenized=False,
    use_gpu=False,
    tokenize_with_spacy=False,
    tokenize_no_ssplit=False,
    tokenize_engine="tokenize/tweet" # Tweebank v2 im Stanza Tokenizer von TweebankNLP
)

df = pd.read_csv("testkorpus_divers_50.csv")

all_results = []

for idx, row in df.iterrows():
    text = row.get("text")
    if pd.isna(text) or not isinstance(text, str) or text.strip() == "":
        continue

    doc = nlp(text)
    for sentence in doc.sentences:
        for word in sentence.words:
            lemma_p = f"{word.lemma}_{word.upos}"
            all_results.append({
                "post_id": idx + 1,
                "date": row.get("date"),
                "word": word.text,
                "lemma": word.lemma,
                "pos": word.upos,   
                "xpos": word.xpos,
                "lemma_p": lemma_p,
            })

statw = pd.DataFrame(all_results)
output_file = "testkorpus_divers_50_stanza_tweets.csv"
statw.to_csv(output_file, index=False, encoding="utf-8")
display(statw.head(20))

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-24 11:59:52 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-24 11:59:52 INFO: Downloading default packages for language: en (English) ...
2025-09-24 11:59:55 INFO: File exists: /Users/vivien/stanza_resources/en/default.zip
2025-09-24 11:59:59 INFO: Finished downloading models and saved to /Users/vivien/stanza_resources
2025-09-24 11:59:59 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-24 11:59:59 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-24 12:00:00 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2025-09-24 12:00:00 INFO: Using device: cpu
2025-09-24 12:00:00 INFO: Loading: tokenize
2025-09-24 12:00:00 INFO: Loading: mwt
2025-09-24 12:00:00 INFO: Loading: pos
2025-09-24 12:00:03 INFO: Loading: lemma
2025-09-24 12:00:05 INFO: Done loading processors!


Unnamed: 0,post_id,date,word,lemma,pos,xpos,lemma_p
0,1,2010-11-04,Reminder,reminder,NOUN,NN,reminder_NOUN
1,1,2010-11-04,:,:,PUNCT,:,:_PUNCT
2,1,2010-11-04,The,the,DET,DT,the_DET
3,1,2010-11-04,Miss,Miss,PROPN,NNP,Miss_PROPN
4,1,2010-11-04,Universe,Universe,PROPN,NNP,Universe_PROPN
5,1,2010-11-04,competition,competition,NOUN,NN,competition_NOUN
6,1,2010-11-04,will,will,AUX,MD,will_AUX
7,1,2010-11-04,be,be,AUX,VB,be_AUX
8,1,2010-11-04,LIVE,live,ADJ,JJ,live_ADJ
9,1,2010-11-04,from,from,ADP,IN,from_ADP


In [59]:
import pandas as pd
statw = pd.read_csv("testkorpus_divers_50_stanza_tweets.csv")
display(statw[200:215])

Unnamed: 0,post_id,date,word,lemma,pos,xpos,lemma_p
200,11,2017-07-21,MADE,made,VERB,VBN,made_VERB
201,11,2017-07-21,IN,in,ADP,IN,in_ADP
202,11,2017-07-21,AMERICA,America,PROPN,NNP,America_PROPN
203,11,2017-07-21,",",",",PUNCT,",",",_PUNCT"
204,11,2017-07-21,it,it,PRON,PRP,it_PRON
205,11,2017-07-21,is,be,AUX,VBZ,be_AUX
206,11,2017-07-21,the,the,DET,DT,the_DET
207,11,2017-07-21,BEST,good,ADJ,JJS,good_ADJ
208,11,2017-07-21,!,!,PUNCT,.,!_PUNCT
209,11,2017-07-21,USA,USA,PROPN,NNP,USA_PROPN


In [60]:
statw.shape
# taggt sehr gut, allerdings wird nicht jedes Emoji richtig als NFP erkannt, 
# sondern oft nur als Satzzeichen.

(1207, 7)

In [7]:
# Stanza + Tweet-Tokenizer + Emojivariante
import pandas as pd
import stanza
import re

stanza.download("en")

# Stanza-Pipeline mit Tweet-Tokenizer
nlp = stanza.Pipeline(
    lang="en",
    processors="tokenize,pos,lemma",
    tokenize_pretokenized=False,
    use_gpu=False,
    tokenize_with_spacy=False,
    tokenize_no_ssplit=False,
    aggregation_strategy="simple",
    tokenize_engine="tokenize/tweet"
)


# Unicode-Emoji-Regex
emoji_pattern = re.compile("["
    u"\U0001F600-\U0001F64F"  # Emoticons
    u"\U0001F300-\U0001F5FF"  # Symbole & Piktogramme
    u"\U0001F680-\U0001F6FF"  # Transport & Symbole
    u"\U0001F1E0-\U0001F1FF"  # Flaggen
    "]+", flags=re.UNICODE)

# klassische Smileys
smiley_pattern = re.compile(r'[:;=8][\-~]?[)D]', flags=re.UNICODE)

def is_emoji_or_smiley(token):
    return bool(emoji_pattern.fullmatch(token)) or bool(smiley_pattern.fullmatch(token))


df = pd.read_csv("testkorpus_divers_50.csv")
all_results = []

for idx, row in df.iterrows():
    text = row.get("text")
    if pd.isna(text) or not isinstance(text, str) or text.strip() == "":
        continue

    doc = nlp(text)
    for sentence in doc.sentences:
        for word in sentence.words:
            xpos = word.xpos
            if is_emoji_or_smiley(word.text):
                xpos = "NFP"

            lemma_p = f"{word.lemma}_{word.upos}"
            all_results.append({
                "post_id": idx + 1,
                "date": row.get("date"),
                "word": word.text,
                "lemma": word.lemma,
                "pos": word.upos,   
                "xpos": xpos,
                "lemma_p": lemma_p,
            })

statw2 = pd.DataFrame(all_results)
output_file = "testkorpus_divers_50_stanza_tweets2.csv"
statw2.to_csv(output_file, index=False, encoding="utf-8")
display(statw2.head(20))

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-27 17:29:35 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-27 17:29:35 INFO: Downloading default packages for language: en (English) ...
2025-09-27 17:29:38 INFO: File exists: /Users/vivien/stanza_resources/en/default.zip
2025-09-27 17:29:43 INFO: Finished downloading models and saved to /Users/vivien/stanza_resources
2025-09-27 17:29:43 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  ‚Ä¶

2025-09-27 17:29:43 INFO: Downloaded file to /Users/vivien/stanza_resources/resources.json
2025-09-27 17:29:44 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |

2025-09-27 17:29:44 INFO: Using device: cpu
2025-09-27 17:29:44 INFO: Loading: tokenize
2025-09-27 17:29:44 INFO: Loading: mwt
2025-09-27 17:29:44 INFO: Loading: pos
2025-09-27 17:29:47 INFO: Loading: lemma
2025-09-27 17:29:48 INFO: Done loading processors!


Unnamed: 0,post_id,date,word,lemma,pos,xpos,lemma_p
0,1,2010-11-04,Reminder,reminder,NOUN,NN,reminder_NOUN
1,1,2010-11-04,:,:,PUNCT,:,:_PUNCT
2,1,2010-11-04,The,the,DET,DT,the_DET
3,1,2010-11-04,Miss,Miss,PROPN,NNP,Miss_PROPN
4,1,2010-11-04,Universe,Universe,PROPN,NNP,Universe_PROPN
5,1,2010-11-04,competition,competition,NOUN,NN,competition_NOUN
6,1,2010-11-04,will,will,AUX,MD,will_AUX
7,1,2010-11-04,be,be,AUX,VB,be_AUX
8,1,2010-11-04,LIVE,live,ADJ,JJ,live_ADJ
9,1,2010-11-04,from,from,ADP,IN,from_ADP


In [8]:
import pandas as pd
statw2 = pd.read_csv("testkorpus_divers_50_stanza_tweets2.csv")
display(statw2[200:215])

Unnamed: 0,post_id,date,word,lemma,pos,xpos,lemma_p
200,11,2017-07-21,MADE,made,VERB,VBN,made_VERB
201,11,2017-07-21,IN,in,ADP,IN,in_ADP
202,11,2017-07-21,AMERICA,America,PROPN,NNP,America_PROPN
203,11,2017-07-21,",",",",PUNCT,",",",_PUNCT"
204,11,2017-07-21,it,it,PRON,PRP,it_PRON
205,11,2017-07-21,is,be,AUX,VBZ,be_AUX
206,11,2017-07-21,the,the,DET,DT,the_DET
207,11,2017-07-21,BEST,good,ADJ,JJS,good_ADJ
208,11,2017-07-21,!,!,PUNCT,.,!_PUNCT
209,11,2017-07-21,USA,USA,PROPN,NNP,USA_PROPN


In [9]:
statw2.shape

(1207, 7)

### Fazit zu Stanza:
#### Stanza:
- weniger Tags als alle anderen Modelle
- @ werden meistens gut erkannt (und so belassen wie sie waren): @_KatherineWebb, @Timc1021 wird in @ als PUNCT und Timc1021 als PROPN getrennt, @darhar981, @ MagaGlam"Emojis", @macys wird zu @macy lemmatisiert, @WhiteHouse
- Emojis werden als Punkt erkannt
- Hashes werden oft getrennt # MissUSA, #AGENDA47, # MAGA, # MadeInAmerica
- Url als Eigenname, aber werden immer ganz gelassen
- richtige Lemmatisierung (Looking wird mit lemma look kategorisiert), pm wird zu p.m., IS wird zu be, BEST zu good
- meistens richtiges Tagging
- Barr's wird zu Barr als PROPN - Tweebank trennt in Barr 's auf
- Shreds (falsch als PROPN) und its (a loser) (Rechtschreibfehler) falsch erkannt, ol' als Noun (eigentl. old, Tweebank hatte ol' als ADJ), Illnesses falsch als PROPN und bleiibt Illnesses (falsch lemmatisiert)
- doesn't wird zu do (lemma)
- is (He is a joke) als AUX, dabei ist es hier VERB
- anti-Trump wird zu anti- ADP und Trump PROPN
- We're zu we (und kein are)

#### mit Tweet-Tokenizer: 

- am besten xpos verwenden, da diese Tweet-spezifische Tags beinhalten
- xpos vergibt allerdings bei # MAGA (wenn falsch tokenisiert wurde) (NN, NNP)
- Hashes und @ werden gut erkannt und entsprechend getaggt
- @ und Hashes wie oben
- aber @LBPerfectMaine als ADD bei xpos? die anderen @ alle als NNP
- Trennung in did n't 
- Url werden bei xpos mit ADD getagt, bei pos (also nicht tweetspezifisch) mit PROPN, sie werden auch ganz gelassen
- Emojis werden meist als Satzzeichen getaggt und nur selten als NFP
- Barr's wird aufgeteilt in Barr und 's
- Hashes und @ werden allerdings auch genau so oft getrennt, wie sie zusammen gelassen werden
- May (your day be filled with peace) als AUX
- zwei neue Tags f√ºr Tweets: ADD und NFP
- her bleibt auch als lemma her (statt she)
- anti-Trump PROPN und (einen Satz sp√§ter) anti- ADP Trump PROPN
- We're zu we PRON be AUX
- BEST wird zu good lemmatisiert, better zu good, People zu person
- ol' bleibt ol' NOUN
- Twisting zu twist VERB political ADJ arms zu arm NOUN ? PUNCT
- JOBS NOUN
- 1st ADJ Amendment NOUN

#### Tweet Tokenizer mit Emojis:

- Emojis werden besser erkannt als NFP
- der Rest ist identisch mit dem ersten Tweet-Tokenizer Modell
- more bleibt more, aber getting wird get, better wird good, stolen zu steal, were zu be, BEST zu good
- 1st ADJ Amendment NOUN

# Finale Entscheidung und Tagging der Daten

Ich entscheide mich f√ºr... STANZA mit Tweet-Tokenizer (basiert auf Tweebank)

Hier die beiden Varianten des Tagging mit extra Tags f√ºr Tweets:

### Mapping Tabelle:
| `xpos` (Tweebank / PTB) | `upos` (Universal) | Bedeutung / Beispiele                                                                  |
| ----------------------- | ------------------ | -------------------------------------------------------------------------------------- |
| **NN**                  | NOUN               | Noun, singular ‚Üí *dog, idea*                                                           |
| **NNS**                 | NOUN               | Noun, plural ‚Üí *dogs, cars*                                                            |
| **NNP**                 | PROPN              | Proper noun, singular ‚Üí *Trump, Canada*                                                |
| **NNPS**                | PROPN              | Proper noun, plural ‚Üí *the Smiths*                                                     |
| **PRP**                 | PRON               | Personal pronoun ‚Üí *I, you, he*                                                        |
| **PRP\$**               | PRON               | Possessive pronoun ‚Üí *my, your*                                                        |
| **WP**                  | PRON               | Wh-pronoun ‚Üí *who, what*                                                               |
| **WP\$**                | PRON               | Possessive wh-pronoun ‚Üí *whose*                                                        |
| **DT**                  | DET                | Determiner ‚Üí *the, a, some*                                                            |
| **PDT**                 | DET                | Predeterminer ‚Üí *all the kids*                                                         |
| **WDT**                 | DET                | Wh-determiner ‚Üí *which*                                                                |
| **JJ**                  | ADJ                | Adjective ‚Üí *big, nice*                                                                |
| **JJR**                 | ADJ                | Comparative adj ‚Üí *bigger*                                                             |
| **JJS**                 | ADJ                | Superlative adj ‚Üí *biggest*                                                            |
| **RB**                  | ADV                | Adverb ‚Üí *quickly*                                                                     |
| **RBR**                 | ADV                | Comparative adv ‚Üí *faster*                                                             |
| **RBS**                 | ADV                | Superlative adv ‚Üí *fastest*                                                            |
| **WRB**                 | ADV                | Wh-adverb ‚Üí *how, when, why*                                                           |
| **VB**                  | VERB               | Verb base ‚Üí *eat, go*                                                                  |
| **VBD**                 | VERB               | Verb past ‚Üí *ate, went*                                                                |
| **VBG**                 | VERB               | Verb gerund/participle ‚Üí *eating*                                                      |
| **VBN**                 | VERB               | Verb past participle ‚Üí *eaten*                                                         |
| **VBP**                 | VERB               | Verb non-3sg present ‚Üí *eat, go*                                                       |
| **VBZ**                 | VERB               | Verb 3sg present ‚Üí *eats, goes*                                                        |
| **MD**                  | AUX                | Modal ‚Üí *can, should*                                                                  |
| **IN**                  | ADP                | Preposition, subordinating ‚Üí *in, of, because*                                         |
| **TO**                  | PART               | Particle *to*                                                                          |
| **CC**                  | CCONJ              | Coordinating conj ‚Üí *and, or*                                                          |
| **UH**                  | INTJ               | Interjection ‚Üí *oh, hi*                                                                |
| **EX**                  | PRON               | Existential *there*                                                                    |
| **FW**                  | X                  | Foreign word                                                                           |
| **SYM**                 | SYM                | Symbol ‚Üí *%, \$, +*                                                                    |
| **LS**                  | X                  | List item marker                                                                       |
| **CD**                  | NUM                | Cardinal number ‚Üí *5, twenty*                                                          |
| **POS**                 | PART               | Possessive marker *‚Äôs*                                                                 |
| **RP**                  | PART               | Particle ‚Üí *up, off*                                                                   |
| **ADD**                 | PROPN              | **Spezialtag Tweebank**: URL, Email, @mention, Hashtag ‚Üí *@user, #hashtag, https\://‚Ä¶* |
| **NFP**                 | SYM                | **Spezialtag Tweebank**: Non-functional punctuation ‚Üí *ü§£, :), ‚ù§Ô∏è*                     |
| **. , : ; - etc.**      | PUNCT              | Satzzeichen                                                                            |