# Texte vorverarbeiten

einlesen, umformen, filtern, ...

Nach Sarkar, angepasst von Heiko Rölke

## Einlesen von Texten aus dem Internet

In [1]:
import requests

data = requests.get('http://www.gutenberg.org/cache/epub/8001/pg8001.html')
content = data.content
print(content[1163:2200])

b'gin-top: 0;\r\n    margin-bottom: 0;\r\n}\r\n#pg-header #pg-machine-header strong {\r\n    font-weight: normal;\r\n}\r\n#pg-header #pg-start-separator, #pg-footer #pg-end-separator {\r\n    margin-bottom: 3em;\r\n    margin-left: 0;\r\n    margin-right: auto;\r\n    margin-top: 2em;\r\n    text-align: center\r\n}\r\n\r\n    .xhtml_center {text-align: center; display: block;}\r\n    .xhtml_center table {\r\n        display: table;\r\n        text-align: left;\r\n        margin-left: auto;\r\n        margin-right: auto;\r\n        }</style><title>The Project Gutenberg eBook of The Bible, King James version, Book 1: Genesis, by Anonymous</title><style>/* ************************************************************************\r\n * classless css copied from https://www.pgdp.net/wiki/CSS_Cookbook/Styles\r\n * ********************************************************************** */\r\n/* ************************************************************************\r\n * set the body margins t

In [2]:
import re
from bs4 import BeautifulSoup

def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    return stripped_text

clean_content = strip_html_tags(content)
print(clean_content[1163:2045])

form, and void; and darkness was
           upon the face of the deep. And the Spirit of God moved upon
           the face of the waters.
01:001:003 And God said, Let there be light: and there was light.
01:001:004 And God saw the light, that it was good: and God divided the
           light from the darkness.
01:001:005 And God called the light Day, and the darkness he called
           Night. And the evening and the morning were the first day.
01:001:006 And God said, Let there be a firmament in the midst of the
           waters, and let it divide the waters from the waters.
01:001:007 And God made the firmament, and divided the waters which were
           under the firmament from the waters which were above the
           firmament: and it was so.
01:001:008 And God called the firmament Heaven. And the evening and the
           morning were the second day.
01:001


## Aufteilung/Tokenization

Satzaufteilung/Sentence Tokenization

In [3]:
import nltk
from nltk.corpus import gutenberg
from pprint import pprint
import numpy as np

# loading text corpora
alice = gutenberg.raw(fileids='carroll-alice.txt')
sample_text = ("US unveils world's most powerful supercomputer, beats China. " 
               "The US has unveiled the world's most powerful supercomputer called 'Summit', " 
               "beating the previous record-holder China's Sunway TaihuLight. With a peak performance "
               "of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, "
               "which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, "
               "which reportedly take up the size of two tennis courts.")
sample_text

"US unveils world's most powerful supercomputer, beats China. The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight. With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, which reportedly take up the size of two tennis courts."

In [4]:
# Total characters in Alice in Wonderland
len(alice)

144395

In [5]:
# First 100 characters in the corpus
alice[0:100]

"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was"

In [8]:
default_st = nltk.sent_tokenize
alice_sentences = default_st(text=alice)
sample_sentences = default_st(text=sample_text)

print('Total sentences in sample_text:', len(sample_sentences))
print('Sample text sentences :-')
print(np.array(sample_sentences))

print('\nTotal sentences in alice:', len(alice_sentences))
print('First 5 sentences in alice:-')
print(np.array(alice_sentences[0:5]))

Total sentences in sample_text: 4
Sample text sentences :-
["US unveils world's most powerful supercomputer, beats China."
 "The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight."
 'With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second.'
 'Summit has 4,608 servers, which reportedly take up the size of two tennis courts.']

Total sentences in alice: 1625
First 5 sentences in alice:-
["[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I."
 "Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversation?'"
 

Geht auch für andere Sprachen

In [11]:
from nltk.corpus import europarl_raw

german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
# Total characters in the corpus
print(len(german_text))
# First 100 characters in the corpus
print(german_text[0:100])

157171
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit


In [12]:
# default sentence tokenizer 
german_sentences_def = default_st(text=german_text, language='german')

# loading german text tokenizer into a PunktSentenceTokenizer instance  
german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
german_sentences = german_tokenizer.tokenize(german_text)

In [13]:
german_sentences_def[:5]

[' \nWiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .',
 'Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .',
 'Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .',
 'Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .',
 'Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .']

In [14]:
german_sentences[:5]

[' \nWiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .',
 'Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .',
 'Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .',
 'Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .',
 'Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .']

## Satzaufteilung per RegEx

hier neue Ausdrucksmittel: ?<! und ?<= "negative and positive lookbehind"
probieren sie die Konstruktion zum besseren Verständnis einmal in Ruhe und kleinschrittig aus (ausserhalb der Vorlesung)

In [15]:
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
regex_st = nltk.tokenize.RegexpTokenizer(
            pattern=SENTENCE_TOKENS_PATTERN,
            gaps=True)  # wir suchen nach den "Lücken" zwischen den Sätzen
sample_sentences = regex_st.tokenize(sample_text)
sample_sentences 

["US unveils world's most powerful supercomputer, beats China.",
 "The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight.",
 'With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second.',
 'Summit has 4,608 servers, which reportedly take up the size of two tennis courts.']

# Wortaufteilung / Word Tokenization

In [16]:
default_wt = nltk.word_tokenize
words = default_wt(sample_text)
print(words)

['US', 'unveils', 'world', "'s", 'most', 'powerful', 'supercomputer', ',', 'beats', 'China', '.', 'The', 'US', 'has', 'unveiled', 'the', 'world', "'s", 'most', 'powerful', 'supercomputer', 'called', "'Summit", "'", ',', 'beating', 'the', 'previous', 'record-holder', 'China', "'s", 'Sunway', 'TaihuLight', '.', 'With', 'a', 'peak', 'performance', 'of', '200,000', 'trillion', 'calculations', 'per', 'second', ',', 'it', 'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',', 'which', 'is', 'capable', 'of', '93,000', 'trillion', 'calculations', 'per', 'second', '.', 'Summit', 'has', '4,608', 'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size', 'of', 'two', 'tennis', 'courts', '.']


In [17]:
TOKEN_PATTERN = r'\w+'        
regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN,
                                gaps=False)  # hier suchen wir nach den Wörtern
words = regex_wt.tokenize(sample_text)
print(words)

['US', 'unveils', 'world', 's', 'most', 'powerful', 'supercomputer', 'beats', 'China', 'The', 'US', 'has', 'unveiled', 'the', 'world', 's', 'most', 'powerful', 'supercomputer', 'called', 'Summit', 'beating', 'the', 'previous', 'record', 'holder', 'China', 's', 'Sunway', 'TaihuLight', 'With', 'a', 'peak', 'performance', 'of', '200', '000', 'trillion', 'calculations', 'per', 'second', 'it', 'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', 'which', 'is', 'capable', 'of', '93', '000', 'trillion', 'calculations', 'per', 'second', 'Summit', 'has', '4', '608', 'servers', 'which', 'reportedly', 'take', 'up', 'the', 'size', 'of', 'two', 'tennis', 'courts']


In [18]:
GAP_PATTERN = r'\s+'        
regex_wt = nltk.RegexpTokenizer(pattern=GAP_PATTERN,
                                gaps=True)  # und wieder nach den Lücken
words = regex_wt.tokenize(sample_text)
print(words)

['US', 'unveils', "world's", 'most', 'powerful', 'supercomputer,', 'beats', 'China.', 'The', 'US', 'has', 'unveiled', 'the', "world's", 'most', 'powerful', 'supercomputer', 'called', "'Summit',", 'beating', 'the', 'previous', 'record-holder', "China's", 'Sunway', 'TaihuLight.', 'With', 'a', 'peak', 'performance', 'of', '200,000', 'trillion', 'calculations', 'per', 'second,', 'it', 'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight,', 'which', 'is', 'capable', 'of', '93,000', 'trillion', 'calculations', 'per', 'second.', 'Summit', 'has', '4,608', 'servers,', 'which', 'reportedly', 'take', 'up', 'the', 'size', 'of', 'two', 'tennis', 'courts.']


(Sarkar führt noch andere Tokenizer vor, schauen sie bei Interesse dort nach)

## Tokenization mit spaCy

In [19]:
import spacy
nlp = spacy.load('en_core_web_trf')

text_spacy = nlp(sample_text)

  from .autonotebook import tqdm as notebook_tqdm


In [20]:
sents = list(text_spacy.sents)
sents

[US unveils world's most powerful supercomputer, beats China.,
 The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight.,
 With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second.,
 Summit has 4,608 servers, which reportedly take up the size of two tennis courts.]

Wörter aufgeteilt nach Sätzen

In [21]:
sent_words = [[word.text for word in sent] for sent in sents]
print(sent_words)

[['US', 'unveils', 'world', "'s", 'most', 'powerful', 'supercomputer', ',', 'beats', 'China', '.'], ['The', 'US', 'has', 'unveiled', 'the', 'world', "'s", 'most', 'powerful', 'supercomputer', 'called', "'", 'Summit', "'", ',', 'beating', 'the', 'previous', 'record', '-', 'holder', 'China', "'s", 'Sunway', 'TaihuLight', '.'], ['With', 'a', 'peak', 'performance', 'of', '200,000', 'trillion', 'calculations', 'per', 'second', ',', 'it', 'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',', 'which', 'is', 'capable', 'of', '93,000', 'trillion', 'calculations', 'per', 'second', '.'], ['Summit', 'has', '4,608', 'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size', 'of', 'two', 'tennis', 'courts', '.']]


oder durchgehend in einer Liste

In [22]:
words = [word.text for word in text_spacy]
print(words)

['US', 'unveils', 'world', "'s", 'most', 'powerful', 'supercomputer', ',', 'beats', 'China', '.', 'The', 'US', 'has', 'unveiled', 'the', 'world', "'s", 'most', 'powerful', 'supercomputer', 'called', "'", 'Summit', "'", ',', 'beating', 'the', 'previous', 'record', '-', 'holder', 'China', "'s", 'Sunway', 'TaihuLight', '.', 'With', 'a', 'peak', 'performance', 'of', '200,000', 'trillion', 'calculations', 'per', 'second', ',', 'it', 'is', 'over', 'twice', 'as', 'fast', 'as', 'Sunway', 'TaihuLight', ',', 'which', 'is', 'capable', 'of', '93,000', 'trillion', 'calculations', 'per', 'second', '.', 'Summit', 'has', '4,608', 'servers', ',', 'which', 'reportedly', 'take', 'up', 'the', 'size', 'of', 'two', 'tennis', 'courts', '.']


## Akzente und Umlaute entfernen

(nicht immer notwendig, aber manchmal schon)

In [23]:
import unicodedata

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

remove_accented_chars('Sómě Áccěntěd těxt')

'Some Accented text'

In [24]:
# so wirkt es für deutsche Umlaute:
remove_accented_chars("äusserlich, äußerlich, überhaupt, öffentlich")

'ausserlich, auerlich, uberhaupt, offentlich'

(das kann man sicherlich besser hinbekommen...)

## Contractions / Zusammenziehungen bzw. Auslassungen

In [29]:
# nutzt extern definierte Tabelle

from contractions import contractions_dict
import re

def expand_contractions(text, contraction_mapping=contractions_dict):
    
    # HR: hier wird mit einem compilierten Pattern gearbeitet; dieses wird aus allen Contraction-Schlüsseln zusammengesetzt
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), flags=re.IGNORECASE|re.DOTALL)
    
    # HR: innere Funktion
    # HR: Achtung - die Sarkar-Funktion arbeitet nicht immer fehlerfrei. Hier (hoffentlich) korrigiert.
    def expand_match(contraction):
        match = contraction.group(0)
        # Contraction-Muster sind alle kleingeschrieben
        expanded_contraction = contraction_mapping.get(match) if contraction_mapping.get(match) else contraction_mapping.get(match.lower())                     
        # Grossschreibung wiederherstellen
        if match[0].isupper():
            expanded_contraction = expanded_contraction[0].upper() + expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)  # expand match wird für jeden match aufgerufen
    expanded_text = re.sub("'", "", expanded_text)  # noch etwas aufräumen
    return expanded_text

In [30]:
expand_contractions("Y'all can't expand contractions I'd think")

'You all cannot expand contractions I would think'

In [31]:
expand_contractions("Ain't ain't good you'd think.")

'Are not are not good you would think.'

## Remove Special Characters / Spezialzeichen entfernen

In [32]:
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

remove_special_characters("Well this was fun! What do you think? 123#@!", 
                          remove_digits=True)

'Well this was fun What do you think '

## Text Correction / Fehlerverbesserungen

### Zeichenwiederholungen

zuerst "naive" Lösung: alle Zeichenwiederholungen ersetzen

In [33]:
old_word = 'finalllyyy'
# neue Konstrukte RegEx: Anzahlen/Wiederholungen und Verweise auf gefundene Teil-Muster
repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
match_substitution = r'\1\2\3'
step = 1

while True:
    # remove one repeated character
    new_word = repeat_pattern.sub(match_substitution,
                                  old_word)
    if new_word != old_word:
         print('Step: {} Word: {}'.format(step, new_word))
         step += 1 # update step
         # update old word to last substituted state
         old_word = new_word  
         continue
    else:
         print("Final word:", new_word)
         break

Step: 1 Word: finalllyy
Step: 2 Word: finallly
Step: 3 Word: finally
Step: 4 Word: finaly
Final word: finaly


wie wir sehen, können so auch zu viele Zeichen entfernt werden

besser: 
nur solange entfernen, bis "richtiges" Wort erreicht wird (Wörterbuch-Ansatz)

In [36]:
from nltk.corpus import wordnet
old_word = 'finalllyyy'
repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
match_substitution = r'\1\2\3'
step = 1
 
while True:
    # check for semantically correct word
    if wordnet.synsets(old_word):
        print("Final correct word:", old_word)
        break
    # remove one repeated character
    new_word = repeat_pattern.sub(match_substitution,
                                  old_word)
    if new_word != old_word:
        print('Step: {} Word: {}'.format(step, new_word))
        step += 1 # update step
        # update old word to last substituted state
        old_word = new_word  
        continue
    else:
        print("Final word:", new_word)
        break

Step: 1 Word: finalllyy
Step: 2 Word: finallly
Step: 3 Word: finally
Final correct word: finally


als Funktion

In [37]:
import nltk
from nltk.corpus import wordnet

def remove_repeated_characters(tokens):
    repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
    match_substitution = r'\1\2\3'
    def replace(old_word):
        if wordnet.synsets(old_word):
            return old_word
        new_word = repeat_pattern.sub(match_substitution, old_word)
        return replace(new_word) if new_word != old_word else new_word
            
    correct_tokens = [replace(word) for word in tokens]
    return correct_tokens

In [38]:
sample_sentence = 'My schooool is realllllyyy amaaazingggg'
correct_tokens = remove_repeated_characters(nltk.word_tokenize(sample_sentence))
' '.join(correct_tokens)

'My school is really amazing'

In [39]:
beispielsatz = ("Hallloo? Wiiee jetzt?")
correct_tokens = remove_repeated_characters(nltk.word_tokenize(beispielsatz))
' '.join(correct_tokens)

'Halo ? Wie jetzt ?'

(klappt nicht gut, da falsches Wörterbuch)

## Rechtschreibkorrekturen

In [41]:
import re, collections

def tokens(text): 
    """
    Get all words from the corpus
    """
    return re.findall('[a-z]+', text.lower()) 

WORDS = tokens(open('big.txt').read())
WORD_COUNTS = collections.Counter(WORDS)
# top 10 words in corpus
WORD_COUNTS.most_common(10)

[('the', 80030),
 ('of', 40025),
 ('and', 38313),
 ('to', 28766),
 ('in', 22050),
 ('a', 21155),
 ('that', 12512),
 ('he', 12401),
 ('was', 11410),
 ('it', 10681)]

In [42]:
def edits0(word): 
    """
    Return all strings that are zero edits away 
    from the input word (i.e., the word itself).
    """
    return {word}



def edits1(word):
    """
    Return all strings that are one edit away 
    from the input word.
    """
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    def splits(word):
        """
        Return a list of all possible (first, rest) pairs 
        that the input word is made of.
        """
        return [(word[:i], word[i:]) 
                for i in range(len(word)+1)]
                
    pairs      = splits(word)
    deletes    = [a+b[1:]           for (a, b) in pairs if b]
    transposes = [a+b[1]+b[0]+b[2:] for (a, b) in pairs if len(b) > 1]
    replaces   = [a+c+b[1:]         for (a, b) in pairs for c in alphabet if b]
    inserts    = [a+c+b             for (a, b) in pairs for c in alphabet]
    return set(deletes + transposes + replaces + inserts)


def edits2(word):
    """Return all strings that are two edits away 
    from the input word.
    """
    return {e2 for e1 in edits1(word) for e2 in edits1(e1)}

In [43]:
def known(words):
    """
    Return the subset of words that are actually 
    in our WORD_COUNTS dictionary.
    """
    return {w for w in words if w in WORD_COUNTS}

In [44]:
# input word
word = 'fianlly'

# zero edit distance from input word
edits0(word)

{'fianlly'}

In [45]:
# returns null set since it is not a valid word
known(edits0(word))

set()

In [46]:
# one edit distance from input word
edits1(word)

{'afianlly',
 'aianlly',
 'bfianlly',
 'bianlly',
 'cfianlly',
 'cianlly',
 'dfianlly',
 'dianlly',
 'efianlly',
 'eianlly',
 'faanlly',
 'faianlly',
 'fainlly',
 'fanlly',
 'fbanlly',
 'fbianlly',
 'fcanlly',
 'fcianlly',
 'fdanlly',
 'fdianlly',
 'feanlly',
 'feianlly',
 'ffanlly',
 'ffianlly',
 'fganlly',
 'fgianlly',
 'fhanlly',
 'fhianlly',
 'fiaally',
 'fiaanlly',
 'fiablly',
 'fiabnlly',
 'fiaclly',
 'fiacnlly',
 'fiadlly',
 'fiadnlly',
 'fiaelly',
 'fiaenlly',
 'fiaflly',
 'fiafnlly',
 'fiaglly',
 'fiagnlly',
 'fiahlly',
 'fiahnlly',
 'fiailly',
 'fiainlly',
 'fiajlly',
 'fiajnlly',
 'fiaklly',
 'fiaknlly',
 'fiallly',
 'fially',
 'fialnlly',
 'fialnly',
 'fiamlly',
 'fiamnlly',
 'fianally',
 'fianaly',
 'fianblly',
 'fianbly',
 'fianclly',
 'fiancly',
 'fiandlly',
 'fiandly',
 'fianelly',
 'fianely',
 'fianflly',
 'fianfly',
 'fianglly',
 'fiangly',
 'fianhlly',
 'fianhly',
 'fianilly',
 'fianily',
 'fianjlly',
 'fianjly',
 'fianklly',
 'fiankly',
 'fianlaly',
 'fianlay',
 'fi

In [47]:
# get correct words from above set
known(edits1(word))

{'finally'}

In [48]:
# two edit distances from input word
edits2(word)

{'fiaicnlly',
 'niarlly',
 'fiailrly',
 'cianzlly',
 'fiunllc',
 'fnanilly',
 'fianwmly',
 'fiallt',
 'fiacely',
 'fnianllt',
 'fiaenllb',
 'nianlyl',
 'qfyanlly',
 'fianllyza',
 'yfianloy',
 'qfixanlly',
 'fvidnlly',
 'fianllyix',
 'fikblly',
 'flanzlly',
 'bianlpy',
 'fiqkanlly',
 'fianllyoa',
 'nfianlyly',
 'fkahlly',
 'yfianoly',
 'fianllyox',
 'fianlleby',
 'fibnlln',
 'fxanllg',
 'fhianlrly',
 'fianldlhy',
 'fmyianlly',
 'lianllz',
 'finnfly',
 'ficanllby',
 'fqanllty',
 'ifanlply',
 'fianyny',
 'ifianlliy',
 'faanylly',
 'bfianllj',
 'fgisanlly',
 'fianulyw',
 'fiajnllyj',
 'fiajtnlly',
 'fiaonlsy',
 'fwalnly',
 'fisvnlly',
 'ftianllvy',
 'fianlgay',
 'kfianlvy',
 'fimnllvy',
 'fibanldly',
 'fiaflxy',
 'fqianlry',
 'faianally',
 'fiaznllyo',
 'mianlky',
 'feuianlly',
 'fianhls',
 'mfianyly',
 'wfiansly',
 'fidnllyk',
 'ytanlly',
 'fainlay',
 'fiamnllm',
 'fiaynyly',
 'fqianldy',
 'zfbianlly',
 'fvnlly',
 'fiawnlli',
 'fpipnlly',
 'fianltlyw',
 'zianally',
 'mianloly',
 'fainllyi

In [49]:
# get correct words from above set
known(edits2(word))

{'faintly', 'finally', 'finely', 'frankly'}

Vorgehen: ist das Wort selbst bekannt? Falls nicht: Editierdistanz 1 usw. Fallback: Das Wort selbst.

In [50]:
candidates = (known(edits0(word)) or 
              known(edits1(word)) or 
              known(edits2(word)) or 
              [word])
candidates

{'finally'}

In [51]:
def correct(word):
    """
    Get the best correct spelling for the input word
    """
    # Priority is for edit distance 0, then 1, then 2
    # else defaults to the input word itself.
    candidates = (known(edits0(word)) or 
                  known(edits1(word)) or 
                  known(edits2(word)) or 
                  [word])
    return max(candidates, key=WORD_COUNTS.get)  # falls es mehrere Möglichkeiten gibt

In [52]:
correct('fianlly')

'finally'

In [53]:
correct('FIANLLY')

'FIANLLY'

In [54]:
def correct_match(match):
    """
    Spell-correct word in match, 
    and preserve proper upper/lower/title case.
    """
    
    word = match.group()
    def case_of(text):
        """
        Return the case-function appropriate 
        for text: upper, lower, title, or just str.:
            """
        return (str.upper if text.isupper() else
                str.lower if text.islower() else
                str.title if text.istitle() else
                str)
    return case_of(word)(correct(word.lower()))

    
def correct_text_generic(text):
    """
    Correct all the words within a text, 
    returning the corrected text.
    """
    return re.sub('[a-zA-Z]+', correct_match, text)

In [55]:
correct_text_generic('fianlly')

'finally'

In [56]:
correct_text_generic('FIANLLY')

'FINALLY'

### Rechtschreibkorrekturen in Bibliotheken (Beispiel)

In [57]:
# !pip install textblob
from textblob import Word

w = Word('fianlly')
w.correct()

'finally'

In [58]:
w.spellcheck()

[('finally', 1.0)]

In [59]:
w = Word('flaot')
w.spellcheck()

[('flat', 0.85), ('float', 0.15)]

## Stemming / Wortstamm-Reduktion

Porter Stemmer: Regelbasiert, Anwendung fester Verkürzungsregeln in aufeinander folgenden Phasen

In [60]:
# Porter Stemmer
from nltk.stem import PorterStemmer
ps = PorterStemmer()

ps.stem('jumping'), ps.stem('jumps'), ps.stem('jumped')

('jump', 'jump', 'jump')

In [61]:
ps.stem('lying')

'lie'

In [62]:
ps.stem('strange')

'strang'

Lancaster Stemmer: ebenfalls regelbasiert, aber teils andere Ergebnisse

In [63]:
# Lancaster Stemmer
from nltk.stem import LancasterStemmer
ls = LancasterStemmer()

ls.stem('jumping'), ls.stem('jumps'), ls.stem('jumped')

('jump', 'jump', 'jump')

In [64]:
ls.stem('lying')

'lying'

In [65]:
ls.stem('strange')

'strange'

Wie kann man so etwas nachbauen? Ganz simpel zum Beispiel als RegEx:

In [66]:
# Regex based stemmer
from nltk.stem import RegexpStemmer
rs = RegexpStemmer('ing$|s$|ed$', min=4)
rs.stem('jumping'), rs.stem('jumps'), rs.stem('jumped')

('jump', 'jump', 'jump')

In [67]:
rs.stem('lying')

'ly'

In [68]:
rs.stem('strange')

'strange'

Stemming geht natürlich nicht nur im Englischen:

In [69]:
# Snowball Stemmer
from nltk.stem import SnowballStemmer
ss = SnowballStemmer("german")
print('Supported Languages:', SnowballStemmer.languages)

Supported Languages: ('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


Originalkommentare von Sarkar :)

In [70]:
# stemming on German words
# autobahnen -> cars
# autobahn -> car
ss.stem('autobahnen')

'autobahn'

In [71]:
# springen -> jumping
# spring -> jump
ss.stem('springen')

'spring'

weitere Beispiele (HR)

In [72]:
ss.stem("gesprungen")

'gesprung'

Als Funktion für Sätze/Texte:

In [73]:
def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

simple_stemmer("My system keeps crashing his crashed yesterday, ours crashes daily")

'my system keep crash hi crash yesterday, our crash daili'

# Lemmatization / Lemmatisierung

In [74]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

In [75]:
# lemmatize nouns
print(wnl.lemmatize('cars', 'n'))
print(wnl.lemmatize('men', 'n'))

car
men


In [76]:
# lemmatize adjectives
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('fancier', 'a'))

sad
fancy


In [77]:
# ineffective lemmatization
print(wnl.lemmatize('ate', 'n'))
print(wnl.lemmatize('fancier', 'v'))

ate
fancier


In [78]:
import spacy
nlp = spacy.load('en_core_web_trf')
text = 'My system keeps crashing his crashed yesterday, ours crashes daily'

def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

lemmatize_text("My system keeps crashing! his crashed yesterday, ours crashes daily")

'my system keep crash ! his crash yesterday , ours crash daily'

## Stoppwörter entfernen

In [79]:
from nltk.tokenize.toktok import ToktokTokenizer
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
def remove_stopwords(text, is_lower_case=False, stopwords=stopword_list):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopwords]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

remove_stopwords("The, and, if are stopwords, computer is not")

', , stopwords , computer'

# Alles zusammenführen 

In [80]:
def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True, remove_digits=True):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

In [81]:
{'Original': sample_text,
 'Processed': normalize_corpus([sample_text])[0]}

{'Original': "US unveils world's most powerful supercomputer, beats China. The US has unveiled the world's most powerful supercomputer called 'Summit', beating the previous record-holder China's Sunway TaihuLight. With a peak performance of 200,000 trillion calculations per second, it is over twice as fast as Sunway TaihuLight, which is capable of 93,000 trillion calculations per second. Summit has 4,608 servers, which reportedly take up the size of two tennis courts.",
 'Processed': 'us unveil world powerful supercomputer beat china us unveil world powerful supercomputer call summit beat previous record holder chinas sunway taihulight peak performance trillion calculation per second twice fast sunway taihulight capable trillion calculation per second summit server reportedly take size two tennis court'}