<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h1 style='margin:10px 5px'> 
Master Thesis Yannik Haller - Data Preprocessing TextBlob
</h1>
</div>

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
1. Load required packages and the data
</h2>
</div>

In [1]:
# Import required baseline packages
import re
import os
import glob
import time
import sys
import pandas as pd
import numpy as np
from pprint import pprint

# Change pandas' setting to print out long strings
pd.options.display.max_colwidth = 200

# Spacy (for lemmatization)
import spacy

# Plotting tools
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore", category = DeprecationWarning)
warnings.filterwarnings("ignore", category = FutureWarning)

In [2]:
# Set the appropriate working directory
os.chdir('D:\\Dropbox\\MA_data')

# Read in the aggregated data
de_tx = pd.read_csv("agg_csv_sparse_de.csv", index_col = 0, dtype = {'so': object, 'la': object, 'tx': object})

In [3]:
# Take a look at the shape of the data
de_tx.shape

(1934313, 3)

In [4]:
# Store the article IDs (i.e. index) of the language specific subsets
de_idx = de_tx.index  # German

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2. Preprocess the text data batchwise
</h2>
</div>

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2.1 Define all required functions to preprocess the data (pre-cleaning, tokenizing, removing stop words and lemmatization)
</h2>
</div>

In [2]:
## Define all required functions for the batchwise data preprocessing

# Define a function to prepare/pre-clean the text data
def pre_clean(articles):
    # Raise an error if an inappropriate data type is given as an input
    if(not isinstance(articles, list)):
        raise ValueError("Invalid input type. Expected a list.")

    # Keep track of the processing time
    t = time.time()

    # Remove any links starting with http:// or https://
    articles = [re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+').sub('',x) for x in articles]
    # Remove any links starting with www.
    articles = [re.compile('www\.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+').sub('',x) for x in articles]
    
    # Replace punctuations which are not followed by a blank with punctuations followed by a blank
    articles = [re.sub(r'[\.]', '. ', x) for x in articles]
    # Separate words in which a lowercase letter is followed by a capital letter, since they usually do not belong together
    articles = [re.sub('(^[a-z]*)+([A-Z])', r'\1 \2', x) for x in articles]
    # Correct manually for those cases where a name like 'McDonalds' was separated to Mc Donalds
    articles = [re.sub('Mc ', 'Mc', x) for x in articles]
    # Replace quotation marks with a blank
    articles = [re.sub('«', ' ', x) for x in articles]
    articles = [re.sub('»', ' ', x) for x in articles]
    # Remove percentage signs
    articles = [re.sub('%', ' ', x) for x in articles]
    # Remove distracting hyphens
    articles = [re.sub("-", " ", x) for x in articles]
    articles = [re.sub("–", " ", x) for x in articles]
    # Replace control characters (e.g. \n or \t) and multiple blanks with a single blank
    articles = [re.sub('\s+', ' ', x) for x in articles]
    # Print out the processing time
    print("Processing time for pre-cleaning: ", str((time.time() - t)/60), "minutes")

    # Return the pre-cleaned text data
    return articles


# Define a function to perform the following tasks at once:
## 1. Tokenize: transform text into a list of words, digits and punctuations
## 2. Filter the tokens, such that only nouns, proper nouns, verbs, adjectives, adverbs and negations are kept, while digits and punctuations are removed
## 3. Lemmatize: transform each word back to its word stem
## 4. Lowercase the entire data
def tokenize_filter_and_lemmatize(articles, nlp):
    # Keep track of the processing time
    t = time.time()
    # Create a list to store the output
    articles_out = []
    # Define the list of allowed postags (pos = part of speech)
    allowed_postags = ['PROPN', 'NOUN', 'ADJ', 'VERB', 'ADV']
    # Define the list of allowed negations (for German)
    allowed_negations = ['nie', 'nichts', 'nicht', 'kein', 'wenig', 'ohne']
    # Create a loop to go through all articles in the input list of articles
    for article in articles:
        # Define the current article as the focal document
        doc = nlp(article)
        # Tokenize, filter and lemmatize the document, while filtering out punctuations and unused word types
        articles_out.append([token.lemma_.lower() for token in doc if (token.pos_ in allowed_postags) or (token.lemma_.lower() in allowed_negations)])
    # Print out the processing time
    print("Processing time for tokenizing, filtering and lemmatization: ", str((time.time() - t)/60), "minutes")
    return articles_out

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2.2 Define a function to preprocess and export the data batchwise
</h2>
</div>

In [7]:
## Define the function to apply batchwise processing of the text data

# Note: "articles" has to be a dataframe with a column tx containing the text files
def process_batchwise(articles, language, batch_size = 100000, first_batch_number = 1):
    # Raise an error if an inappropriate data type is given as an input
    if(not isinstance(articles, pd.DataFrame)):
        raise ValueError("Invalid input type. Expected a pandas DataFrame.")
    # Raise an error if an inadmissible language is chosen
    allowed_languages = ['de', 'en', 'fr', 'it']
    if language not in allowed_languages:
        raise ValueError("Invalid language. Expected one of: %s" % allowed_languages)

    # Get the number of batches
    nbatches = int((len(articles)-1)/batch_size) + 1
    # Store the index of the articles
    idx = articles.index
    # Convert the column of the dataframe that contains the articles to a list of articles, while overwriting the variable 'articles' to save RAM
    articles = articles.tx.values.tolist()

    # Initialize the appropriate spacy model depending on the language of the text data, while keeping only the tagger component (for efficiency)
    if language == 'de':
        nlp = spacy.load('de_core_news_lg', disable = ['tok2vec', 'morphologizer', 'senter', 'ner', 'attribute_ruler'])
    elif language == 'en':
        nlp = spacy.load('en_core_web_lg', disable = ['tok2vec', 'morphologizer', 'senter', 'ner', 'attribute_ruler'])
    elif language == 'fr':
        nlp = spacy.load('fr_core_news_lg', disable = ['tok2vec', 'morphologizer', 'senter', 'ner', 'attribute_ruler'])
    elif language == 'it':
        nlp = spacy.load('it_core_news_lg', disable = ['tok2vec', 'morphologizer', 'senter', 'ner', 'attribute_ruler'])
    
    # Set up a loop to process the data batchwise
    for i in range(nbatches):
        print('Processing batch #', i+first_batch_number, '...')
        # Select the data related to the current batch
        batch_min = batch_size * i
        if i == (nbatches - 1):
            batch_max = len(articles)
        else:
            batch_max = batch_size * (i+1)
        batch_tx = articles[batch_min:batch_max]

        # Pre-clean the data
        batch_tx = pre_clean(batch_tx)
        # Tokenize, filter and lemmatize the data
        batch_tx = tokenize_filter_and_lemmatize(batch_tx, nlp)

        ## Save the processed text data to a csv file
        # Generate a list containing the preprocessed data in form of strings in which all lemmatized phrases are contained and separated by a blank (such that it's easy to read in later)
        batch_tx_out = []
        for article in batch_tx:
            batch_tx_out.append(" ".join(article))
        # Create a correctly indexed dataframe containing the preprocessed data in a column and export it as a csv file
        pd.DataFrame(batch_tx_out, index = idx[batch_min:batch_max], columns = ['tx']).to_csv(
            "Preprocessed/Sentiment_Analysis/Lemmatized/"+language+"_lemmatized_senti_batch_"+str(i+first_batch_number)+".csv", index = True, encoding = 'utf-8-sig'
        )

        # Delete large unused variables to save memory
        del batch_tx_out
    print("DONE! ;)")

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
2.3 Apply batchwise preprocessing and store the preprocessed data externally as csv files
</h2>
</div>

In [8]:
# Apply batchwise preprocessing by means of the previously defined function
# Batches 1-5
process_batchwise(de_tx[:500000], language = 'de', batch_size = 100000, first_batch_number = 1)

Processing batch # 1 ...
Processing time for pre-cleaning:  0.4581407109896342 minutes
Processing time for tokenizing, filtering and lemmatization:  68.1004353483518 minutes
Processing batch # 2 ...
Processing time for pre-cleaning:  0.4717169920603434 minutes
Processing time for tokenizing, filtering and lemmatization:  65.81250176032384 minutes
Processing batch # 3 ...
Processing time for pre-cleaning:  0.4857803265253703 minutes
Processing time for tokenizing, filtering and lemmatization:  72.64777606725693 minutes
Processing batch # 4 ...
Processing time for pre-cleaning:  0.4697477459907532 minutes
Processing time for tokenizing, filtering and lemmatization:  81.45349471966425 minutes
Processing batch # 5 ...
Processing time for pre-cleaning:  0.433683451016744 minutes
Processing time for tokenizing, filtering and lemmatization:  72.35946209828059 minutes
DONE! ;)


In [8]:
# Batches 6-10
process_batchwise(de_tx[500000:1000000], language = 'de', batch_size = 100000, first_batch_number = 6)

Processing batch # 6 ...
Processing time for pre-cleaning:  0.540809675057729 minutes
Processing time for tokenizing, filtering and lemmatization:  81.53184503316879 minutes
Processing batch # 7 ...
Processing time for pre-cleaning:  0.6536357482274373 minutes
Processing time for tokenizing, filtering and lemmatization:  95.98633125623067 minutes
Processing batch # 8 ...
Processing time for pre-cleaning:  0.7139591574668884 minutes
Processing time for tokenizing, filtering and lemmatization:  110.23240004380544 minutes
Processing batch # 9 ...
Processing time for pre-cleaning:  0.6416012843449911 minutes
Processing time for tokenizing, filtering and lemmatization:  106.77491083542506 minutes
Processing batch # 10 ...
Processing time for pre-cleaning:  0.4863160332043966 minutes
Processing time for tokenizing, filtering and lemmatization:  73.21589245398839 minutes
DONE! ;)


In [8]:
# Batches 11-15
process_batchwise(de_tx[1000000:1500000], language = 'de', batch_size = 100000, first_batch_number = 11)

Processing batch # 11 ...
Processing time for pre-cleaning:  0.4868136405944824 minutes
Processing time for tokenizing, filtering and lemmatization:  71.40801916122436 minutes
Processing batch # 12 ...
Processing time for pre-cleaning:  0.5520401914914449 minutes
Processing time for tokenizing, filtering and lemmatization:  79.00566128094991 minutes
Processing batch # 13 ...
Processing time for pre-cleaning:  0.4907376249631246 minutes
Processing time for tokenizing, filtering and lemmatization:  69.40921380917231 minutes
Processing batch # 14 ...
Processing time for pre-cleaning:  0.5029875238736471 minutes
Processing time for tokenizing, filtering and lemmatization:  68.68988905350368 minutes
Processing batch # 15 ...
Processing time for pre-cleaning:  0.4651066025098165 minutes
Processing time for tokenizing, filtering and lemmatization:  68.85461974938711 minutes
DONE! ;)


In [8]:
# Batches 16-20
process_batchwise(de_tx[1500000:], language = 'de', batch_size = 100000, first_batch_number = 16)

Processing batch # 16 ...
Processing time for pre-cleaning:  0.43460333744684854 minutes
Processing time for tokenizing, filtering and lemmatization:  62.765749108791354 minutes
Processing batch # 17 ...
Processing time for pre-cleaning:  0.4038864612579346 minutes
Processing time for tokenizing, filtering and lemmatization:  56.21139123042425 minutes
Processing batch # 18 ...
Processing time for pre-cleaning:  0.6051807125409444 minutes
Processing time for tokenizing, filtering and lemmatization:  95.92116247415542 minutes
Processing batch # 19 ...
Processing time for pre-cleaning:  0.6003604650497436 minutes
Processing time for tokenizing, filtering and lemmatization:  88.91763996283213 minutes
Processing batch # 20 ...
Processing time for pre-cleaning:  0.15742949644724527 minutes
Processing time for tokenizing, filtering and lemmatization:  21.651685535907745 minutes
DONE! ;)


<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
3. Read in and inspect the filtered and lemmatized data
</h2>
</div>

In [7]:
# Define a function to read in and concatenate the filtered and lemmatized data
def read_lemmatized(language, tokenize = True):
    # Raise an error if an inadmissible language is chosen
    allowed_languages = ['de', 'en', 'fr', 'it']
    if language not in allowed_languages:
        raise ValueError("Invalid language. Expected one of: %s" % allowed_languages)

    # Set the appropriate working directory
    os.chdir('D:\\Dropbox\\MA_data\\Preprocessed\\Sentiment_Analysis\\Lemmatized')

    # Get a list of all files to read and concatenate
    extension = 'csv'
    all_filenames = [i for i in glob.glob(language+"_lemmatized_senti_batch_*.{}".format(extension))]
    # Concatenate all files in the list to one dataframe
    batches_aggregated = pd.concat([pd.read_csv(f, index_col = 0, dtype = {'tx': object}) for f in all_filenames])
    # Get the articles' indices together with an enumeration to identify them in the list of filtered and lemmatized articles
    idx = batches_aggregated.index
    idx = pd.DataFrame(idx, columns = [language+'_idx'])
    # Tokenize the data again if tokenize = True
    if tokenize:
        batches_aggregated = retokenize(batches_aggregated.tx.values.tolist())
    else:
        batches_aggregated = batches_aggregated.tx.values.tolist()
    
    # Reset the appropriate working directory
    os.chdir('D:\\Dropbox\\MA_data')  

    # Return the precleaned data
    return batches_aggregated, idx

# Define a function to retokenize the filtered and lemmatized text data
def retokenize(articles):
    articles_out = []
    for article in articles:
        articles_out.append(article.split())
    return articles_out

In [8]:
# Read in the filtered and lemmatized data
de_tx_lemm, de_idx = read_lemmatized('de', tokenize = True)

In [9]:
# Take a look at the dataframe containing the according index
de_idx

Unnamed: 0,de_idx
0,16553
1,16554
2,16555
3,16556
4,16557
...,...
1934308,2441178
1934309,2441179
1934310,2441180
1934311,2441181


In [10]:
# Take a look at the size of the filtered and lematized data
sys.getsizeof(de_tx_lemm)

15673408

In [11]:
# Take a look at the first few tokens of the first element of filtered and lemmatized data
de_tx_lemm[0][:6]

['rückkehrer', 'stefan', 'meier', 'überragen', '7:6', 'flames']

In [12]:
# Compare it to the initial text
de_tx.tx.iloc[0]

'Rückkehrer Stefan Meier überragt beim 7:6 gegen die Flames. Herisau bangt allerdings am Schluss.Lukas PfiffnerIn der vergangenen Saison tat sich der UHC Herisau darin hervor, immer wieder einen Rückstand aufzuholen und Partien zu kehren. In der noch jungen 1.-Liga-Meisterschaft 2020/21 lebt das Team mindestens in den Heimspielen einem neuen Trend nach: trotz deutlicher Führung noch zu zittern.Am Samstag lagen die überzeugenden Ausserrhoder 2:0 vorne, sie reagierten auf den Ausgleich der Flames mit drei Toren innert dreier Minuten. Sie besassen mit Stefan Meier, der während neun Saisons für Wasa verteidigt hat und im Sommer aus der NLA zu seinem Stammverein zurückgekehrt ist, einen herausragenden Stürmer. Elf Minuten vor der Sirene hiess es 6:3, zum fünften Mal hatte Meier seinen Stock im Spiel. Mit der Sicherheit am Ball ging allerdings auch die Führung verloren. Der komplette Zusammenbruch drohte. Die Flames konnten aber die Gewichte nicht total verschieben – und ein eindrücklicher E

In [13]:
# Take a look at the tail of the first element of the filtered and lematized data, as there are still quite some unnecessary tokens contained
de_tx_lemm[0][200:]

['rückgängig',
 'machen',
 'flames',
 'innert',
 'sekunde',
 'tor',
 'schießen',
 'trainer',
 'nehmen',
 'beordern',
 'meier',
 'letzt',
 'minute',
 'verteidigung',
 'problem',
 'kurzfristig',
 'umstellung',
 'nicht',
 'meinen',
 'abwehr',
 'machen',
 'routine',
 'positionsspiel',
 'zudem',
 'jahrelang',
 'torhüter',
 'dominic',
 'jud',
 'rücken',
 'verteidigen',
 'kennen',
 'anweisung',
 'genau',
 'sonntagabend',
 'cupeinsatzfür',
 'herisau',
 'bringen',
 'samstag',
 'viert',
 'spiel',
 'dritt',
 'erfolg',
 'belegen',
 'zweit',
 'platz',
 'punkt',
 'partie',
 'bassersdorf',
 'nürensdorf',
 'saison',
 'nicht',
 'mehr',
 'effektiv',
 'punkt',
 'erstellung',
 'tabelle',
 'berücksichtigen',
 'quotient',
 'bisher',
 'geben',
 'gruppe',
 'corona',
 'allerdings',
 'noch',
 'kein',
 'spielabsagen',
 'sonntagabend',
 'treten',
 'ausserrhoder',
 '2.',
 'ligisten',
 'grab',
 'werdenberg',
 'cupspiel',
 'herisau',
 'flames',
 '1:0',
 '5:4)sportzentrum',
 'zuschauer',
 'sr',
 'cereda',
 'locatelli

In [14]:
# Remove unnecessary variables to save RAM
del de_tx_lemm

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
4. Supplemental (manual) cleaning of the filtered and lemmatized data
</h2>
</div>

In [15]:
# Read in the filtered and lemmatized data (untokenized)
de_tx, de_idx = read_lemmatized('de', tokenize = False) # Overwrite the variable containing the uncleaned data to save RAM

In [16]:
# Take a look at the dataframe containing the according index
de_idx

Unnamed: 0,de_idx
0,16553
1,16554
2,16555
3,16556
4,16557
...,...
1934308,2441178
1934309,2441179
1934310,2441180
1934311,2441181


In [17]:
# Take a look at the first element of the (untokenized) filtered and lemmatized data
de_tx[0]

'rückkehrer stefan meier überragen 7:6 flames herisau bangen allerdings schluss lukas pfiffnerin vergangen saison tun uhc herisau darin immer wieder rückstand aufholen partie kehren noch jung 1. liga meisterschaft leben team mindestens heimspiel neu trend deutlich führung noch zittern samstag lagen überzeugend ausserrhoder vorne reagieren ausgleich flames tor dreier minute besassen stefan meier saison wasa verteidigen sommer nla stammverein zurückkehren herausragenden stürmer minute sirene hiess fünft mal meier stock spiel sicherheit ball gehen allerdings auch führung verlieren komplette zusammenbruch drohen flames können aber gewicht nicht total verschieben eindrücklich effort niklas hess tragen gastgeber sieg 57. trainer sagen grosses kino schon woche zuvor meier einzig herisauer niederlage pfannenstiel egg treffer erzielen liegen tor assists nun platz skorer gruppe 2. meier bewegen messen cm grösse kg gewicht erstaunlich geschmeidig können ball behaupten weisen wuchtig schuss so lau

In [18]:
# Define a function to apply the supplemental/manual cleaning to the filtered and lemmatized data
def supp_clean(articles):
    # Raise an error if an inappropriate data type is given as an input
    if(not isinstance(articles, list)):
        raise ValueError("Invalid input type. Expected a list.")

    # Keep track of the processing time
    t = time.time()
    # Remove any instances where 1 to 3 initiating letters are followed by a dot (either once at the end or after each letter), since such cases usually represent abbreviations with low semantic meaning
    articles = [re.compile(' [a-z][\.]?[a-z]?[\.]?[a-z]?\.+').sub(' ', x) for x in articles]
    # Remove any remaining digit
    articles = [re.sub(r'\b\d+\b', '', x) for x in articles]
    # Remove anything except words, spaces and the & sign, since this might appear in certain names
    articles = [re.sub(r'[^\w\s\&]','', x) for x in articles]
    # Remove a list of specific words which appear quite often but do not have any semantic meaning
    words_to_remove = ['awp','afp']
    for word in words_to_remove:
        articles = [re.sub(' '+word, '', x) for x in articles] 
    # Replace control characters (e.g. \n or \t) and multiple blanks with a single blank
    articles = [re.sub('\s+', ' ', x) for x in articles]
    # Print out the processing time
    print("Processing time for supplemental manual cleaning: ", str((time.time() - t)/60), "minutes")

    # Return the manually cleaned text data
    return articles

In [19]:
# Apply the supplemental/manual cleaning by means of the previously defined function
de_tx = supp_clean(de_tx)

Processing time for supplemental manual cleaning:  7.325034916400909 minutes


In [20]:
# Take a look at the first element of the fully preprocessed data
de_tx[0]

'rückkehrer stefan meier überragen flames herisau bangen allerdings schluss lukas pfiffnerin vergangen saison tun uhc herisau darin immer wieder rückstand aufholen partie kehren noch jung liga meisterschaft leben team mindestens heimspiel neu trend deutlich führung noch zittern samstag lagen überzeugend ausserrhoder vorne reagieren ausgleich flames tor dreier minute besassen stefan meier saison wasa verteidigen sommer nla stammverein zurückkehren herausragenden stürmer minute sirene hiess fünft mal meier stock spiel sicherheit ball gehen allerdings auch führung verlieren komplette zusammenbruch drohen flames können aber gewicht nicht total verschieben eindrücklich effort niklas hess tragen gastgeber sieg trainer sagen grosses kino schon woche zuvor meier einzig herisauer niederlage pfannenstiel egg treffer erzielen liegen tor assists nun platz skorer gruppe meier bewegen messen cm grösse kg gewicht erstaunlich geschmeidig können ball behaupten weisen wuchtig schuss so laufen aktuell vo

In [21]:
# Compart the fully preprocessed data to the initial text (copy paste from above)

'Rückkehrer Stefan Meier überragt beim 7:6 gegen die Flames. Herisau bangt allerdings am Schluss.Lukas PfiffnerIn der vergangenen Saison tat sich der UHC Herisau darin hervor, immer wieder einen Rückstand aufzuholen und Partien zu kehren. In der noch jungen 1.-Liga-Meisterschaft 2020/21 lebt das Team mindestens in den Heimspielen einem neuen Trend nach: trotz deutlicher Führung noch zu zittern.Am Samstag lagen die überzeugenden Ausserrhoder 2:0 vorne, sie reagierten auf den Ausgleich der Flames mit drei Toren innert dreier Minuten. Sie besassen mit Stefan Meier, der während neun Saisons für Wasa verteidigt hat und im Sommer aus der NLA zu seinem Stammverein zurückgekehrt ist, einen herausragenden Stürmer. Elf Minuten vor der Sirene hiess es 6:3, zum fünften Mal hatte Meier seinen Stock im Spiel. Mit der Sicherheit am Ball ging allerdings auch die Führung verloren. Der komplette Zusammenbruch drohte. Die Flames konnten aber die Gewichte nicht total verschieben – und ein eindrücklicher Effort von Niklas Hess trug den Gastgebern den Sieg ein (57.).Der Trainer sagte: «Grosses Kino»Schon eine Woche zuvor hatte Meier bei der einzigen Herisauer Niederlage (5:7 gegen Pfannenstiel Egg) drei Treffer erzielt. Er liegt mit sechs Toren und vier Assists nun auf Platz fünf der Skorer in der Gruppe 2. Meier bewegt sich – gemessen an 190\xa0cm Grösse und 82 kg Gewicht – erstaunlich geschmeidig, er kann sich am Ball behaupten und weist einen wuchtigen Schuss auf. «Wenn es so läuft wie aktuell, ist es vorne natürlich schön», meinte der 29-Jährige. Er harmonierte mit seinen Linienpartnern Joel Conzett und Silas Stucki vorzüglich. «Beide sind kreativ.» Trainer Nico Raschle sprach von «grossem Kino», was Meiers Auftritt am Samstag und allgemein seine Einstellung betreffe: Er sei nicht in die 1. Liga gekommen, um seine Karriere einfach ein wenig «ausplämperlen» zu lassen. Dies bestätigt Meier indirekt. «Es war ganz gut, eine neue Position zugeteilt zu bekommen.» Da könne man sich nochmals richtig «reinhängen». Warum spielt Meier in Herisau vorne? «Weil wir uns von ihm das erhoffen, was ihm heute gelungen ist», sagte Raschle. Im resultatmässigen Notstand muss man auch einmal einen Entscheid rückgängig machen: Die Flames hatten innert 90 Sekunden drei Tore geschossen, der Trainer nahm ein Time-out und beorderte Meier für die letzten vier Minuten in die Verteidigung zurück. Ein Problem sei die kurzfristige Umstellung nicht gewesen, meinte dieser. «In der Abwehr machst du vieles über die Routine und das Positionsspiel.» Er habe zudem jahrelang mit Torhüter Dominic Jud im Rücken verteidigt. «Ich kenne seine Anweisungen genau.»Am Sonntagabend CupeinsatzFür Herisau brachte der Samstag im vierten Spiel den dritten Erfolg. Es belegt den zweiten Platz (2,25 Punkte pro Partie) hinter Bassersdorf Nürensdorf (2,5). Seit dieser Saison werden nicht mehr die effektiven Punkte für die Erstellung der Tabelle berücksichtigt, sondern die Quotienten. Bisher gab es in der Gruppe 2 trotz Corona allerdings noch keine Spielabsagen. Am Sonntagabend traten die Ausserrhoder beim 2.-Ligisten Grabs-Werdenberg zum Cupspiel an.Herisau – Flames 7:6 (1:0, 1:2, 5:4)Sportzentrum. – 98 Zuschauer. – Sr. Cereda/Locatelli. Tore: 15. S. Meier (Penalty) 1:0. 22. S. Meier (S. Stucki) 2:0. 25. Mattsson (Bernet) 2:1. 37. Liechti (Jenny) 2:2. 42. Conzett (S. Meier) 3:2. 44. (43:47) Conzett (S. Meier) 4:2. 45. (44:31) Germann (Brandes) 5:2. 49. (48:21) Swoboda 5:3. 49. (48:59) S. Meier (S. Stucki) 6:3. 55. (54:10) B. Jud (Mattsson, Ausschluss Schilling) 6:4. 55. (54:39) Dürr (J. Jud) 6:5. 56. (55:40) Mattsson (Rautio) 6:6. 57. (56:44) Hess 7:6.Herisau: D. Jud; Brunner, Schwarz; Rüegg, Schmid; Stern, L. Stucki; Schweizer; Hess, Schilling, Sandmeier; S. Stucki, Conzett, S. Meier; Germann, Mittelholzer, Wetter; Brandes. Strafen: je 1-mal 2 Minuten.Konzentration bis am Schluss: Herisaus Torhüter Dominic Jud sieht einen Schuss auf sich zukommen. Bild: Lukas Pfiffner'

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
5. Export the fully preprocessed data as one csv file
</h2>
</div>

In [22]:
# Define a function to export the fully preprocessed data
def export_preprocessed(language, articles, idx, data_tokenized = True):
    # Raise an error if an inadmissible language is chosen
    allowed_languages = ['de', 'en', 'fr', 'it']
    if language not in allowed_languages:
        raise ValueError("Invalid language. Expected one of: %s" % allowed_languages)

    # Set the appropriate working directory
    os.chdir('D:\\Dropbox\\MA_data')

    # Untokenize the data if it is still tokenized
    if data_tokenized:
        # Generate a list containing the fully preprocessed data in form of strings in which all precleaned unigrams are contained and separated by a blank (such that it's easy to read in later)
        articles_out = []
        for article in articles:
            articles_out.append(" ".join(article))
        # Overwrite the variable which stores the tokenized articles
        articles = articles_out
        # Delete the variable articles_out to save RAM
        del articles_out
    
    # Create a correctly indexed dataframe containing the fully preprocessed data in a column and export it as a csv file
    pd.DataFrame(articles, index = idx, columns = ['tx']).to_csv("Preprocessed/Sentiment_Analysis/"+language+"_preprocessed_senti.csv", index = True, encoding = 'utf-8-sig')

In [23]:
# Export the fully preprocessed data
export_preprocessed('de', de_tx, de_idx.de_idx.to_list(), False)

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
6. Read in the fully preprocessed data
</h2>
</div>

In [3]:
# Define a function to read in the fully preprocessed data
def read_preprocessed(language, tokenize = True):
    # Raise an error if an inadmissible language is chosen
    allowed_languages = ['de', 'en', 'fr', 'it']
    if language not in allowed_languages:
        raise ValueError("Invalid language. Expected one of: %s" % allowed_languages)
    
    # Set the appropriate working directory
    os.chdir('D:\\Dropbox\\MA_data')

    # Define the name of the file to load
    filename = "Preprocessed/Sentiment_Analysis/"+language+"_preprocessed_senti.csv"

    # Read in the dataframe containing the text data
    tx_pp = pd.read_csv(filename, index_col = 0, dtype = {'tx': object})

    # Get the articles' index together with an enumeration to identify their position in the list of precleaned articles
    idx = tx_pp.index
    idx = pd.DataFrame(idx, columns = [language+'_idx'])

    # Reduce the dataframe to a list containing the text data
    tx_pp = tx_pp.tx.to_list()

    # Tokenize the data again if tokenize = True (RAM-saving)
    if tokenize:
        tx_pp = retokenize(tx_pp)

    # Return the preprocessed data
    return tx_pp, idx

# Define a function to retokenize the preprocessed text data (RAM-saving)
def retokenize(article_list):
    for i in range(len(article_list)):
        temp_tx = str(article_list[i]).split()
        article_list[i] = temp_tx
    return article_list

In [4]:
# Read in the fully preprocessed data
de_tx, de_idx = read_preprocessed('de', tokenize = True) # Overwrite the variables used above to save RAM

In [5]:
# Take a look at the dataframe containing the according index
de_idx

Unnamed: 0,de_idx
0,16553
1,16554
2,16555
3,16556
4,16557
...,...
1934308,2441178
1934309,2441179
1934310,2441180
1934311,2441181


In [6]:
# Take a look at the first element of the fully preprocessed and tokenized data
de_tx[0]

['rückkehrer',
 'stefan',
 'meier',
 'überragen',
 'flames',
 'herisau',
 'bangen',
 'allerdings',
 'schluss',
 'lukas',
 'pfiffnerin',
 'vergangen',
 'saison',
 'tun',
 'uhc',
 'herisau',
 'darin',
 'immer',
 'wieder',
 'rückstand',
 'aufholen',
 'partie',
 'kehren',
 'noch',
 'jung',
 'liga',
 'meisterschaft',
 'leben',
 'team',
 'mindestens',
 'heimspiel',
 'neu',
 'trend',
 'deutlich',
 'führung',
 'noch',
 'zittern',
 'samstag',
 'lagen',
 'überzeugend',
 'ausserrhoder',
 'vorne',
 'reagieren',
 'ausgleich',
 'flames',
 'tor',
 'dreier',
 'minute',
 'besassen',
 'stefan',
 'meier',
 'saison',
 'wasa',
 'verteidigen',
 'sommer',
 'nla',
 'stammverein',
 'zurückkehren',
 'herausragenden',
 'stürmer',
 'minute',
 'sirene',
 'hiess',
 'fünft',
 'mal',
 'meier',
 'stock',
 'spiel',
 'sicherheit',
 'ball',
 'gehen',
 'allerdings',
 'auch',
 'führung',
 'verlieren',
 'komplette',
 'zusammenbruch',
 'drohen',
 'flames',
 'können',
 'aber',
 'gewicht',
 'nicht',
 'total',
 'verschieben',
 

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 
7. Quantitative summary of the data cleaning
</h2>
</div>

In [7]:
# Count the total number of words contained after the data cleaning
nwords_after = 0
for article in de_tx:
    nwords_after = nwords_after + len(article)
print('Total number of words contained after data cleaning:', nwords_after)

Total number of words contained after data cleaning: 488369046


In [8]:
# Get the average number of words per article after the data cleaning
avg_nwords_after = nwords_after/len(de_tx)
avg_nwords_after
print('Average number of words per article after data cleaning:', avg_nwords_after)

Average number of words per article after data cleaning: 252.47674290562077


In [None]:
# Remove unnecessary variables to save RAM
del de_tx, de_idx

In [None]:
# Read in the uncleaned data
os.chdir('D:\\Dropbox\\MA_data')
de_tx_uncleaned = pd.read_csv("agg_csv_sparse_de.csv", index_col = 0, dtype = {'so': object, 'la': object, 'tx': object})

In [None]:
## Count the total number of words contained before the data cleaning
# Note: to get an appropriate count of the distinct words we must at least apply the sparse preprocessing first, to ensure that all words are separated properly and distracting signs are removed
de_tx_uncleaned = pre_clean(de_tx_uncleaned.tx.tolist())
# Count the total number of words
nwords_before = 0
for article in de_tx_uncleaned:
    nwords_before = nwords_before + len(article.split())
print('Total number of words contained before data cleaning:', nwords_before)

Processing time for pre-cleaning:  12.829872250556946 minutes
Total number of words contained before data cleaning: 857640778


In [None]:
# Get the average number of words per article before the data cleaning
avg_nwords_before = nwords_before/len(de_tx_uncleaned)
print('Average number of words per article before data cleaning:', avg_nwords_before)

Average number of words per article before data cleaning: 443.38262628643866


In [None]:
# Get the number of removed words
nwords_rm = nwords_before - nwords_after
print('Number of words removed by the data cleaning:', nwords_rm)

Number of words removed by the data cleaning: 369271732


In [None]:
# Get the ratio of the words that have been removed
ratio_removed = nwords_rm / nwords_before
print(np.round(ratio_removed*100,4),'percent of the words have been removed by the data cleaning.')

43.0567 percent of the words have been removed by the data cleaning.
