# Preprocessing German Language

#### In this notebook, specifically for the German data, we will apply a series of techniques to remove ads, noise, and any information that is not needed for our analysis. We will clean the data that has been previously extracted and store it in different files and structures to fulfill the requirements of the models that will be applied later.

## Import packages

In [6]:
import pandas as pd
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from gensim.parsing.preprocessing import STOPWORDS, strip_numeric, strip_punctuation, strip_multiple_whitespaces, remove_stopwords, strip_short
import numpy as np
import sys
sys.path.append('../')
from utils import remove_similar_rows_per_player, find_lines_with_player, del_patterns, map_emoji_to_description, remove_accents, name_wordgroups, extract_sentence
import emoji
from googletrans import Translator


## Load Data

We load the data from our csv file with all the previously pulled data. And we filter German language data.

In [7]:
url = '../data_files/all_data_v3.csv'
df = pd.read_csv(url)

In [8]:
# Filter out the German data and reindex
df_ger = df[df["language"] == "de"]

#Reset index
df_ger = df_ger.reset_index(drop=True)


## Remove similiar rows

We are using a pre-tailored function to remove duplicates and rows that were mistakenly stored twice.

In [9]:
# Remove the similiar rows (The Function is imported from utils on top)
df_ger = remove_similar_rows_per_player(df_ger, df_ger['player'].unique())

### Tranform Hincapie

In [10]:
# transform Hincapié to Hincapie
df_ger.loc[df_ger['player'] == 'Piero Hincapié', 'player'] = 'Piero Hincapie'

## Transform data into lower case

We want to transform data and player into lower case

In [11]:
# Create a copy
data_lower = df_ger.copy()

# Transform data into lower case
data_lower['data'] = data_lower['data'].str.lower()
data_lower['player'] = data_lower['player'].str.lower()



## Delete Patterns

Due to the large number of noice and irrelevant information, patterns are defined to be removed from all texts. These patterns are specific for German texts

In [12]:
# Create a copy and delete content patterns
data_wo_pattern = data_lower.copy()
data_wo_pattern['data'] = data_wo_pattern['data'].apply(lambda x: re.sub(r"^{\'content\': \'", "", str(x)))
data_wo_pattern['data'] = data_wo_pattern['data'].apply(lambda x: re.sub(r"{'content':", "", str(x)))

In [13]:
# Define patterns to delete
pattern = ['nutze kicker', 
           'mit dem', 
           'nur €2,49', 
           'https://www.faz.net/',
           'bereits pur-abonnent?', 
           'alle antworten', 
           'hinweis zur verarbeitung', 
           'werbung','olympia-verlags', 
           'bild', 'g+j medien gmbh', 
           'services',
           'kurz-link dieses artikels lautet:', 
           'http://epaper.welt.de',
           'stephan von nocks',
           'warum sehe ich faz.net',
           'permalink:',
           'mcfit mitgliedschaft',
           'fitx-vertrag',
           'kündigeneasyfitness',
           'proteinbedarf',
           'fitnessland',
           'kündigeneasyfitness',
           'trainingspuls berechnen',
           'pulsrechner',
           'fitseveneleven-mitgliedschaft',
           'alkoholabbau & promille',
           'index.promillerechner',
           'mitgliedschaft kündigen',
           'promillerechner',
           'ihr body-mass-index',
           'bmi rechner',
           'kalorienrechner',
           'grundumsatz & kalorienbedarf',
           'partnerangebote',
           'newsletter',
           'journalismus der presse',
           'abonnieren',
           '(apa)',
           'foto:',
           'quelle:',
           'lesezeit:',
           'aktuelle nachrichten',
           'herausgegeben von',
           'zeitung',
           'faz.net',
           'lesen sie mehr',
           'stellenmarkt der sz.',
           'bewerben sie sich jetzt',
           'gutscheine',
           'vergleichsportal',
           'stern plus-inhalten',
           'jederzeit kündbar',
           'bereits registriert?',
           'zur startseite',
           'öffnet in neuem tab oder fenster',
           'vollansicht der tabelle unter',
           'frankfurter allgemeine zeitung', 
           'aktuelle nachrichten aus politik, wirtschaft, sport und kultur',
           'lesezeit:'
           ]

In [14]:
# Delete patterns
data_wo_pattern['data'] = data_wo_pattern['data'].apply(lambda x: del_patterns(str(x), pattern))

## Transform Emojis to text

Due to the presence of emojis, they are being translated into text

In [15]:
# Use map_emoji_to_description to translate emojis into text
data_wo_pattern['data'] = data_wo_pattern['data'].apply(lambda x: re.sub(r'[^\w\s]', lambda match: map_emoji_to_description(match.group(), language = 'de',), str(x)))

## Remove noise

"Noise" refers to non-meaningful data, such as numbers, links, additional whitespaces, and for Spanish data, it also includes accents

In [16]:
data_rm_1 = data_wo_pattern.copy()

# Strip_numeric -> removing digits (https://gensimr.news-r.org/reference/strip_numeric.html)
data_rm_1['data'] = data_rm_1['data'].apply(strip_numeric)

# Strip links
data_rm_1['data'] = data_rm_1['data'].apply(lambda x: re.sub(r'http\S+', '', str(x)))

# Strip multiple whitespaces also \n -> (https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.strip_multiple_whitespaces)
data_rm_1['data'] = data_rm_1['data'].apply(strip_multiple_whitespaces)

# Strip spanish accents
data_rm_1['data'] = data_rm_1['data'].apply(lambda x: remove_accents(str(x)))

# reset index 
data_rm_1 = data_rm_1.reset_index(drop=True)

## Remove empty rows

In [17]:
# remove empty rows 
data_rm_1 = data_rm_1.replace('', pd.NA)
data_rm_1.dropna(inplace=True)

## Save data clean 1

This data is stored for a later treatment

In [18]:
# Remove the similiar rows (The Function is imported from utils on top)
data_rm_1 = remove_similar_rows_per_player(data_rm_1, data_rm_1['player'].unique())

In [19]:
# Store data 
data_rm_1.to_csv('../data_files/data_clean/de_clean_1.csv', index=False)

# ------------------------

# Preprocess for data clean 2

# ------------------------

In [20]:
# Create a copy data 
data_rm_2 = data_rm_1.copy()

## Remove stopwords

In [21]:
# Load stop words
german_stop_words = stopwords.words('german')

# Create a list of stop words to remove 
stop_words_to_remove = ['nicht', 'nichts', 'kein', 'keine', 'keinem', 'keinen', 'keiner', 'keines', 'anders']

stop_words_to_add = ['vor', 'in', 'den', 'dem', 'beim', 'wir', 'der', 'ist','ende', 'seite', 'ersten', 'fürs', 'eh', 'blick', 'schon', 'zumal', 'erst', 'mal', 'bild', 't', '(dpa)', 'fur']

# Remove the stop words to remove from the stop words list
for word in stop_words_to_remove:
  german_stop_words.remove(word)

for word in stop_words_to_add:
  german_stop_words.append(word)

# Apply the remove_stopwords function to the 'text' column using the apply method
data_rm_2['data'] = data_rm_2['data'].apply(lambda x: remove_stopwords(x, german_stop_words))


## Strip punctuation and remove short words
Copy of unstriped data is taken because it's needed for further data preprocessing 

In [22]:
# copy data for third transformation
data_rm_3 = data_rm_2.copy()

# strip_punctutation -> removing punctations (https://gensimr.news-r.org/reference/strip_punctuation.html)
data_rm_2['data'] = data_rm_2['data'].apply(strip_punctuation)

#strip_short deletes words smaller 3
data_rm_2['data'] = data_rm_2['data'].apply(strip_short)

## Save data clean 2

In [23]:
# Store data 
data_rm_2.to_csv('../data_files/data_clean/de_clean_2.csv', index=False)

# ------------------------

# Data condensed
The third transformation focus on the deletion of sentences to clean the corpus. The only paragraphs kept are the one including the player names.

# ------------------------

## Keep only paragraph

Get lines and following lines where the Player name appears in the corpus 

In [24]:
# Select only paragraphs which include playernames 
data_with_playernames = find_lines_with_player(data_rm_3, data_rm_3['player'].unique(), n_lines = 0)

## Create wordgroups
Remove first names of players and create word pairs e.g. for bayer leverkusen -> bayerleverkusen or europa league -> europaleague

In [25]:
# Perform wordpair function
data_with_playernames = name_wordgroups(data_with_playernames)

## Delete playernames from their sentences
Because the playernames took a huge influence on the clustering they will be removed for each player. 

In [26]:
# For every player remove their names from the texts 
for player in data_with_playernames['player'].unique():
    f_l_name = player.split()

    # Extracting the first name
    first_name = str(f_l_name[0])

    # Extracting the last name
    last_name = str(f_l_name[1])

    updated_pattern = r"\b(" + first_name.lower() + r"|" + last_name.lower() + r")\b|"


    # Apply the function to the data column
    data_with_playernames.loc[data_with_playernames['player'] == player, 'data'] = data_with_playernames.loc[data_with_playernames['player'] == player, 'data'].apply(lambda x: re.sub(updated_pattern, "", str(x)))



## Strip punctuation and remove short words

In [27]:
# Strip_punctutation -> removing punctations (https://gensimr.news-r.org/reference/strip_punctuation.html)
data_with_playernames['data'] = data_with_playernames['data'].apply(strip_punctuation)

# Strip_short deletes words smaller 3
data_with_playernames['data'] = data_with_playernames['data'].apply(strip_short)

## Remove empty rows

In [28]:
# Delete empty rows
data_with_playernames = data_with_playernames.replace('', pd.NA)
data_with_playernames.dropna(inplace=True)

## Save data condensed

In [29]:
# Store data 
data_with_playernames.to_csv('../data_files/data_clean/de_clean_condensed.csv', index=False)

# Additional Sentiment Preprocessing
For the sentiment analysis we created one additional dataset where we take just the sentence where a playername appears of the clean dataset.

In [30]:
# load clean dataset
df_de_sentence = pd.read_csv('../data_files/data_clean/de_clean_1.csv')


In [31]:
# Keep paragraph where the player name is found
df_de_sentence = find_lines_with_player(df_de_sentence, df_de_sentence['player'].unique(),n_lines=1)

In [32]:
# Extract sentence
df_de_sentence = extract_sentence(df_de_sentence)

In [33]:
# copy sentence column into data and remove sentence
df_de_sentence['data']= df_de_sentence['sentence']
df_de_sentence.drop('sentence', axis=1, inplace=True)

In [34]:
# remove similiar sentences 
df_de_sentence = remove_similar_rows_per_player(df_de_sentence, df_de_sentence['player'].unique())

# apply wordgroups
df_de_sentence = name_wordgroups(df_de_sentence)

# delete empty rows
df_de_sentence = df_de_sentence.replace('', pd.NA)
df_de_sentence.dropna(inplace=True)


In [35]:
# Store data 
df_de_sentence.to_csv('../data_files/data_clean/de_clean_1_sen.csv', index=False)

# Notebook Output

This notebook will create the following CSV files:

1. data1
2. data2
3. data_condensed
4. de_clean_1_sen

The objective of these files is to have the data cleaned and saved at different levels of detail. They allow us to use the same data for different processes with various requirements.

# Next steps for Bayer04

To further improve the data processing, we could recommend Bayer04 to try different preprocessing combinations. By storing the data and experimenting with various combinations in subsequent processes, the goal is to achieve the best accuracy for each of the models. This iterative approach allows for fine-tuning the preprocessing steps and selecting the most effective ones that lead to improved model performance.