# Preprocessing English Language

#### In this notebook, specifically for the English data, we will apply a series of techniques to remove ads, noise, and any information that is not needed for our analysis. We will clean the data that has been previously extracted and store it in different files and structures to fulfill the requirements of the models that will be applied later.

## Import packages

In [2]:
import pandas as pd
import re
#from sklarn.feature_extraction.text import CountVectorizer
from gensim.parsing.preprocessing import STOPWORDS, strip_tags, strip_numeric, strip_punctuation, strip_multiple_whitespaces, remove_stopwords, strip_short, stem_text
import numpy as np
import sys
import os
sys.path.append('../')
#from utils import remove_similar_rows, find_lines_with_player, remove_stopwords_from_text, map_emoji_to_description, remove_similar_rows_per_player ##del_patterns, 


#INFO included in utils - but utils doesn't work for me here
import emoji
def map_emoji_to_description(emoji_text, language): 
    emoji_description = emoji.demojize(emoji_text, language=language)
    return emoji_description

def translate_emojis(text, language):
    return re.sub(r'[\U0001F000-\U0001F999]', lambda match: map_emoji_to_description(match.group(), language=language), text)

def remove_stopwords_from_text(text, stopwords_list_per_language):
    return remove_stopwords(text, stopwords=stopwords_list_per_language)

from difflib import SequenceMatcher
def remove_similar_rows_per_player(df, playerlist, threshold=0.9):
    '''The procedure of deleting similiar articles needs to be done by each player because if an article writes about 
    # e.g. two players we want to keep it for both of the players'''

    # define empty df which will be returned in the end
    df_complete = pd.DataFrame()

    for player in playerlist:
        
        # create the df for the player
        df_player = df[df["player"] == player]
        df_player = df_player.reset_index(drop=True)
        column_as_df = pd.DataFrame(df_player['data'])


        
        # Compute similarity scores for each pair of rows
        similarity_scores = {}
        for i, row in column_as_df.iterrows():
            for j, other_row in column_as_df.iterrows():
                if i >= j:
                    continue
                score = SequenceMatcher(None, row, other_row).ratio()
                if score >= threshold:
                    similarity_scores[(i, j)] = score
        
        # Identify rows to remove
        rows_to_remove = []
        for (i, j), score in similarity_scores.items():
            if i not in rows_to_remove and j not in rows_to_remove:
                rows_to_remove.append(j if df_player.index[i] < df_player.index[j] else i)
        
        # Remove rows and concatenate df
        df_player = df_player.drop(rows_to_remove)
        df_complete = pd.concat([df_complete, df_player], axis=0)

        #return modified DataFrame
    return df_complete

from unidecode import unidecode
def remove_accents(text):
    return unidecode(text)

# Function which finds the lines where a players name is contained
def find_lines_with_player(dataframe, playerlist, n_lines = 0):
    
    # create empty df 
    df_complete = pd.DataFrame()

    # iterating over all players
    for player in playerlist:

        # get players first and last_name to include them in later sentence checks
        player_first_name, player_last_name = player.split()

        # just select player indiviual data
        df_player = dataframe[dataframe["player"] == player]
        df_player = df_player.reset_index(drop=True)

        # iterate over all data for the player
        for i in range(len(df_player)):

            # get the current record
            current_line = df_player['data'].iloc[i]
            # split up the records in lines
            lines = current_line.split('\\n')
            # create an empty string
            new_string = ''

            line_counter = 0
            # iterate over all lines in the record
            for line in lines:
                # if the playername can be found in the line add the line to the string
                if line.find(player) != -1:
                    new_string = new_string + line + " "
                    if line_counter <= 0:
                        line_counter = line_counter + n_lines
            
                elif line.find(player_first_name) != -1:
                    new_string = new_string + line + " "
                    if line_counter <= 0:
                        line_counter = line_counter + n_lines
        
                elif line.find(player_last_name) != -1:
                    new_string = new_string + line + " "
                    if line_counter <= 0:
                        line_counter = line_counter + n_lines
            
                elif line_counter >= 0:
                    new_string = new_string + line + " "
                    line_counter = line_counter-1
        
            # switch the previos record against the newly created string 
            df_player['data'].iloc[i] = new_string

        # add the new data to the Dataframe and return
        df_complete = pd.concat([df_complete, df_player], axis=0)
        
    return df_complete

def name_wordgroups(df):
    '''
    Function to match first and surname to just last name
    '''
    # create patterns which should be matched 
    # first lastname and firstname should both result in just lastname
    pattern_match2d = np.array([[r"\b(mitchel bakker|mitchel)\b", 'bakker'], 
                                [r"\b(xabi alonso|xabi)\b", 'alonso'], 
                                [r"\b(exequiel palacios|exequiel)\b", 'palacios'],
                                [r"\b(nadiem amiri|nadiem)\b", 'amiri'],
                                [r"\b(kerem demirbay|kerem)\b", 'demirbay'],
                                [r"\b(robert andrich|robert)\b", 'andrich'],
                                [r"\b(exequiel palacios|exequiel)\b", 'palacios'],
                                [r"\b(piero hincapie|piero)\b", 'hincapie'],
                                [r"\b(jeremie frimpong|jeremie)\b", 'frimpong'],
                                [r"\b(jonathan tah|jonathan)\b", 'tah'],
                                [r"\b(moussa diaby|moussa)\b", 'diaby'],
                                [r"\b(mykhaylo mudryk|mykhaylo)\b", 'mudryk'],
                                [r"\b(amine adli|amine)\b", 'adli'],
                                [r"\b(florian wirtz|florian)\b", 'wirtz'],
                                [r"\b(jose mourinho|jose)\b", 'mourinho'],     
                                #other wordgroups
                                [r"\b(europa league)\b", 'europaleague'],
                                [r"\b(champions league)\b", 'championsleague'],
                                [r"\b(bayer leverkusen|bayer|leverkusen|leverkusens)\b", 'bayerleverkusen']
                                ])

    # do the pattern matching for each player
    for pattern, player in pattern_match2d:
        df['data'] = df['data'].apply(lambda x: re.sub(pattern, str(player), str(x)))

    return df

'''def del_patterns(df_line, pattern):
    lines = df_line.split('\\n')
    new_string = ''
    for line in lines:
        deleting = False
        for word in pattern:
            if deleting:
                break
            elif word in line:
                deleting = True
            else:
                deleting = False
        if not deleting:
            new_string += line + '\n'
    return new_string.strip()'''


"def del_patterns(df_line, pattern):\n    lines = df_line.split('\\n')\n    new_string = ''\n    for line in lines:\n        deleting = False\n        for word in pattern:\n            if deleting:\n                break\n            elif word in line:\n                deleting = True\n            else:\n                deleting = False\n        if not deleting:\n            new_string += line + '\n'\n    return new_string.strip()"

In [3]:
def del_patterns(df_line, pattern):
    '''
    Function which takes an input and deletes defined text pattern 
    '''
    # split up the records in lines
    lines = df_line.split('\\n')
    
    if len(lines) > 1:
        # create an empty string
        new_string = ''
        
        # iterating over the lines 
        for line in lines:
            # set deleting to False first
            deleting = False
            
            # check if any pattern word is included in the line and set deleting to True if so 
            for word in pattern:
                if word in line:
                    deleting = True
                    break
            
            # if the sentence should not be deleted, add it to the string  
            if not deleting:
                new_string += line + ' '  # Add a space after each line
            
        # remove trailing space from the new_string
        new_string = new_string.rstrip()
        
    else:
        new_string = df_line
    
    # return the string 
    return new_string


## Load Data

We load the data from our csv file with all the previously pulled data. And we filter English language data.

In [4]:
url = 'https://raw.githubusercontent.com/svisel22/SS23-BIPM-Analytics-Lab---Group-4-repository/main/data_files/all_data_v3.csv'
df = pd.read_csv(url)

In [5]:
# Filter out the English data and reindex
df_en = df[df["language"] == "en"]

#Reset index
df_en = df_en.reset_index(drop=True)
df_en

Unnamed: 0,data,player,language,publishedAt
0,{'content': 'Football\nFlorian Wirtz\'s goal f...,Exequiel Palacios,en,2023-02-16T23:56:00Z
1,{'content': '[1/4]\xa0Soccer Football - Europa...,Exequiel Palacios,en,2023-02-23T20:50:50Z
2,"{'content': ""By Will Pickworth For Mailonline\...",Exequiel Palacios,en,2023-02-23T20:53:59Z
3,{'content': '\nBUENOS AIRES (AP) — World Cup w...,Exequiel Palacios,en,2023-03-03T16:40:46Z
4,{'content': 'Sign In\nSign In\nThe Star Editio...,Exequiel Palacios,en,2023-03-03T16:42:19Z
...,...,...,...,...
408,{'content': 'Real Madrid and Manchester City p...,Mykhaylo Mudryk,en,2023-05-09T20:58:09Z
409,{'content': 'Real Madrid and Manchester City p...,Mykhaylo Mudryk,en,2023-05-09T19:45:09Z
410,{'content': 'West Ham United are the sole Prem...,Mykhaylo Mudryk,en,2023-05-09T14:30:39Z
411,"{'content': ""Inter Milan beat AC Milan 2-0 in ...",Mykhaylo Mudryk,en,2023-05-09T14:17:43Z


## Remove similiar rows

We are using a pre-tailored function to remove duplicates and rows that were mistakenly stored twice.

In [7]:
# Remove the similiar rows (The Function is imported from utils on top)
df_en = remove_similar_rows_per_player(df_en, df_en['player'].unique())
df_en

Unnamed: 0,data,player,language,publishedAt
0,{'content': 'Football\nFlorian Wirtz\'s goal f...,Exequiel Palacios,en,2023-02-16T23:56:00Z
1,{'content': '[1/4]\xa0Soccer Football - Europa...,Exequiel Palacios,en,2023-02-23T20:50:50Z
2,"{'content': ""By Will Pickworth For Mailonline\...",Exequiel Palacios,en,2023-02-23T20:53:59Z
3,{'content': '\nBUENOS AIRES (AP) — World Cup w...,Exequiel Palacios,en,2023-03-03T16:40:46Z
4,{'content': 'Sign In\nSign In\nThe Star Editio...,Exequiel Palacios,en,2023-03-03T16:42:19Z
...,...,...,...,...
24,"{'content': ""By Dominic Hogan For Mailonline\n...",Piero Hincapie,en,2023-05-16T12:22:18Z
0,{'content': 'We use cookies and other tracking...,Piero Hincapié,en,2023-04-27T04:57:02Z
1,{'content': 'Man City’s Alex Robertson makes d...,Piero Hincapié,en,2023-03-24T15:24:08Z
2,"{'content': ""\nLast updated on 19 March 202319...",Piero Hincapié,en,2023-03-19T20:03:28Z


## Transform data into lower case

We want to transform data and player into lower case

In [8]:
# Transform data into lower case
df_en['data'] = df_en['data'].str.lower()
df_en['player'] = df_en['player'].str.lower()

## Delete Patterns

Due to the large number of noice and irrelevant information, patterns are defined to be removed from all texts. These patterns are specific for English texts

In [9]:
# Delete content patterns
df_en['data'] = df_en['data'].apply(lambda x: re.sub(r"^{\'content\': \'", "", str(x)))
df_en['data'] = df_en['data'].apply(lambda x: re.sub(r"{'content':", "", str(x)))

# Define patterns to delete
patternlist_en = [
    "copyright",
    "photo",
    'image',
    "all rights reserved",
    'published by',
    'published',
    'publisher',
    'pic.twitter.com',
    'want an ad-free experience?',
    'comments',
    'log in',    
    'last updated on',
    'updated',
    'we use cookies and other tracking technologies',
    'we use cookies',
    'sign',
    'xa',
    'external-link',
    'creator',
    'gameswednesday',
    'fridayman', 
    'subscription', 
    'subscription',
    'februaryarsen',
    'decemberwest',
    'from the section',
    'filed under:',
    'when you purchase through links on our site, we may earn an affiliate commission',
    'how it works',
    'the ambury',
    'bath',
    'future publishing limited quay house',
    'to the independent?',
    'if you would prefer:',
    'want an ad-free experience?\n',
    'fourfourtwo is part of future plc, an international media group and leading digital publisher.',
    'visit our corporate site (opens in new tab).',
    ' the journal publishes the biggest breaking news in irish and international sport but for all of the 42′s insightful analysis and sharp sportswriting, subscribe\\xa0here. making a difference',
    'a mix of advertising and supporting contributions helps keep paywalls away from valuable information like this article.',
    'over 5,000 readers like you have already stepped up and support us with a monthly payment or a once-off donation.',
    'for the price of one cup of coffee each week you can make sure we can keep reliable, meaningful news open to everyone regardless of their ability to pay.',
    'support us',
    'learn more for the price of one cup of coffee each week you can make sure we can keep reliable, meaningful news open to everyone regardless of their ability to pay. to embed this post, copy the code below on your site 600px wide <iframe width="600" height="460" frameborder="0" style="border:0px;" src="https://www.thejournal.ie/https://www.thejournal.ie/ireland-france-4-6030557-mar2023/?embedpost=6030557&width=600&height=460" ></iframe> 400px wide <iframe width="600" height="460" frameborder="0" style="border:0px;" src="https://www.thejournal.ie/https://www.thejournal.ie/ireland-france-4-6030557-mar2023/?embedpost=6030557&width=400&height=460" ></iframe> 300px wide <iframe width="600" height="460" frameborder="0" style="border:0px;" src="https://www.thejournal.ie/https://www.thejournal.ie/ireland-france-4-6030557-mar2023/?embedpost=6030557&width=300&height=460" ></iframe> access to the comments facility has been disabled for this user',
    '483623.',
    'for mailonline',
    'registered office: 3rd floor, latin hall, golden lane, dublin 8. please note that the journal uses cookies to improve your experience and to provide',
    'services and',
    'advertising. for more information on cookies please refer to our cookies',
    'policy. the journal supports the work of the press council of ireland and the office of the press',
    'ombudsman, and our staff operate within the code of practice. you can obtain a copy of the',
    'code, or contact the council, at www.presscouncil.ie,',
    'ph: (01) 6489130, lo-call 1890 208 080 or email: info@presscouncil.ie news images provided by press association',
    'and rollingnews.ie unless otherwise stated.',
    'unless otherwise stated. wire service provided by afp and press association. journal media does not control and is not responsible for user created content, posts, comments,',
    'submissions or preferences. users are reminded that they are fully responsible for their own',
    'and indemnify journal media in relation to such content and their ability to make such content,',
    'posts, comments and submissions available. journal media does not control and is not responsible',
    'for the content of external websites. switch to mobile site switch to desktop site',
    'create an email alert based on the current article',
    'refresh the page or navigate to another page on the site to be automatically logged inplease refresh your browser to be logged in',
    'your bookmarks in your independent premium section, under my profile',
    'referee:\\xa0artur dias (por) the journal publishes the biggest breaking news in irish and international sport but for all of the 42′s insightful analysis and sharp sportswriting, subscribe\\xa0here. making a difference',
    'learn more to embed this post, copy the code below on your site 600px wide <iframe width="600" height="460" frameborder="0" style="border:0px;" src="https://www.thejournal.ie/https://www.thejournal.ie/ireland-france-4-6030557-mar2023/?embedpost=6030557&width=600&height=460" ></iframe> 400px wide <iframe width="600" height="460" frameborder="0" style="border:0px;" src="https://www.thejournal.ie/https://www.thejournal.ie/ireland-france-4-6030557-mar2023/?embedpost=6030557&width=400&height=460" ></iframe> 300px wide <iframe width="600" height="460" frameborder="0" style="border:0px;" src="https://www.thejournal.ie/https://www.thejournal.ie/ireland-france-4-6030557-mar2023/?embedpost=6030557&width=300&height=460" ></iframe> access to the comments facility has been disabled for this user',
    'registered office: 3rd floor, latin hall, golden lane, dublin 8. please note that the journal uses cookies to improve your experience and to provide',
    'policy. the journal supports the work of the press council of ireland and the office of the press',
    'ph: (01) 6489130, lo-call 1890 208 080 or email: info@presscouncil.ie news images provided by press association',
    'unless otherwise stated. wire service provided by afp and press association. journal media does not control and is not responsible for user created content, posts, comments,',
    'created content and their own posts, comments and submissions and fully and effectively warrant',
    'for the content of external websites. switch to mobile site switch to desktop site.',
    'to bookmark your favourite articles and stories to read or reference later? start your independent premium subscription today.please refresh the page or navigate to another page on the site to be automatically logged inplease refresh your browser to be logged inlog in',
    'use cookies and other tracking technologies to improve your browsing experience on our site, show personalized content and targeted ads, analyze site traffic, and understand where our audiences come from. to learn more or opt-out, read our cookie policy. please also read our privacy notice and terms of use, which became effective december 20, 2019.',
    'by choosing i accept, you consent to our use of cookies and other tracking technologies.',
    'latest transfer news',
    'manchester city transfer news, live! latest reports, rumors, updates',
    'manchester united transfer news, live! latest reports, rumors, updates',
    'arsenal transfer news, live! latest reports, rumors, updates',
    'liverpool transfer news, rumors today, live!',
    '[ transfer news: chelsea | tottenham | man city | arsenal | man united ]',
    'we bring sports news that matters to your inbox, to help you stay informed and get a winning edge.',
    'cbs interactive',
    'cbs sports is a registered trademark of cbs broadcasting inc. commissioner.com is a registered trademark of cbs interactive inc.',
    'images by getty images and us presswire',
    'want to bookmark your favourite articles and stories to read or reference later?',
    'start your independent premium subscription today.',
    'log in want an ad-free experience?',
    'comment',
    'name',
    'email',
    'website',
    'arsenal live scores',
    'january',
    'february',
    'march',
    'april',
    'mai',
    'june',
    'july',
    'august',
    'september',
    'october',
    'november',
    'december'
]

In [10]:
# Delete patterns
df_en['data'] = df_en['data'].apply(lambda x: del_patterns(str(x), patternlist_en))   
df_en                                                   

Unnamed: 0,data,player,language,publishedAt
0,football florian wirtz\'s goal for bayer lever...,exequiel palacios,en,2023-02-16T23:56:00Z
1,"monaco, feb 23 (reuters) - bayer leverkusen be...",exequiel palacios,en,2023-02-23T20:50:50Z
2,5 it was a goal-heavy thursday in the compet...,exequiel palacios,en,2023-02-23T20:53:59Z
3,argentina coach lionel scaloni on friday anno...,exequiel palacios,en,2023-03-03T16:40:46Z
4,the star edition change location this copy is ...,exequiel palacios,en,2023-03-03T16:42:19Z
...,...,...,...,...
24,43 tottenham have identified form bayer leve...,piero hincapie,en,2023-05-16T12:22:18Z
0,"one for the future of course, kendry is still ...",piero hincapié,en,2023-04-27T04:57:02Z
1,man city’s alex robertson makes debut as aiden...,piero hincapié,en,2023-03-24T15:24:08Z
2,""" exequiel palacios scored two penalties as b...",piero hincapié,en,2023-03-19T20:03:28Z


## Transform Emojis to text

Due to the presence of emojis, they are being translated into text

In [1]:
# Use translate emjois function to have text instead
df_en['data'] = df_en['data'].apply(lambda x: translate_emojis(str(x), language='en'))

NameError: name 'df_en' is not defined

## Remove noise

"Noise" refers to non-meaningful data, such as numbers, links, additional whitespaces, and for Spanish data, it also includes accents

In [12]:
# Strip_numeric
df_en['data'] = df_en['data'].apply(strip_numeric)

# Strip links
df_en['data'] = df_en['data'].apply(lambda x: re.sub(r'http\S+', '', str(x)))

# Strip multiple whitespaces also \n
df_en['data'] = df_en['data'].apply(strip_multiple_whitespaces)

# Strip spanish accents
df_en['data'] = df_en['data'].apply(lambda x: remove_accents(str(x)))
df_en['player'] = df_en['player'].apply(lambda x: remove_accents(str(x)))

In [13]:
df_en

Unnamed: 0,data,player,language,publishedAt
0,football florian wirtz\'s goal for bayer lever...,exequiel palacios,en,2023-02-16T23:56:00Z
1,"monaco, feb (reuters) - bayer leverkusen beat ...",exequiel palacios,en,2023-02-23T20:50:50Z
2,it was a goal-heavy thursday in the competiti...,exequiel palacios,en,2023-02-23T20:53:59Z
3,argentina coach lionel scaloni on friday anno...,exequiel palacios,en,2023-03-03T16:40:46Z
4,the star edition change location this copy is ...,exequiel palacios,en,2023-03-03T16:42:19Z
...,...,...,...,...
24,tottenham have identified form bayer leverkus...,piero hincapie,en,2023-05-16T12:22:18Z
0,"one for the future of course, kendry is still ...",piero hincapie,en,2023-04-27T04:57:02Z
1,man city's alex robertson makes debut as aiden...,piero hincapie,en,2023-03-24T15:24:08Z
2,""" exequiel palacios scored two penalties as b...",piero hincapie,en,2023-03-19T20:03:28Z


# Remove weird cells from tah

In [14]:
df_en[df_en['player']== 'jonathan tah']

Unnamed: 0,data,player,language,publishedAt
0,"it seems that ""citing gun research without bot...",jonathan tah,en,2023-02-08T21:46:59Z
1,""" the razor-sharp stand-up comic reveals it a...",jonathan tah,en,2023-02-10T22:36:59Z
2,"what made so much of world jewry--ashkenazim, ...",jonathan tah,en,2023-02-15T11:15:52Z
3,the unofficial guide to official washington. t...,jonathan tah,en,2023-03-17T10:16:11Z
4,""" exequiel palacios scored two penalties as b...",jonathan tah,en,2023-03-19T20:03:28Z
5,once again it\'s a much bigger week in notable...,jonathan tah,en,2023-04-07T14:12:31Z
6,"""after kicking off espn's college football fu...",jonathan tah,en,2023-04-26T11:24:16Z


In [15]:
df_en.info()

<class 'pandas.core.frame.DataFrame'>
Index: 394 entries, 0 to 3
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   data         394 non-null    object
 1   player       394 non-null    object
 2   language     394 non-null    object
 3   publishedAt  394 non-null    object
dtypes: object(4)
memory usage: 15.4+ KB


In [16]:
# Reset the index of the df
df_en = df_en.reset_index(drop=True)

In [17]:
# we examined unoprapiate rows for jonathan tah which have to be deleted manually
# print out the rows for tah and examine them 
df_en[df_en['player'] == 'jonathan tah']
#rows 131, 133, 134, 136 should be deleted

rows_to_delete = [131, 133, 134, 136]
df_en = df_en.drop(rows_to_delete)


In [18]:
# examine if rows are deleted
df_en[df_en['player'] == 'jonathan tah']

Unnamed: 0,data,player,language,publishedAt
132,""" the razor-sharp stand-up comic reveals it a...",jonathan tah,en,2023-02-10T22:36:59Z
135,""" exequiel palacios scored two penalties as b...",jonathan tah,en,2023-03-19T20:03:28Z
137,"""after kicking off espn's college football fu...",jonathan tah,en,2023-04-26T11:24:16Z


## Remove duplicates and nulls
If nulls or duplicates appeared during the transformation steps drop them now 

In [19]:
# remove nulls and duplicated rows 
# remove empty rows 
df_en = df_en.replace('', pd.NA)
df_en.dropna(inplace=True)

# Remove the similiar rows (The Function is imported from utils on top)
df_en = remove_similar_rows_per_player(df_en, df_en['player'].unique())

## Reset index

In [20]:
df_en_index = df_en.copy()

df_en_index.reset_index(drop=True, inplace=True)

In [21]:
df_en_index.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 378 entries, 0 to 377
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   data         378 non-null    object
 1   player       378 non-null    object
 2   language     378 non-null    object
 3   publishedAt  378 non-null    object
dtypes: object(4)
memory usage: 11.9+ KB


In [22]:
df_en_index

Unnamed: 0,data,player,language,publishedAt
0,football florian wirtz\'s goal for bayer lever...,exequiel palacios,en,2023-02-16T23:56:00Z
1,"monaco, feb (reuters) - bayer leverkusen beat ...",exequiel palacios,en,2023-02-23T20:50:50Z
2,it was a goal-heavy thursday in the competiti...,exequiel palacios,en,2023-02-23T20:53:59Z
3,argentina coach lionel scaloni on friday anno...,exequiel palacios,en,2023-03-03T16:40:46Z
4,the star edition change location this copy is ...,exequiel palacios,en,2023-03-03T16:42:19Z
...,...,...,...,...
373,tottenham have identified form bayer leverkus...,piero hincapie,en,2023-05-16T12:22:18Z
374,"one for the future of course, kendry is still ...",piero hincapie,en,2023-04-27T04:57:02Z
375,man city's alex robertson makes debut as aiden...,piero hincapie,en,2023-03-24T15:24:08Z
376,""" exequiel palacios scored two penalties as b...",piero hincapie,en,2023-03-19T20:03:28Z


## Save data clean 1

This data is stored for a later treatment

In [23]:
# Define the folder path
folder_path = "data_clean"

# Define the file path
file_path = os.path.join(folder_path, "en_clean_1.csv")

# Save the DataFrame as a CSV file
df_en_index.to_csv(file_path, index=False)

print("Data saved successfully.")

Data saved successfully.


# ------------------------

# Preprocess for data clean 2

# ------------------------

## Remove punctuation & short words

In [24]:
# Create a copy
df_en_2 = df_en_index.copy()

# Strip_punctutation
df_en_2['data'] = df_en_2['data'].apply(strip_punctuation)

# Strip_short deletes words smaller 3
df_en_2['data'] = df_en_2['data'].apply(strip_short)

## Remove Stopwords

In [25]:
# Words to keep
words_to_keep = {'don', 'didn', 'doesn', 'shouldn', 'couldn', 'wouldn', 'never', 'isn', 'cannot', 'no', 'neither', 'nor', 'cant', 'top', 'least', 'except'} #'top', 'least', 'except' these would also change the meaning, which is why we keep them

# Create modified stopwords list
modified_en_stopwords = STOPWORDS - words_to_keep

print(modified_en_stopwords)

frozenset({'last', 'which', 'or', 'between', 'others', 'former', 'keep', 'several', 'myself', 'used', 'ever', 'whither', 'perhaps', 'wherever', 'forty', 'third', 'system', 'often', 'hundred', 'itself', 'before', 'at', 'either', 'an', 'call', 'made', 'anything', 'can', 'con', 'whence', 'from', 'then', 'twelve', 'each', 'thence', 'latterly', 'might', 'well', 'found', 'among', 'off', 'upon', 'on', 'towards', 'them', 'namely', 'we', 'did', 'so', 'via', 'became', 'mostly', 'rather', 'co', 'any', 'seem', 'hereupon', 'onto', 'etc', 'to', 'beforehand', 'couldnt', 'would', 'various', 'already', 'you', 'describe', 'what', 'amoungst', 'should', 'he', 'km', 'hasnt', 'really', 'yet', 'take', 'is', 'since', 'their', 'being', 'nothing', 'full', 'per', 'sincere', 'own', 'other', 'who', 'however', 'becoming', 'only', 'fill', 'does', 'whereas', 'eg', 'its', 'quite', 'herein', 'she', 'hence', 'through', 'me', 'someone', 'go', 'not', 'enough', 'the', 'these', 'with', 'ie', 'side', 'and', 'elsewhere', 'six

In [26]:
# Apply the remove_stopwords function to the 'text' column using the apply method
df_en_2['data'] = df_en_2['data'].apply(lambda x: remove_stopwords_from_text(x, modified_en_stopwords))

In [27]:
# remove nulls and duplicated rows 
# remove empty rows 
df_en_2 = df_en_2.replace('', pd.NA)
df_en_2.dropna(inplace=True)

# Remove the similiar rows (The Function is imported from utils on top)
df_en_2 = remove_similar_rows_per_player(df_en_2, df_en_2['player'].unique())


## Reset index

In [28]:
# Create a copy
df_en_2_index = df_en_2.copy()

# Reset index
df_en_2_index.reset_index(drop=True, inplace=True)

In [29]:
df_en_2_index.head()

Unnamed: 0,data,player,language,publishedAt
0,football florian wirtz goal bayer leverkusen e...,exequiel palacios,en,2023-02-16T23:56:00Z
1,monaco feb reuters bayer leverkusen beat monac...,exequiel palacios,en,2023-02-23T20:50:50Z
2,goal heavy thursday competition goals scored t...,exequiel palacios,en,2023-02-23T20:53:59Z
3,argentina coach lionel scaloni friday announce...,exequiel palacios,en,2023-03-03T16:40:46Z
4,star edition change location copy personal non...,exequiel palacios,en,2023-03-03T16:42:19Z


In [30]:
df_en_2_index

Unnamed: 0,data,player,language,publishedAt
0,football florian wirtz goal bayer leverkusen e...,exequiel palacios,en,2023-02-16T23:56:00Z
1,monaco feb reuters bayer leverkusen beat monac...,exequiel palacios,en,2023-02-23T20:50:50Z
2,goal heavy thursday competition goals scored t...,exequiel palacios,en,2023-02-23T20:53:59Z
3,argentina coach lionel scaloni friday announce...,exequiel palacios,en,2023-03-03T16:40:46Z
4,star edition change location copy personal non...,exequiel palacios,en,2023-03-03T16:42:19Z
...,...,...,...,...
372,tottenham identified form bayer leverkusen wer...,piero hincapie,en,2023-05-16T12:22:18Z
373,future course kendry couple weeks shy birthday...,piero hincapie,en,2023-04-27T04:57:02Z
374,man city alex robertson makes debut aiden neil...,piero hincapie,en,2023-03-24T15:24:08Z
375,exequiel palacios scored penalties bayer lever...,piero hincapie,en,2023-03-19T20:03:28Z


## Save data clean 2

In [31]:
# Define the folder path
folder_path = "data_clean"

# Define the file path
file_path = os.path.join(folder_path, "en_clean_2.csv")

# Save the DataFrame as a CSV file
df_en_2_index.to_csv(file_path, index=False)

print("Data saved successfully.")

Data saved successfully.


# ------------------------

# Data condensed
The third transformation focus on the deletion of sentences to clean the corpus. The only paragraphs kept are the one including the player names.

# ------------------------

## Keep only paragraph

Get lines and following lines where the Player name appears in the corpus 

In [32]:
# Because the following code wouldn't work with stripped punctuation we will redo the steps from en_data_2

# Create a copy
df_en_con = df_en_index.copy()

# remove stopwords
df_en_con['data'] = df_en_con['data'].apply(lambda x: remove_stopwords_from_text(x, modified_en_stopwords))

#strip_short deletes words smaller 3
df_en_con['data'] = df_en_con['data'].apply(strip_short)

In [33]:
# select only paragraphs which include playernames 
df_en_con = find_lines_with_player(df_en_con, df_en_con['player'].unique(), n_lines = 1)

In [34]:
# now reperform punctuation


# strip_punctutation
#df_en_con['data'] = df_en_con['data'].apply(strip_punctuation)

#ACITON: not used yet, but might make sense
# reperform remove similar rows
#df_en_con = remove_similar_rows_per_player(df_en_con, df_en_con['player'].unique())
#df_en_con.info()

## Create Wordgroups

In [35]:
# perform wordpair function
# df_en_con = name_wordgroups(df_en_con)

## Delete playernames from their sentences

In [36]:
'''# for every player remove their names from the texts 
for player in df_en_con['player'].unique():
    f_l_name = player.split()

    # Extracting the first name
    first_name = str(f_l_name[0])

    # Extracting the last name
    last_name = str(f_l_name[1])

    updated_pattern = r"\b(" + first_name.lower() + r"|" + last_name.lower() + r")\b|"


    # Apply the function to the data column
    df_en_con.loc[df_en_con['player'] == player, 'data'] = df_en_con.loc[df_en_con['player'] == player, 'data'].apply(lambda x: re.sub(updated_pattern, "", str(x)))'''


'# for every player remove their names from the texts \nfor player in df_en_con[\'player\'].unique():\n    f_l_name = player.split()\n\n    # Extracting the first name\n    first_name = str(f_l_name[0])\n\n    # Extracting the last name\n    last_name = str(f_l_name[1])\n\n    updated_pattern = r"\x08(" + first_name.lower() + r"|" + last_name.lower() + r")\x08|"\n\n\n    # Apply the function to the data column\n    df_en_con.loc[df_en_con[\'player\'] == player, \'data\'] = df_en_con.loc[df_en_con[\'player\'] == player, \'data\'].apply(lambda x: re.sub(updated_pattern, "", str(x)))'

## Remove empty rows (=where the API misclassified the Articles for players)

In [37]:
# Remove empty rows 
df_en_con = df_en_con.replace('', pd.NA)
df_en_con.dropna(inplace=True)

# Remove the similiar rows (The Function is imported from utils on top)
df_en_con = remove_similar_rows_per_player(df_en_con, df_en_con['player'].unique())


## Reset index

In [38]:
df_en_con_index = df_en_con.copy()

df_en_con_index.reset_index(drop=True, inplace=True)

In [39]:
df_en_con_index.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 378 entries, 0 to 377
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   data         378 non-null    object
 1   player       378 non-null    object
 2   language     378 non-null    object
 3   publishedAt  378 non-null    object
dtypes: object(4)
memory usage: 11.9+ KB


In [40]:
df_en_con_index

Unnamed: 0,data,player,language,publishedAt
0,football florian wirtz\'s goal bayer leverkuse...,exequiel palacios,en,2023-02-16T23:56:00Z
1,"monaco, feb (reuters) bayer leverkusen beat mo...",exequiel palacios,en,2023-02-23T20:50:50Z
2,goal-heavy thursday competition goals scored t...,exequiel palacios,en,2023-02-23T20:53:59Z
3,argentina coach lionel scaloni friday announce...,exequiel palacios,en,2023-03-03T16:40:46Z
4,star edition change location copy personal non...,exequiel palacios,en,2023-03-03T16:42:19Z
...,...,...,...,...
373,tottenham identified form bayer leverkusen wer...,piero hincapie,en,2023-05-16T12:22:18Z
374,"future course, kendry couple weeks shy birthda...",piero hincapie,en,2023-04-27T04:57:02Z
375,man city's alex robertson makes debut aiden o'...,piero hincapie,en,2023-03-24T15:24:08Z
376,exequiel palacios scored penalties bayer lever...,piero hincapie,en,2023-03-19T20:03:28Z


## Save data condensed

In [41]:
# Define the folder path
folder_path = "data_clean"

# Define the file path
file_path = os.path.join(folder_path, "en_clean_condensed_punc_play.csv")

# Save the DataFrame as a CSV file
df_en_con_index.to_csv(file_path, index=False)

print("Data saved successfully.")

Data saved successfully.


# Code to check whether preprocessing worked

In [42]:
'''Control for patterns
#check whether it worked
data_affected_row = data_wo_pattern_en.copy()
filtered_rows = data_affected_row[data_affected_row['data'].str.contains('by choosing', case=False)]

# Display the filtered rows
df_filtered = pd.DataFrame(filtered_rows['data'])
df_filtered
#print(df_filtered.iloc[0].values)
'''

"Control for patterns\n#check whether it worked\ndata_affected_row = data_wo_pattern_en.copy()\nfiltered_rows = data_affected_row[data_affected_row['data'].str.contains('by choosing', case=False)]\n\n# Display the filtered rows\ndf_filtered = pd.DataFrame(filtered_rows['data'])\ndf_filtered\n#print(df_filtered.iloc[0].values)\n"

In [43]:
'''Control for emojis

# Unicode ranges for emojis
emoji_ranges = [
    (0x1F600, 0x1F64F),  # Emoticons
    (0x1F300, 0x1F5FF),  # Miscellaneous symbols and pictographs
    (0x1F680, 0x1F6FF),  # Transport and map symbols
    (0x2600, 0x26FF),    # Miscellaneous symbols
    (0x2700, 0x27BF),    # Dingbats
    (0xFE00, 0xFE0F),    # Variation Selectors
    (0x1F900, 0x1F9FF),  # Supplemental Symbols and Pictographs
    (0x1F1E6, 0x1F1FF)   # Flags
]

# Function to check if a character is an emoji
def is_emoji(character):
    if emoji.demojize(character) != character:
        return True
    return False

# Assuming your DataFrame is named 'df'
articles_with_untranslated_emojis = 0

# Iterate over the rows of the DataFrame
for index, row in data_wo_emojis.iterrows():
    # Counter for untranslated emojis in the current row
    untranslated_emoji_count = 0

    # Iterate over the characters in the row
    for char in str(row['data']):
        if is_emoji(char) and emoji.demojize(char) == char:
            untranslated_emoji_count += 1

    # If there is at least one untranslated emoji in the current row, increment the count of rows with untranslated emojis
    if untranslated_emoji_count > 0:
        articles_with_untranslated_emojis += 1
        print("Untranslated emojis found in row", index + 1, ":", untranslated_emoji_count)

print("Total number of rows with untranslated emojis:", articles_with_untranslated_emojis)
'''

'Control for emojis\n\n# Unicode ranges for emojis\nemoji_ranges = [\n    (0x1F600, 0x1F64F),  # Emoticons\n    (0x1F300, 0x1F5FF),  # Miscellaneous symbols and pictographs\n    (0x1F680, 0x1F6FF),  # Transport and map symbols\n    (0x2600, 0x26FF),    # Miscellaneous symbols\n    (0x2700, 0x27BF),    # Dingbats\n    (0xFE00, 0xFE0F),    # Variation Selectors\n    (0x1F900, 0x1F9FF),  # Supplemental Symbols and Pictographs\n    (0x1F1E6, 0x1F1FF)   # Flags\n]\n\n# Function to check if a character is an emoji\ndef is_emoji(character):\n    if emoji.demojize(character) != character:\n        return True\n    return False\n\n# Assuming your DataFrame is named \'df\'\narticles_with_untranslated_emojis = 0\n\n# Iterate over the rows of the DataFrame\nfor index, row in data_wo_emojis.iterrows():\n    # Counter for untranslated emojis in the current row\n    untranslated_emoji_count = 0\n\n    # Iterate over the characters in the row\n    for char in str(row[\'data\']):\n        if is_emoji(c

In [44]:
'''control for empty rows

def check_empty_lines(df):
    empty_lines_count = df.isnull().any(axis=1).sum()

    # Print the count of empty lines
    print("Number of empty lines:", empty_lines_count)
    
'''

'control for empty rows\n\ndef check_empty_lines(df):\n    empty_lines_count = df.isnull().any(axis=1).sum()\n\n    # Print the count of empty lines\n    print("Number of empty lines:", empty_lines_count)\n    \n'

# Notebook Output

This notebook will create the following CSV files:

1. data1
2. data2
3. data_condensed

The objective of these files is to have the data cleaned and saved at different levels of detail. They allow us to use the same data for different processes with various requirements.

# Next steps for Buyer04

To further improve the data processing, we could recommend Buyer04 to try different preprocessing combinations. By storing the data and experimenting with various combinations in subsequent processes, the goal is to achieve the best accuracy for each of the models. This iterative approach allows for fine-tuning the preprocessing steps and selecting the most effective ones that lead to improved model performance.