# Preprocessing Spanish Language

#### In this notebook, specifically for the Spanish data, we will apply a series of techniques to remove ads, noise, and any information that is not needed for our analysis. We will clean the data that has been previously extracted and store it in different files and structures to fulfill the requirements of the models that will be applied later.

## Import packages

In [61]:
import pandas as pd
import re
import os
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
sys.path.append('../')
from utils import remove_similar_rows, find_lines_with_player, remove_similar_rows_per_player, map_emoji_to_description, del_patterns, extract_sentence, name_wordgroups
import nltk
from nltk.corpus import stopwords
from gensim.parsing.preprocessing import remove_stopwords, strip_numeric, strip_punctuation, strip_multiple_whitespaces, strip_short
from unidecode import unidecode




## Load Data

We load the data from our csv file with all the previously pulled data. And we filter Spanish language data.

In [62]:
url = '../data_files/all_data_v3.csv'
df = pd.read_csv(url)

In [63]:
# Filter out the Spanish data and reindex
df_es = df[df["language"] == "es"]

#Reset index
df_es = df_es.reset_index(drop=True)
df_es

Unnamed: 0,data,player,language,publishedAt
0,"{'content': ""DIRECTO\nMercado de fichajes de f...",Exequiel Palacios,es,2023-01-29T18:25:03Z
1,{'content': 'Con el primer mes del 2023 a poco...,Exequiel Palacios,es,2023-01-30T16:52:46Z
2,{'content': 'Deportes\nGustavo Puerta ya no ju...,Exequiel Palacios,es,2023-01-31T20:41:38Z
3,{'content': 'Dólar blue\nAlberto Fernández\nMa...,Exequiel Palacios,es,2023-02-09T18:32:38Z
4,{'content': 'Dólar blue\nAlberto Fernández\nMa...,Exequiel Palacios,es,2023-02-12T21:13:55Z
...,...,...,...,...
270,"{'content': ""Juventus rescató un empate 1-1 fr...",Exequiel Palacios,es,2023-05-11T19:05:33Z
271,"{'content': ""Juventus rescató un empate 1-1 fr...",Exequiel Palacios,es,2023-05-11T18:26:06Z
272,{'content': 'Este jueves se abren las series d...,Exequiel Palacios,es,2023-05-11T13:44:18Z
273,"{'content': 'Con 10 futbolistas argentinos, la...",Exequiel Palacios,es,2023-05-10T16:03:25Z


## Remove similiar rows

We are using a pre-tailored function to remove duplicates and rows that were mistakenly stored twice.

In [64]:
# Remove the similiar rows (The Function is imported from utils on top)
df_es = remove_similar_rows_per_player(df_es, df_es['player'].unique())

## Transform data into lower case

We want to transform data and player into lower case

In [65]:
# Create a copy
data_lower_es = df_es.copy()

# Transform data into lower case
data_lower_es['data'] = data_lower_es['data'].str.lower()
data_lower_es['player'] = data_lower_es['player'].str.lower()
data_lower_es.head()

Unnamed: 0,data,player,language,publishedAt
0,"{'content': ""directo\nmercado de fichajes de f...",exequiel palacios,es,2023-01-29T18:25:03Z
1,{'content': 'con el primer mes del 2023 a poco...,exequiel palacios,es,2023-01-30T16:52:46Z
2,{'content': 'deportes\ngustavo puerta ya no ju...,exequiel palacios,es,2023-01-31T20:41:38Z
3,{'content': 'dólar blue\nalberto fernández\nma...,exequiel palacios,es,2023-02-09T18:32:38Z
4,{'content': 'dólar blue\nalberto fernández\nma...,exequiel palacios,es,2023-02-12T21:13:55Z


## Delete Patterns

Due to the large number of noice and irrelevant information, patterns are defined to be removed from all texts. These patterns are specific for Spanish texts

In [66]:
# Define patterns to delete
patternlist = [
    'content',
    'directo',
    'espacio publicitario',
    'copyright',
    'foto:',
    'todos los derechos reservados',
    'derechos reservados',
    'suscribete',
    'queda prohibida la reproducción',
    'parcial, por',
    'cualquier medio, de todos los contenidos sin autorización expresa de grupo el comercio',
    'pic.twitter.com'
   
]

In [67]:
# Create a copy
data_wo_pattern = data_lower_es.copy()

# Delete patterns
data_wo_pattern['data'] = data_wo_pattern['data'].apply(lambda x: del_patterns(str(x), patternlist))

In [68]:
# Replace '\\xaO' with a whitespace in the 'data' column
data_wo_pattern['data'] = data_wo_pattern['data'].str.replace(r'\\xa0', ' ')

## Transform Emojis to text

Due to the presence of emojis, they are being translated into text

In [69]:
# Create a copy
data_wo_emojis = data_wo_pattern.copy()

# Use map_emoji_to_description to translate emojis into text
data_wo_emojis['data'] = data_wo_emojis['data'].apply(lambda x: re.sub(r'[^\w\s]', lambda match: map_emoji_to_description(match.group(), language = 'es',), str(x)))


## Remove noise

"Noise" refers to non-meaningful data, such as numbers, links, additional whitespaces, and for Spanish data, it also includes accents.

In [70]:
# Create a copy
data_rm_es = data_wo_emojis.copy()

# Strip_numeric
data_rm_es['data'] = data_rm_es['data'].apply(strip_numeric)

# Strip links
data_rm_es['data'] = data_rm_es['data'].apply(lambda x: re.sub(r'http\S+', '', str(x)))

# Strip multiple whitespaces also \n
data_rm_es['data'] = data_rm_es['data'].apply(strip_multiple_whitespaces)

# Create remove accents function
def remove_accents(text):
    return unidecode(text)

# Strip spanish accents
data_rm_es['data'] = data_rm_es['data'].apply(lambda x: remove_accents(str(x)))
data_rm_es['player'] = data_rm_es['player'].apply(lambda x: remove_accents(str(x)))

# Create a copy
data_es_clean1 = data_rm_es.copy()

# Reset index
data_es_clean1.reset_index(drop=True, inplace=True)


## Save data clean 1

This data is stored for a later treatment

In [71]:
# Store data 
#data_es_clean1.to_csv('../data_files/data_clean/es_clean_1.csv', index=False)

# ------------------------

# Preprocess for data clean 2

# ------------------------

## Remove short words

In [72]:
# Create a copy
data_wo_short = data_es_clean1.copy()

# Remove short words
data_wo_short['data'] = data_wo_short['data'].apply(strip_short)

## Remove stopwords

In [73]:
#Create copy to new df data_rm_sw as in data removed stopwords
data_es_sw = data_wo_short.copy()

#Load stopwords
#nltk.download('stopwords')
spanish_stopwords = stopwords.words('spanish')



In [74]:
# Create a list of stop words to remove 
stop_words_to_remove = ['ni', 'no', 'sin']

# Remove the stop words to remove from the stop words list
for word in stop_words_to_remove:
  spanish_stopwords.remove(word)

# Define a function to apply remove_stopwords on a column
def remove_stopwords_from_text(text):
    return remove_stopwords(text, stopwords=spanish_stopwords)

# Apply the remove_stopwords function to the 'text' column using the apply method
data_es_sw['data'] = data_es_sw['data'].apply(remove_stopwords_from_text)



## Make a copy for data clean

We need to store the cleaning process so far because the next process needs puntuation

In [75]:
data_es_clean3 = data_es_sw.copy()

## Remove Punctuation

In [76]:
# Create a copy
data_wo_pun = data_es_sw.copy()

# Remove punctuation
data_wo_pun['data'] = data_wo_pun['data'].apply(strip_punctuation)

## Save data clean 2

In [77]:
# Create a copy
data_es_clean2 = data_wo_pun.copy()

In [78]:
# Store data 
#data_es_clean2.to_csv('../data_files/data_clean/es_clean_2.csv', index=False)

# ------------------------

# Data condensed
The third transformation focus on the deletion of sentences to clean the corpus. The only paragraphs kept are the one including the player names.

# ------------------------

## Keep only paragraph

Get lines and following lines where the Player name appears in the corpus 

In [79]:
# Create a copy
data_es_pp = data_es_clean3.copy()

# Select only paragraphs which include playernames 
data_es_pp = find_lines_with_player(data_es_pp, data_es_pp['player'].unique(), n_lines = 1)

## Remove empty rows

In [80]:
# Remove rows that are empty
data_es_er = data_es_pp.replace('', pd.NA)
data_es_er.dropna(inplace=True)
data_es_er

Unnamed: 0,data,player,language,publishedAt
0,adeyemi firmo primer gol bundesliga javier alf...,exequiel palacios,es,2023-01-29T18:25:03Z
1,ultima semana marzo primera ventana partidos i...,exequiel palacios,es,2023-01-30T16:52:46Z
2,gustavo puerta jugara bayer leverkusen enterat...,exequiel palacios,es,2023-01-31T20:41:38Z
3,alberto fernandez mauricio macri indec harvard...,exequiel palacios,es,2023-02-09T18:32:38Z
4,alberto fernandez mauricio macri indec harvard...,exequiel palacios,es,2023-02-12T21:13:55Z
...,...,...,...,...
73,"minuto, visitante habia llegado arco romano ge...",piero hincapie,es,2023-05-11T21:13:48Z
74,"bayer leverkusen jugo visita roma, partido ida...",piero hincapie,es,2023-05-11T20:56:21Z
75,"ecuatoriano volvera semifinales torneo uefa, a...",piero hincapie,es,2023-05-11T18:30:25Z
76,seleccion ecuador jugara dos ultimos amistosos...,piero hincapie,es,2023-05-10T23:37:55Z


## Delete playernames from their sentences

Because the playernames took a huge influence on the clustering they will be removed for each player. 

In [81]:
# Create a copy
data_es_pn = data_es_er.copy()

# For every player remove their names from the texts 
for player in data_es_pn['player'].unique():
    f_l_name = player.split()

    # Extracting the first name
    first_name = str(f_l_name[0])

    # Extracting the last name
    last_name = str(f_l_name[1])

    updated_pattern = r"\b(" + first_name.lower() + r"|" + last_name.lower() + r")\b|"


    # Apply the function to the data column
    data_es_pn.loc[data_es_pn['player'] == player, 'data'] = data_es_pn.loc[data_es_pn['player'] == player, 'data'].apply(lambda x: re.sub(updated_pattern, "", str(x)))

## Remove punctuation

In [82]:
# Remove pustuation
data_es_pn['data'] = data_es_pn['data'].apply(strip_punctuation)

# Remove excesive white space
data_es_pn['data'] = data_es_pn['data'].apply(strip_multiple_whitespaces)

## Save data condensed

In [83]:
# Create a copy
data_es_condense = data_es_pn.copy()

In [84]:
# Store data 
#data_es_condense.to_csv('../data_files/data_clean/es_clean_condensed.csv', index=False)

# Additional Sentiment Preprocessing
For the sentiment analysis we created one additional dataset where we take just the sentence where a playername appears of the clean dataset.

In [85]:
# load clean dataset
df_es_sentence = pd.read_csv('../data_files/data_clean/es_clean_1.csv')

In [86]:
# Keep paragraph where the player name is found
df_es_sentence = find_lines_with_player(df_es_sentence, df_es_sentence['player'].unique(),n_lines=1)

In [87]:
# Extract sentence
df_es_sentence = extract_sentence(df_es_sentence)

In [88]:
# copy sentence column into data and remove sentence
df_es_sentence['data']= df_es_sentence['sentence']
df_es_sentence.drop('sentence', axis=1, inplace=True)

In [89]:
# remove similiar sentences 
df_es_sentence = remove_similar_rows_per_player(df_es_sentence, df_es_sentence['player'].unique())

# apply wordgroups
df_es_sentence = name_wordgroups(df_es_sentence)

# delete empty rows
df_es_sentence = df_es_sentence.replace('', pd.NA)
df_es_sentence.dropna(inplace=True)

In [90]:
# Store data 
#df_es_sentence.to_csv('../data_files/data_clean/es_clean_1_sen.csv', index=False)

# Notebook Output

This notebook will create the following CSV files:

1. data1
2. data2
3. data_condensed
4. es_clean_1_sen

The objective of these files is to have the data cleaned and saved at different levels of detail. They allow us to use the same data for different processes with various requirements.

# Next steps for Bayer04

To further improve the data processing, we could recommend Bayer04 to try different preprocessing combinations. By storing the data and experimenting with various combinations in subsequent processes, the goal is to achieve the best accuracy for each of the models. This iterative approach allows for fine-tuning the preprocessing steps and selecting the most effective ones that lead to improved model performance.