# Data Cleaning and Pre-processing

In [5]:
import pandas as pd
import re
import string 

## Dataset

In [6]:
# Reading the data 
dataset_csv = "ICMLA_2014_2015_2016_2017.csv"
encoding = "ISO-8859-1"
data_df = pd.read_csv(dataset_csv, encoding=encoding).set_index("paper_id")
data_df.head()

Unnamed: 0_level_0,title,keywords,abstract,session,year
paper_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Ensemble Statistical and Heuristic Models for ...,"statistical word alignment, ensemble learning,...",Statistical word alignment models need large a...,Ensemble Methods,2014
2,Improving Spectral Learning by Using Multiple ...,"representation, spectral learning, discrete fo...",Spectral learning algorithms learn an unknown ...,Ensemble Methods,2014
3,Applying Swarm Ensemble Clustering Technique f...,"software defect prediction, particle swarm opt...",Number of defects remaining in a system provid...,Ensemble Methods,2014
4,Reducing the Effects of Detrimental Instances,"filtering, label noise, instance weighting",Not all instances in a data set are equally be...,Ensemble Methods,2014
5,Concept Drift Awareness in Twitter Streams,"twitter, adaptation models, time-frequency ana...",Learning in non-stationary environments is not...,Ensemble Methods,2014


## Different Preprocessing Techniques

In [7]:
data_df["text"] = data_df["title"] + " " + data_df["abstract"]
data_df["text"].values[0] 

'Ensemble Statistical and Heuristic Models for Unsupervised Word Alignment Statistical word alignment models need large amount of training data while they are weak in small-size corpora. This paper proposes a new approach of unsupervised hybrid word alignment technique using ensemble learning method. This algorithm uses three base alignment models in several rounds to generate alignments. The ensemble algorithm uses a weighed scheme for resampling training data and a voting score to consider aggregated alignments. The underlying alignment algorithms used in this study include IBM Model 1, 2 and a heuristic method based on Dice measurement. Our experimental results show that by this approach, the alignment error rate could be improved by at least %15 for the base alignment models.'

### Lower Case

In [8]:
data_df["text"] = data_df["text"].str.lower()
data_df["text"].values[0] 

'ensemble statistical and heuristic models for unsupervised word alignment statistical word alignment models need large amount of training data while they are weak in small-size corpora. this paper proposes a new approach of unsupervised hybrid word alignment technique using ensemble learning method. this algorithm uses three base alignment models in several rounds to generate alignments. the ensemble algorithm uses a weighed scheme for resampling training data and a voting score to consider aggregated alignments. the underlying alignment algorithms used in this study include ibm model 1, 2 and a heuristic method based on dice measurement. our experimental results show that by this approach, the alignment error rate could be improved by at least %15 for the base alignment models.'

### Detect and Remove URLs

In [37]:
#url
text = "login to www.example1.com and www.example2.com"
pattern = re.compile(r"https?://\S+|www\.\S+")

urls = re.findall(pattern, text)
print(urls)

cleaned_text = re.sub(pattern, "", text)
print(cleaned_text)

['www.example1.com', 'www.example2.com']
login to  and 


In [38]:
pattern = re.compile(r"https?://\S+|www\.\S+")
data_df["text"] = data_df["text"].apply(
    lambda text: re.sub(pattern, "", text)
)

### Detect and Remove Email ID

In [39]:
text = "you can contact us at info@example.com "
pattern = re.compile(r"[\w\.-]+@[\w\.-]+\.\w+")
email_ids = re.findall(pattern, text)
print(email_ids)
cleaned_text = re.sub(pattern, "", text)
print(cleaned_text)

['info@example.com']
you can contact us at  


In [40]:
pattern = re.compile(r"[\w\.-]+@[\w\.-]+\.\w+")
data_df["text"] = data_df["text"].apply(
    lambda text: re.sub(pattern, "", text)
)

### Remove Dates

In [43]:
text = "Today's date in different formats is 25-01-2023 25/01/2023 25.01.2023"
pattern = re.compile(r"\d+[\.\/-]\d+[\.\/-]\d+")
dates = re.findall(pattern, text)
print(dates)
cleaned_text = re.sub(pattern, "", text)
print(cleaned_text)

['25-01-2023', '25/01/2023', '25.01.2023']
Today's date in different formats is   


In [44]:
pattern = re.compile(r"\d+[\.\/-]\d+[\.\/-]\d+")
data_df["text"] = data_df["text"].apply(
    lambda text: re.sub(pattern, "", text)
)

### Expand Contractions

In [10]:
# !pip install contractions
import contractions
contractions.fix("There aren't many contractions they've used")

'There are not many contractions they have used'

In [11]:
data_df["text"] = data_df["text"].apply(contractions.fix)
data_df["text"].values[0]

'ensemble statistical and heuristic models for unsupervised word alignment statistical word alignment models need large amount of training data while they are weak in small-size corpora. this paper proposes a new approach of unsupervised hybrid word alignment technique using ensemble learning method. this algorithm uses three base alignment models in several rounds to generate alignments. the ensemble algorithm uses a weighed scheme for resampling training data and a voting score to consider aggregated alignments. the underlying alignment algorithms used in this study include ibm model 1, 2 and a heuristic method based on dice measurement. our experimental results show that by this approach, the alignment error rate could be improved by at least %15 for the base alignment models.'

### Remove Stopwords

In [19]:
with open("stopwords-en.txt", encoding="utf-8") as sw:
    STOPWORDS_EN = sw.readlines()
    STOPWORDS_EN = [word.replace("\n", "") for word in STOPWORDS_EN]

In [18]:
data_df["text"] = data_df["text"].apply(
    lambda text: " ".join([word for word in str(text).split() if word not in STOPWORDS_EN])
    )
data_df["text"].values[0]

'ensemble statistical heuristic models unsupervised word alignment statistical word alignment models training data weak small-size corpora. paper proposes approach unsupervised hybrid word alignment technique ensemble learning method. algorithm base alignment models rounds generate alignments. ensemble algorithm weighed scheme resampling training data voting score aggregated alignments. underlying alignment algorithms study include ibm model 1, 2 heuristic method based dice measurement. experimental approach, alignment error rate improved %15 base alignment models.'

### Remove Punctuations

In [24]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [26]:
data_df["text"] = data_df["text"].apply(
    lambda text: re.sub('[%s]' % re.escape(string.punctuation), "",text)
)
data_df["text"].values[0]

'ensemble statistical heuristic models unsupervised word alignment statistical word alignment models training data weak smallsize corpora paper proposes approach unsupervised hybrid word alignment technique ensemble learning method algorithm base alignment models rounds generate alignments ensemble algorithm weighed scheme resampling training data voting score aggregated alignments underlying alignment algorithms study include ibm model 1 2 heuristic method based dice measurement experimental approach alignment error rate improved 15 base alignment models'

### Remove Digits

In [27]:
data_df["text"] = data_df["text"].apply(
    lambda text: re.sub('\d','',text)
)
data_df["text"].values[0]

'ensemble statistical heuristic models unsupervised word alignment statistical word alignment models training data weak smallsize corpora paper proposes approach unsupervised hybrid word alignment technique ensemble learning method algorithm base alignment models rounds generate alignments ensemble algorithm weighed scheme resampling training data voting score aggregated alignments underlying alignment algorithms study include ibm model   heuristic method based dice measurement experimental approach alignment error rate improved  base alignment models'

### Remove Extra Spaces

In [29]:
data_df["text"] = data_df["text"].apply(
    lambda text: re.sub(" +", " ", text).strip()
)
data_df["text"].values[0]

'ensemble statistical heuristic models unsupervised word alignment statistical word alignment models training data weak smallsize corpora paper proposes approach unsupervised hybrid word alignment technique ensemble learning method algorithm base alignment models rounds generate alignments ensemble algorithm weighed scheme resampling training data voting score aggregated alignments underlying alignment algorithms study include ibm model heuristic method based dice measurement experimental approach alignment error rate improved base alignment models'