# Data Cleaning and Pre-processing

In [2]:
import pandas as pd
import re
import string 

## Dataset

In [3]:
# Reading the data 
dataset_csv = "ICMLA_2014_2015_2016_2017.csv"
encoding = "ISO-8859-1"
data_df = pd.read_csv(dataset_csv, encoding=encoding).set_index("paper_id")
data_df.head()

Unnamed: 0_level_0,title,keywords,abstract,session,year
paper_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Ensemble Statistical and Heuristic Models for ...,"statistical word alignment, ensemble learning,...",Statistical word alignment models need large a...,Ensemble Methods,2014
2,Improving Spectral Learning by Using Multiple ...,"representation, spectral learning, discrete fo...",Spectral learning algorithms learn an unknown ...,Ensemble Methods,2014
3,Applying Swarm Ensemble Clustering Technique f...,"software defect prediction, particle swarm opt...",Number of defects remaining in a system provid...,Ensemble Methods,2014
4,Reducing the Effects of Detrimental Instances,"filtering, label noise, instance weighting",Not all instances in a data set are equally be...,Ensemble Methods,2014
5,Concept Drift Awareness in Twitter Streams,"twitter, adaptation models, time-frequency ana...",Learning in non-stationary environments is not...,Ensemble Methods,2014


In [4]:
data_df.shape

(448, 5)

## Different Preprocessing Techniques

In [5]:
data_df["text"] = data_df["title"] + " " + data_df["abstract"]
data_df["text"].values[3] 

'Reducing the Effects of Detrimental Instances Not all instances in a data set are equally beneficial for inducing a model of the data. Some instances (such as outliers or noise) can be detrimental. However, at least initially, the instances in a data set are generally considered equally in machine learning algorithms. Many current approaches for handling noisy and detrimental instances make a binary decision about whether an instance is detrimental or not. In this paper, we 1) extend this paradigm by weighting the instances on a continuous scale and 2) present a methodology for measuring how detrimental an instance may be for inducing a model of the data. We call our method of identifying and weighting detrimental instances reduced detrimental instance learning (RDIL). We examine RIDL on a set of 54 data sets and 5 learning algorithms and compare RIDL with other weighting and filtering approaches. RDIL is especially useful for learning algorithms where every instance can affect the cl

### Abbreviation Resolution

In [6]:
# Modified from https://stackoverflow.com/questions/63631451/how-to-match-abbreviations-with-their-meaning-with-regex
import re
def contains_abbrev(abbrev, text):
    text = text.lower()
    if not abbrev.isupper():
        return False
    cnt = 0
    for c in abbrev.lower():
        if text.find(c) > -1:
            text = text[text.find(c):]
            cnt += 1
    return cnt == len(abbrev)

# text= "Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred. Stupid example(s) Stupid example(S) Not stupid example (NSEMPLE), bad example (Bexle)"
text = data_df["text"].values[3]
pattern = r'\b(([A-Z])\w*(?:\s+\w+)*?)\s*\((\2[A-Z]*)\)'
abbreviations = {x.group(3):x.group(1) for x in re.finditer(pattern, text, re.I) if contains_abbrev(x.group(3), x.group(1))}
print(abbreviations)
abbreviations.keys()

{'RDIL': 'reduced detrimental instance learning'}


dict_keys(['RDIL'])

We need to replace all the abbreviations from individual document using abbreviation definitions from that document. We have to remove the first occurance of the abbreviation from the text as it is from the abbreviation definition and replace other occurances with the expanded version (definition)

In [7]:
text_step1 = [word for word in text.split() if not (word.startswith("(") and word.strip("().,:;?!/") in abbreviations)]
text_step2 = [abbreviations[word] if word in abbreviations.keys() else word for word in text_step1]
print(" ".join(text_step1))
print("================================================")
print(" ".join(text_step2))

Reducing the Effects of Detrimental Instances Not all instances in a data set are equally beneficial for inducing a model of the data. Some instances (such as outliers or noise) can be detrimental. However, at least initially, the instances in a data set are generally considered equally in machine learning algorithms. Many current approaches for handling noisy and detrimental instances make a binary decision about whether an instance is detrimental or not. In this paper, we 1) extend this paradigm by weighting the instances on a continuous scale and 2) present a methodology for measuring how detrimental an instance may be for inducing a model of the data. We call our method of identifying and weighting detrimental instances reduced detrimental instance learning We examine RIDL on a set of 54 data sets and 5 learning algorithms and compare RIDL with other weighting and filtering approaches. RDIL is especially useful for learning algorithms where every instance can affect the classificat

**OBSERVATIONS**: Interesting instances of abbreviations can be observed in the above text, which very difficult to handle. The abbreviation definition is RDIL but author has used RIDL twice, one instance of RDIL has been replaced by above preprocessing steps. Another instance is with "multilayer perceptrons trained with backpropagation (MLPs)". Here the abbreviation doesnot exactly corresponds to the definition and there are extra words. Also, MLPs contain a lower character at the end.

In [8]:
def replace_abbreviations(text):
    pattern = r'\b(([A-Z])\w*(?:\s+\w+)*?)\s*\((\2[A-Z]*)\)'
    abbreviations = {x.group(3):x.group(1) for x in re.finditer(pattern, text, re.I) if contains_abbrev(x.group(3), x.group(1))}
    text_step1 = [word for word in text.split() if not (word.startswith("(") and word.strip("().,:;?!/") in abbreviations)]
    text_step2 = [abbreviations[word] if word in abbreviations.keys() else word for word in text_step1]
    return " ".join(text_step2)
    
data_df["text"] = data_df["text"].apply(replace_abbreviations)
data_df["text"].values[3]

'Reducing the Effects of Detrimental Instances Not all instances in a data set are equally beneficial for inducing a model of the data. Some instances (such as outliers or noise) can be detrimental. However, at least initially, the instances in a data set are generally considered equally in machine learning algorithms. Many current approaches for handling noisy and detrimental instances make a binary decision about whether an instance is detrimental or not. In this paper, we 1) extend this paradigm by weighting the instances on a continuous scale and 2) present a methodology for measuring how detrimental an instance may be for inducing a model of the data. We call our method of identifying and weighting detrimental instances reduced detrimental instance learning We examine RIDL on a set of 54 data sets and 5 learning algorithms and compare RIDL with other weighting and filtering approaches. reduced detrimental instance learning is especially useful for learning algorithms where every i

### Lower Case

In [9]:
data_df["text"] = data_df["text"].str.lower()
data_df["text"].values[3] 

'reducing the effects of detrimental instances not all instances in a data set are equally beneficial for inducing a model of the data. some instances (such as outliers or noise) can be detrimental. however, at least initially, the instances in a data set are generally considered equally in machine learning algorithms. many current approaches for handling noisy and detrimental instances make a binary decision about whether an instance is detrimental or not. in this paper, we 1) extend this paradigm by weighting the instances on a continuous scale and 2) present a methodology for measuring how detrimental an instance may be for inducing a model of the data. we call our method of identifying and weighting detrimental instances reduced detrimental instance learning we examine ridl on a set of 54 data sets and 5 learning algorithms and compare ridl with other weighting and filtering approaches. reduced detrimental instance learning is especially useful for learning algorithms where every i

### Detect and Remove URLs

In [10]:
#url
text = "login to www.example1.com and www.example2.com"
pattern = re.compile(r"https?://\S+|www\.\S+")

urls = re.findall(pattern, text)
print(urls)

cleaned_text = re.sub(pattern, "", text)
print(cleaned_text)

['www.example1.com', 'www.example2.com']
login to  and 


In [11]:
pattern = re.compile(r"https?://\S+|www\.\S+")
data_df["text"] = data_df["text"].apply(
    lambda text: re.sub(pattern, "", text)
)

### Detect and Remove Email ID

In [12]:
text = "you can contact us at info@example.com "
pattern = re.compile(r"[\w\.-]+@[\w\.-]+\.\w+")
email_ids = re.findall(pattern, text)
print(email_ids)
cleaned_text = re.sub(pattern, "", text)
print(cleaned_text)

['info@example.com']
you can contact us at  


In [13]:
pattern = re.compile(r"[\w\.-]+@[\w\.-]+\.\w+")
data_df["text"] = data_df["text"].apply(
    lambda text: re.sub(pattern, "", text)
)

### Remove Dates

In [14]:
text = "Today's date in different formats is 25-01-2023 25/01/2023 25.01.2023"
pattern = re.compile(r"\d+[\.\/-]\d+[\.\/-]\d+")
dates = re.findall(pattern, text)
print(dates)
cleaned_text = re.sub(pattern, "", text)
print(cleaned_text)

['25-01-2023', '25/01/2023', '25.01.2023']
Today's date in different formats is   


In [15]:
pattern = re.compile(r"\d+[\.\/-]\d+[\.\/-]\d+")
data_df["text"] = data_df["text"].apply(
    lambda text: re.sub(pattern, "", text)
)

### Expand Contractions

In [16]:
# !pip install contractions
import contractions
contractions.fix("There aren't many contractions they've used")

'There are not many contractions they have used'

In [17]:
data_df["text"] = data_df["text"].apply(contractions.fix)
data_df["text"].values[3]

'reducing the effects of detrimental instances not all instances in a data set are equally beneficial for inducing a model of the data. some instances (such as outliers or noise) can be detrimental. however, at least initially, the instances in a data set are generally considered equally in machine learning algorithms. many current approaches for handling noisy and detrimental instances make a binary decision about whether an instance is detrimental or not. in this paper, we 1) extend this paradigm by weighting the instances on a continuous scale and 2) present a methodology for measuring how detrimental an instance may be for inducing a model of the data. we call our method of identifying and weighting detrimental instances reduced detrimental instance learning we examine ridl on a set of 54 data sets and 5 learning algorithms and compare ridl with other weighting and filtering approaches. reduced detrimental instance learning is especially useful for learning algorithms where every i

Many of the general abbreviations will also be expanded using contractions fix function. For example gimme: give me, sux : sucks, but they might not be relevant in case of the reasearch articles and is more relevant in case of human communications applications. the dictionary used by contractions library can be obtained from https://github.com/kootenpv/contractions/tree/master/contractions/data    

### Remove Stopwords

In [18]:
with open("stopwords-en.txt", encoding="utf-8") as sw:
    STOPWORDS_EN = sw.readlines()
    STOPWORDS_EN = [word.replace("\n", "") for word in STOPWORDS_EN]

In [19]:
data_df["text"] = data_df["text"].apply(
    lambda text: " ".join([word for word in str(text).split() if word not in STOPWORDS_EN])
    )
data_df["text"].values[3]

'reducing effects detrimental instances instances data set equally beneficial inducing model data. instances (such outliers noise) detrimental. however, initially, instances data set considered equally machine learning algorithms. current approaches handling noisy detrimental instances binary decision instance detrimental not. paper, 1) extend paradigm weighting instances continuous scale 2) methodology measuring detrimental instance inducing model data. method identifying weighting detrimental instances reduced detrimental instance learning examine ridl set 54 data sets 5 learning algorithms compare ridl weighting filtering approaches. reduced detrimental instance learning learning algorithms instance affect classification boundary training instances considered individually, multilayer perceptrons trained backpropagation (mlps). accurate estimate instances detrimental positive impact handling them.'

### Remove Punctuations

In [20]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [21]:
data_df["text"] = data_df["text"].apply(
    lambda text: re.sub('[%s]' % re.escape(string.punctuation), "",text)
)
data_df["text"].values[3]

'reducing effects detrimental instances instances data set equally beneficial inducing model data instances such outliers noise detrimental however initially instances data set considered equally machine learning algorithms current approaches handling noisy detrimental instances binary decision instance detrimental not paper 1 extend paradigm weighting instances continuous scale 2 methodology measuring detrimental instance inducing model data method identifying weighting detrimental instances reduced detrimental instance learning examine ridl set 54 data sets 5 learning algorithms compare ridl weighting filtering approaches reduced detrimental instance learning learning algorithms instance affect classification boundary training instances considered individually multilayer perceptrons trained backpropagation mlps accurate estimate instances detrimental positive impact handling them'

### Remove Digits

In [22]:
data_df["text"] = data_df["text"].apply(
    lambda text: re.sub('\d','',text)
)
data_df["text"].values[3]

'reducing effects detrimental instances instances data set equally beneficial inducing model data instances such outliers noise detrimental however initially instances data set considered equally machine learning algorithms current approaches handling noisy detrimental instances binary decision instance detrimental not paper  extend paradigm weighting instances continuous scale  methodology measuring detrimental instance inducing model data method identifying weighting detrimental instances reduced detrimental instance learning examine ridl set  data sets  learning algorithms compare ridl weighting filtering approaches reduced detrimental instance learning learning algorithms instance affect classification boundary training instances considered individually multilayer perceptrons trained backpropagation mlps accurate estimate instances detrimental positive impact handling them'

### Remove Extra Spaces

In [23]:
data_df["text"] = data_df["text"].apply(
    lambda text: re.sub(" +", " ", text).strip()
)
data_df["text"].values[3]

'reducing effects detrimental instances instances data set equally beneficial inducing model data instances such outliers noise detrimental however initially instances data set considered equally machine learning algorithms current approaches handling noisy detrimental instances binary decision instance detrimental not paper extend paradigm weighting instances continuous scale methodology measuring detrimental instance inducing model data method identifying weighting detrimental instances reduced detrimental instance learning examine ridl set data sets learning algorithms compare ridl weighting filtering approaches reduced detrimental instance learning learning algorithms instance affect classification boundary training instances considered individually multilayer perceptrons trained backpropagation mlps accurate estimate instances detrimental positive impact handling them'

### Spelling Correction

In [24]:
# Spelling correction using TextBlob
# !pip install -U textblob
# !python -m textblob.download_corpora

In [25]:


data_df["text"] = data_df["text"].apply(
    lambda text: str(TextBlob(text).correct())
)
data_df["text"].values[3]

'reducing effects detrimental instances instances data set equally beneficial inducing model data instances such outlines noise detrimental however initially instances data set considered equally machine learning algorithms current approaches handling noisy detrimental instances binary decision instance detrimental not paper extend paradise weighing instances continuous scale methodology measuring detrimental instance inducing model data method identifying weighing detrimental instances reduced detrimental instance learning examine ride set data sets learning algorithms compare ride weighing faltering approaches reduced detrimental instance learning learning algorithms instance affect classification boundary training instances considered individually multilayer perceptions trained backpropagation maps accurate estimate instances detrimental positive impact handling them'

This option might not be feasible for large datasets as it took 38 minutes to run for small dataset with 448 records.

### Conversion to British or American English

In [26]:
# Based on https://stackoverflow.com/questions/42329766/python-nlp-british-english-vs-american-english

import requests

# def americanize(string):
#     url ="https://raw.githubusercontent.com/hyperreality/American-British-English-Translator/master/data/british_spellings.json"
#     british_to_american_dict = requests.get(url).json()    

#     for british_spelling, american_spelling in british_to_american_dict.items():
#         string = string.replace(british_spelling, american_spelling)
  
#     return string

import requests

def britishize(string):
    url ="https://raw.githubusercontent.com/hyperreality/American-British-English-Translator/master/data/american_spellings.json"
    american_to_british_dict = requests.get(url).json()    

    for american_spelling, british_spelling in american_to_british_dict.items():
        string = string.replace(american_spelling, british_spelling)
  
    return string

import requests
import re
def americanize(string):
    url ="https://raw.githubusercontent.com/hyperreality/American-British-English-Translator/master/data/british_spellings.json"
    british_to_american = requests.get(url).json()    
    for british_spelling, american_spelling in british_to_american.items():
        #string = string.replace(british_spelling, american_spelling) 
        string = re.sub(f'(?<![a-zA-Z]){british_spelling}(?![a-z-Z])', american_spelling, string)
        string = re.sub(f'(?<![a-zA-Z]){british_spelling}(?![a-z-Z])', american_spelling, string)
    return string

In [27]:
text = "Discount analyse"
americanize(text)

'Discount analyze'