# Phase 01: Data Cleaning & Normalization

- Essential Libraries

In [1]:
import pandas as pd

- loading the raw data and examining its structure

In [3]:
# Load the dataset
df = pd.read_csv('../data/raw/Spam_SMS.csv')
df.head()

Unnamed: 0,Class,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# print information about dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5574 entries, 0 to 5573
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Class    5574 non-null   object
 1   Message  5574 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


Based on the information, this DataFrame consists of <mark> 5,574 entities</mark> and contains <mark> no missing data.</mark>

In [5]:
df.describe()

Unnamed: 0,Class,Message
count,5574,5574
unique,2,5159
top,ham,"Sorry, I'll call later"
freq,4827,30


Out of the 5,574 data samples, 4,827 are classified as ham.<mark> Some messages are duplicate</mark>, and in total there are 5,159 unique messages. The message "Sorry, I'll call later" is the most frequent one, appearing 30 times.

In [7]:
duplicates = df[df.duplicated(keep=False, subset=['Message'])]
print(f"Number of duplicated messages: {len(duplicates)}")
print("Example of duplicated messages: \n", duplicates.head(10))

Number of duplicated messages: 705
Example of duplicated messages: 
    Class                                            Message
2   spam  Free entry in 2 a wkly comp to win FA Cup fina...
7    ham  As per your request 'Melle Melle (Oru Minnamin...
8   spam  WINNER!! As a valued network customer you have...
9   spam  Had your mobile 11 months or more? U R entitle...
11  spam  SIX chances to win CASH! From 100 to 20,000 po...
12  spam  URGENT! You have won a 1 week FREE membership ...
45   ham                   No calls..messages..missed calls
56  spam  Congrats! 1 year special cinema pass for 2 is ...
62   ham                          Its a part of checking IQ
65  spam  As a valued customer, I am pleased to advise y...


Here, duplicate messages have been separated along with their classes. If identical messages have the same class, one instance can be kept and the rest removed in order to prevent model overfitting. However, if identical messages have different classes, they must be carefully examined to determine how they should be handled.

In [16]:
inconsistent = duplicates.groupby('Message')['Class'].nunique() > 1
inconsistent = inconsistent[inconsistent].index.tolist()
inconsistent

[]

Fortunately, <mark>no inconsistencies</mark> were found in the dataset.

In [17]:
# Remove all duplicates, keeping only the first occurrence
df_clean = df.drop_duplicates(subset=['Message'], keep='first')
df_clean.describe()

Unnamed: 0,Class,Message
count,5159,5159
unique,2,5159
top,ham,Rofl. Its true to its name
freq,4518,1
