# SMS Spam Classification 

## The SMS Spam Collection is a public set of SMS labeled messages that have been collected for mobile phone spam research.

## 1. READ IN TEXT DATA

### We use the Python open() function to read in the text file, so that the process is more flexible for even messier dataframes.

In [1]:
rawData = open("SMSSpamCollection").read()
rawData[0:300]

"ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\nham\tOk lar... Joking wif u oni...\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 084528100"

### We can see that the values are seperated by a '\t' which means it is a tab-delimited text file. For tsv files, we can call the Pandas read_csv function and specify '\t' as the seperator.

In [2]:
parsedData = rawData.replace('\t', '\n').split('\n') # Replace tab characters with new-line characters
parsedData[0:4]

['ham',
 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
 'ham',
 'Ok lar... Joking wif u oni...']

In [3]:
labels = parsedData[0::2] # Every second element starting from the 0th index
texts = parsedData[1::2] # Every second element starting from the 1st index

print(labels[0:4])
print(texts[0:4])

['ham', 'ham', 'spam', 'ham']
['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', 'Ok lar... Joking wif u oni...', "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", 'U dun say so early hor... U c already then say...']


In [4]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)

In [5]:
dataFrame = pd.DataFrame({
    'label': labels[:-1], # Removing the last index from the labels list because it contains a blank element and arrays need to be equal in length
    'body': texts
})

dataFrame.head()

Unnamed: 0,label,body
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


### To replicate the same process as above, we use the Pandas' approach. Note that the above used procedure is more flexible for even complex text files.

In [6]:
df = pd.read_csv("SMSSpamCollection", sep = "\t", header = None) # Importing tab-delimited file
df.columns = ["label", "body"] # Naming the columns
df.head()

Unnamed: 0,label,body
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


## 2. ANALYSING THE TEXT DATA

In [7]:
# Number of rows and columns in the dataset
df.shape

(5572, 2)

In [8]:
# Number of ham entries
print("Ham: {}".format(len(df[df['label'] == 'ham'])))

# Number of spam entries
print("Spam: {}".format(len(df[df['label'] == 'spam'])))

Ham: 4825
Spam: 747


In [9]:
# Checking for null values
df.isnull().sum()

label    0
body     0
dtype: int64

### The data doesn't contain any null values, so no further cleaning in that respect is required.

## 3. TOKENIZATION AND REMOVING STOPWORDS

In [10]:
import string
import re
import nltk
string.punctuation[0:10] # Printing the first 10 punctuations in the string library
stopword = nltk.corpus.stopwords.words('english') # Defining Stopwords
ps = nltk.PorterStemmer() # Defining the Porter Stemmer
wn = nltk.WordNetLemmatizer() # Defining the Word Net Lemmatizer

In [11]:
def clean_text(text):
    text_nopunct = "".join([char.lower() for char in text if char not in string.punctuation])
    token = re.split("\W+", text_nopunct)
    text_nostopword = [word for word in token if word not in stopword]
    #clean_text = [ps.stem(word) for word in text_nostopword] 
    # We use lemmatizing because of it's higher sophistication and we don't have a performance bottleneck
    clean_text = [wn.lemmatize(word) for word in text_nostopword]
    return clean_text

# This Cleaning Function is called by the Vectorizer

### Feature Engineering - Hypothesis is that spam messages are longer and contain more punctuations than normal messages.

In [12]:
def count_punc(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

df['punc%'] = df['body'].apply(lambda x: count_punc(x))
df['body_len'] = df['body'].apply(lambda x: len(x) - x.count(" "))

df.head()

Unnamed: 0,label,body,punc%,body_len
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",9.8,92
1,ham,Ok lar... Joking wif u oni...,25.0,24
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,4.7,128
3,ham,U dun say so early hor... U c already then say...,15.4,39
4,ham,"Nah I don't think he goes to usf, he lives around here though",4.1,49


## 4. VECTORIZING

### Implementing the Count Vectorizer and TFIDF Vectorizer

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

countvect = CountVectorizer(analyzer = clean_text)
vect = TfidfVectorizer(analyzer = clean_text)

X_count = countvect.fit_transform(df['body'])
X_tfidf = vect.fit_transform(df['body'])

print(X_count.shape) 
print(X_tfidf.shape) 

(5572, 8917)
(5572, 8917)


In [14]:
X_count_df = pd.concat([df[['label']].reset_index(drop=True), pd.DataFrame(X_count.toarray())], axis = 1)

X_tdidf_df = pd.concat([df[['label']].reset_index(drop=True), pd.DataFrame(X_tfidf.toarray())], axis = 1)