## Spam filter for SMS messages

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

In [1]:
import pandas as pd
sms = pd.read_csv("SMSSpamCollection", sep="\t", header = None, names= ['Label', 'SMS'])

In [2]:
sms.shape

(5572, 2)

In [3]:
sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [4]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
spam_count = sum(sms['Label'] == 'spam')
ham_count = sms.shape[0] - spam_count

print("There are {} spam messages and {} ham messages".format(spam_count, ham_count))

There are 747 spam messages and 4825 ham messages


For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

In [6]:
sms = sms.sample(frac = 1, random_state = 1)
train = sms.iloc[:round(sms.shape[0]*.8)].reset_index().drop('index', axis = 1)
test = sms.iloc[round(sms.shape[0]*0.8):].reset_index().drop('index',axis = 1)

In [7]:
train.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [8]:
ham_perct = sum(train['Label'] == 'ham') * 100/train.shape[0]
spam_perct = sum(train['Label'] == 'spam') * 100/train.shape[0]

print("There are {:.2f}% spam messages and {:.2f}% ham messages in training dataset".format(spam_perct, ham_perct))

There are 13.46% spam messages and 86.54% ham messages in training dataset


In [9]:
test.head()

Unnamed: 0,Label,SMS
0,ham,Later i guess. I needa do mcat study too.
1,ham,But i haf enuff space got like 4 mb...
2,spam,Had your mobile 10 mths? Update to latest Oran...
3,ham,All sounds good. Fingers . Makes it difficult ...
4,ham,"All done, all handed in. Don't know if mega sh..."


In [10]:
ham_perct = sum(test['Label'] == 'ham') * 100/test.shape[0]
spam_perct = sum(test['Label'] == 'spam') * 100/test.shape[0]

print("There are {:.2f}% spam messages and {:.2f}% ham messages in test dataset".format(spam_perct, ham_perct))

There are 13.20% spam messages and 86.80% ham messages in test dataset


### Data Cleaning
- Remove puncutation
- lowercase 

In [11]:
train['SMS'] = train['SMS'].str.replace('\W', ' ').str.lower().str.strip()
test['SMS'] = test['SMS'].str.replace('\W', ' ').str.lower().str.strip()

In [12]:
train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [14]:
vocabulary = []
for row in train['SMS']:
    for word in row.split():
        vocabulary.append(word)

vocabulary = list(set(vocabulary))


In [15]:
vocabulary

['stuff',
 '6',
 'brin',
 'welcomes',
 'empty',
 'poop',
 'sentiment',
 'slovely',
 'mum',
 'yan',
 'use',
 '08000839402',
 'coveragd',
 'soon',
 'wants',
 'ldew',
 'watts',
 'regular',
 '872',
 '08450542832',
 'yar',
 'dorothy',
 'co',
 'manda',
 'entered',
 'leftovers',
 'wth',
 'tmr',
 'cherthala',
 'hide',
 'feet',
 'big',
 'tickets',
 'aluable',
 'tookplace',
 '900',
 '4882',
 'blogspot',
 'functions',
 'claire',
 'cheesy',
 'galileo',
 'rcb',
 'happens',
 'ing',
 'truth',
 'mone',
 'young',
 'leh',
 'drms',
 'affections',
 'careless',
 'saves',
 '09058094583',
 'age23',
 'apply',
 'logon',
 'messed',
 'finishing',
 'impressively',
 'paces',
 'dinero',
 'fights',
 'iriver',
 'wa',
 'promptly',
 'blu',
 'janinexx',
 '09061701461',
 'guides',
 'tenerife',
 'ringtoneking',
 'indyarocks',
 'monkey',
 'pick',
 'understanding',
 'punch',
 'mtmsg18',
 'annoying',
 'icic',
 'earn',
 '54',
 '7mp',
 '100txt',
 'rights',
 'atm',
 'bunkers',
 'stone',
 'happened',
 'rimac',
 'ou',
 'antha',
 

In [16]:
print("There are {} unique words".format(len(vocabulary)))

There are 7783 unique words


In [17]:
word_counts_per_sms = {unique_word: [0]* len(train['SMS']) for unique_word in vocabulary}
for index, sms in enumerate(train['SMS']):
    for word in sms.split():
        word_counts_per_sms[word][index] += 1
        

In [18]:
word_df = pd.DataFrame(word_counts_per_sms)
train = pd.concat([train, word_df], axis = 1)

In [19]:
train.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,yep by the pretty sculpture,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,yes princess are you going to make me moan,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,welp apparently he retired,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,havent,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,i forgot 2 ask ü all smth there s a card on ...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [20]:
p_spam = sum(train['Label'] == 'spam')/train.shape[0]
p_ham = sum(train['Label'] == 'ham')/train.shape[0]

In [21]:
spam_words = train[train['Label'] == 'spam']['SMS'].str.split().apply(len)
n_spam = spam_words.sum()

In [22]:
ham_words = train[train['Label'] == 'ham']['SMS'].str.split().apply(len)
n_ham = ham_words.sum()

In [26]:
n_vocabulary = len(vocabulary)
alpha = 1

In [None]:
spam_vocab = {unique_word: 0 for unique_word in vocabulary}
ham_vocab = {unique_word: 0 for unique_word in vocabulary}
train_spam = train[train['Label'] == 'spam']
train_ham = train[train['Label'] == 'ham']
for word in vocabulary:
    n_word_spam = train_spam[word].sum()
    spam_vocab[word] = (n_word_spam + alpha)/(n_spam + alpha * n_vocabulary)
    
    n_word_ham = train_ham[word].sum()
    ham_vocab[word] = (n_word_spam + alpha)/(n_ham + alpha * n_vocabulary)
    

In [1]:
import re
def classify(message):
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_total = 1
    p_ham_total = 1
    for word in message:
        n_word_spam = train_spam[word].sum()
        p_spam = (n_word_spam + alpha)/(n_spam + alpha*n_vocabulary)
        p_smap_total *= p_spam
        
        n_word_ham = train_ham[word].sum()
        p_ham= (n_word_ham + alpha)/(n_ham + alpha * n_vocabulary)
        p_ham_tatal *= p_ham
        
        if p_ham_total > p_spam_total:
            print("Label : Ham")
        elif p_ham_total < p_spam_total:
            print("Label : Spam")
        else:
            print("Equal probabilites, have a human classify this")

In [2]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

NameError: name 'train_spam' is not defined