# Project 10: Spam Filter
## Probability

In this project, we will build a spam filter for SMS messages applying Probabilty formulas such as Naive Bayes Laplace smoothing.

Datasets used in this project from:
* [SMS Spam Collection](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) 

We will first import train dataset:

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
df.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


We will clean analyzed data of vocabulary frequency for both ham and spam messages:

In [3]:
words=df['SMS'].str.replace("([^\w']|\B')",' ').str.lower().str.split(expand=True).stack().nunique()
ham=df.loc[df['Label']=='ham','SMS'].str.replace("([^\w']|\B')",' ').str.lower().str.split(expand=True).stack().value_counts().to_frame()
spam=df.loc[df['Label']=='spam','SMS'].str.replace("([^\w']|\B')",' ').str.lower().str.split(expand=True).stack().value_counts().to_frame()

In [4]:
words

8916

In [5]:
ham.head()

Unnamed: 0,0
i,2331
you,1866
to,1562
the,1133
a,1070


In [6]:
spam.head()

Unnamed: 0,0
to,691
a,380
call,355
you,290
your,264


We will create a method using Laplace smoothing algorithm to predict data:

In [7]:
def classify(word):
    n=pd.Series(word.split()).str.replace("([^\w']|\B')",' ').str.lower().str.strip()
    h=np.prod(n.apply(lambda x:(ham[ham.index==x].sum()+1)/(len(ham)+words)))
    s=np.prod(n.apply(lambda x:(spam[spam.index==x].sum()+1)/(len(spam)+words)))
    if float(h)>float(s): 
        return 'ham'
    else:
        return 'spam'

In [8]:
df['predicted'] = df['SMS'].apply(classify)
df.head()

Unnamed: 0,Label,SMS,predicted
0,ham,"Go until jurong point, crazy.. Available only ...",ham
1,ham,Ok lar... Joking wif u oni...,ham
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,ham,U dun say so early hor... U c already then say...,ham
4,ham,"Nah I don't think he goes to usf, he lives aro...",ham


Finally, we will calculate the accuracy of the filter:

In [9]:
accuracy= round(len(df[df['Label']==df['predicted']])/len(df)*100,2)
print('Accuracy = '+str(accuracy)+'%')

Accuracy = 97.2%


End. Thankyou!