# SMS SPAM FILTER


There are 5,572 Sms writtten in English. 4825 is ham sms and 747 is spam


****Text preprocessing****

Change column names (v1 to label and v2 to text) and create a new column. The name of the new column is 'copy' and it is the exact copy of the text column. The main purpose of creating an exact copy of the text column is to see the difference between processed and unprocessed data. We need to encode the class labels in the text column because of the binary classes. (spam = 1 , ham = 0)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re


df = pd.read_csv("../input/sms-spam-collection-dataset/spam.csv", encoding='latin-1', usecols = [0, 1])

df.rename(columns = {'v1':'label','v2':'text'},inplace=True)

df['class'] = df.label.map({'ham':0, 'spam':1})

df['copy'] = df.text



Let's see what we got

In [None]:
df.info()


In [None]:
df.head(10)


We need to replace the some words and/or numbers with spesific strings in order to extract meaningful features from dataset. 

These are the few steps of what we are doing.

1. Replaced email adresses with 'emailaddr'
2. Replaced website names with httpaddr etc

Some words in the English language, while necessary, don't contribute much to the meaning of a phrase. These words, such as "when", "had", "those" or "before", are called **stop_words** and should be filtered out.

You can check what are the all words in the list of stop_words from [here](https://gist.github.com/sebleier/554280)

**Why we are do these things?**

Let's think of programming languages as babies. At the very beginning, babies know nothing about what you are trying to do. Later you give them some information and rules to understand their environment and their own purposes. In this way, babies can grow through our teaching and reflects it. Programming languages are exactly like the babies, they demand some information (inputs) from you, for the processing operation. Thus, you should describe every nuance in a proper way otherwise it's gonna be insufficient.
For instance

*  *"I feel exhausted."* 
*  *"i FEEL eXhAusted!"* 

Refers to same meaning. But in programming languages like python, its not the same thing at all. From python perspective, these are two different sentences.That's why we use lower() function and remove all punctuation since ***exhausted.*** and ***exhausted!*** refers to same word.




It's likely the corpus contains words with various suffixes such as "distribute", "distributing", "distributor" or "distribution". We can replace these four words with just "distribut" via a step called stemming.





In [None]:
import nltk
from nltk.corpus import stopwords

porter = nltk.PorterStemmer() #"distribute", "distributing", "distributor" or "distribution".
stop_words = nltk.corpus.stopwords.words('english')

def clean_text(string):
    message = re.sub(r'\b[\w\-.]+?@\w+?\.\w{2,4}\b', 'emailaddr', string)
    message = re.sub(r'(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)', 'httpaddr', #Replace URLs with 'httpaddr'
                     message)
    message = re.sub(r'£|\$', 'money', message) #Replace money symbols with 'moneysymb'
    message = re.sub(
        r'\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b', #Replace phone numbers with 'phonenumbr'
        'phonenumbr', message)
    message = re.sub(r'\d+(\.\d+)?', 'numbr', message)  #Replace numbers with 'numbr'
    message = re.sub(r'[^\w\d\s]', ' ', message)
    message = re.sub(r'\s+', ' ', message)
    message = re.sub(r'^\s+|\s+?$', '', message.lower())
    return ' '.join(
    porter.stem(term) 
    for term in message.split()
    if term not in set(stop_words)
    )

Let's test it!

In [None]:
clean_text("going to vacation!!! 5734 I have ,.$ £")



Cool! Clean_text() function changed some words to spesific strings

* going -> go (PorterStemmer)
* to -> removed (because it is in the stop_words list)
* vacation -> vacat ( PorterStemmer)
* !!! -> removed
* 5734 -> numbr



In [None]:
textCopy = df['text'].copy()
textCopy = textCopy.apply(clean_text)
df["copy"] = textCopy
df.head(5)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


vectorizer = CountVectorizer() 

vectorizer.fit(textCopy)

tf_vectorizer = TfidfVectorizer(ngram_range=(1, 2))

ngrams = vectorizer.fit_transform(textCopy).toarray()

**Graphs**

In this part, we want to check what is the most 10 common words with how many times words are used in spam and ham texts.






In [None]:
from collections import Counter

spam_df = df[df['class'] == 1] #create sub-dataframe of spam text
ham_df = df[df['class'] == 0] #sub-dataframe of ham text
spam_df['copy'] = spam_df['copy'].map(clean_text)
ham_df['copy'] = ham_df['copy'].map(clean_text)
spam_df['new_column'] = spam_df['copy'].apply(lambda x: Counter(x.split(' ')))
forspam=Counter(" ".join(spam_df['copy']).split()).most_common(10)
forham=Counter(" ".join(ham_df['copy']).split()).most_common(10)

spamfeat=[]
for i in range(len(forspam)):
    spamcounter=forspam[i][0]
    spamfeat.append(spamcounter)

spamnumber=[]
for i in range(len(forspam)):
    spamcounter=forspam[i][1]
    spamnumber.append(spamcounter)

hamnumber=[]
for i in range(len(forham)):
    spamcounter=forham[i][1]
    hamnumber.append(spamcounter)
    
hamfeat=[]
for i in range(len(forham)):
    spamcounter=forham[i][0]
    hamfeat.append(spamcounter)

In [None]:
import seaborn as sns

fig, (ax,ax1) = plt.subplots(1,2,figsize = (25, 8))
sns.barplot(x = spamfeat, y=spamnumber, ax = ax)
ax.set_ylabel('count', fontsize = 15)
ax.set_xlabel('word',fontsize = 15)
ax.tick_params(labelsize=12)
ax.set_title('spam top 10 words', fontsize = 15)

sns.barplot(x = hamfeat, y = hamnumber, ax = ax1)
ax1.set_ylabel('count', fontsize = 15)
ax1.set_xlabel('word',fontsize = 15)
ax1.tick_params(labelsize=15)
ax1.set_title('ham top 10 words', fontsize = 15)

In [None]:
from wordcloud import WordCloud

forS=" ".join(spam_df['copy'])
forH=" ".join(ham_df['copy'])
spam_word_cloud = WordCloud(width = 600, height = 400, background_color = 'white').generate(forS)
ham_word_cloud = WordCloud(width = 600, height = 400,background_color = 'white').generate(forH)

fig, (ax, ax2) = plt.subplots(1,2, figsize = (18,8))
ax.imshow(spam_word_cloud)
ax.axis('off')
ax.set_title('spam word cloud', fontsize = 20)
ax2.imshow(ham_word_cloud)
ax2.axis('off')
ax2.set_title('ham word cloud', fontsize = 20)
plt.show()

**Predictions**

In [None]:
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split, KFold, cross_val_score

n_folds = 5
def f1_cv(model):
    kf = KFold(n_folds, shuffle = True, random_state = 29).get_n_splits(ngrams)
    f1 = cross_val_score(model, ngrams, df["class"], scoring = 'f1', cv = kf )
    return (f1)


In [None]:

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression


x_train, x_test, y_train, y_test = train_test_split(ngrams, df['class'].values, test_size=.40)

clfs = {
    'Decision_tree': DecisionTreeClassifier(),
    'gradient_descent': SGDClassifier(),
    'Naive_bayes': GaussianNB(),
    'Logistic_Regression': LogisticRegression()
}

for clf_name in clfs.keys():
    print("Training",clf_name,"classifier")
    clf = clfs[clf_name]
    clf.fit(x_train, y_train)
    y_predict = clf.predict(x_test)
    print(classification_report(y_test, y_predict))
    print()

