## Spam Detection
* Read dataset and make it in proper format.
* Encode labels
* Convert all cases to lower
* Remove punctuations
* Remove Stopwords
* Check stats of messages
* Convert all texts into vectors
* Import classifier
* Train and test
* Check the accuracy/confusion matrix.

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
sms = pd.read_csv('/kaggle/input/sms-spam-collection-dataset/spam.csv',encoding='latin-1')
sms.head()

* Dataset has extra columns- Remove 
* Renaming v1 and v2

In [None]:
sms.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1,inplace=True)
sms.rename(columns={'v1':'label','v2':'message'},inplace=True)

In [None]:
print ('Shape = >',sms.shape)

In [None]:
print ('ham and spam counts','\n',sms.label.value_counts())

In [None]:
print ('spam ratio = ', round(len(sms[sms['label']=='spam']) / len(sms.label),2)*100,'%')
print ('ham ratio  = ', round(len(sms[sms['label']=='ham']) / len(sms.label),2)*100,'%')

New column for Length of message

In [None]:
sms['length'] = sms.message.str.len()
sms.head(2)

Label coding 0 = ham and 1 = spam

In [None]:
sms['label'].replace({'ham':0,'spam':1},inplace=True)

Convert all messages to lower case

In [None]:
sms['message'] = sms['message'].str.lower()

Dealing with punctuations

In [None]:
# Replace email addresses with 'email'
sms['message'] = sms['message'].str.replace(r'^.+@[^\.].*\.[a-z]{2,}$',
                                 'emailaddress')

# Replace URLs with 'webaddress'
sms['message'] = sms['message'].str.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$',
                                  'webaddress')

# Replace money symbols with 'moneysymb' (£ can by typed with ALT key + 156)
sms['message'] = sms['message'].str.replace(r'£|\$', 'moneysymb')
    
# Replace 10 digit phone numbers (formats include paranthesis, spaces, no spaces, dashes) with 'phonenumber'
sms['message'] = sms['message'].str.replace(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$',
                                  'phonenumber')

    
# Replace numbers with 'numbr'
sms['message'] = sms['message'].str.replace(r'\d+(\.\d+)?', 'numbr')

In [None]:
# Remove punctuation
sms['message'] = sms['message'].str.replace(r'[^\w\d\s]', ' ')

# Replace whitespace between terms with a single space
sms['message'] = sms['message'].str.replace(r'\s+', ' ')

# Remove leading and trailing whitespace
sms['message'] = sms['message'].str.replace(r'^\s+|\s+?$', '')

Remove stopwords

In [None]:
import string
import nltk
from nltk.corpus import  stopwords

stop_words = set(stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure'])

sms['message'] = sms['message'].apply(lambda x: ' '.join(
    term for term in x.split() if term not in stop_words))

New column (clean_length) after puncuations,stopwords removal. This is to get some sense of how much text we removed alltogether

In [None]:
sms['clean_length'] = sms.message.str.len()
sms.head()

In [None]:
print ('Origian Length', sms.length.sum())
print ('Clean Length', sms.clean_length.sum())

Message length distribution BEFORE cleaning

In [None]:
f,ax = plt.subplots(1,2,figsize = (10,5))

sns.distplot(sms[sms['label']==1]['length'],bins=20,ax=ax[0],label='Spam messages distribution',color='r')
ax[0].set_xlabel('Spam sms length')
ax[0].legend()

sns.distplot(sms[sms['label']==0]['length'],bins=20,ax=ax[1],label='ham messages distribution')
ax[1].set_xlabel('ham sms length')
ax[1].legend()

plt.show()

Message length distribution AFTER cleaning

In [None]:
f,ax = plt.subplots(1,2,figsize = (10,5))

sns.distplot(sms[sms['label']==1]['clean_length'],bins=20,ax=ax[0],label='Spam messages distribution',color='r')
ax[0].set_xlabel('Spam sms length')
ax[0].legend()

sns.distplot(sms[sms['label']==0]['clean_length'],bins=20,ax=ax[1],label='ham messages distribution')
ax[1].set_xlabel('ham sms length')
ax[1].legend()

plt.show()

Clearly we can see lot of difference in the spam graph before and after.
* Now lets get some sense of spam and ham texts

In [None]:
#Getting sense of loud words in spam
from wordcloud import WordCloud


spams = sms['message'][sms['label']==1]
spam_cloud = WordCloud(width=600,height=400,background_color='white',max_words=50).generate(' '.join(spams))
plt.figure(figsize=(10,8),facecolor='b')
plt.imshow(spam_cloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()


We can clearly see, some of the words like "free, claim,cash etc" are indication of spams.

In [None]:
#Getting sense of loud words in ham 

hams = sms['message'][sms['label']==0]
spam_cloud = WordCloud(width=600,height=400,background_color='white',max_words=50).generate(' '.join(hams))
plt.figure(figsize=(10,8),facecolor='k')
plt.imshow(spam_cloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

1. Convert text into vectors using TF-IDF
2. Instantiate MultinomialNB classifier
3. Split feature and label

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

tf_vec = TfidfVectorizer()
naive = MultinomialNB()

features = tf_vec.fit_transform(sms['message'])

X = features
y = sms['label']

In [None]:
X_train,x_test,Y_train,y_test = train_test_split(X,y,random_state=42)
naive.fit(X_train,Y_train)
y_pred= naive.predict(x_test)

print ('Final score = > ', accuracy_score(y_test,y_pred))

In [None]:
print(classification_report(y_test, y_pred))

Confusion Matrix

In [None]:
conf_mat = confusion_matrix(y_test,y_pred)

ax=plt.subplot()
sns.heatmap(conf_mat,annot=True,ax=ax,linewidths=5,linecolor='r',center=0)
ax.set_xlabel('Predicted Labels');ax.set_ylabel('True Labels')
ax.set_title('Confusion matrix')
ax.xaxis.set_ticklabels(['ham','spam'])
ax.yaxis.set_ticklabels(['ham','spam'])
plt.show()

You can run multiple classifications and try to improve the accuracy. If you like it please **upvote**