## Spam or Ham
#### "Spam"  is a irrelevant or unsolicited messages sent over the internet, typically to a large number of users, for the purposes of advertising, phishing, spreading malware, etc.""Ham" is e-mail or messages that is not Spam.

#### Here we will be building a model which can detect spam messages using classification model.

In [None]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
df=pd.read_csv('../input/smsspamcollection/train.csv')
df.columns=['id','label','message']
df.head()

In [None]:
df.shape

In [None]:
df.isnull().sum()
# we have only one null value, so thats is not going to make any difference we can ignore that.

In [None]:
df=df.drop(['id'], axis=1)
df.head(10)

#### Our main issue with our data is that it is all in text format (strings).So,the classification algorithms for two or more features will not work.Here comes NLP which works on text data and covert it into machine understandable format.

# Text Pre-processing

In [None]:
# importing libraries
import nltk
import re
nltk.download('stopwords')

### STOPWORDS
#### Stopwords are the most common words in natural language which does not add much value for the purpose of analyzing text data and building NLP models.eg- “the”, “is”, “in”, “for”, “where”, “when”, “to”, “at” etc.

#### In that case we should remove the stopwords present in the text data.

In [None]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
# Stemmer used to give the root words.

In [None]:
corpus = []
for i in range(0, len(df)):
    review = re.sub('[^a-zA-Z]', ' ',str(df['message'][i]))## replacing all with space other that characters or alphabets 
    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]# stemming all the words which are not present in  stopwords 
    review = ' '.join(review)
    corpus.append(review)

In [None]:
print(corpus)# list of stemmed words 

### Now we'll convert each message, represented as a list of tokens (lemmas) above, into a vector that machine learning models can understand.
### We can use methods-
### 1. CountVectorizer
### 2. TF-IDF(Term Frequency and Inverse Document Freqency)

## CountVectorizer
###  It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors such that machine can process data easily. It is used to create Bag of words.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2500)
X1 = cv.fit_transform(corpus).toarray()

In [None]:
print(X1[0])

### We can see it has contains "1" in some places.It will contain values only in integers.

# TF-IDF

### TF-IDF stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. 

### One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.

### Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a sentence, divided by the total number of words in that sentence; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the sentences in the corpus divided by the number of sentences containing that word.

### TF: Term Frequency, which measures how frequently a term occurs in a sentence. Since every sentence is different in length, it is possible that a term would appear much more times in long sentences than shorter ones. Thus, the term frequency is often divided by the sentence length (aka. the total number of terms in the sentence) as a way of normalization:

### TF(t) = (Number of repition of words in sentence) / (Total number of words in sentences).

### IDF: Inverse Document Frequency, which measures how important a word is. While computing TF, all words are considered equally important. However it is known that certain words, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

### IDF(t) = log_e(Total number of sentences / Number of sentences containing words).



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X2 = cv.fit_transform(corpus).toarray()

In [None]:
print(X2[0])

### It contains decimal values.


In [None]:
y=pd.get_dummies(df['label'])
y=y.iloc[:,1].values

In [None]:
# USING CountVectorizer
from sklearn.model_selection import train_test_split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y, test_size = 0.20, random_state = 0)


In [None]:
# TF-IDF
from sklearn.model_selection import train_test_split
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y, test_size = 0.20, random_state = 0)


In [None]:
#CountVectorizer
from sklearn.naive_bayes import MultinomialNB #classification model
spam_detect_model1 = MultinomialNB().fit(X_train1, y_train1)


In [None]:
# TF-IDF
from sklearn.naive_bayes import MultinomialNB
spam_detect_model2 = MultinomialNB().fit(X_train2, y_train2)

In [None]:
print("CountVectorizer")
print('Training Accuracy :',spam_detect_model1.score(X_train1, y_train1))
print('Testing Accuracy :',spam_detect_model1.score(X_test1, y_test1))

print("TF-IDF")
print('Training Accuracy :',spam_detect_model2.score(X_train2, y_train2))
print('Testing Accuracy :',spam_detect_model2.score(X_test2, y_test2))

#### We can see that CountVectorizer gives better accuracy as compared to TF-IDF

In [None]:
y_pred=spam_detect_model1.predict(X_test1)

In [None]:
# creating confusing matrix
from sklearn.metrics import confusion_matrix
con_mat = confusion_matrix(y_test1,y_pred)


print('\nCONFUSION MATRIX')
plt.figure(figsize= (6,4))
sns.heatmap(con_mat, annot = True,fmt='d',cmap="YlGnBu")

# THANKS.
####  upvote if you found it useful