# K-nearest neighbors
K-nearest neighbors is a non-parametric machine learning model in which the model
memorizes the training observation for classifying the unseen test data. It can also be called
instance-based learning. This model is often termed as lazy learning, as it does not learn
anything during the training phase like regression, random forest, and so on. Instead it
starts working only during the testing/evaluation phase to compare the given test
observations with nearest training observations, which will take significant time in
comparing each test data point. Hence, this technique is not efficient on big data; also,
performance does deteriorate when the number of variables is high due to the curse of
dimensionality.

# Naive Bayes
Bayes algorithm concept is quite old and exists from the 18th century since Thomas Bayes.
Thomas developed the foundational mathematical principles for determining the
probability of unknown events from the known events. For example, if all apples are red in
color and average diameter would be about 4 inches then, if at random one fruit is selected
from the basket with red color and diameter of 3.7 inch, what is the probability that the
particular fruit would be an apple? Naive term does assume independence of particular
features in a class with respect to others. In this case, there would be no dependency
between color and diameter. This independece assumption makes the Naive Bayes
classifier most effective in terms of computational ease for particular tasks such as email
classification based on words in which high dimensions of vocab do exist, even after
assuming independence between features. Naive Bayes classifier performs surprisingly
really well in practical applications.

Bayesian classifiers are best applied to problems in which information from a very high
number of attributes should be considered simultaneously to estimate the probability of
final outcome. Bayesian methods utilize all available evidence to consider for prediction
even features have weak effects on the final outcome to predict. However, we should not
ignore the fact that a large number of features with relatively minor effects, taken together
its combined impact would form strong classifiers.

In [91]:
import csv

In [115]:
smsdata = open('SMSSpamCollection.txt','r')
smsdata

<_io.TextIOWrapper name='SMSSpamCollection.txt' mode='r' encoding='cp1252'>

In [116]:
csv_reader = csv.reader(smsdata,delimiter='\t')

In [117]:
smsdata_data = []
smsdata_labels = []

In [118]:
for line in csv_reader:
    smsdata_labels.append(line[0])
    smsdata_data.append(line[1])

In [119]:
smsdata_data[:5]

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
 'Ok lar... Joking wif u oni...',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
 'U dun say so early hor... U c already then say...',
 "Nah I don't think he goes to usf, he lives around here though"]

In [120]:
smsdata_labels[:5]

['ham', 'ham', 'spam', 'ham', 'ham']

In [121]:
smsdata.close()

In [122]:
for i in range(5):
    print (smsdata_data[i],smsdata_labels[i])

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... ham
Ok lar... Joking wif u oni... ham
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's spam
U dun say so early hor... U c already then say... ham
Nah I don't think he goes to usf, he lives around here though ham


In [123]:
# After getting preceding output run following code:
from collections import Counter
c = Counter( smsdata_labels )
print(c)

Counter({'ham': 4825, 'spam': 747})


Out of 5,572 observations, 4,825 are ham messages, which are about 86.5 percent and 747 spam messages are about remaining 13.4 percent.

# Using NLP techniques, we have preprocessed the data for obtaining finalized word vectors to map with final outcomes spam or ham. Major preprocessing stages involved are:

#### Removal of punctuations: Punctuations needs to be removed before applying any further processing. Punctuations from the string library are !"# $%&\'()*+,-./:;<=>?@[\\]^_`{|}~, which are removed from all the messages.


#### Word tokenization: Words are chunked from sentences based on white space for further processing.


#### Converting words into lower case: Converting to all lower case providesremoval of duplicates, such as Run and run, where the first one comes at start of the sentence and the later one comes in the middle of the sentence, and so on, which all needs to be unified to remove duplicates as we are working on bag of words technique.


#### Stop word removal: Stop words are the words that repeat so many times in literature and yet are not much differentiator in explanatory power of sentences. For example: I, me, you, this, that, and so on, which needs to be removed before further processing.


#### Keeping words of length at least three: Here we have removed words with length less than three. Stemming of words: Stemming process stems the words to its respective root words. Example of stemming is bringing down running to run or runs to run. By doing stemming we reduce duplicates and improve the accuracy of the model.


#### Part-of-speech (POS) tagging: This applies the speech tags to words, such as noun, verb, adjective, and so on. For example, POS tagging for running is verb, whereas for run is noun. In some situation running is noun and lemmatization will not bring down the word to root word run, instead it just keeps the running as it is. Hence, POS tagging is a very crucial step necessary for performing prior to applying the lemmatization operation to bring down the word to its root word.


#### Lemmatization of words: Lemmatization is another different process to reduce the dimensionality. In lemmatization process, it brings down the word to root word rather than just truncating the words. For example, bring ate to its root word as eat when we pass the ate word into lemmatizer with the POS tag as verb.

## The nltk package has been utilized for all the preprocessing steps, as it consists of all the necessary NLP functionality in one single roof:

In [124]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
import pandas as pd
from nltk import pos_tag
from nltk.stem import PorterStemmer

In [132]:
def preprocessing(text):
    text2 = " ".join("".join([" " if ch in string.punctuation else ch for ch in text]).split())   # 1
    tokens = [word for sent in nltk.sent_tokenize(text2) for word in nltk.word_tokenize(sent)]    # 2
    tokens = [word.lower() for word in tokens]                                                    # 3
    stopwds = stopwords.words('english')                                                          # 4
    tokens = [token for token in tokens if token not in stopwds]                                  # 4
    tokens = [word for word in tokens if len(word)>=3]                                            # 5
    stemmer = PorterStemmer()                                                                     # 6
    tokens = [stemmer.stem(word) for word in tokens]                                              # 6
    tagged_corpus = pos_tag(tokens)                                                               # 7
    Noun_tags = ['NN','NNP','NNPS','NNS']
    Verb_tags = ['VB','VBD','VBG','VBN','VBP','VBZ']
    lemmatizer = WordNetLemmatizer()
    def prat_lemmatize(token,tag):
        if tag in Noun_tags:
            return lemmatizer.lemmatize(token,'n')
        elif tag in Verb_tags:
            return lemmatizer.lemmatize(token,'v')
        else:
            return lemmatizer.lemmatize(token,'n')
    pre_proc_text=" ".join([prat_lemmatize(token,tag) for token, tag in tagged_corpus])
    return pre_proc_text


1. The following line of the code splits the word and checks each character if it is in standard punctuations, if so it will be replaced with blank and or else it just does not replace with blanks # 1.
2. The following code tokenizes the sentences into words based on white spaces and put them together as a list for applying further steps # 2
3. Converting all the cases (upper, lower, and proper) into lowercase reduces duplicates in corpus: # 3
4. As mentioned earlier, stop words are the words that do not carry much weight in understanding the sentence; they are used for connecting words, and so on. We have removed them with the following line of code # 4
5. Keeping only the words with length greater than 3 in the following code for removing small words, which hardly consists of much of a meaning to carry # 5
6. Stemming is applied on the words using PorterStemmer function, which stems the extra suffixes from the words # 6
7. POS tagging is a prerequisite for lemmatization, based on whether the word is noun or verb, and so on, it will reduce it to the root word: # 7
8. The pos_tag function returns the part of speed in four formats for noun and six formats for verb. NN (noun, common, singular), NNP (noun, proper, singular), NNPS (noun, proper, plural), NNS (noun, common, plural), VB (verb, base form), VBD (verb, past tense), VBG (verb, present participle), VBN (verb, past participle), VBP (verb, present tense, not third person singular), VBZ (verb, present tense, third person singular):

The prat_lemmatize function has been created only for the reasons of mismatch between
the pos_tag function and intake values of the lemmatize function. If the tag for any word
falls under the respective noun or verb tags category, n or v will be applied accordingly in
the lemmatize function:

In [135]:
smsdata_data_2 = []
for i in smsdata_data:
    smsdata_data_2.append(preprocessing(i))

In [134]:
#import nltk
#nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [136]:
import numpy as np
trainset_size = int(round(len(smsdata_data_2)*0.70))
print ('The training set size for this classifier is ' + str(trainset_size) + '\n')

The training set size for this classifier is 3900



In [138]:
x_train = np.array([''.join(rec) for rec in smsdata_data_2[0:trainset_size]])
y_train = np.array([rec for rec in smsdata_labels[0:trainset_size]])
x_test = np.array([''.join(rec) for rec in smsdata_data_2[trainset_size+1:len( smsdata_data_2)]])
y_test = np.array([rec for rec in smsdata_labels[trainset_size+1:len(smsdata_labels)]])

In [139]:
print(x_train.shape,y_train.shape,x_test.shape,y_test.shape)

(3900,) (3900,) (1671,) (1671,)


The following code converts the words into a vectorizer format and applies term
frequency-inverse document frequency (TF-IDF) weights, which is a way to increase
weights to words with high frequency and at the same time penalize the general terms such
as the, him, at, and so on. In the following code, we have restricted to most frequent 4,000
words in the vocabulary, none the less we can tune this parameter as well for checking
where the better accuracies are obtained

In [140]:
# building TFIDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2),stop_words='english',
                             max_features= 4000,strip_accents='unicode', norm='l2')

In [141]:
x_train_2 = vectorizer.fit_transform(x_train).todense()
x_test_2 = vectorizer.transform(x_test).todense()

Multinomial Naive Bayes classifier is suitable for classification with discrete features
(example word counts), which normally requires large feature counts. However, in practice,
fractional counts such as TF-IDF will also work well. If we do not mention any Laplace
estimator, it does take the value of 1.0 means and it will add 1.0 against each term in
numerator and total for denominator:

In [147]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(x_train_2, y_train)
ytrain_nb_predicted = clf.predict(x_train_2)
ytest_nb_predicted = clf.predict(x_test_2)

In [148]:
from sklearn.metrics import classification_report,accuracy_score

In [149]:
print ("\nNaive Bayes - Train Confusion Matrix\n\n",
       pd.crosstab(y_train, ytrain_nb_predicted,rownames =["Actuall"],colnames = ["Predicted"]))
print ("\nNaive Bayes- Train accuracy",round(accuracy_score(y_train,ytrain_nb_predicted),3))
print ("\nNaive Bayes - Train Classification Report\n",classification_report(y_train, ytrain_nb_predicted))
print ("\nNaive Bayes - Test Confusion Matrix\n\n",pd.crosstab(y_test,ytest_nb_predicted,rownames = ["Actuall"],colnames = ["Predicted"]))
print ("\nNaive Bayes- Test accuracy",round(accuracy_score(y_test,ytest_nb_predicted),3))
print ("\nNaive Bayes - Test Classification Report\n",classification_report( y_test, ytest_nb_predicted))


Naive Bayes - Train Confusion Matrix

 Predicted   ham  spam
Actuall              
ham        3381     0
spam         77   442

Naive Bayes- Train accuracy 0.98

Naive Bayes - Train Classification Report
               precision    recall  f1-score   support

         ham       0.98      1.00      0.99      3381
        spam       1.00      0.85      0.92       519

    accuracy                           0.98      3900
   macro avg       0.99      0.93      0.95      3900
weighted avg       0.98      0.98      0.98      3900


Naive Bayes - Test Confusion Matrix

 Predicted   ham  spam
Actuall              
ham        1440     3
spam         54   174

Naive Bayes- Test accuracy 0.966

Naive Bayes - Test Classification Report
               precision    recall  f1-score   support

         ham       0.96      1.00      0.98      1443
        spam       0.98      0.76      0.86       228

    accuracy                           0.97      1671
   macro avg       0.97      0.88      0.92  

From the previous results it is appearing that Naive Bayes has produced excellent results of
96.6 percent test accuracy with significant recall value of 76 percent for spam and almost
100 percent for ham.


However, if we would like to check what are the top 10 features based on their coefficients
from Naive Bayes, the following code will be handy for this:

In [150]:
# printing top features
feature_names = vectorizer.get_feature_names()
coefs = clf.coef_
intercept = clf.intercept_
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
print ("\n\nTop 10 features - both first & last\n")
n=10
top_n_coefs = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top_n_coefs:
    print('\t%.4f\t%-15s\t\t%.4f\t%-15s' % (coef_1, fn_1, coef_2,fn_2))



Top 10 features - both first & last

	-8.7130	1hr            		-5.5795	free           
	-8.7130	1st love       		-5.7187	txt            
	-8.7130	2go            		-5.8721	text           
	-8.7130	2morrow        		-6.0066	claim          
	-8.7130	2mrw           		-6.0704	stop           
	-8.7130	2nd inning     		-6.0785	mobil          
	-8.7130	2nd sm         		-6.1074	repli          
	-8.7130	30ish          		-6.1514	prize          
	-8.7130	3rd            		-6.2015	servic         
	-8.7130	3rd natur      		-6.2208	tone           
