In [5]:
import pandas as pd
import re
import nltk
nltk.download('stopwords')
messages = pd.read_csv("spamham.csv", usecols=['message', 'class'])
messages.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hrisi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,message,class
0,"Date: Wed, 21 Aug 2002 10:54:46 -05...",ham
1,"Martin A posted:\r\n\r\nTassos Papadopoulos, t...",ham
2,Man Threatens Explosion In Moscow \r\n\r\n\r\n...,ham
3,Klez: The Virus That Won't Die\r\n\r\n \r\n\r\...,ham
4,"> in adding cream to spaghetti carbonara, whi...",ham


In [6]:

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()
corpus = []

for i in range(0, len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages['message'][i])
    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)
    

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
x = cv.fit_transform(corpus).toarray()

In [8]:
y = messages['class'].map({'ham':0, 'spam':1})

## Grid Search

- Why do we use Grid Search?

`GridSearchCV` is a technique to search through the best parameter values from the given set of the grid of parameters. It is basically a cross-validation method. the model and the parameters are required to be fed in. Best parameter values are extracted and then the predictions are made.

## Select the best model
- so here we have some list of the best text classification algorithms we imported. Now we will compare each model's score and see which model is performing better than rest of the others

### 1. Multinomial Naive Bayes Classifier

The multinomial NB classifier has a hyperparameter called **`alpha`**. It is the **smoothing parameter** to avoid **zero counts** when calculating the frequencies. 

For example, if we are now classifying a new SMS with a word "ryan" which never exist in the spam emails within our training dataset, the **likelihood** for this word will be zero. This will casue the **overall likelihood** to be zero (because we take the product of all **individual likelihoods**) for no matter what class of output variable we have.

Therefore, we need to add **additional counts** to each word when calculating the frequencies to avoid have a zero likelihood value. **Alpha** indicates how many **additional counts** we add.

### 2. Gaussian Naive Bayes Classifier

There is one hyperparameter we need to tune: **`var_smoothing`**. This is the **portion of the largest variance** of all features that is added to variances for **calculation stability**.

### 3. SVC
SVC, or Support Vector Classifier, is a supervised machine learning algorithm typically used for classification tasks. SVC works by mapping data points to a high-dimensional space and then finding the optimal hyperplane that divides the data into two classes.

In [13]:
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

models = {
    "Multinomial Naive Bayes": MultinomialNB(),
    "Gaussian Naive Bayes": GaussianNB(),
    "SVC": SVC()
}

- ### We will create a generic function to check each model's performance so that we can compare those

In [14]:
# Create a function which can evaluate models and return a report 
def evaluate_models(X, y, models):
    '''
    This function takes in X and y and models dictionary as input
    It splits the data into Train Test split
    Iterates through the given model dictionary and evaluates the metrics
    Returns: Dataframe which contains report of all models metrics with cost
    '''
    # separate dataset into train and test
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
    

    models_list = []
    scores = []
    
    for i in range(len(list(models))):
        model = list(models.values())[i]
        model.fit(X_train, y_train) # Train model

        # Make predictions
        y_pred = model.predict(X_test)

        score = accuracy_score(y_test,y_pred)
        
        model_name = list(models.keys())[i]
        print(f'---- score for --- {model_name} ----')
        print(f"{score}")
        models_list.append(model_name)
        scores.append(score)
    
    print()
    
    report = pd.DataFrame()
    report['Model_name'] = models_list
    report['Score'] = scores        
    return report

In [15]:
report = evaluate_models(x, y, models)

---- score for --- Multinomial Naive Bayes ----
0.9387186629526463
---- score for --- Gaussian Naive Bayes ----
0.9623955431754875
---- score for --- SVC ----
0.9303621169916435



In [16]:
report.sort_values('Score')

Unnamed: 0,Model_name,Score
2,SVC,0.930362
0,Multinomial Naive Bayes,0.938719
1,Gaussian Naive Bayes,0.962396


- ### From the report above we can see that the Gaussian Naive Bayes model performed the best, so we will continue training our model using Gaussian Naive Bayes algorithm.

In [27]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)

In [54]:
from sklearn.naive_bayes import GaussianNB
import numpy as np

params = {'var_smoothing': np.random.exponential(.000000001,20)
         }
gnb_model = GaussianNB()
gnb_cv = GridSearchCV(gnb_model, params, cv = 10)
gnb_cv.fit(X_train, y_train)

print("tuned hpyerparameters :(best parameters) ",gnb_cv.best_params_)
print("accuracy :",gnb_cv.best_score_)

tuned hpyerparameters :(best parameters)  {'var_smoothing': 1.0182959106004854e-09}
accuracy : 0.9536887824235386


In [55]:
from sklearn.metrics import confusion_matrix, accuracy_score

spam_detect_model = GaussianNB(**gnb_cv.best_params_)
spam_detect_model.fit(X_train, y_train)
y_pred = spam_detect_model.predict(X_test)
confusion_m = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of the model is {accuracy}")
print(f"The confusion matrix is: \n{confusion_m}")



Accuracy of the model is 0.9623955431754875
The confusion matrix is: 
[[604   5]
 [ 22  87]]


- So we can see that the model performed well and we have got an accuracy of 96% which is pretty insane. In our project we will be having all these models and we will be selecting the models based on the performence.