#Use Case: Spam Mail Detection

In this use case we will be having bunch of mails and there fact wether they are spam mails or non-spam mails using naie bayes lets build a model which classify spam or non-spam mails

## Importing Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_auc_score, f1_score , log_loss
from nltk.stem.porter import PorterStemmer
import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

## Lets get the dataset from kaggle

In [2]:
!pip install -q kaggle
from google.colab import files
files.upload()
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download -d venky73/spam-mails-dataset
!unzip spam-mails-dataset.zip 

Saving kaggle.json to kaggle.json
Downloading spam-mails-dataset.zip to /content
  0% 0.00/1.86M [00:00<?, ?B/s]
100% 1.86M/1.86M [00:00<00:00, 89.2MB/s]
Archive:  spam-mails-dataset.zip
  inflating: spam_ham_dataset.csv    


### Lets load dataset

In [61]:
dataset = pd.read_csv("spam_ham_dataset.csv")
dataset.head()

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


label : what type of mail it is
ham - good mails
spam - spam mails

text : mail body

label_num:
0 : ham
1: spam

##EDA on text

Lets see some random mails

In [62]:
# lets print some random emails
gen = 0
for rand in np.random.randint(0, high=len(dataset), size=3, dtype=int):
  gen +=1
  print("mail ----- "+ str(gen) + "  --- its type : "+str(dataset['label'][rand]))
  print(dataset['text'][rand])
  print("---------------------")

mail ----- 1  --- its type : ham
Subject: re : enerfin meter 980439 for 10 / 00
daren can you please extend deal # 422516 to cover flow of 9 dec . for
10 / 6 / 2000 and extend deal 432556 to cover flow of 44 dec . for 10 / 19 / 2000 ?
volume mgmt is trying to clear up these issues .
thanks
- jackie -
enron north america corp .
from : victor lamadrid 12 / 15 / 2000 11 : 52 am
to : jackie young / hou / ect @ ect
cc : sherlyn schumack / hou / ect @ ect , daren j farmer / hou / ect @ ect , rita
wynne / hou / ect @ ect
subject : re : enerfin meter 980439 for 10 / 00
jackie , talk to darren about this . the deal you reference is an hpl deal with
dynegy and i don ' t have access to it . i ' m on the east desk . yesterday i
extended the deal 421415 for the 6 th and 19 th and meredith inserted a path in
unify - tetco to cover the small overflow volume between hpl and ena . the
ena / hpl piece is done . the piece between hpl and dynegy is what you need
inserted . thanks
jackie 

## Lets do some preprocessing

Lets drop  duplicates

In [63]:
dataset = dataset[['text','label_num']]
#droping duplicates
print("Records with duplicates : " + str(len(dataset)))
dataset = dataset.drop_duplicates(keep='first', inplace=False)
print("Records without duplicates : " + str(len(dataset)))

Records with duplicates : 5171
Records without duplicates : 4993


Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

* Begin by removing the html tags
* Remove any punctuations or limited set of special characters like , or . or # etc.
* Check if the word is made up of english letters and is not alpha-numeric
* Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
* Convert the word to lowercase
* Remove Stopwords
* Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)



In [64]:
# https://stackoverflow.com/a/47091490/4084039
import re
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase



In [65]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [66]:
import nltk
nltk.download('words')
from nltk.stem import PorterStemmer
ps = PorterStemmer()

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [67]:
# https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element
from bs4 import BeautifulSoup
from tqdm import tqdm
words = set(nltk.corpus.words.words())
# tqdm is for printing the status bar
def preprocessing(sentance):
  sentance = re.sub(r"http\S+", "", sentance)
  sentance = BeautifulSoup(sentance, 'lxml').get_text()
  sentance = decontracted(sentance)
  sentance = re.sub("\S*\d+\S*", "", sentance).strip()
  sentance = re.sub('[^A-Za-z]+', ' ', sentance)
  # https://gist.github.com/sebleier/554280
  sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
  sentance =  " ".join(w for w in nltk.wordpunct_tokenize(sentance) if w.lower() in words or not w.isalpha())
  sentance =  " ".join(ps.stem(w) for w in nltk.wordpunct_tokenize(sentance))
  return sentance.strip()
dataset['text'] = dataset['text'].apply(preprocessing)


In [68]:
# lets print some random emails
gen = 0
for rand in np.random.randint(0, high=len(dataset), size=3, dtype=int):
  gen +=1
  print("mail ----- "+ str(gen) + "  --- its type : "+str(dataset['label_num'][rand]))
  print(dataset['text'][rand])
  print("---------------------")

mail ----- 1  --- its type : 1
subject pavilion v monitor w satellit pavilion v monitor w satellit part no p abb pavilion v monitor w satellit diagon maximum viewabl true color x resolut harman power satellit minimum appli visit one stop offic duti free latest clearanc sale list contact depart pleas send net contact via dell creativ cisco us canon lot contact u ex work duti free zone avail subject chang canada u e without notic receiv special plain text format repli mail request export not consid long includ contact inform remov messag intend dealer somehow gotten list error reason would like remov pleas repli remov subject line messag messag sent complianc feder legisl commerci e mail h r section paragraph e bill titl th u congress logo properti respect sale middl east may not exactli shown follow link click link copi past address browser pleas give effect
---------------------
mail ----- 2  --- its type : 0
subject global two firm meter global need make sure deal first meter second m

Train Test split

In [69]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(dataset['text'], dataset['label_num'], test_size=0.20, random_state=42)


## Bernouli Naive Bayes.

As we know Bernouli Naive bayes we can apply for binary data lets use CountVectoriser and convert our mails into binary text embeddings and use Bernouli Naive Bayes

In [70]:
from sklearn.feature_extraction.text import CountVectorizer
binaryvectoriser = CountVectorizer(ngram_range = (1,2), binary = True)
X_train_bernouli = binaryvectoriser.fit_transform(X_train)

In [71]:
X_test_bernouli = binaryvectoriser.transform(X_test)

In [72]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import RandomizedSearchCV
distributions = dict(alpha = np.random.uniform(0,2,size=1000))
bernouli_clf = RandomizedSearchCV(BernoulliNB(), distributions, random_state=0,scoring = 'accuracy',cv=20)
bernouli_clf.fit(X_train_bernouli, y_train)


RandomizedSearchCV(cv=20, estimator=BernoulliNB(),
                   param_distributions={'alpha': array([4.50704952e-01, 1.55277675e+00, 1.12358272e+00, 1.22960752e+00,
       8.73763230e-01, 1.88130291e+00, 2.60289889e-01, 2.13723542e-01,
       1.72560272e+00, 4.46168489e-01, 8.31434219e-01, 1.28060399e+00,
       1.90079494e+00, 6.10457237e-01, 1.43388147e+00, 1.18195222e+00,
       4.49424393e-01, 7.56669410...
       1.68521638e+00, 1.03180320e+00, 1.95912791e+00, 1.29256716e+00,
       1.24934693e-01, 1.29735624e+00, 2.42849150e-01, 1.37261607e+00,
       1.52804717e+00, 2.12062000e-01, 1.11548761e-01, 1.79695269e+00,
       1.35100871e+00, 9.62455263e-01, 3.62124121e-01, 1.26508770e+00,
       1.26985773e+00, 1.38627505e+00, 6.68563641e-02, 9.15853181e-01,
       1.67658189e+00, 1.21199584e+00, 5.82650006e-01, 8.88121486e-01])},
                   random_state=0, scoring='accuracy')

In [73]:
y_pred_bernouli = bernouli_clf.predict(X_test_bernouli)
y_prob_bernouli = bernouli_clf.predict_proba(X_test_bernouli)

In [74]:
results = pd.DataFrame(data = np.zeros((4,7)),index = ['accuracy','f1_score','roc_auc','log_loss'], columns = ['Bernouli_binary_bow', 'Multinomial_bow','Complement_bow','Gaussian_bow','Gaussian_TF_IDF','Gaussian_Glove','Gaussian_averagew2v'])

In [79]:
from sklearn.metrics import accuracy_score
def result_check (type, y_true,y_pred,y_prob ):
  results[type]['accuracy'] = accuracy_score(y_true, y_pred)
  results[type]['f1_score'] = f1_score(y_true, y_pred)
  results[type]['roc_auc'] = roc_auc_score(y_true, y_prob[:, 1])
  results[type]['log_loss'] = log_loss(y_true, y_prob[:, 1])
  print("accuracy score : " , accuracy_score(y_test, y_pred))
  print("f1_Score : " ,f1_score(y_test, y_pred))
  print("roc_auc_score :" , roc_auc_score(y_test, y_prob[:, 1]))
  print("log_loss :" , log_loss(y_test, y_prob[:, 1]))
  return results


In [80]:
result_check('Bernouli_binary_bow',y_test,y_pred_bernouli,y_prob_bernouli)

accuracy score :  0.913913913913914
f1_Score :  0.8114035087719299
roc_auc_score : 0.9933382452262541
log_loss : 2.155820121854353


Unnamed: 0,Bernouli_binary_bow,Multinomial_bow,Complement_bow,Gaussian_bow,Gaussian_TF_IDF,Gaussian_Glove,Gaussian_averagew2v
accuracy,0.913914,0.0,0.0,0.0,0.0,0.0,0.0
f1_score,0.811404,0.0,0.0,0.0,0.0,0.0,0.0
roc_auc,0.993338,0.0,0.0,0.0,0.0,0.0,0.0
log_loss,2.15582,0.0,0.0,0.0,0.0,0.0,0.0


# MultiNomial Naive Bayes

In [81]:
from sklearn.naive_bayes import MultinomialNB
multinomialvectoriser = CountVectorizer(ngram_range = (1,2), binary = False)
X_train_multinomial = multinomialvectoriser.fit_transform(X_train)
X_test_multinomial = multinomialvectoriser.transform(X_test)
distributions = dict(alpha = np.random.uniform(0,3,size=1000))
Multinomial_clf = RandomizedSearchCV(MultinomialNB(), distributions, random_state=0,scoring = 'accuracy',cv=20)
Multinomial_clf.fit(X_train_multinomial, y_train)
y_pred_multinomial = Multinomial_clf.predict(X_test_multinomial)
y_prob_multinomial = Multinomial_clf.predict_proba(X_test_multinomial)
result_check('Multinomial_bow',y_test,y_pred_multinomial,y_prob_multinomial)

accuracy score :  0.975975975975976
f1_Score :  0.9540229885057471
roc_auc_score : 0.9938268762407646
log_loss : 0.3077857716624732


Unnamed: 0,Bernouli_binary_bow,Multinomial_bow,Complement_bow,Gaussian_bow,Gaussian_TF_IDF,Gaussian_Glove,Gaussian_averagew2v
accuracy,0.913914,0.975976,0.0,0.0,0.0,0.0,0.0
f1_score,0.811404,0.954023,0.0,0.0,0.0,0.0,0.0
roc_auc,0.993338,0.993827,0.0,0.0,0.0,0.0,0.0
log_loss,2.15582,0.307786,0.0,0.0,0.0,0.0,0.0


# Complement Naive Bayes



In [82]:
from sklearn.naive_bayes import ComplementNB
import warnings
warnings.filterwarnings("ignore")
Complementvectoriser = CountVectorizer(ngram_range = (1,2), binary = False,max_features=3000)
X_train_Complement = Complementvectoriser.fit_transform(X_train)
X_test_Complement = Complementvectoriser.transform(X_test)
distributions = dict(alpha = np.random.uniform(0,3,size=1000))
Complement_clf = RandomizedSearchCV(ComplementNB(), distributions, random_state=0,scoring = 'accuracy',cv=20)
Complement_clf.fit(X_train_Complement.toarray(), np.array(y_train))
y_pred_Complement = Complement_clf.predict(X_test_Complement.toarray())
y_prob_Complement = Complement_clf.predict_proba(X_test_Complement.toarray())
result_check('Complement_bow',y_test,y_pred_Complement,y_prob_Complement)



accuracy score :  0.9529529529529529
f1_Score :  0.9140767824497258
roc_auc_score : 0.9827597675037352
log_loss : 0.5098126844607179


Unnamed: 0,Bernouli_binary_bow,Multinomial_bow,Complement_bow,Gaussian_bow,Gaussian_TF_IDF,Gaussian_Glove,Gaussian_averagew2v
accuracy,0.913914,0.975976,0.952953,0.0,0.0,0.0,0.0
f1_score,0.811404,0.954023,0.914077,0.0,0.0,0.0,0.0
roc_auc,0.993338,0.993827,0.98276,0.0,0.0,0.0,0.0
log_loss,2.15582,0.307786,0.509813,0.0,0.0,0.0,0.0


# Gaussian Naive bayes with Bag Of Words


In [83]:
from sklearn.naive_bayes import GaussianNB
Gaussianvectoriser = CountVectorizer(ngram_range = (1,2), binary = False,max_features=3000)
X_train_Gaussian = Gaussianvectoriser.fit_transform(X_train)
X_test_Gaussian = Gaussianvectoriser.transform(X_test)
distributions = dict(var_smoothing = np.random.uniform(0.00000001,0.0000000001,size=1000))
Gaussian_clf = RandomizedSearchCV(GaussianNB(), distributions, random_state=0,scoring = 'accuracy',cv=20)
Gaussian_clf.fit(X_train_Gaussian.toarray(), y_train)
y_pred_Gaussian = Gaussian_clf.predict(X_test_Gaussian.toarray())
y_prob_Gaussian = Gaussian_clf.predict_proba(X_test_Gaussian.toarray())
result_check('Gaussian_bow',y_test,y_pred_Gaussian,y_prob_Gaussian)



accuracy score :  0.9269269269269269
f1_Score :  0.8760611205432937
roc_auc_score : 0.938578313992755
log_loss : 2.523905756828182


Unnamed: 0,Bernouli_binary_bow,Multinomial_bow,Complement_bow,Gaussian_bow,Gaussian_TF_IDF,Gaussian_Glove,Gaussian_averagew2v
accuracy,0.913914,0.975976,0.952953,0.926927,0.0,0.0,0.0
f1_score,0.811404,0.954023,0.914077,0.876061,0.0,0.0,0.0
roc_auc,0.993338,0.993827,0.98276,0.938578,0.0,0.0,0.0
log_loss,2.15582,0.307786,0.509813,2.523906,0.0,0.0,0.0


## Gaussian Naive bayes with TF-IDF

In [84]:
from sklearn.naive_bayes import GaussianNB
Gaussianvectoriser_tfidf = TfidfVectorizer(ngram_range=(1,2),max_features=3000)
X_train_Gaussian_tfidf= Gaussianvectoriser_tfidf.fit_transform(X_train)
X_test_Gaussian_tfidf = Gaussianvectoriser_tfidf.transform(X_test)
distributions = dict()
Gaussian_clf_tfidf = RandomizedSearchCV(GaussianNB(), distributions, random_state=0,scoring = 'accuracy',cv=20)
Gaussian_clf_tfidf.fit(X_train_Gaussian_tfidf.toarray(), y_train)
y_pred_Gaussian_tfidf = Gaussian_clf_tfidf.predict(X_test_Gaussian_tfidf.toarray())
y_prob_Gaussian_tfidf = Gaussian_clf_tfidf.predict_proba(X_test_Gaussian_tfidf.toarray())
result_check('Gaussian_TF_IDF',y_test,y_pred_Gaussian_tfidf,y_prob_Gaussian_tfidf)


accuracy score :  0.9429429429429429
f1_Score :  0.8980322003577818
roc_auc_score : 0.9417045291745972
log_loss : 1.9707137517563038


Unnamed: 0,Bernouli_binary_bow,Multinomial_bow,Complement_bow,Gaussian_bow,Gaussian_TF_IDF,Gaussian_Glove,Gaussian_averagew2v
accuracy,0.913914,0.975976,0.952953,0.926927,0.942943,0.0,0.0
f1_score,0.811404,0.954023,0.914077,0.876061,0.898032,0.0,0.0
roc_auc,0.993338,0.993827,0.98276,0.938578,0.941705,0.0,0.0
log_loss,2.15582,0.307786,0.509813,2.523906,1.970714,0.0,0.0


# Gaussian Naive Bayes with Glove

Downloading Glove

In [25]:
!wget https://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip



--2022-10-12 14:39:34--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-10-12 14:39:34--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2022-10-12 14:42:15 (5.12 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [23]:
glove = {}
with open('glove.6B.300d.txt',encoding='utf-8') as f: #taking 300 dimesions
  for line in tqdm(f):
    word = line.split() #it is stored as string like this "'the': '.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.4"
    glove[word[0]] = np.asarray(word[1:], dtype='float32')

400000it [00:24, 16283.05it/s]


In [85]:
dataset = pd.read_csv("spam_ham_dataset.csv")
dataset = dataset[['text','label_num']]
dataset = dataset.drop_duplicates(keep='first', inplace=False)
def preprocessing(sentance):
  sentance = re.sub(r"http\S+", "", sentance)
  sentance = BeautifulSoup(sentance, 'lxml').get_text()
  sentance = decontracted(sentance)
  sentance = re.sub("\S*\d+\S*", "", sentance).strip()
  sentance = re.sub('[^A-Za-z]+', ' ', sentance)
  # https://gist.github.com/sebleier/554280
  sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
  sentance =  " ".join(w for w in nltk.wordpunct_tokenize(sentance) if w.lower() in words or not w.isalpha())
  return sentance.strip()
dataset['text'] = dataset['text'].apply(preprocessing)
X_train,X_test,y_train,y_test = train_test_split(dataset['text'], dataset['label_num'], test_size=0.20, random_state=42)


In [86]:
X_train_Glove = []
for row in tqdm(X_train):
  glove_embed=np.zeros(300)
  val = 0
  for token in row.split():
    if token in glove.keys():
      val+=1
      glove_embed +=glove[token]
    else:
      pass
  X_train_Glove.append(glove_embed/val)



100%|██████████| 3994/3994 [00:00<00:00, 5782.80it/s]


In [87]:
X_test_Glove = []
for row in tqdm(X_test):
  glove_embed=np.zeros(300)
  val =0
  for token in row.split():
    if token in glove.keys():
      val+=1
      glove_embed +=glove[token]
    else:
      pass
  X_test_Glove.append(glove_embed/val)

100%|██████████| 999/999 [00:00<00:00, 5641.99it/s]


In [89]:
distributions = dict()
Gaussian_clf_Glove = RandomizedSearchCV(GaussianNB(), distributions, random_state=0,scoring = 'accuracy',cv=20)
Gaussian_clf_Glove.fit(X_train_Glove, y_train)
y_pred_Gaussian_Glove = Gaussian_clf_Glove.predict(X_test_Glove)
y_prob_Gaussian_Glove = Gaussian_clf_Glove.predict_proba(X_test_Glove)
result_check('Gaussian_Glove',y_test,y_pred_Gaussian_Glove,y_prob_Gaussian_Glove)

accuracy score :  0.8588588588588588
f1_Score :  0.7622259696458684
roc_auc_score : 0.9391129940033974
log_loss : 1.3877585359460587


Unnamed: 0,Bernouli_binary_bow,Multinomial_bow,Complement_bow,Gaussian_bow,Gaussian_TF_IDF,Gaussian_Glove,Gaussian_averagew2v
accuracy,0.913914,0.975976,0.952953,0.926927,0.942943,0.858859,0.0
f1_score,0.811404,0.954023,0.914077,0.876061,0.898032,0.762226,0.0
roc_auc,0.993338,0.993827,0.98276,0.938578,0.941705,0.939113,0.0
log_loss,2.15582,0.307786,0.509813,2.523906,1.970714,1.387759,0.0


# Gaussian Naive Bayes with Word2Vec

Need More RAM 

In [28]:
#! kaggle datasets download -d leadbest/googlenewsvectorsnegative300

In [29]:
#! unzip /content/googlenewsvectorsnegative300.zip


In [91]:
w2v_model=KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True,limit = 50000)
#print(w2v_model.wv.most_similar('great'))
X_train_W2V = []
#print(list(w2v_model.wv.vocab))
for row in tqdm(X_train):
  word2vec_embed=np.zeros(300)
  val =0
  for token in row.split():
    if token in list(w2v_model.wv.vocab):
      val +=1
      word2vec_embed +=w2v_model.wv[token]
    else:
      pass
  if val ==0:
    val =1
  X_train_W2V.append(word2vec_embed/val)

X_test_W2V = []
for row in tqdm(X_test):
  word2vec_embed=np.zeros(300)
  val =0
  for token in row.split():
    if token in list(w2v_model.wv.vocab):
      val+=1
      word2vec_embed +=w2v_model.wv[token]
    else:
      pass
  if val ==0:
    val =1
  X_test_W2V.append(word2vec_embed/val)

distributions = dict()
Gaussian_clf_w2v = RandomizedSearchCV(GaussianNB(), distributions, random_state=0,scoring = 'accuracy',cv=20)
Gaussian_clf_w2v.fit(X_train_W2V, y_train)
y_pred_Gaussian_w2v = Gaussian_clf_w2v.predict(X_test_W2V)
y_prob_Gaussian_w2v = Gaussian_clf_w2v.predict_proba(X_test_W2V)
result_check('Gaussian_averagew2v',y_test,y_pred_Gaussian_w2v,y_prob_Gaussian_w2v)

100%|██████████| 3994/3994 [03:52<00:00, 17.18it/s]
100%|██████████| 999/999 [00:58<00:00, 17.16it/s]


accuracy score :  0.7897897897897898
f1_Score :  0.6947674418604651
roc_auc_score : 0.9150907676879311
log_loss : 2.2310575902144483


Unnamed: 0,Bernouli_binary_bow,Multinomial_bow,Complement_bow,Gaussian_bow,Gaussian_TF_IDF,Gaussian_Glove,Gaussian_averagew2v
accuracy,0.913914,0.975976,0.952953,0.926927,0.942943,0.858859,0.78979
f1_score,0.811404,0.954023,0.914077,0.876061,0.898032,0.762226,0.694767
roc_auc,0.993338,0.993827,0.98276,0.938578,0.941705,0.939113,0.915091
log_loss,2.15582,0.307786,0.509813,2.523906,1.970714,1.387759,2.231058


# Results Comparision


In [92]:
results

Unnamed: 0,Bernouli_binary_bow,Multinomial_bow,Complement_bow,Gaussian_bow,Gaussian_TF_IDF,Gaussian_Glove,Gaussian_averagew2v
accuracy,0.913914,0.975976,0.952953,0.926927,0.942943,0.858859,0.78979
f1_score,0.811404,0.954023,0.914077,0.876061,0.898032,0.762226,0.694767
roc_auc,0.993338,0.993827,0.98276,0.938578,0.941705,0.939113,0.915091
log_loss,2.15582,0.307786,0.509813,2.523906,1.970714,1.387759,2.231058
