# Spam Classification Dataset

# Building Word2Vec from scratch and use of AvgWord2Vec

In [1]:
import pandas as pd

In [2]:
messages= pd.read_csv('spamclassification.csv')

In [3]:
messages

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5569,spam,This is the 2nd time we have tried 2 contact u...
5570,ham,Will ü b going to esplanade fr home?
5571,ham,"Pity, * was in mood for that. So...any other s..."
5572,ham,The guy did some bitching but I acted like i'd...


In [4]:
messages.shape

(5574, 2)

In [5]:
messages['message'].loc[100]

"Please don't text me anymore. I have nothing else to say."

## Data Cleaning and Preprocessing

In [6]:
import re #regular expression library
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer #porter stemmer use for stemming purpose
ps= PorterStemmer()

1. Initialize an empty list to store the processed messages
2. A for loop is used to iterate over the messages
3. remove all special characters other than alphabets
4. then lowering all the texts
5. then messages is split into words
6. then for each word we are applying stemming and stopwords
7. then list of words is oined back again into a string
8. then cleaned message is stored in the corpus list

In [8]:
corpus= []

for i in range(0, len(messages)):
    review = re.sub('[^a-zA-z0-9]', ' ', messages['message'][i])
    review= review.lower()
    review= review.split()


    review=[ps.stem(word) for word in review if not word in stopwords.words('english')]
    review= ' '.join(review)
    corpus.append(review)
    

In [9]:
corpus

['go jurong point crazi avail bugi n great world la e buffet cine got amor wat',
 'ok lar joke wif u oni',
 'free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri question std txt rate c appli 08452810075over18',
 'u dun say earli hor u c alreadi say',
 'nah think goe usf live around though',
 'freemsg hey darl 3 week word back like fun still tb ok xxx std chg send 1 50 rcv',
 'even brother like speak treat like aid patent',
 'per request mell mell oru minnaminungint nurungu vettam set callertun caller press 9 copi friend callertun',
 'winner valu network custom select receivea 900 prize reward claim call 09061701461 claim code kl341 valid 12 hour',
 'mobil 11 month u r entitl updat latest colour mobil camera free call mobil updat co free 08002986030',
 'gonna home soon want talk stuff anymor tonight k cri enough today',
 'six chanc win cash 100 20 000 pound txt csh11 send 87575 cost 150p day 6day 16 tsandc appli repli hl 4 info',
 'urgent 1 week free mem

## 1. Creating bag of words model


In [10]:
from sklearn.feature_extraction.text import CountVectorizer #use to create bag of words
cv= CountVectorizer(max_features=2500, binary= True, ngram_range=(2,2)) #max feature to reduce dimensionality
X=  cv.fit_transform(corpus).toarray() #created a matrix and convered to array

In [11]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [12]:
X.shape

(5574, 2500)

#### Label encoding

In [16]:
#target variable y
y= pd.get_dummies(messages['label']) #getdummies is used to convert categorical features
y= y.iloc[:, 1].values 

In [17]:
y

array([False, False,  True, ..., False, False, False])

#### Train Test Split

In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.20, random_state=0)
#20% data for testing, 80% for training

In [19]:
X_train, y_train

(array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64),
 array([False, False, False, ...,  True, False, False]))

#### Using naive bayes for classification

In [20]:
from sklearn.naive_bayes import MultinomialNB
spam_detect_model= MultinomialNB().fit(X_train, y_train) #fit method trains the data

#### Predictions

In [21]:
y_pred= spam_detect_model.predict(X_test)

In [22]:
from sklearn.metrics import accuracy_score, classification_report

In [23]:
score= accuracy_score(y_test, y_pred)
print(score)

0.9659192825112107


#### Classification Report

In [24]:
from sklearn.metrics import classification_report
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

       False       1.00      0.96      0.98       986
        True       0.78      0.99      0.87       129

    accuracy                           0.97      1115
   macro avg       0.89      0.98      0.93      1115
weighted avg       0.97      0.97      0.97      1115



Overall:

1. Accuracy: 0.97 (the proportion of correctly classified instances)
2. Macro Average:
    1. Precision: (1.00 + 0.78) / 2 = 0.89
    2. Recall: (0.96 + 0.99) / 2 = 0.98
    3. F1-score: (0.98 + 0.87) / 2 = 0.93
3. Weighted Average:
    1. Precision: (1.00 * 986 + 0.78 * 129) / 1115 = 0.97
    2. Recall: (0.96 * 986 + 0.99 * 129) / 1115 = 0.97
    3. F1-score: (0.98 * 986 + 0.87 * 129) / 1115 = 0.97

## 2. Creating the TFIDF Model

1. ngram_range=(1,2):-
(1,2) specifies that both unigrams (single words) and bigrams (pairs of consecutive words) should be considered.-
So, the feature set will include individual words as well as pairs of consecutive words.

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv= TfidfVectorizer(max_features=2500, ngram_range=(1,2))
X = tv.fit_transform(corpus).toarray()


#### Train Test Split

In [31]:
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.20, random_state=0)

#### Using naive bayes for classification

In [32]:
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(X_train, y_train)

#### Predictions

In [33]:
y_pred= spam_detect_model.predict(X_test)

In [34]:
score= accuracy_score(y_test, y_pred)
print(score)

0.9766816143497757


#### In TFIDF we got better accuracy than BagOfWords
#### using ngram (2,2) range, accuracy reducing 

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv= TfidfVectorizer(max_features=2500, ngram_range=(2,2))
X = tv.fit_transform(corpus).toarray()

# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.20, random_state=0)

y_pred= spam_detect_model.predict(X_test)

score= accuracy_score(y_test, y_pred)
print(score)

0.8681614349775785


#### Using Random Foreest Classifier

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv= TfidfVectorizer(max_features=2500, ngram_range=(1,2))
X = tv.fit_transform(corpus).toarray()

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.20, random_state=0)

In [42]:
from sklearn.ensemble import RandomForestClassifier
classifier= RandomForestClassifier()
classifier.fit(X_train, y_train)

In [43]:
y_pred= classifier.predict(X_test)

In [44]:
print(accuracy_score(y_pred,y_test))
print(classification_report(y_pred,y_test))

0.97847533632287
              precision    recall  f1-score   support

       False       1.00      0.98      0.99       974
        True       0.85      1.00      0.92       141

    accuracy                           0.98      1115
   macro avg       0.93      0.99      0.95      1115
weighted avg       0.98      0.98      0.98      1115



### Better Results with TFIDF and random forest than naive bayes

## 3. Word2Vec Implementation

In [45]:
!pip install gensim




[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


#### i) Using google's pretrained model

In [46]:
import gensim.downloader as api
wv =  api.load('word2vec-google-news-300')

In [47]:
vec_king= wv['king']

In [48]:
vec_king

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

#### Seeing how model performs with wordnet lemmatizer

In [49]:
from nltk.stem import WordNetLemmatizer
lemmatizer= WordNetLemmatizer()

In [50]:
corpus=[]

for i in range(0, len(messages)):
    review= re.sub('[^a-zA-Z0-9]', " ", messages['message'][i])
    review= review.lower()
    review= review.split()

    review= [lemmatizer.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review=' '.join(review)
    corpus.append(review)

In [51]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [54]:
corpus[10]

'gonna home soon want talk stuff anymore tonight k cried enough today'

In [55]:
words=[]
for sent in corpus:
    sent_token= sent_tokenize(sent)
    for sent in sent_token:
        words.append(simple_preprocess(sent))

In [56]:
words

[['go',
  'jurong',
  'point',
  'crazy',
  'available',
  'bugis',
  'great',
  'world',
  'la',
  'buffet',
  'cine',
  'got',
  'amore',
  'wat'],
 ['ok', 'lar', 'joking', 'wif', 'oni'],
 ['free',
  'entry',
  'wkly',
  'comp',
  'win',
  'fa',
  'cup',
  'final',
  'tkts',
  'st',
  'may',
  'text',
  'fa',
  'receive',
  'entry',
  'question',
  'std',
  'txt',
  'rate',
  'apply',
  'over'],
 ['dun', 'say', 'early', 'hor', 'already', 'say'],
 ['nah', 'think', 'go', 'usf', 'life', 'around', 'though'],
 ['freemsg',
  'hey',
  'darling',
  'week',
  'word',
  'back',
  'like',
  'fun',
  'still',
  'tb',
  'ok',
  'xxx',
  'std',
  'chgs',
  'send',
  'rcv'],
 ['even', 'brother', 'like', 'speak', 'treat', 'like', 'aid', 'patent'],
 ['per',
  'request',
  'melle',
  'melle',
  'oru',
  'minnaminunginte',
  'nurungu',
  'vettam',
  'set',
  'callertune',
  'caller',
  'press',
  'copy',
  'friend',
  'callertune'],
 ['winner',
  'valued',
  'network',
  'customer',
  'selected',
  're

### ii) Now building/creating word2vec from scratch

In [57]:
import gensim

In [58]:
model= gensim.models.Word2Vec(words,window=5, min_count=2)
# min_count=2 so any word that appears less than twice in the dataset will be ignored.
#by default vector dimension is 100

#### Seeing all the vocabulary

In [59]:
model.wv.index_to_key

['call',
 'get',
 'ur',
 'gt',
 'go',
 'lt',
 'ok',
 'free',
 'day',
 'know',
 'come',
 'like',
 'good',
 'time',
 'got',
 'text',
 'love',
 'want',
 'send',
 'one',
 'need',
 'txt',
 'today',
 'going',
 'stop',
 'home',
 'lor',
 'sorry',
 'see',
 'still',
 'mobile',
 'take',
 'back',
 'da',
 'dont',
 'reply',
 'think',
 'tell',
 'week',
 'hi',
 'phone',
 'new',
 'later',
 'please',
 'pls',
 'co',
 'msg',
 'dear',
 'make',
 'night',
 'message',
 'well',
 'say',
 'min',
 'thing',
 'much',
 'claim',
 'oh',
 'hope',
 'great',
 'hey',
 'number',
 'give',
 'happy',
 'work',
 'wat',
 'friend',
 'yes',
 'way',
 'www',
 'let',
 'prize',
 'right',
 'tomorrow',
 'already',
 'ask',
 'said',
 'win',
 'cash',
 'amp',
 'life',
 'im',
 'yeah',
 'really',
 'tone',
 'babe',
 'meet',
 'find',
 'miss',
 'morning',
 'service',
 'last',
 'thanks',
 'uk',
 'com',
 'care',
 'would',
 'anything',
 'year',
 'nokia',
 'also',
 'lol',
 'feel',
 'every',
 'keep',
 'sure',
 'pick',
 'contact',
 'sent',
 'urgent',


In [60]:
model.corpus_count

5567

In [61]:
model.epochs
##epochs" refer to the number of times the entire training dataset is passed 
#forward and backward through the neural network. 

5

In [62]:
model.wv.similar_by_word('love')

[('thing', 0.9997715353965759),
 ('need', 0.9997528791427612),
 ('one', 0.9997496604919434),
 ('miss', 0.9997262358665466),
 ('go', 0.999724805355072),
 ('dont', 0.9997167587280273),
 ('make', 0.9997159242630005),
 ('give', 0.9997117519378662),
 ('life', 0.9997108578681946),
 ('amp', 0.9997093677520752)]

In [63]:
model.wv['kid'].shape

(100,)

## 3.2 Avg Word2Vec

Function Definition:

1. def avg_word2vec(doc):: a function named avg_word2vec that takes a single argument, doc
2. [model.wv[word] for word in doc if word in model.wv.index_to_key]: list comprehension that iterates over each word in the document (doc). It checks if the word is present in the Word2Vec model's vocabulary (model.wv.index_to_key).
For each valid word, it retrieves the corresponding word vector (model.wv[word]).
3. Numpy Mean Calculation:
np.mean(..., axis=0): calculates the mean along axis 0 (column-wise). This results in the average word vector for the document

In [64]:
import numpy as np
def avg_word2vec(doc):

    return np.mean([model.wv[word] for word in doc if word in model.wv.index_to_key],axis=0)

#### tqdm is a Python library that provides a fast, extensible progress bar for loops and other iterable computations. The name "tqdm" stands for "taqaddum" in Arabic, which means "progress."

In [65]:
!pip install tqdm




[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





In [66]:
from tqdm import tqdm

In [67]:
words[73]

['performed']

In [69]:
type(model.wv.index_to_key)

list

list

In [70]:
X=[]

for i in tqdm(range(len(words))):
    X.append(avg_word2vec(words[i]))

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
100%|████████████████████████████████████████████████████████████████████████████| 5567/5567 [00:00<00:00, 6985.53it/s]


1. X: This list will store the average word vectors for each document.

2. for i in tqdm(range(len(words))):: This loop iterates over the indices of the words list, and tqdm is used to display a progress bar for the loop.

3. X.append(avg_word2vec(words[i])): Inside the loop, the avg_word2vec function is called for each document (words[i]), and the resulting average word vector is appended to the list X.

In [87]:
type(X)

list

#### converting X as list array

In [91]:
import numpy as np

# Assuming X is a list of word vectors with potentially different dimensions
max_dimensionality = max(len(wv) for wv in X)

# Initialize an array filled with zeros
X_new = np.zeros((len(X), max_dimensionality))

# Copy each word vector into the corresponding row
for i, wv in enumerate(X):
    X_new[i, :len(wv)] = wv


TypeError: object of type 'numpy.float64' has no len()

#### Train and Test Split