### Ideal Steps for Machine Learning:

1. Text Pre-processing and Cleaning
2. Train Test Split
3. BOW and TF-IDF
4. Trained our Models

## Ideally run this in colab

## Libraries used in this project
- gensim -----> Word2Vec
- gensim -----> simple_preprocess
- nltk -------> WordNetLemmatizer
- tqdm -------> WordNetLemmatizer
- sklearn ----> RandomForestClassifier

### Libraries used earlier
- pandas -----> read_csv, get_dummies [OHE]
- nltk -------> Sent_Tokenize, Word_Tokenize
- nltk -------> PorterStemmer, WordNetLemmatizer, POS_Tag(Point of Speech)
- pandas -----> read_csv, get_dummies [OHE]
- nltk -------> PorterStemmer, StopWords
- skLearn ----> CountVectorizer [*BagOfWords*]
- skLearn ----> TFIDFVectorizer [*TF-IDF*]
- skLearn ----> Naive-Bayes.MultiNomialNB [Classification Model]
- skLearn ----> Metrics.ClassificationScore
- skLearn ----> Metrics.AccuracyScore




In [108]:
!pip install gensim



In [109]:
import gensim
from gensim.models import Word2Vec, KeyedVectors

## pre trained word2vec model - word2vec-google-news-300'

In [110]:
import gensim.downloader as api

wv = api.load('word2vec-google-news-300')

vec_king = wv['king']

In [111]:
import pandas as pd
messages=pd.read_csv('/content/SMSSpamCollection',
                    sep='\t',names=["label","message"])

In [112]:
messages[:5]

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [113]:
## Data Cleaning And Preprocessing
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [114]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

In [120]:
del len

In [121]:
corpus=[]
for i in range(0,len(messages)):
    review=re.sub('[^a-zA-z]',' ',messages['message'][i])
    review=review.lower()
    review=review.split()
    review=[lemmatizer.lemmatize(word) for word in review if not word in stopwords.words('english')]
    review=' '.join(review)
    corpus.append(review)

In [122]:
# print the first 5 elements of the corpus in separate lines and its respective label
for i in range(5):
    print(corpus[i])
    print(messages['label'][i])

go jurong point crazy available bugis n great world la e buffet cine got amore wat
ham
ok lar joking wif u oni
ham
free entry wkly comp win fa cup final tkts st may text fa receive entry question std txt rate c apply
spam
u dun say early hor u c already say
ham
nah think go usf life around though
ham


In [123]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [124]:
# Download the punkt_tab resource if not already present
import nltk
try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab')

words=[]
for sentence in corpus:
  sent_tokenized=sent_tokenize(sentence)
  for sent in sent_tokenized:
    words.append(simple_preprocess(sent))

In [125]:
words[:5]

[['go',
  'jurong',
  'point',
  'crazy',
  'available',
  'bugis',
  'great',
  'world',
  'la',
  'buffet',
  'cine',
  'got',
  'amore',
  'wat'],
 ['ok', 'lar', 'joking', 'wif', 'oni'],
 ['free',
  'entry',
  'wkly',
  'comp',
  'win',
  'fa',
  'cup',
  'final',
  'tkts',
  'st',
  'may',
  'text',
  'fa',
  'receive',
  'entry',
  'question',
  'std',
  'txt',
  'rate',
  'apply'],
 ['dun', 'say', 'early', 'hor', 'already', 'say'],
 ['nah', 'think', 'go', 'usf', 'life', 'around', 'though']]

# Train word2vec from scratch

In [126]:
model=gensim.models.Word2Vec(words,window=5,min_count=5,epochs=20)

In [127]:
model.wv.index_to_key

['call',
 'get',
 'ur',
 'gt',
 'lt',
 'go',
 'day',
 'ok',
 'free',
 'know',
 'come',
 'like',
 'time',
 'good',
 'got',
 'love',
 'text',
 'want',
 'send',
 'need',
 'one',
 'txt',
 'today',
 'going',
 'stop',
 'home',
 'lor',
 'sorry',
 'see',
 'still',
 'mobile',
 'take',
 'back',
 'da',
 'reply',
 'dont',
 'think',
 'tell',
 'week',
 'phone',
 'hi',
 'new',
 'please',
 'later',
 'pls',
 'co',
 'msg',
 'min',
 'dear',
 'night',
 'make',
 'message',
 'well',
 'say',
 'thing',
 'much',
 'claim',
 'hope',
 'great',
 'oh',
 'hey',
 'give',
 'number',
 'happy',
 'friend',
 'wat',
 'work',
 'way',
 'yes',
 'www',
 'prize',
 'let',
 'right',
 'tomorrow',
 'already',
 'tone',
 'ask',
 'said',
 'win',
 'cash',
 'amp',
 'life',
 'yeah',
 'im',
 'really',
 'meet',
 'babe',
 'find',
 'miss',
 'morning',
 'last',
 'year',
 'service',
 'uk',
 'thanks',
 'care',
 'anything',
 'would',
 'com',
 'also',
 'nokia',
 'lol',
 'feel',
 'every',
 'keep',
 'pick',
 'sure',
 'urgent',
 'sent',
 'contact',


In [128]:
model.corpus_count

5564

In [129]:
model.epochs

20

In [130]:
model.wv.similar_by_word('good')

[('morning', 0.9707043170928955),
 ('night', 0.9279810190200806),
 ('sweet', 0.9277861714363098),
 ('hope', 0.9242984056472778),
 ('dream', 0.9205161333084106),
 ('ended', 0.8970087170600891),
 ('boy', 0.8921810984611511),
 ('day', 0.8913053870201111),
 ('afternoon', 0.8911586403846741),
 ('princess', 0.8885772228240967)]

In word2vec, there are 100 dimensions for each word, but in avgword2vec, every word will have an avg vector for the entire sentence.

In [131]:
def avg_word2vec(doc):
  # Filter out words not present in the model's vocabulary
  valid_words = [word for word in doc if word in model.wv.index_to_key]
  if not valid_words:  # Return zero vector if no valid words
    return np.zeros(model.wv.vector_size)
  return (np.mean([model.wv[word] for word in valid_words], axis=0))

In [132]:
!pip install tqdm



In [133]:
import tqdm
from tqdm import tqdm

In [134]:
import numpy as np

In [135]:
X=[]
for i in tqdm(range(len(words))):
  X.append(avg_word2vec(words[i]))

100%|██████████| 5564/5564 [00:00<00:00, 8969.77it/s]


In [136]:
X[:2]

[array([-0.02704175,  0.257559  ,  0.05143752,  0.06801914, -0.0437781 ,
        -0.4131363 ,  0.01928416,  0.49850166, -0.20030332, -0.23698352,
        -0.22299844, -0.25883695, -0.0159449 ,  0.06195792,  0.0080861 ,
        -0.17287058,  0.01592366, -0.3516671 ,  0.04046682, -0.4590689 ,
         0.11547125,  0.17403379, -0.06232778, -0.21705753, -0.25670737,
         0.00162331, -0.3868079 , -0.26545188, -0.10486103,  0.11326233,
         0.13379505,  0.00399581,  0.10757468, -0.01886824, -0.14970802,
         0.42361155, -0.11330015, -0.25117028, -0.21843724, -0.4946473 ,
         0.12841119, -0.16605516,  0.09883534,  0.0715077 ,  0.11941345,
         0.00433502, -0.20213236, -0.05166787,  0.17504162,  0.27379304,
         0.10665791, -0.14880054, -0.02391928,  0.02517737, -0.22231363,
         0.11392688,  0.15719616, -0.00667098, -0.30006662,  0.00818404,
         0.01464317,  0.14462152, -0.2267537 , -0.08233383, -0.23450099,
         0.234445  ,  0.08864834,  0.01614122, -0.2

In [137]:
len(X)

5564

In [138]:
X[1]

array([ 0.00864806,  0.21313824,  0.17537999,  0.09277378, -0.20611963,
       -0.3503817 ,  0.03722663,  0.52216244, -0.21623565, -0.373985  ,
       -0.41898   , -0.2033761 ,  0.10985104,  0.03958215, -0.10870285,
       -0.01902244,  0.13943446, -0.52323025, -0.13885619, -0.6159522 ,
        0.00484971,  0.04632068, -0.00930485, -0.47021973, -0.5881736 ,
       -0.1240838 , -0.64060515, -0.11617996, -0.03410793,  0.27587104,
       -0.1828767 ,  0.09195088,  0.1442458 , -0.14266233, -0.12474997,
        0.6276421 , -0.07273011, -0.18824063, -0.4145806 , -0.59128064,
        0.390425  ,  0.07988682,  0.06901786,  0.20356382,  0.14558831,
       -0.0833194 , -0.30257252, -0.22555593,  0.29542592,  0.3932673 ,
        0.03011341, -0.20434305, -0.05581282,  0.0075122 , -0.15910494,
        0.27055466,  0.14100373, -0.08718036, -0.24272054,  0.01714549,
       -0.07248018,  0.0055347 , -0.14536761,  0.07054287, -0.35354364,
        0.58228207,  0.00836414, -0.11235195, -0.21473762,  0.36

In [139]:
X_new=np.array(X)

In [140]:
messages.shape

(5572, 2)

In [141]:
X_new.shape

(5564, 100)

In [142]:
y=pd.get_dummies(messages['label'])

In [143]:
y=y.iloc[:,0].values

In [144]:
y.shape

(5572,)

## Some rows are removed from X_new 😖

In [145]:
[[i,j,k] for i,j,k in zip(list(map(len,corpus)),corpus, messages['message']) if i<1]

[[0, '', 'What you doing?how are you?'],
 [0, '', 'Where @'],
 [0, '', '645'],
 [0, '', 'Can a not?'],
 [0, '', ':) '],
 [0, '', 'What you doing?how are you?'],
 [0, '', ':( but your not here....'],
 [0, '', ':-) :-)']]

In [150]:
y = messages[list(map(lambda x: len(x)>0, corpus))]
y = pd.get_dummies(y['label'])
y = y.iloc[:,0].values
y.shape

(5564,)

In [170]:
# If you know the final shape, pre-allocate (most efficient)
import numpy as np

# Assuming all X[i] have the same shape when reshaped
sample_shape = X[0].reshape(1, -1).shape[1]
df = pd.DataFrame(index=range(len(X)), columns=range(sample_shape))

for i in range(len(X)):
    df.iloc[i] = X[i].reshape(1, -1)[0]

In [171]:
df.shape

(5564, 100)

In [172]:
X=df

In [176]:
X.isnull().sum()

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0
...,...
95,0
96,0
97,0
98,0


In [173]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [175]:
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=200,criterion='entropy')

In [177]:
classifier.fit(X_train,y_train)

In [178]:
y_pred=classifier.predict(X_test)

In [179]:
from sklearn.metrics import accuracy_score, classification_report

In [180]:
print(accuracy_score(y_pred,y_test))

0.9694519317160827


## Verify the model accuracy


## Here are one-line explanations for each metric:
##### Precision: Measures how many of the predicted positive cases were actually correct (minimizes false positives).
#####  Recall: Measures how many of the actual positive cases were correctly identified (minimizes false negatives).
#####  F1-Score: Harmonic mean of precision and recall, providing a balanced measure when you need to consider both false positives and false negatives equally.
#####  Support: The number of actual occurrences of each class in the dataset, indicating how much data was available to evaluate each
##### Accuracy: Measures the overall percentage of correct predictions out of all predictions made.
##### Macro avg: Calculates the average of precision/recall/F1 across all classes, treating each class equally regardless of its frequency.
##### Weighted avg: Calculates the average of precision/recall/F1 across all classes, weighted by the number of samples (support) in each class.

## Accuracy Score

In [181]:
report = classification_report(y_test,y_pred)
print(report)

              precision    recall  f1-score   support

       False       0.89      0.88      0.88       146
        True       0.98      0.98      0.98       967

    accuracy                           0.97      1113
   macro avg       0.94      0.93      0.93      1113
weighted avg       0.97      0.97      0.97      1113

