<a href="https://colab.research.google.com/github/shahriar1990/Deep_Learning/blob/main/SMS_Spam_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%tensorflow_version 2.x
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import os
import io
tf.__version__

'2.8.0'

In this chapter, we will focus on the basics of pre-processing text and
build a simple spam detector. Speciﬁcally, we will learn about the
following:
* The typical text processing workﬂow
* Data collection and labelingText
* normalization, including case normalization, text
* tokenization, stemming, and lemmatization
* Modeling datasets that have been text normalized
* Vectorizing text
Modeling datasets with vectorized text

# Collecting labeled data 
In this book, we will rely on publicly available datasets. The
appropriate datasets will be called out in their respective chapters
along with instructions on downloading them. To build a spam
detection system on an email dataset, we will be using the SMS
Spam Collection dataset made available by University of California,
Irvine. This dataset can be downloaded using instructions available
in the tip box below. Each SMS is tagged as "SPAM" or "HAM," with
the la er indicating it is not a spam message.

In [2]:
path_to_zip = tf.keras.utils.get_file("smsspamcollection.zip",
                  origin="https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip",
                  extract=True)

#unzip a file into the folder
!unzip $path_to_zip -d data 

Downloading data from https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Archive:  /root/.keras/datasets/smsspamcollection.zip
  inflating: data/SMSSpamCollection  
  inflating: data/readme             


In [4]:
lines = io.open('/content/data/SMSSpamCollection').read().strip().split('\n')
lines[5]

"spam\tFreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"

# Preprocessing Data

In [15]:
spam_dataset = []
count = 0
for line in lines:
  label, text = line.split('\t')
  if label.lower().strip() == 'spam':
    spam_dataset.append((1, text.strip()))
    count += 1
  else:
    spam_dataset.append(((0, text.strip())))

print(spam_dataset[0])
print("Spam: ", count)

(0, 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...')
Spam:  747


# Data Normalization

In [35]:
import pandas as pd

df = pd.DataFrame(spam_dataset,columns=['Spam','Message'])
df

Unnamed: 0,Spam,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5569,1,This is the 2nd time we have tried 2 contact u...
5570,0,Will ü b going to esplanade fr home?
5571,0,"Pity, * was in mood for that. So...any other s..."
5572,0,The guy did some bitching but I acted like i'd...


In [36]:
import re

#Normalization Functions

def message_length(x):
  # returns total number of characters
  return len(x)

def num_capitals(x):
  _,count = re.subn(r'[A-Z]','',x)
  return count

def num_punctutation(x):
  _,count = re.subn(r'[\W]','',x)
  return count


In [37]:
df['Capitals'] = df['Message'].apply(num_capitals)
df['punctutation'] = df['Message'].apply(num_punctuation)
df['Length'] = df['Message'].apply(message_length)

In [38]:
df.describe()

Unnamed: 0,Spam,Capitals,punctutation,Length
count,5574.0,5574.0,5574.0,5574.0
mean,0.134015,5.621636,18.942591,80.443488
std,0.340699,11.683233,14.825994,59.841746
min,0.0,0.0,0.0,2.0
25%,0.0,1.0,8.0,36.0
50%,0.0,2.0,15.0,61.0
75%,0.0,4.0,27.0,122.0
max,1.0,129.0,253.0,910.0


In [39]:
train = df.sample(frac=0.8,random_state=42)
test = df.drop(train.index)

In [40]:
train.describe()

Unnamed: 0,Spam,Capitals,punctutation,Length
count,4459.0,4459.0,4459.0,4459.0
mean,0.132765,5.519399,18.886522,80.316439
std,0.339359,11.405424,14.602023,59.346407
min,0.0,0.0,0.0,2.0
25%,0.0,1.0,8.0,35.0
50%,0.0,2.0,15.0,61.0
75%,0.0,4.0,27.0,122.0
max,1.0,129.0,253.0,910.0


In [41]:
test.describe()

Unnamed: 0,Spam,Capitals,punctutation,Length
count,1115.0,1115.0,1115.0,1115.0
mean,0.139013,6.030493,19.166816,80.95157
std,0.346116,12.731059,15.694599,61.807655
min,0.0,0.0,0.0,2.0
25%,0.0,1.0,8.0,36.0
50%,0.0,2.0,15.0,61.0
75%,0.0,4.0,28.0,123.0
max,1.0,127.0,195.0,790.0


# Model Building

In [47]:
def make_model(input_dims=3, num_units=12):
  model = tf.keras.Sequential()

  # Adds a densely-connected layer with 12 units to the model:
  model.add(tf.keras.layers.Dense(num_units, 
                                  input_dim=input_dims, 
                                  activation='relu'))

  # Add a sigmoid layer with a binary output unit:
  model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
  model.compile(loss='binary_crossentropy', optimizer='adam', 
                metrics=['accuracy'])
  return model

In [48]:
x_train = train[['Length','punctutation','Capitals']]
y_train = train['Spam']

x_test = test[['Length','punctutation','Capitals']]
y_test = test['Spam']

In [49]:
x_train

Unnamed: 0,Length,punctutation,Capitals
3690,25,4,1
3527,161,48,107
724,40,7,1
3370,69,17,3
468,37,8,1
...,...,...,...
3280,444,114,44
3186,65,14,50
3953,81,23,2
2768,38,8,2


In [50]:
model = make_model()

In [51]:
model.fit(x_train,y_train,epochs=10,batch_size=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f03d1b5fd90>

In [52]:
model.evaluate(x_test,y_test)



[0.3027591109275818, 0.8699551820755005]

In [54]:
y_train_pred = model.predict(x_train)


array([[0.04612938],
       [0.15152553],
       [0.05732074],
       ...,
       [0.01037714],
       [0.03827438],
       [0.04750213]], dtype=float32)

#Tokenization and Stop Word Removal

In [55]:
sentence = 'Go until jurong point, crazy.. Available only in bugis n great world'
sentence.split()

['Go',
 'until',
 'jurong',
 'point,',
 'crazy..',
 'Available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world']

In [56]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.3.0-py3-none-any.whl (432 kB)
[?25l[K     |▊                               | 10 kB 26.4 MB/s eta 0:00:01[K     |█▌                              | 20 kB 14.3 MB/s eta 0:00:01[K     |██▎                             | 30 kB 10.3 MB/s eta 0:00:01[K     |███                             | 40 kB 9.2 MB/s eta 0:00:01[K     |███▉                            | 51 kB 4.8 MB/s eta 0:00:01[K     |████▌                           | 61 kB 5.6 MB/s eta 0:00:01[K     |█████▎                          | 71 kB 5.6 MB/s eta 0:00:01[K     |██████                          | 81 kB 5.5 MB/s eta 0:00:01[K     |██████▉                         | 92 kB 6.1 MB/s eta 0:00:01[K     |███████▋                        | 102 kB 5.2 MB/s eta 0:00:01[K     |████████▍                       | 112 kB 5.2 MB/s eta 0:00:01[K     |█████████                       | 122 kB 5.2 MB/s eta 0:00:01[K     |█████████▉                      | 133 kB 5.2 MB/s eta 0:00:01[K  

In [58]:
import stanza

In [59]:
en = stanza.download('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-04 05:51:28 INFO: Downloading default packages for language: en (English)...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/default.zip:   0%|          | 0…

2022-04-04 05:51:39 INFO: Finished downloading models and saved to /root/stanza_resources.


In [60]:
en = stanza.Pipeline(lang='en')

2022-04-04 05:52:26 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2022-04-04 05:52:26 INFO: Use device: cpu
2022-04-04 05:52:26 INFO: Loading: tokenize
2022-04-04 05:52:27 INFO: Loading: pos
2022-04-04 05:52:27 INFO: Loading: lemma
2022-04-04 05:52:27 INFO: Loading: depparse
2022-04-04 05:52:27 INFO: Loading: sentiment
2022-04-04 05:52:28 INFO: Loading: constituency
2022-04-04 05:52:29 INFO: Loading: ner
2022-04-04 05:52:30 INFO: Done loading processors!


In [61]:
sentence

'Go until jurong point, crazy.. Available only in bugis n great world'

In [68]:
tokenized = en(sentence)

In [69]:
len(tokenized.sentences)

2

In [70]:
for snt in tokenized.sentences:
  for word in snt.tokens:
    print(word.text)
  print("<End of Sentence>")

Go
until
jurong
point
,
crazy
..
<End of Sentence>
Available
only
in
bugis
n
great
world
<End of Sentence>


# Dependency Parsing Example

In [72]:
en2 = stanza.Pipeline(lang='en')
pr2 = en2('Hari went to school')
for snt in pr2.sentences:
  for word in snt.tokens:
    print(word)

2022-04-04 06:06:07 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2022-04-04 06:06:07 INFO: Use device: cpu
2022-04-04 06:06:07 INFO: Loading: tokenize
2022-04-04 06:06:07 INFO: Loading: pos
2022-04-04 06:06:08 INFO: Loading: lemma
2022-04-04 06:06:08 INFO: Loading: depparse
2022-04-04 06:06:09 INFO: Loading: sentiment
2022-04-04 06:06:10 INFO: Loading: constituency
2022-04-04 06:06:11 INFO: Loading: ner
2022-04-04 06:06:12 INFO: Done loading processors!


[
  {
    "id": 1,
    "text": "Hari",
    "lemma": "Hari",
    "upos": "PROPN",
    "xpos": "NNP",
    "feats": "Number=Sing",
    "head": 2,
    "deprel": "nsubj",
    "start_char": 0,
    "end_char": 4,
    "ner": "S-PERSON"
  }
]
[
  {
    "id": 2,
    "text": "went",
    "lemma": "go",
    "upos": "VERB",
    "xpos": "VBD",
    "feats": "Mood=Ind|Tense=Past|VerbForm=Fin",
    "head": 0,
    "deprel": "root",
    "start_char": 5,
    "end_char": 9,
    "ner": "O"
  }
]
[
  {
    "id": 3,
    "text": "to",
    "lemma": "to",
    "upos": "ADP",
    "xpos": "IN",
    "head": 4,
    "deprel": "case",
    "start_char": 10,
    "end_char": 12,
    "ner": "O"
  }
]
[
  {
    "id": 4,
    "text": "school",
    "lemma": "school",
    "upos": "NOUN",
    "xpos": "NN",
    "feats": "Number=Sing",
    "head": 2,
    "deprel": "obl",
    "start_char": 13,
    "end_char": 19,
    "ner": "O"
  }
]


In [73]:
fa = stanza.download('fa')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-04 06:09:10 INFO: Downloading default packages for language: fa (Persian)...


Downloading https://huggingface.co/stanfordnlp/stanza-fa/resolve/v1.3.0/models/default.zip:   0%|          | 0…

2022-04-04 06:09:19 INFO: Finished downloading models and saved to /root/stanza_resources.


In [74]:
fa = stanza.Pipeline(lang='fa')

2022-04-04 06:10:07 INFO: Loading these models for language: fa (Persian):
| Processor | Package |
-----------------------
| tokenize  | perdt   |
| mwt       | perdt   |
| pos       | perdt   |
| lemma     | perdt   |
| depparse  | perdt   |

2022-04-04 06:10:07 INFO: Use device: cpu
2022-04-04 06:10:07 INFO: Loading: tokenize
2022-04-04 06:10:07 INFO: Loading: mwt
2022-04-04 06:10:07 INFO: Loading: pos
2022-04-04 06:10:08 INFO: Loading: lemma
2022-04-04 06:10:08 INFO: Loading: depparse
2022-04-04 06:10:09 INFO: Done loading processors!


In [81]:
fa_line = fa('نگامی که عبارتی را در ترجمه‌گری مثل گوگل ترنسلیت تایپ می‌کنید یا با ویرایشگرهای متنی (مثل مایکروسافت ورد) کار می‌کنید در حقیقت بخشی از فرایند تصحیح و پیشنهاد عبارات توسط NLP انجام می‌شود، ضمنا رایانه‌های امروزی قادرند خلاصه‌ای از یک متن طولانی را در اختیار کاربران قرار دهند!')

In [82]:
for snt in fa_line.sentences:
  for word in snt.tokens:
    print(word.text)

نگامی
که
عبارتی
را
در
ترجمه‌گری
مثل
گوگل
ترنسلیت
تایپ
می‌کنید
یا
با
ویرایشگرهای
متنی
(
مثل
مایکروسافت
ورد
)
کار
می‌کنید
در
حقیقت
بخشی
از
فرایند
تصحیح
و
پیشنهاد
عبارات
توسط
NLP
انجام
می‌شود
،
ضمنا
رایانه‌های
امروزی
قادرند
خلاصه‌ای
از
یک
متن
طولانی
را
در
اختیار
کاربران
قرار
دهند
!


In [89]:
fa = stanza.Pipeline(lang='fa')
pr2 = fa('نگامی که عبارتی را در ترجمه‌گری مثل گوگل ترنسلیت تایپ می‌کنید یا با ویرایشگرهای متنی (مثل مایکروسافت ورد) کار می‌کنید در حقیقت بخشی از فرایند تصحیح و پیشنهاد عبارات توسط NLP انجام می‌شود، ضمنا رایانه‌های امروزی قادرند خلاصه‌ای از یک متن طولانی را در اختیار کاربران قرار دهند!')
for snt in pr2.sentences:
  for word in snt.tokens:
    print(word)

2022-04-04 06:54:38 INFO: Loading these models for language: fa (Persian):
| Processor | Package |
-----------------------
| tokenize  | perdt   |
| mwt       | perdt   |
| pos       | perdt   |
| lemma     | perdt   |
| depparse  | perdt   |

2022-04-04 06:54:38 INFO: Use device: cpu
2022-04-04 06:54:38 INFO: Loading: tokenize
2022-04-04 06:54:38 INFO: Loading: mwt
2022-04-04 06:54:38 INFO: Loading: pos
2022-04-04 06:54:38 INFO: Loading: lemma
2022-04-04 06:54:38 INFO: Loading: depparse
2022-04-04 06:54:39 INFO: Done loading processors!


[
  {
    "id": 1,
    "text": "نگامی",
    "lemma": "نگط",
    "upos": "NOUN",
    "xpos": "N_IANM",
    "feats": "Number=Sing",
    "head": 35,
    "deprel": "nsubj",
    "start_char": 0,
    "end_char": 5
  }
]
[
  {
    "id": 2,
    "text": "که",
    "lemma": "که",
    "upos": "SCONJ",
    "xpos": "SUBR",
    "head": 1,
    "deprel": "acl",
    "start_char": 6,
    "end_char": 8
  }
]
[
  {
    "id": 3,
    "text": "عبارتی",
    "lemma": "عبارت",
    "upos": "NOUN",
    "xpos": "N_IANM",
    "feats": "Number=Sing",
    "head": 11,
    "deprel": "obj",
    "start_char": 9,
    "end_char": 15
  }
]
[
  {
    "id": 4,
    "text": "را",
    "lemma": "را",
    "upos": "ADP",
    "xpos": "POSTP",
    "head": 3,
    "deprel": "case",
    "start_char": 16,
    "end_char": 18
  }
]
[
  {
    "id": 5,
    "text": "در",
    "lemma": "در",
    "upos": "ADP",
    "xpos": "PREP",
    "head": 6,
    "deprel": "case",
    "start_char": 19,
    "end_char": 21
  }
]
[
  {
    "id": 6,
    "text": "ت

# Adding word count features

In [87]:
def word_counts(x, pipeline=en):
  doc = pipeline(x)
  count = sum( [ len(sentence.tokens) for sentence in doc.sentences] )
  return count


In [90]:
df['Words'] = df['Message'].apply(word_counts)

KeyError: ignored