<a href="https://colab.research.google.com/github/rahiakela/advanced-natural-language-processing-with-tensorflow-2/blob/main/1-essentials-of-nlp/1_text_normalization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SMS Spam Detection

To understand how to process text, it is important to understand the general
workflow for NLP.

<img src='https://github.com/rahiakela/img-repo/blob/master/advanced-nlp-with-tensorflow-2/text-processing-workflow.png?raw=1' width='800'/>

The first two steps of the process in the preceding diagram involve collecting labeled data. A supervised model or even a semi-supervised model needs data to operate.

The next step is usually normalizing and featurizing the data. Models have a hard time processing text data as is. There is a lot of hidden structure in a given text that needs to be processed and exposed. These two steps focus on that. 

The last step is building a model with the processed inputs. While NLP has some unique models, this chapter will use only a simple deep neural network and focus more on the normalization and vectorization/featurization. Often, the last three stages operate in a cycle, even though the diagram may give the impression of linearity.

## Setup

In [None]:
%%shell

pip install stopwordsiso
pip install stanfordnlp

In [None]:
%tensorflow_version 2.x     # magic command instructing to use TensorFlow version 2+
import tensorflow as tf
#from tf.keras.models import Sequential
#from tf.keras.layers import Dense
import os
import io
import re

import pandas as pd 
import stopwordsiso as stopwords
import stanfordnlp as snlp
en = snlp.download('en')

tf.__version__

# Data collection

**The first step of any Machine Learning (ML) project is to obtain a dataset.**

We will be using the SMS Spam Collection dataset made available by University of California, Irvine.

In [3]:
# Download the zip file
path_to_zip = tf.keras.utils.get_file("smsspamcollection.zip",
                  origin="https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip",
                  extract=True)

# Unzip the file into a folder
!unzip $path_to_zip -d data

Downloading data from https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Archive:  /root/.keras/datasets/smsspamcollection.zip
  inflating: data/SMSSpamCollection  
  inflating: data/readme             


In [None]:
# optional step - helps if colab gets disconnected
# from google.colab import drive
# drive.mount('/content/drive')

Reading the data file is trivial.

In [4]:
# Let's see if we read the data correctly
# lines = io.open('/content/drive/My Drive/colab-data/SMSSpamCollection').read().strip().split('\n')
lines = io.open('/content/data/SMSSpamCollection').read().strip().split('\n')
lines[0]

'ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [5]:
lines[2]

"spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"

## Pre-process Data

The next step is to split each line into two columns – one with the text of the message and the other as the label. While we are separating these labels, we will also convert the labels to numeric values. Since we are interested in predicting spam messages, we can assign a value of 1 to the spam
messages. A value of 0 will be assigned to legitimate messages.

In [7]:
spam_dataset = []
spam_count = 0
ham_count = 0
for line in lines:
  label, text = line.split('\t')
  if label.lower().strip() == 'spam':
    spam_dataset.append((1, text.strip()))
    spam_count += 1
  else:
    spam_dataset.append(((0, text.strip())))
    ham_count += 1

spam_dataset[:5]

[(0,
  'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'),
 (0, 'Ok lar... Joking wif u oni...'),
 (1,
  "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"),
 (0, 'U dun say so early hor... U c already then say...'),
 (0, "Nah I don't think he goes to usf, he lives around here though")]

In [8]:
print("Spam: ", spam_count, ", Ham: ", ham_count)

Spam:  747 , Ham:  4827


Now the dataset is ready for further processing in the pipeline.

# Data Normalization

Text normalization is a pre-processing step aimed at improving the quality
of the text and making it suitable for machines to process. 

Four main steps in text normalization are:

- case normalization, 
- tokenization and stop word removal,
- Parts-of-Speech (POS) tagging, 
- and stemming.

## Case normalization

**Case normalization applies to languages that use uppercase and lowercase letters.**

All languages based on the Latin alphabet or the Cyrillic alphabet (Russian,
Mongolian, and so on) use upper- and lowercase letters. Other languages
that sometimes use this are Greek, Armenian, Cherokee, and Coptic. 

In case normalization, all letters are converted to the same case. It is quite helpful in semantic use cases. However, in other cases, this may hinder performance.

### Preprocessing case normalized data

Let's build a baseline model with three simple features:

- Number of characters in the message
- Number of capital letters in the message
- Number of punctuation symbols in the message

In [9]:
# To do so, first, we will convert the data into a pandas DataFrame
df = pd.DataFrame(spam_dataset, columns=['Spam', 'Message'])

In [10]:
df.head()

Unnamed: 0,Spam,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Next, let's build some simple functions that can count the length of the message, and the numbers of capital letters and punctuation symbols. Python's regular expression package, re, will be used to implement these:

In [11]:
# Normalization functions

def message_length(x):
  # returns total number of characters
  return len(x)

def num_capitals(x):
  # get count of capital letters
  _, count = re.subn(r'[A-Z]', '', x) # only works in english
  return count

def num_punctuation(x):
  # get count the number of punctuation symbols
  _, count = re.subn(r'\W', '', x)
  return count

Additional feature columns will be added to the DataFrame, and then the set will
be split into test and train sets:

In [12]:
df['Capitals'] = df['Message'].apply(num_capitals)
df['Punctuation'] = df['Message'].apply(num_punctuation)
df['Length'] = df['Message'].apply(message_length)

In [13]:
df.head()

Unnamed: 0,Spam,Message,Capitals,Punctuation,Length
0,0,"Go until jurong point, crazy.. Available only ...",3,28,111
1,0,Ok lar... Joking wif u oni...,2,11,29
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,10,33,155
3,0,U dun say so early hor... U c already then say...,2,16,49
4,0,"Nah I don't think he goes to usf, he lives aro...",2,14,61


In [14]:
df.describe()

Unnamed: 0,Spam,Capitals,Punctuation,Length
count,5574.0,5574.0,5574.0,5574.0
mean,0.134015,5.621636,18.942591,80.443488
std,0.340699,11.683233,14.825994,59.841746
min,0.0,0.0,0.0,2.0
25%,0.0,1.0,8.0,36.0
50%,0.0,2.0,15.0,61.0
75%,0.0,4.0,27.0,122.0
max,1.0,129.0,253.0,910.0


Now let's split the dataset into training and test sets, with
80% of the records in the training set and the rest in the test set.

In [15]:
train=df.sample(frac=0.8,random_state=42) #random state is a seed value
test=df.drop(train.index)

In [16]:
train.describe()

Unnamed: 0,Spam,Capitals,Punctuation,Length
count,4459.0,4459.0,4459.0,4459.0
mean,0.132765,5.519399,18.886522,80.316439
std,0.339359,11.405424,14.602023,59.346407
min,0.0,0.0,0.0,2.0
25%,0.0,1.0,8.0,35.0
50%,0.0,2.0,15.0,61.0
75%,0.0,4.0,27.0,122.0
max,1.0,129.0,253.0,910.0


In [17]:
test.describe()

Unnamed: 0,Spam,Capitals,Punctuation,Length
count,1115.0,1115.0,1115.0,1115.0
mean,0.139013,6.030493,19.166816,80.95157
std,0.346116,12.731059,15.694599,61.807655
min,0.0,0.0,0.0,2.0
25%,0.0,1.0,8.0,36.0
50%,0.0,2.0,15.0,61.0
75%,0.0,4.0,28.0,123.0
max,1.0,127.0,195.0,790.0


Further more, labels will be removed from both the training and test sets:

In [18]:
x_train = train[['Length', 'Punctuation', 'Capitals']]
y_train = train[['Spam']]

x_test = test[['Length', 'Punctuation', 'Capitals']]
y_test = test[['Spam']]

In [19]:
x_train.describe()

Unnamed: 0,Length,Punctuation,Capitals
count,4459.0,4459.0,4459.0
mean,80.316439,18.886522,5.519399
std,59.346407,14.602023,11.405424
min,2.0,0.0,0.0
25%,35.0,8.0,1.0
50%,61.0,15.0,2.0
75%,122.0,27.0,4.0
max,910.0,253.0,129.0


In [20]:
x_test.describe()

Unnamed: 0,Length,Punctuation,Capitals
count,1115.0,1115.0,1115.0
mean,80.95157,19.166816,6.030493
std,61.807655,15.694599,12.731059
min,2.0,0.0,0.0
25%,36.0,8.0,1.0
50%,61.0,15.0,2.0
75%,123.0,28.0,4.0
max,790.0,195.0,127.0


### Modeling case normalized data

We will use a very simple model, as the objective is to show different basic NLP data processing techniques more than modeling. Here, we want to see if three simple features can aid in the classification of spam. As more features are added, passing them through the same model will help in seeing if the
featurization aids or hampers the accuracy of the classification.

In [21]:
# Basic 1-layer neural network model for evaluation
def make_model(input_dims=3, num_units=12):
  model = tf.keras.Sequential()

  # Adds a densely-connected layer with 12 units to the model:
  model.add(tf.keras.layers.Dense(num_units, 
                                  input_dim=input_dims, 
                                  activation='relu'))

  # Add a sigmoid layer with a binary output unit:
  model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  
  return model

This model uses binary cross-entropy for computing loss and the Adam optimizer
for training. The key metric, given that this is a binary classification problem, is accuracy.

We can train our simple baseline model with only three features like so:

In [22]:
model = make_model()

In [23]:
model.fit(x_train, y_train, epochs=10, batch_size=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fc0c007d748>

This is not bad as our three simple features help us get to 93% accuracy. A quick check shows that there are 592 spam messages in the test set, out of a total of 4,459. So, this model is doing better than a very simple model that guesses everything as not spam.

That model would have an accuracy of 87%. This number may be
surprising but is fairly common in classification problems where there is a severe class imbalance in the data. Evaluating it on the training set gives an accuracy of around 93.27%:

In [24]:
model.evaluate(x_test, y_test)



[0.19739584624767303, 0.9327354431152344]

Please note that the actual performance you see may be slightly different due to the data splits and computational vagaries. 

A quick verification can be performed by plotting the confusion matrix to see the performance:

In [25]:
y_train_pred = model.predict_classes(x_train)



In [26]:
# confusion matrix
tf.math.confusion_matrix(tf.constant(y_train.Spam), y_train_pred)

<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[3777,   90],
       [ 186,  406]], dtype=int32)>

This shows that 3,666 out of 3,867 regular messages were classified correctly, while 353 out of 592 spam messages were classified correctly. Again, you may get a slightly different result.

|  | **Predicted Not Spam** | **Predicted Spam** |
| --- | --- | --- |
| **Actual Not Spam** | 3777 | 90 |
| **Actual Spam** | 186 | 406 |

We can get calculation as follow:

|  | **Predicted Not Spam** | **Predicted Spam** | |
| --- | --- | --- | |
| **Actual Not Spam** | 3777 | 90 | 3777 + 90 = 3867 |
| **Actual Spam**     | 186  | 406 |  186 + 406 = 592  |

So confusion matrix show us that if we reduce the value 201 and 239 then accuracy would be increased.

In [27]:
sum(y_train_pred)

array([496], dtype=int32)

In [28]:
y_test_pred = model.predict_classes(x_test)
tf.math.confusion_matrix(tf.constant(y_test.Spam), y_test_pred)



<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[943,  17],
       [ 58,  97]], dtype=int32)>

### Excersize

To test the value of the features, try re-running the model by removing one of the features, such as punctuation or a number of capital letters, to get a sense of their contribution to the model.

In [None]:
x_train = train[['Length', 'Punctuation']]
y_train = train[['Spam']]

x_test = test[['Length', 'Punctuation']]
y_test = test[['Spam']]

x_train.describe()

Unnamed: 0,Length,Punctuation
count,4459.0,4459.0
mean,80.316439,18.886522
std,59.346407,14.602023
min,2.0,0.0
25%,35.0,8.0
50%,61.0,15.0
75%,122.0,27.0
max,910.0,253.0


In [None]:
x_test.describe()

Unnamed: 0,Length,Punctuation
count,1115.0,1115.0
mean,80.95157,19.166816
std,61.807655,15.694599
min,2.0,0.0
25%,36.0,8.0
50%,61.0,15.0
75%,123.0,28.0
max,790.0,195.0


In [None]:
model1 = make_model(input_dims=2)
model1.fit(x_train, y_train, epochs=10, batch_size=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fb3a40d9198>

In [None]:
model1.evaluate(x_test, y_test)



[0.26170480251312256, 0.8941704034805298]

In [None]:
y_train_pred = model1.predict_classes(x_train)



In [None]:
# confusion matrix
tf.math.confusion_matrix(tf.constant(y_train.Spam), y_train_pred)

<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[3780,   87],
       [ 385,  207]], dtype=int32)>

Now trying to remove Punctuation letters.

In [None]:
x_train = train[['Length', 'Capitals']]
y_train = train[['Spam']]

x_test = test[['Length', 'Capitals']]
y_test = test[['Spam']]

In [None]:
model2 = make_model(input_dims=2)
model2.fit(x_train, y_train, epochs=10, batch_size=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fb3906cf2b0>

In [None]:
model2.evaluate(x_test, y_test)



[0.2811644971370697, 0.9085201621055603]

In [None]:
y_train_pred = model2.predict_classes(x_train)



In [None]:
# confusion matrix
tf.math.confusion_matrix(tf.constant(y_train.Spam), y_train_pred)

<tf.Tensor: shape=(2, 2), dtype=int32, numpy=
array([[3684,  183],
       [ 171,  421]], dtype=int32)>

We observe that removing one of the features punctuation letters, It contribute to the model accuracy by increasing upto 90%.

## Tokenization normalization

This step takes a piece of text and converts it into a list of tokens. If the input is a sentence, then separating the words would be an example of tokenization. Depending on the model, different granularities can be chosen. At the lowest level, each character could become a token. In some cases, entire sentences of paragraphs can be considered as a token:

<img src='https://github.com/rahiakela/img-repo/blob/master/advanced-nlp-with-tensorflow-2/sentence-tokenizing.png?raw=1' width='800'/>

The preceding diagram shows two ways a sentence can be tokenized. One way to
tokenize is to chop a sentence into words. Another way is to chop into individual characters. However, this can be a complex proposition in some languages such as Japanese and Mandarin.


Many languages use a word separator, a space, to separate words. This makes the
task of tokenizing on words trivial. However, there are other languages that do not use any markers or separators between words. Some examples of such languages are Japanese and Chinese. In such languages, the task is referred to as segmentation.

Fortunately, most languages are not as complex as Japanese and use spaces to
separate words. In Python, splitting by spaces is trivial.

In [29]:
sentence = 'Go until jurong point, crazy.. Available only in bugis n great world'
sentence.split()

['Go',
 'until',
 'jurong',
 'point,',
 'crazy..',
 'Available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world']

The two lines(`point,` and `crazy..`) in the preceding output show that the naïve approach in Python will result in punctuation being included in the words, among other issues. Consequently, this step is done through a library like StanfordNLP.

This package provides capabilities for tokenization, POS tagging, and lemmatization out of the box. To start with tokenization, we instantiate a pipeline and tokenize a sample text to see how this works:

In [32]:
en = snlp.Pipeline(lang="en", processors="tokenize")

Use device: gpu
---
Loading: tokenize
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Done loading processors!
---


For now, only tokenization of text is desired, so only the tokenizer is used:

In [33]:
tokenized = en(sentence)
len(tokenized.sentences)

2

This shows that the tokenizer correctly divided the text into two sentences.

To investigate what words were removed, the following code can be used:

In [34]:
for snt in tokenized.sentences:
  for word in snt.tokens:
    print(word.text)
  print("<End of Sentence>")

Go
until
jurong
point
,
crazy
..
<End of Sentence>
Available
only
in
bugis
n
great
world
<End of Sentence>


Punctuation marks were separated out into their own words. Text was split into multiple sentences. This is an improvement over only using spaces to split. In some applications, removal of punctuation may be required.

### Japanese Tokenization Example

Consider the preceding example of Japanese. To see the performance of StanfordNLP on Japanese tokenization, the following piece of code can be used:

In [36]:
jp = snlp.download('ja')

Using the default treebank "ja_gsd" for language "ja".
Would you like to download the models for: ja_gsd now? (Y/n)
Y

Default download directory: /root/stanfordnlp_resources
Hit enter to continue or type an alternate directory.


Downloading models for: ja_gsd
Download location: /root/stanfordnlp_resources/ja_gsd_models.zip


100%|██████████| 219M/219M [00:13<00:00, 15.9MB/s]



Download complete.  Models saved to: /root/stanfordnlp_resources/ja_gsd_models.zip
Extracting models file for: ja_gsd
Cleaning up...Done.


Next, a Japanese pipeline will be instantiated and the words will be processed:

In [45]:
jp = snlp.Pipeline(lang="ja", processors="tokenize")
jp_line = jp("選挙管理委員会")

Use device: gpu
---
Loading: tokenize
With settings: 
{'model_path': '/root/stanfordnlp_resources/ja_gsd_models/ja_gsd_tokenizer.pt', 'lang': 'ja', 'shorthand': 'ja_gsd', 'mode': 'predict'}
Done loading processors!
---


You may recall that the Japanese text reads Election Administration Committee.
Correct tokenization should produce three words, where first two should be two
characters each, and the last word is three characters:

In [46]:
for snt in jp_line.sentences:
  for word in snt.tokens:
    print(word.text)

選挙
管理
委員会


This matches the expected output. StanfordNLP supports 53 languages, so the same
code can be used for tokenizing any language that is supported.

Coming back to the spam detection example, a new feature can be implemented that
counts the number of words in the message using this tokenization functionality.

### Modeling tokenized data

It is possible that spam messages have different numbers of words than regular
messages. The first step is to define a method to compute the number of words:

In [47]:
def word_counts(x, pipeline=en):
  doc = pipeline(x)
  count = sum( [ len(sentence.tokens) for sentence in doc.sentences] )
  return count

In [52]:
en = snlp.Pipeline(lang='en', processors='tokenize')
df['Words'] = df['Message'].apply(word_counts)

Use device: gpu
---
Loading: tokenize
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Done loading processors!
---


In [53]:
df.head()

Unnamed: 0,Spam,Message,Capitals,Punctuation,Length,Words
0,0,"Go until jurong point, crazy.. Available only ...",3,28,111,24
1,0,Ok lar... Joking wif u oni...,2,11,29,8
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,10,33,155,34
3,0,U dun say so early hor... U c already then say...,2,16,49,13
4,0,"Nah I don't think he goes to usf, he lives aro...",2,14,61,15


Next, using the train and test splits, add a column for the word count feature:

In [54]:
#train=df.sample(frac=0.8,random_state=42) #random state is a seed value
#test=df.drop(train.index)

train['Words'] = train['Message'].apply(word_counts)
test['Words'] = test['Message'].apply(word_counts)

In [55]:
x_train = train[['Length', 'Punctuation', 'Capitals', 'Words']]
y_train = train[['Spam']]

x_test = test[['Length', 'Punctuation', 'Capitals' , 'Words']]
y_test = test[['Spam']]

In [56]:
model = make_model(input_dims=4)
model.fit(x_train, y_train, epochs=10, batch_size=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fc075581e10>

There is only a marginal improvement in accuracy. One hypothesis is that the
number of words is not useful. It would be useful if the average number of words in spam messages were smaller or larger than regular messages.

In [57]:
model.evaluate(x_test, y_test)



[0.2072787880897522, 0.921076238155365]

Using pandas, this can be quickly verified:

In [58]:
train.loc[train.Spam == 1].describe()

Unnamed: 0,Spam,Capitals,Punctuation,Length,Words
count,592.0,592.0,592.0,592.0,592.0
mean,1.0,15.320946,29.086149,138.856419,29.511824
std,0.0,11.635105,7.083572,28.07998,7.474256
min,1.0,0.0,2.0,13.0,3.0
25%,1.0,7.0,26.0,132.0,26.0
50%,1.0,14.0,30.0,149.0,30.0
75%,1.0,21.0,34.0,157.0,35.0
max,1.0,128.0,49.0,197.0,49.0


Let's compare the preceding results to the statistics for regular messages:

In [59]:
train.loc[train.Spam == 0].describe()

Unnamed: 0,Spam,Capitals,Punctuation,Length,Words
count,3867.0,3867.0,3867.0,3867.0,3867.0
mean,0.0,4.018878,17.325058,71.354538,17.344194
std,0.0,10.599291,14.826644,57.755351,13.811278
min,0.0,0.0,0.0,2.0,1.0
25%,0.0,1.0,8.0,33.0,8.0
50%,0.0,2.0,13.0,53.0,13.0
75%,0.0,3.0,23.0,92.0,22.0
max,0.0,129.0,253.0,910.0,209.0


Some interesting patterns can quickly be seen. Spam messages usually have much
less deviation from the mean. Focus on the Capitals feature column. It shows that regular messages use far fewer capitals than spam messages.

This quick check yields an indication as to why adding the word features wasn't that useful. However, there are a couple of things to consider still. 

First, the tokenization model split out punctuation marks as words. Ideally, these words should be removed from the word counts as the punctuation feature is showing that spam messages use a lot more punctuation characters.

Secondly, languages have some common words that are usually excluded. This is called stop word removal.

## Stop Word Removal normalization

In [None]:
!pip install stopwordsiso

Collecting stopwordsiso
[?25l  Downloading https://files.pythonhosted.org/packages/3e/03/4c5f24b654bb9459f81aa5c1b60b094b804286b99dca9f2e116c9eb01ac8/stopwordsiso-0.6.1-py3-none-any.whl (73kB)
[K     |████▌                           | 10kB 19.4MB/s eta 0:00:01[K     |█████████                       | 20kB 1.7MB/s eta 0:00:01[K     |█████████████▍                  | 30kB 2.1MB/s eta 0:00:01[K     |█████████████████▉              | 40kB 2.4MB/s eta 0:00:01[K     |██████████████████████▎         | 51kB 2.0MB/s eta 0:00:01[K     |██████████████████████████▊     | 61kB 2.3MB/s eta 0:00:01[K     |███████████████████████████████▏| 71kB 2.5MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 2.3MB/s 
[?25hInstalling collected packages: stopwordsiso
Successfully installed stopwordsiso-0.6.1


In [None]:
import stopwordsiso as stopwords

stopwords.langs()

{'af',
 'ar',
 'bg',
 'bn',
 'br',
 'ca',
 'cs',
 'da',
 'de',
 'el',
 'en',
 'eo',
 'es',
 'et',
 'eu',
 'fa',
 'fi',
 'fr',
 'ga',
 'gl',
 'gu',
 'ha',
 'he',
 'hi',
 'hr',
 'hu',
 'hy',
 'id',
 'it',
 'ja',
 'ko',
 'ku',
 'la',
 'lt',
 'lv',
 'mr',
 'ms',
 'nl',
 'no',
 'pl',
 'pt',
 'ro',
 'ru',
 'sk',
 'sl',
 'so',
 'st',
 'sv',
 'sw',
 'th',
 'tl',
 'tr',
 'uk',
 'ur',
 'vi',
 'yo',
 'zh',
 'zu'}

In [None]:
sorted(stopwords.stopwords('en'))

["'ll",
 "'tis",
 "'twas",
 "'ve",
 '10',
 '39',
 'a',
 "a's",
 'able',
 'ableabout',
 'about',
 'above',
 'abroad',
 'abst',
 'accordance',
 'according',
 'accordingly',
 'across',
 'act',
 'actually',
 'ad',
 'added',
 'adj',
 'adopted',
 'ae',
 'af',
 'affected',
 'affecting',
 'affects',
 'after',
 'afterwards',
 'ag',
 'again',
 'against',
 'ago',
 'ah',
 'ahead',
 'ai',
 "ain't",
 'aint',
 'al',
 'all',
 'allow',
 'allows',
 'almost',
 'alone',
 'along',
 'alongside',
 'already',
 'also',
 'although',
 'always',
 'am',
 'amid',
 'amidst',
 'among',
 'amongst',
 'amoungst',
 'amount',
 'an',
 'and',
 'announce',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anymore',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'ao',
 'apart',
 'apparently',
 'appear',
 'appreciate',
 'appropriate',
 'approximately',
 'aq',
 'ar',
 'are',
 'area',
 'areas',
 'aren',
 "aren't",
 'arent',
 'arise',
 'around',
 'arpa',
 'as',
 'aside',
 'ask',
 'asked',
 'asking',
 'asks',
 'associated

In [None]:
en_sw = stopwords.stopwords('en')

def word_counts(x, pipeline=en):
  doc = pipeline(x)
  count = 0
  for sentence in doc.sentences:
    for token in sentence.tokens:
        if token.text.lower() not in en_sw:
          count += 1
  return count

In [None]:
train['Words'] = train['Message'].apply(word_counts)
test['Words'] = test['Message'].apply(word_counts)

In [None]:
x_train = train[['Length', 'Punctuation', 'Capitals', 'Words']]
y_train = train[['Spam']]

x_test = test[['Length', 'Punctuation', 'Capitals' , 'Words']]
y_test = test[['Spam']]

model = make_model(input_dims=4)
#model = make_model(input_dims=3)

model.fit(x_train, y_train, epochs=10, batch_size=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f3832aa7d30>

## POS Based Features

In [None]:
en = stanza.Pipeline(lang='en')

txt = "Yo you around? A friend of mine's lookin."
pos = en(txt)

2020-10-14 04:51:48 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | ewt       |
| pos       | ewt       |
| lemma     | ewt       |
| depparse  | ewt       |
| sentiment | sstplus   |
| ner       | ontonotes |

2020-10-14 04:51:48 INFO: Use device: gpu
2020-10-14 04:51:48 INFO: Loading: tokenize
2020-10-14 04:51:48 INFO: Loading: pos
2020-10-14 04:51:49 INFO: Loading: lemma
2020-10-14 04:51:49 INFO: Loading: depparse
2020-10-14 04:51:50 INFO: Loading: sentiment
2020-10-14 04:51:51 INFO: Loading: ner
2020-10-14 04:51:51 INFO: Done loading processors!


In [None]:
def print_pos(doc):
    text = ""
    for sentence in doc.sentences:
        for token in sentence.tokens:
            text += token.words[0].text + "/" + \
                    token.words[0].upos + " "
        text += "\n"
    return text

In [None]:
print(print_pos(pos))

Yo/PRON you/PRON around/ADV ?/PUNCT 
A/DET friend/NOUN of/ADP mine/PRON 's/PART lookin/NOUN ./PUNCT 



In [None]:
en_sw = stopwords.stopwords('en')

def word_counts_v3(x, pipeline=en):
  doc = pipeline(x)
  count = 0
  for sentence in doc.sentences:
    for token in sentence.tokens:
        if token.text.lower() not in en_sw and \
        token.words[0].upos not in ['PUNCT', 'SYM']:
          count += 1
  return count

In [None]:
print(word_counts(txt), word_counts_v3(txt))

6 4


In [None]:
train['Test'] = 0
train.describe()

Unnamed: 0,Spam,Capitals,Punctuation,Length,Words,Test
count,4459.0,4459.0,4459.0,4459.0,4459.0,4459.0
mean,0.132765,5.519399,18.886522,80.316439,9.326979,0.0
std,0.339359,11.405424,14.602023,59.346407,8.016488,0.0
min,0.0,0.0,0.0,2.0,0.0,0.0
25%,0.0,1.0,8.0,35.0,4.0,0.0
50%,0.0,2.0,15.0,61.0,7.0,0.0
75%,0.0,4.0,27.0,122.0,13.0,0.0
max,1.0,129.0,253.0,910.0,147.0,0.0


In [None]:
def word_counts_v3(x, pipeline=en):
  doc = pipeline(x)
  totals = 0.
  count = 0.
  non_word = 0.
  for sentence in doc.sentences:
    totals += len(sentence.tokens)  # (1)
    for token in sentence.tokens:
        if token.text.lower() not in en_sw:
          if token.words[0].upos not in ['PUNCT', 'SYM']:
            count += 1.
          else:
            non_word += 1.
  non_word = non_word / totals
  return pd.Series([count, non_word], index=['Words_NoPunct', 'Punct'])

In [None]:
x = train[:10]
x.describe()

Unnamed: 0,Spam,Capitals,Punctuation,Length,Words,Test
count,10.0,10.0,10.0,10.0,10.0,10.0
mean,0.0,14.4,18.3,72.7,8.6,0.0
std,0.0,32.948445,14.772723,50.36103,10.068653,0.0
min,0.0,1.0,4.0,23.0,2.0,0.0
25%,0.0,1.0,7.25,37.75,3.0,0.0
50%,0.0,1.5,13.0,57.0,4.0,0.0
75%,0.0,9.0,23.75,88.0,10.75,0.0
max,0.0,107.0,48.0,161.0,35.0,0.0


In [None]:
train_tmp = train['Message'].apply(word_counts_v3)
train = pd.concat([train, train_tmp], axis=1)
train.describe()

Unnamed: 0,Spam,Capitals,Punctuation,Length,Words,Test,Words_NoPunct,Punct
count,4459.0,4459.0,4459.0,4459.0,4459.0,4459.0,4459.0,4459.0
mean,0.132765,5.519399,18.886522,80.316439,9.326979,0.0,6.535995,0.147763
std,0.339359,11.405424,14.602023,59.346407,8.016488,0.0,5.679984,0.094337
min,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
25%,0.0,1.0,8.0,35.0,4.0,0.0,3.0,0.090909
50%,0.0,2.0,15.0,61.0,7.0,0.0,5.0,0.142857
75%,0.0,4.0,27.0,122.0,13.0,0.0,9.0,0.2
max,1.0,129.0,253.0,910.0,147.0,0.0,54.0,0.666667


In [None]:
test_tmp = test['Message'].apply(word_counts_v3)
test = pd.concat([test, test_tmp], axis=1)
test.describe()

Unnamed: 0,Spam,Capitals,Punctuation,Length,Words,Words_NoPunct,Punct
count,1115.0,1115.0,1115.0,1115.0,1115.0,1115.0,1115.0
mean,0.139013,6.030493,19.166816,80.95157,9.623318,6.700448,0.152936
std,0.346116,12.731059,15.694599,61.807655,8.303803,5.887786,0.101909
min,0.0,0.0,0.0,2.0,0.0,0.0,0.0
25%,0.0,1.0,8.0,36.0,4.0,3.0,0.096774
50%,0.0,2.0,15.0,61.0,7.0,4.0,0.142857
75%,0.0,4.0,28.0,123.0,14.0,10.0,0.2
max,1.0,127.0,195.0,790.0,83.0,45.0,1.0


In [None]:
z = pd.concat([x, train_tmp], axis=1)
z.describe()

Unnamed: 0,Spam,Capitals,Punctuation,Length,Words,Test,Words_NoPunct,Punct
count,10.0,10.0,10.0,10.0,10.0,10.0,4459.0,4459.0
mean,0.0,14.4,18.3,72.7,8.6,0.0,6.535995,0.147763
std,0.0,32.948445,14.772723,50.36103,10.068653,0.0,5.679984,0.094337
min,0.0,1.0,4.0,23.0,2.0,0.0,0.0,0.0
25%,0.0,1.0,7.25,37.75,3.0,0.0,3.0,0.090909
50%,0.0,1.5,13.0,57.0,4.0,0.0,5.0,0.142857
75%,0.0,9.0,23.75,88.0,10.75,0.0,9.0,0.2
max,0.0,107.0,48.0,161.0,35.0,0.0,54.0,0.666667


In [None]:
z.loc[z['Spam']==0].describe()

Unnamed: 0,Spam,Capitals,Punctuation,Length,Words,Test,Words_NoPunct,Punct
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,0.0,14.4,18.3,72.7,8.6,0.0,5.5,0.151479
std,0.0,32.948445,14.772723,50.36103,10.068653,0.0,7.412452,0.063396
min,0.0,1.0,4.0,23.0,2.0,0.0,1.0,0.0
25%,0.0,1.0,7.25,37.75,3.0,0.0,2.0,0.130721
50%,0.0,1.5,13.0,57.0,4.0,0.0,2.0,0.166667
75%,0.0,9.0,23.75,88.0,10.75,0.0,6.75,0.2
max,0.0,107.0,48.0,161.0,35.0,0.0,25.0,0.208333


In [None]:
z.loc[z['Spam']==1].describe()

Unnamed: 0,Spam,Capitals,Punctuation,Length,Words,Test,Words_NoPunct,Punct
count,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,,,,,,,,
std,,,,,,,,
min,,,,,,,,
25%,,,,,,,,
50%,,,,,,,,
75%,,,,,,,,
max,,,,,,,,


In [None]:
aa = [word_counts_v3(y) for y in x['Message']]

In [None]:
ab = pd.DataFrame(aa)
ab.describe()

Unnamed: 0,Words_NoPunct,Punct
count,10.0,10.0
mean,5.5,0.151479
std,7.412452,0.063396
min,1.0,0.0
25%,2.0,0.130721
50%,2.0,0.166667
75%,6.75,0.2
max,25.0,0.208333


# Lemmatization

In [None]:

text = "Stemming is aimed at reducing vocabulary and aid un-derstanding of" +\
       " morphological processes. This helps people un-derstand the" +\
       " morphology of words and reduce size of corpus."

lemma = en(text)

In [None]:
lemmas = ""
for sentence in lemma.sentences:
        for token in sentence.tokens:
            lemmas += token.words[0].lemma +"/" + \
                    token.words[0].upos + " "
        lemmas += "\n"

print(lemmas)

stemming/NOUN be/AUX aim/VERB at/SCONJ reduce/VERB vocabulary/NOUN and/CCONJ aid/NOUN un/NOUN -/PUNCT derstanding/NOUN of/ADP morphological/ADJ process/NOUN ./PUNCT 
this/PRON help/VERB people/NOUN un/NOUN -/PUNCT derstand/VERB the/DET morphology/NOUN of/ADP word/NOUN and/CCONJ reduce/VERB size/NOUN of/ADP corpus/NOUN ./PUNCT 



# TF-IDF Based Model


In [None]:
# if not installed already
!pip install sklearn



In [None]:
corpus = [
          "I like fruits. Fruits like bananas",
          "I love bananas but eat an apple",
          "An apple a day keeps the doctor away"
]


## Count Vectorization

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

vectorizer.get_feature_names()

['an',
 'apple',
 'away',
 'bananas',
 'but',
 'day',
 'doctor',
 'eat',
 'fruits',
 'keeps',
 'like',
 'love',
 'the']

In [None]:
X.toarray()

array([[0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 2, 0, 0],
       [1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0],
       [1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1]])

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(X.toarray())

array([[1.        , 0.13608276, 0.        ],
       [0.13608276, 1.        , 0.3086067 ],
       [0.        , 0.3086067 , 1.        ]])

In [None]:
query = vectorizer.transform(["apple and bananas"])

cosine_similarity(X, query)

array([[0.23570226],
       [0.57735027],
       [0.26726124]])

## TF-IDF Vectorization

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer(smooth_idf=False)
tfidf = transformer.fit_transform(X.toarray())

pd.DataFrame(tfidf.toarray(), 
             columns=vectorizer.get_feature_names())

Unnamed: 0,an,apple,away,bananas,but,day,doctor,eat,fruits,keeps,like,love,the
0,0.0,0.0,0.0,0.230408,0.0,0.0,0.0,0.0,0.688081,0.0,0.688081,0.0,0.0
1,0.321267,0.321267,0.0,0.321267,0.479709,0.0,0.0,0.479709,0.0,0.0,0.0,0.479709,0.0
2,0.275785,0.275785,0.411797,0.0,0.0,0.411797,0.411797,0.0,0.0,0.411797,0.0,0.0,0.411797


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

tfidf = TfidfVectorizer(binary=True)
X = tfidf.fit_transform(train['Message']).astype('float32')
X_test = tfidf.transform(test['Message']).astype('float32')

In [None]:
X.shape

(4459, 7741)

In [None]:
from keras.utils import np_utils

_, cols = X.shape
model2 = make_model(cols)  # to match tf-idf dimensions
lb = LabelEncoder()
y = lb.fit_transform(y_train)
dummy_y_train = np_utils.to_categorical(y)
model2.fit(X.toarray(), y_train, epochs=10, batch_size=10)

  y = column_or_1d(y, warn=True)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f37d1fb7828>

In [None]:
model2.evaluate(X_test.toarray(), y_test)



[0.05765564367175102, 0.9838564991950989]

In [None]:
train.loc[train.Spam == 1].describe() 

Unnamed: 0,Spam,Capitals,Punctuation,Length,Words,Test,Words_NoPunct,Punct
count,592.0,592.0,592.0,592.0,592.0,592.0,592.0,592.0
mean,1.0,15.320946,29.086149,138.856419,18.469595,0.0,14.25,0.138386
std,0.0,11.635105,7.083572,28.07998,6.085607,0.0,4.701046,0.064732
min,1.0,0.0,2.0,13.0,2.0,0.0,2.0,0.0
25%,1.0,7.0,26.0,132.0,14.0,0.0,11.0,0.096774
50%,1.0,14.0,30.0,149.0,19.0,0.0,14.0,0.137931
75%,1.0,21.0,34.0,157.0,23.0,0.0,18.0,0.176471
max,1.0,128.0,49.0,197.0,33.0,0.0,27.0,0.333333


# Word Vectors

In [None]:
# memory limit may be exceeded. Try deleting some objects before running this next section
# or copy this section to a different notebook.
!pip install gensim



In [None]:
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api


In [None]:
api.info()

{'corpora': {'20-newsgroups': {'checksum': 'c92fd4f6640a86d5ba89eaad818a9891',
   'description': 'The notorious collection of approximately 20,000 newsgroup posts, partitioned (nearly) evenly across 20 different newsgroups.',
   'fields': {'data': '',
    'id': 'original id inferred from folder name',
    'set': "marker of original split (possible values 'train' and 'test')",
    'topic': 'name of topic (20 variant of possible values)'},
   'file_name': '20-newsgroups.gz',
   'file_size': 14483581,
   'license': 'not found',
   'num_records': 18846,
   'parts': 1,
   'read_more': ['http://qwone.com/~jason/20Newsgroups/'],
   'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/20-newsgroups/__init__.py',
   'record_format': 'dict'},
  '__testing_matrix-synopsis': {'checksum': '1767ac93a089b43899d54944b07d9dc5',
   'description': '[THIS IS ONLY FOR TESTING] Synopsis of the movie matrix.',
   'file_name': '__testing_matrix-synopsis.gz',
   'parts': 1,
   're

In [None]:
model_w2v = api.load("word2vec-google-news-300")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [None]:
model_w2v.most_similar("cookies",topn=10)

  if np.issubdtype(vec.dtype, np.int):


[('cookie', 0.745154082775116),
 ('oatmeal_raisin_cookies', 0.6887780427932739),
 ('oatmeal_cookies', 0.662139892578125),
 ('cookie_dough_ice_cream', 0.6520504951477051),
 ('brownies', 0.6479344964027405),
 ('homemade_cookies', 0.6476464867591858),
 ('gingerbread_cookies', 0.6461867690086365),
 ('Cookies', 0.6341644525527954),
 ('cookies_cupcakes', 0.6275068521499634),
 ('cupcakes', 0.6258294582366943)]

In [None]:
model_w2v.doesnt_match(["USA","Canada","India","Tokyo"])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
  if np.issubdtype(vec.dtype, np.int):


'Tokyo'

In [None]:
king = model_w2v['king']
man = model_w2v['man']
woman = model_w2v['woman']

queen = king - man + woman  
model_w2v.similar_by_vector(queen)

  if np.issubdtype(vec.dtype, np.int):


[('king', 0.8449392318725586),
 ('queen', 0.7300517559051514),
 ('monarch', 0.6454660892486572),
 ('princess', 0.6156251430511475),
 ('crown_prince', 0.5818676948547363),
 ('prince', 0.5777117609977722),
 ('kings', 0.5613663792610168),
 ('sultan', 0.5376776456832886),
 ('Queen_Consort', 0.5344247817993164),
 ('queens', 0.5289887189865112)]