In [2]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Loading the spam data
# ham is the label for non-spam messages
spam = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/spam.csv')
spam.head(10)

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


spaCy handles the bag of words conversion and building a simple linear model for you with the TextCategorizer class.The TextCategorizer is a spaCy pipe. Pipes are classes for processing and transforming tokens.

You can remove or add pipes to models. What we'll do here is create an empty model without any pipes (other than a tokenizer, since all models always have a tokenizer). Then, we'll create a TextCategorizer pipe and add it to the empty model.

Since the classes are either ham or spam, we set "exclusive_classes" to True.


In [4]:
import spacy
# Create an empty model
nlp = spacy.blank("en")
# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})
# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)
# Add labels to text classifier
textcat.add_label("ham")
textcat.add_label("spam")

1

Since the classes are either ham or spam, we set "exclusive_classes" to True. "bow" stands for Bag of Words.

In [5]:
train_texts = spam['text'].values
train_labels = [{'cats': {'ham': label == 'ham',
              'spam': label == 'spam'}} 
                for label in spam['label']]
                
# The zip() function returns an iterator of tuples based on the iterable objects
# zip(['a', 'b', 'c'], [1, 2, 3]) yields ('a', 1) ('b', 2) ('c', 3)
train_data = list(zip(train_texts, train_labels))
train_data[:3]

[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  {'cats': {'ham': True, 'spam': False}}),
 ('Ok lar... Joking wif u oni...', {'cats': {'ham': True, 'spam': False}}),
 ("Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
  {'cats': {'ham': False, 'spam': True}})]

create an optimizer using nlp.begin_training(). spaCy uses this optimizer to update the model. In general it's more efficient to train models in small batches. spaCy provides the minibatch function that returns a generator yielding minibatches for training. Finally, the minibatches are split into texts and labels, then used with nlp.update to update the model's parameters


In [6]:
from spacy.util import minibatch

spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

# Create the batch generator with batch size = 8
batches = minibatch(train_data, size=8)
# Iterate through minibatches
for batch in batches:
    # Each batch is a list of (text, label) but we need to
    # send separate lists for texts and labels to update().
    # This is a quick way to split a list of tuples into lists
    # In python, * is the 'splat' operator. It is used for unpacking a list into arguments. 
    # For example: foo(*[1, 2, 3]) is the same as foo(1, 2, 3).
    
    texts, labels = zip(*batch)   
    nlp.update(texts, labels, sgd=optimizer)

This is just one training loop (or epoch) through the data. The model will typically need multiple epochs. Use another loop for more epochs, and optionally re-shuffle the training data at the begining of each loop.

print(loss) shows the accumulated losses of the model

In [7]:
import random

random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

losses = {}
for epoch in range(10):
    random.shuffle(train_data)
    # Create the batch generator with batch size = 8
    batches = minibatch(train_data, size=8)
    # Iterate through minibatches
    for batch in batches:
        # Each batch is a list of (text, label) but we need to
        # send separate lists for texts and labels to update().
        # This is a quick way to split a list of tuples into lists
        texts, labels = zip(*batch)
        nlp.update(texts, labels, sgd=optimizer, losses=losses)
    print(losses)

{'textcat': 0.43189741921099767}
{'textcat': 0.6474976215331196}
{'textcat': 0.7842154536487618}
{'textcat': 0.8716683716818165}
{'textcat': 0.9280939335008995}
{'textcat': 0.9655779922872296}
{'textcat': 0.9939651840090362}
{'textcat': 1.0127976631523663}
{'textcat': 1.0275637812859075}
{'textcat': 1.0378531470013608}


Now that you have a trained model, you can make predictions with the predict() method. The input text needs to be tokenized with nlp.tokenizer. Then you pass the tokens to the predict method which returns scores. The scores are the probability the input text belongs to the classes.

In [12]:
texts = ["Are you ready for the tea party????? It's gonna be wild",
         "URGENT Reply to this message for GUARANTEED FREE TEA",
         "This is the tea I like, would you like to try it out",
         "Tea for sale. Have a 30% off by the end of Auguest"
         ]
docs = [nlp.tokenizer(text) for text in texts]
    
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)

print(scores.round(3))

[[1.    0.   ]
 [0.011 0.989]
 [0.966 0.034]
 [0.856 0.144]
 [0.757 0.243]]
