## Text Classification with SpaCy

- a common task in NLP is **text classification**
    - this is classificaiton in the conventional ML sense and is applied to text

- examples
    - spam detection 
    - sentiment analysis
    - tagging customer queries

***

## Spam Detection with SpaCy

- this classifier will detect spam messages 
- a common functionality in most email client 

### Import pandas

In [9]:
import pandas as pd 

### Load labelled data

In [10]:
spam = pd.read_csv('kaggle_data/spam.csv')

### Check head of dataset

In [11]:
spam.head(10)

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


- here, `ham` is the label for non-spam data

***

## Numerical Representation of Text

- Machine learning models don't learn from raw text data

    - Instead, you need to convert the text to something numeric
    
    - The simplest common representation is a variation of one-hot encoding

- once we have a numerical representation of the raw text, those numerical vectors can be fed into any machine learning model



### Building the Vocabulary

- The vocabulary is built from all the tokens (terms) in the corpus (the collection of documents)

**Example**

- As an example, take the sentences 
    - "Tea is life. Tea is love." and 

    - "Tea is healthy, calming, and delicious." as our corpus. 
    
- The vocabulary then is `{"tea", "is", "life", "love", "healthy", "calming", "and", "delicious"}` (ignoring punctuation)

### Bag of Words (One-Hot-Encoding for text) Representation

- You represent each document as a vector of term frequencies for each term in the vocabulary

- so for the sentences:
    - "Tea is life. Tea is love." and 

    - "Tea is healthy, calming, and delicious." as our corpus

- vector representation is given by
    -   `v_1 = [2 2 1 1 0 0 0 0]` and 
    -   `v_2 = [1 1 0 0 1 1 1 1]`  

- this vector representation is called the **bag-of-words** representation

- vocabularies typically have 10k-90k terms, so these **bag-of-words** vectors can be very large

### TF-IDF Representation

- Another common representation of text in numerical form is **TF-IDF** (Term Frequency - Inverse Document Frequency)

- TF-IDF is similar to bag of words except that each term count is scaled by the term's frequency in the corpus
    - Using TF-IDF can potentially improve your models. You won't need it here. Feel free to look it up though



***

## Building a Bag-of-Words Prediction Model

- with the `TextCategorizer` class, spaCy handles 
    - bag of words conversion and 
    
    - building a simple linear model for you 
    
### Pipes

- `TextCategorizer` is actually a spaCY **pipe**

- now, **pipes** are classes for 

    - processing 
    
    - transformation 

    of tokens

#### Default Pipes

- `nlp = spacy.load('en')` actually creates a model with default pipes that perform part of speech tagging, entity recognition and other transformations

- when text is run through this `nlp` model by doing `doc = nlp("Some text here")`

    - the output of the pipes are atached to the tokens in the `doc` object 
    
    - the lemmas that are output with `token.lemma_` comes from one these pipes

#### Modifying Model Pipes

- pipes maybe added or removed from models 

- to build a model from scratch with only the desired pipes, create an empty model first

    - `nlp = spacy.blank("en")`

    - the empty model still comes with a tokenizer, since text representation models always have a tokenizer

- then, add the `TextCategorizer` pipe to the empty model for instance 

### Implementation (bag-of-words prediction model)

In [12]:
# import spacy
import spacy

# create empty model
nlp = spacy.blank("en")

# create the `textcategorizer` pipe with the given parameter settings
textcat = nlp.create_pipe(
    "textcat", # internal name of the pipe
    config={
        "exclusive_classes": True, # set "exclusive_classes" parameter to "True" (spam or ham output based on input - i.e. discrete classification output of input text)
        "architecture": "bow", # set the pipe "architecture" parameter to "bow" (bag-of-words model)
    }
)

# add the newly created pipe to the blank nlp model
nlp.add_pipe(textcat)


- Since the label classes are either `ham` or `spam`, we set "`exclusive_classes`" to `True`

- We've also configured it with the bag of words ("`bow`") architecture

- spaCy provides a convolutional neural network architecture as well, but it's more complex than you need for now


#### Adding labels to the pipe in the model

- next we'll add the labels to the model
    - "ham" are for the real messages

    - "spam" are spam messages

In [13]:
# Add labels to text classifier
textcat.add_label("ham")
textcat.add_label("spam")

1

#### Training a Text Categorizer Model


- Next, convert the labels in the data to the form TextCategorizer require

- For each document, create a dictionary of boolean values for each class

- For example, if a text is "ham", we need a dictionary `{'ham': True, 'spam': False}`
    - The model is looking for these labels inside another dictionary with the key 'cats'


In [15]:
train_text = spam['text'].values
print(train_text)

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
 'Ok lar... Joking wif u oni...'
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
 ... 'Pity, * was in mood for that. So...any other suggestions?'
 "The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free"
 'Rofl. Its true to its name']


In [19]:
train_labels = [{'cats':{
    'ham': label == 'ham',
    'spam': label == 'spam',
}} for label in spam['label']]
print(train_labels)

[{'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': False, 'spam': True}}, {'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': False, 'spam': True}}, {'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': False, 'spam': True}}, {'cats': {'ham': False, 'spam': True}}, {'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': False, 'spam': True}}, {'cats': {'ham': False, 'spam': True}}, {'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': False, 'spam': True}}, {'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': False, 'spam': True}}, {'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': True, 'spam': False}},

In [21]:
train_data = list(zip(train_text, train_labels))
train_data[:3]

[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  {'cats': {'ham': True, 'spam': False}}),
 ('Ok lar... Joking wif u oni...', {'cats': {'ham': True, 'spam': False}}),
 ("Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
  {'cats': {'ham': False, 'spam': True}})]

#### Building an optimizer

- spaCy uses an optimizer to update the model

    - create an `optimizer` using `nlp.begin_training()`


#### Setting up a batcher

- in general, it is more efficient to train models in small batches 

- spacy provides the `minibatch` function that returns a generator yielding minibathces for training

- finally, the minibatches are split into texts and labels

    - then used with `nlp.update` to update the model's parameters

In [26]:
# setup the batch processor
from spacy.util import minibatch

spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

batches = minibatch(train_data,size = 8)

#### One Training Epoch

- following is just one training loop (or epoch) through the data 

In [25]:
# iterate throguh the minibatches 
for batch in batches:
    # each batch is a list of (text,label)
    # but send separate lists for text and labels to update()
    # quick way to split a list of tuples into lists 
    texts, labels = zip(*batch)
    nlp.update(texts, labels, sgd=optimizer)

#### Multiple Epochs

- Training a model will typically need multiple epochs 

- Use another loop for more epochs

    - and optionally re-shuffle the training data at the begining of each loop

In [29]:
import random

random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

losses = {} # init a blank dictionary for storing the losses at each update
number_of_epochs = 10 # set the number of epochs to run

for epoch in range(number_of_epochs):

    # shuffle training data at the beginning of the epoch
    random.shuffle(train_data)

    # create the batch generator with bathc size = 8
    batches = minibatch(train_data, size=8)

    # iterate throguh the minibatches 
    for batch in batches:
        texts, labels = zip(*batch)
        nlp.update(texts, labels, sgd = optimizer, losses = losses)
    print(losses)

{'textcat': 0.09073213039844849}
{'textcat': 0.1292124649657731}
{'textcat': 0.1522202355726207}
{'textcat': 0.16769853011832045}
{'textcat': 0.17897253698093277}
{'textcat': 0.1882149436709063}
{'textcat': 0.1960779418065535}
{'textcat': 0.20293926947665042}
{'textcat': 0.20946894104579655}
{'textcat': 0.21498010334911705}


### Making prediction with the trained model

- you can make predictions with the `predict()` method of the trained model

- the input text needs to be tokenized with `nlp.tokenizer` 

    - then pass the tokens to the predict method which returns scores
    
    - the scores are the probability the input text belongs to the classes





In [30]:
texts = ["Are you ready for the tea party????? It's gonna be wild",
         "URGENT Reply to this message for GUARANTEED FREE TEA" ]
         
docs = [nlp.tokenizer(text) for text in texts]
    
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)

print(scores)

[[9.9998879e-01 1.1163655e-05]
 [1.0171348e-03 9.9898285e-01]]


- scores are used to predict a single class or label by choosing the label with the highest probability

- get the index of the highest probability with `scores.argmax`,

    - then use the index to get the label string from `textcat.labels`

In [31]:
predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])

['ham', 'spam']
