

```

import pandas as pd

# Loading the spam data
# ham is the label for non-spam messages

spam = pd.read_csv('../input/nlp-course/spam.csv')
spam.head(10)

```





Machine learning models don't learn from raw text data. Instead, you need to convert the text to something numeric.

The simplest common representation is a variation of one-hot encoding.

spaCy handles the **bag of words** conversion and building a simple linear model for you with the **TextCategorizer** class

```
# Bag of Words

import spacy

# Create an empty model
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)


```





```
# Add labels to text classifier
textcat.add_label("ham")
textcat.add_label("spam")

```





```
# training a text categorizer model

train_texts = spam['text'].values
train_labels = [{'cats': {'ham': label == 'ham',
                          'spam': label == 'spam'}} 
                for label in spam['label']]
```





```
# combine the texts and labels into a single list.

train_data = list(zip(train_texts, train_labels))
train_data[:3]

```



ready to train the model
 
create an optimizer using **nlp.begin_training()** ... spaCy uses this optimizer to update the model

In general it's more efficient to train models in small batches ... spaCy provides the **minibatch** function that returns a generator yielding minibatches for training

the minibatches are split into texts and labels, then used with **nlp.update** to update the model's parameters



```
from spacy.util import minibatch

spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

# Create the batch generator with batch size = 8
batches = minibatch(train_data, size=8)
# Iterate through minibatches
for batch in batches:
    # Each batch is a list of (text, label) but we need to
    # send separate lists for texts and labels to update().
    # This is a quick way to split a list of tuples into lists
    texts, labels = zip(*batch)
    nlp.update(texts, labels, sgd=optimizer)

```



This is just one training loop (or epoch) through the data
The model will typically need multiple epochs
Use another loop for more epochs, and optionally re-shuffle the training data at the begining of each loop



```
import random

random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

losses = {}
for epoch in range(10):
    random.shuffle(train_data)
    # Create the batch generator with batch size = 8
    batches = minibatch(train_data, size=8)
    # Iterate through minibatches
    for batch in batches:
        # Each batch is a list of (text, label) but we need to
        # send separate lists for texts and labels to update().
        # This is a quick way to split a list of tuples into lists
        texts, labels = zip(*batch)
        nlp.update(texts, labels, sgd=optimizer, losses=losses)
    print(losses)
    
```



**Making Predictions**



```
texts = ["Are you ready for the tea party????? It's gonna be wild",
         "URGENT Reply to this message for GUARANTEED FREE TEA" ]
docs = [nlp.tokenizer(text) for text in texts]
    
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)

print(scores)
```



The scores are used to predict a single class or label by choosing the label with the highest probability
You get the index of the highest probability with **scores.argmax**, then use the index to get the label string from **textcat.labels**

```
predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])

```

