## Text Classification with SpaCy

- a common task in NLP is **text classification**
    - this is classificaiton in the conventional ML sense and is applied to text

- examples
    - spam detection 
    - sentiment analysis
    - tagging customer queries

***

## Spam Detection with SpaCy

- this classifier will detect spam messages 
- a common functionality in most email client 

### Import pandas

In [3]:
import pandas as pd 

### Load labelled data

In [5]:
spam = pd.read_csv('kaggle_data/spam.csv')

### Check head of dataset

In [7]:
spam.head(10)

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


- here, `ham` is the label for non-spam data

***

## Numerical Representation of Text

- Machine learning models don't learn from raw text data

    - Instead, you need to convert the text to something numeric
    
    - The simplest common representation is a variation of one-hot encoding

- once we have a numerical representation of the raw text, those numerical vectors can be fed into any machine learning model



### Building the Vocabulary

- The vocabulary is built from all the tokens (terms) in the corpus (the collection of documents)

**Example**

- As an example, take the sentences 
    - "Tea is life. Tea is love." and 

    - "Tea is healthy, calming, and delicious." as our corpus. 
    
- The vocabulary then is `{"tea", "is", "life", "love", "healthy", "calming", "and", "delicious"}` (ignoring punctuation)

### Bag of Words (One-Hot-Encoding for text) Representation

- You represent each document as a vector of term frequencies for each term in the vocabulary

- so for the sentences:
    - "Tea is life. Tea is love." and 

    - "Tea is healthy, calming, and delicious." as our corpus

- vector representation is given by
    -   `v_1 = [2 2 1 1 0 0 0 0]` and 
    -   `v_2 = [1 1 0 0 1 1 1 1]`  

- this vector representation is called the **bag-of-words** representation

- vocabularies typically have 10k-90k terms, so these **bag-of-words** vectors can be very large

### TF-IDF Representation

- Another common representation of text in numerical form is **TF-IDF** (Term Frequency - Inverse Document Frequency)

- TF-IDF is similar to bag of words except that each term count is scaled by the term's frequency in the corpus
    - Using TF-IDF can potentially improve your models. You won't need it here. Feel free to look it up though



***

## Building a Bag-of-Words Prediction Model

- with the `TextCategorizer` class, spaCy handles 
    - bag of words conversion and 
    
    - building a simple linear model for you 
    
### Pipes

- `TextCategorizer` is actually a spaCY **pipe**

- now, **pipes** are classes for 

    - processing 
    
    - transformation 

    of tokens

#### Default Pipes

- `nlp = spacy.load('en')` actually creates a model with default pipes that perform part of speech tagging, entity recognition and other transformations

- when text is run through this `nlp` model by doing `doc = nlp("Some text here")`

    - the output of the pipes are atached to the tokens in the `doc` object 
    
    - the lemmas that are output with `token.lemma_` comes from one these pipes

#### Modifying Model Pipes

- pipes maybe added or removed from models 

- to build a model from scratch with only the desired pipes, create an empty model first

    - `nlp = spacy.blank("en")`

    - the empty model still comes with a tokenizer, since text representation models always have a tokenizer

- then add the `TextCategorizer` pipe to the empty model for instance 