Let us look at how we can perfom text classification with spaCy
The dataset is from the Tweet Sentiment Extraction challenge from Kaggle(https://www.kaggle.com/c/tweet-sentiment-extraction/overview)
We would perform text classification using spaCy on tweet data to classify tweets as "positive","negative"  or "neutral"



In [13]:
#Import all required libraries
import spacy
import random
import time
import numpy as np
import pandas as pd
import re
import string


import sys
from spacy import displacy

from tqdm.auto import tqdm
from spacy.tokens import DocBin


Let us define methods to pre-process the tweets

In [7]:

def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

def remove_url(text): 
    url_pattern  = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    return url_pattern.sub(r'', text)
 # converting return value from list to string



def clean_text(text ): 
    delete_dict = {sp_character: '' for sp_character in string.punctuation} 
    delete_dict[' '] = ' ' 
    table = str.maketrans(delete_dict)
    text1 = text.translate(table)
    #print('cleaned:'+text1)
    textArr= text1.split()
    text2 = ' '.join([w for w in textArr if ( not w.isdigit() and  ( not w.isdigit() and len(w)>3))]) 
    
    return text2.lower()



Code from : https://medium.com/analytics-vidhya/building-a-text-classifier-with-spacy-3-0-dd16e9979a

In [42]:
def make_docs(file_path):
    """
    this will take a list of texts and labels 
    and transform them in spacy documents
    
    data: list(tuple(text, label))
    
    returns: List(spacy.Doc.doc)
    """
    train_data = pd.read_csv(file_path)
    train_data.dropna(axis = 0, how ='any',inplace=True) 
    train_data['Num_words_text'] = train_data['text'].apply(lambda x:len(str(x).split())) 
    mask = train_data['Num_words_text'] >2
    train_data = train_data[mask]
    print(train_data['sentiment'].value_counts())
    
    train_data['text'] = train_data['text'].apply(remove_emoji)
    train_data['text'] = train_data['text'].apply(remove_url)
    train_data['text'] = train_data['text'].apply(clean_text)
   
    data = tuple(zip(train_data['text'].tolist(), train_data['sentiment'].tolist())) 
    print(data[1])
    docs = []
    # nlp.pipe([texts]) is way faster than running 
    # nlp(text) for each text
    # as_tuples allows us to pass in a tuple, 
    # the first one is treated as text
    # the second one will get returned as it is.
    nlp = spacy.load("en_core_web_trf")
    for doc, label in tqdm(nlp.pipe(data, as_tuples=True), total = len(data)):
        
        # we need to set the (text)cat(egory) for each document
        #print(label)
        if (label=='positive'):
            doc.cats['positive'] = 1
            doc.cats['negative'] = 0
            doc.cats['neutral']  = 0
        elif (label=='negative'):
            doc.cats['positive'] = 0
            doc.cats['negative'] = 1
            doc.cats['neutral']  = 0
        else:
            doc.cats['positive'] = 0
            doc.cats['negative'] = 0
            doc.cats['neutral']  = 1
        #print(doc.cats)
        
        # put them into a nice list
        docs.append(doc)
    
    return docs,train_data

Let us convert our train data and test data to spaCy format

In [43]:
train_docs,train_data  = make_docs("C:\\TweetSenitment\\train.csv")
# then we save it in a binary file to disc
doc_bin = DocBin(docs=train_docs)
doc_bin.to_disk("./textcat_data/textcat_train.spacy")

test_docs,test_data  = make_docs("C:\\TweetSenitment\\test.csv")
# then we save it in a binary file to disc
doc_bin = DocBin(docs=test_docs)
doc_bin.to_disk("./textcat_data/textcat_valid.spacy")




neutral     10704
positive     8375
negative     7673
Name: sentiment, dtype: int64
('sooo will miss here diego', 'negative')


  0%|          | 0/26752 [00:00<?, ?it/s]

neutral     1376
positive    1075
negative     983
Name: sentiment, dtype: int64
('shanghai also really exciting precisely skyscrapers galore good tweeps china', 'positive')


  0%|          | 0/3434 [00:00<?, ?it/s]

Lets train the model on our dataset

In [44]:
!python -m spacy init fill-config ./textcat_base_config.cfg ./textcat_config.cfg

[+] Auto-filled config with all values
[+] Saved config
textcat_config.cfg
You can now add your data and train your pipeline:
python -m spacy train textcat_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [45]:
!python -m spacy train textcat_config.cfg --verbose --output ./textcat_output --paths.train textcat_data/textcat_train.spacy --paths.dev textcat_data/textcat_valid.spacy

[i] Using CPU
[1m
[+] Initialized pipeline
[1m
[i] Pipeline: ['transformer', 'textcat']
[i] Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  -------------  ------------  ----------  ------
  0       0           0.00          0.17        0.00    0.00
  1     200           0.10         43.59       69.74    0.70
  2     400           0.35         37.37       71.49    0.71
  3     600           0.50         28.74       70.67    0.71
  4     800           0.36         22.46       73.57    0.74
  5    1000           0.75         20.60       72.50    0.72
  7    1200           0.33         11.35       71.95    0.72
  8    1400           0.42         11.61       72.84    0.73
  9    1600           0.70         11.11       71.96    0.72
[+] Saved pipeline to output directory
textcat_output\model-last

[2021-06-11 14:25:14,792] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
[2021-06-11 14:25:16,108] [INFO] Set up nlp object from config
[2021-06-11 14:25:16,116] [DEBUG] Loading corpus from path: textcat_data\textcat_valid.spacy
[2021-06-11 14:25:16,117] [DEBUG] Loading corpus from path: textcat_data\textcat_train.spacy
[2021-06-11 14:25:16,117] [INFO] Pipeline: ['transformer', 'textcat']
[2021-06-11 14:25:16,120] [INFO] Created vocabulary
[2021-06-11 14:25:16,120] [INFO] Finished initializing nlp object
[2021-06-11 14:25:44,641] [INFO] Initialized pipeline components: ['transformer', 'textcat']
[2021-06-11 14:25:44,666] [DEBUG] Loading corpus from path: textcat_data\textcat_valid.spacy
[2021-06-11 14:25:44,667] [DEBUG] Loading corpus from path: textcat_data\textcat_train.spacy
[2021-06-11 14:25:44,704] [DEBUG] Removed existing output directory: textcat_output\model-best
[2021-06-11 14:25:44,709] [DEBUG] Removed existing output directory: textcat_output\model-last





Lets test our model on  test data

In [46]:
nlp_textcat = spacy.load("textcat_output/model-best")
test_texts = test_data['text'].tolist()
test_cats = test_data['sentiment'].tolist()
doc2 = nlp_textcat(test_texts[100])
print("Text: "+ test_texts[100])
print("Orig Cat:"+ test_cats[100])
print(" Predicted Cats:") 
print(doc2.cats)
print("=======================================")
doc2 = nlp_textcat(test_texts[1000])
print("Text: "+ test_texts[1000])
print(" Orig Cat:"+test_cats[1000])
print(" Predicted Cats:") 
print(doc2.cats)

Text: want david cook
Orig Cat:positive
 Predicted Cats:
{'positive': 0.06087260693311691, 'negative': 0.05941738560795784, 'neutral': 0.8797099590301514}
Text: okaii cool cant wait series begin guna awesome
 Orig Cat:positive
 Predicted Cats:
{'positive': 0.9591326713562012, 'negative': 0.00452009541913867, 'neutral': 0.036347199231386185}


In [47]:
doc2 = nlp_textcat("Avengers Endgame was a great movie")
print(doc2.cats)

{'positive': 0.9590757489204407, 'negative': 0.004502081777900457, 'neutral': 0.036422159522771835}


In [48]:
doc2 = nlp_textcat("Data science is tough to master")
print(doc2.cats)

{'positive': 0.02091258205473423, 'negative': 0.8962381482124329, 'neutral': 0.0828491747379303}
