## Tutorial: Text Classification in Python Using spaCy 
https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/  
## Documentation 
https://spacy.io/

In [52]:
import spacy
import pandas as pd
import en_core_web_sm

### Word tokenization
Breaking up text into individual words

In [53]:
# word tokenization
from spacy.lang.en import English

# load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

#  "nlp" object is used to create documents with linguistic annotations.
my_doc = nlp(text)

# create list of word tokens
token_list = []
for token in my_doc:
    token_list.append(token.text)
print(token_list)

['When', 'learning', 'data', 'science', ',', 'you', 'should', "n't", 'get', 'discouraged', '!', '\n', 'Challenges', 'and', 'setbacks', 'are', "n't", 'failures', ',', 'they', "'re", 'just', 'part', 'of', 'the', 'journey', '.', 'You', "'ve", 'got', 'this', '!']


### Sentence Tokenization  
The tokenizer looks for specific characters that fall between sentences, like periods, exclaimation points, and newline characters

In [54]:
# sentence tokenization

# load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

# create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer')

# add the component to the pipeline
nlp.add_pipe(sbd)

text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

#  "nlp" object is used to create documents with linguistic annotations.
doc = nlp(text)

# create list of sentence tokens
sents_list = []
for sent in doc.sents:
    sents_list.append(sent.text)
print(sents_list)

["When learning data science, you shouldn't get discouraged!", "\nChallenges and setbacks aren't failures, they're just part of the journey.", "You've got this!"]


In [55]:
#stop words

#importing stop words from English language
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

#print total number of stop words
print('Number of stop words: %d' % len(spacy_stopwords))

#printing first ten stop words
print('First ten stop words: %s' % list(spacy_stopwords)[:20])

Number of stop words: 312
First ten stop words: ['within', 'part', 'third', 'also', 'top', 'did', 'itself', 'only', 'has', "'s", 'own', 'last', 'some', 'put', 'forty', "'m", 'whom', 'others', 'down', 'him']


In [56]:
#Implementation of stop words:
#create empty list called filtered_sent
filtered_sent=[]

#  "nlp" object is used to create documents with linguistic annotations
doc = nlp(text)

# filtering stop words
for word in doc:
    #use token attribute "is_stop" to identify words that aren’t in the stopword list 
    #and append them to our filtered_sent list (token attributes: https://spacy.io/usage/rule-based-matching/#adding-patterns-attributes)
    if word.is_stop==False:
        filtered_sent.append(word)
print("Filtered Sentence:",filtered_sent)

Filtered Sentence: [learning, data, science, ,, discouraged, !, 
, Challenges, setbacks, failures, ,, journey, ., got, !]


### Lexicon Normalization  
Converts high dimensional features into low dimensional features
One method is stemming, which can be achieved through lemmatization  
Lemmatization looks at words and their roots (called lemma) as described in the dictionary, which is more precise (as long as the words exist in the dictionary) that basic stemming

In [57]:
# Implementing lemmatization
lem = nlp("run runs running runner")
# finding lemma for each word
for word in lem:
    print(word.text,word.lemma_)

run run
runs run
running run
runner runner


### Part of Speech (POS) Tagging
Need to import en_core_web_sm model, because that contains the dictionary and grammatical information 
Then load model with .load() and loop through docs variable, identifying the part of speech for each word using .pos_

In [58]:
spacy.cli.download('en_core_web_sm')

✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')


In [59]:
import en_core_web_sm

# load en_core_web_sm of English for vocabluary, syntax & entities
nlp = en_core_web_sm.load()

#  "nlp" Objectis used to create documents with linguistic annotations.
docs = nlp(u"All is well that ends well.")

for word in docs:
    print(word.text,word.pos_)

All DET
is VERB
well ADV
that DET
ends VERB
well ADV
. PUNCT


### Entity Detection (aka Entity Recognition)  
Identifies important elements like places, people, organizations, and languages within an input string of text  
We’ll use .label to grab a label for each entity that’s detected in the text, and then we’ll take a look at these entities in a more visual format using spaCy‘s displaCy visualizer  
Entity types: https://spacy.io/api/annotation#named-entities  

In [60]:
#for visualization of Entity detection importing displacy from spacy:

from spacy import displacy

nytimes= nlp(u"""New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases.

At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday.

The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.""")

entities=[(i, i.label_, i.label) for i in nytimes.ents]
entities

[(New York City, 'GPE', 384),
 (Tuesday, 'DATE', 391),
 (At least 285, 'CARDINAL', 397),
 (September, 'DATE', 391),
 (Brooklyn, 'GPE', 384),
 (Williamsburg, 'GPE', 384),
 (four, 'CARDINAL', 397),
 (Bill de Blasio, 'PERSON', 380),
 (Tuesday, 'DATE', 391),
 (Orthodox Jews, 'NORP', 381),
 (6 months old, 'DATE', 391),
 (up to $1,000, 'MONEY', 394)]

In [61]:
#use style = "ent" to tell displaCy that we want to visualize entities
displacy.render(nytimes, style = "ent",jupyter = True)

### Dependency parsing  
Analyzes how sentence is constructed to determing meaning  
Noun_chunks breaks the input down into nouns and the words describing them, and iterates through each chunk in source text, identifying the word, its root, its dependency identification, and which chunk it belongs to
Labels: https://spacy.io/api/annotation#dependency-parsing

In [62]:
docp = nlp (" In pursuit of a wall, President Trump ran into one.")

for chunk in docp.noun_chunks:
   print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

pursuit pursuit pobj In
a wall wall pobj of
President Trump Trump nsubj ran


In [63]:
#noun chunks also has a visualizer
displacy.render(docp, style="dep", jupyter= True)

### Word vector representation  
A word vector is a numeric representation of a word that commuicates its relationship to other words  

In [64]:
nlp = en_core_web_sm.load()
mango = nlp(u'mango')
print(mango.vector.shape)
print(mango.vector)

(96,)
[ 1.0466377  -1.5323697  -0.72177833 -2.4700646  -0.27151567  1.1589653
  1.7113379  -0.3161533  -2.0978353   1.8375525   1.4681312   2.7280447
 -2.3457406  -5.1718407  -4.611001   -0.21236429 -0.30295217  4.220026
 -0.68139046  2.4016762  -1.9546713  -0.8508699   1.2456177   1.5108002
  0.46847373  3.1612053   0.15542096  2.0598547   3.780033    4.611097
  0.6375267  -1.078107   -0.9664707  -1.3939939  -0.5691425   0.5143471
  2.3150034  -0.9319972  -2.7970653  -0.8540132  -3.4250066   4.285772
  2.5058162  -2.215088    0.78601825  3.496334   -0.6260618  -2.021353
 -4.474211    1.6821624  -6.078921    0.2280091  -0.3695004  -4.5340705
 -1.7978685  -2.0802987   4.125555    3.1852465  -3.2864473   1.0892262
  1.0171156   1.2736399  -0.10613781  3.5102787   1.1902345   0.0548352
 -0.06298053  0.82806814  0.05514137  0.94817257 -0.4937699   1.1512344
 -0.8137415  -1.6104263   1.8233354  -2.2784023  -2.1321888   0.30293244
 -1.4510609  -1.0584288  -3.5698357  -0.13046017 -0.26683304 

### Example using Amazon Alexa reviews

In [65]:
#load tsv file
df_amazon = pd.read_csv ("smartphone-review/amazon_alexa.tsv", sep="\t")

In [66]:
df_amazon.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


In [67]:
# shape of dataframe
df_amazon.shape

(3150, 5)

In [68]:
# View data information
df_amazon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
rating              3150 non-null int64
date                3150 non-null object
variation           3150 non-null object
verified_reviews    3150 non-null object
feedback            3150 non-null int64
dtypes: int64(2), object(3)
memory usage: 123.1+ KB


In [69]:
# Feedback Value count
df_amazon.feedback.value_counts()

1    2893
0     257
Name: feedback, dtype: int64

In [70]:
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()

# Creating our tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

To further clean our text data, we’ll also want to create a custom transformer for removing initial and end spaces and converting text into lower case. Here, we will create a custom predictors class which inherits the TransformerMixin class. This class overrides the transform, fit and get_parrams methods. We’ll also create a clean_text() function that removes spaces and converts text into lowercase.

In [72]:
# Custom transformer using spaCy
from sklearn.base import TransformerMixin
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()

We can generate a BoW matrix for our text data by using scikit-learn‘s CountVectorizer. In the code below, we're telling CountVectorizer to use the custom spacy_tokenizer function we built as its tokenizer, and defining the ngram range we want.

N-grams are combinations of adjacent words in a given text, where n is the number of words that incuded in the tokens. For example, in the sentence “Who will win the football world cup in 2022?” unigrams would be a sequence of single words such as “who”, “will”, “win” and so on. Bigrams would be a sequence of 2 contiguous words such as “who will”, “will win”, and so on. So the ngram_range parameter we’ll use in the code below sets the lower and upper bounds of the our ngrams (we’ll be using unigrams). Then we’ll assign the ngrams to bow_vector.

In [73]:
from sklearn.feature_extraction.text import CountVectorizer  
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

In [74]:
# apply tfidf
from sklearn.feature_extraction.text import TfidfVectorizer  
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

In [75]:
from sklearn.model_selection import train_test_split

X = df_amazon['verified_reviews'] # the features we want to analyze
ylabels = df_amazon['feedback'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

Create pipeline with three components: cleaner, vectorizer, and classifier. The cleaner uses our predictors class object to clean and preprocess the text. The vectorizer uses countvector objects to create the bag of words matrix for our text. The classifier is an object that performs the logistic regression to classify the sentiments.

In [77]:
# Logistic Regression Classifier
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()

# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(X_train,y_train)



Pipeline(memory=None,
     steps=[('cleaner', <__main__.predictors object at 0x000001DB8A6E04A8>), ('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
      ...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))])

### Evaluating the model  
Accuracy refers to the percentage of the total predictions our model makes that are completely correct  
Precision describes the ratio of true positives to true positives plus false positives in our predictions  
Recall describes the ratio of true positives to true positives plus false negatives in our predictions  

In [78]:
from sklearn import metrics
# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

Logistic Regression Accuracy: 0.9396825396825397
Logistic Regression Precision: 0.9415584415584416
Logistic Regression Recall: 0.9965635738831615
