### What’s spaCy?
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

 If you’re working with a lot of text, you’ll eventually want to know more about it. For example, what’s it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?

 spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

In [180]:
import spacy
#command to install
#conda install -c conda-forge spacy-model-en_core_web_sm
nlp = spacy.load("en_core_web_sm") 

#### Text: The original word text.
#### Lemma:(lemma_) The base form of the word.
#### POS:(pos_) The simple part-of-speech tag.
#### Tag:(tag_) The detailed part-of-speech tag.
#### Dep:(dep_) Syntactic dependency, i.e. the relation between tokens.
#### Shape:(shape_) The word shape – capitalization, punctuation, digits.
#### is alpha:(is_alpha) Is the token an alpha character?
#### is stop:(is_stop) Is the token part of a stop list, i.e. the most common words of the language?

In [181]:
doc = nlp("Tea is healthy and calming, don't you think?")
print("Token\tLemma\tStopword POS Tag")
print('-'*60)
for token in doc:
    print(f"{token}\t{token.lemma_}\t{token.is_stop}\t{token.pos_}\t{token.tag_}")

Token	Lemma	Stopword POS Tag
------------------------------------------------------------
Tea	tea	False	NOUN	NN
is	be	True	AUX	VBZ
healthy	healthy	False	ADJ	JJ
and	and	True	CCONJ	CC
calming	calm	False	VERB	VBG
,	,	False	PUNCT	,
do	do	True	AUX	VBP
n't	not	True	PART	RB
you	-PRON-	True	PRON	PRP
think	think	False	VERB	VB
?	?	False	PUNCT	.


#### UNDERSTANDING TAGS AND LABELS
Most of the tags and labels look pretty abstract, and they vary between languages. spacy.explain will show you a short description – for example, spacy.explain("VBZ") returns “verb, 3rd person singular present”

In [182]:
spacy.explain("VBZ")

'verb, 3rd person singular present'

In [183]:
spacy.explain('RB')

'adverb'

#### Named Entity
A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title

In [184]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
    print(f"{ent.text} ==> {ent.label_} ==> {spacy.explain(ent.label_)}")

Apple ==> ORG ==> Companies, agencies, institutions, etc.
U.K. ==> GPE ==> Countries, cities, states
$1 billion ==> MONEY ==> Monetary values, including unit


In [185]:
nlp = spacy.load('en_core_web_sm')

In [186]:
tokens = nlp("dog cat banana orange")

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

dog dog 1.0
dog cat 0.5033337
dog banana 0.37854198
dog orange 0.37416118
cat dog 0.5033337
cat cat 1.0
cat banana 0.52952737


  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


cat orange 0.19970883
banana dog 0.37854198
banana cat 0.52952737
banana banana 1.0
banana orange 0.27482426
orange dog 0.37416118
orange cat 0.19970883
orange banana 0.27482426
orange orange 1.0


  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


In [187]:
tokens = nlp("dog cat banana afskfsd")

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 19.266302 True
cat True 19.220264 True
banana True 17.748499 True
afskfsd True 20.882006 True


#### Language data
Every language is different – and usually full of exceptions and special cases, especially amongst the most common words. Some of these exceptions are shared across languages, while others are entirely specific – usually so specific that they need to be hard-coded. The lang module contains all language-specific data, organized in simple Python files. This makes the data easy to update and extend

In [188]:
from spacy.lang.en import English
from spacy.lang.de import German

nlp_en = English()  # Includes English data
nlp_de = German()  # Includes German data

In [189]:
from spacy import displacy

doc_dep = nlp("This is a sentence.")
displacy.serve(doc_dep, style="dep")

doc_ent = nlp("When Sebastian Thrun started working on self-driving cars at Google "
              "in 2007, few people outside of the company took him seriously.")
displacy.serve(doc_ent, style="ent")

  "__main__", mod_spec)



Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


  "__main__", mod_spec)



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


#### Pattern Matching
Another common NLP task is matching tokens or phrases within chunks of text or whole documents. You can do pattern matching with regular expressions, but spaCy's matching capabilities tend to be easier to use.

In [202]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)

In [203]:
text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
               "photography tests pitting the iPhone 11 Pro against the "
               "galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.") 
matches = matcher(text_doc)
for _, start, end in matches:
    print(text_doc[start:end])

iPhone 11
galaxy Note
iPhone XS
Google Pixel


#### Text Classification with SpaCy
We will analyze a dataset of SMSs to classify them into spam and ham

In [204]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [205]:
#import spam csv
dfspam = pd.read_csv('spam.csv', encoding='ISO-8859-1')

In [206]:
dfspam.head(10)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
5,spam,FreeMsg Hey there darling it's been 3 week's n...,,,
6,ham,Even my brother is not like to speak with me. ...,,,
7,ham,As per your request 'Melle Melle (Oru Minnamin...,,,
8,spam,WINNER!! As a valued network customer you have...,,,
9,spam,Had your mobile 11 months or more? U R entitle...,,,


In [207]:
# Create an empty model
nlp = spacy.blank('en')

In [208]:
# Create the TextCategorizer with exclusive classes and "bag of word(bow)" architecture
textcat = nlp.create_pipe("textcat",
              config={"exclusive_classes": True,"architecture": "bow"})

In [209]:
nlp.add_pipe(textcat)

In [210]:
#add classoifier label
textcat.add_label('ham')
textcat.add_label('spam')

1

In [211]:
#create train test split for training and evaluating model
X = dfspam['v2'].values
y = [{'cats': {'ham': label == 'ham',
                          'spam': label == 'spam'}} 
                for label in dfspam['v1']]
train_texts, test_texts, train_labels, test_labels = train_test_split(X, y, test_size=0.2, random_state=9)

In [212]:
train_data = list(zip(train_texts, train_labels))
train_data[0:3]

[('You can jot down things you want to remember later.',
  {'cats': {'ham': True, 'spam': False}}),
 ('So you think i should actually talk to him? Not call his boss in the morning? I went to this place last year and he told me where i could go and get my car fixed cheaper. He kept telling me today how much he hoped i would come back in, how he always regretted not getting my number, etc.',
  {'cats': {'ham': True, 'spam': False}}),
 ('Wat makes some people dearer is not just de happiness dat u feel when u meet them but de pain u feel when u miss dem!!!',
  {'cats': {'ham': True, 'spam': False}})]

In [213]:
import random

random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

losses = {}
for epoch in range(10):
    random.shuffle(train_data)
    # Create the batch generator with batch size = 8
    batches = minibatch(train_data, size=8)
    # Iterate through minibatches
    for batch in batches:
        # Each batch is a list of (text, label) but we need to
        # send separate lists for texts and labels to update().
        # This is a quick way to split a list of tuples into lists
        texts, labels = zip(*batch)
        nlp.update(texts, labels, sgd=optimizer, losses=losses)
    print(losses)

{'textcat': 1.227645506569388}
{'textcat': 1.5717371886594265}
{'textcat': 1.752538370129031}
{'textcat': 1.8720738124953087}
{'textcat': 1.9544220569802029}
{'textcat': 2.0088419842828724}
{'textcat': 2.046027143647982}
{'textcat': 2.0726789025792405}
{'textcat': 2.0928852535404383}
{'textcat': 2.108061934710048}


In [214]:
texts = ["Are you ready for the tea party????? It's gonna be wild",
         "URGENT Reply to this message for GUARANTEED FREE TEA" ]
docs = [nlp.tokenizer(text) for text in texts]
    
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)

print(scores)

[[9.9985826e-01 1.4171332e-04]
 [2.7513899e-02 9.7248614e-01]]


In [215]:
# From the scores, find the label with the highest score/probability
predicted_labels = scores.argmax(axis=1)
print(predicted_labels)
print([textcat.labels[label] for label in predicted_labels])

[0 1]
['ham', 'spam']


In [216]:
#predict using the trained model
def predict(model, texts): 
    # Use the model's tokenizer to tokenize each input text
    docs = [model.tokenizer(text) for text in texts]
    
    # Use textcat to get the scores for each doc
    textcat = model.get_pipe('textcat')
    scores, _ = textcat.predict(docs)
    
    # From the scores, find the class with the highest score/probability
    predicted_class = scores.argmax(axis=1)
    
    return predicted_class

In [217]:
predictions = predict(nlp, ["Are you ready for the tea party????? It's gonna be wild",
         "URGENT Reply to this message for GUARANTEED FREE TEA"])
print([textcat.labels[label] for label in predictions])

['ham', 'spam']


In [218]:
#evaluate the model
def evaluate(model, texts, labels):
      
    # From the scores, find the class with the highest score/probability
    predicted_class = predict(model, texts)
    
    actual_class = [int(labels[i]['cats']['spam']) for i,label in enumerate(labels)]
    
    return accuracy_score(actual_class, predicted_class)

In [219]:
evaluate(nlp, test_texts, test_labels)

0.9838565022421525

===========================================END=======================================