# Using Spacy

<font color='steelblue'>
<h2>
<span style="font-family:verdana; font-size:1.2em;">
Using Spacy to explore different tools that are supported in text processing<br>
</span>
</h2>
</font>

<font color='grey'>
<span style="font-family:verdana; font-size:1.2em;">
<b>
Look at following examples:
    <ul>
        <li>Loading a prebuild model</li>
        <li>Tokenization</li>
        <li>Text preprocessing</li>
        <li>Lemmatization</li>
        <li>Vocabulary</li>
        <li>Lexical analysis</li>
        <li>Parts of speech analysis</li>
        <li>Named entities within the text</li>
     </ul>
    </b>
</span>

# Install spacy<br>
<font color='tomato'>
<span style="font-family:verdana; font-size:1.4em;">
    Couple of options to install spacy:
    <ul>
        <li>conda install -c conda-forge spacy</li>
        <li>pip install spacy</li>
    </ul><br>
    In anaconda terminal use the following commands to download small model or large model:
    <ul>
        <li>python -m spacy download en_core_web_sm</li>
        <li>python -m spacy download en_core_web_lg</li>
    </ul>
        
</span>
</font>

<font color='gray'>
<h3>
<span style="font-family:verdana; font-size:1.2em;">
spaCy - Functionality<br>
</span>
</h3>
</font>
<span style="font-family:verdana; font-size:1.0em;">
spaCy comes with predefined NLP models that can be used to perform most common tasks:<br>
    <ul>
        <li>Tokenization</li>
        <li><a href="https://spacy.io/usage/linguistic-features" target="_blank">Parts of speech (POS) tagging</a></li>
        <li><a href="https://www.machinelearningplus.com/nlp/training-custom-ner-model-in-spacy/" target="_blank">Named Entity Recognition (NER) </a></li>
        <li><a href="https://spacy.io/api/lemmatizer" target="_blank">Lemmatization</a></li>
        <li><a href="https://spacy.io/usage/vectors-similarity" target="_blank">Transforming words into vectors</a></li>
    </ul>
</span>

## Load model

In [None]:
import spacy

In [None]:
# load small english model (returns an object)
model = spacy.load('en_core_web_sm')

In [None]:
print(type(model))

In [None]:
text1 = """The economic situation of the country is on edge , as the stock 
market crashed causing loss of millions. Citizens who had their main investment 
in the share-market are facing a great loss. Many companies might lay off 
thousands of people to reduce labor cost"""


## Create a document object using the model for our text


In [None]:
# create Doc object
doc1 = model(text1)
print(type(doc1))

<font color='gray'>
<h3>
<span style="font-family:verdana; font-size:1.2em;">
spaCy - Tokenization<br>
</span>
</h3>
</font>
<span style="font-family:verdana; font-size:1.2em;">
    <ul>
<li>Tokens are individual text entities that make up the document. They are words, punctuations, spaces, etc</li>
<li>Tokenization is the process of converting a text into smaller sub-texts, based on certain predefined rules. For example, sentences are tokenized to words (and punctuation optionally). And paragraphs into sentences, depending on the context</li>
    </ul>
</span>

In [None]:
# print the tokens (10 of them)
i = 0
for token in doc1:
    if (i == 10):
        break
    print(token.text)
    i = i + 1

<font color='gray'>
<h3>
<span style="font-family:verdana; font-size:1.2em;">
spaCy - Preprocessing<br>
</span>
</h3>
</font>
<span style="font-family:verdana; font-size:1.2em;">
    <ul>
        <li>Remove stop words such as the, was, it, etc</li>
        <li>Remove punctuations, extra spaces</li>
    </ul>
    This helps in reducing the amount of words that needs to be processed
</span>

In [None]:
# print if token is stop word or punctuation
for token in doc1:
    print(token.text, '\t', token.is_stop, '\t', token.is_punct)

In [None]:
# remove the stop words and punctuations
doc1_clean = [token for token in doc1 if not token.is_stop and not token.is_punct]

In [None]:
for token in doc1_clean:
    print(token.text)

In [None]:
print("Length: Original text {} preprocessed text {}".format(len(text1), len(doc1_clean)))

<font color='gray'>
<h3>
<span style="font-family:verdana; font-size:1.2em;">
spaCy - Lemmatization<br>
</span>
</h3>
</font>
<span style="font-family:verdana; font-size:1.2em;">
    <ul>
        <li>Lemmatization is finding the root of the word e.g. played, playing, plays have a root work play</li>
        <li>Useful when dealing with number of occurances of a word - playing, plays are same as play</li>
    </ul>
</span>

In [None]:
text2 = 'she played chess against mary she likes playing chess.'
doc2 = model(text2)

In [None]:
# print the lemma of the words
# NOTE: pronouns are identified here, they need to be handled separately
for token in doc2:
    print(token.lemma_)

<font color='gray'>
<h3>
<span style="font-family:verdana; font-size:1.2em;">
spaCy - Vocab<br>
</span>
</h3>
</font>
<span style="font-family:verdana; font-size:1.2em;">
    <ul>
        <li>Words of a doc are stored in Vocab</li>
        <li>These words are converted to unique id</li>
        <li>Can do look up between the id and the word and vice versa</li>
        <li>Word will have the same hash value irrespective of the document</li>
    </ul>
</span>

In [None]:
uid = model.vocab.strings['chess']
print(uid)

In [None]:
print(model.vocab.strings[uid])

<font color='gray'>
<h3>
<span style="font-family:verdana; font-size:1.2em;">
spaCy - Lexical attributes<br>
</span>
</h3>
</font>
<span style="font-family:verdana; font-size:1.2em;">
    Previous examples used is_punct, is_space, etc. These are Lexical attributes.<br>
    There are many lexical attibutes that are found:
    <ul>
        <li>like_num</li>
        <li>like_email</li>
        <li>like_url</li>
        <li>etc...</li>
    </ul>
</span>

In [None]:
text3 = '2020 is far worse for the world economy than 2009'
doc3 = model(text3)

In [None]:
# get numerical values
for token in doc3:
    if token.like_num:
        print(token)

In [None]:
text4 = """ name : Jim age: 45 email : jsmith@gmail.com
                 name : John age: 34 email: jdoe8888@gmail.com
                 name : Nila age: 60 email : nwafers222@gmail.com
                 name : Mary age: 15 email : mpotter@yahoo.com
                 """

In [None]:
doc4 = model(text4)

In [None]:
for token in doc4:
    if token.like_email:
        print(token.text)

<font color='gray'>
<h3>
<span style="font-family:verdana; font-size:1.2em;">
spaCy - Part of Speech Analysis<br>
</span>
</h3>
</font>
<span style="font-family:verdana; font-size:1.2em;">
    Spacy helps recognize different parts of speech - nouns, verbs, pronouns, etc.<br>
    POS can be used to remove certain text "junk" like etc, i.e.
</span>

In [None]:
text5 = 'John plays basketball,if time permits. He played in high school too'
doc5 = model(text5)

In [None]:
for token in doc5:
    print(token.text, '\t', token.pos_)

In [None]:
# understand what POS means
spacy.explain('SCONJ')

### Remove "junk" text using POS

In [None]:
text6 = """I liked the movies etc The movie had good direction  The movie was amazing i.e.
            The movie was average direction was not bad The cinematography was nice. i.e.
            The movie was a bit lengthy  otherwise fantastic  etc etc"""

In [None]:
doc6 = model(text6)

In [None]:
# print the "junk" text
for token in doc6:
    if token.pos_ == "X":
        print(token.text)

In [None]:
# remove junk text
doc6_clean = [token for token in doc6 if not token.pos_ == "X"]

In [None]:
print(doc6_clean)

In [None]:
# create a dictionary of parts of speech and correctponding token numbers
doc6_tags = {token.pos: token.pos_ for token in doc6}

In [None]:
print(doc6_tags)

In [None]:
# Visualization of POS using displacy
from spacy import displacy
displacy.render(doc5, style = 'dep', jupyter = True)

<font color='gray'>
<h3>
<span style="font-family:verdana; font-size:1.2em;">
spaCy - Named Entity Recognition<br>
</span>
</h3>
</font>
<span style="font-family:verdana; font-size:1.2em;">
    Spacy allows to find named entities in the text e.g. "John works for Cisco", John and Cisco are named entities - person and company
</span>

In [None]:
text7 = 'Tony Stark owns the company StarkEnterprises. Emily Clark works at Microsoft and lives in Manchester. She loves to read the Bible and learn French'

In [None]:
doc7 = model(text7)

In [None]:
print(doc7.ents)

In [None]:
# print the type of entity
for entity in doc7.ents:
    print(entity.text, '\t', entity.label_)

In [None]:
displacy.render(doc7, style = 'ent', jupyter = True)