#Understanding Spacy


The field of Natural Language Processing has seen a lot of progress over time & this is well evident by the development of some specialised libraries.

One such library is Spacy. If a person is working with lot of text data, he would want to unerstand it better. For example, what is the data about? Context of the words? What is the key point of the text? 

Here comes **SpaCy** which is designed specifically for production use and helps one build applications that process and “understand” large text data. 


It is one stop shop for NLP tasks such as - 

*   Tokenisation
*   Lemmatisation
*   POS tagging
*   Entity recognition
*   Word-to-vector transformations
*   Different Methods for cleaning and Processing text

In this article, we will try to learn the use of Spacy, a fast-growing industry standard library for NLP in Python.
It is incredibly fast as it is implemented in Cython and basically is sort of Numpy for NLP.

In the first part of the article we try to understand the basic implementation of NLP tasks in Spacy such as Tokenization, Lemmatization, POS tagging & Entity Recognition.

And in the second part we explore the features of large scale data using spaCy.

#Installing Spacy

First let us understand installing the spacy library in Anaconda

**Installing Spacy using Conda**

    conda install -c conda-forge spacy
    sudo python -m spacy download en
    sudo python -m spacy download fr

Now lets import the library and start playing with it

In [0]:
import spacy

#Using Spacy
**Understanding basic implementations of NLP tasks in spacy and Working with statistical models, using them to predict linguistic features in our text**


Initially we have to insantiate Spacy's pipeline using a variable named *nlp*.
To insantiate *nlp* first we import the English language class from spacy.lang.in. 

In [0]:
# Importing the English language class
from spacy.lang.en import English

# Creating the nlp object
nlp = English()

#Reading and Tokenization

We pass our text to the nlp object we created and thus the text is converted in Spacy readable form, which is basically a document. Then the doc lets us access the text in structured way, letting us iterate over the tokens, getting a token by its index etc.

In [12]:
doc = nlp('Life is beautiful if you do not take it too seriously')
print([token.text for token in doc])

['Life', 'is', 'beautiful', 'if', 'you', 'do', 'not', 'take', 'it', 'too', 'seriously']


We can slice/index tokens in a way similar to what we do in lists

Also different attributes can be performed on tokens which gives us more information about the tokens.
For example we can remove stop words from the tokens, **is_alpha** detects if the token consists of alphabetic characters or not, similarly **is_punct**  detects if the token is a punctuation symbol or not.

In [13]:
print([token.text for token in doc if not token.is_stop ])

['Life', 'beautiful', 'seriously']


We can also **lemmatize** the tokens using .lemma_ attribute

In [14]:
text = 'loving the love of being loved'
doc = nlp(text)

print([token.lemma_ for token in doc])

['love', 'the', 'love', 'of', 'be', 'love']


#Statistical Models & Linguistic Features

Understanding the context of a sentence can help machine make better predictions about the data.

Pre-trained statistical models help spaCy make predictions in context.

The pre-trained models can be downloaded using the 'spacy download' command.

The 'en_core_web_sm' package is one such package which can support all the functionalities and is trained on web data.

It provides spaCy with binary weights and vocabulary that enables it to make predictions.

*   Parts of Speech tags


In [15]:
# Load the small English model – spaCy is already imported
nlp = spacy.load('en_core_web_sm')

sent = 'Parth has been ignorant this semester and would have to suffer for the same'

doc = nlp(sent)

# Coarse-grained part-of-speech tags
print([(token.text,token.pos_) for token in doc])


[('Parth', 'PROPN'), ('has', 'VERB'), ('been', 'VERB'), ('ignorant', 'ADJ'), ('this', 'DET'), ('semester', 'NOUN'), ('and', 'CCONJ'), ('would', 'VERB'), ('have', 'VERB'), ('to', 'PART'), ('suffer', 'VERB'), ('for', 'ADP'), ('the', 'DET'), ('same', 'ADJ')]


In [16]:
# Fine-grained part-of-speech tags
print([(token.text,token.tag_) for token in doc])

[('Parth', 'NNP'), ('has', 'VBZ'), ('been', 'VBN'), ('ignorant', 'JJ'), ('this', 'DT'), ('semester', 'NN'), ('and', 'CC'), ('would', 'MD'), ('have', 'VB'), ('to', 'TO'), ('suffer', 'VB'), ('for', 'IN'), ('the', 'DT'), ('same', 'JJ')]



*   Named entity recognition


In [17]:
doc = nlp("Harry has total right to sew apple for the cheating")

# Text and label of named entity span

print([(tok.text, tok.label_) for tok in doc.ents])

[('Harry', 'PERSON')]



#Visualising dependencies

In [0]:
from spacy import displacy  

In [19]:
displacy.render(doc, style="dep",jupyter = True,options = {'distance':140})

#Visualizing named entities

In [20]:
displacy.render(doc, style="ent",jupyter = True,options = {'distance':140})

Here, we'll show you how to create Doc objects.

Doc object takes 3 arguments. nlp.vocab, words and spaces(Bool value)

In [21]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ['Hello', ',', 'How', 'are', 'you' ,'!']
spaces = [False, True, True, True,False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Hello, How are you!


We'll understand one more object span.

Span takes 4 arguments. first argument is reference to doc second and third arguments are start point(inclusive) and end point(exclusive) tokens respectively. forth argument is optional and used to label your span object.

In [22]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=['I', 'am', 'from', 'London', ',','United', 'Kingdom'], spaces=[True, True, True, False, True, True, False])

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 3, 7, label='PERSON')
print(span.text, span.label_)

London, United Kingdom PERSON


Doc has a property called entity which can be used to store object span.

doc.ents takes one argument, span.

In [23]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=['I', 'am', 'from', 'London', ',','United', 'Kingdom'], spaces=[True, True, True, False, True, True, False])

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 3, 7, label='PERSON')

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])


# Iterate over the tokens
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == 'PROPN':
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == 'VERB':
            print('Found a verb after a proper noun!')

[('London, United Kingdom', 'PERSON')]


Spacy can be used to find similarity between token-token, doc-token, span-doc, etc.

How spacy measure similarities?

The similarity is determined using word vectors. Multi-dimensional meaning representations of words. Generated using an algorithm like Word2Vec. by default, it gives cosine similarity, but it can be customized. Doc and Span vectors default to an average of token vectors. it gives a more accurate result with Short phrases than long documents as it contains more irrelevant words.


It needs a word vector model to check similarity. spacy has three different size models by size:

*   en_core_web_sm (small model)
*   en_core_web_md (medium model)
*   en_core_web_lg (large model)


For our purpose, we'll be using the small one.

To convert any doc token to its vector form, you can use vector property on a 
token of Doc.

you can compare doc-doc similarity or token-token similarity using 
doc.similarity(doc) or token.similarity(token)

In [24]:
# import spacy.cli
# !python -m spacy download en_core_web_lg
# !spacy.cli.download('en_core_web_md')

# Load the en_core_web_md model
nlp = spacy.load('en_core_web_sm')
# import en_core_sci_md
# nlp = en_core_sci_md.load()

# Process a text
doc = nlp("why am i eating pizza?")

# Get the vector for the token "bananas"
prnt_vector = doc[3].vector
print(prnt_vector)

doc1 = nlp("Let's have a pizza party tonight.")
doc2 = nlp("i will have a burger.")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)


doc = nlp("car and cards")
token1, token2 = doc[0], doc[2]

# Get the similarity of the tokens "TV" and "books" 
similarity = token1.similarity(token2)
print(similarity)

[-0.5838928  -2.1370084  -0.02769181 -0.08865166  0.02538602 -1.8068115
  8.920958    0.08726597 -1.2190907  -2.872636   -2.0168803   1.0798023
  0.01544001 -1.9018455   0.4573893  -1.4435092   2.0902443   0.1434044
 -3.0693352   2.0814598   0.5286237  -3.1452076   0.52507913  5.818022
 -4.6569743  -0.74498695  6.3046036  -0.7623123  -0.19211659  3.837345
  1.8241663  -0.61304367  2.2083328  -2.594667    1.7671746  -2.3091698
  0.8150312  -0.7948343   0.116624    5.871045    4.9112945  -1.6638976
 -1.7903962  -0.94224524 -2.2724063  -3.2359116   2.1840324  -2.1792397
 -3.3966825  -2.2638009   1.6616726  -2.3878088   1.1520045   1.6465685
  1.0054997  -1.4016173  -1.8878505  -2.4043303   1.019835   -1.5475055
  1.9052529  -0.15786624 -3.5604143  -2.7913537  -2.3911486  -1.6027616
  1.4730654   0.68242794 -3.0053346   0.28748536 -0.92508596 -2.283294
  0.10451706  2.9885764   4.0413136   3.7486117  -2.351272   -0.4637978
  0.59839106  3.4197762   4.7469673  -0.8413911  -1.5675265   0.619

  "__main__", mod_spec)
  "__main__", mod_spec)


In [25]:
 doc = nlp("We were playing cricket this evening. We are planing to go to watch world cup.")

span1 = doc[2:6]
span2 = doc[8:]

# Get the similarity of the spans
similarity = span1.similarity(span2)
print(similarity)

0.44422776


  "__main__", mod_spec)


A list containing name of all countries.

In [0]:
COUNTRIES = ['Afghanistan',
 'Åland Islands',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antarctica',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia (Plurinational State of)',
 'Bonaire, Sint Eustatius and Saba',
 'Bosnia and Herzegovina',
 'Botswana',
 'Bouvet Island',
 'Brazil',
 'British Indian Ocean Territory',
 'United States Minor Outlying Islands',
 'Virgin Islands (British)',
 'Virgin Islands (U.S.)',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cabo Verde',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Christmas Island',
 'Cocos (Keeling) Islands',
 'Colombia',
 'Comoros',
 'Congo',
 'Congo (Democratic Republic of the)',
 'Cook Islands',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Curaçao',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Falkland Islands (Malvinas)',
 'Faroe Islands',
 'Fiji',
 'Finland',
 'France',
 'French Guiana',
 'French Polynesia',
 'French Southern Territories',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Gibraltar',
 'Greece',
 'Greenland',
 'Grenada',
 'Guadeloupe',
 'Guam',
 'Guatemala',
 'Guernsey',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Haiti',
 'Heard Island and McDonald Islands',
 'Holy See',
 'Honduras',
 'Hong Kong',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 "Côte d'Ivoire",
 'Iran (Islamic Republic of)',
 'Iraq',
 'Ireland',
 'Isle of Man',
 'Israel',
 'Italy',
 'Jamaica',
 'Japan',
 'Jersey',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Kiribati',
 'Kuwait',
 'Kyrgyzstan',
 "Lao People's Democratic Republic",
 'Latvia',
 'Lebanon',
 'Lesotho',
 'Liberia',
 'Libya',
 'Liechtenstein',
 'Lithuania',
 'Luxembourg',
 'Macao',
 'Macedonia (the former Yugoslav Republic of)',
 'Madagascar',
 'Malawi',
 'Malaysia',
 'Maldives',
 'Mali',
 'Malta',
 'Marshall Islands',
 'Martinique',
 'Mauritania',
 'Mauritius',
 'Mayotte',
 'Mexico',
 'Micronesia (Federated States of)',
 'Moldova (Republic of)',
 'Monaco',
 'Mongolia',
 'Montenegro',
 'Montserrat',
 'Morocco',
 'Mozambique',
 'Myanmar',
 'Namibia',
 'Nauru',
 'Nepal',
 'Netherlands',
 'New Caledonia',
 'New Zealand',
 'Nicaragua',
 'Niger',
 'Nigeria',
 'Niue',
 'Norfolk Island',
 "Korea (Democratic People's Republic of)",
 'Northern Mariana Islands',
 'Norway',
 'Oman',
 'Pakistan',
 'Palau',
 'Palestine, State of',
 'Panama',
 'Papua New Guinea',
 'Paraguay',
 'Peru',
 'Philippines',
 'Pitcairn',
 'Poland',
 'Portugal',
 'Puerto Rico',
 'Qatar',
 'Republic of Kosovo',
 'Réunion',
 'Romania',
 'Russian Federation',
 'Rwanda',
 'Saint Barthélemy',
 'Saint Helena, Ascension and Tristan da Cunha',
 'Saint Kitts and Nevis',
 'Saint Lucia',
 'Saint Martin (French part)',
 'Saint Pierre and Miquelon',
 'Saint Vincent and the Grenadines',
 'Samoa',
 'San Marino',
 'Sao Tome and Principe',
 'Saudi Arabia',
 'Senegal',
 'Serbia',
 'Seychelles',
 'Sierra Leone',
 'Singapore',
 'Sint Maarten (Dutch part)',
 'Slovakia',
 'Slovenia',
 'Solomon Islands',
 'Somalia',
 'South Africa',
 'South Georgia and the South Sandwich Islands',
 'Korea (Republic of)',
 'South Sudan',
 'Spain',
 'Sri Lanka',
 'Sudan',
 'Suriname',
 'Svalbard and Jan Mayen',
 'Swaziland',
 'Sweden',
 'Switzerland',
 'Syrian Arab Republic',
 'Taiwan',
 'Tajikistan',
 'Tanzania, United Republic of',
 'Thailand',
 'Timor-Leste',
 'Togo',
 'Tokelau',
 'Tonga',
 'Trinidad and Tobago',
 'Tunisia',
 'Turkey',
 'Turkmenistan',
 'Turks and Caicos Islands',
 'Tuvalu',
 'Uganda',
 'Ukraine',
 'United Arab Emirates',
 'United Kingdom of Great Britain and Northern Ireland',
 'United States of America',
 'Uruguay',
 'Uzbekistan',
 'Vanuatu',
 'Venezuela (Bolivarian Republic of)',
 'Viet Nam',
 'Wallis and Futuna',
 'Western Sahara',
 'Yemen',
 'Zambia',
 'Zimbabwe']

doc = 'Czech Republic may help Slovakia protect its airspace'

PhraseMatcher function can be used to match phrases in two Docs.

In [27]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
doc = nlp(doc)

tt = [nlp(i) for i in COUNTRIES]
matcher.add('COUNTRY', None, *tt)

matches = matcher(doc)

print([doc[start:end] for match_id, start, end in matches])

[Czech Republic, Slovakia]


In [32]:
file = open("test.txt" ,'r')
text = file.read()

# text = "After the Cold War, the UN saw a radical expansion in its peacekeeping duties, taking on more missions in ten years than it had in the previous four decades.Between 1988 and 2000, the number of adopted Security Council resolutions more than doubled, and the peacekeeping budget increased more than tenfold. The UN negotiated an end to the Salvadoran Civil War, launched a successful peacekeeping mission in Namibia, and oversaw democratic elections in post-apartheid South Africa and post-Khmer Rouge Cambodia. In 1991, the UN authorized a US-led coalition that repulsed the Iraqi invasion of Kuwait. Brian Urquhart, Under-Secretary-General from 1971 to 1985, later described the hopes raised by these successes as a false renaissance for the organization, given the more troubled missions that followed. Though the UN Charter had been written primarily to prevent aggression by one nation against another, in the early 1990s the UN faced a number of simultaneous, serious crises within nations such as Somalia, Haiti, Mozambique, and the former Yugoslavia. The UN mission in Somalia was widely viewed as a failure after the US withdrawal following casualties in the Battle of Mogadishu, and the UN mission to Bosnia faced worldwide ridicule for its indecisive and confused mission in the face of ethnic cleansing. In 1994, the UN Assistance Mission for Rwanda failed to intervene in the Rwandan genocide amid indecision in the Security Council. Beginning in the last decades of the Cold War, American and European critics of the UN condemned the organization for perceived mismanagement and corruption. In 1984, the US President, Ronald Reagan, withdrew his nation\'s funding from UNESCO (the United Nations Educational, Scientific and Cultural Organization, founded 1946) over allegations of mismanagement, followed by Britain and Singapore. Boutros Boutros-Ghali, Secretary-General from 1992 to 1996, initiated a reform of the Secretariat, reducing the size of the organization somewhat. His successor, Kofi Annan (1997–2006), initiated further management reforms in the face of threats from the United States to withhold its UN dues. In the late 1990s and 2000s, international interventions authorized by the UN took a wider variety of forms. The UN mission in the Sierra Leone Civil War of 1991–2002 was supplemented by British Royal Marines, and the invasion of Afghanistan in 2001 was overseen by NATO. In 2003, the United States invaded Iraq despite failing to pass a UN Security Council resolution for authorization, prompting a new round of questioning of the organization\'s effectiveness. Under the eighth Secretary-General, Ban Ki-moon, the UN has intervened with peacekeepers in crises including the War in Darfur in Sudan and the Kivu conflict in the Democratic Republic of Congo and sent observers and chemical weapons inspectors to the Syrian Civil War. In 2013, an internal review of UN actions in the final battles of the Sri Lankan Civil War in 2009 concluded that the organization had suffered systemic failure. One hundred and one UN personnel died in the 2010 Haiti earthquake, the worst loss of life in the organization\'s history. The Millennium Summit was held in 2000 to discuss the UN\'s role in the 21st century. The three day meeting was the largest gathering of world leaders in history, and culminated in the adoption by all member states of the Millennium Development Goals (MDGs), a commitment to achieve international development in areas such as poverty reduction, gender equality, and public health. Progress towards these goals, which were to be met by 2015, was ultimately uneven. The 2005 World Summit reaffirmed the UN\'s focus on promoting development, peacekeeping, human rights, and global security. The Sustainable Development Goals were launched in 2015 to succeed the Millennium Development Goals. In addition to addressing global challenges, the UN has sought to improve its accountability and democratic legitimacy by engaging more with civil society and fostering a global constituency. In an effort to enhance transparency, in 2016 the organization held its first public debate between candidates for Secretary-General. On 1 January 2017, Portuguese diplomat António Guterres, who previously served as UN High Commissioner for Refugees, became the ninth Secretary-General. Guterres has highlighted several key goals for his administration, including an emphasis on diplomacy for preventing conflicts, more effective peacekeeping efforts, and streamlining the organization to be more responsive and versatile to global needs."
print(len(text))
file.close()

164189


All the country name that is used in "text" variable is printed.

Although the size of the text file is too big. The program runs quite smoothly.

In [33]:
# Create a doc and find matches in it
doc = nlp(text)

matcher = PhraseMatcher(nlp.vocab)
matcher.add('COUNTRY', None, *([nlp(i) for i in COUNTRIES]))
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE" and overwrite the doc.ents
    span = Span(doc, start, end, label='GPE')
    # Get the span's root head token
    # Print the text of the span root's head token and the span text
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == 'GPE'])

[('China', 'GPE'), ('North Africa', 'GPE'), ('Australia', 'GPE'), ('Germany', 'GPE'), ('Japan', 'GPE'), ('the Italian Republic', 'GPE'), ('the United States', 'GPE'), ('the Soviet Union', 'GPE'), ('United States', 'GPE'), ('United Kingdom', 'GPE'), ('China', 'GPE'), ('Germany', 'GPE'), ('Japan', 'GPE'), ('the Soviet Union', 'GPE'), ('China', 'GPE'), ('Japan', 'GPE'), ('China', 'GPE'), ('Poland', 'GPE'), ('Germany', 'GPE'), ('Germany', 'GPE'), ('France', 'GPE'), ('the United Kingdom', 'GPE'), ('Germany', 'GPE'), ('Italy', 'GPE'), ('Japan', 'GPE'), ('Germany', 'GPE'), ('the Soviet Union', 'GPE'), ('Poland', 'GPE'), ('Finland', 'GPE'), ('Romania', 'GPE'), ('North Africa', 'GPE'), ('East Africa', 'GPE'), ('France', 'GPE'), ('the British Empire', 'GPE'), ('Battle', 'GPE'), ('Britain', 'GPE'), ('the Soviet Union', 'GPE'), ('Japan', 'GPE'), ('the United States', 'GPE'), ('U.S.', 'GPE'), ('Japan', 'GPE'), ('Great Britain', 'GPE'), ('U.S.', 'GPE'), ('Japan', 'GPE'), ('Germany', 'GPE'), ('Italy'

With this we can pretty much see that spaCy is an efficient library for NLP tasks and thus is an effective tool for anyone who is planning to deal with a lot of data.

Signing off - Vandan Patel and Parth Upadhayay