<a href="https://colab.research.google.com/github/rahiakela/python-for-programmers-practice/blob/natural-language-processing/named_entity_recognition_and_similarity_detection_using_spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NAMED ENTITY RECOGNITION WITH SPACY

NLP can determine what a text is about. A key aspect of this is named entity recognition, which attempts to locate and categorize items like dates, times, quantities, places, people, things, organizations and more. In this section, we’ll use the named entity recognition capabilities in the spaCy NLP library , to analyze text.

You may also want to check out [Textacy](https://github.com/chartbeat-labs/textacy) an NLP library built on spaCy that supports additional NLP tasks.

## Loading the Language Model

The first step in using spaCy is to load the language model representing the natural language
of the text you’re analyzing. To do this, you’ll call the spacy module’s load function.

In [0]:
import spacy

nlp = spacy.load('en')

## Creating a spaCy Doc

Next, you use the nlp object to create a spaCy Doc object representing the document to
process. Here we used a sentence from the introduction to the World Wide Web in many of
our books.

In [0]:
document = nlp('In 1994, Tim BernersLee founded the World Wide Web Consortium (W3C), devoted to developing web technologies.')

## Getting the Named Entities

The Doc object’s ents property returns a tuple of Span objects representing the named entities found in the Doc. Each Span has many properties.

In [7]:
for entity in document.ents:
  print(f'{entity.text} : {entity.label_}')

1994 : DATE
Tim BernersLee : PERSON
the World Wide Web Consortium (W3C : ORG


Each Span’s text property returns the entity as a string, and the label_ property returns a string indicating the entity’s kind. Here, spaCy found three entities representing a DATE (1994), a PERSON (Tim BernersLee)
and an ORG (organization; the World Wide Web Consortium).

# SIMILARITY DETECTION WITH SPACY

Similarity detection is the process of analyzing documents to determine how alike they are. One possible similarity detection technique is word frequency counting. 

For example, some people believe that the works of William Shakespeare actually might have been written by Sir Francis Bacon, Christopher Marlowe or others. Comparing the word frequencies of their works with those of Shakespeare can reveal writingstyle similarities.

Various machinelearning techniques can be used to study document similarity. However, as is often the case in Python, there are libraries such as
spaCy and Gensim that can do this for you. Here, we’ll use spaCy’s similarity detection features to compare Doc objects representing Shakespeare’s Romeo and Juliet with Christopher Marlowe’s Edward the Second.

## Creating the spaCy Docs

Next, we create two Doc objects—one for Romeo and Juliet and one for Edward the Second:

In [0]:
from pathlib import Path

document1 = nlp(Path('romeo-and-juliet.txt').read_text())
document2 = nlp(Path('edward-the-second.txt').read_text())

## Comparing the Books’ Similarity

Finally, we use the Doc class’s similarity method to get a value from 0.0 (not similar) to
1.0 (identical) indicating how similar the documents are:

In [9]:
document1.similarity(document2)

  "__main__", mod_spec)


0.9128401577782687

spaCy believes these two documents have significant similarities.

In [0]:
document1 = nlp('I am very happy today.')
document2 = nlp('I am very very happy today.')

In [12]:
document1.similarity(document2)

  "__main__", mod_spec)


0.950185623440308

In [14]:
document1 = nlp('I am very happy today.')
document2 = nlp('I am not happy today.')
document1.similarity(document2)

  "__main__", mod_spec)


0.9522217818907655