# Detecting Named Entities using NLTK and spaCy

### Our two favorite imports 😛 

In [2]:
import spacy, nltk

### And good old pandas

In [42]:
import pandas as pd

### Read in the text

In [20]:
text = open("testfile.txt", "rb",).read()

### Load spaCy

In [8]:
nlp = spacy.load("en")

### Run spaCy pipeline

In [21]:
doc = nlp(unicode(text))

### View named entities
See: https://spacy.io/docs/usage/entity-recognition

In [48]:
pd.DataFrame([(x.text, x.label_, ) for x in doc.ents], columns = ["Entity", "Entity Type"]).head()

Unnamed: 0,Entity,Entity Type
0,January 2017,DATE
1,Quinn,PERSON
2,Ruby,PRODUCT
3,Python,PRODUCT
4,Netezza,PERSON


### View POS tags

In [57]:
df_pos = pd.DataFrame([(x.text, x.ent_type_, x.tag_, x.pos_) for x in doc], columns = ["Token", "Entity Type", "Tag", "Part of Speech"])
df_pos.head()

Unnamed: 0,Token,Entity Type,Tag,Part of Speech
0,As,,IN,ADP
1,of,,IN,ADP
2,January,DATE,NNP,PROPN
3,2017,DATE,CD,NUM
4,",",,",",PUNCT


### Identifying domain-specific terms
One way we could extract programming language names and such domain-specific terms is by looking for proper nouns. However, this would merely identify single-word terms; we would miss out terms such as _Ruby on Rails_. Also, we would still have to have a master list to compare and identify our terms of interest from the proper nouns list. The noun chunks list does not contain the term _Ruby on Rails_ either.

#### Proper nouns

In [74]:
df_pos[df_pos["Part of Speech"] == "PROPN"]

Unnamed: 0,Token,Entity Type,Tag,Part of Speech
2,January,DATE,NNP,PROPN
7,Mr.,,NNP,PROPN
8,Quinn,PERSON,NNP,PROPN
13,C,,NNP,PROPN
15,C++,,NNP,PROPN
17,C#,,NNP,PROPN
19,Ruby,PRODUCT,NNP,PROPN
21,Rails,,NNPS,PROPN
24,Python,PRODUCT,NNP,PROPN
31,Netezza,PERSON,NNP,PROPN


#### Noun chunks

In [70]:
list(doc.noun_chunks)

[January,
 the mysterious Mr. Quinn,
 working knowledge,
 C,
 Rails,
 He,
 Netezza,
 March,
 He,
 his Masters,
 Mathematics,
 the University,
 Cambridge,
 He,
 an intern,
 Marvel Studios,
 his Bachelors study,
 The Department,
 Physical Sciences,
 University,
 Awesomeness,
 their supercomputing cluster,
 simulations,
 Complexly,
 Physical Phenomenon,
 He,
 modern twistor theory,
 breakfast,
 string theory,
 lunch,
 his Masters thesis,
 Mr. Quinn,
 himself,
 new flavors,
 Linux,
 Hardy Heron,
 he,
 all the releases,
 Linux,
 He,
 Linux Mint,
 Ubuntu,
 He,
 the book Hackers,
 Painters,
 Paul Graham,
 He,
 a parallel universe,
 he,
 Y-Combinator,
 Paul Graham]