[View in Colaboratory](https://colab.research.google.com/github/schwaaweb/aimlds1_11-NLP/blob/master/M11_CCS1--AG--NLP_Objective_1_Part_2.ipynb)

### Objective 1 Part 2: Diagnose the structure of text

**Utilize Dependency Parsing to glean chunks of meaning and relations**

POS tagging labels words in a sentence as adjectives, verbs, nouns, etc. as we saw in the tutorial. However at times, there is a need to determine the dependency between the words and understand relations beyond merely tagging the parts of speech.

For Dependency Parsing, we will use spacy which is one of the newer NLP libraries in Python.

For your reference, the link below provides the list of dependency tokens (scroll down to the Dependency Tokens section):

https://stackoverflow.com/questions/40288323/what-do-spacys-part-of-speech-and-dependency-tags-mean

In [1]:
# Reference: https://spacy.io/
!pip install -U spacy


Collecting spacy
[?25l  Downloading https://files.pythonhosted.org/packages/3c/31/e60f88751e48851b002f78a35221d12300783d5a43d4ef12fbf10cca96c3/spacy-2.0.11.tar.gz (17.6MB)
[K    100% |████████████████████████████████| 17.6MB 1.7MB/s 
[?25hRequirement not upgraded as not directly required: numpy>=1.7 in /usr/local/lib/python3.6/dist-packages (from spacy) (1.14.3)
Collecting murmurhash<0.29,>=0.28 (from spacy)
  Downloading https://files.pythonhosted.org/packages/5e/31/c8c1ecafa44db30579c8c457ac7a0f819e8b1dbc3e58308394fff5ff9ba7/murmurhash-0.28.0.tar.gz
Collecting cymem<1.32,>=1.30 (from spacy)
  Downloading https://files.pythonhosted.org/packages/f8/9e/273fbea507de99166c11cd0cb3fde1ac01b5bc724d9a407a2f927ede91a1/cymem-1.31.2.tar.gz
Collecting preshed<2.0.0,>=1.0.0 (from spacy)
[?25l  Downloading https://files.pythonhosted.org/packages/1b/ac/7c17b1fd54b60972785b646d37da2826311cca70842c011c4ff84fbe95e0/preshed-1.0.0.tar.gz (89kB)
[K    100% |████████████████████████████████| 92kB 18.

 | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
[?25h  Stored in directory: /content/.cache/pip/wheels/fb/00/28/75c85d5135e7d9a100639137d1847d41e914ed16c962d467e4
  Running setup.py bdist_wheel for murmurhash ... [?25l- \ | / done
[?25h  Stored in directory: /content/.cache/pip/wheels/b8/94/a4/f69f8664cdc1098603df44771b7fec5fd1b3d8364cdd83f512
  Running setup.py bdist_wheel for cymem ... [?25l- \ | done
[?25h  Stored in directory: /content/.cache/pip/wheels/55/8d/4a/f6328252aa2aaec0b1cb906fd96a1566d77f0f67701071ad13
  Running setup.py bdist_wheel for preshed ... [?25l- \ | / - \ | done
[?25h  Stored in directory: /content/.cache/pip/wheels/8f/85/06/2d132fb649a6bbcab22487e4147880a55b0dd0f4b18f

 \ | / - \ | / done
[?25h  Stored in directory: /content/.cache/pip/wheels/f8/b1/86/c92e4d36b690208fff8471711b85eaa6bc6d19860a86199a09
  Running setup.py bdist_wheel for msgpack-python ... [?25l- \ | / done
[?25h  Stored in directory: /content/.cache/pip/wheels/d5/de/86/7fa56fda12511be47ea0808f3502bc879df4e63ab168ec0406
Successfully built spacy murmurhash cymem preshed thinc pathlib ujson dill regex wrapt cytoolz msgpack-python
Installing collected packages: murmurhash, cymem, preshed, wrapt, tqdm, cytoolz, plac, dill, pathlib, msgpack-python, msgpack-numpy, thinc, ujson, regex, spacy
Successfully installed cymem-1.31.2 cytoolz-0.8.2 dill-0.2.7.1 msgpack-numpy-0.4.1 msgpack-python-0.5.6 murmurhash-0.28.0 pathlib-1.0.1 plac-0.9.6 preshed-1.0.0 regex-2017.4.5 spacy-2.0.11 thinc-6.10.2 tqdm-4.23.4 ujson-1.35 wrapt-1.10.11


In [0]:
!python -m spacy download en

In [0]:
# Import spacy
import spacy

# Load English version of tokenizer, tagger, parser
nlp = spacy.load('en')


# Define a function to show what word each word depends upon
def show_dependency(rawText):
    tokens = nlp(rawText)
    print(tokens)
    for token in tokens:
        print("{} : {}".format(token.orth_,token.dep_))

# Call the function
show_dependency("The batter hit the ball out of AT&T park into the pacific ocean.")

**Determine the head of a sentence**

The head of a sentence governs the central structure of the sentence. It is the most important word of the sentence and it is usually a verb. In the previous example, the word "hit" is the root around which the rest of the sentence revolves. The word "hit" has a direct object i.e. the ball and a subject i.e. the batter

In [0]:
# Import spacy
import spacy

# Load English version of tokenizer, tagger, parser
nlp = spacy.load('en')

# Define a function to show what word each word depends upon
def show_dependency(rawText):
    tokens = nlp(rawText)
    for token in tokens:
        print(" {} : {} : {} : {}".format(
            token.orth_, token.pos_, token.dep_, token.head))

# Print the header
print("token : POS : dep. : head")
print("#########################")

# Call the function
show_dependency("The batter hit the ball out of AT&T park into the pacific ocean")

# In the above example, the central structure of the sentence
# is dictated by the ROOT which is the word "hit"

### Find Named Entities in a body of text

So far, we have explored part-of-speech tagging, dependency parsing, and identifying the head of a sentence. Now, we will look at finding proper nouns or named entities. Named Entity Recognition (NER) helps understand what a body of text is about; essentially, it provides an insight into what is being referred to in the body of text.

In [0]:
# Import spacy
# https://spacy.io/usage/spacy-101
import spacy

# Load English version of tokenizer, tagger, parser
nlp = spacy.load('en')

# Define a function to show what word each word depends upon
def show_dependency(rawText):
    entities = nlp(rawText)
    for entity in entities.ents:
        print("{} : {}".format(entity.text,entity.label_))

# Call the function
show_dependency("Alexander Martino Rodrigues-Schmidt hit the ball out off AT&T park into the pacific ocean. The park is located in San Francisco")

# GPE - stands for Geo-Political Entity

**Access built-in corpuses**

A corpus is a collection of text documents, and a corpora is the plural of corpus. When working with real-world **Natural Language Processing** (NLP) problems, there is a need to work with huge amounts of data. This data is generally available in the form of a corpus externally on the World Wide Web or as an add-on of the NLTK package. For example, to create a spell checker, you need a huge corpus of words to match against. Within this objective, we will explore how to access and work with the built-in corpora and we will walk though the process of creating a custom corpus; a custom corpus is essentially just a bunch of text files in a directory.

**Load, access and perform operations on a built-in corpus**

NLTK provides many corpuses, the complete list of built-in corpora provided within NLTK is available at: http://www.nltk.org/nltk_data. To access the set of data packages included with NLTK, you will have to download the packages fist. Instructions to download the data packages can be accessed via the following link: https://www.nltk.org/data.html

In [0]:
import nltk

nltk.download('all')

# Inaugural is one of the data packages included within NLTK

# Import the "inaugural" data package
from nltk.corpus import inaugural

# Output the list of all Inaugural addresses included
# print(inaugural.fileids())

# All words from a subset of the Inaugural Address Corpus that begin with future
# or fellow are counted; separate counts are kept for each inaugural address;
# these are plotted so that trends in usage over time can be observed
condfreqdist = nltk.ConditionalFreqDist(
            (tget, fileid[:4])
            for fileid in (inaugural.fileids()[0:25])
            for w in inaugural.words(fileid)
            for tget in ['future', 'fellow']
            if w.lower().startswith(tget)) 
condfreqdist.plot()

In [0]:
# Accessing the "reuters" data package
# Import the "reuters" data package
from nltk.corpus import reuters

# List the files within the "reuters" corpus
reuters.fileids()

# Fetch the words from a specific

words14862 = reuters.words(['test/14862'])
print(words14862)

# Fetch the first 25 words from the file which is part of the corpus
words25 = reuters.words(['test/14862'])[:25]
print(words25)

# The corpus is a list of files hierarchically categorized into 90 topics
# Examine the categories within the Reuters corpus
reutersCategories = reuters.categories()
print(reutersCategories)

# Fetch the files associated with a specific category
print(reuters.fileids(categories='barley'))

# Fetch the words from the specific category
print(reuters.words(categories='barley')[:25])

# Fetch words from 2 categories
print(reuters.words(categories=['barley', 'carcass']))