Skip to content
This repository has been archived by the owner on May 20, 2019. It is now read-only.

Latest commit

 

History

History
165 lines (137 loc) · 8.91 KB

ner.org

File metadata and controls

165 lines (137 loc) · 8.91 KB

Social Network Analysis with NLTK

Introduction

  • What is NLP?
  • What is NER?
  • Why is it so hard?
    • Aliases

What is Natural Language Processing?

What is Named-Entity Recognition?

What is Social Network Analysis?

Why are these things so hard?

How can this problem be solved?

Preprocess

Strip unessential information.

First Pass

Create lexicon. pos-tag every sentence in lexicon, recording its use in the sentence and creating a dictionary. write out cfg for this text, using a standard cfg for branches and the dictionary for terminals.

Second Pass

Use CFG to parse and chunk each sentence in the text. Where no human subject is used (ie no John, he, they, etc.), discard sentence. Reduce sentence to basic meaning. Record relationship.

Postprocess

Determine aliases and combine accordingly. Construct graph. Output graph to text file.

Parsing the Text

From a File into Memory

In order to work with any text, it must first be loaded into memory so the computer can work with it. This is almost trivial in Python.

First, we need to determine the path to the text we wish to analyse. Because of its wealth of character diversity, we are going to use the text of Les Miserables. This text is stored with this file under a directory called texts/. So, let’s create a variable to store this information called text_filepath.

text_filepath = './texts/les-miserables.text'

Now we can use this file path to read in the raw text. Note that we do not need write access to the file—only read access. Since the text will almost certainly use some sort of accent characters, we’ll make sure Python uses the UTF-8 encoding.

text_file = open(text_filepath, 'rU')

Now that we have an open file to work with, we can read in the entire text of the file into a variable we will be working with extensively from here on out: text.

text = text_file.read()

We are now done with the file, so we will close it and move on.

text_file.close()

Into Chapters and Paragraphs

Hopefully, I can get this to work. It would be good to be able to parse it into a higher-level logical structure than a sentence.

Into Sentences

In the previous section, we took a file stored on disk and pulled it into memory—a fairly standard an unexciting procedure. Next, we are going to finally start to avail of a very important library for this research: the Natural Language Toolkit (NLTK). Using NLTK, we’re going to process text and turn it into a form that NLTK can natively work with. First, though, we must first turn text, which is at this point a mindless sequence of lifeless characters, into a collection (more properly a list) of sentences. Keep in mind that this is still low-level processing, and we aren’t doing anything too fancy yet. This could be done with only a few extra lines of Python, but since we are using this library anyway, why reinvent the wheel?

Speaking of using the library, we must import it before we can use it.

import nltk

Easy, right? Now, let’s parse text into sentences.

sentences = nltk.sent_tokenize(text)

Into Words and Into Position

Now, sentences contains text in a form that accurately represents the distinctions of sentences in the text. What we need to do now is to recognize the individual words of each sentence and determine their correct positions in the sentence. For example, the sentence

John gave Caitlyn a pretty flower.

has a distinct syntactical structure: John, the subject, gave, a transitive verb, Caitlyn, the indirect object, a flower, the direct object (with a determiner), and pretty, an adjective upon a flower. With any single sentence, NLTK can not only tokenize it into individual words (taking into account contractions, etc.), but also tag these words with their appropriate functions in the sentence using a set of keys described in its documentation. For example, the above sentence would be parsed as the tree structure below.

(S
  John/NNP
  gave/VBD
  Caitlyn/NNP
  a/DT
  pretty/RB
  flower/JJR
  ./.)

As English sentences can be arbitrarily complex (and absurd), NLTK may very well trip on more involved sentences, but that is separate research. We will unreasonally assume that NLTK is perfect to its specifications.

Let us now finally instruct Python to first tokenize each sentence into words and then tag each and every sentence in sentences using a list comprehension.

tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [ntlk.pos_tag(sentence) for sentence in tagged_sentences]

Determining Actors and Objects

Consider for a moment the (most likely) difference in syntactical complexity between a five-year-old speaker and a famed author of a classic. The five-year-old (say, Billy) will not have the same complex sentence structures as Victor Hugo, save unlikely genius. However, if you stick them both in front of a computer that’s been instructed to hold a faux-meaningful conversation with you, Billy is likely to have better luck in communicating. Notice that Billy is probably going to speak to the computer in such simple phrases as ”I hugged my cat,/” while Victor would be more likely to say something a little more dramatic, such as “/On the fourth day of June, 1254, Monty Python mustered up his scarce courage and strength to do battle with the great King Arthur of Britain.” At its most basic, English is (or is usually) constructed as Subject-/Verb/.

As such, Billy’s sentiment could be (slightly) reduced to “/I hug,/” while Victor’s could be drastically reduced to “Monty Python battle.” While this strips almost all prose and purpose from Victor’s masterfully crafted sentiment, it is far easier for a humble computer (not to mention myself) to understand. The aim of this research is to simplify the text to this point and quantify the social networks that become apparent through the text.

Before we get there, though, there are major hurdles that must be overcome. First of all, English sentences are hardly ever as simple as Billy’s. While they are also usually not as grandiose as Victor’s contribution, that level of complexity is much more typical. To boil down each sentence into something more basic, we’ll have to describe to NLTK what we are looking for using a grammar, specifically a context-free grammar. Succinctly, a context-free grammar (or CFG) describes a language as a set of rules starting from the sentence and decomposing itself into smaller and smaller parts until the individual words (terminals) are reached.

Fortunately, NLTK has rather intuitive support for these arguably complex things. Let us define, to the best of our ability, the English language using a CFG. (Note that this is technically not possible; see Higginbotham.)

More Than Your Grade-School Grammar

NLTK has pretty wonderful support for parts-of-speech tagging and CFG parsing, where the former is not too reliable. There are advantages and disadvantages to using each. Regular expressions are by default less powerful (since they define a smaller class of languages), but they can define tokens by part-of-speech, as in the regular expression <N.*>+<V.*><N.*>*. This expression would describe a sentence that begins with one or more nouns (proper or otherwise), followed by a verb in any tense, followed by zero or more nouns. I mentioned that this approach is by default less powerful than using a context-free grammar. Using regular expressions, it is impossible to recursively define a phrase such as a prepositional phrase, for example. A prepositional phrase is any phrase in the sentence which begins with a preposition, followed by a noun, and optionally followed by another prepositional phrase.

For example, consider the sentence, “Fetch the hose under the porch behind the house.” There is no way to describe this using a proper regular expression. However, context-free grammars can describe this scenario using a production rule for prepositional phrases:

Prepositional_Phrase -> <PP><DT>?<N.*><Prepositional_Phrase>

Recognizing Names

What’s in a Name?

discerning the difference between a noun and a name

Compiling a List

Determine Aliases

Determining Relationships

Strength

  • number of co-occurances

About the NLTK