# Intro to Jupyter Notebooks



# Basic Natural Language Processing in Python

We are going to do some simple natural language processing tasks in Python.

Steps:
   
1. Get some text
2. Prepare our text
3. Process our text
4. Revel in what we've done

## Step One: Get some text

We're going to select a text from the Gutenberg Project: https://www.gutenberg.org/. All texts at the Gutenberg Project are in the public domain and already prepared in plaintext (.txt file).

For this exercise, we're going to use Mark Twain's "Innocents Abroad," about his Grand Tour of Europe and the Mediterranean in 1867.

In the Gutenberg search box, search for "Mark Twain." In the list of results, select "Innocents Abroad."

_Take note of the url: https://www.gutenberg.org/ebooks/3176. We're going to need the number at the end later._

Click on the file called "Plain Text UTF-8." We also want to use plain text files in computational text analysis so that the computer is not confused by formatting. UTF-8 can encode any Unicode character and therefore, can handle text in most writing systems.

This will open up the text file in a browser window (Mac). You'll see it is not easy to read on the screen, which is why we have ePub and other formats for reading on a Kindle or other devices.

You'll see some preliminary information from the Gutenberg Project in the beginning, and at the very end, you'll see the Gutenberg license agreement. In our next step, we'll learn how to remove that information so that only Mark Twain's words appear in our text file.

But wait, how do I download the file onto my computer? You actually don't have to. We're going to let our program retrieve the file for us.

If you do want to download it, go back one page to the page with the different file formats. (If you get lost, it's the url above). (Mac) Right click the file called "Plain Text UTF-8" and select "Save Link As ..." When prompted, select OK and remember where you saved the file.

Congratulations, you've now collected some data. Text = data.


In [12]:
import nltk  
from gutenberg.acquire import load_etext 
from gutenberg.cleanup import strip_headers

text = strip_headers(load_etext(3176)).strip()  # 3176 = Innocents Abroad
print(text)

INNOCENTS ABROAD

by Mark Twain


[From an 1869--1st Edition]



                                CONTENTS


                                CHAPTER I.
Popular Talk of the Excursion--Programme of the Trip--Duly Ticketed for
the Excursion--Defection of the Celebrities

                               CHAPTER II.
Grand Preparations--An Imposing Dignitary--The European Exodus
--Mr. Blucher's Opinion--Stateroom No. 10--The Assembling of the Clans
--At Sea at Last

                               CHAPTER III.
“Averaging” the Passengers--Far, far at Sea.--Tribulation among the
Patriarchs--Seeking Amusement under Difficulties--Five Captains in the
Ship

                               CHAPTER IV.
The Pilgrims Becoming Domesticated--Pilgrim Life at Sea
--“Horse-Billiards”--The “Synagogue”--The Writing School--Jack's “Journal”
 --The “Q. C. Club”--The Magic Lantern--State Ball on Deck--Mock Trials
--Charades--Pilgrim Solemnity--Slow Music--The Executive Officer Delivers
an Opinion

                

Let's review what just happened.

First, we imported the `nltk` (Natural Language Toolkit) Python package. You can read more about the Natural Language Toolkit here: http://www.nltk.org/. It includes a package called `gutenberg` that helps retrieve the text from the Gutenberg Project webpage and extract the extraneous information.

Notice how we used the number from the end of the Gutenberg Project url. That number is telling the program to retrieve the plain-text version of "Innocents Abroad."

Rather than using the entire book, let's just use an excerpt to make things easier. In the next code block, I've reset the text variable to an excerpt, which includes an overview of Twain's itinerary.

In [13]:
# remember to wrap in triple quotes for multi-line text
text = """The undersigned will make an excursion as above during the coming
     season, and begs to submit to you the following programme:

       A first-class steamer, to be under his own command, and capable of
     accommodating at least one hundred and fifty cabin passengers, will
     be selected, in which will be taken a select company, numbering not
     more than   three-fourths of the ship's capacity.  There is good
     reason to believe that this company can be easily made up in this
     immediate vicinity, of mutual friends and acquaintances.

       The steamer will be provided with every necessary comfort,
     including library and musical instruments.

       An experienced physician will be on board.

       Leaving New York about June 1st, a middle and pleasant route will
     be taken across the Atlantic, and passing through the group of
     Azores, St. Michael will be reached in about ten days.  A day or two
     will be spent here, enjoying the fruit and wild scenery of these
     islands, and the voyage continued, and Gibraltar reached in three or
     four days.

       A day or two will be spent here in looking over the wonderful
     subterraneous fortifications, permission to visit these galleries
     being readily obtained.

       From Gibraltar, running along the coasts of Spain and France,
     Marseilles will be reached in three days.  Here ample time will be
     given not only to look over the city, which was founded six hundred
     years before the Christian era, and its artificial port, the finest
     of the kind in the Mediterranean, but to visit Paris during the
     Great Exhibition; and the beautiful city of Lyons, lying
     intermediate, from the heights of which, on a clear day, Mont Blanc
     and the Alps can be distinctly seen.  Passengers who may wish to
     extend the time at Paris can do so, and, passing down through
     Switzerland, rejoin the steamer at Genoa.

       From Marseilles to Genoa is a run of one night.  The excursionists
     will have an opportunity to look over this, the “magnificent city of
     palaces,” and visit the birthplace of Columbus, twelve miles off,
     over a beautiful road built by Napoleon I.  From this point,
     excursions may be made to Milan, Lakes Como and Maggiore, or to
     Milan, Verona (famous for its extraordinary fortifications), Padua,
     and Venice.  Or, if passengers desire to visit Parma (famous for
     Correggio's frescoes) and Bologna, they can by rail go on to
     Florence, and rejoin the steamer at Leghorn, thus spending about
     three weeks amid the cities most famous for art in Italy.

       From Genoa the run to Leghorn will be made along the coast in one
     night, and time appropriated to this point in which to visit
     Florence, its palaces and galleries; Pisa, its cathedral and
     “Leaning Tower,” and Lucca and its baths, and Roman amphitheater;
     Florence, the most remote, being distant by rail about sixty miles.

       From Leghorn to Naples (calling at Civita Vecchia to land any who
     may prefer to go to Rome from that point), the distance will be made
     in about thirty-six hours; the route will lay along the coast of
     Italy, close by Caprera, Elba, and Corsica.  Arrangements have been
     made to take on board at Leghorn a pilot for Caprera, and, if
     practicable, a call will be made there to visit the home of
     Garibaldi.

       Rome [by rail], Herculaneum, Pompeii, Vesuvius, Vergil's tomb, and
     possibly the ruins of Paestum can be visited, as well as the
     beautiful surroundings of Naples and its charming bay.

       The next point of interest will be Palermo, the most beautiful
     city of Sicily, which will be reached in one night from Naples.  A
     day will be spent here, and leaving in the evening, the course will
     be taken towards Athens.

       Skirting along the north coast of Sicily, passing through the
     group of Aeolian Isles, in sight of Stromboli and Vulcania, both
     active volcanoes, through the Straits of Messina, with “Scylla” on
     the one hand and “Charybdis” on the other, along the east coast of
     Sicily, and in sight of Mount Etna, along the south coast of Italy,
     the west and south coast of Greece, in sight of ancient Crete, up
     Athens Gulf, and into the Piraeus, Athens will be reached in two and
     a half or three days.  After tarrying here awhile, the Bay of
     Salamis will be crossed, and a day given to Corinth, whence the
     voyage will be continued to Constantinople, passing on the way
     through the Grecian Archipelago, the Dardanelles, the Sea of
     Marmora, and the mouth of the Golden Horn, and arriving in about
     forty-eight hours from Athens.

       After leaving Constantinople, the way will be taken out through
     the beautiful Bosphorus, across the Black Sea to Sebastopol and
     Balaklava, a run of about twenty-four hours.  Here it is proposed to
     remain two days, visiting the harbors, fortifications, and
     battlefields of the Crimea; thence back through the Bosphorus,
     touching at Constantinople to take in any who may have preferred to
     remain there; down through the Sea of Marmora and the Dardanelles,
     along the coasts of ancient Troy and Lydia in Asia, to Smyrna, which
     will be reached in two or two and a half days from Constantinople.
     A sufficient stay will be made here to give opportunity of visiting
     Ephesus, fifty miles distant by rail.

       From Smyrna towards the Holy Land the course will lay through the
     Grecian  Archipelago, close by the Isle of Patmos, along the coast
     of Asia, ancient Pamphylia, and the Isle of Cyprus.  Beirut will be
     reached in three days.  At Beirut time will be given to visit
     Damascus; after which the steamer will proceed to Joppa.

       From Joppa, Jerusalem, the River Jordan, the Sea of Tiberias,
     Nazareth, Bethany, Bethlehem, and other points of interest in the
     Holy Land can be visited, and here those who may have preferred to
     make the journey from Beirut through the country, passing through
     Damascus, Galilee, Capernaum, Samaria, and by the River Jordan and
     Sea of Tiberias, can rejoin the steamer.

       Leaving Joppa, the next point of interest to visit will be
     Alexandria, which will be reached in twenty-four hours.  The ruins
     of Caesar's Palace, Pompey's Pillar, Cleopatra's Needle, the
     Catacombs, and ruins of ancient Alexandria will be found worth the
     visit.  The journey to Cairo, one hundred and thirty miles by rail,
     can be made in a few hours, and from which can be visited the site
     of ancient Memphis, Joseph's Granaries, and the Pyramids.

       From Alexandria the route will be taken homeward, calling at
     Malta, Cagliari (in Sardinia), and Palma (in Majorca), all
     magnificent harbors, with charming scenery, and abounding in fruits.

       A day or two will be spent at each place, and leaving Parma in the
     evening, Valencia in Spain will be reached the next morning.  A few
     days will be spent in this, the finest city of Spain.

       From Valencia, the homeward course will be continued, skirting
     along the coast of Spain.  Alicant, Carthagena, Palos, and Malaga
     will be passed but a mile or two distant, and Gibraltar reached in
     about twenty-four hours.

       A stay of one day will be made here, and the voyage continued to
     Madeira, which will be reached in about three days.  Captain
     Marryatt writes: “I do not know a spot on the globe which so much
     astonishes and delights upon first arrival as Madeira.” A stay of
     one or two days will be made here, which, if time permits, may be
     extended, and passing on through the islands, and probably in sight
     of the Peak of Teneriffe, a southern track will be taken, and the
     Atlantic crossed within the latitudes of the northeast trade winds,
     where mild and pleasant weather, and a smooth sea, can always be
     expected.

       A call will be made at Bermuda, which lies directly in this route
     homeward, and will be reached in about ten days from Madeira, and
     after spending a short time with our friends the Bermudians, the
     final departure will be made for home, which will be reached in
     about three days.

       Already, applications have been received from parties in Europe
     wishing to join the Excursion there.

       The ship will at all times be a home, where the excursionists, if
     sick, will be surrounded by kind friends, and have all possible
     comfort and sympathy.

       Should contagious sickness exist in any of the ports named in the
     program, such ports will be passed, and others of interest
     substituted.

       The price of passage is fixed at $1,250, currency, for each adult
     passenger.  Choice of rooms and of seats at the tables apportioned
     in the order in which passages are engaged; and no passage
     considered engaged until ten percent of the passage money is
     deposited with the treasurer.

       Passengers can remain on board of the steamer, at all ports, if
     they desire, without additional expense, and all boating at the
     expense of the ship.

       All passages must be paid for when taken, in order that the most
     perfect arrangements be made for starting at the appointed time.

       Applications for passage must be approved by the committee before
     tickets are issued, and can be made to the undersigned.

       Articles of interest or curiosity, procured by the passengers
     during the voyage, may be brought home in the steamer free of
     charge.

       Five dollars per day, in gold, it is believed, will be a fair
     calculation to make for all traveling expenses onshore and at the
     various points where passengers may wish to leave the steamer for
     days at a time.

       The trip can be extended, and the route changed, by unanimous vote
     of the passengers.

      CHAS.  C.  DUNCAN,  117 WALL STREET, NEW YORK  R.  R.  G******,
     Treasurer

      Committee on Applications  J.  T.  H*****, ESQ.  R.  R.  G*****,
     ESQ.  C.  C.  Duncan

      Committee on Selecting Steamer  CAPT.  W.  W.  S* * * *, Surveyor
     for Board of Underwriters

       C.  W.  C******, Consulting Engineer for U.S.  and Canada  J.  T.
     H*****, Esq. C.  C.  DUNCAN

       P.S.--The very beautiful and substantial side-wheel steamship
     “Quaker City” has been chartered for the occasion, and will leave
     New York June 8th.  Letters have been issued by the government
     commending the party to courtesies abroad.
"""



In [14]:
nltk.download('punkt')
tokens = nltk.word_tokenize(text)

[nltk_data] Downloading package punkt to /Users/swanzsp/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


The variable `tokens` is a Python list, where each item in the list is a word or punctuation mark from our text. If we run the `print` command, we'll see what that looks like.

In [15]:
print(tokens)


['The', 'undersigned', 'will', 'make', 'an', 'excursion', 'as', 'above', 'during', 'the', 'coming', 'season', ',', 'and', 'begs', 'to', 'submit', 'to', 'you', 'the', 'following', 'programme', ':', 'A', 'first-class', 'steamer', ',', 'to', 'be', 'under', 'his', 'own', 'command', ',', 'and', 'capable', 'of', 'accommodating', 'at', 'least', 'one', 'hundred', 'and', 'fifty', 'cabin', 'passengers', ',', 'will', 'be', 'selected', ',', 'in', 'which', 'will', 'be', 'taken', 'a', 'select', 'company', ',', 'numbering', 'not', 'more', 'than', 'three-fourths', 'of', 'the', 'ship', "'s", 'capacity', '.', 'There', 'is', 'good', 'reason', 'to', 'believe', 'that', 'this', 'company', 'can', 'be', 'easily', 'made', 'up', 'in', 'this', 'immediate', 'vicinity', ',', 'of', 'mutual', 'friends', 'and', 'acquaintances', '.', 'The', 'steamer', 'will', 'be', 'provided', 'with', 'every', 'necessary', 'comfort', ',', 'including', 'library', 'and', 'musical', 'instruments', '.', 'An', 'experienced', 'physician', '

We can now the computer identify the part of speech of each word ("POS tagging"). Ignore punctuation for now. 


In [16]:
nltk.download('averaged_perceptron_tagger') # we'll have to install some additional modules from nltk
tagged_tokens = nltk.pos_tag(tokens)
print(tagged_tokens)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/swanzsp/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('The', 'DT'), ('undersigned', 'JJ'), ('will', 'MD'), ('make', 'VB'), ('an', 'DT'), ('excursion', 'NN'), ('as', 'IN'), ('above', 'IN'), ('during', 'IN'), ('the', 'DT'), ('coming', 'VBG'), ('season', 'NN'), (',', ','), ('and', 'CC'), ('begs', 'VBZ'), ('to', 'TO'), ('submit', 'VB'), ('to', 'TO'), ('you', 'PRP'), ('the', 'DT'), ('following', 'VBG'), ('programme', 'NN'), (':', ':'), ('A', 'DT'), ('first-class', 'JJ'), ('steamer', 'NN'), (',', ','), ('to', 'TO'), ('be', 'VB'), ('under', 'IN'), ('his', 'PRP$'), ('own', 'JJ'), ('command', 'NN'), (',', ','), ('and', 'CC'), ('capable', 'JJ'), ('of', 'IN'), ('accommodating', 'VBG'), ('at', 'IN'), ('least', 'JJS'), ('one', 'CD'), ('hundred', 'CD'), ('and', 'CC'), ('fifty', 'JJ'), ('cabin', 'NN'), ('passengers', 'NNS'), (',', ','), ('will', 'MD'), ('be', 'VB'), ('selected', 'VBN'), (',', ','), ('in', 'IN'), ('which', 'WDT'), ('will', 'MD'), ('be', 'VB'), ('taken', 'VBN'), ('a', 'DT'), ('select', 'JJ'), ('company', 'NN'), (',', ','), ('numbering',

The two or three letter code corresponds to a different part of speech. In addition to the common parts of speech (noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection), the POS tagging also gives us many more tags for words like plurals, possessives, or proper nouns, or tense.  

This model only works in English. If you are working in other languages, you will need to download a different tag set. Non-English tag set might add more tags for grammatical gender or case.

We are using the Penn Treebank English tagset, and you can see what the tags mean here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

You can also quickly look up the meaning of a code in your program:



In [17]:
nltk.download('tagsets')
nltk.help.upenn_tagset('NNP')

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...


[nltk_data] Downloading package tagsets to /Users/swanzsp/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


This told us that the NNP tag refers to a proper noun. Sounds like it might be useful for named entity recognition tasks.

Some of you might be wondering how NLTK handles two-word nouns (called bigrams) like "New York."

The fifth paragraph begins, ""Leaving New York about June 1st, ..."

Here's the results of our tokenization and tagging for that phrase: "('Leaving', 'VBG'), ('New', 'NNP'), ('York', 'NNP'), ('about', 'IN'), ('June', 'NNP'), ('1st', 'CD'), (',', ',')"

Hmm, so "New" and "York" are separate tokens (words), but both are tagged as proper nouns. Is that just because they are both capitalized? Let's find out.

Next we are going to identify named entities (persons or places), a process known as "named entity recognition" (NER). 

In [18]:
nltk.download('maxent_ne_chunker')
nltk.download('words')
named_entities = nltk.chunk.ne_chunk(tagged_tokens)


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/swanzsp/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /Users/swanzsp/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [19]:
print(named_entities)

(S
  The/DT
  undersigned/JJ
  will/MD
  make/VB
  an/DT
  excursion/NN
  as/IN
  above/IN
  during/IN
  the/DT
  coming/VBG
  season/NN
  ,/,
  and/CC
  begs/VBZ
  to/TO
  submit/VB
  to/TO
  you/PRP
  the/DT
  following/VBG
  programme/NN
  :/:
  A/DT
  first-class/JJ
  steamer/NN
  ,/,
  to/TO
  be/VB
  under/IN
  his/PRP$
  own/JJ
  command/NN
  ,/,
  and/CC
  capable/JJ
  of/IN
  accommodating/VBG
  at/IN
  least/JJS
  one/CD
  hundred/CD
  and/CC
  fifty/JJ
  cabin/NN
  passengers/NNS
  ,/,
  will/MD
  be/VB
  selected/VBN
  ,/,
  in/IN
  which/WDT
  will/MD
  be/VB
  taken/VBN
  a/DT
  select/JJ
  company/NN
  ,/,
  numbering/VBG
  not/RB
  more/RBR
  than/IN
  three-fourths/NNS
  of/IN
  the/DT
  ship/NN
  's/POS
  capacity/NN
  ./.
  There/EX
  is/VBZ
  good/JJ
  reason/NN
  to/TO
  believe/VB
  that/IN
  this/DT
  company/NN
  can/MD
  be/VB
  easily/RB
  made/VBN
  up/RP
  in/IN
  this/DT
  immediate/JJ
  vicinity/NN
  ,/,
  of/IN
  mutual/JJ
  friends/NNS
  and/CC
  acquain

Digging through the results file above, we see how it handled "New York":
    "Leaving/VBG
  (GPE New/NNP York/NNP)
  about/IN
  June/NNP
  1st/CD
  ,/,"

Looks to me like it identified New York as a single place and gave it the tag "GPE," which stands for Geo-Political Entity. It did so using a process called noun-phrase chunking, which we won't get into here. 

There's some errors too. "St. Michael" is tagged as "PERSON", but from the context, it is clear that it refers to a place. And "Atlantic" is tagged an an "ORGANIZATION," when it is clear that it is the "Atlantic Ocean." Perhaps the absence of the word "Ocean" fooled the tagger.

And as we skim through, should we have expected it to pick up "Cleopatra's Needle" or "Caesar's Palace"?

What about "River Jordan" and "Black Sea"?

It was a hassle to scroll through looking for tokens that were tagged as placenames. How can we do it computationally?

First we're going to figure out what kind of variable our named_entities variable is. Is it a list, dictionary, something else?

In [20]:
print(type(named_entities))

<class 'nltk.tree.Tree'>


Hmm, that's not a variable type I recognize. It looks like a special variable type used within the nltk package. How do I extract just the placenames?

The code below is traversing the named_entities tree and spitting out everything with the label "GPE."

In [21]:
for i in named_entities.subtrees(filter=lambda x: x.label() == 'GPE'):
     print(i)

(GPE New/NNP York/NNP)
(GPE Azores/NNP)
(GPE Gibraltar/NNP)
(GPE Spain/NNP)
(GPE France/NNP)
(GPE Christian/NNP)
(GPE Mediterranean/NNP)
(GPE Paris/NNP)
(GPE Lyons/NNP)
(GPE Paris/NNP)
(GPE Switzerland/NNP)
(GPE Genoa/NNP)
(GPE Columbus/NNP)
(GPE Maggiore/NNP)
(GPE Verona/NNP)
(GPE Venice/NNP)
(GPE Parma/NNP)
(GPE Florence/NNP)
(GPE Italy/NNP)
(GPE Florence/NNP)
(GPE Pisa/NNP)
(GPE Roman/NNP)
(GPE Florence/NNP)
(GPE Leghorn/NNP)
(GPE Naples/NNP)
(GPE Rome/NNP)
(GPE Italy/NNP)
(GPE Caprera/NNP)
(GPE Garibaldi/NNP)
(GPE Herculaneum/NNP)
(GPE Paestum/NNP)
(GPE Naples/NNP)
(GPE Palermo/NNP)
(GPE Sicily/NNP)
(GPE Naples/NNP)
(GPE Athens/NNP)
(GPE Sicily/NNP)
(GPE Aeolian/JJ Isles/NNP)
(GPE Vulcania/NNP)
(GPE Messina/NNP)
(GPE Sicily/NNP)
(GPE Italy/NNP)
(GPE Greece/NNP)
(GPE Athens/NNP)
(GPE Salamis/NNP)
(GPE Corinth/NNP)
(GPE Grecian/NNP Archipelago/NNP)
(GPE Dardanelles/NNP)
(GPE Athens/NNP)
(GPE Bosphorus/NNP)
(GPE Balaklava/NNP)
(GPE Crimea/NNP)
(GPE Bosphorus/NNP)
(GPE Dardanelles/NNP)

Whoa that was fast. Imagine how long it would take to manually extract every place name in the entire book?

You'll see that it also picked up some other two-word placenames like "Grecian Archipelago" and "Aeolian Isles."

Hmm, but I also see some false positives -- anyone know where "Christian" is? I'm guessing that "Garibaldi" refers to the person. We'd have to dive into the text to be sure.

So it's not perfect, but even having to do a little clean up, this certainly saved us a lot of time.



In [22]:
for i in named_entities.subtrees(filter=lambda x: x.label() == 'LOCATION'):
     print(i)

(LOCATION Black/NNP Sea/NNP)


In [23]:
for i in named_entities.subtrees(filter=lambda x: x.label() == 'FACILITY'):
     print(i)

(FACILITY Applications/NNP J./NNP)


In [24]:
named_entities2 = nltk.chunk.ne_chunk(tagged_tokens, binary=True)
for i in named_entities2.subtrees(filter=lambda x: x.label() == 'NE'):
     print(i)

(NE New/NNP York/NNP)
(NE Atlantic/NNP)
(NE Azores/NNP)
(NE St./NNP Michael/NNP)
(NE Gibraltar/NNP)
(NE Gibraltar/NNP)
(NE Spain/NNP)
(NE France/NNP)
(NE Marseilles/NNP)
(NE Christian/NNP)
(NE Mediterranean/NNP)
(NE Paris/NNP)
(NE Great/NNP Exhibition/NNP)
(NE Lyons/NNP)
(NE Mont/NNP Blanc/NNP)
(NE Paris/NNP)
(NE Switzerland/NNP)
(NE Genoa/NNP)
(NE Marseilles/NNP)
(NE Genoa/NNP)
(NE Columbus/NNP)
(NE Milan/NNP)
(NE Lakes/NNP Como/NNP)
(NE Maggiore/NNP)
(NE Milan/NNP)
(NE Verona/NNP)
(NE Padua/NNP)
(NE Parma/NNP)
(NE Correggio/NNP)
(NE Bologna/NNP)
(NE Florence/NNP)
(NE Leghorn/NNP)
(NE Italy/NNP)
(NE Leghorn/NNP)
(NE Florence/NNP)
(NE Lucca/NNP)
(NE Roman/NNP)
(NE Leghorn/NNP)
(NE Naples/NNP)
(NE Civita/NNP Vecchia/NNP)
(NE Italy/NNP)
(NE Caprera/NNP)
(NE Elba/NNP)
(NE Corsica/NNP)
(NE Leghorn/NNP)
(NE Caprera/NNP)
(NE Garibaldi/NNP)
(NE Herculaneum/NNP)
(NE Pompeii/NNP)
(NE Vesuvius/NNP)
(NE Vergil/NNP)
(NE Paestum/NNP)
(NE Naples/NNP)
(NE Palermo/NNP)
(NE Sicily/NNP)
(NE Naples/NNP)


You'll see some names repeated. What if we wanted only unique names? 

Instead of printing out each named entity, we'll make a set, which acts much like a list, but only includes unique values.

In [25]:
ne = set()
for tree in named_entities2.subtrees(filter=lambda t: t.label() == 'NE'):
    ne.add(' '.join([child[0] for child in tree.leaves()]))
ne = list(ne)
print(type(ne))
print(ne)



<class 'list'>
['Cagliari', 'Florence', 'Elba', 'Lydia', 'Joppa', 'Malaga', 'Italy', 'Azores', 'Mediterranean', 'Galilee', 'Pillar', 'Capernaum', 'Mount Etna', 'Vergil', 'Aeolian Isles', 'Corinth', 'Teneriffe', 'Christian', 'Sardinia', 'Majorca', 'Canada J.', 'Paris', 'Bethany', 'Joseph', 'Athens', 'Sebastopol', 'Civita Vecchia', 'Cairo', 'Constantinople', 'Spain', 'Balaklava', 'Beirut', 'Athens Gulf', 'Parma', 'Gibraltar', 'Palma', 'Steamer CAPT', 'Correggio', 'Vulcania', 'Treasurer Committee', 'Nazareth', 'Naples', 'Pyramids', 'Duncan Committee', 'Garibaldi', 'Bosphorus', 'Greece', 'U.S.', 'Quaker City', 'Cleopatra', 'Columbus', 'Messina', 'DUNCAN', 'Lakes Como', 'Already', 'Smyrna', 'St. Michael', 'Valencia', 'River Jordan', 'Leghorn', 'Underwriters', 'Roman', 'Mont Blanc', 'Damascus', 'Sicily', 'Grecian Archipelago', 'Bermuda', 'Palermo', 'Piraeus', 'Vesuvius', 'Patmos', 'Salamis', 'Ephesus', 'NEW YORK', 'Corsica', 'Milan', 'Verona', 'France', 'Maggiore', 'Palos', 'Tiberias', 'ESQ'

Now let's export it into a csv file so that we can put our places on a map.

In [26]:
import csv

with open('places.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    for item in ne:
        writer.writerow([item])


import into Google Maps - see mistakes

didn't pick up Troy in NER

Smyrna - Greece or Turkey at the time?
Galilee - adding Israel helps, but did not exist at the time
Mont Blanc
add country names or other details
doesn't do well with terms like "Holy Land"
Pompey's Pillar, when combined, Google went to Montana, not from context in Alexandria Egypt
Cleopatra's Needle when combined went to London where it is now, but from context it was then in Alexandria and indeed was not moved to London until 1877, but if change to Cleo Needle, Alexandria, it puts in NYC
Leghorn = Livorno
Teneriffe went to Australia, not Canary Islands
Sebastopol Calif not Crimea
Civitavecchia usually spelled as one word, Twain spelled as two

and then annotate place names with https://recogito.pelagios.org/ !

In [27]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that Sugar is not good for your lifestyle."

In [28]:
 # compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]


In [29]:
print(doc_complete)

['Sugar is bad to consume. My sister likes to have sugar, but not my father.', 'My father spends a lot of time driving my sister around to dance practice.', 'Doctors suggest that driving may cause increased stress and blood pressure.', 'Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better.', 'Health experts say that Sugar is not good for your lifestyle.']


In [30]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in doc_complete]  

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/swanzsp/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/swanzsp/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [31]:
print(doc_clean)


[['sugar', 'bad', 'consume', 'sister', 'like', 'sugar', 'father'], ['father', 'spends', 'lot', 'time', 'driving', 'sister', 'around', 'dance', 'practice'], ['doctor', 'suggest', 'driving', 'may', 'cause', 'increased', 'stress', 'blood', 'pressure'], ['sometimes', 'feel', 'pressure', 'perform', 'well', 'school', 'father', 'never', 'seems', 'drive', 'sister', 'better'], ['health', 'expert', 'say', 'sugar', 'good', 'lifestyle']]


### Preparing Document-Term Matrix

All the text documents combined is known as the corpus. To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation. LDA model looks for repeating term patterns in the entire DT matrix. Python provides many great libraries for text mining practices, “gensim” is one such clean and beautiful library to handle text data. It is scalable, robust and efficient. Following code shows how to convert a corpus into a document-term matrix.

In [33]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

In [43]:
print(dictionary)

Dictionary(35 unique tokens: ['bad', 'consume', 'father', 'like', 'sister']...)


In [42]:

print(doc_term_matrix)

Dictionary(35 unique tokens: ['bad', 'consume', 'father', 'like', 'sister']...)
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2)], [(2, 1), (4, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)], [(8, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1)], [(2, 1), (4, 1), (18, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)], [(5, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]]


### Running LDA Model

Next step is to create an object for LDA model and train it on Document-Term matrix. The training also requires few parameters as input which are explained in the above section. The gensim module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents.

In [38]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=2, id2word = dictionary, passes=50)

In [40]:
print(ldamodel.print_topics(num_topics=2, num_words=2))

[(2, '0.076*"pressure" + 0.042*"better"'), (3, '0.029*"father" + 0.029*"sister"')]


Each line is a topic with individual topic terms and weights. Topic1 can be termed as Bad Health, and Topic3 can be termed as Family.