# 1.   Computing with Language: Texts and Words

### 1.2   Getting Started with NLTK

Once you've installed NLTK, start up the Python interpreter as before, and install the data required for the book by typing the following two commands at the Python prompt, then selecting the book collection 

In [1]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

*Downloading the NLTK Book Collection:* browse the available packages using nltk.download(). The Collections tab on the downloader shows how the packages are grouped into sets, and you should select the line labeled book to obtain all data required for the examples and exercises in this book. It consists of about 30 compressed files requiring about 100Mb disk space. The full collection of data (i.e., all in the downloader) is nearly ten times this size (at the time of writing) and continues to expand.

Once the data is downloaded to your machine, you can load some of it using the Python interpreter. The first step is to type a special command at the Python prompt which tells the interpreter to load some texts for us to explore: from nltk.book import *. This says "from NLTK's book module, load all items." The book module contains all the data you will need as you read this chapter. After printing a welcome message, it loads the text of several books (this will take a few seconds). Here's the command again, together with the output that you will see. Take care to get spelling and punctuation right

In [2]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


Any time we want to find out about these texts, we just have to enter their names at the Python prompt:

### 1.3   Searching Text

There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context. Here we look up the word __monstrous__ in *Moby Dick* by entering text1 followed by a period, then the term concordance, and then placing "monstrous" in parentheses:

In [3]:
text1.concordance("monstrous")

Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u


Once you've spent a little while examining these texts, we hope you have a new sense of the richness and diversity of language. In the next chapter you will learn how to access a broader range of text, including text in languages other than English.

A concordance permits us to see words in context. For example, we saw that monstrous occurred in contexts such as the --- pictures and a --- size . What other words appear in a similar range of contexts? We can find out by appending the term similar to the name of the text in question, then inserting the relevant word in parentheses:

In [4]:
text1.similar("monstrous")

true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless


In [5]:
text2.similar("monstrous")

very so exceedingly heartily a as good great extremely remarkably
sweet vast amazingly


Observe that we get different results for different texts. Austen uses this word quite differently from Melville; for her, *monstrous* has positive connotations, and sometimes functions as an intensifier like the word *very*.

In [13]:
[i for i in text1 if len(sent) > 6]

NameError: name 'sent' is not defined

In [44]:
dist = FreqDist(text7)

In [45]:
dist

FreqDist({'Pierre': 1,
          'Vinken': 2,
          ',': 4885,
          '61': 5,
          'years': 115,
          'old': 24,
          'will': 281,
          'join': 4,
          'the': 4045,
          'board': 30,
          'as': 385,
          'a': 1878,
          'nonexecutive': 5,
          'director': 32,
          'Nov.': 24,
          '29': 5,
          '.': 3828,
          'Mr.': 375,
          'is': 671,
          'chairman': 45,
          'of': 2319,
          'Elsevier': 1,
          'N.V.': 3,
          'Dutch': 3,
          'publishing': 13,
          'group': 43,
          'Rudolph': 3,
          'Agnew': 1,
          '55': 10,
          'and': 1511,
          'former': 19,
          'Consolidated': 2,
          'Gold': 2,
          'Fields': 2,
          'PLC': 13,
          'was': 367,
          'named': 22,
          '*-1': 1123,
          'this': 184,
          'British': 11,
          'industrial': 18,
          'conglomerate': 3,
          'A': 110,
          

In [36]:
vocab1 = dist.keys()

In [56]:
lst=dist.keys()
list(lst)[:5]

['Pierre', 'Vinken', ',', '61', 'years']

In [46]:
import operator
sorted_x = sorted(dist.items(), key=operator.itemgetter(1))

In [47]:
sorted_x

[('Pierre', 1),
 ('Elsevier', 1),
 ('Agnew', 1),
 ('fiber', 1),
 ('resilient', 1),
 ('lungs', 1),
 ('symptoms', 1),
 ('Loews', 1),
 ('Micronite', 1),
 ('spokewoman', 1),
 ('properties', 1),
 ('Dana-Farber', 1),
 ('filter', 1),
 ('1953', 1),
 ('1955', 1),
 ('Four', 1),
 ('diagnosed', 1),
 ('malignant', 1),
 ('mesothelioma', 1),
 ('asbestosis', 1),
 ('morbidity', 1),
 ('Groton', 1),
 ('stringently', 1),
 ('smooth', 1),
 ('needle-like', 1),
 ('classified', 1),
 ('amphobiles', 1),
 ('Brooke', 1),
 ('pathlogy', 1),
 ('Vermont', 1),
 ('curly', 1),
 ('Environmental', 1),
 ('Protection', 1),
 ('gradual', 1),
 ('1997', 1),
 ('cancer-causing', 1),
 ('outlawed', 1),
 ('160', 1),
 ('Areas', 1),
 ('dusty', 1),
 ('burlap', 1),
 ('sacks', 1),
 ('bin', 1),
 ('poured', 1),
 ('cotton', 1),
 ('acetate', 1),
 ('mechanically', 1),
 ('clouds', 1),
 ('dust', 1),
 ('hung', 1),
 ('ventilated', 1),
 ('Darrell', 1),
 ('Yields', 1),
 ('tracked', 1),
 ('IBC', 1),
 ('fraction', 1),
 ('Compound', 1),
 ('reinvestment

In [48]:
sorted_x = sorted(dist.items(), key=operator.itemgetter(1), reverse=True)

In [49]:
sorted_x

[(',', 4885),
 ('the', 4045),
 ('.', 3828),
 ('of', 2319),
 ('to', 2164),
 ('a', 1878),
 ('in', 1572),
 ('and', 1511),
 ('*-1', 1123),
 ('0', 1099),
 ('*', 965),
 ("'s", 864),
 ('for', 817),
 ('that', 807),
 ('*T*-1', 806),
 ('*U*', 744),
 ('$', 718),
 ('The', 717),
 ('``', 702),
 ("''", 684),
 ('is', 671),
 ('said', 628),
 ('on', 490),
 ('it', 476),
 ('%', 446),
 ('by', 429),
 ('at', 402),
 ('with', 387),
 ('from', 386),
 ('as', 385),
 ('million', 383),
 ('Mr.', 375),
 ('*-2', 372),
 ('are', 369),
 ('was', 367),
 ('be', 356),
 ('*T*-2', 345),
 ('has', 339),
 ('its', 332),
 ("n't", 325),
 ('have', 323),
 ('an', 316),
 ('or', 291),
 ('will', 281),
 ('company', 260),
 ('--', 230),
 ('he', 230),
 ('which', 225),
 ('U.S.', 221),
 ('year', 212),
 ('they', 210),
 ('says', 210),
 ('would', 209),
 ('about', 206),
 ('more', 198),
 ('were', 197),
 ('In', 197),
 ('this', 184),
 ('their', 181),
 ('than', 180),
 ('market', 176),
 (';', 171),
 ('New', 165),
 ('had', 165),
 ('who', 163),
 ('new', 162

In [64]:
a = "List listed lists listing listings"

In [65]:
a = a.lower().split(' ')
a

['list', 'listed', 'lists', 'listing', 'listings']

In [66]:
porter = nltk.PorterStemmer()
[porter.stem(i) for i in a]

['list', 'list', 'list', 'list', 'list']

In [71]:
import nltk
nltk.help.upenn_tagset('MD')

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would
