# Natural Language Processing with Python

Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation.

## Terminology

### Corpous
Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based. The plural form of corpus is corpora. Some popular corpora are British National Corpus (BNC), COBUILD/Birmingham Corpus, IBM/Lancaster Spoken English Corpus. Monolingual corpora represent only one language while bilingual corpora represent two languages.

A corpus provides grammarians, lexicographers, and other interested parties with better discriptions of a language. Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selcted samples. Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information.

### Tokens

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called **tokens**, perhaps at the same time throwing away certain characters, such as punctuation. 

What is a token? There is a technical definition in NLP, but we can think about them as **data** that represent meaningful units of text:

- Words
- Phrases
- Punctuation
- Numbers
- bi-grams

### Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.

A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stems", "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu" (illustrating the case where the stem is not itself a word or root) but "argument" and "arguments" reduce to the stem "argument".

### Stop Words

Sometimes, some extremely common words which would appear to be of little value in getting usuful information about documents are excluded from the vocabulary entirely. These words are called stop words. Though stop words usually refer to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools.

![](http://nlp.stanford.edu/IR-book/html/htmledition/img95.png)

## NLTK

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation, NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike.

<table summary="Language processing tasks and corresponding NLTK modules with
        examples of functionality" style="border-collapse: collapse;border-top: 0.5pt solid ; border-bottom: 0.5pt solid ; border-left: 0.5pt solid ; border-right: 0.5pt solid ; "><colgroup><col><col><col></colgroup><thead><tr><th style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>Language processing task</p></th><th style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>NLTK modules</p></th><th style="border-bottom: 0.5pt solid ; "><p>Functionality</p></th></tr></thead><tbody><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>Accessing corpora</p></td><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>nltk.corpus</p></td><td style="border-bottom: 0.5pt solid ; "><p>Standardized interfaces to corpora and
              lexicons</p></td></tr><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>String processing</p></td><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>nltk.tokenize, nltk.stem</p></td><td style="border-bottom: 0.5pt solid ; "><p>Tokenizers, sentence tokenizers,
              stemmers</p></td></tr><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>Collocation discovery</p></td><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>nltk.collocations</p></td><td style="border-bottom: 0.5pt solid ; "><p>t-test, chi-squared, point-wise mutual
              information</p></td></tr><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>Part-of-speech tagging</p></td><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>nltk.tag</p></td><td style="border-bottom: 0.5pt solid ; "><p>n-gram, backoff, Brill, HMM, TnT</p></td></tr><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>Classification</p></td><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>nltk.classify, nltk.cluster</p></td><td style="border-bottom: 0.5pt solid ; "><p>Decision tree, maximum entropy, naive Bayes, EM,
              k-means</p></td></tr><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>Chunking</p></td><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>nltk.chunk</p></td><td style="border-bottom: 0.5pt solid ; "><p>Regular expression, n-gram, named
              entity</p></td></tr><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>Parsing</p></td><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>nltk.parse</p></td><td style="border-bottom: 0.5pt solid ; "><p>Chart, feature-based, unification, probabilistic,
              dependency</p></td></tr><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>Semantic interpretation</p></td><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>nltk.sem, nltk.inference</p></td><td style="border-bottom: 0.5pt solid ; "><p>Lambda calculus, first-order logic, model
              checking</p></td></tr><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>Evaluation metrics</p></td><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>nltk.metrics</p></td><td style="border-bottom: 0.5pt solid ; "><p>Precision, recall, agreement
              coefficients</p></td></tr><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>Probability and estimation</p></td><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>nltk.probability</p></td><td style="border-bottom: 0.5pt solid ; "><p>Frequency distributions, smoothed probability
              distributions</p></td></tr><tr><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>Applications</p></td><td style="border-right: 0.5pt solid ; border-bottom: 0.5pt solid ; "><p>nltk.app, nltk.chat</p></td><td style="border-bottom: 0.5pt solid ; "><p>Graphical concordancer, parsers, WordNet browser,
              chatbots</p></td></tr><tr><td style="border-right: 0.5pt solid ; "><p>Linguistic fieldwork</p></td><td style="border-right: 0.5pt solid ; "><p>nltk.toolbox</p></td><td style=""><p>Manipulate data in SIL Toolbox
              format</p></td></tr></tbody></table>

NLTK was designed with four primary goals in mind:
    
**Simplicity**
To provide an intuitive framework along with substantial building blocks, giving users a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annotated language data

**Consistency**
To provide a uniform framework with consistent interfaces and data structures, and easily guessable method names

**Extensibility**
To provide a structure into which new software modules can be easily accommodated, including alternative implementations and competing approaches to the same task

**Modularity**
To provide components that can be used independently without needing to understand the rest of the toolkit

### NLTK Installation and use

NLTK comes with [Anaconda](https://docs.continuum.io/anaconda/pkg-docs), so you have it already.

In [1]:
import nltk

In [2]:
# Browse the available packages
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [3]:
text_file = open('sample_article.txt','r')
sample_text = text_file.read()
text_file.close()

In [4]:
print sample_text

Computer scientists have defined artificial intelligence in many different ways, but at its core, AI involves machines that think the way humans think. Of course, it's very difficult to determine whether or not a machine is "thinking," so on a practical level, creating artificial intelligence involves creating a computer system that is good at doing the kinds of things humans are good at.

The idea of creating machines that are as smart as humans goes all the way back to the ancient Greeks, who had myths about automatons created by the gods. In practical terms, however, the idea didn't really take off until 1950.

In that year, Alan Turing published a groundbreaking paper called "Computing Machinery and Intelligence" that posed the question of whether machines can think. He proposed the famous Turing test, which says, essentially, that a computer can be said to be intelligent if a human judge can't tell whether he is interacting with a human or a machine.

The phrase artificial intelli

In [6]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [17]:
for x in word_tokenize(sample_text):
    print x, ',',

Computer , scientists , have , defined , artificial , intelligence , in , many , different , ways , , , but , at , its , core , , , AI , involves , machines , that , think , the , way , humans , think , . , Of , course , , , it , 's , very , difficult , to , determine , whether , or , not , a , machine , is , `` , thinking , , , '' , so , on , a , practical , level , , , creating , artificial , intelligence , involves , creating , a , computer , system , that , is , good , at , doing , the , kinds , of , things , humans , are , good , at , . , The , idea , of , creating , machines , that , are , as , smart , as , humans , goes , all , the , way , back , to , the , ancient , Greeks , , , who , had , myths , about , automatons , created , by , the , gods , . , In , practical , terms , , , however , , , the , idea , did , n't , really , take , off , until , 1950 , . , In , that , year , , , Alan , Turing , published , a , groundbreaking , paper , called , `` , Computing , Machinery , and 

In [8]:
for x in sent_tokenize(sample_text):
    print x + '\n'

Computer scientists have defined artificial intelligence in many different ways, but at its core, AI involves machines that think the way humans think.

Of course, it's very difficult to determine whether or not a machine is "thinking," so on a practical level, creating artificial intelligence involves creating a computer system that is good at doing the kinds of things humans are good at.

The idea of creating machines that are as smart as humans goes all the way back to the ancient Greeks, who had myths about automatons created by the gods.

In practical terms, however, the idea didn't really take off until 1950.

In that year, Alan Turing published a groundbreaking paper called "Computing Machinery and Intelligence" that posed the question of whether machines can think.

He proposed the famous Turing test, which says, essentially, that a computer can be said to be intelligent if a human judge can't tell whether he is interacting with a human or a machine.

The phrase artificial inte

### Stop Words

In [9]:
from nltk.corpus import stopwords

In [10]:
for x in stopwords.words('english'):
    print x, ',',

i , me , my , myself , we , our , ours , ourselves , you , your , yours , yourself , yourselves , he , him , his , himself , she , her , hers , herself , it , its , itself , they , them , their , theirs , themselves , what , which , who , whom , this , that , these , those , am , is , are , was , were , be , been , being , have , has , had , having , do , does , did , doing , a , an , the , and , but , if , or , because , as , until , while , of , at , by , for , with , about , against , between , into , through , during , before , after , above , below , to , from , up , down , in , out , on , off , over , under , again , further , then , once , here , there , when , where , why , how , all , any , both , each , few , more , most , other , some , such , no , nor , not , only , own , same , so , than , too , very , s , t , can , will , just , don , should , now , d , ll , m , o , re , ve , y , ain , aren , couldn , didn , doesn , hadn , hasn , haven , isn , ma , mightn , mustn , needn 

In [17]:
for x in stopwords.words('spanish'):
    print x, ',',

de , la , que , el , en , y , a , los , del , se , las , por , un , para , con , no , una , su , al , lo , como , más , pero , sus , le , ya , o , este , sí , porque , esta , entre , cuando , muy , sin , sobre , también , me , hasta , hay , donde , quien , desde , todo , nos , durante , todos , uno , les , ni , contra , otros , ese , eso , ante , ellos , e , esto , mí , antes , algunos , qué , unos , yo , otro , otras , otra , él , tanto , esa , estos , mucho , quienes , nada , muchos , cual , poco , ella , estar , estas , algunas , algo , nosotros , mi , mis , tú , te , ti , tu , tus , ellas , nosotras , vosostros , vosostras , os , mío , mía , míos , mías , tuyo , tuya , tuyos , tuyas , suyo , suya , suyos , suyas , nuestro , nuestra , nuestros , nuestras , vuestro , vuestra , vuestros , vuestras , esos , esas , estoy , estás , está , estamos , estáis , están , esté , estés , estemos , estéis , estén , estaré , estarás , estará , estaremos , estaréis , estarán , estaría , estarías , 

### Stemming

In [13]:
from nltk.stem import PorterStemmer

In [14]:
stemmer = PorterStemmer()

In [15]:
program_words = ['program', 'programming','programmer','programed','programs']
for word in program_words:
    print stemmer.stem(word)

program
program
programm
program
program


In [16]:
my_words = word_tokenize(sample_text)[:70]
for word in my_words:
    print stemmer.stem(word),

Comput scientist have defin artifici intellig in mani differ way , but at it core , AI involv machin that think the way human think . Of cours , it 's veri difficult to determin whether or not a machin is `` think , '' so on a practic level , creat artifici intellig involv creat a comput system that is good at do the kind of thing human are


## Resources

Most popular introductory book for NLP with Python:

#### Natural Language Processing with Python:  Analyzing Text with the Natural Language Toolkit
** Steven Bird, Ewan Klein, and Edward Loper **

http://www.nltk.org/book/