<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# NLP Basics

**`nltk` Package**

&copy; Dr. Yves J. Hilpisch

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>

## The Basics

The main page about `nltk`: https://www.nltk.org/

    !conda install nltk
    import nltk
    nltk.download('popular')

In [None]:
!git clone https://github.com/tpq-classes/natural_language_processing.git
import sys
sys.path.append('natural_language_processing')


In [None]:
import nltk

In [None]:
# nltk.download('popular')

In [None]:
# nltk.  # -> press tab

### Tokenization

In [None]:
s = '''This is a multiline
string object in Python
for tokenization. This is
a second sentence in the string.'''

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
word_tokenize(s)

In [None]:
s.split()

In [None]:
import requests

In [None]:
url = 'https://hilpisch.com/walden.txt'

In [None]:
text = requests.get(url).text

In [None]:
tokens = word_tokenize(text)

In [None]:
tokens[:10]

In [None]:
tokens_ = [w.lower() for w in tokens]

In [None]:
tokens_[100:110]

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
sents = sent_tokenize(s)

In [None]:
sents

In [None]:
sents = sent_tokenize(text)

In [None]:
sents[150:155]

### Stopwords

In [None]:
from nltk.corpus import stopwords

In [None]:
stopwords

In [None]:
stop_words = stopwords.words('english')
stop_words[:10]

In [None]:
len(stop_words)

In [None]:
len(tokens_)

In [None]:
%time tokens__ = [w for w in tokens_ if not w in stop_words]

In [None]:
len(tokens__)

In [None]:
71 / 133

### Stemming

In [None]:
words = ['play', 'playing', 'played',
        'working', 'worked', 'work']

In [None]:
from nltk.stem import PorterStemmer

In [None]:
stemmer = PorterStemmer()

In [None]:
stems = [stemmer.stem(w) for w in words]

In [None]:
stems

In [None]:
tokens__ = [w for w in tokens__ if len(w) > 3]

In [None]:
len(tokens__)

In [None]:
tokens__[:10]

In [None]:
stems = [stemmer.stem(w) for w in tokens__]

In [None]:
stems[49:60]

## Part-of-Speech Tagging

In [None]:
print(s)

In [None]:
from nltk import pos_tag

In [None]:
tokens = word_tokenize(s)
tokens[:7]

In [None]:
pos_tag(tokens)

## Chunking

In [None]:
tokens = word_tokenize(s)

In [None]:
tokens[:5]

In [None]:
tags = pos_tag(tokens)

In [None]:
tags[:5]

In [None]:
from nltk.chunk import ne_chunk

In [None]:
chunks = ne_chunk(tags)

In [None]:
# !pip install svgling

In [None]:
chunks.pprint()

## Named Entity Recognition

In [None]:
s = '''Python is a programming language.
Guido van Rossum was the inventor of it.
Bank of America, a US-based bank, is using
it across different departments. London
is a financial capital.'''

In [None]:
tokens = word_tokenize(s)
tokens[:5]

In [None]:
tokens = [w for w in tokens if len(w) > 3]

In [None]:
tags = pos_tag(tokens)
tags[:5]

In [None]:
chunks = ne_chunk(tags)

In [None]:
chunks.pprint()

## Appendix: Tags

The tags have the following meaning:

    CC: coordinating conjunction
    CD: cardinal digit
    DT: determiner
    EX: existential there
    FW: foreign word
    IN: preposition or subordinating conjunction
    JJ: adjective
    JJR: adjective, comparative
    JJS: adjective, superlative
    LS: list marker
    MD: modal
    NN: noun, singular or mass
    NNS: noun, plural
    NNP: proper noun, singular
    NNPS: proper noun, plural
    PDT: predeterminer
    POS: possessive ending
    PRP: personal pronoun
    PRP\$: possessive pronoun
    RB: adverb
    RBR: adverb, comparative
    RBS: adverb, superlative
    RP: particle
    SYM: symbol
    TO: to
    UH: interjection
    VB: verb, base form
    VBD: verb, past tense
    VBG: verb, gerund or present participle
    VBN: verb, past participle
    VBP: verb, non-3rd person singular present
    VBZ: verb, 3rd person singular present
    WDT: wh-determiner
    WP: wh-pronoun
    WP$: possessive wh-pronoun
    WRB: wh-adverb

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>