# Exercise 03: Splitting sentences and PoS annotation

Let's start with a simple paragraph, copied from the course description:

In [1]:
text = """
Increasingly, customers send text to interact or leave comments, 
which provides a wealth of data for text mining. That’s a great 
starting point for developing custom search, content recommenders, 
and even AI applications.
"""
repr(text)

"'\\nIncreasingly, customers send text to interact or leave comments, \\nwhich provides a wealth of data for text mining. That’s a great \\nstarting point for developing custom search, content recommenders, \\nand even AI applications.\\n'"

Notice how there are explicit *line breaks* in the text. Let's write some code to flow the paragraph without any line breaks:

In [2]:
text = " ".join(map(lambda x: x.strip(), text.split("\n"))).strip()
repr(text)

"'Increasingly, customers send text to interact or leave comments, which provides a wealth of data for text mining. That’s a great starting point for developing custom search, content recommenders, and even AI applications.'"

Now we can use [TextBlob](http://textblob.readthedocs.io/) to *split* the paragraph into sentences:

In [3]:
from textblob import TextBlob

for sent in TextBlob(text).sentences:
  print("> ", sent)


**********************************************************************
  Resource 'tokenizers/punkt/PY3/english.pickle' not found.
  Please use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/Users/whitehat/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


MissingCorpusError: 
Looks like you are missing some required data for this feature.

To download the necessary data, simply run

    python -m textblob.download_corpora

or use the NLTK downloader to download the missing data: http://nltk.org/data.html
If this doesn't fix the problem, file an issue at https://github.com/sloria/TextBlob/issues.


Next we take a sentence and *annotate* it with part-of-speech (PoS) tags:

In [4]:
import textblob_aptagger as tag

sent = "Increasingly, customers send text to interact or leave comments, which provides a wealth of data for text mining."

ts = tag.PerceptronTagger().tag(sent)
print(ts)

ImportError: No module named 'textblob_aptagger'

Given these annotations for part-of-speech tags, we can *lemmatize* nouns and verbs to get their root forms. This will also singularize the plural nouns:

In [5]:
from textblob import Word

ts = [('InterAct', 'VB'), ('comments', 'NNS'), ('provides', 'VBZ'), ('mining', 'NN')]

for lex, pos in ts:
  w = Word(lex.lower())
  lemma = w.lemmatize(pos[0].lower())
  print(lex, pos, lemma)


**********************************************************************
  Resource 'corpora/wordnet' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/Users/whitehat/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


MissingCorpusError: 
Looks like you are missing some required data for this feature.

To download the necessary data, simply run

    python -m textblob.download_corpora

or use the NLTK downloader to download the missing data: http://nltk.org/data.html
If this doesn't fix the problem, file an issue at https://github.com/sloria/TextBlob/issues.


We can also lookup synonyms and definitions for each word, using *synsets* from [WordNet](https://wordnet.princeton.edu/):

In [6]:
from textblob.wordnet import VERB

w = Word("comments")

for synset, definition in zip(w.get_synsets(), w.define()):
  print(synset, definition)

LookupError: 
**********************************************************************
  Resource 'corpora/wordnet' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/Users/whitehat/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************