---

_You are currently looking at **version 1.2** of this notebook. 

---

# LHS 712 - Basics of NLTK

We will go through basic NLP techinques using NLTK.
Please visit https://www.nltk.org/ for more information.


1. Tokenization - document/sentence splitting
3. Normalization
4. Part-of-speech tagging
5. Parsing

## 1. Tokenization - document/sentence splitting
Text can be split into small `tokens` (i.e., document to sentences, sentence to words). 
For this demo, we explore simple patterns like `Space`, `Nextline`, or `Fullstop`.

### Splitting text by space

In [1]:
document = '''Pt. has chronic difficulties with concentration. 
Pt. has taken adderall 40.00 mg daily.
No h/o other meds'''

print(document+'\n')
print(document.split())

Pt. has chronic difficulties with concentration. 
Pt. has taken adderall 40.00 mg daily.
No h/o other meds

['Pt.', 'has', 'chronic', 'difficulties', 'with', 'concentration.', 'Pt.', 'has', 'taken', 'adderall', '40.00', 'mg', 'daily.', 'No', 'h/o', 'other', 'meds']


### Splitting text by nextline

In [3]:
document = '''Pt. has chronic difficulties with concentration. 
Pt. has taken adderall 40.00 mg daily.
No h/o other meds'''

print(document+'\n')
print(document.split('\n'))

Pt. has chronic difficulties with concentration. 
Pt. has taken adderall 40.00 mg daily.
No h/o other meds

['Pt. has chronic difficulties with concentration. ', 'Pt. has taken adderall 40.00 mg daily.', 'No h/o other meds']


### Splitting text by fullstop

In [4]:
document = '''Pt. has chronic difficulties with concentration. 
Pt. has taken adderall 40.00 mg daily.
No h/o other meds'''

print(document+'\n')
print(document.split('.'))

Pt. has chronic difficulties with concentration. 
Pt. has taken adderall 40.00 mg daily.
No h/o other meds

['Pt', ' has chronic difficulties with concentration', ' \nPt', ' has taken adderall 40', '00 mg daily', '\nNo h/o other meds']


### Counting words/tokens

We can count the number of words used in the text. Or, we can count the number of unique words.

In [5]:
document = '''Pt. has chronic difficulties with concentration. 
Pt. has taken adderall 40.00 mg daily.
No. h/o other meds'''

print(document+'\n')
print('Number of words: {}'.format(len(document.split())))
print('Vocabulary size: {}'.format(len(set(document.split()))))

Pt. has chronic difficulties with concentration. 
Pt. has taken adderall 40.00 mg daily.
No. h/o other meds

Number of words: 17
Vocabulary size: 15


### NLTK's tokenizer

In [7]:
import nltk

In [8]:
import nltk

sentence = "Children shouldn't drink a sugary drink before bed."
print(sentence+'\n')

print("split text by space")
print(sentence.split())

print("\nNLTK's built-in tokenizer")
print(nltk.word_tokenize(sentence)) 

Children shouldn't drink a sugary drink before bed.

split text by space
['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

NLTK's built-in tokenizer


LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\Wu/nltk_data'
    - 'D:\\anaconda\\nltk_data'
    - 'D:\\anaconda\\share\\nltk_data'
    - 'D:\\anaconda\\lib\\nltk_data'
    - 'C:\\Users\\Wu\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


## 2. Normalization

### Stemming 

In [9]:
import nltk

words = ['list', 'lists', 'listed', 'listing', 'listings']

porter = nltk.PorterStemmer() 
words_stemmed = [porter.stem(t) for t in words]
print('Before stemming: {}'.format(words))
print('After stemming: {}'.format(words_stemmed)) 

Before stemming: ['list', 'lists', 'listed', 'listing', 'listings']
After stemming: ['list', 'list', 'list', 'list', 'list']


### Lemmatization

In [10]:
import nltk

words = ['list', 'lists', 'listed', 'listing', 'listings']

porter = nltk.PorterStemmer()
WNlemma = nltk.WordNetLemmatizer()
sent_stem = [porter.stem(t) for t in words]
sent_lemm = [WNlemma.lemmatize(t) for t in words]

print('\nOriginal Lemmatizing Stemming')
print('-'*30)
for i,s in enumerate(words):
    print('{} {} {}'.format(i, sent_lemm[i], sent_stem[i]))



LookupError: 
**********************************************************************
  Resource [93mwordnet[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('wordnet')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/wordnet[0m

  Searched in:
    - 'C:\\Users\\Wu/nltk_data'
    - 'D:\\anaconda\\nltk_data'
    - 'D:\\anaconda\\share\\nltk_data'
    - 'D:\\anaconda\\lib\\nltk_data'
    - 'C:\\Users\\Wu\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


## 3. Part-of-speech tagging 

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

In [None]:
import nltk

sent = "I saw the man with a telescope"

sent_tk = nltk.word_tokenize(sent)
print(sent_tk)
print('================')
sent_pos = nltk.pos_tag(sent_tk)

for element in sent_pos:
    print(element)

## 4. Parsing

In [11]:
import nltk

grammar = nltk.data.load('mygrammar.cfg')
parser = nltk.ChartParser(grammar)

sent = "I saw the man with a telescope"
sent_tk = nltk.word_tokenize(sent)

trees = parser.parse_all(sent_tk)

for tree in trees:
    print("I saw the man with a telescope\n")
    print(tree)
    print('\n')

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\Wu/nltk_data'
    - 'D:\\anaconda\\nltk_data'
    - 'D:\\anaconda\\share\\nltk_data'
    - 'D:\\anaconda\\lib\\nltk_data'
    - 'C:\\Users\\Wu\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************
