# Text Analytics with NLTK

### What is NLTK?
The Natural Language Toolkit (NLTK) is a Python package for natural language processing.
NLTK comes with many corpora, toy grammars, trained models, etc. https://www.nltk.org/data.html

Let's start by downloading nltk. See https://www.nltk.org/data.html 

In [None]:
import nltk, re, json, io #https://pypi.org/project/nltk/ --- pip install nltk 
from collections import Counter
import pandas as pd

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

## Tokenization
Read documentation: https://www.nltk.org/api/nltk.tokenize.html

* sentence tokenization
* word tokenization

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = "Netanyahu's visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people. Following criticism from political opponents over what they consider the prime minister's unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip. President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack \"despicable.\""

# tokenize text into words
words = word_tokenize(text)
words

**sentence tokenizer**

https://www.nltk.org/_modules/nltk/tokenize/punkt.html

Punkt Sentence Tokenizer

The NLTK data package includes a pre-trained Punkt tokenizer for English. 
This tokenizer divides a text into a list of sentences by using an unsupervised algorithm.

In [None]:
# split text into sentences
sents = sent_tokenize(text)
sents

sent_tokenize is quite smart. See examples below

In [None]:
text2 = "Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries.  And sometimes sentences can start with non-capitalized words.  i is a good variable name."

sents2 = sent_tokenize(text2)
sents2

In [None]:
text3 = '''
... (How does it deal with this parenthesis?)  "It should be part of the
... previous sentence." "(And the same with this one.)" ('And this one!')
... "('(And (this)) '?)" [(and this. )]
... '''


sents3 = sent_tokenize( text3 )
sents3

#### WhitespaceTokenizer

```WhitespaceTokenizer```
Tokenize a string on whitespace (space, tab, newline). In general, users should use the string split() method instead.

In [None]:
from nltk.tokenize import WhitespaceTokenizer, WordPunctTokenizer
s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
print(s)

print('\nWhitespaceTokenizer:')
WhitespaceTokenizer().tokenize(s)

#### WordPunctTokenizer
```WordPunctTokenizer``` Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp ```\w+|[^\w\s]+```

In [None]:
print('WordPunctTokenizer:')
WordPunctTokenizer().tokenize(s)

**<span class="mark">TODO</span>**

Try to tokenize tweets with the `tweetokenizer`
https://www.nltk.org/api/nltk.tokenize.html?highlight=word_tokenize

`tweetokenizer`: a Tokenizer specifically suited for tweets.

You can start with any sample tweets.

Here is one example: "@remy This is waaaaayyyy too much for you!!!!!!"

In [None]:
# Your code below



In [None]:
# comparing with whitespace tokenizer
WhitespaceTokenizer().tokenize(tweet)

**<span class="mark">TODO</span>**

Test to see how two different tokenizers would function when you pass the same text. 

 * Test with the same tweet that you picked as your sample data previously
 * Test with 2 tokenizers from NLTK: ```word_tokenize``` and ```casual_tokenize```

In [None]:
# Your code below



### Comparing tokenizers. 
Refer to this great resource: https://towardsdatascience.com/an-introduction-to-tweettokenizer-for-processing-tweets-9879389f8fe7

Instead of taking the time to analyze the outcome of each tokenizer, we can put everything in one pd.dataframe for fast and accurate interpretation. How would you do it?

**<span class="mark">TODO for later</span>**

## Stopwords

In [None]:
from nltk.corpus import stopwords #nltk.download('stopwords')

In [None]:
stopeng = set(stopwords.words('english'))
stopeng

In [None]:
# remove stopwords from text

#text = "Netanyahu's visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people. Following criticism from political opponents over what they consider the prime minister's unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip. President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack \"despicable.\""
text = "NASA Mars Rover Is Bringing 10.9 Million Names to the Red Planet"

# first tokenize text into words
tokens = word_tokenize( text.lower() )
print(tokens)
tokens_nostop = [w for w in tokens if w not in stopeng]

print('\n', tokens_nostop)

## Stemming

http://www.nltk.org/howto/stem.html

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()
tokens_porter = [(w, ps.stem(w)) for w in tokens_nostop if w != ps.stem(w)]
tokens_porter

# Text cleaning

In [None]:
## Download an example tweets file  
## https://raw.githubusercontent.com/fivethirtyeight/data/master/trump-twitter/realDonaldTrump_poll_tweets.csv

import urllib.request

print('Beginning file download with urllib2...')

# url at which the file is in direct downloadable format
url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/trump-twitter/realDonaldTrump_poll_tweets.csv'

urllib.request.urlretrieve(url, './noisy_twitter.csv')

<span class="mark">**What is `urllib.request.urlretrieve` doing??**</span>

download the file with `urlretrieve` and specify the name by which it will be stored

### Reading the csv as pandas dataframe

In [None]:
# first install pandas library and then import it
import pandas as pd

#read the file into dataframe. "header=0" means the first row will be considered as a header
frame = pd.read_csv("./noisy_twitter.csv", header=0, dtype={'id':str,"created_at":str,'text':str})

#print first 5 rows
frame.head()

In [None]:
#print last 5 rows
frame.tail()

In [None]:
# create a list of all text columns

textlist = frame['text'].tolist() # here, the order of rows is preserved
textlist

### function for calculating top 20 frequent items from a list

nltk `most_common`: https://tedboy.github.io/nlps/generated/generated/nltk.FreqDist.most_common.html

In [None]:
from collections import Counter
from collections import defaultdict
def top20(thislist):
    # First make a string out of the entire list
    BIGstr = " ".join(thislist)
    wordlist = BIGstr.split(" ")
    wordcount = Counter(wordlist)
    return(wordcount.most_common(20))

In [None]:
print(top20(textlist))

### What are the most frequent mentions?

Steps:
* Extract mentions using regular expressions
* Count the most common

In [None]:
import re
def extractmentions(row):
    row = row.lower()
    result = re.findall("(?<![@\w])@(\w{1,25})", row)
    return result

all_mentions = []

for t in textlist:
    result = extractmentions(t)
    if len(result) > 0:
        all_mentions = all_mentions + result
        
print(top20(all_mentions))

### What are the most frequent hashtags?

<span class="mark">**TODO**</span>

write a similar function to extract hashtags

print top 20 most frequent hashtags

In [None]:
# Your code below




### top 20 words without text cleaning

In [None]:
# what are the top words from the entire text corpus
top20(textlist)

#### Let's do some text cleaning

In [None]:
def textcleaner(row):
    row = row.lower()
    #remove urls
    row  = re.sub(r'http\S+', '', row)
    #remove mentions
    row = re.sub(r"(?<![@\w])@(\w{1,25})", '', row)
    #remove hashtags
    row = re.sub(r"(?<![#\w])#(\w{1,25})", '',row)
    #remove other special characters
    row = re.sub('[^A-Za-z .-]+', '', row)
    #remove digits
    row = re.sub('\d+', '', row)
    row = row.strip(" ")
    return row

cleaned_textlist = []

for t in textlist:
    cleaned_textlist.append(textcleaner(t))
    
top20(cleaned_textlist)

**TODO for later** There are still few things that could be cleaned up. Like appearance of that last character -

# Stopwords

**<span class="mark">TODO</span>**:
1. Fetch english stopwords
2. write code to remove stopwords from the text that you are working with: `cleaned_textlist`

In [None]:
from nltk.corpus import stopwords