# Week 9 Problem 1

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says YOUR CODE HERE. Do not write your answer in anywhere else other than where it says YOUR CODE HERE. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select Kernel, and restart the kernel and run all cells (Restart & Run all).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select File → Save and CheckPoint)

5. When you are ready to submit your assignment, go to Dashboard → Assignments and click the Submit button. Your work is not submitted until you click Submit.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

7. If your code does not pass the unit tests, it will not pass the autograder.

**NOTE:** Validation may take some time. Be patient!!

## Author: Apurv Garg
### Primary Reviewer: John Nguyen


# Due Date: 6 PM, March 26, 2018

In [1]:
# Display all plots inline
% matplotlib inline

import seaborn as sns
import matplotlib.pyplot as plt
import re
import numpy as np
import pandas as pd
import collections as cl
import pprint
pp = pprint.PrettyPrinter(indent=2, depth=2, width=80, compact=True)
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import scipy.sparse as sp
from numpy.testing import assert_array_equal, assert_array_almost_equal
from pandas.util.testing import assert_frame_equal, assert_index_equal
from nose.tools import assert_false, assert_equal, assert_almost_equal, assert_true, assert_in, assert_is_not
from operator import itemgetter
import nltk

# We do this to ignore several specific warnings
import warnings
warnings.filterwarnings("ignore")

# Set default seaborn plotting style
sns.set(style="white")

# Dataset

We will analyze the twenty newsgroup data set. We will be analyzing a posting which follows similar structure to an email. We have removed the headers, quotes and footers, i.e. we will just be analyzing the message. Note that we will be performing our analysis on just one message inorder to have computational feasibility.

The cell below will create a subdirectory under home called `temp_data`. *If you want to the delete the temp_data directory at any point, run this code in a new cell.*  
``` bash
! rm -rf /home/data_scientist/temp_data
```

In [2]:
! mkdir ~/temp_data
HOME = '/home/data_scientist/temp_data'

In [10]:
text = fetch_20newsgroups(HOME, remove =('quotes', 'headers', 'footers'))
messageID = 11
message = text['data'][messageID]
target = text['target'][messageID]
#print(f'Target Newsgroup: {text["target_names"][target]}')
#print(80*'-')
#print(message)

# Problem 1

For this problem, complete the function `string_tokenizer` which will take 3 parameters: `pattern, msg and one_letter`. <br>
- In this function, we will explicitly split the text into tokens and then create a `Counter` to accumulate the number of unique occurrences of each token. 
- Remember to convert the words to lowercase before tokenizing. Since we will just be looking for alphanumeric string, use `re.sub(pattern, ' ', msg)` for removing punctuation tokens. <br>
- If parameter `one_letter` is True, then include 1 letter words in the count, else don't include one letter words.

**Example:** 5 most common values with `one_letter=False` are: `[('the', 37), ('that', 16), ('of', 14), ('to', 13), ('is', 12)]`<br>
5 most common values with `one_letter=True` are: `[('the', 37), ('that', 16), ('of', 14), ('to', 13), ('a', 12)]`

**HINT:** If you want, you can create your pattern(regex) removing 1 character words or,<br>
You can remove one character words after creating the Counter by taking the values of length > 1.

In [41]:
def string_tokenizer(pattern, msg, one_letter):
    '''           
    Parameters
    ----------
    pattern : Regular expression searching for punctuations
    msg : the message which is to be tokenized
    one_letter : A boolean value where True implies that you have to include 1-letter words 
                 and a False value implies that you have to remove the 1-letter words
    
    Returns
    -------
    A Counter object wc    
    '''    
    # YOUR CODE HERE
    # convert message to lower-case
    words = re.sub(pattern, ' ', msg.lower()).split()
    
    if (one_letter == True):
        wc = cl.Counter(words)
    else :
        newwords = []
        for i in range(0, len(words)):
            if (len(words[i]) > 1):
                newwords.append(words[i])
        wc = cl.Counter(newwords)
    
    return wc

In [42]:
pattern = re.compile(r'[^\w\s]')
wc1 = string_tokenizer(pattern, message, False)
wc2 = string_tokenizer(pattern, message, True)
assert_equal(isinstance(wc1, cl.Counter), True)
assert_equal(isinstance(wc2, cl.Counter), True)
assert_equal(len(wc1), 219)
assert_equal(len(wc2), 224)
assert_equal(wc1.most_common()[0], ('the', 37))
assert_equal(wc2.most_common()[5], ('is', 12))
assert_equal(wc1.most_common()[4], ('is', 12))

In [43]:
print(f"{'Term':12s}: {'Frequency'}")
print(25*'-')

# Compute term counts
t_wc1 = sum(wc1.values())

# Display counts and frequencies
for wt in wc1.most_common(8):
    print(f'{wt[0]:12s}: {wt[1]/t_wc1:4.3f}')

Term        : Frequency
-------------------------
the         : 0.081
that        : 0.035
of          : 0.031
to          : 0.029
is          : 0.026
and         : 0.022
in          : 0.022
this        : 0.020


# Problem 2

For this problem, complete the function `vectorize`,which will take 3 parameters: `rm_stop, data and message`. <br>
-  Inside the function, create a CountVectorizer object with hyper-parameters: `stop_words = 'english', analyzer='word', lowercase=True` if the condition rm_stop is True and if the condition rm_stop is False, create an object with hyper-parameters: `analyzer='word', lowercase=True`. <br>
-  Fit the CountVectorizer created on the data and transform the message to a Document Term Matrix(dtm). <br>
-  Find non-zero elements from Document Term Matrix and create a list containing a tuple of Document-Term Matrix[i, j] and Count.<br>
-  Finally, find non-zero elements and return a sorted list of **10** elements(tuples) based on word counts(maximum comes 1st).<br>
-  Finally return the CountVectorizer object and the list with sorted Document-Term Matrix[i, j] and Count.

**Example:** Your sample list should look like :<br>
[(0, 88532, 37),(0, 88519, 16),(0, 67670, 14),(0, 89360, 13),(0, 51136, 12),
(0, 18521, 10),(0, 49447, 10),(0, 60078, 9),(0, 88767, 9),(0, 69918, 8)]





In [59]:
def vectorize(rm_stop, data, message):
    '''           
    Name your CountVectorizer as cv and sorted list of tuples as srt_dtm.
    
    Parameters
    ----------
    rm_stop : A boolean value which if True, remove stop words
    data : whole data set to build the vocabulary
    msg : the message which is to be vectorized
    
    Returns
    -------
    A tuple of 2 containing the CountVectorizer object and the list with sorted Document-Term Matrix[i, j] and Count.
    '''    

    # YOUR CODE HERE
    if (rm_stop == True):
        cv = CountVectorizer(stop_words = 'english', analyzer='word', lowercase=True)
    else:
        cv = CountVectorizer(analyzer='word', lowercase=True)
    
    # Build a vocabulary from our data
    cv.fit(text['data'])
    
    # We need an iteratable to apply cv.transform()
    msg = []
    msg.append(message)
    
    # What is returned is a Document Term Matrix (dtm)
    dtm = cv.transform(msg)
    
    import scipy.sparse as sp
    # Find non-zero elements
    i, j, c = sp.find(dtm)
    srt_dtm = list(zip(i, j, c))
    
    # Number of terms to display
    top_display = 10

    # Sort our document term list, and unzip
    srt_dtm.sort(key=itemgetter(2), reverse=True)   
    
    return cv,srt_dtm


In [61]:
cv1,srt_dtm1 = vectorize(rm_stop = False, data=text['data'], message=message)
cv2,srt_dtm2 = vectorize(rm_stop = True, data=text['data'], message=message)
assert_equal(isinstance(cv1,CountVectorizer), True)
assert_equal(isinstance(cv2,CountVectorizer), True)
assert_equal(srt_dtm1[0], (0, 88532, 37))
assert_equal(srt_dtm1[1], (0, 88519, 16))
assert_equal(srt_dtm2[0], (0, 69723, 8))
assert_equal(srt_dtm2[1], (0, 26952, 6))
max_key1 = max(srt_dtm1, key=itemgetter(2))[1]
assert_equal(max_key1,88532)

In [62]:
def cnt(top_display, cv , srt_dtm):

    terms = cv.vocabulary_
    # Sort our document term list, and unzip
    i, j, c = zip(*srt_dtm)
    # Grab out the keys and values for top terms
    x_keys = [(k, v) for k, v in terms.items() 
              if terms[k] in j[:top_display]]
    x_keys.sort(key=itemgetter(1), reverse=True)
    # Grab the data, including counts from DTM list
    x_counts = srt_dtm[:top_display]
    x_counts.sort(key=itemgetter(1), reverse=True)
    # Now we merge the two lists so we can sort to display terms in order
    x_merged = []
    for idx in range(len(x_keys)):
        x_merged.append((x_keys[idx][0], 
                         x_keys[idx][1], 
                         x_counts[idx][2]))
    x_merged.sort(key=itemgetter(2), reverse=True)
    print('Count: Term in Vocabulary')
    print(40*'-')
    for x in x_merged:
        print(f'{x[2]:5d}: vocabulary[{x[1]}] = {x[0]}')

In [63]:
cnt(6, cv1, srt_dtm1)
print(80*'-')
cnt(6, cv2, srt_dtm2)

Count: Term in Vocabulary
----------------------------------------
   37: vocabulary[88532] = the
   16: vocabulary[88519] = that
   14: vocabulary[67670] = of
   13: vocabulary[89360] = to
   12: vocabulary[51136] = is
   10: vocabulary[18521] = and
--------------------------------------------------------------------------------
Count: Term in Vocabulary
----------------------------------------
    8: vocabulary[69723] = parent
    6: vocabulary[26952] = child
    5: vocabulary[62940] = moral
    4: vocabulary[86584] = swear
    3: vocabulary[28151] = code
    3: vocabulary[16289] = absolute


# Problem 3

For this problem, complete the function `tokenize_nltk` which will take `pattern` and `msg` as parameters and return lexical diversity, unique tokens, maximum occuring token and 5 _hapaxes_ in the corpus. <br>Use NLTK library to tokenize the message(passed through msg parameter). Also, remember to convert the words to lowercase before tokenizing.
Since we will just be looking for alphanumeric string, use `re.sub(pattern, ' ', msg)` for removing punctuation tokens.

In [64]:
def tokenize_nltk(pattern, msg):
    '''           
    Parameters
    ----------
    pattern : Regular expression searching for punctuations
    msg : the message which is to be tokenized
    
    Returns
    -------
    A tuple of 4 containing the lexical diversity value, number of unique tokens, maximum occuring token,
    and a list of 5 containing hapaxes.
    '''    
    # YOUR CODE HERE
    
    # Tokenize a text document
    words = re.sub(pattern, ' ', msg.lower()).split()

    # Count number of occurances for each token
    counts = nltk.FreqDist(words)

    # Compute and display lexical diversity
    num_words = len(words)
    num_tokens = len(counts)
    lexdiv  =  num_words / num_tokens
    
    unique_tk = counts.B()
    max_tk = counts.max()
    haps = counts.hapaxes()[:5]
    
    return lexdiv, unique_tk, max_tk, haps


In [65]:
pattern1 = re.compile(r'[^\w\s]')
div, bins, max_val, hap = tokenize_nltk(pattern1, message)
assert_almost_equal(div, 2.13392, 3)
assert_equal(bins, 224)
assert_equal(max_val, 'the')
assert_equal(isinstance(hap, list), True)
assert_equal(len(hap), 5)

In [66]:
print('5 hapaxes in corpus are:',hap)

5 hapaxes in corpus are: ['yep', 'pretty', 'much', 'jewish', 'thinking']
