## Python Text Modeling & Spark

Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All).  You can speak with others regarding the assignment but all work must be your own. 


### This is a 30 point assignment graded from answers to questions and automated tests that should be run at the bottom. Be sure to clearly label all of your answers and commit final tests at the end. If you attempt to fake passing the tests you will receive a 0 on the assignment and it will be considered an ethical violation. (Note, not all questions have tests).

### You must show the executed code and then the output . Do not just copy and past the code to a markdown cell. 

In [1]:
NAME = ""
COLLABORATORS = ["Alyssa Hacker"]  #You can speak with others regarding the assignment, but all typed work must be your own.

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

In [2]:
X = ["Mr. Green killed Colonel Mustard in the study with the candlestick. \
Mr. Green is not a very nice fellow.",
     "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away \
from his office last week."]


In [None]:
%load_ext ipython_unittest

In [3]:
len(X)

3

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(X)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [5]:
vectorizer.vocabulary_

{'away': 0,
 'candlestick': 1,
 'colonel': 2,
 'fellow': 3,
 'from': 4,
 'green': 5,
 'has': 6,
 'he': 7,
 'his': 8,
 'in': 9,
 'is': 10,
 'killed': 11,
 'last': 12,
 'miss': 13,
 'mr': 14,
 'mustard': 15,
 'nice': 16,
 'not': 17,
 'office': 18,
 'plant': 19,
 'plum': 20,
 'professor': 21,
 'scarlett': 22,
 'study': 23,
 'the': 24,
 'very': 25,
 'was': 26,
 'watered': 27,
 'week': 28,
 'while': 29,
 'with': 30}

In [6]:
X_bag_of_words = vectorizer.transform(X)

In [7]:
X_bag_of_words.shape

(3, 31)

In [8]:
X_bag_of_words

<3x31 sparse matrix of type '<class 'numpy.int64'>'
	with 39 stored elements in Compressed Sparse Row format>

In [9]:
X_bag_of_words.toarray()

array([[0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 1, 1, 0, 0, 2, 1, 1, 1, 0, 0, 0, 0,
        0, 1, 2, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
        1, 0, 0, 0, 1, 1, 1, 1, 0]])

In [10]:
vectorizer.get_feature_names()

['away',
 'candlestick',
 'colonel',
 'fellow',
 'from',
 'green',
 'has',
 'he',
 'his',
 'in',
 'is',
 'killed',
 'last',
 'miss',
 'mr',
 'mustard',
 'nice',
 'not',
 'office',
 'plant',
 'plum',
 'professor',
 'scarlett',
 'study',
 'the',
 'very',
 'was',
 'watered',
 'week',
 'while',
 'with']

In [11]:
vectorizer.inverse_transform(X_bag_of_words)

[array(['candlestick', 'colonel', 'fellow', 'green', 'in', 'is', 'killed',
        'mr', 'mustard', 'nice', 'not', 'study', 'the', 'very', 'with'],
       dtype='<U11'),
 array(['green', 'has', 'his', 'in', 'plant', 'plum', 'professor', 'study'],
       dtype='<U11'),
 array(['away', 'from', 'green', 'he', 'his', 'last', 'miss', 'office',
        'plant', 'plum', 'professor', 'scarlett', 'was', 'watered', 'week',
        'while'],
       dtype='<U11')]

# tf-idf Encoding
A useful transformation that is often applied to the bag-of-word encoding is the so-called term-frequency inverse-document-frequency (tf-idf) scaling, which is a non-linear transformation of the word counts.

The tf-idf encoding rescales words that are common to have less weight:

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [13]:
import numpy as np
np.set_printoptions(precision=2)

print(tfidf_vectorizer.transform(X).toarray())

[[ 0.    0.22  0.22  0.22  0.    0.26  0.    0.    0.    0.17  0.22  0.22
   0.    0.    0.44  0.22  0.22  0.22  0.    0.    0.    0.    0.    0.17
   0.44  0.22  0.    0.    0.    0.    0.22]
 [ 0.    0.    0.    0.    0.    0.27  0.46  0.    0.35  0.35  0.    0.    0.
   0.    0.    0.    0.    0.    0.    0.35  0.35  0.35  0.    0.35  0.    0.
   0.    0.    0.    0.    0.  ]
 [ 0.27  0.    0.    0.    0.27  0.16  0.    0.27  0.21  0.    0.    0.
   0.27  0.27  0.    0.    0.    0.    0.27  0.21  0.21  0.21  0.27  0.    0.
   0.    0.27  0.27  0.27  0.27  0.  ]]


In [14]:
tfidf_vectorizer.get_feature_names()

['away',
 'candlestick',
 'colonel',
 'fellow',
 'from',
 'green',
 'has',
 'he',
 'his',
 'in',
 'is',
 'killed',
 'last',
 'miss',
 'mr',
 'mustard',
 'nice',
 'not',
 'office',
 'plant',
 'plum',
 'professor',
 'scarlett',
 'study',
 'the',
 'very',
 'was',
 'watered',
 'week',
 'while',
 'with']

tf-idfs are a way to represent documents as feature vectors. tf-idfs can be understood as a modification of the raw term frequencies (`tf`); the `tf` is the count of how often a particular word occurs in a given document. The concept behind the tf-idf is to downweight terms proportionally to the number of documents in which they occur. Here, the idea is that terms that occur in many different documents are likely unimportant or don't contain any useful information for Natural Language Processing tasks such as document classification. If you are interested in the mathematical details and equations, see this [external IPython Notebook](http://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/tfidf_scikit-learn.ipynb) that walks you through the computation.

# Bigrams and N-Grams

In the example illustrated in the figure at the beginning of this notebook, we used the so-called 1-gram (unigram) tokenization: Each token represents a single element with regard to the splittling criterion. 

Entirely discarding word order is not always a good idea, as composite phrases often have specific meaning, and modifiers like "not" can invert the meaning of words.

A simple way to include some word order are n-grams, which don't only look at a single token, but at all pairs of neighborhing tokens. For example, in 2-gram (bigram) tokenization, we would group words together with an overlap of one word; in 3-gram (trigram) splits we would create an overlap two words, and so forth:

- original text: "this is how you get ants"
- 1-gram: "this", "is", "how", "you", "get", "ants"
- 2-gram: "this is", "is how", "how you", "you get", "get ants"
- 3-gram: "this is how", "is how you", "how you get", "you get ants"

Which "n" we choose for "n-gram" tokenization to obtain the optimal performance in our predictive model depends on the learning algorithm, dataset, and task. Or in other words, we have consider "n" in "n-grams" as a tuning parameters, and in later notebooks, we will see how we deal with these.

Now, let's create a bag of words model of bigrams using scikit-learn's `CountVectorizer`:

In [26]:
# look at sequences of tokens of minimum length 2 and maximum length 2
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
bigram_vectorizer.fit(X)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(2, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [27]:
bigram_vectorizer.get_feature_names()

['away from',
 'candlestick mr',
 'colonel mustard',
 'from his',
 'green is',
 'green killed',
 'green plant',
 'has green',
 'he was',
 'his office',
 'his study',
 'in his',
 'in the',
 'is not',
 'killed colonel',
 'last week',
 'miss scarlett',
 'mr green',
 'mustard in',
 'nice fellow',
 'not very',
 'office last',
 'plant in',
 'plant while',
 'plum green',
 'plum has',
 'professor plum',
 'scarlett watered',
 'study with',
 'the candlestick',
 'the study',
 'very nice',
 'was away',
 'watered professor',
 'while he',
 'with the']

In [28]:
bigram_vectorizer.transform(X).toarray()

array([[0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 2, 1, 1, 1, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
        0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0]])

Often we want to include unigrams (single tokens) AND bigrams, wich we can do by passing the following tuple as an argument to the `ngram_range` parameter of the `CountVectorizer` function:

In [29]:
gram_vectorizer = CountVectorizer(ngram_range=(1, 2))
gram_vectorizer.fit(X)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [30]:
gram_vectorizer.get_feature_names()

['away',
 'away from',
 'candlestick',
 'candlestick mr',
 'colonel',
 'colonel mustard',
 'fellow',
 'from',
 'from his',
 'green',
 'green is',
 'green killed',
 'green plant',
 'has',
 'has green',
 'he',
 'he was',
 'his',
 'his office',
 'his study',
 'in',
 'in his',
 'in the',
 'is',
 'is not',
 'killed',
 'killed colonel',
 'last',
 'last week',
 'miss',
 'miss scarlett',
 'mr',
 'mr green',
 'mustard',
 'mustard in',
 'nice',
 'nice fellow',
 'not',
 'not very',
 'office',
 'office last',
 'plant',
 'plant in',
 'plant while',
 'plum',
 'plum green',
 'plum has',
 'professor',
 'professor plum',
 'scarlett',
 'scarlett watered',
 'study',
 'study with',
 'the',
 'the candlestick',
 'the study',
 'very',
 'very nice',
 'was',
 'was away',
 'watered',
 'watered professor',
 'week',
 'while',
 'while he',
 'with',
 'with the']

In [31]:
gram_vectorizer.transform(X).toarray()

array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        1, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
        1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
        1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]])

Character n-grams
=================

Sometimes it is also helpful not only to look at words, but to consider single characters instead.   
That is particularly useful if we have very noisy data and want to identify the language, or if we want to predict something about a single word.
We can simply look at characters instead of words by setting ``analyzer="char"``.
Looking at single characters is usually not very informative, but looking at longer n-grams of characters could be:

In [32]:
X

['Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.',
 'Professor Plum has a green plant in his study.',
 "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."]

In [33]:
char_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer="char")
char_vectorizer.fit(X)

CountVectorizer(analyzer='char', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(2, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [34]:
print(char_vectorizer.get_feature_names())

[' a', ' c', ' f', ' g', ' h', ' i', ' k', ' l', ' m', ' n', ' o', ' p', ' s', ' t', ' v', ' w', "'s", '. ', 'a ', 'an', 'ar', 'as', 'at', 'aw', 'ay', 'ca', 'ce', 'ck', 'co', 'd ', 'dl', 'dy', 'e ', 'ed', 'ee', 'ek', 'el', 'en', 'er', 'es', 'et', 'fe', 'ff', 'fi', 'fr', 'gr', 'h ', 'ha', 'he', 'hi', 'ic', 'il', 'in', 'is', 'it', 'k.', 'ki', 'l ', 'la', 'le', 'll', 'lo', 'lu', 'm ', "m'", 'mi', 'mr', 'mu', 'n ', 'nd', 'ne', 'ni', 'no', 'nt', 'of', 'ol', 'om', 'on', 'or', 'ot', 'ow', 'pl', 'pr', 'r ', 'r.', 'rd', 're', 'rl', 'ro', 'ry', 's ', 'sc', 'so', 'ss', 'st', 't ', 'ta', 'te', 'th', 'ti', 'tt', 'tu', 'ud', 'um', 'us', 've', 'w.', 'wa', 'we', 'wh', 'wi', 'y ', 'y.']


### Word Count
#### ** (1a) Create a base RDD **
#### We'll start by generating a base RDD by using a Python list and the `sc.parallelize` method.  Then we'll print out the type of the base RDD.

In [4]:
#Don't Execute this on Databricks
#To be used if executing via docker
import pyspark
sc = pyspark.SparkContext('local[*]')

In [10]:
%load_ext ipython_unittest

ModuleNotFoundError: No module named 'ipython_unittest'

In [5]:
wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordsList, 4)
# Print out the type of wordsRDD
print (type(wordsRDD))

<class 'pyspark.rdd.RDD'>


#### ** (1b) Pluralize and test **
#### Let's use a `map()` transformation to add the letter 's' to each string in the base RDD we just created. We'll define a Python function that returns the word with an 's' at the end of the word.  Please replace `<FILL IN>` with your solution.  If you have trouble, the next cell has the solution.  After you have defined `makePlural` you can run the third cell which contains a test.  If you implementation is correct it will print `1 test passed`.
#### This is the general form that exercises will take, except that no example solution will be provided.  Exercises will include an explanation of what is expected, followed by code cells where one cell will have one or more `<FILL IN>` sections.  The cell that needs to be modified will have `# TODO: Replace <FILL IN> with appropriate code` on its first line.  Once the `<FILL IN>` sections are updated and the code is run, the test cell can then be run to verify the correctness of your solution.  The last code cell before the next markdown section will contain the tests.

In [8]:
# TODO: Replace <FILL IN> with appropriate code
def makePlural(word):
    """Adds an 's' to `word`.

    Note:
        This is a simple function that only adds an 's'.  No attempt is made to follow proper
        pluralization rules.

    Args:
        word (str): A string.

    Returns:
        str: A string with 's' added to it.
    """
    return word+'s'

print (makePlural('cat'))

cats


In [9]:
%%unittest_main
class TestPackages(unittest.TestCase):
    def test_packages1(self):
        assertEquals(makePlural('rat'), 'rats', 'incorrect result: makePlural does not add an s')

UsageError: Cell magic `%%unittest_main` not found.
