
# Connect Intensive - Machine Learning Nanodegree
# Lesson 4: Bayes NLP Mini-Project


## Objectives
  - Understand how [Bayes Rule](https://en.wikipedia.org/wiki/Bayes%27_theorem) derives from [conditional probability](https://en.wikipedia.org/wiki/Conditional_probability)
  - Write methods, utilizing Python dictionary objects and string methods such as `str.split()`.
  - Apply Bayes Rule to simple NLP: missing word prediction problems
  
## Prerequisites
  - Basic Python knowledge in strings and dictionaries would help.
 
## Acknowledgements
  - This lesson is adapted from one of [Nick Hoh's excellent sessions](https://github.com/nickypie/ConnectIntensive).
  
  
  

## Bad Handwriting Exposition


Imagine your boss has left you a message from a location with terrible reception. Several words are impossible to hear. Based on some transcriptions of previous messages he’s left, you want to fill in the remaining words. To do this, we will use Bayes’ Rule to find the probability that a given word is in the blank, given some other information about the message.

Recall Bayes Rule:

P(A|B) = P(B|A)*P(A)/P(B)

Or in our case

**P(a certain word|surrounding words) = P(surrounding words|a certain word)*P(a certain word) / P(surrounding words)**

## 1. Calculations

Let’s calculate some of the probabilities from the previous page. Here is a message from your boss:

“So if you could just go ahead and pack up your stuff and move it down there, that would be terrific, OK?” 

Assuming the above text is representative, calculate the following probabilities: 

•	Finding the word “you” after the word “if”    1    
•	That a randomly selected word is “you”        .045   
•	That randomly selected word is “if”               .045     


In [1]:
message = 'So if you could just go ahead and pack up your stuff and move it down there, that would be terrific, OK?'
message_word_count = message.split()

total_word_count = len(message_word_count)
count_you = float(0)
count_if = float(0)


for word in message_word_count:
    if word == "you":
        count_you += 1
    if word == "if":
        count_if += 1 
            
prob_word_is_you = count_you/total_word_count
prob_word_is_if = count_if/total_word_count


print ("Probability 'you': {:.3f} \nProbability 'if': {:.3f}".format(prob_word_is_you,prob_word_is_if))

Probability 'you': 0.045 
Probability 'if': 0.045


## 2. Maximum Likelihood

In this exercise, you will write a method, `NextWordProbability(sampletext,word)`, that creates a Python dictionary from a string `sampletext` and a target word `word`. The keys of the dictionary will be the words that follow the target word `word`, and the values will be the number of times the key follows the target word `word`. For example,  the output of the following code:
```
memo = "If you could just go ahead and pack up your stuff and move it down there, that would be terrific, OK?"
word = "and"
print(NextWordProbability(memo,word))
```
should be the dictionary:
```
{'move': 1, 'pack': 1}
```
Don't worry about removing punctuation or changing upper or lower case letters.

**Complete** the method `NextWordProbability` in the cell below and then **run** the cell. You may want to use [the string method `split`](https://docs.python.org/2/library/stdtypes.html#string-methods), and refer to [the Python documentation on dictionaries](https://docs.python.org/2/library/stdtypes.html#mapping-types-dict).  Then, you can test your method by running the cell below it to try some test cases. When you feel confident your `NextWordProbability` method works, you can copy and paste the method into the Bayes NLP Mini-Project "Quiz: Maximum Likelihood".

In [2]:
sample_memo = '''
Milt, we're gonna need to go ahead and move you downstairs into storage B. We have some new people coming in, and we need all the space we can get. So if you could just go ahead and pack up your stuff and move it down there, that would be terrific, OK?
Oh, and remember: next Friday... is Hawaiian shirt day. So, you know, if you want to, go ahead and wear a Hawaiian shirt and jeans.
Oh, oh, and I almost forgot. Ahh, I'm also gonna need you to go ahead and come in on Sunday, too...
Hello Peter, whats happening? Ummm, I'm gonna need you to go ahead and come in tomorrow. So if you could be here around 9 that would be great, mmmk... oh oh! and I almost forgot ahh, I'm also gonna need you to go ahead and come in on Sunday too, kay. We ahh lost some people this week and ah, we sorta need to play catch up.
'''

#
#   Maximum Likelihood Hypothesis
#
#
#   In this quiz we will find the maximum likelihood word based on the preceding word
#
#   Fill in the NextWordProbability procedure so that it takes in sample text and a word,
#   and returns a dictionary with keys the set of words that come after, whose values are
#   the number of times the key comes after that word.
#   
#   Just use .split() to split the sample_memo text into words separated by spaces.

def NextWordProbability(sampletext,word):
    words = sampletext.split()
    prevword = None
    d = {}
    for w in words:
        if prevword == word:
            if w in d:
                d[w] += 1
            else:
                d[w] = 1
        prevword = w

    return d

In [3]:
# Test cases: see how well your NextWordProbability method works.

memo1 = "If you could just go ahead and pack up your stuff and move it down there, that would be terrific, OK?"
word1 = "and"
print(NextWordProbability(memo1,word1))
# Output should be:
# {'move': 1, 'pack': 1}

# memo2 = "Milt, we're gonna need to go ahead and move you downstairs into storage B. We have some new people coming in, and we need all the space we can get. So if you could just go ahead and pack up your stuff and move it down there, that would be terrific, OK?"
# word2 = "need"
# print(NextWordProbability(memo2,word2))
# # Output should be:
# # {'to': 1, 'all': 1}

# memo3 = "Hello Peter, what's happening? Ummm, I'm gonna need you to go ahead and come in tomorrow. So if you could be here around 9 that would be great, mmmk... oh oh! and I almost forgot ahh, I'm also gonna need you to go ahead and come in on Sunday too, kay. We ahh lost some people this week and ah, we sorta need to play catch up."
# word3 = "in"
# print(NextWordProbability(memo3,word3))
# # Output should be:
# # {'tomorrow.': 1, 'on': 1}

{'move': 1, 'pack': 1}


## NLP Disclaimer

In the previous exercise, you may have thought of some ways we might want to clean up the text available to us.

For example, we would certainly want to remove punctuation, and generally want to make all strings lowercase for consistency. In most language processing tasks we will have a much larger corpus of data, and will want to remove certain features.

Overall, just keep in mind that this mini-project is about Bayesian probability. If you're interested in the details of language processing, you might start with this Kaggle project, which introduces a more detailed and standard approach to text processing very different from what we cover here.

## 3. Optimal Classifier Example

Imagine that we have two missing words in a row. Rather than simply pick the maximum likelihood possibility for the first blank and using that to estimate the second, we can take the probabilities for all possibilities for the first blank.

For example, take the following sentence:

“For --- ---“

With the following previous data:

$$\begin{array}{rcl}
P(\text{ "for this" }|\text{"for ---"})&=&0.4\\
P(\text{ "for that" }|\text{"for ---"})&=&0.3\\
P(\text{ "for those" }|\text{"for ---"})&=&0.3\end{array}$$

$$\begin{array}{rclrcl}
P(\text{ "this time" }|\text{"this ---"})&=&0.6,\quad&P(\text{ "this job" }|\text{"this ---"})&=&0.4\\
P(\text{ "that job" }|\text{"that ---"})&=&0.8,\quad&P(\text{ "that time" }|\text{"that ---"})&=&0.2\\
P(\text{ "those items" }|\text{"those ---"})&=&1.0\end{array}$$




**What word should you predict in the second blank?**  job

**With what probability?** .4



In [4]:
p_this= .4
p_that = .3 
p_those= .3 

p_this_time = .6
p_this_job = .4

p_that_job = .8 
p_that_time = .2

p_those_items = 1 

# getting probability of "job"

p_job = p_this_job * p_this + p_that_job * p_that

# getting probability of "time"

p_time  = p_this_time * p_this + p_that_time * p_that

# getting probability of "items"

p_items = p_those_items * p_those 

print "job: ", p_job,"\ntime: ",p_time,"\nitems: ", p_items 


job:  0.4 
time:  0.3 
items:  0.3


## 4.  Optimal Classifier Exercise

In this exercise, you will write a method `LaterWords(sample,word,distance)` that determines the most likely word to appear `distance` words after the target word `word` based on the text in the string `sample`. For example, a call to the method:
```
LaterWords(memo,"and",2)
```
would return a string: the most frequent word appearing 2 words after `"and"` in the string `memo`, *e.g.* "and --- **---**"

In [5]:
#------------------------------------------------------------------

#
#   Bayes Optimal Classifier
#
#   In this quiz we will compute the optimal label for a second missing word in a row
#   based on the possible words that could be in the first blank
#
#   Finish the procedurce, LaterWords(), below
#
#   You may want to import your code from the previous programming exercise!
#

def NextWordProbability(sampletext,word):
    words = sampletext.split()
    prevword = None
    d = {}
    for w in words:
        if prevword == word:
            if w in d:
                d[w] += 1
            else:
                d[w] = 1
        prevword = w
            
    return d


sample_memo = '''
Milt, we're gonna need to go ahead and move you downstairs into storage B. We have some new people coming in, and we need all the space we can get. So if you could just go ahead and pack up your stuff and move it down there, that would be terrific, OK?
Oh, and remember: next Friday... is Hawaiian shirt day. So, you know, if you want to, go ahead and wear a Hawaiian shirt and jeans.
Oh, oh, and I almost forgot. Ahh, I'm also gonna need you to go ahead and come in on Sunday, too...
Hello Peter, whats happening? Ummm, I'm gonna need you to go ahead and come in tomorrow. So if you could be here around 9 that would be great, mmmk... oh oh! and I almost forgot ahh, I'm also gonna need you to go ahead and come in on Sunday too, kay. We ahh lost some people this week and ah, we sorta need to play catch up.
'''

corrupted_memo = '''
Yeah, I'm gonna --- you to go ahead --- --- complain about this. Oh, and if you could --- --- and sit at the kids' table, that'd be --- 
'''

data_list = sample_memo.strip().split()

words_to_guess = ["ahead","could"]


def LaterWords(sample,word,distance):
    '''@param sample: a sample of text to draw from
    @param word: a word occuring before a corrupted sequence
    @param distance: how many words later to estimate (i.e. 1 for the next word, 2 for the word after that)
    @returns: a single word which is the most likely possibility
    '''
    # Initialize word_dict with the word parameter
    word_dict = {word:1}
    
    for idx in range(distance):
             # Create a new dictionary to conduct Naive Bayes
        new_dict = {}
                # Iterate over each word and word count in word_dict
        for w, w_count in word_dict.iteritems():
                    # Naive Bayes -- probabilities (hence word counts) are multiplicative:
            for new_word, new_count in NextWordProbability(sample,w).iteritems():
                        # If the new_word is already in new_dict, add new_count weighted by w_count to the value:
                if new_word in new_dict:
                        new_dict[new_word] += w_count * new_count
                        # If the new_word is not yet in new_dict, initialize with new_count weighted by w_count:
                else:
                    new_dict[new_word] = w_count * new_count
                # Repeat process "distance" times, replacing word_dict with new_dict
            word_dict = new_dict

            # Return the word with the maximum value from Naive Bayes
    return max(word_dict.iterkeys(), key=(lambda key: word_dict[key]))
    
    
for i in words_to_guess:
    print "Guess: Two blanks after '{}' will be:".format(i),LaterWords(sample_memo,i,2)

Guess: Two blanks after 'ahead' will be: come
Guess: Two blanks after 'could' will be: go


## 5. Which Words Meditation

What set of words in a memo do you think could help predict what a missing word might be? What are some advantages and disadvantages of using more or fewer possible influences in prediction?

Many answers

Possible: Peoples names, locations and times. The advantage of using more influences is that you could potentially get a more accurate model. The disadvantage is possibly overfitting to the training data. 




## 6. Joint Distribution Analysis


If you wanted to measure the joint probability distribution of a missing word, given its position relative to every other word in the document, how many probabilities would you need to measure? Say the document is $N$ words long.

N^2

## 7. Domain Knowledge Quiz


Given the corpus of text we have from our boss, we might like to identify some things he often says, and use that knowledge to make better predictions. What are some statements you see arising multiple times?

"go ahead and", "if you could", "that would be great" 

## 8. Domain Knowledge Fill In

Trying to search all [regular expressions](https://docs.python.org/2/library/re.html) of length up to 9 with multiple optional parts is computationally infeasible. But if we have these hypotheses to begin with, we can make extremely accurate guesses. For example, fill in the blanks in the following sentence:
       > "Yeah, I'm gonna --- you to go --- --- not complain about this. Oh, and if you could --- ahead and sit at the kids' table, that'd be ---."
       
       
       need, ahead, and, go, great
