In [167]:
import sklearn
import nltk
import keras
keras.__version__

# Analyzing text


#### AN IMPORTANT ANNOUNCEMENT
The first part of the partner mini-project (downloading the dataset), which is further down in this notebook, takes a long time. You may want to do that step when you are taking a break from the rest of the notebook or, alternatively, you can download the file outside of this Python notebook

<img src="http://zacharski.org/files/courses/cs419/tiles.jpg" width="500"/>

So far we have been dealing with **structured data**. Structured data is ... well ... structured. This means that an instance of our data has nice attributes that can be represented in a DataFrame or a table:

make | mpg | cylinders | HP | 0-60 |
---- | :---: | :---: | :---: | :---: | :---: |
Fiat | 38 | 4 | 157   | 6.9 
Ford F150 | 19 | 6 | 386 | 6.3 
Mazda 3 | 37 | 4 | 155 |  7.5 
Ford Escape | 27 | 4 | 245 | 7.1 
Kia Soul | 31 | 4 | 164 | 8.5 

The majority of data in the world is **unstructured**. Take text for example. Suppose I have a corpus of twitter posts from President Trump and the Dalai Lama and my goal is to create a classifier that takes a tweet and tells me if it was produced by Trump or the Dalai Lama:

*The purpose of education is to build a happier society, we need a more holistic approach that promotes the practice of love and compassion.*

*How low has President Obama gone to tapp my phones during the very sacred election*

We might consider  the columns of a table to be things like *first word of the tweet*, *second word of the tweet* and so on:


id | word 1 | word 2 | word 3 | word 4 |word 5 |word 6 | ... |
---- | :---: | :---: | :---: | :---: | :---: |:---: |:---: |:---: |
1 | The | purpose | of   | education |is | to | ...
2 | How | low | has |President | Obama | gone | ...

So we would be counting how many times the word *President* occurred as the fourth word of a tweet. **But that would be the wrong way to go**. 

A more common way to represent text is to treat the text as an unordered set of words, which is called the **bag of words** approach. 

## Bag of words
<img src="http://zacharski.org/files/courses/cs419/BagofWords.jpg" width="350"/>

With the bag of words approach we count word occurrences and the features (what we might think of as columns) are the words. For example, we take a bunch of Trump tweets and count word occurrences and do the same with the Dalai Lama tweets and we might get something like:

id | a | the | compassion | love |sad |fake | ... |
---- | :---: | :---: | :---: | :---: | :---: |:---: |:---: |:---: |
Trump | 42 | 27 | 1   | 5 |311 | 227 | ...
Dalai Lama | 72 | 42 | 103 |159 | 5 | 1 | ...

So, for example, Trump has used the word *compassion* once in all his tweets  but the Dalai Lama used it 103 times (this data is made-up).

Or maybe, as we did with the IMDB dataset, we will use a '1' to indicate that that word appeared at all in the text and a '0' if it did not.

id | a | the | compassion | love |sad |fake | ... |
---- | :---: | :---: | :---: | :---: | :---: |:---: |:---: |:---: |
Trump | 1 | 1 | 0   | 0 |1| 1 | ...
Dalai Lama | 1 | 1 | 1 |1 | 0 | 0 | ...



This 'bag of words' allows us to use the classification methods we have been using. 

Converting **unstructured** text to something **structured** is a multistep process. Let's learn the bits before putting it together. And we will start with one of last steps first-- converting a text to a list of numbers --each number representing a different word. 

Recall that the IMDB lab started with this. So if we can get to the list-of-numbers representation we can just use what we learned in that lab to construct a deep learning system.

## We are going to need a tokenizer.
A tokenizer takes a long string of text and converts it into individuals words (tokens). At first you might think this is easy, just split the text by spaces but it is not quite that easy.  For example should *you're* be considered one word or two (*you* followed by *'re* is *ice-cream* one word or two? The Keras tokenizer does an additional step -- it creates a dictionary of all the unique words of the text and creates indices for each one. Then it replaces the words in the text with their indices.
First, let's import the library:

In [52]:
from keras.preprocessing.text import Tokenizer

# now create a tokenizer

In [53]:
tokenizer = Tokenizer()

Now we will create some text:

In [54]:
trump1 = "How low has President Obama gone to tapp my phones during the very sacred election process. This is Nixon/Watergate. Obama bad (or sick) guy! Sad"
trump2 = "Our wonderful new Healthcare Bill is now out for review and negotiation. ObamaCare is a complete and total disaster - is imploding fast! Sad"
trump3 = "Don't let the FAKE NEWS tell you that there is big infighting in the Trump Admin. We are getting along great, and getting major things done!"
trump4 = "Russia talk is FAKE NEWS put out by the Dems, and played up by the media, in order to mask the big election defeat and the illegal leaks! Sad"
dalaiLama1 = "The purpose of education is to build a happier society, we need a more holistic approach that promotes the practice of love and compassion."
dalaiLama2 = "Be a kind and compassionate person. This is the inner beauty that is a key factor to making a better world."
dalaiLama3 = "If our goal is a happier, more peaceful world in the future, only education will bring change."
dalaiLama4 = "Love and compassion are important, because they strengthen us. This is a source of hope"
tinyCorpus = [trump1, trump2, trump3, trump4, dalaiLama1, dalaiLama2, dalaiLama3, dalaiLama4]
tinyCorpus


['How low has President Obama gone to tapp my phones during the very sacred election process. This is Nixon/Watergate. Obama bad (or sick) guy! Sad',
 'Our wonderful new Healthcare Bill is now out for review and negotiation. ObamaCare is a complete and total disaster - is imploding fast! Sad',
 "Don't let the FAKE NEWS tell you that there is big infighting in the Trump Admin. We are getting along great, and getting major things done!",
 'Russia talk is FAKE NEWS put out by the Dems, and played up by the media, in order to mask the big election defeat and the illegal leaks! Sad',
 'The purpose of education is to build a happier society, we need a more holistic approach that promotes the practice of love and compassion.',
 'Be a kind and compassionate person. This is the inner beauty that is a key factor to making a better world.',
 'If our goal is a happier, more peaceful world in the future, only education will bring change.',
 'Love and compassion are important, because they strengthe

## fit the corpus
Tokenization is a 2 step process. Each step makes a pass through the data. The first step is `fit` where the tokenizer goes through all the texts and constructs a dictionary containing all the unique words in the data.

In [56]:
tokenizer.fit_on_texts(tinyCorpus)

We can see a list of all the words in the dictionary and respective indices by:

In [57]:
tokenizer.word_index

{'a': 4,
 'admin': 67,
 'along': 68,
 'and': 3,
 'approach': 90,
 'are': 19,
 'bad': 42,
 'be': 93,
 'beauty': 98,
 'because': 112,
 'better': 102,
 'big': 17,
 'bill': 49,
 'bring': 109,
 'build': 86,
 'by': 21,
 'change': 110,
 'compassion': 26,
 'compassionate': 95,
 'complete': 55,
 'defeat': 82,
 'dems': 76,
 'disaster': 57,
 "don't": 60,
 'done': 72,
 'during': 36,
 'education': 22,
 'election': 12,
 'factor': 100,
 'fake': 15,
 'fast': 59,
 'for': 51,
 'future': 106,
 'getting': 20,
 'goal': 104,
 'gone': 32,
 'great': 69,
 'guy': 45,
 'happier': 23,
 'has': 30,
 'healthcare': 48,
 'holistic': 89,
 'hope': 117,
 'how': 28,
 'if': 103,
 'illegal': 83,
 'imploding': 58,
 'important': 111,
 'in': 9,
 'infighting': 65,
 'inner': 97,
 'is': 2,
 'key': 99,
 'kind': 94,
 'leaks': 84,
 'let': 61,
 'love': 25,
 'low': 29,
 'major': 70,
 'making': 101,
 'mask': 81,
 'media': 79,
 'more': 24,
 'my': 34,
 'need': 88,
 'negotiation': 53,
 'new': 47,
 'news': 16,
 'nixon': 40,
 'now': 50,
 'o

As you can see, the index of *love* is 25. Now we can use the tokenizer to convert words in the text to their corresponding indices. 

In [58]:
sequences = tokenizer.texts_to_sequences(tinyCorpus)
sequences

[[28,
  29,
  30,
  31,
  11,
  32,
  5,
  33,
  34,
  35,
  36,
  1,
  37,
  38,
  12,
  39,
  6,
  2,
  40,
  41,
  11,
  42,
  43,
  44,
  45,
  7],
 [13,
  46,
  47,
  48,
  49,
  2,
  50,
  14,
  51,
  52,
  3,
  53,
  54,
  2,
  4,
  55,
  3,
  56,
  57,
  2,
  58,
  59,
  7],
 [60,
  61,
  1,
  15,
  16,
  62,
  63,
  8,
  64,
  2,
  17,
  65,
  9,
  1,
  66,
  67,
  18,
  19,
  20,
  68,
  69,
  3,
  20,
  70,
  71,
  72],
 [73,
  74,
  2,
  15,
  16,
  75,
  14,
  21,
  1,
  76,
  3,
  77,
  78,
  21,
  1,
  79,
  9,
  80,
  5,
  81,
  1,
  17,
  12,
  82,
  3,
  1,
  83,
  84,
  7],
 [1,
  85,
  10,
  22,
  2,
  5,
  86,
  4,
  23,
  87,
  18,
  88,
  4,
  24,
  89,
  90,
  8,
  91,
  1,
  92,
  10,
  25,
  3,
  26],
 [93, 4, 94, 3, 95, 96, 6, 2, 1, 97, 98, 8, 2, 4, 99, 100, 5, 101, 4, 102, 27],
 [103, 13, 104, 2, 4, 23, 24, 105, 27, 9, 1, 106, 107, 22, 108, 109, 110],
 [25, 3, 26, 19, 111, 112, 113, 114, 115, 6, 2, 4, 116, 10, 117]]

Now that we have that representation we can turn to what we learned in the IMDB notebook to create and deploy a deep learning system.

## Challenge 1
Can you convert the first item of `sequence` back to a string of words?

In [38]:
# TBD

Unfortunately, we missed an important step.

# stemming

You may know that the structure of a sentence is called syntax. So  the sentence *The dogs chased the ball*  consists of a noun phrase (NP - *the dog*) followed by a verb phrase (VP - *chased the ball*) and the VP consists of a verb (*chased*) followed by an NP (*the ball*) and finally the NP consists of a determiner *the* followed by a noun *ball*.  And we get a syntactic structure that looks like:

                                    S
                                  /   \
                                 /     \
                                /       \
                               NP        VP
                             /   \      |   \
                            Det   N     V    \   
                            |     |     |      NP  
                           the   dogs  chased  |  \
                                               |   \
                                              Det   N
                                               |    |
                                              the  ball
                                              
Similarly, words have internal structure. So *dogs* is really `dog+PLURAL` and *chased* is `chase+PAST`. This structure is called morphology and the analysis step is called morphological analysis. For many classification tasks, we don't care whether the person wrote *dogs* or *dog*. Or *chasing*, *chased*, or *chases* instead of *chase*. We might want to count all those variants of *chase* simply as *chase*. So instead of having separate attributes for *chase*, *chasing*, *chased*, and *chases*, we reduce it to 1. 

The absolute best way to do this task is with a morphological analyzer but it turns out that writing a good morphological analyzer is extremely tricky so data scientists use a much simpler solution called **stemming**. There are a number of stemming algorithms available to us. Here is how to use one called the Snowball Stemmer:

In [16]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

'dog'

In [17]:
print(stemmer.stem('chasing'))
print(stemmer.stem('chased'))
print(stemmer.stem('chases'))

chase
chase
chase


The above look like real words, but this is not always the case with a stemmer:

In [168]:
print(stemmer.stem('everyone'))
print(stemmer.stem('please'))
print(stemmer.stem('president'))
print(stemmer.stem('compassionate'))


So it seems that we could improve on our initial representation of the Dalai Lama/ Trump tweets if, for example,  *compassion* and *compassionate* where lumped together and this needs to be done before the tokenizer does its job. So going back to our tinyCorpus we might 

### 1. Divide the texts into sequences of words


In [169]:
words =map(text_to_word_sequence,tinyCorpus)
list(words)

### 2. stem the words


In [170]:
stemmed_texts = [ map(stemmer.stem, doc) for doc in words]
stemmed

As you can now see the stemmer 

transformed | into
 :---: | :---:
 president | presid
 during | dure
 negotiation | negoti   
 
 Finally we can perform tokenization, but first we must
 
 ### 3. Glom things back together
 
 

In [63]:
texts = [" ".join(stemmed_words) for stemmed_words in stemmed]
texts

['how low has presid obama gone to tapp my phone dure the veri sacr elect process this is nixon waterg obama bad or sick guy sad',
 'our wonder new healthcar bill is now out for review and negoti obamacar is a complet and total disast is implod fast sad',
 "don't let the fake news tell you that there is big infight in the trump admin we are get along great and get major thing done",
 'russia talk is fake news put out by the dem and play up by the media in order to mask the big elect defeat and the illeg leak sad',
 'the purpos of educ is to build a happier societi we need a more holist approach that promot the practic of love and compass',
 'be a kind and compassion person this is the inner beauti that is a key factor to make a better world',
 'if our goal is a happier more peac world in the futur onli educ will bring chang',
 'love and compass are import becaus they strengthen us this is a sourc of hope']

### 4. Tokenization

In [66]:
stokenizer = Tokenizer()
stokenizer.fit_on_texts(tinyCorpus)
sequences = stokenizer.texts_to_sequences(tinyCorpus)
sequences[0]

[28,
 29,
 30,
 31,
 11,
 32,
 5,
 33,
 34,
 35,
 36,
 1,
 37,
 38,
 12,
 39,
 6,
 2,
 40,
 41,
 11,
 42,
 43,
 44,
 45,
 7]

## Tokenization Parameters
One parameter a tokenizer object takes is the number of words. If we specify this, for example,

    tokenizer = Tokenizer(num_words = 100)

the tokenizer will only use the top `num_words` most frequent words in the data.

## Extracting parts of a text file
suppose I have the following important email:

In [71]:
important_email = """
To: Ron Zacharski <ron.zacharski@gmail.com>
From: Susan Williams <desmondwilliams614@yahoo.com>
Reply-To: Susan Williams <deswill0119@yahoo.fr>
Message-ID: <1860373470.1061917.1488479328300@mail.yahoo.com>
Subject: Hello,
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
X-FileName:

Hello,

Greetings. With warm heart I offer my friendship and greetings, and I hope that this mail 
will meets you in good time.

However strange or surprising this contact might seem to you as we have not meet personally 
or had any dealings in the past. I humbly ask that you take due consideration of its 
importance and immense benefit.

My name is Susan Williams from Republic of Sierra-Leone. I have something very important 
that i would like to confide in you please,I have a reasonable amount of money which 
i inherited from my late father (Nine Million Five Hundred thousand United States Dollar}.
US$9.500.000.00.which I want to invest in your country with you and again in a very 
profitable venture.
"""


And suppose we only want the body of the email (and strip out the header.  We can do that as follows:


In [172]:
import urllib.request



def parseOutText(all_text):
    """ given an opened email file f, parse out all text below the
        metadata block at the top
        """
    text_string = ''
    content = all_text.split("X-FileName:")
    words = ""
    if len(content) > 1:
        text_string = content[1]
        # put your code here
    return text_string

## This part just tests the code
target_url = 'http://zacharski.org/files/courses/cs370/important_email.txt'
openfile = urllib.request.urlopen(target_url)
data = openfile.read()
text = data.decode('utf-8')
new_text = parseOutText(text)
print(text)
print(new_text)

To: Ron Zacharski <ron.zacharski@gmail.com>
From: Susan Williams <desmondwilliams614@yahoo.com>
Reply-To: Susan Williams <deswill0119@yahoo.fr>
Message-ID: <1860373470.1061917.1488479328300@mail.yahoo.com>
Subject: Hello,
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
X-FileName:

Hello,

Greetings. With warm heart I offer my friendship and greetings, and I hope that this mail 
will meets you in good time.

However strange or surprising this contact might seem to you as we have not meet personally 
or had any dealings in the past. I humbly ask that you take due consideration of its 
importance and immense benefit.

My name is Susan Williams from Republic of Sierra-Leone. I have something very important 
that i would like to confide in you please,I have a reasonable amount of money which 
i inherited from my late father (Nine Million Five Hundred thousand United States Dollar}.
US$9.500.000.00.which I want to invest in your country with you and again in a very 
profitable ven

## Warmup 
Can you modify the code above so it returns the stemmed version of the file? So it should return:

    hello greet with warm heart i offer my friendship and greet and i hope that this mail will meet 
    you in good time howev strang or surpris this contact might seem to you as we have not meet person 
    or had ani deal in the past i humbl ask that you take due consider of it import and immens benefit 
    my name is susan william from republ of sierra leon i have someth veri import that i would like to 
    confid in you pleas i have a reason amount of money which i inherit from my late father nine million 
    five hundr thousand unit state dollar us 9 500 000 00 which i want to invest in your countri with you 
    and again in a veri profit ventur


# Partner Mini Project (PMP)
This project is from the Udacity Course *Introduction to Machine Learning*

We are going to work through, step-by-step, the pre-processing we must do to prepare a text for a classification task.



## PMP1 - Downloading 400MB of the Enron Email Dataset
You only need to do this once (so using the 'Run All Cells' command on this notebook might not be a wise decision.

Prior to the Hillary Clinton email dataset, the Enron dataset was the largest publically available email set in the known universe. According to Wikipedia: "The Enron Corpus is a large database of over 600,000 emails generated by 158 employees of the Enron Corporation and acquired by the Federal Energy Regulatory Commission during its investigation after the company's collapse."

Our goal for this mini-project is to use emails from two people to see if we can build a classifier that can identify the author of an email. 

Because of its size and the resulting length of time it will take, let's divide the code into 2 cells:

#### first download ...

Feel free to change the location of the download.

In [69]:
import urllib
url = "http://zoo.cs.yale.edu/classes/cs458/lectures/sklearn/ud/ud120-projects-master/enron_mail_20150507.tgz"
urllib.request.urlretrieve(url, filename="../enron_mail_20150507.tgz") 
print ("download complete!")



download complete!


#### now uncompress it!
If you want to uncompress the data into a different place in your directory structure, uncomment the `chdir` lines and edit them to point to the correct place.

In [173]:
import tarfile
import os
#os.chdir("..")
tfile = tarfile.open("enron_mail_20150507.tgz", "r:gz")
tfile.extractall(".")
#os.chdir("notebooks")
print ("you're ready to go!")

you're ready to go!


## Part 4 - read the data and stem it
Just to reiterate, this mini-project and the majority of the write-up is from Udacity. They did a great job in putting this together.

In the next few code blocks, we will iterate through all the emails from Chris and from Sara. For each email, feed the opened email to parseOutText() and return the stemmed text string. Then do two things:

1. apply parseOutText to extract the text from the opened email
2. remove signature words (“sara”, “shackleton”, “chris”, “germani”--bonus points if you can figure out why it's "germani" and not "germany")
2. append the updated text string to word_data -- if the email is from Sara, append 0 (zero) to from_data, or append a 1 if Chris wrote the email.

Once this step is complete, you should have two lists: one contains the stemmed text of each email, and the second should contain the labels that encode (via a 0 or 1) who the author of that email is.

Running over all the emails can take a little while (5 minutes or more), so we've added a temp_counter to cut things off after the first 200 emails. Of course, once everything is working, you'd want to run over the full dataset.

In [91]:
target_url = 'http://zacharski.org/files/courses/cs370/from_sara.txt'
openfile = urllib.request.urlopen(target_url)
data = openfile.read()
from_sara = data.decode('utf-8').split('\n')


target_url = 'http://zacharski.org/files/courses/cs370/from_chris.txt'
openfile = urllib.request.urlopen(target_url)
data = openfile.read()
from_chris = data.decode('utf-8').split('\n')



['maildir/bailey-s/deleted_items/101.',
 'maildir/bailey-s/deleted_items/106.',
 'maildir/bailey-s/deleted_items/132.',
 'maildir/bailey-s/deleted_items/185.',
 'maildir/bailey-s/deleted_items/186.',
 'maildir/bailey-s/deleted_items/187.',
 'maildir/bailey-s/deleted_items/193.',
 'maildir/bailey-s/deleted_items/195.',
 'maildir/bailey-s/deleted_items/214.',
 'maildir/bailey-s/deleted_items/215.',
 'maildir/bailey-s/deleted_items/233.',
 'maildir/bailey-s/deleted_items/242.',
 'maildir/bailey-s/deleted_items/243.',
 'maildir/bailey-s/deleted_items/244.',
 'maildir/bailey-s/deleted_items/246.',
 'maildir/bailey-s/deleted_items/247.',
 'maildir/bailey-s/deleted_items/254.',
 'maildir/bailey-s/deleted_items/259.',
 'maildir/bailey-s/deleted_items/260.',
 'maildir/bailey-s/deleted_items/261.',
 'maildir/bailey-s/deleted_items/263.',
 'maildir/bailey-s/deleted_items/278.',
 'maildir/bailey-s/deleted_items/290.',
 'maildir/bailey-s/deleted_items/296.',
 'maildir/bailey-s/deleted_items/302.',


In [174]:
# CODE FROM UDACITY altered to work with Keras

import os
import pickle
import re
import sys
import string # raz




from_data = []
word_data = []

### temp_counter is a way to speed up the development--there are
### thousands of emails from Sara and Chris, so running over all of them
### can take a long time
### temp_counter helps you only look at the first 200 emails in the list so you
### can iterate your modifications quicker
temp_counter = 0


for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    print('PROCESSING  ' + name)
    temp_counter = 0
    for path in from_person:
        ### only look at first 200 emails when developing
        ### once everything is working, remove this line to run over full dataset
        temp_counter += 1
        if temp_counter < 200:   # raz
            temp_counter += 1
            if path != '':
                path = os.path.join('/Users/raz/Code/deep-learning/', path)
                #print(path)
                email = open(path, "r")
                all_text = email.read()
                email.close()
                ### use parseOutText to extract the text from the opened email
                

                ### use str.replace() to remove any instances of the words
                ### ["sara", "shackleton", "chris", "germani"]
                
                ### append the text to word_data

                
                ### - if the email is from Sara, append 0 (zero) to from_data, or append a 1 if Chris wrote the email.


                

print ("emails processed")
    
# uncomment this out when you have the above working. 
#with open('../data/word_data.pkl', 'wb') as word_file:
#    pickle.dump( word_data, word_file)
#with open('../data/email_authors.pkl', 'wb') as author_file:
#    pickle.dump( from_data, author_file)

print('Processing Complete')


PROCESSING  sara
PROCESSING  chris
emails processed
Processing Complete


### Great job!
Just for a check, what do you get for `word_data[152]`? I get

    'tjone nsf stephani and sam need nymex calendar'
    

## Congratulations!
Before we forget, let's go back to the above and load in all the data rather than the first 200. Also, uncomment the pickle lines so we don't need to go back and redo the work.

## Now it is your turn to code. 
1. We need to tokenize the texts to convert them to integer indices  (let's use **2,000 words for our dictionary size**)
2. We need to vectorize the texts 

Go ahead and do that work now.



## We should also vectorize our labels

## Divide into training and testing data
Once we have our data vectorized lets divide it into training and testing. I named my vectorized data `X` -- rename that variable in the code below to match yours:

In [159]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)


## Finally the network.
Go ahead and create a network, train it and then classify the test set. How did we do?

### Create the model of the network

###  Compile the network

### Then fit it

### Now compute how well we did on the test data using the evaluate method

###  Let's use Matplotlib to plot the training and validation loss side by side, as well as the training and validation accuracy:

# Congratulations!!


<img src= "https://arjan-hada.github.io/images/enron_fraud.jpg" width=600 />



<img src="https://static01.nyt.com/images/2012/04/06/business/dbpix-enron-whitecollar/dbpix-enron-whitecollar-tmagArticle.jpg" width="600"/>

