#DATSCI W261 Week 5: JRW Notes
##Topics: 
    > Synonyms and synonym detection
    > Google 5-grams
    > AWS command line interface
    > Managing data on s3
    > Modifying the default Hadoop sort in MRJob 
    > Making files in s3 buckets public for distributed access
    > Reading distributed files into mappers
    > Iteration over MRJobs on AWS


###Synonym detection and the nltk evaluation

In [187]:
#!/usr/bin/python2.7
''' pass a string to this funciton ( eg 'car') and it will give you a list of
words which is related to cat, called lemma of CAT. '''
import nltk
from nltk.corpus import wordnet as wn
import sys
#print all the synset element of an element
def synonyms(string):
    syndict = {}
    for i,j in enumerate(wn.synsets(string)):
        syns = j.lemma_names()
        for syn in syns:
            syndict.setdefault(syn,1)
    return syndict.keys()
if 'auto' in synonyms("car"):
    print "got one!\n"

for syn in synonyms("car"):
    print syn


got one!

railcar
auto
car
automobile
cable_car
machine
railway_car
motorcar
elevator_car
gondola
railroad_car


###Google N-grams data and the AWS Command Line Interface (CLI)
Here we will look at a brief example of the AWS CLI.
In general the CLI makes it easy interact with data on Amazon's storage cluster (s3),
and has many framiliar commands (e.g., cp, rm, etc...). For some general information on
how to get set up with the AWS CLI, please follow the appropriate link for your platform
starting at the following URL:

    http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/set-up-ec2-cli-linux.html
    
In the remainder of this tutorial, we will work with a filtered down sample of 10,000 5-grams
from the google books project (shown below)---for reference, the data is tab-delimited,
with the following columns:

    (5-gram) \t (total_count) \t (books_count) \t (pages_count)

In [73]:
!head -10 gbooks_filtered_sample.txt

A Ball in Old Vienna	131	124	63
A Bibliography of First and	44	44	42
A Biographical Dictionary of Women	222	212	156
A Book of Ballads on	48	48	48
A Book of New Intellectual	400	400	327
A Brief Guide to the	1291	1244	995
A Case Study of Labor	62	61	50
A Commentary on the Method	127	125	91
A Commission of the European	46	46	44
A Constitutional Analysis of the	187	180	132


###Managing data on s3 with the AWS CLI
We're not only going to want to make sure we can run our jobs on AWS, 
but also manage our data on the cloud using the AWS storage cluster, s3.
Here is an example that uses the AWS CLI to copy our data up to a bucket
(ucb-mids-mls-hw5) that we have created. To make your own bucket, try the following:

    aws s3 mb s3://ucb-mids-mls-FirstnameLastname/

or use the utility in the GUI console. Buckets may also be removed as follows:

    aws s3 rb s3://ucb-mids-mls-FirstnameLastname/


In [97]:
!aws s3 mb s3://ucb-mids-mls-hw5/gbooks_filtered_sample_sorted/
!aws s3 cp gbooks_filtered_sample.txt s3://ucb-mids-mls-hw5/gbooks_filtered_sample.txt

make_bucket: s3://ucb-mids-mls-hw5/gbooks_filtered_sample.txt/
upload: ./gbooks_filtered_sample.txt to s3://ucb-mids-mls-hw5/gbooks_filtered_sample.txt


###An example mrjob class that uses a non-default Hadoop sort
In HW2, we were interested in doing a numeric sort on some integer keys (as strings)
by leveraging the default Hadoop sort (in the shuffle). This required padding our
integers as strings with zeros, so that the lexicographic sort would be equivilant 
to the desired numeric sort. However, this only got us to an increasing numeric sort,
so, how can we get a decrasing (rank) numeric sort while using mrjob?
####def jobconf(self):
By defining this function in the mrjob class, we are able to pass parameters
to the Hadoop streaming operation going on underneith mrjob. Look closely at
this function in the script below, and notice that we merge the custom_jobconf 
configuration (with our sort parameters) with the default, so that we don't overide
anything important from the default configuration that is not in our custom configuration.

####Changing the sort order with jobconf(self):
In the code below, we show how to change the default sort order, but there are many
other custom configurations that are possible, e.g., partitioning by multiple keys.
For some more general information and other good examples 
check out the following documentation:

http://hadoop.apache.org/docs/r1.0.4/streaming.html#Hadoop+Comparator+Class

Focusing in on the sort parameters in our custom configuration, we have:
>'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',

which will be the non-changing across your custom sorting needs, and
>'mapred.text.key.comparator.options': '-k1rn',

whose flags provide the functionality:
>k1: sort by the first key (keys can be multi-part, with user-specified delimiters)

>r: sort in reverse

>n: sort numerically



In [52]:
%%writefile sortingExample.py
#!/usr/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from itertools import combinations

class sortingExample(MRJob):
    def jobconf(self):
        orig_jobconf = super(sortingExample, self).jobconf()        
        custom_jobconf = {
            'mapred.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
            'mapred.text.key.comparator.options': '-k1rn',
        }
        combined_jobconf = orig_jobconf
        combined_jobconf.update(custom_jobconf)
        self.jobconf = combined_jobconf
        return combined_jobconf
    def steps(self):
        return [MRStep(mapper = self.mapper, reducer = self.reducer)]
    def mapper(self, _, line):
        line.strip()
        [ngram,count,pages,books] = re.split("\t",line)
        yield int(count),ngram
    def reducer(self,count,values):
        for ngram in values:
            yield ngram,count
    
if __name__ == '__main__':
    sortingExample.run()

Overwriting sortingExample.py


In [53]:
!chmod +x sortingExample.py

###Run the sorting example locally and stream the output through STDOUT
Just to make sure our mrjob class is working, we can check using our local copy of the data.
Notice that our sort has not worked---this is because the custom config in our class will only
work when our code is run on AWS, as we'll see below. Otherwise, things appear fine.


In [55]:
!./sortingExample.py gbooks_filtered_sample.txt | head -50

using configs in /Users/jakerylandwilliams/.mrjob.conf
creating tmp directory /var/folders/3y/665tnx6s0jjcysf043nfcwm80000gn/T/sortingExample.jakerylandwilliams.20150928.220554.880307
writing to /var/folders/3y/665tnx6s0jjcysf043nfcwm80000gn/T/sortingExample.jakerylandwilliams.20150928.220554.880307/step-0-mapper_part-00000
Counters from step 1:
  (no counters found)
writing to /var/folders/3y/665tnx6s0jjcysf043nfcwm80000gn/T/sortingExample.jakerylandwilliams.20150928.220554.880307/step-0-mapper-sorted
> sort /var/folders/3y/665tnx6s0jjcysf043nfcwm80000gn/T/sortingExample.jakerylandwilliams.20150928.220554.880307/step-0-mapper_part-00000
writing to /var/folders/3y/665tnx6s0jjcysf043nfcwm80000gn/T/sortingExample.jakerylandwilliams.20150928.220554.880307/step-0-reducer_part-00000
Counters from step 1:
  (no counters found)
Moving /var/folders/3y/665tnx6s0jjcysf043nfcwm80000gn/T/sortingExample.jakerylandwilliams.20150928.220554.880307/step-0-reducer_part-00000 -> /var/folders/3y/665tnx6s0

###Running the sorting example on AWS with data from s3
Here, we send our code up to AWS to meet the data at s3. Note the '-r emr'
flags signaling an EMR job, in addition to the s3 url for our data in our bucket.
If you look closely at the bottom, you'll see the output is reverse numeric sorted!

In [56]:
!./sortingExample.py s3://ucb-mids-mls-hw5/gbooks_filtered_sample.txt -r emr | head -50

using configs in /Users/jakerylandwilliams/.mrjob.conf
using existing scratch bucket mrjob-070799b65f5ef217
using s3://mrjob-070799b65f5ef217/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/3y/665tnx6s0jjcysf043nfcwm80000gn/T/sortingExample.jakerylandwilliams.20150928.220814.786777
writing master bootstrap script to /var/folders/3y/665tnx6s0jjcysf043nfcwm80000gn/T/sortingExample.jakerylandwilliams.20150928.220814.786777/b.py
Copying non-input files into s3://mrjob-070799b65f5ef217/tmp/sortingExample.jakerylandwilliams.20150928.220814.786777/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-645UWBTVR1M4
Created new job flow j-645UWBTVR1M4
Job launched 30.9s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 62.2s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 93.1s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 124.3s ago, status STARTING: Provisioni

###Storing output on s3
Supposing we wish store the output of our job on s3 
(which can then be available as input for, say, some other job), 
we can specify an ouput directory much as with Hadoop streaming.
To do this, we will have to specify an output directoy (on s3) for mrjob:

    --output-dir=s3://ucb-mids-mls-hw5/gbooks_filtered_sample_sorted 
    
However, this will not stop the stream of output back to our local machine,
which is accomplished by the separate parameter:

    --no-output

In [102]:
!./sortingExample.py s3://ucb-mids-mls-hw5/gbooks_filtered_sample.txt -r emr \
    --output-dir=s3://ucb-mids-mls-hw5/gbooks_filtered_sample_sorted \
    --no-output

using configs in /Users/jakerylandwilliams/.mrjob.conf
using existing scratch bucket mrjob-070799b65f5ef217
using s3://mrjob-070799b65f5ef217/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/3y/665tnx6s0jjcysf043nfcwm80000gn/T/sortingExample.jakerylandwilliams.20150929.044057.664324
writing master bootstrap script to /var/folders/3y/665tnx6s0jjcysf043nfcwm80000gn/T/sortingExample.jakerylandwilliams.20150929.044057.664324/b.py
Copying non-input files into s3://mrjob-070799b65f5ef217/tmp/sortingExample.jakerylandwilliams.20150929.044057.664324/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-NERAUZK08MKU
Created new job flow j-NERAUZK08MKU
Job launched 35.4s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 66.7s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 98.0s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 129.2s ago, status STARTING: Provisioni

###Download our output and clean things up a little
To access our stored data, we once again can use the AWS CLI tool, 'cp.'
To download our program's output, we must look inside of the specified target directory 
for the usual Hadoop output.

In [103]:
!aws s3 cp s3://ucb-mids-mls-hw5/gbooks_filtered_sample_sorted/part-00000 gbooks_filtered_sample_sorted.txt
!aws s3 rm s3://ucb-mids-mls-hw5/gbooks_filtered_sample_sorted/part-00000
!aws s3 rm s3://ucb-mids-mls-hw5/gbooks_filtered_sample_sorted/_SUCCESS

download: s3://ucb-mids-mls-hw5/gbooks_filtered_sample_sorted/part-00000 to ./gbooks_filtered_sample_sorted.txt
delete: s3://ucb-mids-mls-hw5/gbooks_filtered_sample_sorted/part-00000
delete: s3://ucb-mids-mls-hw5/gbooks_filtered_sample_sorted/_SUCCESS


###Check on our output

In [104]:
!head -50 gbooks_filtered_sample_sorted.txt 

"I have no doubt that"	285570
"At a meeting of the"	104087
"I ought not to have"	34287
"I am not certain that"	19329
"I am struck by the"	12965
"But in the first place"	12026
"American University in Cairo Press"	8737
"From the above discussion it"	8617
"I have been reading the"	7962
"According to the nature of"	7723
"But by that time the"	7510
"I took it upon myself"	7379
"A new commandment I give"	6149
"How long would it be"	5449
"I am taking the liberty"	5067
"He died of a heart"	5063
"But I could see that"	4762
"History of the American Negro"	4704
"As with so many of"	4699
"Association of the University of"	4693
"I am trying to show"	4548
"I had been with the"	4520
"Although the results of the"	4228
"He carried with him a"	4208
"I have got to go"	4168
"I learned that there was"	4135
"I told myself it was"	4103
"I am in no mood"	4078
"I do not like you"	4067
"I thought it was about"	3997
"God has given to man"	3996
"Board of the Southern Baptist"	3946


###Running our job from a python driver
Much as we have already seen before, we can run our job with data from s3
from a driver that will recieve streaming output---just place the s3 URL
for the main data file. 

In [116]:
%%writefile sortingExample_driver.py
#!/usr/bin/python
from sortingExample import sortingExample
mr_job = sortingExample(args=['s3://ucb-mids-mls-hw5/gbooks_filtered_sample.txt','-r', 'emr'])
with mr_job.make_runner() as runner: 
    runner.run()
    # stream_output: get access of the output 
    for line in runner.stream_output():
        ngram,count = mr_job.parse_output_line(line)
        print ngram + "\t" + str(count)

Overwriting sortingExample_driver.py


In [117]:
!chmod +x sortingExample_driver.py

In [118]:
!./sortingExample_driver.py | head -50

I have no doubt that	285570
At a meeting of the	104087
I ought not to have	34287
I am not certain that	19329
I am struck by the	12965
But in the first place	12026
American University in Cairo Press	8737
From the above discussion it	8617
I have been reading the	7962
According to the nature of	7723
But by that time the	7510
I took it upon myself	7379
A new commandment I give	6149
How long would it be	5449
I am taking the liberty	5067
He died of a heart	5063
But I could see that	4762
History of the American Negro	4704
As with so many of	4699
Association of the University of	4693
I am trying to show	4548
I had been with the	4520
Although the results of the	4228
He carried with him a	4208
I have got to go	4168
I learned that there was	4135
I told myself it was	4103
I am in no mood	4078
I do not like you	4067
I thought it was about	3997
God has given to man	3996
Board of the Southern Baptist	3946
I do not feel so	3780
Houses of Parliament and the	3736
Guidelines for the Management of	3624
I 

##Reading and writing cached data
###Writing to s3
Sometimes (as is required with and iterative implementation of the Apriori algorithm) 
we will need to store data from one job and broadcast it to the mappers in another job.
This first MRJob class will start an implementation of the apriori algorithm for 
the words inside of our 5-grams, here storing the 1-gram (word) frequent support on s3,
via the 

    --output-dir=s3://ucb-mids-mls-hw5/readWriteExample_output
    
command line parameter. This will then force us to use the AWS CLI to copy our data 
from the usual Hadoop output pattern:

    s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000


In [170]:
%%writefile writeExample.py
#!/usr/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from itertools import combinations

class writeExample(MRJob):
    thresh = 1000
    def steps(self):
        return [MRStep(
                mapper = self.mapper, 
                reducer = self.reducer
            )]
    def mapper(self, _, line):
        line.strip()
        [ngram,count,pages,books] = re.split("\t",line)
        words = re.split(" ",ngram)
        count = int(count)
        for word in words:
            yield word,count
    def reducer(self,word,values):
        count = sum(values)
        if count >= self.thresh:
            yield word,count
    
if __name__ == '__main__':
    writeExample.run()

Overwriting writeExample.py


In [171]:
!chmod +x writeExample.py

In [172]:
!./writeExample.py s3://ucb-mids-mls-hw5/gbooks_filtered_sample.txt -r emr \
    --output-dir=s3://ucb-mids-mls-hw5/readWriteExample_output \
    --no-output

using configs in /Users/jakerylandwilliams/.mrjob.conf
using existing scratch bucket mrjob-070799b65f5ef217
using s3://mrjob-070799b65f5ef217/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/3y/665tnx6s0jjcysf043nfcwm80000gn/T/writeExample.jakerylandwilliams.20150930.215411.663562
writing master bootstrap script to /var/folders/3y/665tnx6s0jjcysf043nfcwm80000gn/T/writeExample.jakerylandwilliams.20150930.215411.663562/b.py
Copying non-input files into s3://mrjob-070799b65f5ef217/tmp/writeExample.jakerylandwilliams.20150930.215411.663562/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-X3XVQZPI1UZX
Created new job flow j-X3XVQZPI1UZX
Job launched 30.9s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 62.1s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 93.0s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 124.1s ago, status STARTING: Provisioning Ama

###Copy to local and examine output

In [173]:
!aws s3 cp s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000 ./frequentSupport.txt
!head -25 frequentSupport.txt

download: s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000 to ./frequentSupport.txt
"A"	74206
"AND"	3684
"Abbott"	1758
"About"	1805
"Academy"	1100
"According"	10013
"Account"	1152
"Act"	4238
"Africa"	1928
"After"	12056
"Age"	1468
"Ages"	1093
"All"	13245
"Although"	13899
"Alvin"	2561
"America"	5649
"American"	25879
"Americans"	1616
"Among"	3895
"An"	16379
"Analysis"	4320
"And"	44804
"Annual"	1344
"Anomalous"	1706
"Another"	3324


###Clean things up and store our output on s3

In [174]:
!aws s3 cp s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000 s3://ucb-mids-mls-hw5/frequentSupport.txt
!aws s3 rm s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000
!aws s3 rm s3://ucb-mids-mls-hw5/readWriteExample_output/_SUCCESS

copy: s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000 to s3://ucb-mids-mls-hw5/frequentSupport.txt
delete: s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000
delete: s3://ucb-mids-mls-hw5/readWriteExample_output/_SUCCESS


###Reading cached files from s3
We have already stored our output from the previous job in our bucket:

    s3://ucb-mids-mls-hw5/frequentSupport.txt,

but to read cached files from s3, we will have to make our bucket public, 

    http://tiffanybbrown.com/2014/09/making-all-objects-in-an-s3-bucket-public-by-default/

so that its contents will be acessible from the url format:

    https://s3-us-west-2.amazonaws.com/ucb-mids-mls-hw5/frequentSupport.txt
    
while using the python commands

    import urllib2
    
    f = urllib2.urlopen(URL)
    for line in f.readlines():
    ...
    
discussed in the AWS forum:

    https://forums.aws.amazon.com/thread.jspa?messageID=341543
    

In [164]:
%%writefile readWriteExample.py
#!/usr/bin/python
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
from itertools import combinations
import urllib2

class readWriteExample(MRJob):
    thresh = 1000
    N = 0
    freqSupport = {}
    def steps(self):
        return [MRStep(
                mapper_init = self.mapper_init,
                mapper = self.mapper,
                reducer = self.reducer
            )]
    def mapper_init(self):
        f = urllib2.urlopen("https://s3-us-west-2.amazonaws.com/ucb-mids-mls-hw5/frequentSupport.txt")
        for line in f.readlines():
            line.strip()
            ### because the yielded output produces quotes!
            line = re.sub("\"","",line)
            ngram,count = re.split("\t",line)
            self.freqSupport[ngram] = int(count)
            if self.N == 0:
                self.N = len(re.split(",",ngram))+1
    def mapper(self, _, line):
        line.strip()
        [ngram,count,pages,books] = re.split("\t",line)
        words = re.split(" ",ngram)
        count = int(count)
        combs = list(combinations(words,self.N))
        subSize = self.N-1
        for combination in combs:
            wordSet = ",".join(sorted(combination))
            subCombs = list(combinations(combination,subSize))
            subsFrequent = 1
            for subComb in subCombs:
                subWordSet = ",".join(sorted(subComb))
                if subWordSet not in self.freqSupport:
                    subsFrequent = 0
                    break
            if subsFrequent:
                yield wordSet,count
    def reducer(self,wordSet,values):
        count = sum(values)
        if count >= self.thresh:
            yield wordSet,count
    
if __name__ == '__main__':
    readWriteExample.run()

Overwriting readWriteExample.py


In [165]:
!./readWriteExample.py s3://ucb-mids-mls-hw5/gbooks_filtered_sample.txt -r emr \
    --output-dir=s3://ucb-mids-mls-hw5/readWriteExample_output \
    --no-output

using configs in /Users/jakerylandwilliams/.mrjob.conf
using existing scratch bucket mrjob-070799b65f5ef217
using s3://mrjob-070799b65f5ef217/tmp/ as our scratch dir on S3
creating tmp directory /var/folders/3y/665tnx6s0jjcysf043nfcwm80000gn/T/readWriteExample.jakerylandwilliams.20150930.212100.551496
writing master bootstrap script to /var/folders/3y/665tnx6s0jjcysf043nfcwm80000gn/T/readWriteExample.jakerylandwilliams.20150930.212100.551496/b.py
Copying non-input files into s3://mrjob-070799b65f5ef217/tmp/readWriteExample.jakerylandwilliams.20150930.212100.551496/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-1LFH4ZRUN5XBD
Created new job flow j-1LFH4ZRUN5XBD
Job launched 30.9s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 62.1s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 93.0s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 124.2s ago, status STARTING: Pr

###Copy to local and examine output

In [169]:
!aws s3 cp s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000 ./frequentSupport_2.txt
!head -25 frequentSupport_2.txt

download: s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000 to ./frequentSupport_2.txt
"A,B"	1426
"A,Brief"	1291
"A,DAYS"	1331
"A,FOURTEEN"	1331
"A,Guide"	1337
"A,History"	1041
"A,I"	6513
"A,a"	4685
"A,and"	3988
"A,at"	2634
"A,commandment"	6149
"A,council"	1174
"A,example"	3345
"A,fair"	1291
"A,few"	2407
"A,fine"	4722
"A,for"	2735
"A,from"	1530
"A,give"	6149
"A,good"	1035
"A,held"	1245
"A,in"	3620
"A,is"	5518
"A,large"	1097
"A,later"	2546


###We can package this all in a driver for iteration, too
####Clean up a little first

In [175]:
!aws s3 rm s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000
!aws s3 rm s3://ucb-mids-mls-hw5/readWriteExample_output/_SUCCESS
!aws s3 rm s3://ucb-mids-mls-hw5/frequentSupport.txt

A client error (404) occurred when calling the HeadObject operation: Key "readWriteExample_output/part-00000" does not exist
Completed 1 part(s) with ... file(s) remaining
A client error (404) occurred when calling the HeadObject operation: Key "readWriteExample_output/_SUCCESS" does not exist
Completed 1 part(s) with ... file(s) remaining
delete: s3://ucb-mids-mls-hw5/frequentSupport.txt


In [176]:
%%writefile readWriteExample_driver.py
#!/usr/bin/python
from writeExample import writeExample
from readWriteExample import readWriteExample

import os

mr_job = writeExample(args=[
        's3://ucb-mids-mls-hw5/gbooks_filtered_sample.txt',
        '-r', 'emr',
        '--output-dir=s3://ucb-mids-mls-hw5/readWriteExample_output',
        '--no-output'
    ])

with mr_job.make_runner() as runner: 
    runner.run()
os.system("aws s3 cp s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000 s3://ucb-mids-mls-hw5/frequentSupport.txt")
os.system("aws s3 rm s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000")
os.system("aws s3 rm s3://ucb-mids-mls-hw5/readWriteExample_output/_SUCCESS")

for i in range(2):
    mr_job = readWriteExample(args=[
            's3://ucb-mids-mls-hw5/gbooks_filtered_sample.txt',
            '-r', 'emr',
            '--output-dir=s3://ucb-mids-mls-hw5/readWriteExample_output',
            '--no-output'
        ])
    with mr_job.make_runner() as runner: 
        runner.run()
        os.system("aws s3 cp s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000 s3://ucb-mids-mls-hw5/frequentSupport.txt")
        os.system("aws s3 rm s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000")
        os.system("aws s3 rm s3://ucb-mids-mls-hw5/readWriteExample_output/_SUCCESS")    


Overwriting readWriteExample_driver.py


In [177]:
!chmod +x writeExample.py readWriteExample.py readWriteExample_driver.py

In [178]:
!./readWriteExample_driver.py 

copy: s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000 to s3://ucb-mids-mls-hw5/frequentSupport.txt
delete: s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000
delete: s3://ucb-mids-mls-hw5/readWriteExample_output/_SUCCESS
copy: s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000 to s3://ucb-mids-mls-hw5/frequentSupport.txt
delete: s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000
delete: s3://ucb-mids-mls-hw5/readWriteExample_output/_SUCCESS
copy: s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000 to s3://ucb-mids-mls-hw5/frequentSupport.txt
delete: s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000
delete: s3://ucb-mids-mls-hw5/readWriteExample_output/_SUCCESS


####Check on the output

In [179]:
!aws s3 cp s3://ucb-mids-mls-hw5/frequentSupport.txt ./frequentSupport_3.txt
!head -25 frequentSupport_3.txt

download: s3://ucb-mids-mls-hw5/frequentSupport.txt to ./frequentSupport_3.txt
"A,B,and"	1244
"A,Brief,Guide"	1291
"A,Brief,the"	1291
"A,Brief,to"	1291
"A,DAYS,FOURTEEN"	1331
"A,DAYS,fine"	1331
"A,DAYS,of"	1331
"A,FOURTEEN,fine"	1331
"A,FOURTEEN,of"	1331
"A,Guide,the"	1337
"A,Guide,to"	1291
"A,History,of"	1041
"A,I,commandment"	6149
"A,I,give"	6149
"A,I,new"	6149
"A,a,few"	1821
"A,a,is"	1711
"A,a,later"	1909
"A,a,seconds"	1821
"A,and,of"	1072
"A,at,council"	1174
"A,at,held"	1174
"A,at,was"	1281
"A,commandment,give"	6149
"A,commandment,new"	6149


####Clean up

In [180]:
!aws s3 rm s3://ucb-mids-mls-hw5/readWriteExample_output/part-00000
!aws s3 rm s3://ucb-mids-mls-hw5/readWriteExample_output/_SUCCESS
!aws s3 rm s3://ucb-mids-mls-hw5/frequentSupport.txt

A client error (404) occurred when calling the HeadObject operation: Key "readWriteExample_output/part-00000" does not exist
Completed 1 part(s) with ... file(s) remaining
A client error (404) occurred when calling the HeadObject operation: Key "readWriteExample_output/_SUCCESS" does not exist
Completed 1 part(s) with ... file(s) remaining
delete: s3://ucb-mids-mls-hw5/frequentSupport.txt
