## 15688 Tutorial for AWS Python SDK (boto2)

author: Yuqi Wang

### Introduction
  Having experienced in pracitcal data science for several weeks, it is not hard to find out that better data analysis result usually comes from better and larger datasets. But when we acquire huge volume of data, how and where do we store and preprocess it? 
  **Appearly** using our local computer is not the answer. Thus here comes a new concept comes: **cloud computing**. AWS(Amazon Web Service), provided by Amazon Inc., is one of the popular choices. It offers various services:
  * Amazon EC2 (Elastic Compute Cloud) 
  * S3 (Simple Storage System) 
  * Elastic Block Store (EBS)
  * Elastic Map Reduce (EMR)
  * DynamoDB        
  

  For more information about AWS, please refer to the [AWS website](https://aws.amazon.com/). In this tutorial, we will mainly focus on using ec2 and EMR to make data analysis much more effectively.We can easily and quickly register a new account and start to use differenct kinds of service right away.
  You may create your own AWS account [here](https://aws.amazon.com/). One thing here to point out is that AWS service is not a free service, different service charges in different standards. We need to be cautious before we start that service.

### Before start
  ##### Generate Access Key ID & Secret Key pair on AWS
  After creating your own account, we need to generate an Access Key ID and Secret Key pair after signing into our account. 
  [Here is the user manual:][1]
  * choose Security Credentials from the drop-down menu on the top right corner. 
  * Under Access Keys click on Create a new Access Key and note down the Access Key ID and the Secret Access Key. 
  * If you ever need to refer to these keys in the future, simply visit the Security Credentials page to access them.
  [1]:https://theproject.zone/f16-15619/aws-intro
  ** Store your keys in safe place or export them as an environment variables in case of any information leaking happens. **
  ##### Configure Boto Credentials  
  * Simply use pip install boto to install boto library.
  * Create a ~/.boto file with these contents:  
  [Credentials]  
  aws_access_key_id = YOURACCESSKEY  
  aws_secret_access_key = YOURSECRETKEY
  
  If you already have a AWS account and installed aws cli tool, you can also use shell command 'aws configure' to configure your credentials.

### Library importing
  **boto** is the aws python sdk and it is the key element we use in this tutorial. Its API document is [here](http://boto.cloudhackers.com/en/latest/index.html).

In [39]:
import boto
import sys
from boto.ec2.connection import EC2Connection
import boto.emr
from boto.emr.connection import EmrConnection
from boto.emr.instance_group import InstanceGroup
from boto.emr.step import StreamingStep
import requests
import time

### Making connections to EC2 instances
  First thing we want to do here is to lauch an EC2 instance which acts as  a virtual machine.

In [24]:
# Given a valid region name, return a boto.ec2.connection.EC2Connection.
def ec2_connection(region='us-east-1'):
    conn = boto.ec2.connect_to_region(region)
    return conn
ec2_conn = ec2_connection()

# In case you forget what available regions are, we can call method like:

print boto.ec2.regions()

[RegionInfo:us-east-1, RegionInfo:cn-north-1, RegionInfo:ap-northeast-1, RegionInfo:ap-southeast-2, RegionInfo:sa-east-1, RegionInfo:ap-southeast-1, RegionInfo:ap-northeast-2, RegionInfo:us-west-2, RegionInfo:us-gov-west-1, RegionInfo:us-west-1, RegionInfo:eu-central-1, RegionInfo:eu-west-1]


### Launch an EC2 instance
  The functions listed here are:
  * create a secuiry group for later use. Using the connection we acquired above, to create a security group(sg).
  * delete unneccessery or wrongly created security group.
  * validate a url.This function can be used to test whether an instance is successfully launched.**Note: This function can only be applied to instance with server installed on port 80. If you start with a bare instance, you can simply launch it and try ssh into it a few seconds later.**
  * **launch an ec2 instance**
  * Important thing to noteice here: If you generate a keypair, remember to change its permission or else it will be too open to connect to your instance. e.x. chmod 700 ``<KEYPAIRNAME>``

In [20]:
# sg_name is the name of the sg, sg_desc is the description of the sg.
def sg_creator(conn, sg_name, sg_desc):
    sg = conn.create_security_group(sg_name, sg_desc)
    # Here we create a sg allows every traffic in and out
    sg.authorize(ip_protocol="-1", from_port=None, to_port=None,
                 cidr_ip="0.0.0.0/0", src_group=None, dry_run=False)
    return sg
sg1 = sg_creator(ec2_conn, 'sg1', 'first sg')
print sg1


SecurityGroup:sg1


In [9]:
# This function is used to delete a security group with the name passed in
def delete_sg(conn, sg_name):
    return conn.delete_security_group(name=sg_name)
print delete_sg(ec2_conn, 'sg1')

True


In [26]:
def url_validation(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return True
        print response.content
        return False
    except Exception, e:
        print e
        
# This function is used to launch an instance with parameters passed in. 
# instance_type is the type of instance to run.
def ec2_launch(conn, ami_type, key_name, instance_type, security_groups=None):
    # returns a boto.ec2.instance.Reservation associated with the request for machines
    reservation = conn.run_instances(
        ami_type, key_name=key_name, instance_type=instance_type, security_groups=security_groups)
    # get our instance
    instance = reservation.instances[0]
    # add tags to our instance
    #instance.add_tags({'15688': 'test'})
    # manually update its running status
    state = instance.update()
    while state == 'pending':
        time.sleep(5)
        state = instance.update()
    public_dns_name = instance.public_dns_name
    print public_dns_name
    while not url_validation('http://' + str(public_dns_name)):
        time.sleep(5)
    print '[0] instance created successfully!'
    return public_dns_name

print ec2_launch(ec2_conn, 'ami-dc6f05cb','test','t2.micro',[sg1])

ec2-54-147-4-185.compute-1.amazonaws.com
HTTPConnectionPool(host='ec2-54-147-4-185.compute-1.amazonaws.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x106e26910>: Failed to establish a new connection: [Errno 60] Operation timed out',))
HTTPConnectionPool(host='ec2-54-147-4-185.compute-1.amazonaws.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x106e26950>: Failed to establish a new connection: [Errno 61] Connection refused',))
HTTPConnectionPool(host='ec2-54-147-4-185.compute-1.amazonaws.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x106e26c50>: Failed to establish a new connection: [Errno 61] Connection refused',))
[0] instance created successfully!
ec2-54-147-4-185.compute-1.amazonaws.com


### What can do with this instance
* ssh into the instance using your keypair and the public dns name
* use this instance to as a storage system or data processing helper
* send and receive files between this instance & local PC

### AWS Service: EMR

We learned how to create an ec2 instance using boto API above, now we can start to explore further on how aws can cope with data science. One of its services named Elastic Map Reduce(EMR) is the most practical and easy tool from my view, for students who just step in data science field. To shortly introduce what EMR is, I conclude it as **an automation tool which mainly contains two parts: mapper and reducer, and can process data in a much faster way with the help of hadoop which can run jobs parallelly.**   
Hadoop is an open-source implementation of Google's MapReduce. In a MapReduce program, all the data is processed and stored as Key/Value pairs. Input data is processed by a map function. Mappers that execute the map function, output a transformed set of Key/Value pairs, which are subsequently processed by a reduce function and produce Key/Value pairs as output.  
For more information about EMR, please refer to the [EMR introduction](https://aws.amazon.com/emr/).  
Then let's get started!

### Making connections to EMR service
  Pretty similar to what we did to connect to ec2 instance.

In [30]:
# Given a valid region name, return a boto.ec2.connection.EC2Connection.
def emr_connection(region='us-east-1'):
    conn = boto.emr.connect_to_region(region)
    return conn
emr_conn = emr_connection()
print emr_conn

EmrConnection:elasticmapreduce.us-east-1.amazonaws.com


Here we design a MapReduce Job Flow to do data analytics on Wikimedia dataset.

### About the dataset
Wikimedia maintains hourly page view statistics for all objects stored in Wikimedia servers as publicly accessible datasets. We will use these statistics to analyse page-view trends and derive the trending topics on Wikipedia for a particular time range.Every record in dataset is like this:   
**domain_code page_title count_views total_response_size.**  
Domain_code is the domain name of the request to the server, page_title contains the requested page title, count_views is the number of times this page has been viewed in the respective hour. Total_response_size istThe total response size caused by the requests for this page in the respective hour.  
domain_code has two parts, a language identifier and a sub-project suffix. Sub-project suffix (domain trailing part) is abbreviated, for example, (no suffix), .b , .w are abbreviations for sub-project suffix .wikipedia.org, .wikibooks.org and .mediawiki.org. Here is one record:  
**en Carnegie_Mellon_University 32 3035632**  
which means 32 requests to "en.wikipedia.org/wiki/Carnegie_Mellon_University", the Wikipedia desktop site for Carnegie Mellon University in English, which accounted in total for 3035632 response bytes.

### Example Scenario
Meet following requirements:
* Filter out elements based on the following rules:  
  1. filter out lines which do not have 4 colums
  2. title should start with 'en' or 'en.m'
  3. '%3a' and '%3A' in the title should be excluded
  4. first letter in title should not be lowercase
  5. exclude all invalid patterns
  6. add the page counts of 'en' and 'en.m' together when their titles are the same
* Get the input filename from within a Mapper: As the date/time information is encoded in the filename, Hadoop streaming makes the filename available to every map task through the environment variables mapreduce_map_input_file. For example, the filename can be accessed in python using the statement os.environ["mapreduce_map_input_file"], or in Java using the statement System.getenv("mapreduce_map_input_file")
* Aggregate the pageviews from hourly views to daily views.
* Calculate the total pageviews for each article.
* For every article that has over 100,000 page-views (100,000 excluded), print the following line as output (\t is the tab character): [total month views]\t[article name]\t[date1:page views for date1]\t[date2:page views for date2]...

Here is a sample output record:  
139175\t%C3%81lvaro_Morata\t20160501:1157\t20160502:1240\t20160503:1034\t20160504:1238\t20160505:1759\t20160506:1261\t20160507:885\t20160508:1013\t20160509:1155\t20160510:1181\t20160511:1493\t20160512:3472\t20160513:3993\t20160514:3762\t20160515:2253\t20160516:2209\t20160517:5694\t20160518:9451\t20160519:6747\t20160520:4369\t20160521:10302\t20160522:10663\t20160523:11454\t20160524:9558\t20160525:8111\t20160526:9318\t20160527:5517\t20160528:4192\t20160529:3246\t20160530:3215\t20160531:8233

### Write a mapper and a reducer function to meet your own needs
  Things we need to remember:
  * mapper and reducer are seperate python files which should include shellbang explicitly.
  * mapper and reducer read in input from stdin and give output through stdout.
  * mapper should organize output in a format as key,value pairs and every reducer will treat the first element of mapper's output, split by your customized delimiter as the key.
  * there is only one key in each reducer.  
  
Example mapper and reducer python code are listed below.

In [None]:
#!/usr/bin/env python
import os, sys

def dataFilter(line):
    # prefix_blacklist is a set which stores all prefixes that should be excluded
    prefix_blacklist = set([x.lower() for x in [
        'Media:', 'Special:', 'Talk:', 'User:', 'User_talk:', 'Wikipedia:', 'Wikipedia_talk:', 'File:',
        'File_talk:', 'MediaWiki:', 'MediaWiki_talk:', 'Template:', 'Template_talk:', 'Help:',
        'Help_talk:', 'Category:', 'Category_talk:', 'Portal:', 'Portal_talk:', 'Book:', 'Book_talk:',
        'Draft:', 'Draft_talk:', 'Education_Program:', 'Education_Program_talk:', 'TimedText:',
        'TimedText_talk:', 'Module:', 'Module_talk:', 'Gadget:', 'Gadget_talk:', 'Gadget_definition:',
        'Gadget_definition_talk:', 'Topic:']])
    # suffix_blacklist is a set which stores all suffixes that should be excluded
    suffix_blacklist = {'.png', '.gif', '.jpg', '.jpeg', '.tiff', '.tif', '.xcf', '.mid', '.ogg', '.ogv', '.svg',
                        '.djvu', '.oga', '.flac', '.opus', '.wav', '.webm', '.ico', '.txt'}
    # bad_pages is a set which stoes all bad pags that should be excluded
    bad_pages = {'404_error/', 'Main_Page', 'Hypertext_Transfer_Protocol', 'Search'}
    field = line.strip().split(' ')
    # rule 1: four attributes
    if len(field) == 4:
        # rule 2: title should start with 'en' or 'en.m'
        if field[0] in ['en', 'en.m'] and field[1]:
            # rule 3: '%3a' and '%3A' in the title should be excluded
            prefix, suffix = field[1].replace('%3a', ':').replace('%3A', ':').lower().split(':')[0], \
                             field[1].lower().split('.')[-1]
            # rule 4: first letter in title should not be lowercase
            # rule 5: exclude all invalid patterns
            if not field[1][0].islower() and field[1] not in bad_pages \
                    and prefix + ':' not in prefix_blacklist and '.' + suffix not in suffix_blacklist:
                value = int(field[2])
                # rule 6: add the page counts of 'en' and 'en.m' together when their titles are the same
                return field[1], field[2]
    return (None, None)


for line in sys.stdin:
    article_name, hour_page_count = dataFilter(line)
    if article_name and hour_page_count:
        input_file = os.environ["mapreduce_map_input_file"]
        #input_file = 'pagecounts-20160502-000000'
        # add date as an attribute into mapper
        date = input_file.split('-')[-2]
        print '%s\t%s\t%s' % (article_name, date, hour_page_count)


In [None]:
#!/usr/bin/env python
import sys

# print_month_article(article_name, day_count, month_now) is used to print a formatted string.
def print_month_article(article_name, day_counts, month_now):
    output = ''
    # add everyday's page count into month
    month_page_count = sum(day_counts)
    if month_page_count > 100000:
        output += '%d\t%s' % (month_page_count, article_name)
        for day, day_count in enumerate(day_counts):
            output += '\t%s%02d:%d' % (month_now, day + 1, day_count)
        print output


current_article = ''
month_now = '201605'
day_counts = [0] * 31
for line in sys.stdin:
    article_name, date, hour_page_count_s = line.strip().split('\t')
    hour_page_count_i = int(hour_page_count_s)
    day_now = int(date[-2:])
    # month_now = int(date[:-2])
    # add every hour's page count into day
    if current_article == article_name:
        day_counts[day_now - 1] += hour_page_count_i
    else:
        if current_article:
            print_month_article(current_article, day_counts, month_now)
        day_counts = [0] * 31
        day_counts[day_now - 1] += hour_page_count_i
        current_article = article_name
else:
    print_month_article(current_article, day_counts, month_now)


### Launch an EMR service
  After creating a connection to EMR, the next thing is to create one or more jobflows steps. There are two kinds of steps: streaming and custom jar. We will use the first step here since it is already configured, easy to use and we just have to set up several file path.
  There are five arguments which are extremely important to make step running correctly.
  * step_name: as a name identifier.
  * mapper: mapper file name with extension.
  * reducer: reducer file name with extension.
  * input: input dataset location.
  * output: output result location.
  * file_s3_location: specify the mapper and reducer location on s3.

In [50]:
def EMR_launch(emr_conn, step_name, mapper, reducer, input, output, job_name, log_uri, file_s3_location):
    instance_groups = []
    # feel free to change the num_instance, type and name here
    instance_groups.append(InstanceGroup(num_instances=1, role="MASTER", type="m1.medium", market="ON_DEMAND", name="Main node"))
    instance_groups.append(InstanceGroup(num_instances=2, role="CORE", type="m1.medium", market="ON_DEMAND", name="Worker nodes"))
    #
    step = StreamingStep(name=step_name, mapper=mapper, reducer=reducer, input=input, output=output,jar='command-runner.jar',step_args=list(file_s3_location))
    api_params = {
    'ReleaseLabel': 'emr-5.0.3',
    #'Instances.Ec2SubnetId': 'subnet-6a58b740',
    }
    jobid = emr_conn.run_jobflow(name=job_name, log_uri=log_uri,steps=[step], instance_groups=instance_groups, job_flow_role="EMR_EC2_DefaultRole",
    service_role="EMR_DefaultRole", api_params=api_params)
    print emr_conn.describe_cluster(jobid)
    while emr_conn.list_steps(jobid).steps[0].status.state != 'RUNNING':
        print emr_conn.list_steps(jobid).steps[0].status.state
        time.sleep(60)
    print '[0] Your emr launched seccussfully!'

EMR_launch(emr_conn, 'Wikipedia pagecount example', 'mapper.py', 'reducer.py',
           's3://cmucc-datasets/wikipediatraf/201605', 's3://yuki777/output1','My jobflow', 's3://yuki777/log', "hadoop-streaming", "-files", "s3://yuki777/mapper.py,s3://yuki777/reducer.py")

<boto.emr.emrobject.Cluster object at 0x106cce510>
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
[0] Your emr launched seccussfully!


After seeing the message showing your emr service is running, we can go to aws console or use aws cli to check its status. If your task completed successfully, you could see the output files lying in the output path you specified eariler. Here are two screenshots here to give you a quick view.


In [7]:
import subprocess
subprocess.call(["open", "result.jpg"])
subprocess.call(["open","result1.jpg"])

0

Hope you enjoy this tutorial!  
Thank you!