# Highe Performance Python -- AWS and Multi-Processing/Threading

AWS Services
============

AWS Storage + Execution
-----------------------

What are the primary services that Amazon AWS offers?


Name   |Full Name                  |Service
----   |---------                  |-------
EC2    |Elastic Compute Cloud      |Execution
S3     |Simple Storage Service     |Storage
EBS    |Elastic Block Store        |Storage attached to EC2 instances

Pop Quiz
--------

<details><summary>
Q: I want to store some video files on the web. Which Amazon service
should I use?
</summary>
S3
</details>

<details><summary>
Q: I just created an iPhone app which needs to store user profiles on the
web somewhere. Which Amazon service should I use?
</summary>
S3
</details>

<details><summary>
Q: I want to create a web application in PHP. Which Amazon service
should I use?
</summary>
EC2 + EBS or EC2 + S3
</details>



S3 vs EBS
---------

What is the difference between S3 and EBS? Why would I use one versus
the other?


Feature                |S3                   |EBS
-------                |--                   |---
Can be accessed from   |Anywhere on the web  |Particular availability zone
Can be accessed from   |Any EC2 instance     |EC2 instance attached to it
Pricing                |Storage              |Storage + IOPS
Price                  |Cheaper              |More Expensive
Latency                |Higher               |Lower
Throughput             |Varies More          |Varies Less
Performance            |Slightly Worse       |Slightly Better
Max volume size        |Unlimited            |16 TB
Max file size          |5 TB                 |16 TB

Pop Quiz
--------

<details><summary>
Q: What is latency?
</summary>
Latency is the time it takes between making a request and the start of a response.
</details>


<details><summary>
Q: Which is better? Higher latency or lower?
</summary>
Lower is better.
</details>

<details><summary>
Q: Why is S3 latency higher than EBS?
</summary>
One reason is that EBS is in the same availability zone.
</details>


Amazon vs Other Cloud Services
------------------------------

Why do so many companies use Amazon's Web Services for their backend?

- Steve Yegge provides one of the big reasons for AWS's popularity.


Steve Yegge and Decoupled Design
--------------------------------

<img src="img/yegge.jpg">

Who is Steve Yegge?

- Steve Yegge is a developer from Amazon and Google.

- Steve blogged a long [rant][yegge-rant] about Amazon's APIs vs
  Google's APIs.

[yegge-rant]: https://plus.google.com/+RipRowan/posts/eVeouesvaVX

What is the difference between Amazon and Google's APIs?

- At Amazon developers have to use Amazon's public APIs to for their
  internal dependencies.

- At Google developers can use private APIs for dependencies.

- The forced dogfooding makes Amazon's APIs more decoupled.

---

Amazon S3
=========

Buckets and Files
-----------------

What is a bucket?

- A bucket is a container for files.

- Think of a bucket as a logical grouping of files like a sub-domain.

- A bucket can contain an arbitrary number of files.

How large can a file in a bucket be?

- A file in a bucket can be 5 TB.


Bucket Names
------------

What are best practices on naming buckets?

Bucket names should be DNS-compliant.

- They must be at least 3 and no more than 63 characters long.

- They must be a series of one or more labels, separated by a single
  period. 
  
- Bucket names can contain lowercase letters, numbers, and hyphens. 

- Each label must start and end with a lowercase letter or a number.

- Bucket names must not be formatted as an IP address (e.g., 192.168.5.4).

What are some examples of valid bucket names?

- `myawsbucket`

- `my.aws.bucket`

- `myawsbucket.1`

What are some examples of invalid bucket names? 

- `.myawsbucket`

- `myawsbucket.`

- `my..examplebucket`

Pop Quiz
--------

<details><summary>
Q: Why are these bucket names invalid?
</summary>
Bucket names cannot start or end with a period. And they cannot have a
multiple periods next to each other.
</details>


Creating Buckets
----------------

Q: How can I create a bucket?

- Get your access key and secret key from the `rootkey.csv` that you
  downloaded from Amazon AWS.
  
- Create a file called `~/.aws/credentials` (on Linux/Mac) or
  `%USERPROFILE%\.aws\credentials` (on Windows), and insert the
  following code into it. Replace `ACCESS_KEY` and `SECRET_KEY` with
  the S3 keys you got from Amazon.
  
```
[default]
aws_access_key_id = ACCESS_KEY
aws_secret_access_key = SECRET_KEY
```

- Create a connection to S3.

In [2]:
import boto

conn = boto.connect_s3()
print conn

S3Connection:s3.amazonaws.com


- List all the buckets.

In [6]:
conn.get_all_buckets()

[<Bucket: aritro1>,
 <Bucket: aws-logs-991777501832-us-west-1>,
 <Bucket: aws-logs-991777501832-us-west-2>,
 <Bucket: sversage1>,
 <Bucket: versage.galvanize>]

- Create new bucket.

In [9]:
import os

user = os.environ['USER']
bucket_name = user + "1"
bucket_name = bucket_name.lower()

print bucket_name

bucket = conn.create_bucket(bucket_name)

print bucket


sversage1
<Bucket: sversage1>


Upgrading Boto
--------------

Q: Boto is not able to find the credentials. How can I fix this?

- Older versions of Boto were not able to read the credentials file.

- You might run into this problem on the EC2 instance.

- Here is how to upgrade Boto to the latest version.

In [7]:
! sudo pip install --upgrade boto

Password:


Adding Files
------------

Q: How can I add a file to a bucket?

- List files.

In [10]:
bucket.get_all_keys()

[<Key: sversage1,airline-data-extract.csv>,
 <Key: sversage1,file.txt>,
 <Key: sversage1,file2.txt>,
 <Key: sversage1,scripts/bootstrap-emr.sh>]

- Add file.

In [11]:
file_ = bucket.new_key('file.txt')
print file_
file_.set_contents_from_string('hello world!!')

<Key: sversage1,file.txt>


13

- copy local file over

In [18]:
file_ = bucket.new_key('aws_regions.png')
file_.set_contents_from_filename('img/aws_regions.png')
file_.get_contents_to_filename('aws_regions.png')

- List files again. New file should appear.

In [19]:
bucket.get_all_keys()

[<Key: sversage1,airline-data-extract.csv>,
 <Key: sversage1,aws_regions.png>,
 <Key: sversage1,file.txt>,
 <Key: sversage1,file2.txt>,
 <Key: sversage1,scripts/bootstrap-emr.sh>]

Q: How can I get a file from a bucket?

- Get file. This reads it all at once.

In [None]:
f = bucket.get_key('file.txt')
print f.get_contents_as_string()


Creating Buckets With Periods
-----------------------------

Q: How can I create a bucket in Boto with a period in the name?

- There is a bug in Boto that causes `create_bucket` to fail if the
  bucket name has a period in it. 

- Try creating the bucket with a period in its name. This should fail.

In [None]:
bucket_name_with_period = bucket_name + ".1.2.3"
bucket_with_period = conn.create_bucket(bucket_name_with_period)
print bucket_with_period

- To get around this run this code snippet.

In [None]:
import ssl
if hasattr(ssl, '_create_unverified_context'):
   ssl._create_default_https_context = ssl._create_unverified_context

- Now try creating the bucket with a period in its name and it should work.

In [None]:
bucket_name_with_period = bucket_name + ".1.2.3"
bucket_with_period = conn.create_bucket(bucket_name_with_period)
print bucket_with_period

- Now lets delete the bucket.

In [None]:
bucket_with_period.delete()

- For more details see <https://github.com/boto/boto/issues/2836>.


Access Control
--------------

Q: I want to access my S3 file from a web browser without giving my
access and secret keys. How can I open up access to the file to
anyone?

- You can set up Access Control Lists (ACLs) at the level of the
  bucket or at the level of the individual objects in the bucket
  (folders, files).

Q: What are the different ACL policies?

ACL Policy           |Meaning
----------           |-------
`private`            |No one else besides owner has any access rights.
`public-read`        |Everyone has read access.
`public-read-write`  |Everyone has read/write access.
`authenticated-read` |Registered Amazon S3 users have read access.

Q: What does `read` and `write` mean for buckets and files?

- Read access to a file lets you read the file.

- Read access to a bucket or folder lets you see the names of the
  files inside it.


Pop Quiz
--------

<details><summary>
Q: If a bucket is `private` and a file inside it is `public-read` can
I view it through a web browser?
</summary>
Yes. Access to the file is only determined by its ACL policy.
</details>


<details><summary>
Q: If a bucket is `public-read` and a file inside it is `private` can
I view the file through a web browser?
</summary>
No, you cannot. However, if you access the URL for the bucket you will see the file listed.
</details>

Applying Access Control
-----------------------

Q: How can I make a file available on the web so anyone can read it?

- Create a file with a specific ACL.

In [13]:
file2 = bucket.new_key('file2.txt')
file2.set_contents_from_string('hello world!!!',policy='private')

14

- Try reading the file.

In [14]:
file2_url = 'http://s3.amazonaws.com/' + bucket_name + '/file2.txt'
print file2_url
!curl $file2_url

http://s3.amazonaws.com/sversage1/file2.txt
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>BE3202554FB5DF7B</RequestId><HostId>jK1CvY8Pr2ztgACibD6hl+yAl77z+NIBodUVVIdbNhrienAldCB7AIzTH15oN0Ew/sQCIbQzX3o=</HostId></Error>

- Now change its ACL.

In [15]:
file2.set_acl('public-read')
!curl $file2_url

hello world!!!

- Also you can try accessing the file through the browser.

- If you do not specify the ACL for a file when you set its contents,
  the file is `private` by default.


S3 Files to URLs
----------------

Q: How can I figure out the URL of my S3 file?

- As above, you can compose the URL using the bucket and file name. 

- The general template for the URL is `http://s3.amazonaws.com/BUCKET/FILE`.

- You can also find the URL by looking at the file on the AWS web console.


Deleting Buckets
----------------

Q: How can I delete a bucket?

- Try deleting a bucket containing files. What happens?

In [None]:
print conn.get_all_buckets()
bucket.delete()

In [None]:
bucket.k

- To delete the bucket first delete all the files in it.

In [None]:
for key in bucket.get_all_keys(): 
    key.delete()

- Then delete the bucket.

In [None]:
print conn.get_all_buckets()
bucket.delete()
print conn.get_all_buckets()

---

Amazon EC2
==========

Regions
-------

Q: What are *AWS Regions*?

- AWS is hosted in different geographic locations world-wide. 

- For example, there are 3 regions in the US.


Q: What are the regions in the US

Region       |Name       |Location 
------       |----       |-------- 
us-east-1    |US East    |N. Virginia
us-west-1    |US West    |N. California
us-west-2    |US West 2  |Oregon


Q: How should I choose a region?

- N. Virginia or `us-east-1` is the default region for EC2.

- Using a region other than N. Virginia requires additional configuration.

- If you are not sure choose N. Virginia.


Availability Zones
------------------

Q: What are *AWS Availability Zones*?

- Regions are divided into isolated availability zones for fault
  tolerance.

- Availability zone run on physically separate hardware and
  infrastructure.

- They do not share hardware, or generators, or cooling equipment. 

- Availability zones are assigned automatically to your EC2 instances
  based on your user ID.

<img src="img/aws_regions.png">



<details><summary>
Q: Is it possible for two separate users to coordinate and land on the
same availability zone?
</summary>
1. Availability zones are assigned automatically by the system.
<br>
2. It is not possible for two AWS users to coordinate and be hosted on the same
availability zone.
</details>

----

Connecting to EC2
-----------------

Q: How can I connect to an EC2 instance?

- Login to the AWS console.

- Navigate: EC2 launch and instance (a free linux instance will do for now) 

- This should look something like `ec2-52-3-161-43.compute-1.amazonaws.com`.

- Use this command to connect to it.

- `ssh -X -i ~/.ssh/keypair.pem user@domain`

- Here is an example. 

- `ssh -X -i ~/.ssh/keypair.pem ubuntu@ec2-52-3-161-43.compute-1.amazonaws.com`

- Make sure you replace the Public DNS value below with the value you
  have for your instance.


## Install/test Anaconda

The first step is to get the link to the most recent version of Anaconda for 64-bit Linux. At the time of writing, that link is https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda2-4.0.0-Linux-x86_64.sh

ssh into your ECS instance.

Run the following commands (replace commands with the most recent version of Anaconda for 64-bit linux):

wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda2-4.0.0-Linux-x86_64.sh

bash Anaconda2-4.0.0-Linux-x86_64.sh
exit

Now re-ssh into your EC2 instance.

Make sure the python command will launch Anaconda's version of Python. Run the command which python to be sure Anaconda is being used.

## Copying Files to EC2
--------------------

Q: How can I copy files to the EC2 instance?

### SSH  
- To copy a file `myfile.txt` to EC2, use a command like this.

- `scp -i ~/.ssh/keypair.pem myfile.txt user@domain:`

- To copy a directory `mydir` recursively to EC2, use a command like
  this. 
  
- `scp -i ~/.ssh/keypair.pem -r mydir user@domain:`

### SFTP

While scp works great, it is a very bare-bones command. We can use the sftp command to add more pizzazz. Rather, we'll use programs that user sftp under the hood.

First up, Cyberduck. Cyberduck is a great free program that can use sftp to connect to your EC2 instance and let you transfer and edit remote file with ease. See: https://cyberduck.io/

Next up, OSX Fuse and SSHFS. Together these two programs can mount your remote EC2 instance's file system as a drive on your Mac. The mounted drive looks and works much like a Flash Drive, allowing you to drag-and-drop and edit remote files like they were local. See: https://osxfuse.github.io/

Note: SSHFS requires you use the Terminal to mount your remote file system. To mount your EC2 instance's file system, run

sshfs ec2-user@IP_ADDRESS: ec2server -o IdentityFile=~/Desktop/awskey.pem -f
Note that the command above will appear to hang--that is normal and it will continue to hang until the remote file system is unmounted.
To unmount your remote file system, right click the mounted drive and click "eject". That will make the command above terminate.

Pop Quiz
--------

<details><summary>
Q: When you copy a file to EC2 with `scp` will this show up in S3?
</summary>
No. The file will be stored on the disk on the EC2 instance. It will
not be in S3.
</details>

## Connect a local jupyter notebook to your SSH

### On your EC2 instance
- jupyter notebook --generate-config
- mkdir certs
- cd certs
- sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
- cd /.jupyter
- vim jupyter_notebook_config.py (paste contents below at top

```
c = get_config()

# Notebook config this is where you saved your pem cert
c.NotebookApp.certfile = u'/home/ec2-user/certs/mycert.pem' 
# Run on all IP addresses of your instance
c.NotebookApp.ip = '*'
# Don't open browser by default
c.NotebookApp.open_browser = False  
# Fix port to 8888
c.NotebookApp.port = 8888
# in-line figure when using Matplotlib
c.IPKernelApp.pylab = 'inline'
```

- `jupyter notebook --no-browser` (could use tmux here as well to keep the server going) 
### LOCALLY 
- Open a new window and execute `ssh -i YOUR.PEM -L 8000:localhost:8888 ec2-user@YOUR ADDRESS
- Open a window locally and navigate to https://localhost:8000/tree

High-Performance Python
=======================

Multi-Processing vs Multi-Threading
-----------------------------------

Q: What is the difference between *multi-processing* and
*multi-threading*?

- Multi-threading (also known as concurrency) splits the work between
  different threads running on the same processor. 
  
- When one thread is blocked the processor works on the tasks for the
  next one.

- Multi-processing splits work across processes running on different
  processors or even different machines.

- Multi-threading works better if you need to exchange data between
  the threads. 

- Multi-processing works better if the different processes can work
  heads down without communicating very much.

Pop Quiz
--------

<details><summary>
Q: I have to process a very large dataset and run it through a
CPU-intensive algorithm. Should I use multi-processing or
multi-threading to speed it up?
</summary>
Multi-processing will produce a result faster. This is because it will
be able to split the work across different processors or machines.
</details>


<details><summary>
Q: I have a web scraping application that spends most of its time
waiting for web servers to respond. Should I use multi-processing or
multi-threading to speed it up?
</summary>
Multi-threading will produce a bigger payoff. This is because it will
ensure that the CPU is fully utilized and does not waste time blocked
on input.
</details>

Analogies
---------

Multi-Threading  |Multi-Processing
-----------      |----------------
Laundromat       |Everyone has a washer-dryer
Uber or Carpool  |Everyone has a car


Multi-Threading
---------------

Q: How can I write a multi-threaded program that prints `"hello"` in
different threads?

- Define print as a function.

In [69]:
import time
import threading
import multiprocessing
import Queue
import random

def thread_demo(thread_num, sleep_time=None):
    if not sleep_time:
        sleep_time = random.randint(3,10)
    
    time.sleep(sleep_time)
    print ('Thread {} done, slept for {}'.format(thread_num, sleep_time))
    

- Create threads that are going to print.

In [70]:
t1 = threading.Thread(target = thread_demo, args=[1,3])
t2 = threading.Thread(target = thread_demo, args=[2,6])
t3 = threading.Thread(target = thread_demo, args=[3,6])

- Start the threads.

In [71]:
t1.start()
t2.start()
t3.start()

#Notice how this executes
print('Main thread keeps on trucking')

t1.join()
t2.join()
t3.join()

#Notice how this waits to execute until all threads are done
print('Main thead is running the rest of the script')



Main thread keeps on trucking
Thread 1 done, slept for 3
first thread joined back
third thread joined back
Main thread is back
Thread 2 done, slept for 6
Thread 3 done, slept for 6


## What's the deal with the join? 

It tells the main thread to wait for that thread to finish before it proceeds in the code.

If you only call join on one thread, then the main thread will only waiy for that thread prior to continuing through the scipt. 

Multi-Processing
----------------

Q: Calculate the all the prime number specified by the start value

- Import `Pool`

In [23]:
from multiprocessing import Pool

- Define how to count words in a string.

In [44]:
def isprime(n):
    for i in range(2,int(n**0.5)+1):
        if n%i==0:
            return False
    return True

def prime(nth, q=None):
    n_found = 0
    i = 0 
    while n_found < nth:
        i += 1
        n_found = n_found + int(isprime(i))
    if q:
        q.put(i)
    return i

- Find primes words sequentially.

In [45]:
start = 20000

In [46]:
#Serial example
t1 = time.time()

print(prime(start), prime(start+1), prime(start+2), prime(start+3))
print('Serial time took {} seconds'.format(time.time() - t1))

(224729, 224737, 224743, 224759)
Serial time took 4.74701499939 seconds


- Find primes in parallel.

In [47]:
#Processing example
t3 = time.time()    
proc_queue = multiprocessing.Queue()

jobs = [multiprocessing.Process(target=prime, args=(start, proc_queue)),
        multiprocessing.Process(target=prime, args=(start+1, proc_queue)),
        multiprocessing.Process(target=prime, args=(start+2, proc_queue)),
        multiprocessing.Process(target=prime, args=(start+3, proc_queue))]

for job in jobs:
    job.start()
    
for job in jobs:
    job.join()

print([proc_queue.get() for job in jobs])
    
print('Processing time took {} seconds'.format(time.time() - t3))

[224737, 224759, 224743, 224729]
Processing time took 2.55824589729 seconds


In [29]:
#OR using pool 

t4 = time.time()
pool = multiprocessing.Pool(processes=4)
result = pool.map(prime, range(start, start+4))
print(result)
print('Pool processing time took {} seconds'.format(time.time() - t4))

[224729, 224737, 224743, 224759]
Pool processing time took 2.67378401756 seconds


- Find primes using `Thread`.

In [48]:
#Threading example
t2 = time.time()
thread_queue = Queue.Queue()

jobs = [threading.Thread(target=prime, args=(start, thread_queue)),
        threading.Thread(target=prime, args=(start+1, thread_queue)),
        threading.Thread(target=prime, args=(start+2, thread_queue)),
        threading.Thread(target=prime, args=(start+3, thread_queue))]

for job in jobs:
    job.start()
    
for job in jobs:
    job.join()

#Typically we would have other threads/processes on the other end of the queue grabbing the
# data and doing something with it etc etc
print([thread_queue.get() for job in jobs])
    
print('Threading time took {} seconds'.format(time.time() - t2))

[224743, 224759, 224729, 224737]
Threading time took 9.0032119751 seconds


Pop Quiz
--------

<details><summary>
Q: Between sequential, parallel, and concurrent, which one is the
fastest? Which one is the slowest? Why?
</summary>
1. Sequential is the fastest. Concurrent is second. Parallel is the
slowest.
<br>
2. Concurrent and parallel have a higher setup overhead. This is not
recovered for small problems.
<br>
3. Use these only if your processing takes longer than the setup
overhead.
</details>

Cleaning Up Zombie Python Processes
-----------------------------------

Here is how to kill all the processes that `multiprocessing` will
bring up in the background.

```sh
ps ux | grep IPython.kernel | grep -v grep | awk '{print $2}' | xargs kill -9
```

### Real world example

Was building an application that would make many calls to a database and write the data returned locally. 

<details><summary>
Q: For the calls to the database there was a generator that was providing the queries, what would you use in this scenario?
</summary>
You would use threads here as they share the same memory space (need to use thread safe iters as a warning to avoid thread lock)
</details>

<details><summary>
Q: For the writing locally we would write to many local databases to package and send those over FTP, what would you use here?
</summary>
You would want to use multi processing here, you can write the data coming back from the database to a pool and have the processes pick it up from the pool and write to their respective databases/package and send back individually. /
</details>