## Local files

### Download

Remote working directory can be downloaded with the `--download` parameter:

```bash
python spark-ec2-helper.py --download
```

This method will download all files, including the IPython Notebook and files that your program generated on the server (pickle files, etc.).
Files will be downloaded to the `./remote_files` directory.

### Upload

You can upload a single file or all files in a directory with the `--upload` parameter:

```bash
python spark-ec2-helper.py --upload path/to/a/file
python spark-ec2-helper.py --upload path/to/a/directory
```

If you want to read from a local text file, you can use this method to upload it to the server.


## S3 files

The object `s3helper` is created to help you access S3 files.

In [1]:
help(s3helper)

Help on instance of S3Helper in module __main__:

class S3Helper
 |  A helper function to access S3 files
 |  
 |  Methods defined here:
 |  
 |  __init__(self)
 |  
 |  get_file(self, key_name)
 |      Download the remote file `key_name` on S3 to local.
 |      
 |      Args:
 |          key_name
 |      Returns:
 |          None
 |  
 |  get_path(self, path='')
 |      Get paths of all files in `path` with s3 prefix,
 |      which can be passed to Spark.
 |      
 |       Args:
 |           path
 |       Returns:
 |           an array of file paths with s3 prefix
 |  
 |  load_path(self, path, tgt)
 |      Load all files in `path` to the directory `tgt` in HDFS.
 |      
 |      Args:
 |          path, tgt
 |      Returns:
 |          an array of file paths in HDFS
 |  
 |  ls(self, path='')
 |      List all files in `path`.
 |      
 |      Args:
 |          path
 |      Returns:
 |          an array of files in `path`
 |  
 |  open_bucket(self, bucket_name)
 |      Open a S3 bucket

To access s3 files, the first step is setting AWS credential.

In [2]:
%run Credentials.ipynb

In [3]:
s3helper.set_credential(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)

### Moving data from S3 to your Spark Cluster
Long term storage of data on AWS is done either on **S3** (about \$30 per TB\*Month) or 
**Glacier** (about \$7 per TB\*Month). 

You can, of course, keep the data on your personal server but moving data to and from AWS is slow and/or expensive.

The cheapest way to upload large amounts of data to S3 is by [physically shipping disks](http://aws.amazon.com/importexport/)

Once your files are on S3, it is quite fast to move them to your AWS instance.

### The Spark-Notebook package
[Julaiti Arpapt](http://cseweb.ucsd.edu/~jalafate/) Has utility to simplify the task of creating and managing a spark cluster on AWS. The utility is available from GitHub [here](https://github.com/arapat/spark-notebook). (The utility is also described [here](http://mas-dse.github.io/DSE230/installation/))

These scripts automate the creation of spark clusters on AWS, moving files between your computer, your AWS cluster and S3, and other useful features. I will not review the whole package here. I will just use it to demonstrate some useful actions.

#### Working with s3 buckets and files
The first step to working with S3 is to open the **bucket** that has your files.

In [None]:
s3helper.open_bucket('ucsd-twitter')

Now you can list your files in the bucket.

In [5]:
print s3helper.ls()
print s3helper.ls('model-feb')

[u'Constants.py', u'data', u'data-cse255', u'data2', u'jan_geodata', u'model-feb', u'otherdata', u'pairs-130-179', u'pairs-176-179', u'pairs-176-247', u'pairs-176-247-clean', u'sample-jan-57', u'xy-298-307', u'yelp', u'yx-298-307']
[u'model-feb/users-partition-feb.txt']


To read the files, you have two options. 

**Option 1** Get a list of s3 file paths and pass it to Spark. 

This is the better option if you have enough memory to
keep all of the data, and redundancy / error recovery are not important

In [6]:
files = s3helper.get_path('/model-feb')
print files
rdd = sc.textFile(','.join(files))

[u's3n://ucsd-twitter/model-feb/users-partition-feb.txt']


**Option 2** Load S3 files to HDFS and read them from HDFS

This is the better option if the data is too large to fit in memory 
or if the data will be used over a long period of time so redundancy / error recovery 
are significant issues.

Loading data into HDFS will be slower than loading it directly into memory.
On the other hand, loading from HDFS to memory is much faster than loading from S3 to memory and
HDFS provides redundancy and error recovery.

In [7]:
files = s3helper.load_path('/model-feb', '/feb')
print files
rdd = sc.textFile(','.join(files))

[u'/feb/users-partition-feb.txt']


In [8]:
rdd.count()

7

## Parquet Format
Parquet is a file format developed specifically for large data applications. 
Using this file format a program can read a selected subset of the rows in a 
table using and SQL query. 

This is a much faster alternative than reading the whole file into memory and then filtering out
the un-needed parts.

Parquet files thus provide some of the functionality of an RDBMS. Specifically, 
an efficient way to read susets of large tables. However, to perform out-of-memory calculations other than selection, one needs to install a full-fledged RDBMS such as Hive.


In [4]:
s3helper.open_bucket("mas-dse-public")

files = s3helper.load_path('/Weather/US_Weather.parquet', '/US_Weather.parquet')
files[:10]

[u'/parquet/_SUCCESS',
 u'/parquet/_common_metadata',
 u'/parquet/_metadata',
 u'/parquet/part-r-00000-0f4998c0-b27b-4f60-ad45-ed3212ddb46f.gz.parquet',
 u'/parquet/part-r-00001-0f4998c0-b27b-4f60-ad45-ed3212ddb46f.gz.parquet',
 u'/parquet/part-r-00002-0f4998c0-b27b-4f60-ad45-ed3212ddb46f.gz.parquet',
 u'/parquet/part-r-00003-0f4998c0-b27b-4f60-ad45-ed3212ddb46f.gz.parquet',
 u'/parquet/part-r-00004-0f4998c0-b27b-4f60-ad45-ed3212ddb46f.gz.parquet',
 u'/parquet/part-r-00005-0f4998c0-b27b-4f60-ad45-ed3212ddb46f.gz.parquet',
 u'/parquet/part-r-00006-0f4998c0-b27b-4f60-ad45-ed3212ddb46f.gz.parquet']

In [2]:
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext(master=master_url)
sqlContext = SQLContext(sc)

In [6]:
df = sqlContext.sql("SELECT station, measurement FROM parquet.`/US_Weather.parquet`")
df.head()

Row(station=u'USC00415427', measurement=u'DAPR')