## Scientific Computing for Large Datasets

### What is Big Data?

* Data is larger than what I can hold in Memory (of my Laptop)
* You are either -
    * Memory Bound: Sometimes data is too big to fit in Memory of your Laptop
    * CPU Bound: Processing takes too long (Grid Searching models) and you need to run it for long intervals

You need a bigger machine (more likely memory) to complete the processing.  You can either buy a more expensive laptop or add memory or simply rent the hardware from AWS for the period you need it for.

You need to either Scale Out (Horizontally) or Scale Up (Vertically)

### Scale Out:

Distribute the computing / memory across different Machines (StarCluster / Hadoop / Spark)

* Pros - Only solution on TB and PB scale.
* Cons - Non-Trivial setup, on-going maintenance.

### Scale Up:

Rent a Single Bigger Machine - [upto 244GB Memory from EC2](http://www.ec2instances.info/)

* Pros - Simpler Solution.  We all know how to work with one computer
* Cons - Although 30-50X larger than your laptop, it still has an upper bound.  Some problems will be too big for one machine.

We are going to Scale Up in this Class because that will help you with your project or at work right now.

### Lab Notes:

* [Login to AWS](aws.amazon.com)
* Select the Region to US East (N.Virginia).  We will work with other Regions later.
* Select EC2 and Launch an Instance
* [Anaconda EC2 Image](http://docs.continuum.io/anaconda/images#id4)
* [Understand Costs](http://www.ec2instances.info/)
* Spin up M3.Medium
* Set Security Groups (Open Ports - ssh, http, https, tcp on 8888, 8000)
* Launch
* Create a Key - You need this to login to the box

It should be done in a couple of Minutes


### Any issues?

### Logging into AWS EC2 System:

* Go to Terminal App in Mac / git bash on Windows

* Set Permission on .pem file 
    * chmod 600 pemfile.pem

* SSH into the AWS system
    * ssh -i *"pem file with path"* ubuntu@*"ec2 box"*

### Install / Update Packages:


* Update Linux - `sudo apt-get update`
* Install git - `sudo apt-get install git`
* Update conda - `conda update conda`
* Install Conda packages -
```
conda install ipython-notebook
conda install pandas
conda install matplotlib
conda install scikit-learn
conda install seaborn
```

* `git clone` your SF_DAT_15_WORK repo
* `ls` to confirm that you see your WORK repo

### IPython Server Setup:

`ipython profile create nbserver`

> Output: [ProfileCreate] Generating default config file: u'/home/ubuntu/.ipython/profile_nbserver/ipython_config.py'



### Create a Password for IPython Notebook:

Run:
`ipython`
    * from IPython.lib import passwd
    * passwd()

>'sha1:7370b8b10c51:d3c284bac0f8d6a066f69a8485ae3591ab758983'


### Create SSL Certificate:
```
mkdir .certificates
cd .certificates
pwd
openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
```

[Stack Overflow Post](http://stackoverflow.com/questions/21477210/correct-location-of-openssl-cnf-file/21485937#21485937)

### Setup IPython Server Profile

`nano /home/ubuntu/.ipython/profile_nbserver/ipython_config.py`

```
# kernel config
c.IPKernelApp.pylab = 'inline'
# Notebook Config
c.NotebookApp.certificate = u'/users/ubuntu/.certificates/mycert.pem'
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u'<your password>'
c.NotebookApp.port = 8888
```

### Basic Setup Compleete - Hurray
```
# Run the following Command
ipython notebook --profile=nbserver
```

Open Browser: `EC2 URL:8888`

Do you see the IPython Notebook?  - Any Problems?

## Resources and Further Reading

* [Don't use Hadoop - your data isn't that big](https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html)
* [Stack Overflow Post on SSL](http://stackoverflow.com/questions/21477210/correct-location-of-openssl-cnf-file/21485937#21485937)
* [IPython HTML Notebok Documentation](http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html)
* [IPython Magic Functions](http://ipython.org/ipython-doc/dev/interactive/tutorial.html)

* Use [Screen](http://www.thegeekstuff.com/2010/07/screen-command-examples/) to keep IPython Server Running in the background.

* Supervisord to restart the process