## General

This notebook documents and provides scripts for managing a spark cluster.

The cluster is started using the spark-notebook script created by Kevin Coakley which is based on a script create by Julaiti Alafat. The advantage of Coakley's script is that it uses AWS-EMR instead of directly managing ec2 instances. Clone Coakley's script using 
```sh
git clone https://github.com/mas-dse/spark-notebook.git
```

Follow directions to initialize and start the web interface. The command you use to start the cluster spinner is 
```
./run.py &
```
The first time you use the script you will need to enter your AWS credentials. Those will be kept in a yaml file for the next times.

## Bootstrap
After the script starts the cluster, it executes a bootstrap script. The script is in the spark notebook directory in the file `provision/jupyter-provision-v0.4.sh`

## Per-session bootstrap
You can add an additional script that will be executed after the general bootstrap. This script will be executed on both the head node and the worker nodes. In order to restrict the execution to the head node surround your commands with the following if-then:
```sh
# check for master node
if grep isMaster /mnt/var/lib/info/instance.json | grep true;
then
   #put here commands that are intended only for the head node
fi
```
An example script is below:

```sh
# %load PrivateBootstrap.sh
# check for master node
if grep isMaster /mnt/var/lib/info/instance.json | grep true;
then
   cd /mnt/workspace/

   date +%H.%M:%S:%N  #>> /mnt/workspace/PrivateBootstrap.log
   echo “Start of bootsrap, set up git” #>> /mnt/workspace/PrivateBootstrap.log
   git config --global user.email "yoav.freund@gmail.com"
   git config --global user.name “Yoav Freund”
   git config --global credential.helper cache
   echo "git clone https://github.com/ucsd-edx/edX-Micro-Master-in-Data-Science.git" >clone.sh    # could not figure a way to clone withut user intervensin, 
   # so making the clone into a one line script that need to be executed manually.

   date +%H.%M:%S:%N  #>> /mnt/workspace/PrivateBootstrap.log
   echo “copy files from S3 to Local”  #>> /mnt/workspace/PrivateBootstrap.log
   mkdir Data
   cd Data
   aws s3 cp --recursive s3://dse-weather/weather.parquet  ./weather.parquet

   date +%H.%M:%S:%N  #>> /mnt/workspace/PrivateBootstrap.log
   echo “copy files from Local to HDFS”  #>> /mnt/workspace/PrivateBootstrap.log
   hadoop fs -mkdir /weather
   hadoop fs -copyFromLocal weather.parquet /weather/weather.parquet

   date +%H.%M:%S:%N  #>> /mnt/workspace/PrivateBootstrap.log
   echo “Bootstrap done”  #>> /mnt/workspace/PrivateBootstrap.log
fi
```

The cluster nodes will recieve the script from s3. You therefor need to copy the script into s3 before starting the cluster. You can use the AWS command line (which you need to install on your laptop) to copy a local script to an s3 bucket that is accessible using the credentials f the cluster:
```sh
aws s3 cp PrivateBootstrap.sh s3://dse-weather/PrivateBootstrap.sh
```
You then need to type the s3 location when you start the script under "advanced options"

## log files

At the bottom of the spark-notebook page, before you start a cluster, there is a line of the form:
```
EMR Logs S3 Bucket [?] s3://aws-logs-846273844940-us-east-1
```
This line tells you the s3 bucket where the logs reside.

It is not easy to find out which of the logs are related to your current cluster and which are left over from previous runs. I wrote some code here to help with that.

First, we get a listing of all of the files in the bucket

In [107]:
logs_bucket="aws-logs-846273844940-us-east-1"
!aws s3 ls --recursive $logs_bucket/ > logOfLogs

In [53]:
# I now grep for today's date
import datetime as dt
now=dt.datetime.now()
now.day
#dt.datetime.strptime

18

In [110]:
!head -1 logOfLogs

2017-05-25 09:46:45       2258 elasticmapreduce/j-11GVYBFRIKQ8I/containers/application_1495728137505_0001/container_1495728137505_0001_01_000001/stderr.gz


In [101]:
i=0
import re
rp=r''
pat=re.compile(r'(\d+-\d+-\d+\s+\d+:\d+:\d+)\s+(\d+)\s+([^/]+/)([^/]+)/(.*)')
from collections import Counter
C={}
with open('logOfLogs','r') as logs:
    for line in logs.readlines():
        m=pat.search(line)
        if m:
            timestamp,size,_dir,prefix,file=m.groups()
            #print(timestamp,size,prefix,file)

            ts=dt.datetime.strptime(timestamp,'%Y-%m-%d %H:%M:%S')
            if now.year==ts.year and now.month==ts.month and now.day==ts.day:
                if prefix in C:
                    C[prefix].append(ts)
                else:
                    C[prefix]=[ts]
            i+=1
print("A listing of today's logs\n")
print(" session\t Started\t\t Ended \t\t\t No. of files")
for prefix in C.keys():
    print('%s\t%s\t%s\t %d'%(prefix,min(C[prefix]),max(C[prefix]),len(C[prefix])))

print("_dir=",dir)

A listing of today's logs

 session	 Started		 Ended 			 No. of files
j-1IIG6WD2TZPLR	2018-02-18 12:45:52	2018-02-18 13:26:31	 269
j-1VQQ62TIGBTRE	2018-02-18 12:24:51	2018-02-18 12:42:08	 239
j-2ZV5LAR33KCKC	2018-02-18 10:59:54	2018-02-18 11:02:25	 62
j-35XHQPIDSTWW	2018-02-18 09:38:33	2018-02-18 10:56:05	 237
j-JU6TGVTKCC9Y	2018-02-18 13:33:31	2018-02-18 14:23:33	 262
_dir= elasticmapreduce/


### download a specific session for inspection

In [113]:
current='j-JU6TGVTKCC9Y'
s3path='s3://'+logs_bucket+'/'+_dir+current+'/'
print(s3path)
%cd /tmp
!aws s3 cp --recursive $s3path $current

s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/
/private/tmp
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-00ef5ccc42effdfa1/applications/hadoop-hdfs/hadoop-hdfs-datanode-ip-10-129-239-205.log.gz to j-JU6TGVTKCC9Y/node/i-00ef5ccc42effdfa1/applications/hadoop-hdfs/hadoop-hdfs-datanode-ip-10-129-239-205.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-00ef5ccc42effdfa1/applications/hadoop-hdfs/hadoop-hdfs-datanode-ip-10-129-239-205.log.2018-02-18-21.gz to j-JU6TGVTKCC9Y/node/i-00ef5ccc42effdfa1/applications/hadoop-hdfs/hadoop-hdfs-datanode-ip-10-129-239-205.log.2018-02-18-21.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-00ef5ccc42effdfa1/applications/hadoop-hdfs/hadoop-hdfs-datanode-ip-10-129-239-205.out.gz to j-JU6TGVTKCC9Y/node/i-00ef5ccc42effdfa1/applications/hadoop-hdfs/hadoop-hdfs-datanode-ip-10-129-239-205.out.gz
download: s3://aws-log

download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-0288ea472349f126e/daemons/instance-state/instance-state.log-2018-02-18-21-45.gz to j-JU6TGVTKCC9Y/node/i-0288ea472349f126e/daemons/instance-state/instance-state.log-2018-02-18-21-45.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-0288ea472349f126e/daemons/instance-state/instance-state.log-2018-02-18-22-00.gz to j-JU6TGVTKCC9Y/node/i-0288ea472349f126e/daemons/instance-state/instance-state.log-2018-02-18-22-00.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-0288ea472349f126e/daemons/instance-state/instance-state.log-2018-02-18-22-30.gz to j-JU6TGVTKCC9Y/node/i-0288ea472349f126e/daemons/instance-state/instance-state.log-2018-02-18-22-30.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-0288ea472349f126e/provision-node/51d7c60d-3b01-4f7b-9a38-adbbfb7115af/controller.gz to j-JU6TGVTKCC9Y

download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-02fe4b250d74129e3/daemons/instance-state/instance-state.log-2018-02-18-23-30.gz to j-JU6TGVTKCC9Y/node/i-02fe4b250d74129e3/daemons/instance-state/instance-state.log-2018-02-18-23-30.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-02fe4b250d74129e3/daemons/instance-state/instance-state.log-2018-02-18-23-00.gz to j-JU6TGVTKCC9Y/node/i-02fe4b250d74129e3/daemons/instance-state/instance-state.log-2018-02-18-23-00.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-02fe4b250d74129e3/daemons/instance-state/instance-state.log-2018-02-18-23-15.gz to j-JU6TGVTKCC9Y/node/i-02fe4b250d74129e3/daemons/instance-state/instance-state.log-2018-02-18-23-15.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-02fe4b250d74129e3/daemons/instance-state/instance-state.log-2018-02-18-22-45.gz to j-JU6TGVTKCC9Y/nod

download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-02ff8b34babf37087/setup-devices/setup_var_cache_dir.log.gz to j-JU6TGVTKCC9Y/node/i-02ff8b34babf37087/setup-devices/setup_var_cache_dir.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-02ff8b34babf37087/provision-node/4cf0eee3-ccc9-4c3e-9ee4-ff437eca76b8/stdout.gz to j-JU6TGVTKCC9Y/node/i-02ff8b34babf37087/provision-node/4cf0eee3-ccc9-4c3e-9ee4-ff437eca76b8/stdout.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-02ff8b34babf37087/setup-devices/setup_var_lib_dir.log.gz to j-JU6TGVTKCC9Y/node/i-02ff8b34babf37087/setup-devices/setup_var_lib_dir.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-02ff8b34babf37087/setup-devices/setup_var_log_dir.log.gz to j-JU6TGVTKCC9Y/node/i-02ff8b34babf37087/setup-devices/setup_var_log_dir.log.gz
download: s3://aws-logs-846273844940-us-east-1/ela

download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-0371fc8754bf345c5/provision-node/c44978d9-4536-45ab-87f6-ef93fb6111e2/stderr.gz to j-JU6TGVTKCC9Y/node/i-0371fc8754bf345c5/provision-node/c44978d9-4536-45ab-87f6-ef93fb6111e2/stderr.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-046f6a88e76cd30e0/bootstrap-actions/1/controller.gz to j-JU6TGVTKCC9Y/node/i-046f6a88e76cd30e0/bootstrap-actions/1/controller.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-046f6a88e76cd30e0/daemons/instance-state/instance-state.log-2018-02-18-22-15.gz to j-JU6TGVTKCC9Y/node/i-046f6a88e76cd30e0/daemons/instance-state/instance-state.log-2018-02-18-22-15.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-046f6a88e76cd30e0/daemons/instance-state/instance-state.log-2018-02-18-23-00.gz to j-JU6TGVTKCC9Y/node/i-046f6a88e76cd30e0/daemons/instance-state/instance-st

download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-06291bfec6ce584fd/applications/hadoop-mapreduce/mapred-mapred-historyserver-ip-10-129-253-102.log.2018-02-18-22.gz to j-JU6TGVTKCC9Y/node/i-06291bfec6ce584fd/applications/hadoop-mapreduce/mapred-mapred-historyserver-ip-10-129-253-102.log.2018-02-18-22.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-06291bfec6ce584fd/applications/hadoop-mapreduce/mapred-mapred-historyserver-ip-10-129-253-102.out.gz to j-JU6TGVTKCC9Y/node/i-06291bfec6ce584fd/applications/hadoop-mapreduce/mapred-mapred-historyserver-ip-10-129-253-102.out.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-06291bfec6ce584fd/applications/hadoop-yarn/yarn-yarn-proxyserver-ip-10-129-253-102.out.gz to j-JU6TGVTKCC9Y/node/i-06291bfec6ce584fd/applications/hadoop-yarn/yarn-yarn-proxyserver-ip-10-129-253-102.out.gz
download: s3://aws-logs-846273844940-us-east-1/elasticm

download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-06291bfec6ce584fd/setup-devices/setup_tmp_dir.log.gz to j-JU6TGVTKCC9Y/node/i-06291bfec6ce584fd/setup-devices/setup_tmp_dir.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-06291bfec6ce584fd/setup-devices/setup_var_cache_dir.log.gz to j-JU6TGVTKCC9Y/node/i-06291bfec6ce584fd/setup-devices/setup_var_cache_dir.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-06291bfec6ce584fd/setup-devices/setup_var_log_dir.log.gz to j-JU6TGVTKCC9Y/node/i-06291bfec6ce584fd/setup-devices/setup_var_log_dir.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-06291bfec6ce584fd/setup-devices/setup_var_lib_dir.log.gz to j-JU6TGVTKCC9Y/node/i-06291bfec6ce584fd/setup-devices/setup_var_lib_dir.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-07f2897a3ec61bb3d/

download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-09ad5dff6a2e89c7a/bootstrap-actions/2/controller.gz to j-JU6TGVTKCC9Y/node/i-09ad5dff6a2e89c7a/bootstrap-actions/2/controller.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-09ad5dff6a2e89c7a/bootstrap-actions/1/controller.gz to j-JU6TGVTKCC9Y/node/i-09ad5dff6a2e89c7a/bootstrap-actions/1/controller.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-09ad5dff6a2e89c7a/daemons/instance-state/instance-state.log-2018-02-18-21-45.gz to j-JU6TGVTKCC9Y/node/i-09ad5dff6a2e89c7a/daemons/instance-state/instance-state.log-2018-02-18-21-45.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-09ad5dff6a2e89c7a/applications/hadoop-yarn/yarn-yarn-nodemanager-ip-10-129-251-153.log.gz to j-JU6TGVTKCC9Y/node/i-09ad5dff6a2e89c7a/applications/hadoop-yarn/yarn-yarn-nodemanager-ip-10-129-251-153.log.gz
download

download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-0a3d6028144030752/daemons/instance-state/instance-state.log-2018-02-18-23-15.gz to j-JU6TGVTKCC9Y/node/i-0a3d6028144030752/daemons/instance-state/instance-state.log-2018-02-18-23-15.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-0a3d6028144030752/daemons/instance-state/instance-state.log-2018-02-18-22-45.gz to j-JU6TGVTKCC9Y/node/i-0a3d6028144030752/daemons/instance-state/instance-state.log-2018-02-18-22-45.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-0a3d6028144030752/daemons/instance-state/instance-state.log-2018-02-18-23-30.gz to j-JU6TGVTKCC9Y/node/i-0a3d6028144030752/daemons/instance-state/instance-state.log-2018-02-18-23-30.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-0a3d6028144030752/provision-node/apps-phase/stderr.gz to j-JU6TGVTKCC9Y/node/i-0a3d6028144030752/prov

download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-0ddd3ad4127dfd0c4/daemons/instance-state/instance-state.log-2018-02-18-22-45.gz to j-JU6TGVTKCC9Y/node/i-0ddd3ad4127dfd0c4/daemons/instance-state/instance-state.log-2018-02-18-22-45.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-0ddd3ad4127dfd0c4/daemons/instance-state/instance-state.log-2018-02-18-23-30.gz to j-JU6TGVTKCC9Y/node/i-0ddd3ad4127dfd0c4/daemons/instance-state/instance-state.log-2018-02-18-23-30.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-0ddd3ad4127dfd0c4/provision-node/d28ca74a-a9be-44e1-8c57-fb48f287d28c/stdout.gz to j-JU6TGVTKCC9Y/node/i-0ddd3ad4127dfd0c4/provision-node/d28ca74a-a9be-44e1-8c57-fb48f287d28c/stdout.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-JU6TGVTKCC9Y/node/i-0ddd3ad4127dfd0c4/setup-devices/setup_drives.log.gz to j-JU6TGVTKCC9Y/node/i-0ddd3ad4127dfd0c4/setup-

In [106]:
%pwd

'/private/tmp'