## General

This notebook documents and provides scripts for managing a spark cluster.

The cluster is started using the spark-notebook script created by Kevin Coakley which is based on a script create by Julaiti Alafat. The advantage of Coakley's script is that it uses AWS-EMR instead of directly managing ec2 instances. Clone Coakley's script using 
```sh
git clone https://github.com/mas-dse/spark-notebook.git
```

Follow directions to initialize and start the web interface. The command you use to start the cluster spinner is 
```
./run.py &
```
The first time you use the script you will need to enter your AWS credentials. Those will be kept in a yaml file for the next times.

## Bootstrap
After the script starts the cluster, it executes a bootstrap script. The script is in the spark notebook directory in the file `provision/jupyter-provision-v0.4.sh`

## Per-session bootstrap
You can add an additional script that will be executed after the general bootstrap. This script will be executed on both the head node and the worker nodes. In order to restrict the execution to the head node surround your commands with the following if-then:
```sh
# check for master node
if grep isMaster /mnt/var/lib/info/instance.json | grep true;
then
   #put here commands that are intended only for the head node
fi
```
An example script is below:

```sh
# %load PrivateBootstrap.sh
# check for master node
if grep isMaster /mnt/var/lib/info/instance.json | grep true;
then
   cd /mnt/workspace/

   date +%H.%M:%S:%N  #>> /mnt/workspace/PrivateBootstrap.log
   echo “Start of bootsrap, set up git” #>> /mnt/workspace/PrivateBootstrap.log
   git config --global user.email "yoav.freund@gmail.com"
   git config --global user.name “Yoav Freund”
   git config --global credential.helper cache
   echo "git clone https://github.com/ucsd-edx/edX-Micro-Master-in-Data-Science.git" >clone.sh    # could not figure a way to clone withut user intervensin, 
   # so making the clone into a one line script that need to be executed manually.

   date +%H.%M:%S:%N  #>> /mnt/workspace/PrivateBootstrap.log
   echo “copy files from S3 to Local”  #>> /mnt/workspace/PrivateBootstrap.log
   mkdir Data
   cd Data
   aws s3 cp --recursive s3://dse-weather/weather.parquet  ./weather.parquet

   date +%H.%M:%S:%N  #>> /mnt/workspace/PrivateBootstrap.log
   echo “copy files from Local to HDFS”  #>> /mnt/workspace/PrivateBootstrap.log
   hadoop fs -mkdir /weather
   hadoop fs -copyFromLocal weather.parquet /weather/weather.parquet

   date +%H.%M:%S:%N  #>> /mnt/workspace/PrivateBootstrap.log
   echo “Bootstrap done”  #>> /mnt/workspace/PrivateBootstrap.log
fi
```

The cluster nodes will recieve the script from s3. You therefor need to copy the script into s3 before starting the cluster. You can use the AWS command line (which you need to install on your laptop) to copy a local script to an s3 bucket that is accessible using the credentials f the cluster:
```sh
aws s3 cp PrivateBootstrap.sh s3://dse-weather/PrivateBootstrap.sh
```
You then need to type the s3 location when you start the script under "advanced options"

## log files

At the bottom of the spark-notebook page, before you start a cluster, there is a line of the form:
```
EMR Logs S3 Bucket [?] s3://aws-logs-846273844940-us-east-1
```
This line tells you the s3 bucket where the logs reside.

It is not easy to find out which of the logs are related to your current cluster and which are left over from previous runs. I wrote some code here to help with that.

First, we get a listing of all of the files in the bucket

In [23]:
logs_bucket="aws-logs-846273844940-us-east-1"
!aws s3 ls --recursive $logs_bucket/ > logOfLogs

In [24]:
# I now grep for today's date
import datetime as dt
now=dt.datetime.now()
now.day
#dt.datetime.strptime

18

In [25]:
#!grep "2018-02-19" logOfLogs

In [26]:
i=0
import re
rp=r''
pat=re.compile(r'(\d+-\d+-\d+\s+\d+:\d+:\d+)\s+(\d+)\s+([^/]+/)([^/]+)/(.*)')
from collections import Counter
C={}
with open('logOfLogs','r') as logs:
    for line in logs.readlines():
        m=pat.search(line)
        if m:
            timestamp,size,_dir,prefix,file=m.groups()
            #print(timestamp,size,prefix,file)

            ts=dt.datetime.strptime(timestamp,'%Y-%m-%d %H:%M:%S')
            if now.year==ts.year and now.month==ts.month and ts.day>=18:
                if prefix in C:
                    C[prefix].append(ts)
                else:
                    C[prefix]=[ts]
            i+=1
print("A listing of today's logs\n")
print(" session\t Started\t\t Ended \t\t\t No. of files")
for prefix in C.keys():
    print('%s\t%s\t%s\t %d'%(prefix,min(C[prefix]),max(C[prefix]),len(C[prefix])))

## to do  : print logs in order of start time
print("_dir=",_dir)

A listing of today's logs

 session	 Started		 Ended 			 No. of files
j-1CRVRCKP9FR84	2018-02-18 19:06:35	2018-02-18 21:31:40	 423
j-1IIG6WD2TZPLR	2018-02-18 12:45:52	2018-02-18 13:26:31	 269
j-1VQQ62TIGBTRE	2018-02-18 12:24:51	2018-02-18 12:42:08	 239
j-2FINDADPAMW2G	2018-02-18 16:54:10	2018-02-18 17:20:30	 266
j-2FYNZE0A6FVV2	2018-02-18 21:33:35	2018-02-18 21:55:24	 159
j-2ZV5LAR33KCKC	2018-02-18 10:59:54	2018-02-18 11:02:25	 62
j-35P8X8EE5WLJU	2018-02-18 18:38:06	2018-02-18 19:00:39	 299
j-35XHQPIDSTWW	2018-02-18 09:38:33	2018-02-18 10:56:05	 237
j-3D1WM2IRZOHQO	2018-02-18 17:25:50	2018-02-18 18:33:39	 250
j-JU6TGVTKCC9Y	2018-02-18 13:33:31	2018-02-18 16:46:01	 396
_dir= elasticmapreduce/


### download a specific session for inspection

In [27]:
current='j-2FYNZE0A6FVV2'
s3path='s3://'+logs_bucket+'/'+_dir+current+'/'
print(s3path)
%cd /tmp
!aws s3 cp --recursive $s3path $current

s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/
/private/tmp
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-02303a709f33ec7de/applications/hadoop-hdfs/hadoop-hdfs-datanode-ip-10-129-230-83.log.gz to j-2FYNZE0A6FVV2/node/i-02303a709f33ec7de/applications/hadoop-hdfs/hadoop-hdfs-datanode-ip-10-129-230-83.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-02303a709f33ec7de/applications/hadoop-hdfs/hadoop-hdfs-datanode-ip-10-129-230-83.out.gz to j-2FYNZE0A6FVV2/node/i-02303a709f33ec7de/applications/hadoop-hdfs/hadoop-hdfs-datanode-ip-10-129-230-83.out.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-02303a709f33ec7de/applications/hadoop-yarn/yarn-yarn-nodemanager-ip-10-129-230-83.log.gz to j-2FYNZE0A6FVV2/node/i-02303a709f33ec7de/applications/hadoop-yarn/yarn-yarn-nodemanager-ip-10-129-230-83.log.gz
download: s3://aws-logs-846273844940-us-east-1/

download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-0392a269daed0e6bb/provision-node/apps-phase/stderr.gz to j-2FYNZE0A6FVV2/node/i-0392a269daed0e6bb/provision-node/apps-phase/stderr.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-0392a269daed0e6bb/setup-devices/setup_var_lib_dir.log.gz to j-2FYNZE0A6FVV2/node/i-0392a269daed0e6bb/setup-devices/setup_var_lib_dir.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-04a87b895a47321e0/applications/hadoop-hdfs/hadoop-hdfs-namenode-ip-10-129-254-76.out.gz to j-2FYNZE0A6FVV2/node/i-04a87b895a47321e0/applications/hadoop-hdfs/hadoop-hdfs-namenode-ip-10-129-254-76.out.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-0392a269daed0e6bb/daemons/instance-state/instance-state.log-2018-02-19-05-45.gz to j-2FYNZE0A6FVV2/node/i-0392a269daed0e6bb/daemons/instance-state/instance-state.log-2018-02-19

download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-04a87b895a47321e0/daemons/instance-state/instance-state.log-2018-02-19-05-45.gz to j-2FYNZE0A6FVV2/node/i-04a87b895a47321e0/daemons/instance-state/instance-state.log-2018-02-19-05-45.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-04a87b895a47321e0/setup-devices/setup_emr_metrics.log.gz to j-2FYNZE0A6FVV2/node/i-04a87b895a47321e0/setup-devices/setup_emr_metrics.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-04a87b895a47321e0/provision-node/c157c5d1-fa42-42f9-bdb2-34f22e3b781f/stdout.gz to j-2FYNZE0A6FVV2/node/i-04a87b895a47321e0/provision-node/c157c5d1-fa42-42f9-bdb2-34f22e3b781f/stdout.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-04a87b895a47321e0/setup-devices/setup_tmp_dir.log.gz to j-2FYNZE0A6FVV2/node/i-04a87b895a47321e0/setup-devices/setup_tmp_dir.log.gz
download

download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-072db8d8c45c0e073/provision-node/06f8c857-f2d3-4577-9ded-e0c729f1a156/stdout.gz to j-2FYNZE0A6FVV2/node/i-072db8d8c45c0e073/provision-node/06f8c857-f2d3-4577-9ded-e0c729f1a156/stdout.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-05d93e22535a118c9/setup-devices/setup_var_log_dir.log.gz to j-2FYNZE0A6FVV2/node/i-05d93e22535a118c9/setup-devices/setup_var_log_dir.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-072db8d8c45c0e073/setup-devices/DiskEncryptor.log.gz to j-2FYNZE0A6FVV2/node/i-072db8d8c45c0e073/setup-devices/DiskEncryptor.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-072db8d8c45c0e073/setup-devices/setup_drives.log.gz to j-2FYNZE0A6FVV2/node/i-072db8d8c45c0e073/setup-devices/setup_drives.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/

download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-0d1144fddcdbfd933/applications/hadoop-yarn/yarn-yarn-nodemanager-ip-10-129-252-164.log.gz to j-2FYNZE0A6FVV2/node/i-0d1144fddcdbfd933/applications/hadoop-yarn/yarn-yarn-nodemanager-ip-10-129-252-164.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-085eca42d61f81138/setup-devices/setup_var_lib_dir.log.gz to j-2FYNZE0A6FVV2/node/i-085eca42d61f81138/setup-devices/setup_var_lib_dir.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-085eca42d61f81138/setup-devices/setup_var_log_dir.log.gz to j-2FYNZE0A6FVV2/node/i-085eca42d61f81138/setup-devices/setup_var_log_dir.log.gz
download: s3://aws-logs-846273844940-us-east-1/elasticmapreduce/j-2FYNZE0A6FVV2/node/i-0d1144fddcdbfd933/bootstrap-actions/1/controller.gz to j-2FYNZE0A6FVV2/node/i-0d1144fddcdbfd933/bootstrap-actions/1/controller.gz
download: s3://aws-logs-846273844940

In [28]:
#the top level partition is by the node name
!ls $current/node

[34mi-02303a709f33ec7de[m[m [34mi-04a87b895a47321e0[m[m [34mi-072db8d8c45c0e073[m[m [34mi-0d1144fddcdbfd933[m[m
[34mi-0392a269daed0e6bb[m[m [34mi-05d93e22535a118c9[m[m [34mi-085eca42d61f81138[m[m


In [29]:
# For each node there are the following directories of logs
# I currently care about bootstrap-actions for the master
# You find the master by finding the file `master.log.gz inside bootstap-actions
!ls $current/node/*/bootstrap-actions/master.log.gz

j-2FYNZE0A6FVV2/node/i-04a87b895a47321e0/bootstrap-actions/master.log.gz


In [31]:
## We go to that directory to see what happened with our second bootsrtap
%cd j-2FYNZE0A6FVV2/node/i-04a87b895a47321e0/bootstrap-actions/
!ls

/private/tmp/j-2FYNZE0A6FVV2/node/i-04a87b895a47321e0/bootstrap-actions
[34m1[m[m             [34m2[m[m             master.log.gz


In [32]:
#going into log 2 which corresponds to our log
%cd 2
!ls -l

/private/tmp/j-2FYNZE0A6FVV2/node/i-04a87b895a47321e0/bootstrap-actions/2
total 64
-rw-r--r--  1 yoavfreund  wheel    840 Feb 18 21:42 controller.gz
-rw-r--r--  1 yoavfreund  wheel    124 Feb 18 21:42 stderr.gz
-rw-r--r--  1 yoavfreund  wheel  21003 Feb 18 21:42 stdout.gz


In [33]:
!gunzip *

In [34]:
!ls -l

total 1216
-rw-r--r--  1 yoavfreund  wheel    1989 Feb 18 21:42 controller
-rw-r--r--  1 yoavfreund  wheel     412 Feb 18 21:42 stderr
-rw-r--r--  1 yoavfreund  wheel  613162 Feb 18 21:42 stdout


In [36]:
!tail stderr

/emr/instance-controller/lib/bootstrap-actions/2/PrivateBootstrap.sh: line 23: hdfs: command not found
/emr/instance-controller/lib/bootstrap-actions/2/PrivateBootstrap.sh: line 24: hdfs: command not found
/emr/instance-controller/lib/bootstrap-actions/2/PrivateBootstrap.sh: line 25: hdfs: command not found
/emr/instance-controller/lib/bootstrap-actions/2/PrivateBootstrap.sh: line 27: hdfs: command not found
