# Simple Setup - Single Master and a Worker

https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-standalone-example-2-workers-on-1-node-cluster.adoc

__1. Go to spark home folder in terminal, then navigate to sbin folder. Now start your master using the following command, __

./start-master.sh

__2. Then in your browser go to the following URL,__

localhost:8080

__3. If Spark master has successfully started, then you would see a web page with Spark info. After the Spark logo, you will see a line begining with thw word URL: For example,__

URL: spark://m:7077


__4. The above URL is the master URL. You can add it to your program's spark configuration. __

For example, In Python,<br/>
<br/>
from pyspark import SparkConf, SparkContext<br/>
conf = (SparkConf().setMaster("spark://m:7077").setAppName("Examples"))<br/>
sc = SparkContext(conf=conf)<br/>

For Example, In Scala,<br/>
<br/>
val conf = new SparkConf().setAppName("Examples").setMaster("spark://m:7077")<br/>
val sc = new SparkContext(conf)<br/>

__5. Now the next step is to start the worker, In your terminal in SPARK_HOME/sbin/ folder, type the following command, __

./start-slave.sh spark://m:7077<br/>

__Here the command "./start-slave.sh" is followed by a space " ", then the Master's URL is passed as an argument. You can check your localhots:8080, for your worker, after executing the above command. __

__6. If you execute any command, the master and Slave will automatically, allocate memory for the tasks and use your cores efficiently. Test it with any example. __


## Official Documentation
http://spark.apache.org/docs/latest/spark-standalone.html


Note: Set the following environment variables,<br/>

For Python3, (Only in the case if you haven't set it up earlier),<br/>
export PYSPARK_PYTHON="/usr/bin/python2"<br/>
export PYSPARK_DRIVER_PYTHON="python2"<br/>

For Scala,<br/>
None newly required. <br/>

# Multiple Worker Setup - Single Master and Multiple Worker using start-slave.sh  

### NOTE - YOU HAVE TO SHUT DOWN (n-1) WORKERS MANUALLY

https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-standalone-example-2-workers-on-1-node-cluster.adoc

__1. Follow the instructions in above section for starting the master __
__2. While starting the slaves, you can user create a configuration file conf/spark-env.sh as shown in the above documentation or you can ude the folllowing command below, __

SPARK_WORKER_INSTANCES=4 ./sbin/start-slave.sh spark://m:7077

__3. If you look closely, I haven't specified anything else except the number of workers needed. All other options are allocated greedily by the Spark.__


# Multiple Worker Setup - Single Master and Multiple Worker using start-slaves.sh  

https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-standalone-example-2-workers-on-1-node-cluster.adoc

https://www.cs.helsinki.fi/ukko/hpc-report.txt

__1. Follow the instructions in above section for starting the master __

__2. Now create a ssh key and add it to the ssh-agent__<br/>
     cd ~/.ssh/<br/>
     ssh-copy-id localhost<br/>
     eval '$(ssh-agent -s)'<br/>
     ssh-add ~/.ssh/id_rsa<br/>
     After all these, try issuing the command, <br/>
        ./sbin/start-slaves.sh spark://m:7077<br/>

Note: 
You can also set up password less slaves,to reduce time taken during setup. There are several great tutorials available.

## Hint:
1. try using this command before ssh-copy-id command <br/> 
    ssh localhost<br/>
2. For Mac, Try allowing remote sharing in your preferences, then start use ssh localhost
 command.
 
### References:
https://mbonaci.github.io/mbo-spark/<br/>

# Cluster in UKKO

https://www.cs.helsinki.fi/ukko/hpc-report.txt

### Refer the following Linux Fundamentals course on Setting up the SSH keys
https://wiki.helsinki.fi/display/linuxfun2013/Week1
    
__1. Login in to melkki. Then into an ukko node where you want to run your master. There create an ssh key in any one of the UKKO nodes where you want to start your master<br/>__
__2. Now copy the ssh public key to all other ukko nodes where you want to create the workers<br/>__
    
    cat ~/.ssh/id_rsa.pub | ssh ukkoxy1.hpc.cs.helsinki.fi "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
    cat ~/.ssh/id_rsa.pub | ssh ukkoxy2.hpc.cs.helsinki.fi "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
    cat ~/.ssh/id_rsa.pub | ssh ukkoxy3.hpc.cs.helsinki.fi "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
    cat ~/.ssh/id_rsa.pub | ssh ukkoxy4.hpc.cs.helsinki.fi "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
       
In the above command I have chosen ukkoxy1, ukkoxy2, ukkoxy3, and ukkoxy4 as my worker nodes for my master node in ukkoxy0. 
Remember you now have unrestricted ssh access between ukkoxy0 and the other worker nodes. <br/>

__3. Next step is to add the lsit of workers host name into conf/slaves file.<br/>__
    ukkoxy1.hpc.cs.helsinki.fi<br/>
    ukkoxy2.hpc.cs.helsinki.fi<br/>
    ukkoxy3.hpc.cs.helsinki.fi<br/>
    ukkoxy4.hpc.cs.helsinki.fi<br/>
    
__4. Now start your master using the following command,<br/>__
    ./sbin/start-master.sh 

__5. To check if your master is running, use the following command,<br/>__
    ps aux | grep spark
    
    It will list all running spark processes. For example,
    
    chinnasa 17765  2.2  0.9 5828172 298884 pts/0  Sl   12:00   0:04 /usr/lib/jvm/java-7-openjdk-amd64/bin/java -cp /cs/home/chinnasa/spark/spark-2.0.2-bin-hadoop2.7/conf/:/cs/home/chinnasa/spark/spark-2.0.2-bin-hadoop2.7/jars/* -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --host ukko043.hpc.cs.helsinki.fi --port 7077 --webui-port 8080
    
    From the above command, you can figure out the port num where the process is running.<br/>

__6. Final step is to run the spark slaves, use the following command,<br/>__
    ./sbin/start-slaves.sh spark://ukkoxy0.hpc.cs.helsinki.fi:7077
    <br/>
    It should list where the .out files are stored.<br/>
    
__7. You can got to corresponsing ukko nodes and check the list of runnign processes usign the follwoing command, <br/>__
    htop
    use F10 to exit 'htop'. <br/>

__8. To stop all, use the following command, __
    ./sbin/stop-all.sh

Congrats, now you have set up a Spark Standalone cluster with four slaves and a master. While runnign yout programs, use the following spark configuration settings,<br/>

For Python,<br/>
conf = (SparkConf().setMaster("spark://ukko043.hpc.cs.helsinki.fi:7077").setAppName("Examples"))<br/>
sc = SparkContext(conf=conf)<br/>

For Scala,<br/>
val conf = new SparkConf().setAppName("week2").setMaster("spark://ukko043.hpc.cs.helsinki.fi:7077")<br/>
val sc = new SparkContext(conf)<br/>

NOTE: Note always your ssh public keys remain in the key ring, so when ever you need to run your workers, check if you have access without requiring password.

Also there is some issue with python3 in UKKO, but python2 is working. For now, use python2 until the issue is resolved.