This is the repository for a MOOC on Udacity about Spark
Course Link |
Spark Documentation |
Spark Download
- Download and Install Spark
- Install
pysparkvia pip:pip install pyspark
- ... or Anconda:
conda install pyspark
- On your machine, navigate to:
/usr/local/Cellar/apache-spark/2.4.5/libexec
- Start the Master Node:
./sbin/start-master.sh -h <ip-address where to run>
- Stop the Master Node:
./sbin/stop-master.sh
Connect to instance:
ssh -i <path>/<key_name>.pem hadoop@ec2-###-###-###-###.compute.amazonaws.com-
Connect to instance using SSH or Browser + Proxy
-
Transmit files to HDFS:
scp -i <path>/<key_name>.pem ~/Desktop/sparkify_log_small.json hadoop@ec2-###-###-###-###.compute.amazonaws.com:~/
-
Create new HDFS Folder:
hdfs dfs -mkdir user/<newFolder>
-
Need Help?
hdfs #or hfds dfs -
Move a file to the current cluster:
hdfs dfs -copyFromLocal <file> /user/<folder>
-
Submit a script on hdfs with spark:
which spark-submit = /usr/bin/spark-submit /usr/bin/spark-submit --master yarn ./<script>.py
- Accumulators = global variables for debugging code
from pyspark import SparkContext errors = SparkContext.accumulator(0,0)