Skip to content

weberdavid/pyspark_course

Repository files navigation

Repository for the Spark MOOC on Udacity

This is the repository for a MOOC on Udacity about Spark
Course Link | Spark Documentation | Spark Download

Setup

  1. Download and Install Spark
  2. Install pyspark via pip:
    pip install pyspark
  3. ... or Anconda:
    conda install pyspark

Spark Commands - How to start a Master Node (locally)

  1. On your machine, navigate to:
    /usr/local/Cellar/apache-spark/2.4.5/libexec
  2. Start the Master Node:
    ./sbin/start-master.sh -h <ip-address where to run>
  3. Stop the Master Node:
    ./sbin/stop-master.sh

Connect to an AWS EMR instance

Documentation

Connect to instance:

ssh -i <path>/<key_name>.pem hadoop@ec2-###-###-###-###.compute.amazonaws.com

Transmit Files to HDFS

  1. Connect to instance using SSH or Browser + Proxy

  2. Transmit files to HDFS:

    scp -i <path>/<key_name>.pem ~/Desktop/sparkify_log_small.json hadoop@ec2-###-###-###-###.compute.amazonaws.com:~/
  3. Create new HDFS Folder:

    hdfs dfs -mkdir user/<newFolder>
  4. Need Help?

    hdfs #or hfds dfs
  5. Move a file to the current cluster:

    hdfs dfs -copyFromLocal <file> /user/<folder>
  6. Submit a script on hdfs with spark:

    which spark-submit = /usr/bin/spark-submit
    /usr/bin/spark-submit --master yarn ./<script>.py 

Glossary

  • Accumulators = global variables for debugging code
    from pyspark import SparkContext
    errors = SparkContext.accumulator(0,0)

About

Repository for the PySpark MOOC on Udacity

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages