# Data Munging with PySpark

# Installation etc.

Setting up Spark and it's dependencies is a little tedious. 
Using Windows here. 

1. Download 
	- JDK (prefer 8.x, 64bit)
	- Hadoop (3.2.x, at this time) - for windows, we just need Hadoop Winutils
	- [Hadoop winutils (corresponding to the version of Hadoop)](https://github.com/cdarlint/winutils), [another repo](https://github.com/kontext-tech/winutils)
	- [Spark (3.x, at this time)](https://spark.apache.org/downloads.html)  
    
    
1. Setup environment variables (*notice there are no backslashes in the end. This is because slashes will be added in the next step when we setup path*), example paths would look like:
	- JAVA_HOME = ```C:\[Java]``` 
	- HADOOP_HOME = ```C:\Hadoop\hadoop-3.2.1```
	- SPARK_HOME = ```C:\Spark\spark-3.2.1-bin-hadoop3.2```  
    
    
1. Update system 'path' (*here we add backslashes before bin*) to add the following:
	- Java: ```%JAVA_HOME%/bin```
	- Hadoop 01: ```%HADOOP_HOME%/bin```
	- Hadoop 02: ```%HADOOP_HOME%/sbin``` (*sbin needed in addition to bin*)
	- Spark: ```%SPARK_HOME%/bin```  
    
    
1. Configure Hadoop (*optional, only needed if you want to use hadoop as your file storage system*):
	- create a folder for ```namenode```
	- create a folder for ```datanode```
	- four files: ```core-site.xml```, ```mapred-site.xml```, ```hdfs-site.xml```, ```yarn-site.xml``` - see code for each in the repo.  
    
    
1. Patch Hadoop (this is only needed when Hadoop is run on Windows):
	- copy the ```bin``` folder from the right version of winutils to replace ```%HADOOP_HOME%/bin``` 
	- copy ```hadoop-yarn-server-timelineservice-3.0.3``` from ```%HADOOP_HOME%\share\hadoop\yarn\timelineservice``` to ```%HADOOP_HOME%\share\hadoop\yarn``` (the parent directory).  
    
    
1. References:
	- ```https://muhammadbilalyar.github.io/blogs/How-to-install-Hadoop-on-Window-10/```
	- ```https://github.com/MuhammadBilalYar/Hadoop-On-Window```
	- ```https://dev.to/awwsmm/installing-and-running-hadoop-and-spark-on-windows-33kc```

# Setup

This boiler plate helps, esp. in Jupyter Notebook situations

In [1]:
# Step 1: initialize findspark
import findspark
findspark.init()

In [2]:
# Step 2: import pyspark
import pyspark
from pyspark.sql import SparkSession
pyspark.__version__

'3.3.0'

In [14]:
# Step 3: Create a spark session

# 'local[1]' indicates spark on 1 core on the local machine, specify the number of cores needed
# use .config("spark.some.config.option", "some-value") for additional configuration

spark = SparkSession \
    .builder \
    .master('local[1]') \
    .appName("10+ minutes to pyspark") \
    .getOrCreate()

spark

Back in the day you'd need various 'contexts' as entry points into spark functionality.  
All of this is now wrapped into a SparkSession, easy to manage.

In [15]:
# The SparkSession carries the sparkContext
spark.sparkContext

Check out the spark UI link above.  
Your local UI should launch at a link like: http://localhost:4041/jobs/