The goal of this project is to do some ETL (Extract, Transform and Load) with the Spark Python API (PySpark) and Hadoop Distributed File System (HDFS).
Working with CSV's files from HiggsTwitter dataset we'll do :
- Convert CSV's dataframes to Apache Parquet files.
- Use Spark SQL using DataFrames API and SQL language.
- Some performance testing like compressed CSV vs Parquet, cached DF vs not cached DF and local file access vs local HDFS access.
We will need to install/extract Hadoop and Spark, and remember to create the necessary environment variables:
Windows environment variables (example):
JAVA_HOME=C:\Progra~1\Java\jdk1.8.0_131
HADOOP_HOME=C:\hadoop
SPARK_HOME=C:\spark
PYTHONPATH=%SPARK_HOME%\python;
PATH=C:\Python;C:\Python\Scripts;%HADOOP_HOME%\bin;%SPARK_HOME%\bin;%PYTHONPATH%;%PATH%;
Hadoop can be downloaded from Hadoop releases webpage. These releases do not include winutils (Windows binaries for Hadoop versions) required in order to run Hadoop in Windows systems.
Read this guide in order to install Hadoop on Windows: https://wiki.apache.org/hadoop/Hadoop2OnWindows
Some notes to consider:
- Spark and Hadoop and require Java. JDK 8 64bits worked for me.
- JAVA_HOME must match our Java downloaded version (in my case 1.8.0_131). "Progra~1" for 64bits path installation and "Progra~2" for 32bits.
- PYTHONPATH is the path were Python will look for additional libraries. In this case, is set to look for the spark ones.
- In PATH variable, %PATH% means the same PATH value we already have. Basically add the new paths at the beginning of the PATH value.
- In order to avoid troubles, we should set hadoop.tmp.dir in file \hadoop\etc\hadoop\core-site.xml with Hadoop tmp directory we want (note that drive letter is preceded by /). For example:
<property>
<name>hadoop.tmp.dir</name>
<value>/C:/hadoop/temp/</value>
</property>
In order to run PySpark we need to install the Python library py4j:
pip install py4j
Then we can go to Jupyter Notebook ETL example in order to see some ETL with PySpark.