# PySpark Resources

https://github.com/ericxiao251/spark-syntax

# Install findspark

pip install findspark

# Install pyspark

pip install pyspark

# Install pytables

pip install tables

# Install java

sudo apt update<br>
sudo apt install default-jdk<br>
update-alternatives --list java  # the last line is location of your java runtime<br>
vim ~/.profile   then add: export JAVA_HOME=<path_of_java_runtime_but_exclude_/bin/java>

In [1]:
import findspark
from pathlib import Path
import pandas as pd
import numpy as np

# Local Spark setup with findspark

In [2]:
# local spark
findspark.init('/home/pybokeh/envs/py3.7.2/lib/python3.7/site-packages/pyspark/')

# Getting PySpark shell

To get a PySpark shell:

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('example_app').master('local[*]').getOrCreate()

In [4]:
spark.sql("show databases").show()

+------------+
|databaseName|
+------------+
|     default|
+------------+



# pandas -> spark

First integration is about how to move data from pandas library, which is Python standard library to perform in-memory data manipulation, to Spark. First, let’s load a pandas DataFrame. This one is about Air Quality in Madrid (just to satisfy your curiosity, but not important with regards to moving data from one place to another one). You can download it [here](https://www.kaggle.com/decide-soluciones/air-quality-madrid). Make sure you install pytables to read hdf5 data.

In [5]:
air_quality_df = pd.read_hdf('/home/pybokeh/Downloads/air-quality-madrid/madrid.h5', key='28079008')
air_quality_df.head()

Unnamed: 0_level_0,BEN,CH4,CO,EBE,NMHC,NO,NO_2,NOx,O_3,PM10,PM25,SO_2,TCH,TOL
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2001-07-01 01:00:00,30.65,,6.91,42.639999,,,381.299988,1017.0,9.01,158.899994,,47.509998,,76.050003
2001-07-01 02:00:00,29.59,,2.59,50.360001,,,209.5,409.200012,23.82,104.800003,,20.950001,,84.900002
2001-07-01 03:00:00,4.69,,0.76,25.57,,,116.400002,143.399994,31.059999,48.470001,,11.27,,20.98
2001-07-01 04:00:00,4.46,,0.74,22.629999,,,116.199997,149.300003,23.780001,47.5,,10.1,,14.77
2001-07-01 05:00:00,2.18,,0.57,11.92,,,100.900002,124.800003,29.530001,49.689999,,7.68,,8.97


Let’s make some changes to this DataFrame, like resetting datetime index to not lose information when loading into Spark. Datetime will also be transformed to string as Spark has some issues working with dates (related to system locale, timezones, and so on).

In [6]:
air_quality_df.reset_index(inplace=True)
air_quality_df['date'] = air_quality_df['date'].dt.strftime('%Y-%m-%d %H:%M:%S')

We can simply load from pandas to Spark with ```createDataFrame```:

Once DataFrame is loaded into Spark (as ```air_quality_sdf``` here), can be manipulated easily using PySpark methods:

In [7]:
air_quality_sdf = spark.createDataFrame(air_quality_df)
air_quality_sdf.dtypes

[('date', 'string'),
 ('BEN', 'double'),
 ('CH4', 'double'),
 ('CO', 'double'),
 ('EBE', 'double'),
 ('NMHC', 'double'),
 ('NO', 'double'),
 ('NO_2', 'double'),
 ('NOx', 'double'),
 ('O_3', 'double'),
 ('PM10', 'double'),
 ('PM25', 'double'),
 ('SO_2', 'double'),
 ('TCH', 'double'),
 ('TOL', 'double')]

In [8]:
air_quality_sdf.select('date', 'NOx').show(5)

+-------------------+------------------+
|               date|               NOx|
+-------------------+------------------+
|2001-07-01 01:00:00|            1017.0|
|2001-07-01 02:00:00|409.20001220703125|
|2001-07-01 03:00:00|143.39999389648438|
|2001-07-01 04:00:00| 149.3000030517578|
|2001-07-01 05:00:00|124.80000305175781|
+-------------------+------------------+
only showing top 5 rows



# pandas -> spark -> hive

To persist a Spark DataFrame into HDFS, where it can be queried using default Hadoop SQL engine (Hive), one straightforward strategy (not the only one) is to create a temporal view from that DataFrame:

In [9]:
air_quality_sdf.createOrReplaceTempView("air_quality_sdf")

Once the temporal view is created, it can be used from Spark SQL engine to create a real table using create table as select. Before creating this table, I will create a new database called ```analytics``` to store it:

In [10]:
sql_drop_table = """
drop table if exists analytics.pandas_spark_hive
"""

sql_drop_database = """
drop database if exists analytics cascade
"""

sql_create_database = """
create database if not exists analytics
location '/home/pybokeh/temp/cloudera/analytics/'
"""

sql_create_table = """
create table if not exists analytics.pandas_spark_hive
using parquet
as select to_timestamp(date) as date_parsed, *
from air_quality_sdf
"""

print("dropping database...")
result_drop_db = spark.sql(sql_drop_database)

print("creating database...")
result_create_db = spark.sql(sql_create_database)

print("dropping table...")
result_droptable = spark.sql(sql_drop_table)

print("creating table...")
result_create_table = spark.sql(sql_create_table)

dropping database...
creating database...
dropping table...
creating table...


Can check results using Spark SQL engine, for example to select ozone pollutant concentration over time:

In [11]:
spark.sql("select * from analytics.pandas_spark_hive").select("date_parsed", "O_3").show(5)

+-------------------+------------------+
|        date_parsed|               O_3|
+-------------------+------------------+
|2001-07-01 01:00:00| 9.010000228881836|
|2001-07-01 02:00:00| 23.81999969482422|
|2001-07-01 03:00:00|31.059999465942383|
|2001-07-01 04:00:00|23.780000686645508|
|2001-07-01 05:00:00|29.530000686645508|
+-------------------+------------------+
only showing top 5 rows

