# PySpark provides two main options when it comes to using staight SQL. Spark SQL and SQL Transformer. 
# 1. Spark SQL
    Spark TempView provides two functions that allow users to run **SQL** queries against a Spark DataFrame: 
# createOrReplaceTempView:** 
        The lifetime of this temporary view is tied to the SparkSession 
        that was used to create the dataset. It creates (or replaces if that view name already exists) 
        a lazily evaluated "view" that you can then use like a hive table in Spark SQL. 
        It does not persist to memory unless you cache the dataset that underpins the view.

# createGlobalTempView:
        The lifetime of this temporary view is tied to this Spark application. 
        #This feature is useful when you want to share data among different sessions and keep alive until
        #your application ends.

# A **Spark Session vs. Spark application:**
# 
    Spark application** can be used: 
     for a single batch job
    an interactive session with multiple jobs
    a long-lived server continually satisfying requests
    A Spark job can consist of more than just a single map and reduce.
    can consist of more than one Spark Session. 
# A SparkSession** on the other hand:
     is an interaction between two or more entities. 
    can be created without creating SparkConf, SparkContext or SQLContext,
    (they’re encapsulated within the SparkSession which is new to Spark 2.0)

# 2. SQL Transformer 
You also have the option to use the SQL transformer option where you can write free-form SQL scripts as well. 
    SQL Options within regular PySpark calls
**1. The expr function in PySparks SQL Function Library**
**2. PySparks selectExpr function**
 
We will go over all these in detail so buckel up!

In [4]:
import pyspark
from pyspark.sql import SparkSession

In [5]:
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
spark

In [7]:
cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")


You are working with 1 core(s)


In [8]:
file_path = "/home/nyalazone/Desktop/pyspark/Pyspark_Module/data/rec-crime-pfa.csv"
dataset = spark.read.csv(file_path,header = True,inferSchema = True)

In [9]:
dataset.count()

46469

In [10]:
dataset.limit(10).toPandas()

Unnamed: 0,12 months ending,PFA,Region,Offence,Rolling year total number of offences
0,31/03/2003,Avon and Somerset,South West,All other theft offences,25959
1,31/03/2003,Avon and Somerset,South West,Bicycle theft,3090
2,31/03/2003,Avon and Somerset,South West,Criminal damage and arson,26202
3,31/03/2003,Avon and Somerset,South West,Death or serious injury caused by illegal driving,2
4,31/03/2003,Avon and Somerset,South West,Domestic burglary,14561
5,31/03/2003,Avon and Somerset,South West,Drug offences,2308
6,31/03/2003,Avon and Somerset,South West,Fraud offences,5339
7,31/03/2003,Avon and Somerset,South West,Homicide,19
8,31/03/2003,Avon and Somerset,South West,Miscellaneous crimes against society,1597
9,31/03/2003,Avon and Somerset,South West,Non-domestic burglary,15621


In [12]:
# So, in order for us to perform SQL calls off of this dataframe, 
# we will need to rename any variables that have spaces in them.
#Lets rename it. 


In [17]:
dataframe = dataset.withColumnRenamed('Rolling year total number of offences','offence_count')
dataframe = dataframe.withColumnRenamed('12 months ending','12_months_ending')

dataframe.printSchema()

root
 |-- 12_months_ending: string (nullable = true)
 |-- PFA: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Offence: string (nullable = true)
 |-- offence_count: integer (nullable = true)



In [18]:
# Create a temporary view of the dataframe
dataframe.createOrReplaceTempView('tempview')

In [19]:
# Then Query the temp view
spark.sql("Select * from tempview").limit(5).toPandas()

Unnamed: 0,12_months_ending,PFA,Region,Offence,offence_count
0,31/03/2003,Avon and Somerset,South West,All other theft offences,25959
1,31/03/2003,Avon and Somerset,South West,Bicycle theft,3090
2,31/03/2003,Avon and Somerset,South West,Criminal damage and arson,26202
3,31/03/2003,Avon and Somerset,South West,Death or serious injury caused by illegal driving,2
4,31/03/2003,Avon and Somerset,South West,Domestic burglary,14561


In [20]:
spark.sql("Select * from tempview where offence_count < 10 ").limit(5).toPandas()

Unnamed: 0,12_months_ending,PFA,Region,Offence,offence_count
0,31/03/2003,Avon and Somerset,South West,Death or serious injury caused by illegal driving,2
1,31/03/2003,Bedfordshire,East,Death or serious injury caused by illegal driving,8
2,31/03/2003,Bedfordshire,East,Homicide,2
3,31/03/2003,British Transport Police,British Transport Police,Death or serious injury caused by illegal driving,0
4,31/03/2003,British Transport Police,British Transport Police,Homicide,4


In [25]:
# We can also pass your query results to an object 
sql_results = spark.sql("SELECT * FROM tempview WHERE offence_count > 1000 AND Region='South West'")
sql_results.limit(5).toPandas()

Unnamed: 0,12_months_ending,PFA,Region,Offence,offence_count
0,31/03/2003,Avon and Somerset,South West,All other theft offences,25959
1,31/03/2003,Avon and Somerset,South West,Bicycle theft,3090
2,31/03/2003,Avon and Somerset,South West,Criminal damage and arson,26202
3,31/03/2003,Avon and Somerset,South West,Domestic burglary,14561
4,31/03/2003,Avon and Somerset,South West,Drug offences,2308


In [28]:
## We can even do aggregated "group by" calls like this
spark.sql("SELECT Region, sum(offence_count) AS Total FROM tempview GROUP BY Region").limit(5).toPandas()

Unnamed: 0,Region,Total
0,Fraud: CIFAS,7678981
1,North West,30235732
2,British Transport Police,3029117
3,Wales,11137260
4,London,42691902


# SQL Transformer
    We also have the option to use the SQL transformer option where you can write freeform SQL scripts.

In [30]:
from pyspark.ml.feature import SQLTransformer


In [33]:
sqlTrans = SQLTransformer(statement="SELECT PFA,Region,Offence FROM __THIS__") # __THis__ is our dataframe
# And use it to transform our df object
# Note that "__THIS__" is a special word and cannot be change to __THAT__ for example
sqlTrans.transform(dataframe).show(5)


+-----------------+----------+--------------------+
|              PFA|    Region|             Offence|
+-----------------+----------+--------------------+
|Avon and Somerset|South West|All other theft o...|
|Avon and Somerset|South West|       Bicycle theft|
|Avon and Somerset|South West|Criminal damage a...|
|Avon and Somerset|South West|Death or serious ...|
|Avon and Somerset|South West|   Domestic burglary|
+-----------------+----------+--------------------+
only showing top 5 rows



In [34]:
type(sqlTrans)

pyspark.ml.feature.SQLTransformer

In [37]:
# Also Note that a call like this won't work...
SQLTransformer(statement="SELECT PFA,Region,Offence FROM __THIS__").show()


AttributeError: 'SQLTransformer' object has no attribute 'show'

In [41]:
#Note that this call will not work on the original dataframe "crime" when the variable "Count" is a string

sqlTrans = SQLTransformer(
    statement="SELECT Offence, SUM(offence_count) as Total FROM __THIS__ GROUP BY Offence") 
sqlTrans.transform(dataframe).show(5)

+--------------------+--------+
|             Offence|   Total|
+--------------------+--------+
|Public order offe...|10925676|
|       Bicycle theft| 5297006|
|Residential burglary| 1671469|
|Violence without ...|16590158|
|All other theft o...|30979393|
+--------------------+--------+
only showing top 5 rows



**And a where statement**

In [49]:
sqlTrans = SQLTransformer(
    statement="SELECT PFA,Offence FROM __THIS__ WHERE offence_count > 1000") 
sqlTrans.transform(dataframe).show(5)

+-----------------+--------------------+
|              PFA|             Offence|
+-----------------+--------------------+
|Avon and Somerset|All other theft o...|
|Avon and Somerset|       Bicycle theft|
|Avon and Somerset|Criminal damage a...|
|Avon and Somerset|   Domestic burglary|
|Avon and Somerset|       Drug offences|
+-----------------+--------------------+
only showing top 5 rows



In [51]:
# OR Wen can also pass veiw name tempview
sqlTrans = SQLTransformer(
    statement="SELECT PFA,Offence FROM tempview WHERE offence_count > 1000") 
sqlTrans.transform(dataframe).show(5)

+-----------------+--------------------+
|              PFA|             Offence|
+-----------------+--------------------+
|Avon and Somerset|All other theft o...|
|Avon and Somerset|       Bicycle theft|
|Avon and Somerset|Criminal damage a...|
|Avon and Somerset|   Domestic burglary|
|Avon and Somerset|       Drug offences|
+-----------------+--------------------+
only showing top 5 rows



In [53]:
result = sqlTrans.transform(dataframe)
result.show(5)

+-----------------+--------------------+
|              PFA|             Offence|
+-----------------+--------------------+
|Avon and Somerset|All other theft o...|
|Avon and Somerset|       Bicycle theft|
|Avon and Somerset|Criminal damage a...|
|Avon and Somerset|   Domestic burglary|
|Avon and Somerset|       Drug offences|
+-----------------+--------------------+
only showing top 5 rows



# SQL Options within regular PySpark calls 
**The expr function in PySparks SQL Function Library**
You can also use the expr function within the **pyspark.sql.functions library** coupled with either PySpark's withColumn function or the select function.


In [55]:
# First we need to read in the library
from pyspark.sql.functions import expr 

In [60]:
## Let's add a percent column to the dataframe. To do this, first we need to get the total number of rows in the dataframe (we can't soft this unfortunatly).

sqlTrans = SQLTransformer(
    statement="SELECT SUM(offence_count) as Total FROM __THIS__") 
sqlTrans.transform(dataframe).show(5)

+---------+
|    Total|
+---------+
|244720928|
+---------+



In [59]:
# We could add a percent column to our df 
# that shows the offence %
# with the "withColumn" command
dataframe.withColumn("percent",expr("round((offence_count/244720928)*100,2)")).show()

+----------------+-----------------+----------+--------------------+-------------+-------+
|12_months_ending|              PFA|    Region|             Offence|offence_count|percent|
+----------------+-----------------+----------+--------------------+-------------+-------+
|      31/03/2003|Avon and Somerset|South West|All other theft o...|        25959|   0.01|
|      31/03/2003|Avon and Somerset|South West|       Bicycle theft|         3090|    0.0|
|      31/03/2003|Avon and Somerset|South West|Criminal damage a...|        26202|   0.01|
|      31/03/2003|Avon and Somerset|South West|Death or serious ...|            2|    0.0|
|      31/03/2003|Avon and Somerset|South West|   Domestic burglary|        14561|   0.01|
|      31/03/2003|Avon and Somerset|South West|       Drug offences|         2308|    0.0|
|      31/03/2003|Avon and Somerset|South West|      Fraud offences|         5339|    0.0|
|      31/03/2003|Avon and Somerset|South West|            Homicide|           19|    0.0|

# PySparks selectExpr function
Very similar idea here but slightly different syntax.

In [61]:
dataframe.selectExpr("*","round((offence_count/244720928)*100,2) AS percent").filter("Region ='South West'").show()


+----------------+-----------------+----------+--------------------+-------------+-------+
|12_months_ending|              PFA|    Region|             Offence|offence_count|percent|
+----------------+-----------------+----------+--------------------+-------------+-------+
|      31/03/2003|Avon and Somerset|South West|All other theft o...|        25959|   0.01|
|      31/03/2003|Avon and Somerset|South West|       Bicycle theft|         3090|    0.0|
|      31/03/2003|Avon and Somerset|South West|Criminal damage a...|        26202|   0.01|
|      31/03/2003|Avon and Somerset|South West|Death or serious ...|            2|    0.0|
|      31/03/2003|Avon and Somerset|South West|   Domestic burglary|        14561|   0.01|
|      31/03/2003|Avon and Somerset|South West|       Drug offences|         2308|    0.0|
|      31/03/2003|Avon and Somerset|South West|      Fraud offences|         5339|    0.0|
|      31/03/2003|Avon and Somerset|South West|            Homicide|           19|    0.0|

In [62]:
# Speed test
spark.sql("SELECT * FROM tempview WHERE offence_count > 1000").show()


+----------------+-----------------+----------+--------------------+-------------+
|12_months_ending|              PFA|    Region|             Offence|offence_count|
+----------------+-----------------+----------+--------------------+-------------+
|      31/03/2003|Avon and Somerset|South West|All other theft o...|        25959|
|      31/03/2003|Avon and Somerset|South West|       Bicycle theft|         3090|
|      31/03/2003|Avon and Somerset|South West|Criminal damage a...|        26202|
|      31/03/2003|Avon and Somerset|South West|   Domestic burglary|        14561|
|      31/03/2003|Avon and Somerset|South West|       Drug offences|         2308|
|      31/03/2003|Avon and Somerset|South West|      Fraud offences|         5339|
|      31/03/2003|Avon and Somerset|South West|Miscellaneous cri...|         1597|
|      31/03/2003|Avon and Somerset|South West|Non-domestic burg...|        15621|
|      31/03/2003|Avon and Somerset|South West|Public order offe...|         4025|
|   

In [64]:
# Then we create an SQL call 
sqlTrans = SQLTransformer(
    statement="SELECT * FROM __THIS__ WHERE offence_count > 1000")
# And use it to transform our df object
sqlTrans.transform(dataframe).show(5)

+----------------+-----------------+----------+--------------------+-------------+
|12_months_ending|              PFA|    Region|             Offence|offence_count|
+----------------+-----------------+----------+--------------------+-------------+
|      31/03/2003|Avon and Somerset|South West|All other theft o...|        25959|
|      31/03/2003|Avon and Somerset|South West|       Bicycle theft|         3090|
|      31/03/2003|Avon and Somerset|South West|Criminal damage a...|        26202|
|      31/03/2003|Avon and Somerset|South West|   Domestic burglary|        14561|
|      31/03/2003|Avon and Somerset|South West|       Drug offences|         2308|
+----------------+-----------------+----------+--------------------+-------------+
only showing top 5 rows



In [65]:
SQLTransformer(statement="SELECT * FROM __THIS__ WHERE offence_count > 1000").transform(dataframe).show()



+----------------+-----------------+----------+--------------------+-------------+
|12_months_ending|              PFA|    Region|             Offence|offence_count|
+----------------+-----------------+----------+--------------------+-------------+
|      31/03/2003|Avon and Somerset|South West|All other theft o...|        25959|
|      31/03/2003|Avon and Somerset|South West|       Bicycle theft|         3090|
|      31/03/2003|Avon and Somerset|South West|Criminal damage a...|        26202|
|      31/03/2003|Avon and Somerset|South West|   Domestic burglary|        14561|
|      31/03/2003|Avon and Somerset|South West|       Drug offences|         2308|
|      31/03/2003|Avon and Somerset|South West|      Fraud offences|         5339|
|      31/03/2003|Avon and Somerset|South West|Miscellaneous cri...|         1597|
|      31/03/2003|Avon and Somerset|South West|Non-domestic burg...|        15621|
|      31/03/2003|Avon and Somerset|South West|Public order offe...|         4025|
|   