<h2>Basic SQL Syntax</h2>
<p>
Many people are already familiar with SQL syntax.  Once your data is organized in a dataframe, you can create what's called a view of the data frame.  The view will allow you to write SQL statements view the spark objects sql method.
</p>

In [1]:
from pyspark.sql import SparkSession
from datetime import datetime
path = "/Users/joesphgartner/Desktop/data/"

spk = SparkSession.builder.master("local").getOrCreate()
df = spk.read.json(path)

df.createTempView("tweets")

<p>Walking through the basic data slicing as in the previous notebook, we can perform all the same actions as before, such as:</p>

In [2]:
#Selecting specific fields
df2 = spk.sql("SELECT lang, user.name FROM tweets")
df2.show(5)

+----+----------------+
|lang|            name|
+----+----------------+
|  fr|          TAÏNA♡|
|  fr|         Cerpyth|
|  fr|   Bjr du Centre|
|  fr|            Ben³|
|  ru|Доня любит Шарфу|
+----+----------------+
only showing top 5 rows



In [3]:
#Remove null fields
df2 = spk.sql("SELECT lang, user.name, geo FROM tweets WHERE geo != NULL")
df2.show(5)

+----+----+---+
|lang|name|geo|
+----+----+---+
+----+----+---+



In [4]:
#Find specific field values
df2 = spk.sql("SELECT lang, user.name FROM tweets WHERE lang = 'en'")
df2.show(5)

+----+--------------+
|lang|          name|
+----+--------------+
|  en|TRUFFLESICIOUS|
|  en|     Mina Guli|
|  en| Mhairi-Stella|
|  en|         Harry|
|  en|       Rayy'🚀|
+----+--------------+
only showing top 5 rows



In [5]:
#Perform basic transformations
df2 = spk.sql("SELECT geo.coordinates[0], geo.coordinates[0] + 1 FROM tweets WHERE geo != NULL")
df2.show(5)

+-----------------------------------+---------------------------------------------------------+
|geo.coordinates AS `coordinates`[0]|(geo.coordinates AS `coordinates`[0] + CAST(1 AS DOUBLE))|
+-----------------------------------+---------------------------------------------------------+
+-----------------------------------+---------------------------------------------------------+



In [6]:
#And basic groupings
df2 = spk.sql("SELECT lang, COUNT(*) FROM tweets GROUP BY lang")
df2.show(10)

+----+--------+
|lang|count(1)|
+----+--------+
|  en|   41831|
|  vi|      11|
|  ne|       4|
|  ps|       1|
|  ro|     219|
|  sl|      27|
| und|   17624|
|  ur|      11|
|  lv|      74|
|  pl|     297|
+----+--------+
only showing top 10 rows



<h2>UDFs</h2>
<p>In order to do more sophisiticated selection, one will opt for a user defined function.  We will start by showing that the function works locally on data from the sample</p>

In [7]:
def point_in_box(geo):
    if geo is None or geo.coordinates is None:
        return False
    if geo.coordinates[0] > 53 or geo.coordinates[0] < 52:
        return False
    if geo.coordinates[1] > -1 or geo.coordinates[1] < -2:
        return False
    return True

In [8]:
df2 = spk.sql("SELECT * FROM tweets WHERE geo != NULL")
temp = df2.take(5)

for tweet in temp:
    print(point_in_box(tweet.geo))

<p>When you try to use it in a select statement, it fails</p>

In [9]:
df2 = spk.sql("SELECT * FROM tweets WHERE point_in_box(geo)")

AnalysisException: "Undefined function: 'point_in_box'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 27"

In [10]:
from pyspark.sql.types import BooleanType

sqlContext.registerFunction("dist_pib", lambda geo: point_in_box(geo), returnType=BooleanType())

In [11]:
df2 = spk.sql("SELECT lang, user.name, geo FROM tweets WHERE dist_pib(geo)")
df2.show(5)

+----+-----------------+--------------------+
|lang|             name|                 geo|
+----+-----------------+--------------------+
|  en|       Anne Dewis|[WrappedArray(52....|
|  en|     Vanilla Rose|[WrappedArray(52....|
|  en|  Melissa Jackson|[WrappedArray(52....|
|  en|360 Virtual Tours|[WrappedArray(52....|
|  en|      Super Conny|[WrappedArray(52....|
+----+-----------------+--------------------+
only showing top 5 rows

