<h2>Basic SQL Syntax</h2>
<p>
Many people are already familiar with SQL syntax.  Once your data is organized in a dataframe, you can create what's called a view of the data frame.  The view will allow you to write SQL statements view the spark objects sql method.
</p>

In [1]:
from pyspark.sql import SparkSession
from datetime import datetime
path = "/Users/joesphgartner/Desktop/data/"

spk = SparkSession.builder.master("local").getOrCreate()
df = spk.read.json(path)

df.createTempView("tweets")

<p>Walking through the basic data slicing as in the previous notebook, we can perform all the same actions as before, such as:</p>

In [2]:
#Selecting specific fields
df2 = spk.sql("SELECT lang, user.name, geo FROM tweets")
df2.show(5)

+----+----------------+----+
|lang|            name| geo|
+----+----------------+----+
|  fr|          TAÏNA♡|null|
|  fr|         Cerpyth|null|
|  fr|   Bjr du Centre|null|
|  fr|            Ben³|null|
|  ru|Доня любит Шарфу|null|
+----+----------------+----+
only showing top 5 rows



In [3]:
#Remove null fields
df2 = spk.sql("SELECT lang, user.name, geo FROM tweets WHERE geo IS NOT NULL")
df2.show(5)

+----+-------------------+--------------------+
|lang|               name|                 geo|
+----+-------------------+--------------------+
| und|        AllEyezOnIt|[WrappedArray(48....|
|  en|      Mhairi-Stella|[WrappedArray(51....|
|  fr|Train Travel Alerts|[WrappedArray(49....|
|  en|            Martika|[WrappedArray(48....|
|  en| CharlotteSouthcott|[WrappedArray(51....|
+----+-------------------+--------------------+
only showing top 5 rows



In [4]:
#Find specific field values
df2 = spk.sql("SELECT lang, user.name, text FROM tweets WHERE lang = 'en'")
df2.show(5)

+----+--------------+--------------------+
|lang|          name|                text|
+----+--------------+--------------------+
|  en|TRUFFLESICIOUS|@XxM0hawkPuppyXx ...|
|  en|     Mina Guli|MILESTONE: Only 2...|
|  en| Mhairi-Stella|🌼🌼🌼 spotted ou...|
|  en|         Harry|I wonder what the...|
|  en|       Rayy'🚀|Backseat, spaciou...|
+----+--------------+--------------------+
only showing top 5 rows



In [6]:
#Perform basic transformations
df2 = spk.sql("SELECT geo.coordinates[0], geo.coordinates[0] + 1 FROM tweets WHERE geo IS NOT NULL")
df2.show(5)

+-----------------------------------+---------------------------------------------------------+
|geo.coordinates AS `coordinates`[0]|(geo.coordinates AS `coordinates`[0] + CAST(1 AS DOUBLE))|
+-----------------------------------+---------------------------------------------------------+
|                        48.89007918|                                              49.89007918|
|                              51.25|                                                    52.25|
|                          49.598666|                                                50.598666|
|                            48.8576|                                                  49.8576|
|                        51.15783216|                                              52.15783216|
+-----------------------------------+---------------------------------------------------------+
only showing top 5 rows



In [7]:
#And basic groupings
df2 = spk.sql("SELECT lang, COUNT(*) FROM tweets GROUP BY lang")
df2.show(10)

+----+--------+
|lang|count(1)|
+----+--------+
|  en|   41831|
|  vi|      11|
|  ne|       4|
|  ps|       1|
|  ro|     219|
|  sl|      27|
| und|   17624|
|  ur|      11|
|  lv|      74|
|  pl|     297|
+----+--------+
only showing top 10 rows



<h2>UDFs</h2>
<p>In order to do more sophisiticated selection, one will opt for a user defined function.  We will start by showing that the function works locally on data from the sample</p>

In [7]:
def point_in_box(geo):
    if geo is None or geo.coordinates is None:
        return False
    if geo.coordinates[0] > 53 or geo.coordinates[0] < 52:
        return False
    if geo.coordinates[1] > -1 or geo.coordinates[1] < -2:
        return False
    return True

In [8]:
df2 = spk.sql("SELECT * FROM tweets WHERE geo IS NOT NULL")
temp = df2.take(5)

for tweet in temp:
    print(point_in_box(tweet.geo))

NameError: name 'point_in_box' is not defined

<p>When you try to use it in a select statement, it fails</p>

In [9]:
df2 = spk.sql("SELECT * FROM tweets WHERE point_in_box(geo)")

AnalysisException: "Undefined function: 'point_in_box'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 27"

In [10]:
from pyspark.sql.types import BooleanType

sqlContext.registerFunction("dist_pib", lambda geo: point_in_box(geo), returnType=BooleanType())

In [11]:
df2 = spk.sql("SELECT lang, user.name, geo FROM tweets WHERE dist_pib(geo)")
df2.show(5)

+----+-----------------+--------------------+
|lang|             name|                 geo|
+----+-----------------+--------------------+
|  en|       Anne Dewis|[WrappedArray(52....|
|  en|     Vanilla Rose|[WrappedArray(52....|
|  en|  Melissa Jackson|[WrappedArray(52....|
|  en|360 Virtual Tours|[WrappedArray(52....|
|  en|      Super Conny|[WrappedArray(52....|
+----+-----------------+--------------------+
only showing top 5 rows



In [13]:
temp = df2.rdd.take(1)

In [16]:
temp[0]

Row(contributors=None, coordinates=Row(coordinates=[2.42839066, 48.89007918], type='Point'), created_at='Wed Apr 26 13:30:47 +0000 2017', display_text_range=None, entities=Row(hashtags=[Row(indices=[0, 12], text='AllEyezOnIt'), Row(indices=[13, 23], text='Neochrome'), Row(indices=[24, 30], text='Eriah'), Row(indices=[31, 36], text='Loin'), Row(indices=[37, 45], text='Youtube'), Row(indices=[46, 63], text='YoutubeNeochrome'), Row(indices=[64, 71], text='RepDom')], media=None, symbols=[], urls=[Row(display_url='instagram.com/p/BTWa3QiA7ya/', expanded_url='https://www.instagram.com/p/BTWa3QiA7ya/', indices=[91, 114], url='https://t.co/UymjEOk2Py')], user_mentions=[Row(id=230227400, id_str='230227400', indices=[75, 89], name='NÉOCHROME Officiel', screen_name='Neochromeprod')]), extended_entities=None, extended_tweet=None, favorite_count=0, favorited=False, filter_level='low', geo=Row(coordinates=[48.89007918, 2.42839066], type='Point'), id=857225447821725696, id_str='857225447821725696', i