<h1>Dataframe Creation and basic data selection</h1>
<p>The idea of a dataframe has become prevasive across both users of R and the python package pandas.  If you're not familiar with it's usage, you can think of it as a dictionary that points to arrays, where the dictionaries are the column names.  Let's start by creating one from twitter data, which is stored as json formatted data.</p>

In [1]:
from pyspark.sql import SparkSession
from datetime import datetime
path = "/Users/josephgartner/Desktop/data/"

spk = SparkSession.builder.master("local").getOrCreate()
df = spk.read.json(path)

<h2>The Spark dataframe</h2>
<p>SQL & Dataframes are covered in greater detail [at this link](http://spark.apache.org/docs/latest/sql-programming-guide.html).  I'll briefly cover the usage here, using an example just a bit more complicated than those covered in the docs.  The data they use is much more well behaved than Twitter data.  As a result, a few tricks like nested json will be shown.</p>

In [2]:
df.printSchema()

root
 |-- contributors: string (nullable = true)
 |-- coordinates: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- display_text_range: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |-- media: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- display_url: string (nullable = true)
 |    |    |    |-- expanded_url: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- id_str: string (nullable = true)
 |    |    |    |

<h2>Data Handles</h2>
<p>Dataframes allow you to quickly grab parts of data while avoiding the lambda funciton.  The schema above allows you see how to get these entities.  As it happens, Twitter data has many nested json objects, which you can see within the above schema.  In order to access these, you simply use period (.).</p>

In [3]:
df.select('lang', 'user.name').show(5)

+----+------------+
|lang|        name|
+----+------------+
|  en| Ross Dillon|
|  en|ian slingsby|
|  en|       Nduka|
|  en|  Il Padrino|
|  en|Graham Banks|
+----+------------+
only showing top 5 rows



In [16]:
df.filter(df['geo'].isNotNull()).select('geo').take(1)[0].geo.coordinates[0]

51.484871

<h2>Data slicing</h2>
<p>Dataframes also allow you to slice data quickly.  This is a really important technique in data science.  While it's much more interesting to spin up projects using deep learning, a great deal of questions can be solved by simple data slicing!</p>

In [12]:
n_tot = df.count()
print "There are {} tweets in this sample.".format(n_tot)

n_w_geo = df.filter(df['geo'].isNotNull()).count()
print "{0:.6f}% of which have geo tags".format(float(n_w_geo)/n_tot)

n_non_eng = df.filter(df['lang']!='en').filter(df['lang']!='und').count()
print "{0:.4f}% of which aren't english".format(float(n_non_eng)/n_tot)

There are 111850 tweets in this sample.
0.120358% of which have geo tags
0.0618% of which aren't english


<h2>Data Transformation</h2>
<p>We already know how to do a map with a normal Spark dataset, however, dataframes work a bit differently.  A dataframe Row or Column is immuatable, but rows or columns can be removed (filters for rows, selects for columns).  You can also add a new column in a predictable way, using the 'with column' command.  For basic operations, this command can be called similar to map.</p>

In [17]:
df_mod = df.filter(df['geo'].isNotNull()).withColumn('geo-first', df.geo.coordinates[0] + 1)
df_mod.select('geo.coordinates', 'geo-first').show(5)

+--------------------+-----------+
|         coordinates|  geo-first|
+--------------------+-----------+
|[51.484871, -0.17...|  52.484871|
|[51.42248136, -0....|52.42248136|
|[53.80157756, -1....|54.80157756|
|[53.645, -1.77972...|     54.645|
|[51.50733056, -0....|52.50733056|
+--------------------+-----------+
only showing top 5 rows



<h2>Non-trivial transforms</h2>
<p>The ability to do transforms in this way are very limited.  If you try to do even a simple transform involving a python function, it will fail.</p>

In [18]:
#Verify transformation format is OK
print datetime.strptime(df.select('created_at').take(1)[0].created_at, "%a %b %d %H:%M:%S +0000 %Y")

2017-03-04 13:17:45


In [19]:
try:
    df.withColumn("created_at_dt", datetime.strptime(df["created_at"], "%a %b %d %H:%M:%S +0000 %Y"))
except TypeError as te:
    print te

strptime() argument 1 must be string, not Column


<h2>Enter UDFs</h2>
<p>The problem is that df.withColumn expects a column as a return variable.  To overcome this, we must create a udf -user defined function.</p>

In [20]:
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType, BooleanType

make_dt = udf(lambda date_string: datetime.strptime(date_string, "%a %b %d %H:%M:%S +0000 %Y"), TimestampType())
df_mod = df.withColumn("dt_created_at", make_dt(df['created_at']))
df_mod.select("created_at", "dt_created_at").show(5)

+--------------------+--------------------+
|          created_at|       dt_created_at|
+--------------------+--------------------+
|Sat Mar 04 13:17:...|2017-03-04 13:17:...|
|Sat Mar 04 13:17:...|2017-03-04 13:17:...|
|Sat Mar 04 13:17:...|2017-03-04 13:17:...|
|Sat Mar 04 13:17:...|2017-03-04 13:17:...|
|Sat Mar 04 13:17:...|2017-03-04 13:17:...|
+--------------------+--------------------+
only showing top 5 rows



<h2>Your UDF can also be in a longer function</h2>

In [9]:
def target_market(dt, lng):
    if dt.minute == 23 and lng != 'en':
        return True
    else:
        return False

In [10]:
is_target = udf(lambda dt, lng: target_market(dt, lng), BooleanType())
df_mod_filtered = df_mod.withColumn('is_target', is_target(df_mod['dt_created_at'], df_mod['lang']))
df_mod_filtered.select('is_target', 'dt_created_at', 'lang').show(5)

+---------+--------------------+----+
|is_target|       dt_created_at|lang|
+---------+--------------------+----+
|    false|2016-11-21 03:23:...|  en|
|     true|2016-11-21 03:23:...| und|
|    false|2016-11-21 03:23:...|  en|
|    false|2016-11-21 03:23:...|  en|
|    false|2016-11-21 03:23:...|  en|
+---------+--------------------+----+
only showing top 5 rows



In [11]:
df_mod_filtered.filter(df_mod_filtered['is_target']==True).select('is_target', 'dt_created_at', 'lang').show(5)

+---------+--------------------+----+
|is_target|       dt_created_at|lang|
+---------+--------------------+----+
|     true|2016-11-21 03:23:...| und|
|     true|2016-11-21 03:23:...| und|
|     true|2016-11-21 03:23:...| und|
|     true|2016-11-21 03:23:...| und|
|     true|2016-11-21 03:23:...| und|
+---------+--------------------+----+
only showing top 5 rows

