# Dates and Timestamps in Spark Dataframe

## Table of Content

<ol style = "type:1">
    <li><a href = "#ref">References</a></li>
</ol>

In [1]:
import findspark

In [2]:
findspark.init("/home/virchan/spark-3.3.1-bin-hadoop3")

In [3]:
from pyspark.sql import SparkSession

In [6]:
# Output hidden
spark = SparkSession.builder.appName("dates").getOrCreate()

In [7]:
df = spark.read.csv("appl_stock.csv", 
                    header = True,
                    inferSchema = True
                   )

                                                                                

In [8]:
df.head(1)

[Row(Date=datetime.datetime(2010, 1, 4, 0, 0), Open=213.429998, High=214.499996, Low=212.38000099999996, Close=214.009998, Volume=123432400, Adj Close=27.727039)]

In [11]:
df.select(["Date", "Open"]).show(10)

+-------------------+------------------+
|               Date|              Open|
+-------------------+------------------+
|2010-01-04 00:00:00|        213.429998|
|2010-01-05 00:00:00|        214.599998|
|2010-01-06 00:00:00|        214.379993|
|2010-01-07 00:00:00|            211.75|
|2010-01-08 00:00:00|        210.299994|
|2010-01-11 00:00:00|212.79999700000002|
|2010-01-12 00:00:00|209.18999499999998|
|2010-01-13 00:00:00|        207.870005|
|2010-01-14 00:00:00|210.11000299999998|
|2010-01-15 00:00:00|210.92999500000002|
+-------------------+------------------+
only showing top 10 rows



Spark has various functions on handling datetimes.

In [12]:
from pyspark.sql.functions import (dayofmonth, 
                                   hour, 
                                   dayofyear, 
                                   month, 
                                   year, 
                                   weekofyear, 
                                   format_number, 
                                   date_format
                                  )

In [14]:
df.select(dayofmonth(df["Date"])).show(10)

+----------------+
|dayofmonth(Date)|
+----------------+
|               4|
|               5|
|               6|
|               7|
|               8|
|              11|
|              12|
|              13|
|              14|
|              15|
+----------------+
only showing top 10 rows



In [15]:
df.select(hour(df["Date"])).show(10)

+----------+
|hour(Date)|
+----------+
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
+----------+
only showing top 10 rows



In [16]:
df.select(month(df["Date"])).show(10)

+-----------+
|month(Date)|
+-----------+
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
+-----------+
only showing top 10 rows



We can create a `"Year"` column with the `.withColumn()` method.

In [18]:
newdf = df.withColumn("Year", year(df["Date"]))

In [19]:
newdf.show(5)

+-------------------+----------+----------+------------------+------------------+---------+------------------+----+
|               Date|      Open|      High|               Low|             Close|   Volume|         Adj Close|Year|
+-------------------+----------+----------+------------------+------------------+---------+------------------+----+
|2010-01-04 00:00:00|213.429998|214.499996|212.38000099999996|        214.009998|123432400|         27.727039|2010|
|2010-01-05 00:00:00|214.599998|215.589994|        213.249994|        214.379993|150476200|27.774976000000002|2010|
|2010-01-06 00:00:00|214.379993|    215.23|        210.750004|        210.969995|138040000|27.333178000000004|2010|
|2010-01-07 00:00:00|    211.75|212.000006|        209.050005|            210.58|119282800|          27.28265|2010|
|2010-01-08 00:00:00|210.299994|212.000006|209.06000500000002|211.98000499999998|111902700|         27.464034|2010|
+-------------------+----------+----------+------------------+----------

We can then group by the year, and average everything.

In [21]:
newdf.groupBy("Year").mean().show()

+----+------------------+------------------+------------------+------------------+--------------------+------------------+---------+
|Year|         avg(Open)|         avg(High)|          avg(Low)|        avg(Close)|         avg(Volume)|    avg(Adj Close)|avg(Year)|
+----+------------------+------------------+------------------+------------------+--------------------+------------------+---------+
|2015|120.17575393253965|121.24452385714291| 118.8630954325397|120.03999980555547|  5.18378869047619E7|115.96740080555561|   2015.0|
|2013| 473.1281355634922| 477.6389272301587|468.24710264682557| 472.6348802857143|          1.016087E8| 62.61798788492063|   2013.0|
|2014| 295.1426195357143|297.56103184523823| 292.9949599801587| 295.4023416507935| 6.315273055555555E7| 87.63583323809523|   2014.0|
|2012|     576.652720788| 581.8254008040001| 569.9211606079999| 576.0497195640002|       1.319642044E8| 74.81383696800002|   2012.0|
|2016|104.50777772619044| 105.4271825436508|103.69027771825397|104.60

To select a particular column (list) from the dataframe,

In [23]:
result = newdf.groupBy("Year").mean().select(["Year", "avg(Close)"])

In [24]:
result.show()

[Stage 18:>                                                         (0 + 1) / 1]

+----+------------------+
|Year|        avg(Close)|
+----+------------------+
|2015|120.03999980555547|
|2013| 472.6348802857143|
|2014| 295.4023416507935|
|2012| 576.0497195640002|
|2016|104.60400786904763|
|2010| 259.8424600000002|
|2011|364.00432532142867|
+----+------------------+



                                                                                

We rename the column.

In [26]:
new_result = result.withColumnRenamed("avg(Close)", "Average Closing")

In [27]:
new_result.show()

+----+------------------+
|Year|   Average Closing|
+----+------------------+
|2015|120.03999980555547|
|2013| 472.6348802857143|
|2014| 295.4023416507935|
|2012| 576.0497195640002|
|2016|104.60400786904763|
|2010| 259.8424600000002|
|2011|364.00432532142867|
+----+------------------+



Next, we round the values.

In [30]:
new_result.select(["Year", format_number("Average Closing", 2).alias("Avg Close")]).show()

+----+---------+
|Year|Avg Close|
+----+---------+
|2015|   120.04|
|2013|   472.63|
|2014|   295.40|
|2012|   576.05|
|2016|   104.60|
|2010|   259.84|
|2011|   364.00|
+----+---------+



## <a name = "ref">References</a>

<ol style = "type:1">
    <li>Jose Portilla. Spark and Python for Big Data with PySpark.</li>
    <li>Apache Spark. <a href = "https://spark.apache.org/docs/latest/api/python/">https://spark.apache.org/docs/latest/api/python/</a>.</li>
</ol>