## Date & time functions

***Formatting***
>**current_date()**: Column  
>**current_timestamp()** : Column  

>**to_timestamp(s: Column)**: Column  
>**to_timestamp(s: Column, fmt: String)**: Column  

>**date_format(dateExpr: Column, format: String)**: Column  
>**to_date(col: Column)**: Column  
>**to_date(col: Column, fmt: String)**: Column  

>**trunc(date: Column, format: String)**: Column  
>**date_trunc(format: String, timestamp: Column)**: Column

***Calculation***  

>**add_months(startDate: Column, numMonths: Int)**: Column  
>**date_add(start: Column, days: Int)**: Column  
>**date_sub(start: Column, days: Int)**: Column  
>**datediff(end: Column, start: Column)**: Column  
>**months_between(end: Column, start: Column)**: Column  
>**months_between(end: Column, start: Column, roundOff: Boolean)**: Column  
>**next_day(date: Column, dayOfWeek: String)**: Column

***Extraction***  

>**year(e: Column)**: Column  
>**quarter(e: Column)**: Column  
>**month(e: Column)**: Column  
>**dayofweek(e: Column)**: Column   
>**dayofmonth(e: Column)**: Column  
>**dayofyear(e: Column)**: Column  
>**weekofyear(e: Column)**: Column  

>**hour(e: Column)**: Column  
>**minute(e: Column)**: Column  
>**second(e: Column)**: Column  
>**last_day(e: Column)**: Column

## Unix timestamp

Unix time is a way of representing a timestamp by representing the time as the number of seconds since `January 1st, 1970 at 00:00:00 UTC`. One of the primary benefits of using Unix time is that it can be represented as an ***integer*** making it easier to parse and use across different systems.

***Unix Timestamp Functions***

>**from_unixtime(ut: Column)**: Column  
>**from_unixtime(ut: Column, f: String)**: Column  
>**unix_timestamp()**: Column  
>**unix_timestamp(s: Column)**: Column  
>**unix_timestamp(s: Column, p: String)**: Column

In [0]:
%fs ls /mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/

In [0]:
SampleDF = (spark.read
  .option("inferSchema", "true") # The default, but not costly w/Parquet
  .parquet("/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/")
  .cache()                       # Cache the expensive operation
)


In [0]:
SampleDF.count() #7200000

In [0]:
SampleDF.printSchema()

Renaming the timestamp column

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
(SampleDF
 #renames while selecting the columns.. doesn't alter the SampleDF Schema
  .select( col("timestamp").alias("capturedAt"), col("site"), col("requests") )
  .printSchema()
)

In [0]:
SampleDF #DataFrame[timestamp: string, site: string, requests: int]

In [0]:
#Method 2: Using withColumnRenamed

(SampleDF
  .withColumnRenamed("timestamp", "capturedAt")
  .printSchema()
)

In [0]:
#Method 3: using toDF()

(SampleDF
  .toDF("capturedAt", "site", "requests")
  .printSchema()
)

In [0]:
tempA = (SampleDF
  .withColumnRenamed("timestamp", "capturedAt")
  .select( col("*"), unix_timestamp( col("capturedAt"), "yyyy-MM-dd HH:mm:").cast("timestamp") )
)
tempA.printSchema()

In [0]:
tempA.show()

In [0]:
pageviewsDF = (SampleDF
  .withColumnRenamed("timestamp", "capturedAt") 
  .withColumn("capturedAt", unix_timestamp( col("capturedAt"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp") )
)

pageviewsDF.printSchema()

In [0]:
pageviewsDF.show()

In [0]:
(pageviewsDF
  .select( month(col("capturedAt")).alias("month"), year(col("capturedAt")).alias("year"))
  .distinct()
  .show()                     
)

In [0]:
newDF=pageviewsDF.select("capturedAt", current_date(), current_timestamp())

In [0]:
newDF.show()

Date calculation

In [0]:
newDF.select(months_between(current_date(), "capturedAt", True).alias("daysDifferent")).show()

In [0]:
newDF.select("capturedAt",date_add("capturedAt",10)).show()