# Handling Date/TimeStamp Formats

## 1. When column is of Date/TimeStamp Datatype

### Creating Data Frame with sample Data/schema

In [2]:
from pyspark.sql import functions as F
from pyspark.sql import types as T

In [3]:
myspark = SparkSession.builder.appName("Spark_Date_Timestamp_Test").master("yarn").enableHiveSupport().getOrCreate()

In [4]:
data = [["naresh","hisar",25],["ravi","delhi",33],["virender","hasangarh", 55]]

In [5]:
fields = [T.StructField("name", T.StringType(), True),T.StructField("city", T.StringType(), True),\
          T.StructField("age", T.IntegerType(), True)]

In [6]:
schema = T.StructType(fields)

In [7]:
df = spark.createDataFrame(data,schema)

In [8]:
df.show()

+--------+---------+---+
|    name|     city|age|
+--------+---------+---+
|  naresh|    hisar| 25|
|    ravi|    delhi| 33|
|virender|hasangarh| 55|
+--------+---------+---+



### Adding 2 Extra columns: ctime and cdate

In [9]:
df = df.withColumn("ctime",F.current_timestamp().cast(T.TimestampType()))\
.withColumn("cdate",F.current_date().cast(T.TimestampType()))

In [10]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- ctime: timestamp (nullable = false)
 |-- cdate: timestamp (nullable = false)



In [11]:
df.select("ctime","cdate").show(10,False)

+-----------------------+---------------------+
|ctime                  |cdate                |
+-----------------------+---------------------+
|2018-02-27 21:47:41.947|2018-02-27 00:00:00.0|
|2018-02-27 21:47:41.947|2018-02-27 00:00:00.0|
|2018-02-27 21:47:41.947|2018-02-27 00:00:00.0|
+-----------------------+---------------------+



### Convert the ctime column into desire format

Also HH is to get time in 24hours format.

We will use below: F.date_format(), F.year(), F.month(), F.dayofmonth(), F.to_date(), F.second()

In [12]:
df = df.withColumn("ctime_date",F.date_format(df.ctime, "yyyy-MM-dd HH:mm:ss"))\
    .withColumn("ctime_year", F.year(df.ctime))\
    .withColumn("ctime_month", F.month(df.ctime))\
    .withColumn("ctime_day", F.dayofmonth(df.ctime))\
    .withColumn("ctime_todate", F.to_date(df.ctime))\
      .withColumn("ctime_seconds", F.second(df.ctime))

In [13]:
df.select("ctime_date","ctime_todate","ctime_year","ctime_month","ctime_day","ctime_seconds").show(10,False)

+-------------------+------------+----------+-----------+---------+-------------+
|ctime_date         |ctime_todate|ctime_year|ctime_month|ctime_day|ctime_seconds|
+-------------------+------------+----------+-----------+---------+-------------+
|2018-02-27 21:47:44|2018-02-27  |2018      |2          |27       |44           |
|2018-02-27 21:47:44|2018-02-27  |2018      |2          |27       |44           |
|2018-02-27 21:47:44|2018-02-27  |2018      |2          |27       |44           |
+-------------------+------------+----------+-----------+---------+-------------+



### Get Date before 6 days using F.date_sub() and get the difference using F.datediff()

In [14]:
df = df.withColumn("date_before_6days" , F.date_sub(df.ctime_todate,6))

In [15]:
df = df.withColumn("datediff", F.datediff(df.ctime_todate,df.date_before_6days))

In [16]:
df.select("ctime_todate","date_before_6days","datediff").show(10,False)

+------------+-----------------+--------+
|ctime_todate|date_before_6days|datediff|
+------------+-----------------+--------+
|2018-02-27  |2018-02-21       |6       |
|2018-02-27  |2018-02-21       |6       |
|2018-02-27  |2018-02-21       |6       |
+------------+-----------------+--------+



### Get Unix TimeStamp using F.unix_timestamp() and revert this using F.from_unixtime()

Note: unix_timestamp() will ignore the milliseconds part upto 3 digits and fail if we have more than 3 digits in milliseconds

In [17]:
df = df.withColumn("unixTime", F.unix_timestamp(df.ctime, format='yyyy-MM-dd HH:mm:ss'))

In [18]:
df = df.withColumn("revert_UnixTimestamp", F.from_unixtime(df.unixTime))

In [19]:
df.select(df.ctime, df.unixTime, "revert_UnixTimestamp").show(10,False)

+-----------------------+----------+--------------------+
|ctime                  |unixTime  |revert_UnixTimestamp|
+-----------------------+----------+--------------------+
|2018-02-27 21:47:50.533|1519721270|2018-02-27 21:47:50 |
|2018-02-27 21:47:50.533|1519721270|2018-02-27 21:47:50 |
|2018-02-27 21:47:50.533|1519721270|2018-02-27 21:47:50 |
+-----------------------+----------+--------------------+



### Converting to a timezone(say NZ,IST,PST) from UTC


In [20]:
df = df.withColumn("NZ_to_utc", F.to_utc_timestamp(df.ctime,"NZ"))\
    .withColumn("IST_to_utc", F.to_utc_timestamp(df.ctime,"IST"))\
    .withColumn("PST_to_utc", F.to_utc_timestamp(df.ctime,"PST"))    

In [21]:
df.select(df.ctime, df.NZ_to_utc,"IST_to_utc","PST_to_utc").show(100,False)

+-----------------------+-----------------------+-----------------------+-----------------------+
|ctime                  |NZ_to_utc              |IST_to_utc             |PST_to_utc             |
+-----------------------+-----------------------+-----------------------+-----------------------+
|2018-02-27 21:47:52.825|2018-02-27 08:47:52.825|2018-02-27 16:17:52.825|2018-02-28 05:47:52.825|
|2018-02-27 21:47:52.825|2018-02-27 08:47:52.825|2018-02-27 16:17:52.825|2018-02-28 05:47:52.825|
|2018-02-27 21:47:52.825|2018-02-27 08:47:52.825|2018-02-27 16:17:52.825|2018-02-28 05:47:52.825|
+-----------------------+-----------------------+-----------------------+-----------------------+



## 2. When column is of String Datatype and have Milliseconds

Till now we had the time/date columns in same datatype(time or date) format. But what if you are reading from a file and  everything is loaded as STRING. Then we will have to change the type from String to Date/timestamp seperately.

### Case (i): format = "yyyy-MM-dd HH.mm.ss.SSS" 

Sample: 2017-09-14 01.20.29.343 Having 3 digits of milliseconds.

In [22]:
three_digit_millisecs = [["2017-09-14 01.20.29.343"],["2017-09-10 04.20.29.341"],["2017-09-18 02.20.29.123"]]
schema = T.StructType([T.StructField("timing", T.StringType(),True)])
df = spark.createDataFrame(three_digit_millisecs,schema)

Note that the Datatype is String

In [23]:
df.printSchema()

root
 |-- timing: string (nullable = true)



In [24]:
df.show(10,False)

+-----------------------+
|timing                 |
+-----------------------+
|2017-09-14 01.20.29.343|
|2017-09-10 04.20.29.341|
|2017-09-18 02.20.29.123|
+-----------------------+



### Subcase (i): Milliseconds part is NOT needed

In this case, We can use unix_timestamp() function which will simply change the data type from STRING to TIMESTAMP. But, we will loose the milliseconds part.

Below we will create a new column 'in_timestamp_datatype' losing milliseconds

In [25]:
format = "yyyy-MM-dd HH.mm.ss.SSS"

df = df.withColumn("in_timestamp_datatype",(F.unix_timestamp("timing",format)).cast('timestamp'))

Note the Datatype of new column, its timestamp

In [26]:
df.printSchema()

root
 |-- timing: string (nullable = true)
 |-- in_timestamp_datatype: timestamp (nullable = true)



In [27]:
df.show(10,False)

+-----------------------+---------------------+
|timing                 |in_timestamp_datatype|
+-----------------------+---------------------+
|2017-09-14 01.20.29.343|2017-09-14 01:20:29.0|
|2017-09-10 04.20.29.341|2017-09-10 04:20:29.0|
|2017-09-18 02.20.29.123|2017-09-18 02:20:29.0|
+-----------------------+---------------------+



### Subcase (ii): Milliseconds part is Needed

If we want to keep the milliseconds part, we will :

1) first need to take the timestamp (without milliseconds, which is 2017-09-14 01:20:29) part using substring(0,21) and convert it into double.

Note that we took 1,21 ==> which is 2017-09-14 01.20.29.3

As we know from privious example, unix_timestamp will take the (3 digit) milliseconds part as input (by default 000 or whatever we pass). So we can use 0,21(2017-09-14 01.20.29.3) or 0,22(2017-09-14 01.20.29.34) or 0,23(2017-09-14 01.20.29.343)
	
    unix_timestamp(substring('timing',0,21),format2).cast('double')
		
2) Use substring to add the milliseconds part later (should be with double datatype). Divide by 1000 is just to generate .xyz (343/1000=.343)

    +substring('timing',21,3).cast('double')/1000.0
	
3) Finally cast the whole thing in timestamp format.

    .cast('timestamp')


In [28]:
df = df.withColumn("in_timestamp_datatype",(F.unix_timestamp(F.substring('timing',0,21),format).cast(T.DoubleType())\
        +F.substring('timing',21,3).cast(T.DoubleType())/1000.0)\
        .cast(T.TimestampType()))

In [29]:
df.printSchema()

root
 |-- timing: string (nullable = true)
 |-- in_timestamp_datatype: timestamp (nullable = true)



Note that we are having the milliseconds part now

In [30]:
df.show(10,False)

+-----------------------+-----------------------+
|timing                 |in_timestamp_datatype  |
+-----------------------+-----------------------+
|2017-09-14 01.20.29.343|2017-09-14 01:20:29.343|
|2017-09-10 04.20.29.341|2017-09-10 04:20:29.341|
|2017-09-18 02.20.29.123|2017-09-18 02:20:29.123|
+-----------------------+-----------------------+



### Case (ii): format = "yyyy-MM-dd HH.mm.ss.SSSSSS" 

Sample: 2017-09-14 01.20.29.343234 Having 6 digits of milliseconds.

In [31]:
six_digit_millisecs = [["2017-09-14 01.20.29.343234"],["2017-09-10 04.20.29.341244"],["2017-09-18 02.20.29.123456"]]
schema = T.StructType([T.StructField("timing", T.StringType(),True)])
df = spark.createDataFrame(six_digit_millisecs,schema)

Note that the Datatype is String

In [32]:
df.printSchema()

root
 |-- timing: string (nullable = true)



In [33]:
df.show(10,False)

+--------------------------+
|timing                    |
+--------------------------+
|2017-09-14 01.20.29.343234|
|2017-09-10 04.20.29.341244|
|2017-09-18 02.20.29.123456|
+--------------------------+



### Subcase (i): Milliseconds part is NOT needed

From previous subcase (i), we know that unix_timestamp will handle upto 3 digits of milliseconds (with lose of 3 percisions).

Here all things are same but as we are having more than 3 digits of milliseconds, we will have to use the substring stuff to pass UPTO 3 digits of milliseconds to unix_timestamp. 

In case we pass the value as it is (without using substring), the 6 digit milliseconds will be converted into minutes/seconds and will be added to the actual timestamp. Meaning we will get higher timestamp.

For eq. 2017-09-14-01.20.29.469061 will be converted to  2017-09-14 01:28:18 (higher by 469,061 ms = 7 min 49 seconds)

In [34]:
format = "yyyy-MM-dd HH.mm.ss.SSSSSS"

df = df.withColumn("in_timestamp_datatype",(F.unix_timestamp(F.substring("timing",1,23),format)).cast(T.TimestampType()))

Note the Datatype of new column, its timestamp

In [35]:
df.printSchema()

root
 |-- timing: string (nullable = true)
 |-- in_timestamp_datatype: timestamp (nullable = true)



In [36]:
df.show(10,False)

+--------------------------+---------------------+
|timing                    |in_timestamp_datatype|
+--------------------------+---------------------+
|2017-09-14 01.20.29.343234|2017-09-14 01:20:29.0|
|2017-09-10 04.20.29.341244|2017-09-10 04:20:29.0|
|2017-09-18 02.20.29.123456|2017-09-18 02:20:29.0|
+--------------------------+---------------------+



### Subcase (ii): Milliseconds part is Needed

Explanation is exactly same like earlier

In [37]:
df = df.withColumn("in_timestamp_datatype",(F.unix_timestamp(F.substring('timing',0,21),format).cast(T.DoubleType())\
        +F.substring('timing',21,6).cast(T.DoubleType())/1000000.0)\
        .cast(T.TimestampType()))

In [38]:
df.printSchema()

root
 |-- timing: string (nullable = true)
 |-- in_timestamp_datatype: timestamp (nullable = true)



Note that we are having the milliseconds part now

In [39]:
df.show(10,False)

+--------------------------+--------------------------+
|timing                    |in_timestamp_datatype     |
+--------------------------+--------------------------+
|2017-09-14 01.20.29.343234|2017-09-14 01:20:29.343234|
|2017-09-10 04.20.29.341244|2017-09-10 04:20:29.341244|
|2017-09-18 02.20.29.123456|2017-09-18 02:20:29.123456|
+--------------------------+--------------------------+



# What's Next

1) To Download this Single Notebook, Click this file in my Github Account, Copy the URL and paste in http://nbviewer.jupyter.org/. Download button will be in top right corner.

2) Open your Juypter Notebook home page and upload using "upload" Button.

3) Continue Learning from the next Notebook Spark_05_UDF_Usage.ipynb