# Dates in `pyspark`

In this and subsequent lectures, you will learn about working with dates and times in `pyspark`.  This includes

1. Correctly loading dates and times.
2. Extracting various date- and time-parts
3. Differencing and offsetting dates.

In [1]:
from pyspark.sql import SparkSession
from more_pyspark import to_pandas

spark = SparkSession.builder.appName('Ops').getOrCreate()

22/11/02 16:02:38 WARN Utils: Your hostname, nn1448lr222 resolves to a loopback address: 127.0.1.1; using 172.22.165.244 instead (on interface eth0)
22/11/02 16:02:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/02 16:02:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Loading Date/Time Data

### Dates and Times in `pyspark`

Two important columns types
* `DateType` represents a calendar date
    * default format: `yyyy-MM-dd`
* `TimestampType` represents a point in time
    * default format: `yyyy-MM-dd HH:mm:ss.SSSS`

### Reading in Date/Time data

`inferSchema` often fails $\rightarrow$ manually set the types

* Load the raw data
* Use `more_pyspark.pprint_schema` to print/copy schema
* Edit to define the correct schema
* Write a datetime format string
* Load the data with correct type

## Example 1 - Uber data - Loading a `Timestamp`

In [2]:
uber_april_raw = (spark.read.csv('./data/uber-raw-data-apr14-sample.csv', 
                            header=True, 
                            inferSchema=True)
             )

uber_april_raw.take(2) >> to_pandas

                                                                                

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/18/2014 21:38:00,40.7359,-73.9852,B02682
1,4/23/2014 15:19:00,40.7642,-73.9543,B02598


#### `inferSchema` didn't catch the timestamps

In [3]:
from more_pyspark import pprint_schema

uber_april_raw >> pprint_schema

StructType([StructField('Date/Time', StringType(), True),
            StructField('Lat', DoubleType(), True),
            StructField('Lon', DoubleType(), True),
            StructField('Base', StringType(), True)])


### Step 1 - Correct the Schema

* Copy from above
* Replace the `'Date/Time` `StringType`

In [4]:
from pyspark.sql.types import StructType, StructField, TimestampType, DoubleType, StringType

uber_schema = StructType([StructField('Date/Time', TimestampType(), True),
                          StructField('Lat', DoubleType(), True),
                          StructField('Lon', DoubleType(), True),
                          StructField('Base', StringType(), True)])

### Step 2 - Specifying the format for a `TimeStamp`

To get the correct schema, we need to
    
1. Create the schema with a `TimestampType`
2. Use `timestampFormat` with the correct [datetime pattern](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html)

### Determining the datetime format

<img src="./img/datetime_format.png" width="600">

In [5]:
uber_datetime_format = 'M/d/yyyy H:mm:ss'

### Step 3 - Save the format and schema to a script file

* Inspect the file `uber_schema.py` to see result.
* Make for cleaner and more reusable code.

In [6]:
!cat uber_schema.py

from pyspark.sql.types import StructType, StructField, TimestampType, DoubleType, StringType

uber_schema = StructType([StructField('Date/Time', TimestampType(), True),
                          StructField('Lat', DoubleType(), True),
                          StructField('Lon', DoubleType(), True),
                          StructField('Base', StringType(), True)])
                          
uber_datetime_format = 'M/d/yyyy H:mm:ss'

### Step 4 - Load the data

**Warning.** You will need to restart the kernel anytime you change the script file!

In [7]:
from pyspark.sql import SparkSession
from more_pyspark import to_pandas

spark = SparkSession.builder.appName('Ops').getOrCreate()

In [8]:
# After kernel restart and loading spark
from uber_schema import uber_schema, uber_datetime_format

uber_april = spark.read.csv('./data/uber-raw-data-apr14-sample.csv', 
                            header=True, 
                            schema=uber_schema,
                            timestampFormat=uber_datetime_format
                           )

uber_april.take(5) >> to_pandas

Unnamed: 0,Date/Time,Lat,Lon,Base
0,2014-04-18 21:38:00,40.7359,-73.9852,B02682
1,2014-04-23 15:19:00,40.7642,-73.9543,B02598
2,2014-04-10 07:15:00,40.7138,-74.0103,B02598
3,2014-04-11 15:23:00,40.7847,-73.9698,B02682
4,2014-04-07 17:26:00,40.646,-73.7767,B02598


In [9]:
uber_april >> pprint_schema

StructType([StructField('Date/Time', TimestampType(), True),
            StructField('Lat', DoubleType(), True),
            StructField('Lon', DoubleType(), True),
            StructField('Base', StringType(), True)])


## Example 2 - MoMA exhibitions - Loading a `Date`

#### Step 1 - Load, inspect, and define schema

In [10]:
# Read with correct encoding
exhibitions_raw = spark.read.csv('./data/MoMAExhibitions1929to1989.csv', 
                             header=True, 
                             inferSchema=True,
                             encoding="ISO-8859-1")

exhibitions_raw.take(5) >> to_pandas

                                                                                

22/11/02 16:02:56 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Unnamed: 0,ExhibitionID,ExhibitionNumber,ExhibitionTitle,ExhibitionCitationDate,ExhibitionBeginDate,ExhibitionEndDate,ExhibitionSortOrder,ExhibitionURL,ExhibitionRole,ExhibitionRoleinPressRelease,...,Institution,Nationality,ConstituentBeginDate,ConstituentEndDate,ArtistBio,Gender,VIAFID,WikidataID,ULANID,ConstituentURL
0,2557,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1,moma.org/calendar/exhibitions/1767,Curator,Director,...,,American,1902,1981,"American, 19021981",Male,109252853,Q711362,500241556,moma.org/artists/9168
1,2557,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1,moma.org/calendar/exhibitions/1767,Artist,Artist,...,,French,1839,1906,"French, 18391906",Male,39374836,Q35548,500004793,moma.org/artists/1053
2,2557,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1,moma.org/calendar/exhibitions/1767,Artist,Artist,...,,French,1848,1903,"French, 18481903",Male,27064953,Q37693,500011421,moma.org/artists/2098
3,2557,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1,moma.org/calendar/exhibitions/1767,Artist,Artist,...,,Dutch,1853,1890,"Dutch, 18531890",Male,9854560,Q5582,500115588,moma.org/artists/2206
4,2557,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1,moma.org/calendar/exhibitions/1767,Artist,Artist,...,,French,1859,1891,"French, 18591891",Male,24608076,Q34013,500008873,moma.org/artists/5358


In [11]:
exhibitions_raw >> pprint_schema # Copy ==> paste ==> edit the output in a new cell

StructType([StructField('ExhibitionID', IntegerType(), True),
            StructField('ExhibitionNumber', StringType(), True),
            StructField('ExhibitionTitle', StringType(), True),
            StructField('ExhibitionCitationDate', StringType(), True),
            StructField('ExhibitionBeginDate', StringType(), True),
            StructField('ExhibitionEndDate', StringType(), True),
            StructField('ExhibitionSortOrder', StringType(), True),
            StructField('ExhibitionURL', StringType(), True),
            StructField('ExhibitionRole', StringType(), True),
            StructField('ExhibitionRoleinPressRelease', StringType(), True),
            StructField('ConstituentID', StringType(), True),
            StructField('ConstituentType', StringType(), True),
            StructField('DisplayName', StringType(), True),
            StructField('AlphaSort', StringType(), True),
            StructField('FirstName', StringType(), True),
            StructField('MiddleN

In [12]:
from pyspark.sql.types import StructType, StructField, StringType, DateType, IntegerType

exhib_schema = StructType( [StructField('ExhibitionID', IntegerType(), True),
                            StructField('ExhibitionNumber', StringType(), True),
                            StructField('ExhibitionTitle', StringType(), True),
                            StructField('ExhibitionCitationDate', StringType(), True),
                            StructField('ExhibitionBeginDate', DateType(), True),
                            StructField('ExhibitionEndDate', DateType(), True),
                            StructField('ExhibitionSortOrder', StringType(), True),
                            StructField('ExhibitionURL', StringType(), True),
                            StructField('ExhibitionRole', StringType(), True),
                            StructField('ExhibitionRoleinPressRelease', StringType(), True),
                            StructField('ConstituentID', StringType(), True),
                            StructField('ConstituentType', StringType(), True),
                            StructField('DisplayName', StringType(), True),
                            StructField('AlphaSort', StringType(), True),
                            StructField('FirstName', StringType(), True),
                            StructField('MiddleName', StringType(), True),
                            StructField('LastName', StringType(), True),
                            StructField('Suffix', StringType(), True),
                            StructField('Institution', StringType(), True),
                            StructField('Nationality', StringType(), True),
                            StructField('ConstituentBeginDate', IntegerType(), True),
                            StructField('ConstituentEndDate', IntegerType(), True),
                            StructField('ArtistBio', StringType(), True),
                            StructField('Gender', StringType(), True),
                            StructField('VIAFID', StringType(), True),
                            StructField('WikidataID', StringType(), True),
                            StructField('ULANID', StringType(), True),
                            StructField('ConstituentURL', StringType(), True)])


#### Step 2 - Specifying the format for a `TimeStamp`

In [13]:
exhib_date_format = "M/d/yyyy"

#### Step 3 - Save the format and schema to a script file.  Restart kernel.

In [14]:
!cat MoMA_schema.py

from pyspark.sql.types import StructType, StructField, StringType, DateType, IntegerType, DoubleType

exhib_schema = StructType( [StructField('ExhibitionID', IntegerType(), True),
                            StructField('ExhibitionNumber', StringType(), True),
                            StructField('ExhibitionTitle', StringType(), True),
                            StructField('ExhibitionCitationDate', StringType(), True),
                            StructField('ExhibitionBeginDate', DateType(), True),
                            StructField('ExhibitionEndDate', DateType(), True),
                            StructField('ExhibitionSortOrder', StringType(), True),
                            StructField('ExhibitionURL', StringType(), True),
                            StructField('ExhibitionRole', StringType(), True),
                            StructField('ExhibitionRoleinPressRelease', StringType(), True),
                            StructField('ConstituentID', StringT

#### Step 4 - Load the data

In [15]:
from MoMA_schema import exhib_schema, exhib_date_format

exhibitions = spark.read.csv('./data/MoMAExhibitions1929to1989.csv', 
                             header=True, 
                             schema=exhib_schema,
                             encoding="ISO-8859-1",
                             dateFormat=exhib_date_format)
exhibitions.take(2) >> to_pandas # No more "bad" symbols

Unnamed: 0,ExhibitionID,ExhibitionNumber,ExhibitionTitle,ExhibitionCitationDate,ExhibitionBeginDate,ExhibitionEndDate,ExhibitionSortOrder,ExhibitionURL,ExhibitionRole,ExhibitionRoleinPressRelease,...,Institution,Nationality,ConstituentBeginDate,ConstituentEndDate,ArtistBio,Gender,VIAFID,WikidataID,ULANID,ConstituentURL
0,2557,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",1929-11-07,1929-12-07,1,moma.org/calendar/exhibitions/1767,Curator,Director,...,,American,1902,1981,"American, 19021981",Male,109252853,Q711362,500241556,moma.org/artists/9168
1,2557,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",1929-11-07,1929-12-07,1,moma.org/calendar/exhibitions/1767,Artist,Artist,...,,French,1839,1906,"French, 18391906",Male,39374836,Q35548,500004793,moma.org/artists/1053


## <font color="red"> Exercise 6.4.1 - Makeover Monday Example </font>

In the first week of January 2021, the [Makeover Monday](https://data.world/makeovermonday/2021w1/discuss/2021w1/nkpl7c4a#tegzfysi) topic related to a bike boom in 2020.  The weekly totals are provided in `./data/weekly_data.csv`.

**Tasks.**
1. Load the raw data and inspect the data/types.
2. Create the correct `schema` for these data (especially the dates).
3. Create an appropriate datetime format string
4. Load the data and verify types/data is correct.
5. Save the schema and format string to a Python script file.

In [21]:
# Your code here
# Read with correct encoding
bikes = spark.read.csv('./data/weekly_data.csv', 
                             header=True, 
                             inferSchema=True,
                             encoding="ISO-8859-1")

bikes.take(5) >> to_pandas

Unnamed: 0,year,timeframe,week,counts_31_counters,covid_period,pedestrians_14_counters,bikes_14_counters
0,2019,Week 1,2018-12-30,167679,N,68957,63100
1,2019,Week 2,2019-01-06,82340,N,34778,31703
2,2019,Week 3,2019-01-13,62315,N,32065,18082
3,2019,Week 4,2019-01-20,75801,N,35016,26280
4,2019,Week 5,2019-01-27,75841,N,33627,24447


In [22]:
bikes >> pprint_schema

StructType([StructField('year', IntegerType(), True),
            StructField('timeframe', StringType(), True),
            StructField('week', TimestampType(), True),
            StructField('counts_31_counters', IntegerType(), True),
            StructField('covid_period', StringType(), True),
            StructField('pedestrians_14_counters', IntegerType(), True),
            StructField('bikes_14_counters', IntegerType(), True)])


In [23]:
from pyspark.sql.types import *

bikes_schema = StructType([StructField('year', IntegerType(), True),
                            StructField('timeframe', StringType(), True),
                            StructField('week', DateType(), True),
                            StructField('counts_31_counters', IntegerType(), True),
                            StructField('covid_period', StringType(), True),
                            StructField('pedestrians_14_counters', IntegerType(), True),
                            StructField('bikes_14_counters', IntegerType(), True)])

bike_datetime_format = 'yyyy-m-d'

In [27]:
!cat bikes_schema.py

from pyspark.sql.types import *

bikes_schema = StructType([StructField('year', IntegerType(), True),
                            StructField('timeframe', StringType(), True),
                            StructField('week', DateType(), True),
                            StructField('counts_31_counters', IntegerType(), True),
                            StructField('covid_period', StringType(), True),
                            StructField('pedestrians_14_counters', IntegerType(), True),
                            StructField('bikes_14_counters', IntegerType(), True)])

bike_datetime_format = 'yyyy-m-d'

In [28]:
from bikes_schema import bikes_schema , bike_datetime_format 

bikes = spark.read.csv('./data/weekly_data.csv', 
                             header=True, 
                             schema=bikes_schema,
                             encoding="ISO-8859-1",
                             dateFormat=bike_datetime_format)
bikes.take(5) >> to_pandas # No more "bad" symbols

Unnamed: 0,year,timeframe,week,counts_31_counters,covid_period,pedestrians_14_counters,bikes_14_counters
0,2019,Week 1,2018-01-30,167679,N,68957,63100
1,2019,Week 2,2019-01-06,82340,N,34778,31703
2,2019,Week 3,2019-01-13,62315,N,32065,18082
3,2019,Week 4,2019-01-20,75801,N,35016,26280
4,2019,Week 5,2019-01-27,75841,N,33627,24447
