# Agenda

1. Dates and times
2. Visualization
3. Optimization of queries
    - Memory usage (dtypes and categories)
    - Techniques for improving query speed
    - PyArrow

# Dates and times

When we use the word "time" in a human language, we actually mean two different things:

- A specific point in time, unique in the universe's history. We point to that time with a particular year, month, date, hour, minute, second, etc., depending on how fine-grained we want our measurement to be.  This is how we indicate when a class starts, or when a meeting ends, or when someone was born. In programming, we refer to this as a `datetime` or as a `timestamp`.

- We can also mean a span of time -- how long something lasts, or how long someone lived, or how long we've been working for a particular company. It's related to a `datetime`, but it's not the same. It's not anchored to a particular start or finish, it's just a span of time. In programming, we refer to this as a `timedelta` or an `interval`.

You can actually do math with these:

- `end_datetime` - `start_datetime` - `timedelta`  # how long did an event last?
- `end_datetime` - `timedelta` = `start_datetime` # when did something start, given the endpoint and the length?
- `start_datetime` + `timedelta` = `end_datetime`  # given a starting point and a length of time, when did it end?

Pandas supports both of these, as `datetime` and `timedelta` objects. We can have these objects in a series, and thus in a data frame.

In [1]:
import pandas as pd

df = pd.read_csv('taxi.csv')
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [2]:
df.dtypes

VendorID                   int64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
pickup_longitude         float64
pickup_latitude          float64
RateCodeID                 int64
store_and_fwd_flag        object
dropoff_longitude        float64
dropoff_latitude         float64
payment_type               int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
dtype: object

In [3]:
# how can we turn one or both of these columns into series with datetime dtypes?
# we can use pd.to_datetime

pd.to_datetime(df['tpep_pickup_datetime'])

0      2015-06-02 11:19:29
1      2015-06-02 11:19:30
2      2015-06-02 11:19:31
3      2015-06-02 11:19:31
4      2015-06-02 11:19:32
               ...        
9994   2015-06-01 00:12:59
9995   2015-06-01 00:12:59
9996   2015-06-01 00:13:00
9997   2015-06-01 00:13:02
9998   2015-06-01 00:13:04
Name: tpep_pickup_datetime, Length: 9999, dtype: datetime64[ns]

In [4]:
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])

df.dtypes

VendorID                          int64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
pickup_longitude                float64
pickup_latitude                 float64
RateCodeID                        int64
store_and_fwd_flag               object
dropoff_longitude               float64
dropoff_latitude                float64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
dtype: object

In [5]:
# it seems silly that we have to do this in two steps!
# isn't there a way for us to tell read_csv that these two columns should be interpreted as datetime values?
# yes -- we pass a keyword arguments, parse_dates, with the names (or numeric indexes) of the columns to handle that way

df = pd.read_csv('taxi.csv',
                 usecols=['tpep_pickup_datetime', 'tpep_dropoff_datetime',
                          'passenger_count', 'trip_distance', 'total_amount'],
                parse_dates=['tpep_pickup_datetime',
                             'tpep_dropoff_datetime'])
df.dtypes


tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
total_amount                    float64
dtype: object

# What about date formats?

Dates come in a *lot* of different formats. By default, Pandas can handle `YYYY-MM-DD HH:MM:SS` and a few other well-known, unambiguous formats. But there are also some ambiguous ones, such as `02-03-2024`, is that February 3rd? Or is that March 2nd? By default, because Pandas comes from the US, it assumes that the month comes first -- so this date would be February 3rd. If you want to change that, you can set `dayfirst=True` in `read_csv`.

If your dates come in a very weird format, you can pass the `date_format` keyword argument, which will then determine how Pandas interprets the dates. That format uses `strptime` and `strftime` formatting codes, using `%`. (Info at https://www.strfti.me/, a great site.)

# Once we have dates... so what?

The point of having `datetime` data is that you can pick it apart, and use the components. If we want to find out the year in which something happened, we want to grab that. We can also get the minutes, hours, or anything else.

That's all done via the `.dt` accessor, which works similarly to the `.str` accessor we saw yesterday. We take a `datetime` series (column), apply `.dt` and a specific component, and we get a new series back.

In [6]:
df['tpep_pickup_datetime'].dt.year

0       2015
1       2015
2       2015
3       2015
4       2015
        ... 
9994    2015
9995    2015
9996    2015
9997    2015
9998    2015
Name: tpep_pickup_datetime, Length: 9999, dtype: int32

In [8]:
df['tpep_pickup_datetime'].dt.second

0       29
1       30
2       31
3       31
4       32
        ..
9994    59
9995    59
9996     0
9997     2
9998     4
Name: tpep_pickup_datetime, Length: 9999, dtype: int32

In [10]:
# in addition to the "normal" things we can request, we can also get:

df['tpep_pickup_datetime'].dt.is_quarter_end

0       False
1       False
2       False
3       False
4       False
        ...  
9994    False
9995    False
9996    False
9997    False
9998    False
Name: tpep_pickup_datetime, Length: 9999, dtype: bool

In [11]:
df['tpep_pickup_datetime'].dt.is_leap_year

0       False
1       False
2       False
3       False
4       False
        ...  
9994    False
9995    False
9996    False
9997    False
9998    False
Name: tpep_pickup_datetime, Length: 9999, dtype: bool

In [12]:
df['tpep_pickup_datetime'].dt.days_in_month

0       30
1       30
2       30
3       30
4       30
        ..
9994    30
9995    30
9996    30
9997    30
9998    30
Name: tpep_pickup_datetime, Length: 9999, dtype: int32

In [13]:
df['tpep_pickup_datetime'].dt.day_of_week

0       1
1       1
2       1
3       1
4       1
       ..
9994    0
9995    0
9996    0
9997    0
9998    0
Name: tpep_pickup_datetime, Length: 9999, dtype: int32

In [14]:
df['tpep_pickup_datetime'].dt.day_name()

0       Tuesday
1       Tuesday
2       Tuesday
3       Tuesday
4       Tuesday
         ...   
9994     Monday
9995     Monday
9996     Monday
9997     Monday
9998     Monday
Name: tpep_pickup_datetime, Length: 9999, dtype: object

In [15]:
# let's find all of the taxi rides that took place when the hour is 11

df.loc[  df['tpep_pickup_datetime'].dt.hour == 11  ]

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount
0,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,17.80
1,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,8.30
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,11.00
3,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,17.16
4,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,10.30
...,...,...,...,...,...
5055,2015-06-02 11:33:35,2015-06-02 11:42:16,1,0.73,7.80
5092,2015-06-02 11:33:35,2015-06-02 11:38:53,1,0.96,8.50
5093,2015-06-02 11:33:36,2015-06-02 11:52:32,2,3.50,18.80
5130,2015-06-02 11:33:35,2015-06-02 11:55:02,5,2.14,14.80


# Exericse: Taxi dates/times

1. Load the (large) NYC taxi data from January 2020, turning the two datetime columns into `datetime` values/dtypes.
2. What percentage of the values in this file are actually not from January 2020? (How many are from earlier, and how many are from later?)
3. What are the mean `trip_distance` and `total_amount` at each hour of the day?
4. What is the breakdown of rides on each day of the week? Which day is most popular? And which day is least?

In [16]:
filename = '/Users/reuven/Courses/Current/Data/nyc_taxi_2020-01.csv'

df = pd.read_csv(filename, low_memory=False,
                 parse_dates=['tpep_pickup_datetime',
                              'tpep_dropoff_datetime'])

In [17]:
df.dtypes

VendorID                        float64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                 float64
trip_distance                   float64
RatecodeID                      float64
store_and_fwd_flag               object
PULocationID                      int64
DOLocationID                      int64
payment_type                    float64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
dtype: object

In [19]:
# What percentage of the values in this file are actually not from January 2020? (How many are from earlier, and how many are from later?)

df.loc[df['tpep_pickup_datetime'].dt.year < 2020]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
7,2.0,2019-12-18 15:27:49,2019-12-18 15:28:59,1.0,0.00,5.0,N,193,193,1.0,0.01,0.0,0.0,0.00,0.00,0.3,2.81,2.5
8,2.0,2019-12-18 15:30:35,2019-12-18 15:31:35,4.0,0.00,1.0,N,193,193,1.0,2.50,0.5,0.5,0.00,0.00,0.3,6.30,2.5
796,2.0,2019-12-31 23:48:07,2019-12-31 23:53:39,1.0,0.88,1.0,N,41,41,2.0,6.00,0.5,0.5,0.00,0.00,0.3,7.30,0.0
1276,2.0,2019-12-31 23:59:40,2020-01-01 00:09:06,2.0,2.19,1.0,N,231,158,1.0,9.50,0.5,0.5,2.66,0.00,0.3,15.96,2.5
1419,2.0,2019-12-31 23:56:19,2020-01-01 00:15:43,1.0,3.74,1.0,N,162,158,1.0,15.00,0.5,0.5,5.64,0.00,0.3,24.44,2.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4715309,2.0,2009-01-01 00:02:54,2009-01-01 00:02:58,1.0,0.01,1.0,N,264,264,2.0,2.50,0.0,0.5,0.00,0.00,0.3,3.30,0.0
5086498,2.0,2008-12-31 23:02:40,2009-01-01 05:46:33,1.0,8.48,1.0,N,43,138,1.0,24.00,0.5,0.5,6.78,6.12,0.3,40.70,2.5
5086894,2.0,2008-12-31 23:03:44,2009-01-01 05:32:14,1.0,1.10,1.0,N,262,140,1.0,6.00,0.5,0.5,1.00,0.00,0.3,10.80,2.5
5240836,2.0,2009-01-01 00:08:44,2009-01-01 02:50:15,1.0,4.56,1.0,N,170,41,1.0,15.00,0.5,0.5,3.76,0.00,0.3,22.56,2.5


In [22]:
(
    df
    .loc[lambda df_: ((df_['tpep_pickup_datetime'].dt.year == 2020) & (df_['tpep_pickup_datetime'].dt.month > 1) |
                      (df_['tpep_pickup_datetime'].dt.year > 2020))]
).shape

(51, 18)

In [24]:
(51 + 161) / df.shape[0]

3.309909995428577e-05

In [27]:
# we can actually compare datetime values with *strings*!

df.loc[df['tpep_pickup_datetime'] > '2020-01-31 23:59:59'].shape

(51, 18)

In [28]:
df.loc[df['tpep_pickup_datetime'] > '23:59:59 2020-01-31'].shape

(51, 18)

In [29]:
df.loc[df['tpep_pickup_datetime'] >= '2020-02-01'].shape

(51, 18)

In [30]:
# What are the mean trip_distance and total_amount at each hour of the day?

df.groupby(df['tpep_pickup_datetime'].dt.hour)[['trip_distance', 'total_amount']].mean()

Unnamed: 0_level_0,trip_distance,total_amount
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3.473789,19.622096
1,3.127067,18.421611
2,3.041224,17.849248
3,3.336142,18.594672
4,4.235996,21.663218
5,4.934676,24.351221
6,3.63431,19.618494
7,3.754755,17.861858
8,2.55479,17.63064
9,2.510061,17.662423


In [None]:
# What is the breakdown of rides on each day of the week? Which day is most popular? And which day is least?