# Time data; split-apply-combine

#### Hi again! Today we will shift gears to Data Visualisation. Before we do that, however, there are still some concepts that we should introduce. Let's talk about time then :)

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("../data/bikesharing/data.csv", sep=",")
df.head()

Unnamed: 0,timestamp,cnt,t1,t2,hum,wind_speed,weather_code,is_holiday,is_weekend,season
0,2015-01-04 00:00:00,182,3.0,2.0,93.0,6.0,3.0,0.0,1.0,3.0
1,2015-01-04 01:00:00,138,3.0,2.5,93.0,5.0,1.0,0.0,1.0,3.0
2,2015-01-04 02:00:00,134,2.5,2.5,96.5,0.0,1.0,0.0,1.0,3.0
3,2015-01-04 03:00:00,72,2.0,2.0,100.0,0.0,1.0,0.0,1.0,3.0
4,2015-01-04 04:00:00,47,2.0,0.0,93.0,6.5,1.0,0.0,1.0,3.0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17414 entries, 0 to 17413
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   timestamp     17414 non-null  object 
 1   cnt           17414 non-null  int64  
 2   t1            17414 non-null  float64
 3   t2            17414 non-null  float64
 4   hum           17414 non-null  float64
 5   wind_speed    17414 non-null  float64
 6   weather_code  17414 non-null  float64
 7   is_holiday    17414 non-null  float64
 8   is_weekend    17414 non-null  float64
 9   season        17414 non-null  float64
dtypes: float64(8), int64(1), object(1)
memory usage: 1.3+ MB


# Time data

#### A data type we have not touched upon yet is datetime data, representing time. You can image how much it is needed and also - Time Series Analysis is science in itself. We will show you concepts related to this data type and they offer, and as usual, you have to figure out your own way.

#### In our dataset, the column "timestamp" contains objects of type string. However, we can see that there is more to that than just a set of characters. There is indeed some information about time hidden in this column.

In [4]:
ts=df.loc[0,"timestamp"]
print(ts , type(ts))

2015-01-04 00:00:00 <class 'str'>


https://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64

#### That's where we introduce the Timestamp - a type referring to a point in time. In this session we will use Timestamp type from pandas, which however is just a wrapper for the most popular python library for datetime data called as simple as: datetime. 

In [5]:
ts_pd=pd.to_datetime(ts, format="%Y-%m-%d %H:%M:%S")
print(ts_pd, "\n", type(ts_pd))

2015-01-04 00:00:00 
 <class 'pandas._libs.tslibs.timestamps.Timestamp'>


#### Now we convert the whole column "timestamp" to Timestamp objects

In [6]:
df["timestamp"]=pd.to_datetime(df["timestamp"], format="%Y-%m-%d %H:%M:%S")
type(df.loc[0,"timestamp"])

pandas._libs.tslibs.timestamps.Timestamp

#### Now that we deal with a Timestamp object and we understand its sense, we can explore it a little and get to know its members and member functions. The below are just some examples, please take some time to dig deeper into the different possibilities - it will just make your life so much easier.

In [7]:
ts = df["timestamp"][0]
print(ts.date())
print(ts.day)
print(ts.year)
print(ts.weekday())
print(ts.day_name())
print(ts.hour)

2015-01-04
4
2015
6
Sunday
0


#### The next important concept when talking about time data is Timedelta - a difference between two Timestamps. It answers the question "How long?". The difference, i.e. the overloaded "-" operator for Timestamps is of type Timedelta.

In [8]:
ts1, ts2 = df["timestamp"].agg(["min", "max"])
delta_ts2_ts1 = ts2 - ts1
print(delta_ts2_ts1, "\n", type(delta_ts2_ts1))

730 days 23:00:00 
 <class 'pandas._libs.tslibs.timedeltas.Timedelta'>


In [9]:
ts1, ts2 = df["timestamp"].values.min(), df["timestamp"].values.max()
delta_ts2_ts1 = ts2 - ts1
print(delta_ts2_ts1, "\n", type(delta_ts2_ts1))

63154800000000000 nanoseconds 
 <class 'numpy.timedelta64'>


#### Information about time often equips us with really valuable knowledge about our data. You then no longer have a set of independent rows, but you can say how these rows RELATE in time. However, this feature might also be a curse and must be considered really carefully while working with multiple Machine Learning algorithms or pre-processing algorithms, requiring e.g. indepedence between datasets.

## Split-apply-combine

In [10]:
#split
df.groupby("season")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fa23e6eb820>

In [11]:
#apply and combine
df.groupby("season")[["t1", "cnt"]].agg("mean")

Unnamed: 0_level_0,t1,cnt
season,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,10.666705,1103.831589
1.0,18.43116,1464.465238
2.0,13.039236,1178.954218
3.0,7.686952,821.729099


#### Let's begin by creating columns containing information about weekdays as well as hours. Note that we create redundant columns, that means that we repeat the information included in another column. Usually we would avoid it, especially working with large datasets. Here we allow it to ourselves for the sake of simplicity and also, our dataset has only 17k rows.

In [12]:
df["weekday"] = df.timestamp.dt.day_name()
df["hour"] = df.timestamp.dt.hour
df.head()

Unnamed: 0,timestamp,cnt,t1,t2,hum,wind_speed,weather_code,is_holiday,is_weekend,season,weekday,hour
0,2015-01-04 00:00:00,182,3.0,2.0,93.0,6.0,3.0,0.0,1.0,3.0,Sunday,0
1,2015-01-04 01:00:00,138,3.0,2.5,93.0,5.0,1.0,0.0,1.0,3.0,Sunday,1
2,2015-01-04 02:00:00,134,2.5,2.5,96.5,0.0,1.0,0.0,1.0,3.0,Sunday,2
3,2015-01-04 03:00:00,72,2.0,2.0,100.0,0.0,1.0,0.0,1.0,3.0,Sunday,3
4,2015-01-04 04:00:00,47,2.0,0.0,93.0,6.5,1.0,0.0,1.0,3.0,Sunday,4


#### We will explore how the usage of bike sharing system differs among weekdays. Please note the difference between the argument as_index set to False or True.

In [13]:
df.groupby("weekday")[["cnt"]].mean()

Unnamed: 0_level_0,cnt
weekday,Unnamed: 1_level_1
Friday,1182.772653
Monday,1130.270734
Saturday,995.553753
Sunday,959.567265
Thursday,1258.810594
Tuesday,1230.105389
Wednesday,1244.409


In [14]:
df.groupby("weekday", as_index=False)[["cnt"]].mean()

Unnamed: 0,weekday,cnt
0,Friday,1182.772653
1,Monday,1130.270734
2,Saturday,995.553753
3,Sunday,959.567265
4,Thursday,1258.810594
5,Tuesday,1230.105389
6,Wednesday,1244.409


#### Now we do the same with the hour column. While the result as for the weekdays was still quite easy to understand, it becomes mushy with 24 rows below.

In [15]:
df.groupby("hour", as_index=False)[["cnt"]].mean().head(5)

Unnamed: 0,hour,cnt
0,0,290.609116
1,1,200.631215
2,2,136.303745
3,3,94.245492
4,4,73.313454


#### We might want to create some categories for the different hours then!

In [16]:
def time_of_day(x):
    if (x > 4) and (x <= 8):
        return 'Early Morning'
    elif (x > 8) and (x <= 12 ):
        return 'Morning'
    elif (x > 12) and (x <= 16):
        return'Noon'
    elif (x > 16) and (x <= 20) :
        return 'Eve'
    elif (x > 20) and (x <= 24):
        return'Night'
    elif (x <= 4):
        return'Late Night'

In [17]:
df["timeOfDayAlt"]=df.hour.apply(time_of_day)
df.head(5)

Unnamed: 0,timestamp,cnt,t1,t2,hum,wind_speed,weather_code,is_holiday,is_weekend,season,weekday,hour,timeOfDayAlt
0,2015-01-04 00:00:00,182,3.0,2.0,93.0,6.0,3.0,0.0,1.0,3.0,Sunday,0,Late Night
1,2015-01-04 01:00:00,138,3.0,2.5,93.0,5.0,1.0,0.0,1.0,3.0,Sunday,1,Late Night
2,2015-01-04 02:00:00,134,2.5,2.5,96.5,0.0,1.0,0.0,1.0,3.0,Sunday,2,Late Night
3,2015-01-04 03:00:00,72,2.0,2.0,100.0,0.0,1.0,0.0,1.0,3.0,Sunday,3,Late Night
4,2015-01-04 04:00:00,47,2.0,0.0,93.0,6.5,1.0,0.0,1.0,3.0,Sunday,4,Late Night


In [18]:
b = [0,4,8,12,16,20,24]
l = ['Late Night', 'Early Morning','Morning','Noon','Eve','Night']

In [19]:
df["timeOfDay"] = pd.cut(df["hour"], bins=b, labels=l, include_lowest=True)
df.head(5)

Unnamed: 0,timestamp,cnt,t1,t2,hum,wind_speed,weather_code,is_holiday,is_weekend,season,weekday,hour,timeOfDayAlt,timeOfDay
0,2015-01-04 00:00:00,182,3.0,2.0,93.0,6.0,3.0,0.0,1.0,3.0,Sunday,0,Late Night,Late Night
1,2015-01-04 01:00:00,138,3.0,2.5,93.0,5.0,1.0,0.0,1.0,3.0,Sunday,1,Late Night,Late Night
2,2015-01-04 02:00:00,134,2.5,2.5,96.5,0.0,1.0,0.0,1.0,3.0,Sunday,2,Late Night,Late Night
3,2015-01-04 03:00:00,72,2.0,2.0,100.0,0.0,1.0,0.0,1.0,3.0,Sunday,3,Late Night,Late Night
4,2015-01-04 04:00:00,47,2.0,0.0,93.0,6.5,1.0,0.0,1.0,3.0,Sunday,4,Late Night,Late Night


In [20]:
print((df.timeOfDayAlt==df.timeOfDay).all())

df.drop(columns=["timeOfDayAlt"], inplace=True)

True


#### And then we can use the strategy of split-apply-combine on the new column :)

In [21]:
df.groupby("timeOfDay")[["cnt"]].agg("mean")

Unnamed: 0_level_0,cnt
timeOfDay,Unnamed: 1_level_1
Late Night,159.164497
Early Morning,1233.021747
Morning,1325.90784
Noon,1603.211321
Eve,2042.952234
Night,591.38196


In [22]:
df.groupby(["weekday","timeOfDay"])[["cnt"]].agg("mean")

Unnamed: 0_level_0,Unnamed: 1_level_0,cnt
weekday,timeOfDay,Unnamed: 2_level_1
Friday,Late Night,142.360236
Friday,Early Morning,1493.392593
Friday,Morning,1368.779412
Friday,Noon,1521.37561
Friday,Eve,2044.723301
Friday,Night,638.429967
Monday,Late Night,101.667308
Monday,Early Morning,1483.971292
Monday,Morning,1199.904762
Monday,Noon,1344.023981
