In [None]:
# Automatically generate prediction problems for the Chicago bike sharing dataset with Trane

In this tutorial, we will show how we can use Trane to generate predictions problems for a bike sharing program, [from Kaggle](https://www.kaggle.com/datasets/yingwurenjian/chicago-divvy-bicycle-sharing-data).

## Load Data
First, let's load our data, and examine the first few rows.

In [37]:
import pandas as pd

data = pd.read_csv("data_raw.csv")

In [38]:
counts = data["usertype"].value_counts()
counts

Subscriber    10017631
Customer       3756894
Dependent          190
Name: usertype, dtype: int64

In [39]:
remove_rows = data[data["usertype"] == "Dependent"]
data = data.drop(index=remove_rows.index)

In [40]:
counts = data["usertype"].value_counts()
counts

Subscriber    10017631
Customer       3756894
Name: usertype, dtype: int64

In [43]:
data.tail(5)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_id,from_station_name,latitude_start,longitude_start,...,windchill,dewpoint,humidity,pressure,visibility,wind_speed,precipitation,events,rain,conditions
13774710,16734070,Subscriber,Male,2017-10-01 00:01:00,2017-10-01 00:15:00,837,289,Wells St & Concord Ln,41.912133,-87.634656,...,-999.0,41.0,64.0,30.31,10.0,8.1,-9999.0,partlycloudy,0,Partly Cloudy
13774711,16734069,Customer,,2017-10-01 00:00:00,2017-10-01 00:07:00,366,45,Michigan Ave & Congress Pkwy,41.876243,-87.624426,...,-999.0,41.0,64.0,30.31,10.0,8.1,-9999.0,partlycloudy,0,Partly Cloudy
13774712,16734068,Customer,,2017-10-01 00:00:00,2017-10-01 00:05:00,264,520,Greenview Ave & Jarvis Ave,42.015962,-87.66857,...,-999.0,41.0,64.0,30.31,10.0,8.1,-9999.0,partlycloudy,0,Partly Cloudy
13774713,16734067,Subscriber,Female,2017-10-01 00:00:00,2017-10-01 00:06:00,361,288,Larrabee St & Armitage Ave,41.918084,-87.643749,...,-999.0,41.0,64.0,30.31,10.0,8.1,-9999.0,partlycloudy,0,Partly Cloudy
13774714,16734066,Subscriber,Female,2017-10-01 00:00:00,2017-10-01 00:12:00,741,135,Halsted St & 21st St,41.85378,-87.64665,...,-999.0,41.0,64.0,30.31,10.0,8.1,-9999.0,partlycloudy,0,Partly Cloudy


In [42]:
print(f"Number of Rows: {data.shape[0]}")

Number of Rows: 13774525


In [44]:
data.columns

Index(['trip_id', 'usertype', 'gender', 'starttime', 'stoptime',
       'tripduration', 'from_station_id', 'from_station_name',
       'latitude_start', 'longitude_start', 'dpcapacity_start',
       'to_station_id', 'to_station_name', 'latitude_end', 'longitude_end',
       'dpcapacity_end', 'temperature', 'windchill', 'dewpoint', 'humidity',
       'pressure', 'visibility', 'wind_speed', 'precipitation', 'events',
       'rain', 'conditions'],
      dtype='object')

As we can see, this is a dataset from a bike sharing program. We have information on where the riders go, when they ride, how far they go, how long their trips are, etc. 

We are required to determine the following parameters to generate the Cutoff Strategy:

**entity_col**: the column name to use for grouping the data.
- For this walkthrough, we are interested interested in prediction problems `user type`, which could be a Customer or Subscriber.

**window_size**: the amount of data to use per label
- We will set this at `2d`, to account for the delay in reporting Covid information. 

**minimum_size**: the time at which the labeling should begin
 - We want to use all avaliable information for labeling: set the `minimum_size` to the timestamp of the oldest data point 

**maximum_size**: the time at which the labeling will end
 - We want to create labels for all data points: set the `maximum_size` to be the timestamp of the most recent data point. 

In [4]:
metadata = trane.datasets.load_bike_metadata()
metadata

<trane.utils.table_meta.TableMeta at 0x28fd0a6d0>