<a id='start'></a>
# Data preparation AAA - Intentionally Blank

Before running the notebook please download the data folder from this [sciebo link](https://uni-koeln.sciebo.de/s/QKgDQhiJOfGieZv) and put it into the project root directory.

**Dependencies:**
- Pandas
- Pyarrow (*conda install pyarrow*)
  - Needed for reading the taxi dataset from the parquet file

In [17]:
# Import all necessary libraries
import pandas as pd
import os  
os.makedirs('./data/prepared', exist_ok=True)
os.makedirs('./data/prepared/csv', exist_ok=True)

The weather data is taken from the Open-Meteo Historical Weather API. For further details and the documentation click the link [here](https://open-meteo.com/en/docs/historical-weather-api). With the help of the API, the timeframe is prefiltered, so there is no need to specify the timeframe in the cleaning process.
<br>The taxi data was preprocessed in the preprocess notebook and the original dataset was obtained from the Chicago data portal. For more details about the preprocess and the data collection click [here](./preprocess.ipynb) to get to the preprocess notebook.

In [16]:
# Read in data from the open-meteo API and the preprocessed taxi data
weather_data_hourly = pd.read_csv('https://archive-api.open-meteo.com/v1/archive?latitude=41.85&longitude=-87.65&start_date=2016-01-01&end_date=2016-12-31&hourly=temperature_2m,relativehumidity_2m,apparent_temperature,precipitation,cloudcover,windspeed_10m&format=csv&timezone=CST', header=2)
weather_data_daily= pd.read_csv('https://archive-api.open-meteo.com/v1/archive?latitude=41.85&longitude=-87.65&start_date=2016-01-01&end_date=2016-12-31&daily=temperature_2m_max,temperature_2m_min,temperature_2m_mean,apparent_temperature_max,apparent_temperature_min,apparent_temperature_mean,sunrise,sunset,precipitation_sum,precipitation_hours,windspeed_10m_max&format=csv&timezone=CST', header=2)
taxi_df = pd.read_parquet("data/taxi_data_preprocessed.gzip")

For a detailed description of the data, please refer to the [data reference section here](#references).

## Preparing the weather and taxi datasets

### Preparing the taxi dataset

Because most of the cleaning was done in the preprocess notebook, like dropping columns and deleting rows with null and duplicate values, we skip these steps here. For details about the aforementioned steps click [here](./preprocess.ipynb).

In [6]:
# Getting an overview of the data
taxi_df.describe()

Unnamed: 0,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,trip_total,pickup_centroid_latitude,pickup_centroid_longitude,dropoff_centroid_latitude,dropoff_centroid_longitude
count,20356210.0,20356210.0,20356210.0,20356210.0,20356210.0,20356210.0,20356210.0,20356210.0,20356210.0
mean,751.8359,3.392523,17031370000.0,17031360000.0,14.50391,41.89503,-87.65069,41.89572,-87.64809
std,1055.934,18.3492,343222.4,336176.1,33.88548,0.03092414,0.0673502,0.03020812,0.05917774
min,0.0,0.0,17031010000.0,17031010000.0,0.0,41.66049,-87.90304,41.66049,-87.90304
25%,344.0,0.6,17031080000.0,17031080000.0,7.0,41.88099,-87.64265,41.88099,-87.64281
50%,540.0,1.2,17031280000.0,17031280000.0,9.25,41.89197,-87.63186,41.89204,-87.63275
75%,840.0,2.4,17031840000.0,17031830000.0,13.25,41.89916,-87.62217,41.89967,-87.62217
max,86399.0,3353.1,17031980000.0,17031980000.0,9997.16,42.02122,-87.53139,42.02122,-87.53139


In [7]:
# Convert the timestamp columns to datetime type from pandas
taxi_df['trip_start_timestamp'] = pd.to_datetime(taxi_df['trip_start_timestamp'])
taxi_df['trip_end_timestamp'] = pd.to_datetime(taxi_df['trip_end_timestamp'])

# Sort the taxi data by the start timestamp
taxi_df = taxi_df.sort_values(['trip_start_timestamp'])
# Reset the index
taxi_df = taxi_df.reset_index(drop=True)
taxi_df

Unnamed: 0,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,trip_total,pickup_centroid_latitude,pickup_centroid_longitude,dropoff_centroid_latitude,dropoff_centroid_longitude
0,2016-01-01 00:00:00,2016-01-01 00:15:00,900,2.20,1.703128e+10,1.703108e+10,11.65,41.879255,-87.642649,41.891972,-87.612945
1,2016-01-01 00:00:00,2016-01-01 00:15:00,780,1.98,1.703132e+10,1.703128e+10,11.35,41.884987,-87.620993,41.879255,-87.642649
2,2016-01-01 00:00:00,2016-01-01 00:00:00,300,2.30,1.703184e+10,1.703184e+10,10.45,41.880994,-87.632746,41.898306,-87.653614
3,2016-01-01 00:00:00,2016-01-01 00:45:00,3300,4.60,1.703103e+10,1.703103e+10,26.45,41.994381,-87.672538,41.994381,-87.672538
4,2016-01-01 00:00:00,2016-01-01 00:00:00,0,0.00,1.703108e+10,1.703132e+10,3.25,41.895033,-87.619711,41.877406,-87.621972
...,...,...,...,...,...,...,...,...,...,...,...
20356204,2016-12-31 23:45:00,2016-12-31 23:45:00,840,1.20,1.703108e+10,1.703108e+10,10.75,41.892073,-87.628874,41.899156,-87.626211
20356205,2016-12-31 23:45:00,2017-01-01 00:00:00,1080,0.20,1.703132e+10,1.703133e+10,13.50,41.884987,-87.620993,41.859350,-87.617358
20356206,2016-12-31 23:45:00,2017-01-01 00:30:00,3000,0.20,1.703184e+10,1.703108e+10,23.50,41.880994,-87.632746,41.890922,-87.618868
20356207,2016-12-31 23:45:00,2016-12-31 23:45:00,120,0.40,1.703124e+10,1.703124e+10,5.00,41.906026,-87.675312,41.906026,-87.675312


It should be noted that the trip_seconds column is different from the delta time between the start timestamp and end timestamp. This is NOT an error because the timestamp are rounded to the nearest 15 minutes but this discrepancy should still be kept in mind when analysing the data.

### Preparing weather data

In [8]:
# Getting an overview of the hourly weather data
weather_data_hourly.describe()

Unnamed: 0,temperature_2m (°C),relativehumidity_2m (%),apparent_temperature (°C),precipitation (mm),cloudcover (%),windspeed_10m (km/h)
count,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0
mean,11.084324,75.335155,9.053848,0.110474,48.00444,14.383106
std,10.688952,12.752638,13.630042,0.492808,36.689525,7.138672
min,-19.7,32.0,-25.7,0.0,0.0,0.0
25%,2.3,66.0,-2.1,0.0,14.0,8.9
50%,11.2,76.0,8.4,0.0,40.0,13.5
75%,20.7,85.0,21.2,0.0,88.0,18.9
max,31.7,100.0,37.5,11.0,100.0,49.2


In [9]:
# Convert the time columns to datetime type from pandas
weather_data_hourly['time'] = pd.to_datetime(weather_data_hourly['time'])
#Sort the dataframe by their corresponding time column
weather_data_hourly.sort_values(['time'], inplace=True)
#Reset the index
weather_data_hourly.reset_index(drop = True, inplace =True)
weather_data_hourly

Unnamed: 0,time,temperature_2m (°C),relativehumidity_2m (%),apparent_temperature (°C),precipitation (mm),cloudcover (%),windspeed_10m (km/h)
0,2016-01-01 00:00:00,-4.0,71,-10.4,0.0,14,22.5
1,2016-01-01 01:00:00,-3.9,72,-10.5,0.0,20,23.9
2,2016-01-01 02:00:00,-3.6,71,-10.3,0.0,10,24.7
3,2016-01-01 03:00:00,-3.9,71,-10.6,0.0,14,24.3
4,2016-01-01 04:00:00,-4.5,73,-11.2,0.0,20,24.8
...,...,...,...,...,...,...,...
8779,2016-12-31 19:00:00,0.6,61,-4.8,0.0,51,16.9
8780,2016-12-31 20:00:00,0.0,65,-4.8,0.0,2,13.0
8781,2016-12-31 21:00:00,-0.6,69,-5.1,0.0,0,11.2
8782,2016-12-31 22:00:00,-1.2,71,-5.8,0.0,0,11.7


In [10]:
# Getting an overview of the daily weather data
weather_data_daily.describe()

Unnamed: 0,temperature_2m_max (°C),temperature_2m_min (°C),temperature_2m_mean (°C),apparent_temperature_max (°C),apparent_temperature_min (°C),apparent_temperature_mean (°C),precipitation_sum (mm),precipitation_hours (h),windspeed_10m_max (km/h)
count,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0
mean,14.098087,8.280601,11.085792,12.869126,5.767486,9.053005,2.651366,3.300546,21.065574
std,10.605513,10.474961,10.487603,13.689586,13.280628,13.412974,5.760821,5.106252,7.194592
min,-12.3,-19.7,-15.3,-18.6,-25.7,-22.1,0.0,0.0,7.4
25%,5.225,0.0,2.45,1.1,-5.275,-2.375,0.0,0.0,15.525
50%,14.9,8.2,11.7,12.6,4.7,8.7,0.0,0.0,20.35
75%,23.675,17.85,20.7,25.4,17.8,21.2,2.6,5.0,25.75
max,31.7,24.6,27.3,37.5,29.5,32.3,36.3,24.0,49.2


In [11]:
# Convert the time columns to datetime type from pandas
weather_data_daily['time'] = pd.to_datetime(weather_data_daily['time'])
weather_data_daily['sunrise (iso8601)'] = pd.to_datetime(weather_data_daily['sunrise (iso8601)'])
weather_data_daily['sunset (iso8601)'] = pd.to_datetime(weather_data_daily['sunset (iso8601)'])
#Sort the dataframe by their corresponding time column
weather_data_daily.sort_values(['time'], inplace=True)
#Reset the index
weather_data_daily.reset_index(drop = True, inplace =True)
weather_data_daily

Unnamed: 0,time,temperature_2m_max (°C),temperature_2m_min (°C),temperature_2m_mean (°C),apparent_temperature_max (°C),apparent_temperature_min (°C),apparent_temperature_mean (°C),sunrise (iso8601),sunset (iso8601),precipitation_sum (mm),precipitation_hours (h),windspeed_10m_max (km/h)
0,2016-01-01,-0.5,-6.2,-3.2,-6.7,-13.1,-9.7,2016-01-01 08:16:00,2016-01-01 17:31:00,0.0,0.0,25.6
1,2016-01-02,1.4,-4.1,-1.6,-4.7,-9.3,-7.4,2016-01-02 08:16:00,2016-01-02 17:32:00,0.0,0.0,27.6
2,2016-01-03,-0.3,-3.0,-1.7,-6.2,-9.1,-7.4,2016-01-03 08:16:00,2016-01-03 17:33:00,0.0,0.0,22.1
3,2016-01-04,-0.5,-3.1,-2.0,-6.4,-9.0,-7.5,2016-01-04 08:16:00,2016-01-04 17:34:00,0.0,0.0,23.3
4,2016-01-05,1.3,-4.8,-2.0,-4.0,-10.4,-7.4,2016-01-05 08:16:00,2016-01-05 17:35:00,0.0,0.0,21.0
...,...,...,...,...,...,...,...,...,...,...,...,...
361,2016-12-27,-0.2,-3.4,-2.0,-5.8,-9.6,-8.0,2016-12-27 08:15:00,2016-12-27 17:28:00,0.0,0.0,28.1
362,2016-12-28,5.5,-3.4,1.1,-0.7,-7.3,-4.2,2016-12-28 08:15:00,2016-12-28 17:29:00,0.0,0.0,29.7
363,2016-12-29,1.9,-0.4,0.7,-3.8,-7.3,-5.9,2016-12-29 08:16:00,2016-12-29 17:30:00,0.0,0.0,35.5
364,2016-12-30,0.9,-2.7,-0.9,-4.3,-8.9,-7.0,2016-12-30 08:16:00,2016-12-30 17:30:00,0.2,1.0,28.5


In [12]:
#Checking for any duplicates in weather data
print("Number of duplicates in weather_data_hourly: ", weather_data_hourly.duplicated().sum())
print("Number of duplicates in weather_data_daily: ", weather_data_daily.duplicated().sum())

Number of duplicates in weather_data_hourly:  0
Number of duplicates in weather_data_daily:  0


In [13]:
print("Number of rows with missing values in weather_data_hourly: ", weather_data_hourly.isnull().any(axis=1).sum())
print("Number of rows with missing values in weather_data_daily: ", weather_data_daily.isnull().any(axis=1).sum())

Number of rows with missing values in weather_data_hourly:  0
Number of rows with missing values in weather_data_daily:  0


### Saving to file

In [14]:
# Optional: If you want to save the prepared data as a csv file uncomment the following lines
# taxi_df.to_csv('data/prepared/csv/taxi_data_prepared.csv', index=False)
weather_data_hourly.to_csv('data/prepared/csv/weather_data_hourly_prepared.csv', index=False)
weather_data_daily.to_csv('data/prepared/csv/weather_data_daily_prepared.csv', index=False)



# Saving the prepared data as a parquet file with gzip compression
taxi_df.to_parquet('data/prepared/taxi_data_prepared.gzip', compression='gzip')
weather_data_hourly.to_parquet('data/prepared/weather_data_hourly_prepared.gzip', compression='gzip')
weather_data_daily.to_parquet('data/prepared/weather_data_daily_prepared.gzip', compression='gzip')

To open the parquet file use the pd.read_parquet function. The documentation can be found [here](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html).

## References

All dataframes for reference, most data descriptions where taken from the [Chicago data portal](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew#column-menu:~:text=Columns%20in%20this%20Dataset) and the [Open-Meteo Historical Weather API](https://open-meteo.com/en/docs/historical-weather-api#:~:text=of%20the%20data.-,API%20Documentation,-The%20API%20endpoint):
* **Taxi data for Chicago - Variable name: <br> *taxi_df*** 

| **Column Name**            | **Data Description**                                                                                                                                              | **Dtype**      |
|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
| trip_start_timestamp       | When the trip started, rounded to the nearest 15 minutes.                                                                                                         | datetime64[ns] |
| trip_end_timestamp         | When the trip ended, rounded to the nearest 15 minutes.                                                                                                           | datetime64[ns] |
| trip_seconds               | Time of the trip in seconds.                                                                                                                                      | uint32         |
| trip_miles                 | Distance of the trip in miles.                                                                                                                                    | float64        |
| pickup_census_tract        | The Census Tract where the trip began. For privacy, this Census Tract is not shown for some trips. This column often will be blank for locations outside Chicago. | float64        |
| dropoff_census_tract       | The Census Tract where the trip ended. For privacy, this Census Tract is not shown for some trips. This column often will be blank for locations outside Chicago. | float64        |
| trip_total                 | Total cost of the trip, the total of the previous columns.                                                                                                        | float64        |
| pickup_centroid_latitude   | The latitude of the center of the pickup census tract.                                                                                                            | float64        |
| pickup_centroid_longitude  | The longitude of the center of the pickup census tract.                                                                                                           | float64        |
| dropoff_centroid_latitude  | The latitude of the center of the dropoff census tract.                                                                                                           | float64        |
| dropoff_centroid_longitude | The longitude of the center of the dropoff census tract.                                                                                                          | float64        |
* **Hourly weather data - Variable name: <br> *weather_data_hourly***

| **Column name**           | **Data Description**                                                                                                                                                                                                                       | **Unit of Measurement** | **Dtype**      |
|---------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|----------------|
| time                      | The timestamp for the indicated hour.                                                                                                                                                                                                      | None     | datetime64[ns] |
| temperature_2m (°C)       | Air temperature at 2 meters above ground.                                                                                                                                                                                                  | °C       | float64        |
| relativehumidity_2m (%)   | Relative humidity at 2 meters above ground.                                                                                                                                                                                                | %        | int64          |
| apparent_temperature (°C) | Apparent temperature is the perceived feels-like temperature combining wind chill factor, relative humidity and solar radiation.                                                                                                           | °C       | float64        |
| precipitation (mm)        | Total precipitation (rain, showers, snow) sum of the preceding hour. Data is stored with a 0.1 mm precision. If precipitation data is summed up to monthly sums, there might be small inconsistencies with the total precipitation amount. | mm       | float64        |
| cloudcover (%)            | Total cloud cover as an area fraction.                                                                                                                                                                                                     | %        | int64          |
| windspeed_10m (km/h)      | Wind speed at 10 meters above ground.                                                                                                                                                                                                      | km/h     | float64        |
* **Daily weather data Note: Aggregations are a simple 24 hour aggregation from hourly values - Variable name: <br> *weather_data_daily***

| **Column name**                | **Data Description**                                               | **Unit of Measurement** | **Dtype**      |
|--------------------------------|--------------------------------------------------------------------|-------------------------|----------------|
| time                           | The timestamp for the indicated hour.                              |                         | datetime64[ns] |
| temperature_2m_max (°C)        | Maximum daily air temperature at 2 meters above ground.            | °C                      | float64        |
| temperature_2m_min (°C)        | Minimum daily air temperature at 2 meters above ground.            | °C                      | float64        |
| temperature_2m_mean (°C)       | Mean daily air temperature at 2 meters above ground.               | °C                      | float64        |
| apparent_temperature_max (°C)  | Maximum daily apparent temperature.                                | °C                      | float64        |
|  apparent_temperature_min (°C) | Minimum daily apparent temperature.                                | °C                      | float64        |
| apparent_temperature_mean (°C) | Mean daily apparent temperature.                                   | °C                      | float64        |
| sunrise (iso8601)              | Sun rise times.                                                    | iso8601                 | datetime64[ns] |
| sunset (iso8601)               | Sun set times.                                                     | iso8601                 | datetime64[ns] |
|         precipitation_sum (mm) | Sum of daily precipitation (including rain, showers and snowfall). | millimeter              | float64        |
| precipitation_hours (h)        | The number of hours with rain.                                     | hours                   | float64        |
| windspeed_10m_max (km/h)       | Maximum wind speed on a day.                                       | km/h                    | float64        |

Further spatial data (different resolutions of hexaxgons in h3-Uber) and different temporal discretization (e.g., hourly, 4-hourly, daily) is done in the other notebooks where needed.