<a id='start'></a>
# Data preparation AAA - Intentionally Blank

Before running the notebook please download the data folder from this [sciebo link](https://uni-koeln.sciebo.de/s/QKgDQhiJOfGieZv) and put it into the project root directory.

**Dependencies:**
- Pandas
- Pyarrow (*conda install pyarrow*)
  - Needed for reading the taxi dataset from the parquet file
- Geopandas (*conda install -c conda-forge geopandas*)

In [1]:
# Import all necessary libraries
import pandas as pd
# Also install the following libraries:
import geopandas as gpd # For spatial data (conda install -c conda-forge geopandas)

The weather data is taken from the Open-Meteo Historical Weather API. For further details and the documentation click the link [here](https://open-meteo.com/en/docs/historical-weather-api). With the help of the API, the timeframe is prefiltered, so there is no need to specify the timeframe in the cleaning process.
<br>The taxi data was prepared in the other preparation notebook and the original dataset was obtained from the Chicago data portal. For more details about the preparation and the data collection click [here](./preprocess.ipynb) to get to the preparation notebook for the taxi dataset.

In [2]:
# Read in data from the open-meteo API and the preprocessed taxi data
weather_data_hourly = pd.read_csv('https://archive-api.open-meteo.com/v1/archive?latitude=41.85&longitude=-87.65&start_date=2016-01-01&end_date=2016-12-31&hourly=temperature_2m,relativehumidity_2m,apparent_temperature,precipitation,cloudcover,windspeed_10m&format=csv&timezone=CST', header=2)
weather_data_daily= pd.read_csv('https://archive-api.open-meteo.com/v1/archive?latitude=41.85&longitude=-87.65&start_date=2016-01-01&end_date=2016-12-31&daily=temperature_2m_max,temperature_2m_min,temperature_2m_mean,apparent_temperature_max,apparent_temperature_min,apparent_temperature_mean,sunrise,sunset,precipitation_sum,precipitation_hours,windspeed_10m_max&format=csv&timezone=CST', header=2)
taxi_df = pd.read_parquet("data/prepared/taxi_data_prepared.gzip")

For a detailed description of the data, please refer to the [data reference section here](#references).

## Preparing the weather datasets

### Preparing weather data

In [3]:
# Getting an overview of the hourly weather data
weather_data_hourly.describe()

Unnamed: 0,temperature_2m (°C),relativehumidity_2m (%),apparent_temperature (°C),precipitation (mm),cloudcover (%),windspeed_10m (km/h)
count,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0
mean,11.084324,75.335155,9.053848,0.110474,48.00444,14.383106
std,10.688952,12.752638,13.630042,0.492808,36.689525,7.138672
min,-19.7,32.0,-25.7,0.0,0.0,0.0
25%,2.3,66.0,-2.1,0.0,14.0,8.9
50%,11.2,76.0,8.4,0.0,40.0,13.5
75%,20.7,85.0,21.2,0.0,88.0,18.9
max,31.7,100.0,37.5,11.0,100.0,49.2


In [4]:
# Convert the time columns to datetime type from pandas
weather_data_hourly['time'] = pd.to_datetime(weather_data_hourly['time'])
#Sort the dataframe by their corresponding time column
weather_data_hourly.sort_values(['time'], inplace=True)
#Reset the index
weather_data_hourly.reset_index(drop = True, inplace =True)
weather_data_hourly

Unnamed: 0,time,temperature_2m (°C),relativehumidity_2m (%),apparent_temperature (°C),precipitation (mm),cloudcover (%),windspeed_10m (km/h)
0,2016-01-01 00:00:00,-4.0,71,-10.4,0.0,14,22.5
1,2016-01-01 01:00:00,-3.9,72,-10.5,0.0,20,23.9
2,2016-01-01 02:00:00,-3.6,71,-10.3,0.0,10,24.7
3,2016-01-01 03:00:00,-3.9,71,-10.6,0.0,14,24.3
4,2016-01-01 04:00:00,-4.5,73,-11.2,0.0,20,24.8
...,...,...,...,...,...,...,...
8779,2016-12-31 19:00:00,0.6,61,-4.8,0.0,51,16.9
8780,2016-12-31 20:00:00,0.0,65,-4.8,0.0,2,13.0
8781,2016-12-31 21:00:00,-0.6,69,-5.1,0.0,0,11.2
8782,2016-12-31 22:00:00,-1.2,71,-5.8,0.0,0,11.7


In [5]:
# Getting an overview of the daily weather data
weather_data_daily.describe()

Unnamed: 0,temperature_2m_max (°C),temperature_2m_min (°C),temperature_2m_mean (°C),apparent_temperature_max (°C),apparent_temperature_min (°C),apparent_temperature_mean (°C),precipitation_sum (mm),precipitation_hours (h),windspeed_10m_max (km/h)
count,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0
mean,14.098087,8.280601,11.085792,12.869126,5.767486,9.053005,2.651366,3.300546,21.065574
std,10.605513,10.474961,10.487603,13.689586,13.280628,13.412974,5.760821,5.106252,7.194592
min,-12.3,-19.7,-15.3,-18.6,-25.7,-22.1,0.0,0.0,7.4
25%,5.225,0.0,2.45,1.1,-5.275,-2.375,0.0,0.0,15.525
50%,14.9,8.2,11.7,12.6,4.7,8.7,0.0,0.0,20.35
75%,23.675,17.85,20.7,25.4,17.8,21.2,2.6,5.0,25.75
max,31.7,24.6,27.3,37.5,29.5,32.3,36.3,24.0,49.2


In [6]:
# Convert the time columns to datetime type from pandas
weather_data_daily['time'] = pd.to_datetime(weather_data_daily['time'])
weather_data_daily['sunrise (iso8601)'] = pd.to_datetime(weather_data_daily['sunrise (iso8601)'])
weather_data_daily['sunset (iso8601)'] = pd.to_datetime(weather_data_daily['sunset (iso8601)'])
#Sort the dataframe by their corresponding time column
weather_data_daily.sort_values(['time'], inplace=True)
#Reset the index
weather_data_daily.reset_index(drop = True, inplace =True)
weather_data_daily

Unnamed: 0,time,temperature_2m_max (°C),temperature_2m_min (°C),temperature_2m_mean (°C),apparent_temperature_max (°C),apparent_temperature_min (°C),apparent_temperature_mean (°C),sunrise (iso8601),sunset (iso8601),precipitation_sum (mm),precipitation_hours (h),windspeed_10m_max (km/h)
0,2016-01-01,-0.5,-6.2,-3.2,-6.7,-13.1,-9.7,2016-01-01 08:18:00,2016-01-01 17:29:00,0.0,0.0,25.6
1,2016-01-02,1.4,-4.1,-1.6,-4.7,-9.3,-7.4,2016-01-02 08:18:00,2016-01-02 17:30:00,0.0,0.0,27.6
2,2016-01-03,-0.3,-3.0,-1.7,-6.2,-9.1,-7.4,2016-01-03 08:18:00,2016-01-03 17:31:00,0.0,0.0,22.1
3,2016-01-04,-0.5,-3.1,-2.0,-6.4,-9.0,-7.5,2016-01-04 08:18:00,2016-01-04 17:32:00,0.0,0.0,23.3
4,2016-01-05,1.3,-4.8,-2.0,-4.0,-10.4,-7.4,2016-01-05 08:18:00,2016-01-05 17:33:00,0.0,0.0,21.0
...,...,...,...,...,...,...,...,...,...,...,...,...
361,2016-12-27,-0.2,-3.4,-2.0,-5.8,-9.6,-8.0,2016-12-27 08:17:00,2016-12-27 17:26:00,0.0,0.0,28.1
362,2016-12-28,5.5,-3.4,1.1,-0.7,-7.3,-4.2,2016-12-28 08:17:00,2016-12-28 17:27:00,0.0,0.0,29.7
363,2016-12-29,1.9,-0.4,0.7,-3.8,-7.3,-5.9,2016-12-29 08:17:00,2016-12-29 17:28:00,0.0,0.0,35.5
364,2016-12-30,0.9,-2.7,-0.9,-4.3,-8.9,-7.0,2016-12-30 08:17:00,2016-12-30 17:28:00,0.2,1.0,28.5


In [7]:
#Checking for any duplicates in weather data
print("Number of duplicates in weather_data_hourly: ", weather_data_hourly.duplicated().sum())
print("Number of duplicates in weather_data_daily: ", weather_data_daily.duplicated().sum())

Number of duplicates in weather_data_hourly:  0
Number of duplicates in weather_data_daily:  0


In [8]:
print("Number of rows with missing values in weather_data_hourly: ", weather_data_hourly.isnull().any(axis=1).sum())
print("Number of rows with missing values in weather_data_daily: ", weather_data_daily.isnull().any(axis=1).sum())

Number of rows with missing values in weather_data_hourly:  0
Number of rows with missing values in weather_data_daily:  0


### Saving to file

In [9]:
# Optional: If you want to save the prepared data as a csv file uncomment the following lines
# taxi_df.to_csv('data/prepared/csv/taxi_data_prepared.csv', index=False)
weather_data_hourly.to_csv('data/prepared/csv/weather_data_hourly_prepared.csv', index=False)
weather_data_daily.to_csv('data/prepared/csv/weather_data_daily_prepared.csv', index=False)



# Saving the prepared data as a parquet file with gzip compression
weather_data_hourly.to_parquet('data/prepared/weather_data_hourly_prepared.gzip', compression='gzip')
weather_data_daily.to_parquet('data/prepared/weather_data_daily_prepared.gzip', compression='gzip')

To open the parquet file use the pd.read_parquet function. The documentation can be found [here](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html).

## References

All dataframes for reference, most data descriptions where taken from the [Chicago data portal](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew#column-menu:~:text=Columns%20in%20this%20Dataset) and the [Open-Meteo Historical Weather API](https://open-meteo.com/en/docs/historical-weather-api#:~:text=of%20the%20data.-,API%20Documentation,-The%20API%20endpoint):
* **Taxi data for Chicago - Variable name: <br> *taxi_df*** 

| **Column Name**            | **Data Description**                                                                                                                                              | **Dtype**      |
|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
| trip_start_timestamp       | When the trip started, rounded to the nearest 15 minutes.                                                                                                         | datetime64[ns] |
| trip_end_timestamp         | When the trip ended, rounded to the nearest 15 minutes.                                                                                                           | datetime64[ns] |
| trip_seconds               | Time of the trip in seconds.                                                                                                                                      | uint32         |
| trip_miles                 | Distance of the trip in miles.                                                                                                                                    | float64        |
| pickup_census_tract        | The Census Tract where the trip began. For privacy, this Census Tract is not shown for some trips. This column often will be blank for locations outside Chicago. | float64        |
| dropoff_census_tract       | The Census Tract where the trip ended. For privacy, this Census Tract is not shown for some trips. This column often will be blank for locations outside Chicago. | float64        |
| trip_total                 | Total cost of the trip, the total of the previous columns.                                                                                                        | float64        |
| pickup_centroid_latitude   | The latitude of the center of the pickup census tract.                                                                                                            | float64        |
| pickup_centroid_longitude  | The longitude of the center of the pickup census tract.                                                                                                           | float64        |
| dropoff_centroid_latitude  | The latitude of the center of the dropoff census tract.                                                                                                           | float64        |
| dropoff_centroid_longitude | The longitude of the center of the dropoff census tract.                                                                                                          | float64        |
| idle_time | The idle time in seconds. Rounded to the nearest 15 minutes. NaN if it is the first ride for the specific taxi of the year. Keep in mind that the idle seconds are approximated from the start and end times.                                                                                                          | float64        |
* **Hourly weather data - Variable name: <br> *weather_data_hourly***

| **Column name**           | **Data Description**                                                                                                                                                                                                                       | **Unit of Measurement** | **Dtype**      |
|---------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|----------------|
| time                      | The timestamp for the indicated hour.                                                                                                                                                                                                      | None     | datetime64[ns] |
| temperature_2m (°C)       | Air temperature at 2 meters above ground.                                                                                                                                                                                                  | °C       | float64        |
| relativehumidity_2m (%)   | Relative humidity at 2 meters above ground.                                                                                                                                                                                                | %        | int64          |
| apparent_temperature (°C) | Apparent temperature is the perceived feels-like temperature combining wind chill factor, relative humidity and solar radiation.                                                                                                           | °C       | float64        |
| precipitation (mm)        | Total precipitation (rain, showers, snow) sum of the preceding hour. Data is stored with a 0.1 mm precision. If precipitation data is summed up to monthly sums, there might be small inconsistencies with the total precipitation amount. | mm       | float64        |
| cloudcover (%)            | Total cloud cover as an area fraction.                                                                                                                                                                                                     | %        | int64          |
| windspeed_10m (km/h)      | Wind speed at 10 meters above ground.                                                                                                                                                                                                      | km/h     | float64        |
* **Daily weather data Note: Aggregations are a simple 24 hour aggregation from hourly values - Variable name: <br> *weather_data_daily***

| **Column name**                | **Data Description**                                               | **Unit of Measurement** | **Dtype**      |
|--------------------------------|--------------------------------------------------------------------|-------------------------|----------------|
| time                           | The timestamp for the indicated hour.                              |                         | datetime64[ns] |
| temperature_2m_max (°C)        | Maximum daily air temperature at 2 meters above ground.            | °C                      | float64        |
| temperature_2m_min (°C)        | Minimum daily air temperature at 2 meters above ground.            | °C                      | float64        |
| temperature_2m_mean (°C)       | Mean daily air temperature at 2 meters above ground.               | °C                      | float64        |
| apparent_temperature_max (°C)  | Maximum daily apparent temperature.                                | °C                      | float64        |
|  apparent_temperature_min (°C) | Minimum daily apparent temperature.                                | °C                      | float64        |
| apparent_temperature_mean (°C) | Mean daily apparent temperature.                                   | °C                      | float64        |
| sunrise (iso8601)              | Sun rise times.                                                    | iso8601                 | datetime64[ns] |
| sunset (iso8601)               | Sun set times.                                                     | iso8601                 | datetime64[ns] |
|         precipitation_sum (mm) | Sum of daily precipitation (including rain, showers and snowfall). | millimeter              | float64        |
| precipitation_hours (h)        | The number of hours with rain.                                     | hours                   | float64        |
| windspeed_10m_max (km/h)       | Maximum wind speed on a day.                                       | km/h                    | float64        |

Further spatial data (different resolutions of hexaxgons in h3-Uber) and different temporal discretization (e.g., hourly, 4-hourly, daily) is done in the other notebooks when needed.

## Geopandas example for later notebooks

The location columns are always saved as string type when saved to a file. To convert the location columns to the Point geometry type and to convert the pandas dataframe to a geodataframe see the following example:

In [10]:
# Converting the pickup_centroid_location column to a GeoSeries
taxi_df['pickup_centroid_location'] = gpd.GeoSeries.from_wkt(taxi_df['pickup_centroid_location'])
taxi_df = taxi_df[taxi_df.columns.difference(['idle_seconds'])]
# Converting the taxi_df to a GeoDataFrame
# !Note: The crs is set to 4326 which is the WGS84 coordinate system and must be used to show the coordinates properly on a map
taxi_geo_df = gpd.GeoDataFrame(taxi_df, geometry='pickup_centroid_location', crs=4326)
taxi_geo_df

Unnamed: 0,dropoff_census_tract,dropoff_centroid_location,pickup_census_tract,pickup_centroid_location,trip_end_timestamp,trip_miles,trip_seconds,trip_start_timestamp,trip_total
0,17031070102,POINT (-87.6422063127 41.9305785697),17031070300,POINT (-87.65131 41.92905),2016-01-01 00:00:00,0.9,120,2016-01-01 00:00:00,6.45
1,17031081100,POINT (-87.6291051864 41.9002212967),17031081201,POINT (-87.62621 41.89916),2016-01-01 00:00:00,0.3,120,2016-01-01 00:00:00,5.05
2,17031842300,POINT (-87.6536139825 41.8983058696),17031081201,POINT (-87.62621 41.89916),2016-01-01 00:15:00,2.8,720,2016-01-01 00:00:00,9.85
3,17031081403,POINT (-87.6188683546 41.8909220259),17031081300,POINT (-87.62076 41.89833),2016-01-01 00:15:00,1.0,960,2016-01-01 00:00:00,13.80
4,17031839000,POINT (-87.6314065252 41.8710158803),17031081403,POINT (-87.61887 41.89092),2016-01-01 00:30:00,3.0,1260,2016-01-01 00:00:00,15.65
...,...,...,...,...,...,...,...,...,...
16756403,17031320100,POINT (-87.6209929134 41.8849871918),17031320100,POINT (-87.62099 41.88499),2016-12-31 23:45:00,0.2,480,2016-12-31 23:45:00,7.25
16756404,17031833000,POINT (-87.6572331997 41.8852813201),17031081403,POINT (-87.61887 41.89092),2017-01-01 00:00:00,1.8,780,2016-12-31 23:45:00,12.00
16756405,17031081300,POINT (-87.6207628651 41.8983317935),17031839100,POINT (-87.63275 41.88099),2016-12-31 23:45:00,1.8,840,2016-12-31 23:45:00,10.25
16756406,17031320400,POINT (-87.6219716519 41.8774061234),17031081700,POINT (-87.63186 41.89204),2017-01-01 00:00:00,1.1,600,2016-12-31 23:45:00,7.50


Now you can use the geodataframe to plot the points to a folium map.

In [11]:
taxi_geo_df

Unnamed: 0,dropoff_census_tract,dropoff_centroid_location,pickup_census_tract,pickup_centroid_location,trip_end_timestamp,trip_miles,trip_seconds,trip_start_timestamp,trip_total
0,17031070102,POINT (-87.6422063127 41.9305785697),17031070300,POINT (-87.65131 41.92905),2016-01-01 00:00:00,0.9,120,2016-01-01 00:00:00,6.45
1,17031081100,POINT (-87.6291051864 41.9002212967),17031081201,POINT (-87.62621 41.89916),2016-01-01 00:00:00,0.3,120,2016-01-01 00:00:00,5.05
2,17031842300,POINT (-87.6536139825 41.8983058696),17031081201,POINT (-87.62621 41.89916),2016-01-01 00:15:00,2.8,720,2016-01-01 00:00:00,9.85
3,17031081403,POINT (-87.6188683546 41.8909220259),17031081300,POINT (-87.62076 41.89833),2016-01-01 00:15:00,1.0,960,2016-01-01 00:00:00,13.80
4,17031839000,POINT (-87.6314065252 41.8710158803),17031081403,POINT (-87.61887 41.89092),2016-01-01 00:30:00,3.0,1260,2016-01-01 00:00:00,15.65
...,...,...,...,...,...,...,...,...,...
16756403,17031320100,POINT (-87.6209929134 41.8849871918),17031320100,POINT (-87.62099 41.88499),2016-12-31 23:45:00,0.2,480,2016-12-31 23:45:00,7.25
16756404,17031833000,POINT (-87.6572331997 41.8852813201),17031081403,POINT (-87.61887 41.89092),2017-01-01 00:00:00,1.8,780,2016-12-31 23:45:00,12.00
16756405,17031081300,POINT (-87.6207628651 41.8983317935),17031839100,POINT (-87.63275 41.88099),2016-12-31 23:45:00,1.8,840,2016-12-31 23:45:00,10.25
16756406,17031320400,POINT (-87.6219716519 41.8774061234),17031081700,POINT (-87.63186 41.89204),2017-01-01 00:00:00,1.1,600,2016-12-31 23:45:00,7.50


In [12]:
taxi_geo_df_first_ten_rows = taxi_geo_df.loc[:10, :].copy()
# For the explore function convert the timestamp columns to string or else it will throw an error
taxi_geo_df_first_ten_rows["trip_start_timestamp"] = taxi_geo_df_first_ten_rows["trip_start_timestamp"].astype(str)
taxi_geo_df_first_ten_rows["trip_end_timestamp"] = taxi_geo_df_first_ten_rows["trip_end_timestamp"].astype(str)
taxi_geo_df_first_ten_rows.explore()

In [13]:
taxi_geo_df_first_ten_rows = taxi_geo_df.loc[:10, :].copy()
# Or select all columns except the timestamp columns
taxi_geo_df_first_ten_rows=taxi_geo_df_first_ten_rows[taxi_geo_df_first_ten_rows.columns.difference(['trip_start_timestamp', 'trip_end_timestamp'])]
taxi_geo_df_first_ten_rows.explore()