## Libraries and settings

Import the required libraries and set the main settings.

In [35]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore")
print(os.getcwd())

/workspaces/data_analytics_project/notebooks


# Weather & Bicycle Usage – Data Preprocessing

This notebook cleans, merges, and prepares the collected weather and bicycle traffic data for Zurich for further analysis.

# Weather & Bicycle Usage – Data Preprocessing

## Libraries and settings

In [36]:
import os
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

print(os.getcwd())

/workspaces/data_analytics_project/notebooks


## Libraries and settings

In [37]:
# Load daily mean temperature data (English, mean only)
weather_df = pd.read_csv("../data/temperature_mean_zurich_2025.csv")
print("Weather data shape:", weather_df.shape)
print(weather_df.head())
print("\nData types:")
print(weather_df.dtypes)

Weather data shape: (361, 2)
         date  temp_mean
0  2025-01-01        0.6
1  2025-01-02        3.1
2  2025-01-03       -0.3
3  2025-01-04       -1.9
4  2025-01-05        1.8

Data types:
date          object
temp_mean    float64
dtype: object


## Loading datasets

We load both the weather and bicycle counter data from the CSV files generated in the data collection phase.

In [38]:
# Load bicycle counter data
bikes_df = pd.read_csv("../data/2025_verkehrszaehlungen_werte_fussgaenger_velo.csv")
print("Bicycle data shape:", bikes_df.shape)
print(bikes_df.head())
print("\nData types:")
print(bikes_df.dtypes)

Bicycle data shape: (838029, 8)
   FK_STANDORT             DATUM  VELO_IN  VELO_OUT  FUSS_IN  FUSS_OUT  \
0         4241  2025-01-01T00:00      3.0       0.0      NaN       NaN   
1         2989  2025-01-01T00:00      1.0       2.0      NaN       NaN   
2         2991  2025-01-01T00:00     18.0       2.0      NaN       NaN   
3         4255  2025-01-01T00:00      0.0       0.0      NaN       NaN   
4         4242  2025-01-01T00:00      1.0       0.0      NaN       NaN   

       OST     NORD  
0  2682297  1248328  
1  2682278  1248324  
2  2682756  1247323  
3  2682881  1246549  
4  2682337  1248451  

Data types:
FK_STANDORT      int64
DATUM           object
VELO_IN        float64
VELO_OUT       float64
FUSS_IN        float64
FUSS_OUT       float64
OST              int64
NORD             int64
dtype: object


## Cleaning and handling missing values

We convert time columns to datetime format and remove any rows with missing values to ensure data quality.

In [39]:
# Convert date columns to datetime
weather_df["date"] = pd.to_datetime(weather_df["date"])
bikes_df["DATUM"] = pd.to_datetime(bikes_df["DATUM"])
print("After datetime conversion:")
print("Weather data types:", weather_df.dtypes.values)
print("Bikes data types:", bikes_df.dtypes.values)

After datetime conversion:
Weather data types: [dtype('<M8[ns]') dtype('float64')]
Bikes data types: [dtype('int64') dtype('<M8[ns]') dtype('float64') dtype('float64')
 dtype('float64') dtype('float64') dtype('int64') dtype('int64')]


In [40]:
# --- Aggregate bicycle data to daily level ---
# Create a new column with only the date (no time)
bikes_df['DATE_ONLY'] = bikes_df['DATUM'].dt.date

# Sum all bike counts per day (across all locations)
daily_bikes = bikes_df.groupby('DATE_ONLY').agg({
    'VELO_IN': 'sum',
    'VELO_OUT': 'sum',
    'FUSS_IN': 'sum',
    'FUSS_OUT': 'sum'
}).reset_index()

daily_bikes.rename(columns={'DATE_ONLY': 'date'}, inplace=True)
# Convert 'date' back to datetime for merging
daily_bikes['date'] = pd.to_datetime(daily_bikes['date'])

print('Aggregated daily bicycle data:')
print(daily_bikes.head())

Aggregated daily bicycle data:
        date  VELO_IN  VELO_OUT  FUSS_IN  FUSS_OUT
0 2025-01-01   5094.0    2558.0   1188.0    1099.0
1 2025-01-02   5086.0    2423.0    541.0     448.0
2 2025-01-03   9073.0    4420.0    450.0     404.0
3 2025-01-04   7129.0    3551.0    457.0     388.0
4 2025-01-05   5000.0    2641.0    630.0     565.0


In [41]:
# Check for missing values
print("Missing values in weather data:")
print(weather_df.isnull().sum())

print("\nMissing values in bicycle data:")
print(bikes_df.isnull().sum())

Missing values in weather data:
date         0
temp_mean    0
dtype: int64

Missing values in bicycle data:
FK_STANDORT         0
DATUM               0
VELO_IN         76402
VELO_OUT       167593
FUSS_IN        761627
FUSS_OUT       761627
OST                 0
NORD                0
DATE_ONLY           0
dtype: int64


In [42]:
# Drop missing values
weather_df = weather_df.dropna()
bikes_df = bikes_df.dropna()

print("After dropping missing values:")
print("Weather data shape:", weather_df.shape)
print("Bicycle data shape:", bikes_df.shape)

After dropping missing values:
Weather data shape: (361, 2)
Bicycle data shape: (0, 9)


In [43]:
# Check for duplicates
print("Duplicates in weather data:", weather_df.duplicated().sum())
print("Duplicates in bicycle data:", bikes_df.duplicated().sum())

# Drop duplicates if any
weather_df = weather_df.drop_duplicates()
bikes_df = bikes_df.drop_duplicates()

print("\nAfter removing duplicates:")
print("Weather data shape:", weather_df.shape)
print("Bicycle data shape:", bikes_df.shape)

Duplicates in weather data: 0
Duplicates in bicycle data: 0

After removing duplicates:
Weather data shape: (361, 2)
Bicycle data shape: (0, 9)


## Merging weather and bicycle datasets

We perform an inner join on the time column to combine both datasets into a single analysis-ready dataset.

In [44]:
# Merge weather and aggregated bicycle data on date (daily mean only)
merged_df = pd.merge(weather_df, daily_bikes, on="date", how="inner")

print("Merged data shape:", merged_df.shape)
print("\nFirst 5 rows:")
print(merged_df.head())
print("\nData info:")
print(merged_df.info())

Merged data shape: (359, 6)

First 5 rows:
        date  temp_mean  VELO_IN  VELO_OUT  FUSS_IN  FUSS_OUT
0 2025-01-01        0.6   5094.0    2558.0   1188.0    1099.0
1 2025-01-02        3.1   5086.0    2423.0    541.0     448.0
2 2025-01-03       -0.3   9073.0    4420.0    450.0     404.0
3 2025-01-04       -1.9   7129.0    3551.0    457.0     388.0
4 2025-01-05        1.8   5000.0    2641.0    630.0     565.0

Data info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 359 entries, 0 to 358
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date       359 non-null    datetime64[ns]
 1   temp_mean  359 non-null    float64       
 2   VELO_IN    359 non-null    float64       
 3   VELO_OUT   359 non-null    float64       
 4   FUSS_IN    359 non-null    float64       
 5   FUSS_OUT   359 non-null    float64       
dtypes: datetime64[ns](1), float64(5)
memory usage: 17.0 KB
None


In [45]:
# Basic statistics
print("Basic statistics:")
print(merged_df.describe())

Basic statistics:
                      date   temp_mean       VELO_IN      VELO_OUT  \
count                  359  359.000000    359.000000    359.000000   
mean   2025-06-29 00:00:00   10.576323  23350.487465  11717.409471   
min    2025-01-01 00:00:00   -2.900000   1373.000000   1241.000000   
25%    2025-03-31 12:00:00    4.500000  17209.500000   8603.000000   
50%    2025-06-29 00:00:00   10.000000  22996.000000  11533.000000   
75%    2025-09-26 12:00:00   16.450000  28954.000000  14785.000000   
max    2025-12-25 00:00:00   26.800000  47470.000000  23479.000000   
std                    NaN    7.308193   9144.180319   4650.774839   

           FUSS_IN     FUSS_OUT  
count   359.000000   359.000000  
mean   1220.885794   941.256267  
min       0.000000     0.000000  
25%     622.500000   492.000000  
50%    1000.000000   837.000000  
75%    1643.000000  1286.500000  
max    4116.000000  3340.000000  
std     795.581763   558.080430  


In [46]:
# Save merged data
merged_df.to_csv("../data/merged_weather_bikes.csv", index=False)
print("Merged data saved to ../data/merged_weather_bikes.csv")

Merged data saved to ../data/merged_weather_bikes.csv


## Saving cleaned dataset

The merged and cleaned dataset is saved as a CSV file for use in exploratory data analysis.

## Conclusions

In this notebook, we successfully cleaned and merged the daily mean temperature and bicycle counter datasets for Zurich. Missing values were removed and duplicates were eliminated. The final merged dataset contains daily observations with mean temperature and bicycle counts. The cleaned dataset is now ready for exploratory data analysis.

### Jupyter notebook --footer info--

In [47]:
import os
import platform
from platform import python_version
from datetime import datetime

print('-----------------------------------')
print(os.name.upper())
print(platform.system(), '|', platform.release())
print('Datetime:', datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
print('Python Version:', python_version())
print('-----------------------------------')

-----------------------------------
POSIX
Linux | 6.8.0-1030-azure
Datetime: 2025-12-28 19:00:15
Python Version: 3.12.3
-----------------------------------
