# Weather & Bicycle Usage â€“ Data Preprocessing

## Libraries and settings

In [12]:
import os
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

print(os.getcwd())

/workspaces/data_analytics_project/notebooks


## Libraries and settings

In [13]:
# Load weather data
weather_df = pd.read_csv("../data/weather_zurich.csv")
print("Weather data shape:", weather_df.shape)
print(weather_df.head())
print("\nData types:")
print(weather_df.dtypes)

Weather data shape: (8760, 5)
                  time  temperature_2m  humidity  wind_speed_10m  \
0  2023-01-01 00:00:00             7.0        80             6.2   
1  2023-01-01 01:00:00             7.9        80            10.3   
2  2023-01-01 02:00:00             8.7        75             6.1   
3  2023-01-01 03:00:00             7.6        80             8.1   
4  2023-01-01 04:00:00             8.5        75             8.0   

   precipitation  
0            0.0  
1            0.0  
2            0.0  
3            0.0  
4            0.0  

Data types:
time               object
temperature_2m    float64
humidity            int64
wind_speed_10m    float64
precipitation     float64
dtype: object


## Loading datasets

We load both the weather and bicycle counter data from the CSV files generated in the data collection phase.

In [14]:
# Load bicycle counter data
bikes_df = pd.read_csv("../data/bikes_raw.csv")
print("Bicycle data shape:", bikes_df.shape)
print(bikes_df.head())
print("\nData types:")
print(bikes_df.dtypes)

Bicycle data shape: (8737, 2)
                  time  bike_count
0  2023-01-01 00:00:00   23.934283
1  2023-01-01 01:00:00   11.234714
2  2023-01-01 02:00:00   26.953771
3  2023-01-01 03:00:00   44.460597
4  2023-01-01 04:00:00    9.316933

Data types:
time           object
bike_count    float64
dtype: object


## Cleaning and handling missing values

We convert time columns to datetime format and remove any rows with missing values to ensure data quality.

In [4]:
# Convert time columns to datetime
weather_df["time"] = pd.to_datetime(weather_df["time"])
bikes_df["time"] = pd.to_datetime(bikes_df["time"])

print("After datetime conversion:")
print("Weather data types:", weather_df.dtypes.values)
print("Bikes data types:", bikes_df.dtypes.values)

After datetime conversion:
Weather data types: [dtype('<M8[ns]') dtype('float64') dtype('int64') dtype('float64')
 dtype('float64')]
Bikes data types: [dtype('<M8[ns]') dtype('float64')]


In [5]:
# Check for missing values
print("Missing values in weather data:")
print(weather_df.isnull().sum())

print("\nMissing values in bicycle data:")
print(bikes_df.isnull().sum())

Missing values in weather data:
time              0
temperature_2m    0
humidity          0
wind_speed_10m    0
precipitation     0
dtype: int64

Missing values in bicycle data:
time          0
bike_count    0
dtype: int64


In [6]:
# Drop missing values
weather_df = weather_df.dropna()
bikes_df = bikes_df.dropna()

print("After dropping missing values:")
print("Weather data shape:", weather_df.shape)
print("Bicycle data shape:", bikes_df.shape)

After dropping missing values:
Weather data shape: (8760, 5)
Bicycle data shape: (8737, 2)


In [7]:
# Check for duplicates
print("Duplicates in weather data:", weather_df.duplicated().sum())
print("Duplicates in bicycle data:", bikes_df.duplicated().sum())

# Drop duplicates if any
weather_df = weather_df.drop_duplicates()
bikes_df = bikes_df.drop_duplicates()

print("\nAfter removing duplicates:")
print("Weather data shape:", weather_df.shape)
print("Bicycle data shape:", bikes_df.shape)

Duplicates in weather data: 0
Duplicates in bicycle data: 0

After removing duplicates:
Weather data shape: (8760, 5)
Bicycle data shape: (8737, 2)


## Merging weather and bicycle datasets

We perform an inner join on the time column to combine both datasets into a single analysis-ready dataset.

In [8]:
# Merge weather and bicycle data on time
merged_df = pd.merge(weather_df, bikes_df, on="time", how="inner")

print("Merged data shape:", merged_df.shape)
print("\nFirst 5 rows:")
print(merged_df.head())
print("\nData info:")
print(merged_df.info())

Merged data shape: (8737, 6)

First 5 rows:
                 time  temperature_2m  humidity  wind_speed_10m  \
0 2023-01-01 00:00:00             7.0        80             6.2   
1 2023-01-01 01:00:00             7.9        80            10.3   
2 2023-01-01 02:00:00             8.7        75             6.1   
3 2023-01-01 03:00:00             7.6        80             8.1   
4 2023-01-01 04:00:00             8.5        75             8.0   

   precipitation  bike_count  
0            0.0   23.934283  
1            0.0   11.234714  
2            0.0   26.953771  
3            0.0   44.460597  
4            0.0    9.316933  

Data info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8737 entries, 0 to 8736
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   time            8737 non-null   datetime64[ns]
 1   temperature_2m  8737 non-null   float64       
 2   humidity        8737 non-null   int6

In [9]:
# Basic statistics
print("Basic statistics:")
print(merged_df.describe())

Basic statistics:
                      time  temperature_2m     humidity  wind_speed_10m  \
count                 8737     8737.000000  8737.000000     8737.000000   
mean   2023-07-02 00:00:00       11.337770    78.538057        7.412830   
min    2023-01-01 00:00:00       -9.900000    26.000000        0.000000   
25%    2023-04-02 00:00:00        5.000000    69.000000        4.000000   
50%    2023-07-02 00:00:00       10.800000    82.000000        6.100000   
75%    2023-10-01 00:00:00       17.500000    91.000000        9.600000   
max    2023-12-31 00:00:00       32.700000   100.000000       40.900000   
std                    NaN        8.067036    15.592839        5.121235   

       precipitation   bike_count  
count    8737.000000  8737.000000  
mean        0.170791    67.715526  
min         0.000000     0.000000  
25%         0.000000    26.565584  
50%         0.000000    57.723824  
75%         0.000000   100.945528  
max         8.400000   227.054630  
std         0.5610

In [10]:
# Save merged data
merged_df.to_csv("../data/merged_weather_bikes.csv", index=False)
print("Merged data saved to ../data/merged_weather_bikes.csv")

Merged data saved to ../data/merged_weather_bikes.csv


## Saving cleaned dataset

The merged and cleaned dataset is saved as a CSV file for use in exploratory data analysis.

## Conclusions

In this notebook, we successfully cleaned and merged the weather and bicycle counter datasets. Missing values were removed and duplicates were eliminated. The final merged dataset contains 8,737 hourly observations with all necessary variables (temperature, humidity, wind speed, precipitation, and bicycle counts). The cleaned dataset is now ready for exploratory data analysis.

### Jupyter notebook --footer info--

In [None]:
import os
import platform
from platform import python_version
from datetime import datetime

print('-----------------------------------')
print(os.name.upper())
print(platform.system(), '|', platform.release())
print('Datetime:', datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
print('Python Version:', python_version())
print('-----------------------------------')