# Bike Lane Data Transformation

## Data Cleaning and Normalisation

### The Purpose of this Notebook

- Load the necessary Python libraries
- Upload the OSM, Fahrradstrassen and radverkehrsanlagen GeoJSON files
- Clean data
- Standardise column names for consistency
- Compare OSM and Datasets from Geoportal Berlin (ie. gaps, mismatches, etc.)
- Verify Fahrradstraßen tagging is consistent in the OSM dataset
- Create a unified dataset with OSM as the base with validations from the Geoportal Berlin datasets

### 1. Importing Libraries

In [3]:
import pandas as pd
import geopandas as gpd
import numpy as np
import osmnx as ox

### 2. Load Data

#### Loading local GeoJSON files

In [14]:
osm = gpd.read_file("osm_bikelanes_raw.geojson")
frd = gpd.read_file("fahrradstrassen.geojson")
rvkr = gpd.read_file("radverkehrsanlagen.geojson")

### 3. Overview and Normalisation of Column Names  

#### Quick overview of files

In [15]:
print("OSM:", osm.shape)
print("Fahrradstraßen:", frd.shape)
print("Radverkehrsanlagen:", rvkr.shape)

OSM: (78865, 1064)
Fahrradstraßen: (57, 6)
Radverkehrsanlagen: (18641, 13)


#### OSM data

In [19]:
print("OSM Bike Lanes Columns & Data Types")
osm.info()

OSM Bike Lanes Columns & Data Types
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 78865 entries, 0 to 78864
Columns: 1064 entries, id to geometry
dtypes: datetime64[ms](22), geometry(1), object(1041)
memory usage: 640.2+ MB


##### Selecting the relevant columns

In [20]:
columns_to_keep = ["id", "geometry", "cycleway", "surface", "smoothness"]
osm_small = osm[columns_to_keep].copy()

##### Renaming "id" to "bikelane_id"

In [38]:
osm_small.rename(columns={"id": "bikelane_id"}, inplace=True)
osm_small.rename(columns={"smoothness": "condition"}, inplace=True)

##### Checking the Output of the Reduced OSM Data

In [39]:
print("OSM Bike Lanes Columns & Data Types (reduced)")
osm_small.info()


OSM Bike Lanes Columns & Data Types (reduced)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 78865 entries, 0 to 78864
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   bikelane_id  78865 non-null  object  
 1   geometry     78865 non-null  geometry
 2   cycleway     6479 non-null   object  
 3   surface      76376 non-null  object  
 4   condition    48139 non-null  object  
dtypes: geometry(1), object(4)
memory usage: 3.0+ MB


##### Note: 
With a total of 1064 columns, having 1041 object columns, 22 datetime columns and 1 geometry column this is too much to keep for the purposes of the project. 

Based on the parameters of the project, the following columns will be isolated to keep the relevant fields: 

- bikelane_id *(which may be used as the bikelane_id)*
- geometry *(the line geometry)*
- cycleway *(type of bike lane)*
- surface *(material type of the lane)*
- condition *(condition of bike lane)*

##### OSM Dataset - Columns Overview

| Column       | Non-Null Count | Data Type | Description                                   |
|--------------|----------------|-----------|-----------------------------------------------|
| `bikelane_id`| 78,865         | object    | Unique identifier for each bike lane segment   |
| `geometry`   | 78,865         | geometry  | Spatial geometry of the bike lane (line/polygon) |
| `cycleway`   | 6,479          | object    | Type of cycleway (tag from OSM, often sparse) |
| `surface`    | 76,376         | object    | Surface material of the bike lane (e.g., asphalt, dirt) |
| `condition` | 48,139         | object    | Condition/rideability of the bike lane        |

In [23]:
display(osm_small.head())

Unnamed: 0,bikelane_id,geometry,cycleway,surface,smoothness
0,way/43998936,"POLYGON ((13.60278 52.53787, 13.60269 52.53786...",,asphalt,good
1,way/517805554,"POLYGON ((13.46367 52.47111, 13.46348 52.47116...",no,paving_stones,good
2,way/1186003574,"POLYGON ((13.34615 52.58978, 13.34614 52.58978...",lane,,
3,way/1186011275,"POLYGON ((13.34569 52.58962, 13.3457 52.58961,...",lane,,
4,way/1187324842,"POLYGON ((13.42565 52.48772, 13.42563 52.48772...",crossing,,


#### Fahrradstraßen 

In [36]:
print("Fahrradstraßen Bike Lanes Columns & Data Types")
frd.info()

Fahrradstraßen Bike Lanes Columns & Data Types
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 57 entries, 0 to 56
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   bikelane_id  57 non-null     int32   
 1   district     57 non-null     object  
 2   street_name  57 non-null     object  
 3   designation  57 non-null     object  
 4   section      33 non-null     object  
 5   geometry     57 non-null     geometry
dtypes: geometry(1), int32(1), object(4)
memory usage: 2.6+ KB


##### Note: 
Based on the parameters of the project, the following columns will be isolated to keep the relevant fields: 

- bikelane_id (may differ from id type in OSM table, may need to link to OSM table using geometry)
- district_id (link to core districts table in ERD)
- district (link to core districts table in ERD)
- neighborhood_id (link to core neighborhoods table in ERD)
- neighborhood (link to core neighborhoods table in ERD)
- geometry (link to OSM table (base dataset for this project))
- lane_type (same as cycleway in OSM)
- length
- condition (same as condition in OSM)

##### Fahrradstraßen Dataset – Column Translations & Overview
| German     | English    | Non-Null Count | Data Type | Description                                                                 |
|------------|--------------|----------------|-----------|-----------------------------------------------------------------------------|
| `id`       | id           |57             | int32     | Unique identifier for each entry                                            |
| `bezirk`   | district     |57             | object    | Berlin district the bike street belongs to                                  |
| `strasse`  | street       |57             | object    | Street name                                                                 |
| `freigabe` | designation  |57             | object    | Authorization/designation (official confirmation as Fahrradstraße)           |
| `abschnitt`| section      |33             | object    | Section of the street (may be missing if not applicable)                     |
| `geometry` | geometry     |57             | geometry  | Spatial geometry of the bike street segment                                 |


In [25]:
display(frd.head())

Unnamed: 0,id,bezirk,strasse,freigabe,abschnitt,geometry
0,1,Marzahn-Hellersdorf,Alberichstraße,2003,Hadubrandweg - Alfelder Straße,"MULTILINESTRING ((402133.284 5816807.644, 4021..."
1,2,Pankow,Norwegerstraße,2005,"Behmstraße - Bösebrücke Nr. 1 bis Nr. 6, Finnl...","MULTILINESTRING ((391486.356 5823711.95, 39150..."
2,3,Pankow,Schwedter Straße,2005,Gleimstraße - Schwedter Steg von Nr. 76 bis Nr...,"MULTILINESTRING ((391537.944 5823276.792, 3915..."
3,4,Lichtenberg,Orankeweg,2007,Hansastraße - Orankestraße,"MULTILINESTRING ((396415.785 5823069.715, 3964..."
4,5,Charlottenburg-Wilmersdorf,Teufelsseechaussee,2007,Teufelsseestraße - Grunewald (bzw. Nr. 13 und ...,"MULTILINESTRING ((380754.31 5817467.889, 38077..."


##### Column Name Normalisation

In [35]:
frd = frd.rename(columns={
    "id": "bikelane_id",
    "bezirk": "district",
    "strasse": "street_name",
    "freigabe": "designation",
    "abschnitt": "section"
})


print("Normalised Fahrradstraßen Columns:")
print(frd.columns)

Normalised Fahrradstraßen Columns:
Index(['bikelane_id', 'district', 'street_name', 'designation', 'section',
       'geometry'],
      dtype='object')


*Note for columns names:* 
- *Both OSM, Fahrradstraßen and radverkehrsanlagen have columns called `geometry` and `street_name` which will be used after data cleaning to cross-check and integrate the 3 datasets*


#### Radverkehrsanlagen 

In [37]:
print("Radverkehrsanlagen Bike Lanes Columns & Data Types")
rvkr.info()

Radverkehrsanlagen Bike Lanes Columns & Data Types
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 18641 entries, 0 to 18640
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   id         18641 non-null  object  
 1   importid   18641 non-null  int32   
 2   sobj_kz    18641 non-null  object  
 3   segm_segm  18641 non-null  object  
 4   segm_bez   2430 non-null   object  
 5   stst_str   18641 non-null  object  
 6   stor_name  18641 non-null  object  
 7   ortstl     18641 non-null  object  
 8   rva_typ    18641 non-null  object  
 9   sorvt_typ  18641 non-null  object  
 10  laenge     18641 non-null  int32   
 11  b_pflicht  18641 non-null  object  
 12  geometry   18641 non-null  geometry
dtypes: geometry(1), int32(2), object(10)
memory usage: 1.7+ MB


##### Radverkehrsanlagen Dataset – Column Translations & Overview

| German | English        | Non-Null Count | Data Type | Description                                                                 |
|-----------------|-------------------------|----------------|-----------|-----------------------------------------------------------------------------|
| `id`            | id                      | 18,641         | object    | Unique identifier for each entry                                            |
| `importid`      | import_id               | 18,641         | int32     | Internal import identifier                                                  |
| `sobj_kz`       | object_code             | 18,641         | object    | Object classification code                                                  |
| `segm_segm`     | segment_id              | 18,641         | object    | Segment identifier                                                          |
| `segm_bez`      | segment_name            | 2,430          | object    | Segment name/label (mostly missing)                                         |
| `stst_str`      | street_code             | 18,641         | object    | Encoded street identifier                                                   |
| `stor_name`     | street_name             | 18,641         | object    | Street name                                                                 |
| `ortstl`        | locality                | 18,641         | object    | Locality / neighborhood                                                     |
| `rva_typ`       | bike_facility_type      | 18,641         | object    | Type of bike infrastructure (e.g., lane, track, path)                       |
| `sorvt_typ`     | facility_subtype        | 18,641         | object    | Subtype / further classification of the bike facility                       |
| `laenge`        | length_m                | 18,641         | int32     | Length of the bike facility segment (meters)                                |
| `b_pflicht`     | mandatory_designation   | 18,641         | object    | Indicates if bike lane use is mandatory (German traffic law, “benutzungspflicht”) |
| `geometry`      | geometry                | 18,641         | geometry  | Spatial geometry of the bike facility                                       |


##### Column Name Normalisation

In [None]:
# Select and rename useful columns
rvkr_small = rvkr[[
    "id", "stor_name", "ortstl", "rva_typ", "laenge", "b_pflicht", "geometry", "sobj_kz"
]].copy()

rvkr_small = rvkr_small.rename(columns={
    "id": "bikelane_id",
    "stor_name": "street_name",
    "ortstl": "neighborhood",
    "rva_typ": "lane_type",
    "laenge": "length",
    "b_pflicht": "condition",
    "geometry": "geometry",
    "sobj_kz": "district"   
})

# Check the result
print(rvkr_small.columns)
rvkr_small.head()

Index(['bikelane_id', 'street_name', 'neighborhood', 'lane_type', 'length',
       'condition', 'geometry', 'district'],
      dtype='object')


Unnamed: 0,bikelane_id,street_name,neighborhood,lane_type,length,condition,geometry,district
0,b_radverkehrsanlagen.1,Marzahn-Hellersdorf,Kaulsdorf,Radwege,26,ja,"MULTILINESTRING ((403754.607 5818158.525, 403780.662 5818156.986))",10-000415
1,b_radverkehrsanlagen.2,Marzahn-Hellersdorf,Mahlsdorf,Radwege,16,ja,"MULTILINESTRING ((405403.96 5818088.543, 405410.213 5818087.293, 405415.187 5818087.199, 405420.175 5818087.782))",10-000039
2,b_radverkehrsanlagen.3,Marzahn-Hellersdorf,Mahlsdorf,Radwege,510,nein,"MULTILINESTRING ((404896.216 5818112.451, 404911.818 5818112.154, 404953.426 5818111.589, 404963.819 5818110.939, 405003.38 5818109.734, 405025.991 5818109.304, 405148.339 5818105.167, 405198.014 5818103.544, 405238.601 5818102.772, 405315.166 5818099.732, 405318.505 5818099.895, 405335.86 5818096.624, 405373.16 5818095.462, 405383.939 5818094.353, 405388.461 5818094.267, 405399.024 5818090.673, 405403.96 5818088.543))",10-000038
3,b_radverkehrsanlagen.4,Marzahn-Hellersdorf,Biesdorf,Radwege,76,ja,"MULTILINESTRING ((402391.01 5818565.356, 402415.129 5818554.064, 402456.963 5818527.768))",10-000011
4,b_radverkehrsanlagen.5,Marzahn-Hellersdorf,Biesdorf,Radwege,192,ja,"MULTILINESTRING ((402456.963 5818527.768, 402494.839 5818503.713, 402532.87 5818478.99, 402553.561 5818467.096, 402577.292 5818452.978, 402621.5 5818428.803))",10-000012


*Note for later standardisation of columns:* 
- *OSM Geometry dataset column follows the following syntax: `Polygon((....))`*
- *Fahrradstraßen & radverkehrsanlagen datasets  Geometry column follows the following syntax: `Multilinestring((...), (...), (...), (...))`*