# Gym Data Transformation

This notebook provides a step-by-step workflow to clean and map OpenStreetMap (OSM) gym data for further analysis or database import. It includes:
- Loading the most recent OSM gym export (CSV)
- Cleaning and standardizing the data
- Mapping gyms to Berlin districts via spatial join (GeoPandas)
- Exporting the cleaned and mapped dataset

> **All steps are self-contained and annotated in English.**

In [1]:
# === 1. Import Required Libraries ===
import os
import glob
import pandas as pd
import numpy as np
from datetime import datetime
import geopandas as gpd
from shapely.geometry import Point

# === 2. Load data ===
# - load the most recent OSM gym export
osm_path = os.path.join('..', 'sources', 'gyms_osm_berlin_*.csv')

# - load cleaned gym data
gyms_path = os.path.join('..', 'sources', 'gyms_cleaned_for_db.csv')

# - load Berlin districts GeoJSON
districts_path = os.path.join('..', 'sources', 'berlin_districts.geojson')

# - load Berlin neighborhoods GeoJSON
neighborhoods_path = os.path.join('..', 'sources', 'berlin_neighborhood.geojson')

# - final CSV path
final_csv_path = os.path.join('..', 'sources', 'gyms_ready_for_db.csv')

# 2. Load the Latest Exported OSM Data

This code block performs the following steps:

- **Searches** for all OSM gym export files in the `gyms/sources` directory.  
- **Identifies** the most recent file by extracting the date from each filename.
- **Loads** the latest file as a pandas DataFrame for further processing.
- **Raises an error** if no matching file is found.

This ensures your analysis always uses the most up-to-date OSM data export.

In [2]:
osm_files = glob.glob(osm_path)
if not osm_files:
    raise FileNotFoundError("No OSM gym export file found in ../sources/")

def extract_date(fname):
    basename = os.path.basename(fname)
    date_str = basename.replace('gyms_osm_berlin_', '').replace('.csv', '')
    return datetime.strptime(date_str, "%Y-%m-%d")

osm_files_sorted = sorted(osm_files, key=extract_date)
raw_file = osm_files_sorted[-1]  # most recent file
print(f"Loading OSM export: {raw_file}")
df = pd.read_csv(raw_file)

Loading OSM export: ../sources/gyms_osm_berlin_2025-09-26.csv


## 3. Clean and Standardize Gym Data

This code block prepares the OSM data for further analysis by:

- **Renaming columns** to a unified naming scheme.  
  For example:  
    - `leisure` → `type` (main type, e.g., fitness_centre)  
    - `sport` → `type_alt` (alternative type, e.g., yoga)

- **Merging gym types:**  
  If the main type (`type`) is missing, it uses the value from `type_alt`.

- **Cleaning and filling missing values:**  
  - Fills missing names with "Unknown Gym"
  - Fills missing addresses, postcodes, and other details with sensible defaults
  - Sets city to "Berlin" if missing
  - Normalizes website and phone fields
  - Ensures all latitude and longitude values are numeric
  - Fills unknown wheelchair access info with "unknown"
  - Ensures all `osm_id` values are strings

- **Prepares for future steps:**  
  Adds an empty `district_id` column as a placeholder.


In [3]:
# --- Column Renaming (adjust if your source has different names) ---
df = df.rename(columns={
    'name': 'name',
    'leisure': 'type',           # Main type (e.g. fitness_centre)
    'sport': 'type_alt',         # Backup type (e.g. yoga)
    'street': 'street',
    'housenumber': 'housenumber',
    'postcode': 'postcode',
    'city': 'city',
    'opening_hours': 'opening_hours',
    'phone': 'phone',
    'website': 'website',
    'wheelchair': 'wheelchair',
    'latitude': 'latitude',
    'longitude': 'longitude',
    'osm_id': 'osm_id',
    'osm_type': 'osm_type',
    'source': 'source'
})

# --- Type (merge type and type_alt if main is missing) ---
df['type'] = df['type'].fillna('')
df['type'] = np.where(df['type'] != '', df['type'], df['type_alt'])
df.drop(columns=['type_alt'], inplace=True)

# --- Fill other fields ---
df['name'] = df['name'].fillna('Unknown Gym')
df['street'] = df['street'].fillna('')
df['housenumber'] = df['housenumber'].fillna('')
df['postcode'] = df['postcode'].fillna('')
df['city'] = df['city'].fillna('Berlin')  # Default: Berlin

df['website'] = df['website'].fillna('').str.lower().str.strip()
df['phone'] = df['phone'].fillna('').str.strip()

df['opening_hours'] = df['opening_hours'].fillna('')
df['wheelchair'] = df['wheelchair'].fillna('unknown')

df['latitude'] = pd.to_numeric(df['latitude'], errors='coerce')
df['longitude'] = pd.to_numeric(df['longitude'], errors='coerce')

df['osm_id'] = df['osm_id'].fillna('').astype(str)
df['osm_type'] = df['osm_type'].fillna('')
df['source'] = df['source'].fillna('OSM Overpass')

# --- Placeholder for future district_id assignment --- 
df['district_id'] = ''


## 4. Save Cleaned Data for Further Processing

This code block does the following:

- **Defines the output file path** for the cleaned data (`gyms_cleaned_for_db.csv`) in the `../sources` directory.
- **Exports the cleaned DataFrame** to a CSV file at the specified location, without row indices.
- **Prints a confirmation message** with the file path and the number of rows saved.

This step stores your cleaned and standardized gym data for further processing or database import.


In [4]:
gyms_df = os.path.join('..', 'sources', 'gyms_cleaned_for_db.csv')
df.to_csv(gyms_df, index=False)
print(f"Cleaned data saved to {gyms_df} ({len(df)} rows)")

Cleaned data saved to ../sources/gyms_cleaned_for_db.csv (444 rows)


# 5. Spatial Join: Assign Districts to Gyms

This code performs the following steps:

- **Imports required libraries** for working with tabular and spatial data.
- **Loads the cleaned gym data** from CSV into a pandas DataFrame.
- **Loads Berlin district boundaries** from a GeoJSON file into a GeoDataFrame.
- **Creates spatial Point geometries** for each gym using their longitude and latitude coordinates.
- **Performs a spatial join** between the gyms (as points) and the districts (as polygons), 
  assigning each gym to the district in which it is located.
- **Adds district information** to the gyms DataFrame:
  - `district_id`: the unique identifier for the district (`Schluessel_gesamt`)
  - `district`: the district name (`Gemeinde_name`)

This step enriches the gym data with spatial context, allowing further analysis by Berlin district.


In [5]:
# ---- 1. Load cleaned gym data ----
gyms_df = pd.read_csv(gyms_path)

# ---- 2. Load Berlin districts (GeoJSON) ----
districts_gdf = gpd.read_file(districts_path)
neighborhoods_gdf = gpd.read_file(neighborhoods_path)

# ---- 3. Build Point geometries for gyms ----
gyms_gdf = gpd.GeoDataFrame(
    gyms_df,
    geometry=[Point(xy) for xy in zip(gyms_df.longitude, gyms_df.latitude)],
    crs='EPSG:4326'
)

# ---- 4. Spatial join gyms with districts ----
gyms_with_district = gpd.sjoin(
    gyms_gdf,
    districts_gdf,
    how="left",
    predicate='within'
)

# ---- 5. Assign district_id and district name from joined data ----
gyms_with_district['district_id'] = gyms_with_district['Schluessel_gesamt']
gyms_with_district['district'] = gyms_with_district['Gemeinde_name']

# ---- 6. Load Berlin neighborhoods (GeoJSON) ----
neighborhoods_gdf = gpd.read_file(neighborhoods_path)

# Print columns for debugging
print("Neighborhoods columns:", neighborhoods_gdf.columns)

# ---- 7. Spatial join gyms_with_district with neighborhoods ----
gyms_with_all = gpd.sjoin(
    gyms_with_district,
    neighborhoods_gdf,
    how="left",
    predicate='within',
    lsuffix='_gym',
    rsuffix='_neigh'
)

# ---- 8. Assign neighborhood_id and neighborhood from joined data ----
for neigh_id_col in ['neighborhood_id_neigh', 'neighborhood_id']:
    if neigh_id_col in gyms_with_all.columns:
        gyms_with_all['neighborhood_id'] = gyms_with_all[neigh_id_col]
        break
for neigh_col in ['neighborhood_neigh', 'neighborhood']:
    if neigh_col in gyms_with_all.columns:
        gyms_with_all['neighborhood'] = gyms_with_all[neigh_col]
        break
    
# ---- 9. Clean up: Remove duplicates and unwanted columns ----
unwanted_columns = [col for col in gyms_with_all.columns 
                    if col.endswith('.1') or col.endswith('_gym') or col.endswith('_neigh') 
                    or col in ['Schluessel_gesamt', 'Gemeinde_name', 'index_right', 'index_left']]
gyms_with_all = gyms_with_all.drop(columns=unwanted_columns, errors='ignore')

# district_id as str and without ".0"
gyms_with_all['district_id'] = gyms_with_all['district_id'].apply(lambda x: str(int(float(x))) if pd.notnull(x) and x != '' else None)

final_columns = [
    'gym_id', 'district_id', 'name', 'address', 'postal_code', 'phone_number', 'email',
    'coordinates', 'latitude', 'longitude', 'neighborhood_id', 'neighborhood', 'district'
]
cols_existing = [col for col in final_columns if col in gyms_with_all.columns]

gyms_with_district = gyms_with_all[cols_existing]

Neighborhoods columns: Index(['gml_id', 'spatial_name', 'spatial_alias', 'spatial_type', 'OTEIL',
       'BEZIRK', 'FLAECHE_HA', 'geometry'],
      dtype='object')


# 6. Prepare Columns for Database Import

This code block prepares the DataFrame for database import by:

- **Renaming and creating columns** to match the target SQL table structure.
  - Combines street name and house number into a single address field.
  - Extracts latitude and longitude.
  - Sets empty values for columns not available in the source (e.g., email, neighborhood).

- **Defining the exact column order** to match the SQL schema:
  - Ensures the exported data will align perfectly with the database table, minimizing import issues.

- **Reordering the DataFrame columns** according to the SQL table.
  - This step also removes any unwanted or duplicate columns.

This ensures your data is cleanly formatted, named, and ordered for a smooth transition into the SQL database.


In [6]:
# --- Reorder and rename columns to match the SQL table structure ---

# Create new columns or adjust as needed to match SQL schema
gyms_with_district['gym_id'] = gyms_with_district['osm_id']
gyms_with_district['address'] = gyms_with_district['street'].fillna('') + ' ' + gyms_with_district['housenumber'].fillna('')
gyms_with_district['postal_code'] = gyms_with_district['postcode']
gyms_with_district['phone_number'] = gyms_with_district['phone']
gyms_with_district['email'] = ''  # No email in source; set empty or fill if available
gyms_with_district['coordinates'] = gyms_with_district['geometry'].apply(lambda geom: str(geom) if geom else '')
gyms_with_district['neighborhood'] = ''  # No neighborhood info; set empty or fill if available

# Define final column order (matching SQL table)
final_columns = [
    'gym_id',        # VARCHAR(20) PRIMARY KEY
    'district_id',   # VARCHAR(2)
    'name',          # VARCHAR(200)
    'address',       # VARCHAR(200)
    'postal_code',   # VARCHAR(10)
    'phone_number',  # VARCHAR(50)
    'email',         # VARCHAR(100)
    'coordinates',   # VARCHAR(200)
    'latitude',      # DECIMAL(9,6)
    'longitude',     # DECIMAL(9,6)
    'neighborhood',  # VARCHAR(100)
    'district'       # VARCHAR(100)
]

# Reorder DataFrame columns to match the SQL schema
gyms_final = gyms_with_district[final_columns].copy()

KeyError: 'osm_id'

# 7. Export Final DataFrame with Districts

This code block saves the final, cleaned, and enriched DataFrame to a CSV file:

- **Defines the output path** using `os.path.join` for consistency and portability.
- **Exports the DataFrame** to a CSV file (`gyms_with_district.csv`) in the `../sources` directory, without row indices.
- **Prints a confirmation message** with the file path, confirming successful export.

This ensures you have a ready-to-import CSV file containing all gym and district information, structured for database import or further analysis.


In [None]:
# --- Save the final DataFrame to CSV ---
CSV_PATH = os.path.join('..', 'sources', 'gyms_with_district.csv')
gyms_final.to_csv(CSV_PATH, index=False)
print(f"Columns reordered and renamed for SQL import. CSV file saved as '{CSV_PATH}'.")

Columns reordered and renamed for SQL import. CSV file saved as '../sources/gyms_with_district.csv'.


# **Done!**

- The OSM gym data is now cleaned and mapped to Berlin districts.
- Next steps: The CSV can now be used for database import or further analysis.