###  Step 1: Research & Data Modelling  
PR Branch Name: clubs-data-modelling  

This notebook documents the process for Step 1 of the "Clubs & Social Activities in Berlin" project:  

- **1.1 Data Source Discovery**  
- **1.2 Modelling & Planning**  
- **1.3 Prepare the /sources Directory**  
- **1.4 Review**  

---

###  Goal  
- Identify and document relevant data sources.  
- Select the key parameters for our use case.  
- Draft the planned table schema.  
- Plan cleaning and transformation steps before database population.  

---

## 1.1 Data Source Discovery  

**Topic:** Clubs & Social Activities in Berlin  

**Main source:**  
- **Name:** OpenStreetMap (OSM) via OSMnx / Overpass API  
- **Source and origin:** Public crowdsourced geospatial database  
- **Update frequency:** Continuous (dynamic)  
- **Data type:** Dynamic (API query using tags such as `club=*`, `leisure=*`, `sport=*`, `community_centre=*`)  
- **Reason for selection:**  
  - Covers a wide variety of sports clubs, cultural clubs, and social activity centers in Berlin  
  - Includes geospatial data (coordinates, polygons), names, addresses, and attributes  
  - Open, free, and queryable programmatically  

**Optional additional sources:**  
- **Name:** Berliner Turn- und Freizeitsport-Bund (BTFB)  (https://btfb.de/vereinsservice/vereinssuche/#Vereine-im-Portrait)
  - Source: Official Berlin sports association website  
  - Type: Static (manual export / scraping)  
  - Use: Provides official structured list of sports clubs in Berlin  

- **Name:** Berlin Open Data Portal (daten.berlin.de)  
  - Source: Berlin city government  
  - Type: Static or semi-static (CSV, GeoJSON)  
  - Use: Enrichment with official district boundaries or metadata  

  **Enrichment potential:**  
- Use Berlin shapefiles (districts, neighborhoods) for spatial joins.  


---

## 1.2 Modelling & Planning  

**Key Parameters (planned):**  
- Identification: `name`, `club`, `category`, `subcategory`  
- Location: `address`, `district`, `geometry (lat/lon)`  
- Contact: `website`, `phone`, `email`  
- Attributes: `opening_hours`, `membership`, `fees`, `sport` / `leisure type`  
- Metadata: `source`, `last_updated`  

**Integration with existing tables:**  
- Join on `district_id` from the Berlin districts reference table.  


**Planned table schema:**  
```sql
CREATE TABLE berlin_clubs (
    club_id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    club TEXT,
    leisure TEXT,
    sport TEXT,
    amenity TEXT,
    street TEXT,
    housenumber TEXT,
    postcode TEXT,
    district TEXT,
    city TEXT,
    country TEXT,
    district_id INT REFERENCES berlin_districts(district_id),
    latitude FLOAT NOT NULL,
    longitude FLOAT NOT NULL,
    website TEXT,
    phone TEXT,
    email TEXT,
    opening_hours TEXT,
    wheelchair TEXT
);

In [138]:
# Install Libraries

# !pip install osmnx geopandas

In [139]:
# Import Libraries

import osmnx as ox
import geopandas as gpd
import pandas as pd

In [140]:
ox.settings.use_cache = False

In [141]:
# Define multiple tags
# tags = {
    # "amenity": ["community_centre", "arts_centre", "youth_centre", "music_school"],
    # "leisure": ["sports_centre", "fitness_centre", "dance"],
    # "club": True  # will capture all club types
# }
tags = {
    "amenity": [
        "community_centre", "arts_centre", "social_centre", 
        "youth_centre", "social_club", "music_school","events_venue",
        "music_venue", 
        "dojo", "dancing_school","studio",
        "theatre"
    ],
    "leisure": [
       "sports_centre", "fitness_centre", "dance", 
        "hackerspace", "music_venue", "garden"
    ],
   "club": True 
}
clubs_gdf = ox.features_from_place("Berlin, Germany", tags)

print(clubs_gdf.head())
print(len(clubs_gdf), "clubs/activities found in Berlin")

                                   geometry       amenity     contact:phone  \
element id                                                                    
node    30012753  POINT (13.42919 52.49404)  events_venue  +49 30 338402320   
        60775321  POINT (13.48162 52.53862)           NaN               NaN   
        66917094  POINT (13.38888 52.52392)       theatre               NaN   
        66917098  POINT (13.38862 52.52362)       theatre   +49 30 27879030   
        66917115  POINT (13.38851 52.52067)       theatre   +49 30 203000-0   

                                       contact:website  \
element id                                               
node    30012753  http://www.umspannwerk-kreuzberg.de/   
        60775321                                   NaN   
        66917094                                   NaN   
        66917098                                   NaN   
        66917115                                   NaN   

                                     na

In [142]:
clubs_gdf = clubs_gdf.to_crs(epsg=4326)

In [143]:
clubs_gdf['geometry'] = clubs_gdf['geometry'].apply(lambda geom: geom if geom.geom_type == 'Point' else geom.representative_point())
#Extract latitude and longitude
clubs_gdf["latitude"] = clubs_gdf.geometry.y
clubs_gdf["longitude"] = clubs_gdf.geometry.x
clubs_gdf

Unnamed: 0_level_0,Unnamed: 1_level_0,geometry,amenity,contact:phone,contact:website,name,wheelchair,addr:housenumber,addr:street,club,addr:city,...,construction,manufacturer,monitoring:harvesting,type,not:name,length,maxdepth,communication:amateur_radio:pota,latitude,longitude
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
node,30012753,POINT (13.42919 52.49404),events_venue,+49 30 338402320,http://www.umspannwerk-kreuzberg.de/,Umspannwerk,yes,,,,,...,,,,,,,,,52.494042,13.429187
node,60775321,POINT (13.48162 52.53862),,,,KW76,,76,Konrad-Wolf-Straße,poker,,...,,,,,,,,,52.538623,13.481623
node,66917094,POINT (13.38888 52.52392),theatre,,,Friedrichstadt-Palast,yes,107,Friedrichstraße,,Berlin,...,,,,,,,,,52.523922,13.388879
node,66917098,POINT (13.38862 52.52362),theatre,+49 30 27879030,,Quatsch Comedy Club,limited,107,Friedrichstraße,,Berlin,...,,,,,,,,,52.523624,13.388621
node,66917115,POINT (13.38851 52.52067),theatre,+49 30 203000-0,,Kabarett-Theater Distel,yes,101,Friedrichstraße,,Berlin,...,,,,,,,,,52.520667,13.388505
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
way,1428999282,POINT (13.3565 52.56115),,,,,,,,,,...,,,,,,,,,52.561152,13.356499
way,1428999283,POINT (13.35644 52.56128),,,,,,,,,,...,,,,,,,,,52.561283,13.356439
way,1428999284,POINT (13.35655 52.56104),,,,,,,,,,...,,,,,,,,,52.561039,13.356547
way,1429788653,POINT (13.42016 52.47043),,,,Nanowald,,,,,,...,,,,,,,,,52.470430,13.420161


In [144]:
print(clubs_gdf.notnull().sum().sort_values(ascending=False).head(30))


geometry                6495
latitude                6495
longitude               6495
leisure                 4633
name                    3400
garden:type             2168
access                  2151
addr:street             1926
addr:housenumber        1910
addr:postcode           1843
addr:city               1800
amenity                 1614
addr:country            1298
addr:suburb             1284
website                 1237
wheelchair              1071
sport                    952
operator                 904
contact:website          772
building                 761
opening_hours            688
check_date               535
phone                    505
community_centre         480
contact:phone            435
club                     384
building:levels          334
contact:email            308
community_centre:for     308
wikidata                 299
dtype: int64


In [145]:
# Select important columns
important_cols = [
    "name",               
    "club",                 
    "leisure",             
    "sport",                
    "amenity",               
    "addr:street",           
    "addr:housenumber",
    "addr:suburb",      
    "addr:postcode",         
    "addr:city",
    "addr:country",            
    "website",              
    "phone",             
    "email",               
    "opening_hours",         
    "geometry" ,
    "wheelchair",
    "latitude",
    "longitude"              
]

In [146]:
clubs_df = clubs_gdf[important_cols].copy()

print(clubs_df.head(10))

                                      name   club leisure sport       amenity  \
element id                                                                      
node    30012753               Umspannwerk    NaN     NaN   NaN  events_venue   
        60775321                      KW76  poker     NaN   NaN           NaN   
        66917094     Friedrichstadt-Palast    NaN     NaN   NaN       theatre   
        66917098       Quatsch Comedy Club    NaN     NaN   NaN       theatre   
        66917115   Kabarett-Theater Distel    NaN     NaN   NaN       theatre   
        66917188            Admiralspalast    NaN     NaN   NaN       theatre   
        79808389             Die Wühlmäuse    NaN     NaN   NaN       theatre   
        173985100   HAU 2 (Hebbel am Ufer)    NaN     NaN   NaN       theatre   
        229948256              Sophiensæle    NaN     NaN   NaN       theatre   
        257709121       Kulturhaus Spandau    NaN     NaN   NaN   arts_centre   

                          a

In [147]:
# Rename map for only the columns that need renaming
rename_map = {
    "addr:street": "street",
    "addr:housenumber": "housenumber",
    "addr:postcode": "postcode",
    "addr:city": "city",
    "addr:country": "country",
    "addr:suburb": "district"
}

In [148]:
# Rename the columns
clubs_df = clubs_df.rename(columns=rename_map)

In [149]:
clubs_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,club,leisure,sport,amenity,street,housenumber,district,postcode,city,country,website,phone,email,opening_hours,geometry,wheelchair,latitude,longitude
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
node,30012753,Umspannwerk,,,,events_venue,,,,,,,,,,,POINT (13.42919 52.49404),yes,52.494042,13.429187
node,60775321,KW76,poker,,,,Konrad-Wolf-Straße,76.0,,,,,,,,,POINT (13.48162 52.53862),,52.538623,13.481623
node,66917094,Friedrichstadt-Palast,,,,theatre,Friedrichstraße,107.0,Mitte,10117.0,Berlin,DE,https://www.palast.berlin/,+49 30 23262326,,,POINT (13.38888 52.52392),yes,52.523922,13.388879
node,66917098,Quatsch Comedy Club,,,,theatre,Friedrichstraße,107.0,Mitte,10117.0,Berlin,DE,https://www.quatsch-comedy-club.de/,,,,POINT (13.38862 52.52362),limited,52.523624,13.388621
node,66917115,Kabarett-Theater Distel,,,,theatre,Friedrichstraße,101.0,Mitte,10117.0,Berlin,DE,http://www.distel-berlin.de,,,,POINT (13.38851 52.52067),yes,52.520667,13.388505


In [150]:
for col in [ "name", "club", "leisure", "sport", "amenity"]:
    print(f"\n--- {col.upper()} ---")
    print(clubs_df[col].dropna().unique())


--- NAME ---
['Umspannwerk' 'KW76' 'Friedrichstadt-Palast' ... 'Gemeinschaftsbeet'
 'Hochbeet Annie Heuser Waldorfschule'
 'Begegnungszentrum im Kölner Viertel']

--- CLUB ---
['poker' 'scout' 'sport' 'social' 'yes' 'dance' 'amateur_radio'
 'automobile' 'fishing' 'Körperschaft_des_Öffentlichen_Rechts' 'culture'
 'fan' 'animals' 'elderly' 'bonsai' 'dog' 'freemasonry' 'student'
 'business' 'game' 'music' 'ethnic' 'Agrarbörse Deutschland Ost' 'linux'
 'history' 'education' 'computer' 'religion' 'art' 'politics'
 'board_games' 'youth_movement' 'archive' 'chess' 'sailing' 'science'
 'humanist' 'charity' 'nature' 'hdk_0' 'youth' 'academic' 'motorcycle'
 'allotment_club' 'allotments' 'TC Berolina Biesdorf' 'gardening']

--- LEISURE ---
['hackerspace' 'fitness_centre' 'sports_centre' 'garden' 'dance'
 'music_venue' 'pitch' 'playground' 'stadium' 'ice_rink' 'marina' 'track']

--- SPORT ---
['bowling' '10pin' 'rowing' 'fitness' 'soccer' 'yoga' 'pilates'
 'gymnastics' 'hapkido' 'karate' 'swimmin

In [151]:
print(clubs_df["amenity"].unique())

['events_venue' nan 'theatre' 'arts_centre' 'community_centre'
 'social_centre' 'studio' 'dojo' 'music_school' 'pub' 'nightclub'
 'dancing_school' 'cafe' 'restaurant' 'music_venue' 'bicycle_parking'
 'photo_booth' 'school' 'social_club']


In [152]:
# Define lists of allowed values for 'amenity' and 'leisure' categories
# Keep only rows that match these categories or have a non-empty 'club' field
# This filters out irrelevant OSM features like restaurants, pubs, etc.
allowed_amenities = [
    'arts_centre', 'community_centre', 'events_venue', 'music_venue',
    'social_centre', 'studio', 'dojo', 'music_school',
    'social_club', 'dancing_school'
]

allowed_leisure = [
    'hackerspace' 'fitness_centre' 'sports_centre' 'garden' 'dance'
 'music_venue' 'pitch' 'playground' 'stadium' 'ice_rink' 'marina' 'track'
]



clubs_df = clubs_df[
    (clubs_df['amenity'].isin(allowed_amenities)) |
    (clubs_df['leisure'].isin(allowed_leisure)) |
    (clubs_df['club'].notna())
]

exclude = ["theatre", "pub", "cafe", "bar", "nightclub", "restaurant"]
clubs_df = clubs_df[~clubs_df["amenity"].isin(exclude)]

print(clubs_df[['name', 'amenity', 'leisure', 'club']].head())
print(len(clubs_df), "clubs/activities after filtering")

                                   name           amenity leisure   club
element id                                                              
node    30012753            Umspannwerk      events_venue     NaN    NaN
        60775321                   KW76               NaN     NaN  poker
        257709121    Kulturhaus Spandau       arts_centre     NaN    NaN
        266630320  Buergeramt Mahlsdorf  community_centre     NaN    NaN
        268915262           Karame e.V.  community_centre     NaN    NaN
1731 clubs/activities after filtering


In [153]:
# Check for unwanted values in 'amenity' column
check_values = ["theatre", "pub", "cafe", "bar", "nightclub", "restaurant"]


if "amenity" in clubs_df.columns:
    found = clubs_df["amenity"].dropna().unique()
    print("Unique amenity values currently in the dataset:")
    print(found)

    
    unwanted = set(found).intersection(check_values)
    if unwanted:
        print("\n⚠️ Unwanted values found:", unwanted)
    else:
        print("\n✅ No unwanted values present in 'amenity'.")
else:
    print("No 'amenity' column found in dataset.")

Unique amenity values currently in the dataset:
['events_venue' 'arts_centre' 'community_centre' 'social_centre' 'studio'
 'dojo' 'music_school' 'dancing_school' 'music_venue' 'photo_booth'
 'social_club']

✅ No unwanted values present in 'amenity'.


In [154]:

clubs_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 1731 entries, ('node', np.int64(30012753)) to ('way', np.int64(1423837870))
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   name           1685 non-null   object  
 1   club           379 non-null    object  
 2   leisure        63 non-null     object  
 3   sport          226 non-null    object  
 4   amenity        1401 non-null   object  
 5   street         1124 non-null   object  
 6   housenumber    1114 non-null   object  
 7   district       738 non-null    object  
 8   postcode       1072 non-null   object  
 9   city           1057 non-null   object  
 10  country        744 non-null    object  
 11  website        717 non-null    object  
 12  phone          282 non-null    object  
 13  email          174 non-null    object  
 14  opening_hours  374 non-null    object  
 15  geometry       1731 non-null   geometry
 16  wheelchair     457

In [155]:
clubs_df = clubs_df.drop_duplicates()
clubs_df = clubs_df.drop_duplicates(subset=['name', 'street', 'housenumber'])
clubs_df = clubs_df.dropna(subset=['name', 'geometry'])

In [156]:

clubs_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 1658 entries, ('node', np.int64(30012753)) to ('way', np.int64(1423837870))
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   name           1658 non-null   object  
 1   club           345 non-null    object  
 2   leisure        62 non-null     object  
 3   sport          215 non-null    object  
 4   amenity        1356 non-null   object  
 5   street         1105 non-null   object  
 6   housenumber    1095 non-null   object  
 7   district       730 non-null    object  
 8   postcode       1056 non-null   object  
 9   city           1042 non-null   object  
 10  country        736 non-null    object  
 11  website        703 non-null    object  
 12  phone          279 non-null    object  
 13  email          173 non-null    object  
 14  opening_hours  369 non-null    object  
 15  geometry       1658 non-null   geometry
 16  wheelchair     455

In [157]:
clubs_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,club,leisure,sport,amenity,street,housenumber,district,postcode,city,country,website,phone,email,opening_hours,geometry,wheelchair,latitude,longitude
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
node,30012753,Umspannwerk,,,,events_venue,,,,,,,,,,,POINT (13.42919 52.49404),yes,52.494042,13.429187
node,60775321,KW76,poker,,,,Konrad-Wolf-Straße,76.0,,,,,,,,,POINT (13.48162 52.53862),,52.538623,13.481623
node,257709121,Kulturhaus Spandau,,,,arts_centre,,,,,,,,,,,POINT (13.20231 52.53548),yes,52.535479,13.202312
node,266630320,Buergeramt Mahlsdorf,,,,community_centre,Hönower Straße,91.0,Mahlsdorf,12623.0,Berlin,DE,,,,,POINT (13.61206 52.51314),yes,52.51314,13.612063
node,268915262,Karame e.V.,,,,community_centre,Wilhelmshavener Straße,22.0,Moabit,10551.0,Berlin,DE,,,,"Mo-Fr 13:00-18:00; Sa-Su,PH off",POINT (13.34154 52.531),no,52.531002,13.341544


## Geometry sanity checks

In [158]:
print("Missing geometries:", clubs_df.geometry.isna().sum())

Missing geometries: 0


In [159]:
# Goal: Verify lat/lon look realistic.
# Why? If values are way off, something went wrong in conversion.

print("Latitude range:", clubs_df["latitude"].min(), "to", clubs_df["latitude"].max())

print("Longitude range:", clubs_df["longitude"].min(), "to", clubs_df["longitude"].max())

Latitude range: 52.37387955 to 52.6448252
Longitude range: 13.12237797012892 to 13.7311336


## 1.3 Prepare the /sources Directory
### Raw Data Files:

- **clubs_raw.geojson** (includes geometry)
- **clubs_raw.csv** (tabular only, no geometry)
- **README.md** in /sources will contain:

**Data sources used.**
**Planned transformation steps.**

In [160]:
# Save locally
clubs_gdf.to_file("clubs_raw.geojson", driver="GeoJSON")
clubs_gdf.to_csv("clubs_raw.csv", index=False)



### Step 2: Data Transformation

In [161]:
# Standardize column names

clubs_df.columns = clubs_df.columns.str.lower().str.strip().str.replace(" ", "_").str.replace("-", "_")

# Normalize yes/no columns into Boolean (True/False)

clubs_df["wheelchair"] = clubs_df["wheelchair"].map({"yes": True, "no": False})

In [162]:
print(clubs_df.dtypes)

name               object
club               object
leisure            object
sport              object
amenity            object
street             object
housenumber        object
district           object
postcode           object
city               object
country            object
website            object
phone              object
email              object
opening_hours      object
geometry         geometry
wheelchair         object
latitude          float64
longitude         float64
dtype: object


## Drop irrelevant / redundant columns

In [163]:
clubs_df.drop(columns=["city","district" ,"country"], inplace=True)

## Normalize categories

In [164]:
clubs_df["wheelchair"] = clubs_df["wheelchair"].fillna("unknown").astype(str).str.strip().str.lower()

In [165]:

clubs_df["opening_hours"] = clubs_df["opening_hours"].fillna("unknown").astype(str).str.strip().str.lower()



In [166]:
clubs_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 1658 entries, ('node', np.int64(30012753)) to ('way', np.int64(1423837870))
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   name           1658 non-null   object  
 1   club           345 non-null    object  
 2   leisure        62 non-null     object  
 3   sport          215 non-null    object  
 4   amenity        1356 non-null   object  
 5   street         1105 non-null   object  
 6   housenumber    1095 non-null   object  
 7   postcode       1056 non-null   object  
 8   website        703 non-null    object  
 9   phone          279 non-null    object  
 10  email          173 non-null    object  
 11  opening_hours  1658 non-null   object  
 12  geometry       1658 non-null   geometry
 13  wheelchair     1658 non-null   object  
 14  latitude       1658 non-null   float64 
 15  longitude      1658 non-null   float64 
dtypes: float64(2), geo

In [167]:
clubs_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,club,leisure,sport,amenity,street,housenumber,postcode,website,phone,email,opening_hours,geometry,wheelchair,latitude,longitude
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
node,30012753,Umspannwerk,,,,events_venue,,,,,,,unknown,POINT (13.42919 52.49404),true,52.494042,13.429187
node,60775321,KW76,poker,,,,Konrad-Wolf-Straße,76.0,,,,,unknown,POINT (13.48162 52.53862),unknown,52.538623,13.481623
node,257709121,Kulturhaus Spandau,,,,arts_centre,,,,,,,unknown,POINT (13.20231 52.53548),true,52.535479,13.202312
node,266630320,Buergeramt Mahlsdorf,,,,community_centre,Hönower Straße,91.0,12623.0,,,,unknown,POINT (13.61206 52.51314),true,52.51314,13.612063
node,268915262,Karame e.V.,,,,community_centre,Wilhelmshavener Straße,22.0,10551.0,,,,"mo-fr 13:00-18:00; sa-su,ph off",POINT (13.34154 52.531),false,52.531002,13.341544


### Add district and district_id to the data frame

In [168]:
# conda install -c conda-forge geopy

In [169]:
import geopandas as gpd

gdf_clubs = gpd.GeoDataFrame(
    clubs_df,  # assumes you already built df_unique with station_id
    geometry=gpd.points_from_xy(clubs_df.longitude, clubs_df.latitude),
    crs="EPSG:4326"
)


neighborhoods = gpd.read_file(
    "lor_ortsteile.geojson"
).to_crs("EPSG:4326")


#harmonizing column names coming from the GeoJSON
neighborhoods = neighborhoods.rename(columns={
    "BEZIRK": "district",
    "OTEIL": "neighborhood",
    "spatial_name": "neighborhood_id"
})

gdf_with_districts = gpd.sjoin(
    gdf_clubs,
    neighborhoods[["district", "neighborhood_id", "neighborhood", "geometry"]],
    how="left",
    predicate="within"
)

df_final = gdf_with_districts.drop(columns=["index_right"])

In [170]:
df_final = df_final.reset_index()

# Rename the "id" column to "club_id"

df_final = df_final.rename(columns={"id": "club_id"})

# Change bank_id column type to string

df_final["club_id"] = df_final["club_id"].astype(str)

In [171]:
df_final.head()

Unnamed: 0,element,club_id,name,club,leisure,sport,amenity,street,housenumber,postcode,...,phone,email,opening_hours,geometry,wheelchair,latitude,longitude,district,neighborhood_id,neighborhood
0,node,30012753,Umspannwerk,,,,events_venue,,,,...,,,unknown,POINT (13.42919 52.49404),true,52.494042,13.429187,Friedrichshain-Kreuzberg,202,Kreuzberg
1,node,60775321,KW76,poker,,,,Konrad-Wolf-Straße,76.0,,...,,,unknown,POINT (13.48162 52.53862),unknown,52.538623,13.481623,Lichtenberg,1110,Alt-Hohenschönhausen
2,node,257709121,Kulturhaus Spandau,,,,arts_centre,,,,...,,,unknown,POINT (13.20231 52.53548),true,52.535479,13.202312,Spandau,501,Spandau
3,node,266630320,Buergeramt Mahlsdorf,,,,community_centre,Hönower Straße,91.0,12623.0,...,,,unknown,POINT (13.61206 52.51314),true,52.51314,13.612063,Marzahn-Hellersdorf,1004,Mahlsdorf
4,node,268915262,Karame e.V.,,,,community_centre,Wilhelmshavener Straße,22.0,10551.0,...,,,"mo-fr 13:00-18:00; sa-su,ph off",POINT (13.34154 52.531),false,52.531002,13.341544,Mitte,102,Moabit


In [172]:
# Reverse Geolocation
import requests
import time

def get_address(lat, lon):
    """Retrieve full formatted address from Nominatim"""
    url = "https://nominatim.openstreetmap.org/reverse"
    params = {"lat": lat, "lon": lon, "format": "json", "addressdetails": 1}
    headers = {"User-Agent": "berlin-venues-scraper/1.0"}
    try:
        r = requests.get(url, params=params, headers=headers, timeout=10)
        r.raise_for_status()
        data = r.json()
        return data.get("display_name")
    except requests.exceptions.RequestException as e:
        logging.warning(f"Error fetching address for ({lat}, {lon}): {e}")
        return None

# Apply reverse geolocation with throttling (to respect Nominatim usage policy)
full_addresses = []
for i, row in df_final.iterrows():
    print(f"fetching missing data for {i}")
    lat, lon = row["latitude"], row["longitude"]
    if pd.notna(lat) and pd.notna(lon):
        
        full_addresses.append(get_address(lat, lon))
        time.sleep(1)  # polite delay between requests
    else:
        
        full_addresses.append(None)


df_final["full_address"] = full_addresses

fetching missing data for 0
fetching missing data for 1
fetching missing data for 2
fetching missing data for 3
fetching missing data for 4
fetching missing data for 5
fetching missing data for 6
fetching missing data for 7
fetching missing data for 8
fetching missing data for 9
fetching missing data for 10
fetching missing data for 11
fetching missing data for 12
fetching missing data for 13
fetching missing data for 14
fetching missing data for 15
fetching missing data for 16
fetching missing data for 17
fetching missing data for 18
fetching missing data for 19
fetching missing data for 20
fetching missing data for 21
fetching missing data for 22
fetching missing data for 23
fetching missing data for 24
fetching missing data for 25
fetching missing data for 26
fetching missing data for 27
fetching missing data for 28
fetching missing data for 29
fetching missing data for 30
fetching missing data for 31
fetching missing data for 32
fetching missing data for 33
fetching missing data fo

In [173]:
df_final

Unnamed: 0,element,club_id,name,club,leisure,sport,amenity,street,housenumber,postcode,...,email,opening_hours,geometry,wheelchair,latitude,longitude,district,neighborhood_id,neighborhood,full_address
0,node,30012753,Umspannwerk,,,,events_venue,,,,...,,unknown,POINT (13.42919 52.49404),true,52.494042,13.429187,Friedrichshain-Kreuzberg,0202,Kreuzberg,"Umspannwerk, Ohlauer Straße, Luisenstadt, Kreu..."
1,node,60775321,KW76,poker,,,,Konrad-Wolf-Straße,76,,...,,unknown,POINT (13.48162 52.53862),unknown,52.538623,13.481623,Lichtenberg,1110,Alt-Hohenschönhausen,"KW76, 76, Konrad-Wolf-Straße, Wilhelmsberg, Al..."
2,node,257709121,Kulturhaus Spandau,,,,arts_centre,,,,...,,unknown,POINT (13.20231 52.53548),true,52.535479,13.202312,Spandau,0501,Spandau,"Kulturhaus Spandau, Mauerstraße, Altstadt, Spa..."
3,node,266630320,Buergeramt Mahlsdorf,,,,community_centre,Hönower Straße,91,12623,...,,unknown,POINT (13.61206 52.51314),true,52.513140,13.612063,Marzahn-Hellersdorf,1004,Mahlsdorf,"Buergeramt Mahlsdorf, 91, Hönower Straße, Lich..."
4,node,268915262,Karame e.V.,,,,community_centre,Wilhelmshavener Straße,22,10551,...,,"mo-fr 13:00-18:00; sa-su,ph off",POINT (13.34154 52.531),false,52.531002,13.341544,Mitte,0102,Moabit,"Karame e.V., 22, Wilhelmshavener Straße, Alt-M..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1653,way,1352576785,DAV OG Berlin Oberschöneweide e.V.,fishing,,,,Nalepastraße,56,,...,,unknown,POINT (13.49774 52.47722),unknown,52.477217,13.497736,Treptow-Köpenick,0909,Oberschöneweide,"DAV OG Berlin Oberschöneweide e.V., 56, Nalepa..."
1654,way,1353882816,Haus Wolfgang Raeder,gardening,,,community_centre,Leonberger Ring,54,,...,,unknown,POINT (13.42883 52.42721),unknown,52.427209,13.428828,Neukölln,0802,Britz,"Haus Wolfgang Raeder, 54, Leonberger Ring, Alt..."
1655,way,1387340919,Kulturzentrum Alte Schule,,,,community_centre,,,,...,,unknown,POINT (13.54977 52.43896),unknown,52.438961,13.549769,Treptow-Köpenick,0907,Adlershof,"Kulturzentrum Alte Schule, Selchowstraße, Sied..."
1656,way,1413964279,Vereinsheim Treue Seele,,,,community_centre,,,,...,,unknown,POINT (13.46274 52.47362),unknown,52.473617,13.462736,Neukölln,0801,Neukölln,"Narzissenweg, Kleingartenanlage Treue Seele, N..."


In [174]:
# District mapping (official codes as strings)
district_mapping = {
    'Mitte': '11001001',
    'Friedrichshain-Kreuzberg': '11002002',
    'Pankow': '11003003',
    'Charlottenburg-Wilmersdorf': '11004004',
    'Spandau': '11005005',
    'Steglitz-Zehlendorf': '11006006',
    'Tempelhof-Schöneberg': '11007007',
    'Neukölln': '11008008',
    'Treptow-Köpenick': '11009009',
    'Marzahn-Hellersdorf': '11010010',
    'Lichtenberg': '11011011',
    'Reinickendorf': '11012012'
}

# Apply mapping to create district_id column
df_final['district_id'] = (
    df_final['district']
    .map(district_mapping)
    .astype(str)
)

In [175]:
df_final

Unnamed: 0,element,club_id,name,club,leisure,sport,amenity,street,housenumber,postcode,...,opening_hours,geometry,wheelchair,latitude,longitude,district,neighborhood_id,neighborhood,full_address,district_id
0,node,30012753,Umspannwerk,,,,events_venue,,,,...,unknown,POINT (13.42919 52.49404),true,52.494042,13.429187,Friedrichshain-Kreuzberg,0202,Kreuzberg,"Umspannwerk, Ohlauer Straße, Luisenstadt, Kreu...",11002002
1,node,60775321,KW76,poker,,,,Konrad-Wolf-Straße,76,,...,unknown,POINT (13.48162 52.53862),unknown,52.538623,13.481623,Lichtenberg,1110,Alt-Hohenschönhausen,"KW76, 76, Konrad-Wolf-Straße, Wilhelmsberg, Al...",11011011
2,node,257709121,Kulturhaus Spandau,,,,arts_centre,,,,...,unknown,POINT (13.20231 52.53548),true,52.535479,13.202312,Spandau,0501,Spandau,"Kulturhaus Spandau, Mauerstraße, Altstadt, Spa...",11005005
3,node,266630320,Buergeramt Mahlsdorf,,,,community_centre,Hönower Straße,91,12623,...,unknown,POINT (13.61206 52.51314),true,52.513140,13.612063,Marzahn-Hellersdorf,1004,Mahlsdorf,"Buergeramt Mahlsdorf, 91, Hönower Straße, Lich...",11010010
4,node,268915262,Karame e.V.,,,,community_centre,Wilhelmshavener Straße,22,10551,...,"mo-fr 13:00-18:00; sa-su,ph off",POINT (13.34154 52.531),false,52.531002,13.341544,Mitte,0102,Moabit,"Karame e.V., 22, Wilhelmshavener Straße, Alt-M...",11001001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1653,way,1352576785,DAV OG Berlin Oberschöneweide e.V.,fishing,,,,Nalepastraße,56,,...,unknown,POINT (13.49774 52.47722),unknown,52.477217,13.497736,Treptow-Köpenick,0909,Oberschöneweide,"DAV OG Berlin Oberschöneweide e.V., 56, Nalepa...",11009009
1654,way,1353882816,Haus Wolfgang Raeder,gardening,,,community_centre,Leonberger Ring,54,,...,unknown,POINT (13.42883 52.42721),unknown,52.427209,13.428828,Neukölln,0802,Britz,"Haus Wolfgang Raeder, 54, Leonberger Ring, Alt...",11008008
1655,way,1387340919,Kulturzentrum Alte Schule,,,,community_centre,,,,...,unknown,POINT (13.54977 52.43896),unknown,52.438961,13.549769,Treptow-Köpenick,0907,Adlershof,"Kulturzentrum Alte Schule, Selchowstraße, Sied...",11009009
1656,way,1413964279,Vereinsheim Treue Seele,,,,community_centre,,,,...,unknown,POINT (13.46274 52.47362),unknown,52.473617,13.462736,Neukölln,0801,Neukölln,"Narzissenweg, Kleingartenanlage Treue Seele, N...",11008008


In [176]:
df_final = df_final.drop(columns=["element"])

In [177]:
# (Optional) Save enriched dataset for later use
df_final.to_csv("clubs_with_districts.csv", index=False)

### Final Summary of Cleaned and Transformed Data

In [178]:
print("✅ Dataset after Steps cleaning and transforming\n")

# Shape of dataframe
print(f"Number of rows: {df_final.shape[0]}")
print(f"Number of columns: {df_final.shape[1]}")

# Column list
print("\nRemaining columns:")
print(df_final.columns.tolist())

# Missing values check
missing = df_final.isnull().sum()
print("\nMissing values after cleaning and transforming :")
print(missing)

✅ Dataset after Steps cleaning and transforming

Number of rows: 1658
Number of columns: 22

Remaining columns:
['club_id', 'name', 'club', 'leisure', 'sport', 'amenity', 'street', 'housenumber', 'postcode', 'website', 'phone', 'email', 'opening_hours', 'geometry', 'wheelchair', 'latitude', 'longitude', 'district', 'neighborhood_id', 'neighborhood', 'full_address', 'district_id']

Missing values after cleaning and transforming :
club_id               0
name                  0
club               1313
leisure            1596
sport              1443
amenity             302
street              553
housenumber         563
postcode            602
website             955
phone              1379
email              1485
opening_hours         0
geometry              0
wheelchair            0
latitude              0
longitude             0
district              0
neighborhood_id       0
neighborhood          0
full_address          0
district_id           0
dtype: int64


In [179]:
df_final.dtypes


club_id              object
name                 object
club                 object
leisure              object
sport                object
amenity              object
street               object
housenumber          object
postcode             object
website              object
phone                object
email                object
opening_hours        object
geometry           geometry
wheelchair           object
latitude            float64
longitude           float64
district             object
neighborhood_id      object
neighborhood         object
full_address         object
district_id          object
dtype: object

### Step 3: Populate Database (Layereddb)

In [180]:
import psycopg2
import pandas as pd
from sqlalchemy import create_engine, text
import warnings

warnings.filterwarnings("ignore")

In [181]:
# convert 'geometry' as text
from shapely import wkt

df_final["geometry"] = df_final["geometry"].apply(lambda g: g.wkt if g else None)

In [182]:


# Connection details
host = 'localhost'
port = 5433
database = 'layereddb'
user = '*******'
password = '******'

# Create connection string
connection_string = f"postgresql+psycopg2://{user}:{password}@{host}:{port}/{database}"

# Create engine
engine = create_engine(connection_string)



In [184]:

#this is where you create table with constraints and references first
create_table_query = f"""
CREATE TABLE IF NOT EXISTS berlin_source_data.social_clubs_activities (
    club_id VARCHAR(100) PRIMARY KEY,   -- OSM element id, stored as string
    name VARCHAR(200) NOT NULL,
    club VARCHAR(100),
    leisure VARCHAR(100),
    sport VARCHAR(100),
    amenity VARCHAR(100),
    street VARCHAR(200),
    housenumber VARCHAR(50),
    postcode VARCHAR(20),
    website VARCHAR(250),
    phone VARCHAR(100),
    email VARCHAR(150),
    opening_hours TEXT,
    wheelchair VARCHAR(50),
    latitude DECIMAL(9,6) NOT NULL,
    longitude DECIMAL(9,6) NOT NULL,
    district VARCHAR(100),
    neighborhood_id VARCHAR(100),
    neighborhood VARCHAR(100),
    full_address TEXT,
    district_id VARCHAR(100) NOT NULL,
    geometry TEXT,
    CONSTRAINT district_id_fk FOREIGN KEY (district_id)
        REFERENCES berlin_source_data.districts(district_id)
        ON DELETE RESTRICT
        ON UPDATE CASCADE
);
"""

# Execute the query to create empty table
with engine.connect() as conn:
    conn.execute(text(create_table_query))
    conn.commit()  # commit the transaction

In [185]:
#  Send the DataFrame to the database using .to_sql()
df_final.to_sql(
    'social_clubs_activities',      
    engine,
    schema='berlin_source_data' ,
    if_exists='append', # ✅ keeps table, just inserts data
    index=False
)

print("DataFrame sent to PostgreSQL using .to_sql() with psycopg2!")

DataFrame sent to PostgreSQL using .to_sql() with psycopg2!


In [186]:
##let's query test data!
query = f"""
SELECT * from berlin_source_data.social_clubs_activities
"""

# Execute the query
with engine.connect() as conn:
    df= pd.read_sql(text(query), conn)
    conn.commit()  # commit the transaction
df

Unnamed: 0,club_id,name,club,leisure,sport,amenity,street,housenumber,postcode,website,...,opening_hours,wheelchair,latitude,longitude,district,neighborhood_id,neighborhood,full_address,district_id,geometry
0,30012753,Umspannwerk,,,,events_venue,,,,,...,unknown,true,52.494043,13.429187,Friedrichshain-Kreuzberg,0202,Kreuzberg,"Umspannwerk, Ohlauer Straße, Luisenstadt, Kreu...",11002002,POINT (13.4291868 52.4940425)
1,60775321,KW76,poker,,,,Konrad-Wolf-Straße,76,,,...,unknown,unknown,52.538623,13.481623,Lichtenberg,1110,Alt-Hohenschönhausen,"KW76, 76, Konrad-Wolf-Straße, Wilhelmsberg, Al...",11011011,POINT (13.4816226 52.5386233)
2,257709121,Kulturhaus Spandau,,,,arts_centre,,,,,...,unknown,true,52.535479,13.202312,Spandau,0501,Spandau,"Kulturhaus Spandau, Mauerstraße, Altstadt, Spa...",11005005,POINT (13.2023117 52.5354787)
3,266630320,Buergeramt Mahlsdorf,,,,community_centre,Hönower Straße,91,12623,,...,unknown,true,52.513140,13.612063,Marzahn-Hellersdorf,1004,Mahlsdorf,"Buergeramt Mahlsdorf, 91, Hönower Straße, Lich...",11010010,POINT (13.6120626 52.51314)
4,268915262,Karame e.V.,,,,community_centre,Wilhelmshavener Straße,22,10551,,...,"mo-fr 13:00-18:00; sa-su,ph off",false,52.531003,13.341544,Mitte,0102,Moabit,"Karame e.V., 22, Wilhelmshavener Straße, Alt-M...",11001001,POINT (13.3415438 52.5310025)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1653,1352576785,DAV OG Berlin Oberschöneweide e.V.,fishing,,,,Nalepastraße,56,,,...,unknown,unknown,52.477217,13.497736,Treptow-Köpenick,0909,Oberschöneweide,"DAV OG Berlin Oberschöneweide e.V., 56, Nalepa...",11009009,POINT (13.4977362 52.4772169)
1654,1353882816,Haus Wolfgang Raeder,gardening,,,community_centre,Leonberger Ring,54,,,...,unknown,unknown,52.427209,13.428828,Neukölln,0802,Britz,"Haus Wolfgang Raeder, 54, Leonberger Ring, Alt...",11008008,POINT (13.428828205550849 52.427209)
1655,1387340919,Kulturzentrum Alte Schule,,,,community_centre,,,,,...,unknown,unknown,52.438961,13.549769,Treptow-Köpenick,0907,Adlershof,"Kulturzentrum Alte Schule, Selchowstraße, Sied...",11009009,POINT (13.549768602710081 52.438960699999996)
1656,1413964279,Vereinsheim Treue Seele,,,,community_centre,,,,,...,unknown,unknown,52.473617,13.462736,Neukölln,0801,Neukölln,"Narzissenweg, Kleingartenanlage Treue Seele, N...",11008008,POINT (13.462735720606702 52.473617250000004)
