# 🧪 Step 1: Research & Data Modelling
**PR Branch Name:** banks-data-modelling

This notebook documents the process for Step 1 of the "Banks in Berlin" project:
- **1.1 Data Source Discovery**
- **1.2 Modelling & Planning**
- **1.3 Prepare the /sources Directory**
- **1.4 Review**

Goal:
- Identify and document relevant data sources.
- Select the 23 key parameters for our use case.
- Draft the planned table schema.
- Plan cleaning and transformation steps before database population.


## 1.1 Data Source Discovery

**Topic:** Banks in Berlin

**Main source:**
- **Name:** OpenStreetMap (OSM) via OSMnx library
- **Source and origin:** Public crowdsourced geospatial database
- **Update frequency:** Continuous (dynamic)
- **Data type:** Dynamic (API query using `amenity=bank`)
- **Reason for selection:**  
  - Covers all banks in Berlin  
  - Includes coordinates, names, addresses, and other useful attributes  
  - Open, free, and easy to query programmatically

**Optional additional sources:**
- **Name:** Berlin Open Data Portal (daten.berlin.de)
- **Source and origin:** Official Berlin city government
- **Update frequency:** Varies per dataset
- **Data type:** Static or semi-static (download as CSV/GeoJSON)
- **Possible usage:** Enrich with official administrative boundaries or extra metadata

**Enrichment potential:**
- Neighborhood/district info from Berlin shapefiles (GeoJSON)
- Linking to local amenities for spatial context


In [1]:
# Install Libraries

# %pip install osmnx geopandas pandas --quiet

In [78]:
# Import Libraries

import osmnx as ox
import geopandas as gpd
import pandas as pd

In [79]:
# Fetch banks in Berlin from OSM using the tag "amenity=bank"
# tags filter for only features with 

tags = {"amenity": "bank"}

In [80]:
# Fetch geometries for Berlin
# bank-gdf = GeoDataFrame (DataFrame with geometry)

banks_gdf = ox.features_from_place("Berlin, Germany", tags)


In [81]:
# Display basic info

print(f"Number of bank entries fetched: {len(banks_gdf)}")
banks_gdf.head(3)

Number of bank entries fetched: 323


Unnamed: 0_level_0,Unnamed: 1_level_0,geometry,addr:city,addr:country,addr:housenumber,addr:postcode,addr:street,addr:suburb,amenity,atm,branch,...,operator:type,start_date,building:levels,roof:levels,roof:shape,indoor,access,room,western_union,building:part
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
node,28968292,POINT (13.31972 52.48667),Berlin,DE,42.0,10713.0,Berliner Straße,Wilmersdorf,bank,yes,eG,...,,,,,,,,,,
node,60848455,POINT (13.47104 52.53033),Berlin,,13.0,10369.0,Anton-Saefkow-Platz,,bank,yes,,...,,,,,,,,,,
node,87040399,POINT (13.3888 52.51105),,,,,,,bank,,,...,,,,,,,,,,


In [82]:
banks_gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 323 entries, ('node', np.int64(28968292)) to ('way', np.int64(611744021))
Data columns (total 100 columns):
 #   Column                                 Non-Null Count  Dtype   
---  ------                                 --------------  -----   
 0   geometry                               323 non-null    geometry
 1   addr:city                              227 non-null    object  
 2   addr:country                           141 non-null    object  
 3   addr:housenumber                       224 non-null    object  
 4   addr:postcode                          228 non-null    object  
 5   addr:street                            233 non-null    object  
 6   addr:suburb                            136 non-null    object  
 7   amenity                                323 non-null    object  
 8   atm                                    251 non-null    object  
 9   branch                                 9 non-null      object  
 10  b

## 1.2 Modelling & Planning

### Selected 23 Key Columns
1. osm_id
2. name
3. brand
4. operator
5. street
6. housenumber
7. postcode
8. city
9. country
10. phone
11. email
12. website
13. opening_hours
14. atm
15. wheelchair
16. building
17. latitude
18. longitude
19. geom_type
20. geom
21. neighbourhood
22. district
23. source

---

### How this connects to existing tables:
- **Coordinates (latitude, longitude, geom):** link to neighbourhood and district polygons.
- **Neighbourhood & district fields:** join with administrative boundaries table.
- **Source field:** ensures traceability.

---

### Planned Schema: `banks_in_berlin`
| Column Name     | Data Type | Description | Example |
|-----------------|-----------|-------------|---------|
| osm_id          | int       | Unique OSM ID | 12345678 |
| name            | text      | Bank name | Deutsche Bank |
| brand           | text      | Brand name if available | Sparkasse |
| operator        | text      | Entity operating the bank | Berliner Volksbank |
| street          | text      | Street name | Friedrichstraße |
| housenumber     | text      | House number | 45 |
| postcode        | text      | Postal code | 10117 |
| city            | text      | City name | Berlin |
| country         | text      | Country code | DE |
| phone           | text      | Contact phone | +49 30 123456 |
| email           | text      | Contact email | info@bank.de |
| website         | text      | Website URL | www.bank.de |
| opening_hours   | text      | Opening hours string | Mo-Fr 09:00-17:00 |
| atm             | text      | Presence of ATM | yes |
| wheelchair      | text      | Accessibility info | yes |
| building        | text      | Building type | yes |
| latitude        | float     | Latitude coordinate | 52.5200 |
| longitude       | float     | Longitude coordinate | 13.4050 |
| geom_type       | text      | Geometry type | Point |
| geom            | geometry  | Full geometry | (GeoJSON) |
| neighbourhood   | text      | Local neighbourhood name | Mitte |
| district        | text      | Berlin district | Mitte |
| source          | text      | Data source info | OSM |

---

### Known Data Issues
- Missing contact details for some entries.
- Inconsistent postcode and address formats.
- Neighbourhood and district not always included in raw OSM data.
- Opening hours in non-standard formats.

---

### Transformation Plan
1. Fetch data from OSM with filter `amenity=bank` (Berlin bounding box).
2. Clean column names → snake_case.
3. Normalize formats (phone, postcode, website URLs).
4. Enrich with neighbourhood/district via spatial join.
5. Save cleaned dataset (GeoJSON + CSV).


In [7]:
# Select 23 Columns & Add Coordinates

In [83]:
# Ensure geometry type is Point for lat/lon extraction

banks_gdf = banks_gdf.to_crs(epsg=4326)

In [84]:
banks_gdf['geometry'] = banks_gdf['geometry'].apply(lambda geom: geom if geom.geom_type == 'Point' else geom.representative_point())
#Extract latitude and longitude
banks_gdf["latitude"] = banks_gdf.geometry.y
banks_gdf["longitude"] = banks_gdf.geometry.x
banks_gdf

Unnamed: 0_level_0,Unnamed: 1_level_0,geometry,addr:city,addr:country,addr:housenumber,addr:postcode,addr:street,addr:suburb,amenity,atm,branch,...,building:levels,roof:levels,roof:shape,indoor,access,room,western_union,building:part,latitude,longitude
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
node,28968292,POINT (13.31972 52.48667),Berlin,DE,42,10713,Berliner Straße,Wilmersdorf,bank,yes,eG,...,,,,,,,,,52.486668,13.319723
node,60848455,POINT (13.47104 52.53033),Berlin,,13,10369,Anton-Saefkow-Platz,,bank,yes,,...,,,,,,,,,52.530331,13.471037
node,87040399,POINT (13.3888 52.51105),,,,,,,bank,,,...,,,,,,,,,52.511050,13.388798
node,89274635,POINT (13.41575 52.52324),Berlin,,5,10178,Alexanderstraße,,bank,yes,,...,,,,,,,,,52.523238,13.415750
node,203561614,POINT (13.53833 52.52769),Berlin,,1/2,12681,Helene-Weigel-Platz,,bank,yes,,...,,,,,,,,,52.527687,13.538327
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
way,210920895,POINT (13.49139 52.42095),Berlin,DE,15,12357,Alt-Rudow,Rudow,bank,yes,,...,,,,,,,,,52.420953,13.491388
way,336063499,POINT (13.43363 52.51035),Berlin,DE,5,10243,Am Ostbahnhof,Friedrichshain,bank,yes,,...,,,,room,customers,shop,yes,,52.510354,13.433634
way,422475544,POINT (13.38491 52.53128),Berlin,DE,28,10115,Invalidenstraße,Mitte,bank,,,...,,,,,,,,,52.531277,13.384913
way,423847739,POINT (13.19855 52.53418),,,,,,,bank,yes,,...,,,,room,,,,yes,52.534183,13.198554


In [85]:
# Select the 23 columns (fill missing with None if not present)

selected_columns = [
    #"osmid",
    "name", "brand", "operator",
    "addr:street", "addr:housenumber", "addr:postcode", "addr:city", "addr:country",
    "phone", "email", "website", "opening_hours",
    "atm", "wheelchair", "building",
    "latitude", "longitude", "geometry",
    # placeholders for enrichment
    #"neighbourhood", "district",
    # add source info
    "source"
]

In [86]:
# Rename map for only the columns that need renaming

rename_map = {
    "addr:street": "street",
    "addr:housenumber": "housenumber",
    "addr:postcode": "postcode",
    "addr:city": "city",
    "addr:country": "country"
}

In [87]:
# # Add missing columns if they don’t exist in the data
# for col in selected_columns:
#     if col not in banks_gdf.columns:
#         banks_gdf[col] = None

In [88]:
# Select the columns in the right order
banks_df = banks_gdf[selected_columns]

In [89]:
# Rename the columns
banks_df = banks_df.rename(columns=rename_map)

In [90]:
# Preview the final DataFrame
banks_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,brand,operator,street,housenumber,postcode,city,country,phone,email,website,opening_hours,atm,wheelchair,building,latitude,longitude,geometry,source
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
node,28968292,Berliner Volksbank,Berliner Volksbank,,Berliner Straße,42,10713.0,Berlin,DE,,,,"Mo-Fr 10:00-13:00, Mo 14:00-16:00, Tu,Th 14:00...",yes,yes,,52.486668,13.319723,POINT (13.31972 52.48667),
node,60848455,Sparkasse,,Berliner Sparkasse,Anton-Saefkow-Platz,13,10369.0,Berlin,,,,,"Mo,We,Fr 09:30-15:00; Tu,Th 09:30-18:00",yes,limited,,52.530331,13.471037,POINT (13.47104 52.53033),
node,87040399,DKB,,,,,,,,,,,,,limited,,52.51105,13.388798,POINT (13.3888 52.51105),
node,89274635,Deutsche Bank,Deutsche Bank,Deutsche Bank,Alexanderstraße,5,10178.0,Berlin,,,,,Mo-Tu 10:00-18:00; We 10:00-16:00; Th 10:00-18...,yes,yes,,52.523238,13.41575,POINT (13.41575 52.52324),
node,203561614,Sparkasse,,Berliner Sparkasse,Helene-Weigel-Platz,1/2,12681.0,Berlin,,,,,"Mo,We,Fr 09:30-15:00; Tu,Th 09:30-18:00",yes,yes,,52.527687,13.538327,POINT (13.53833 52.52769),


## Step 1 Review and A–F Data Familiarization

### A) Quick overview

In [91]:
# How many rows and columns?
# banks_df.shape

print("Rows, Columns:", banks_df.shape)

Rows, Columns: (323, 19)


In [92]:
# What are the column names (in order)?
# banks_df.columns.tolist()

print("\nColumns:", banks_df.columns.tolist())


Columns: ['name', 'brand', 'operator', 'street', 'housenumber', 'postcode', 'city', 'country', 'phone', 'email', 'website', 'opening_hours', 'atm', 'wheelchair', 'building', 'latitude', 'longitude', 'geometry', 'source']


In [93]:
# Data types and non-null counts

banks_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 323 entries, ('node', np.int64(28968292)) to ('way', np.int64(611744021))
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   name           322 non-null    object  
 1   brand          194 non-null    object  
 2   operator       174 non-null    object  
 3   street         233 non-null    object  
 4   housenumber    224 non-null    object  
 5   postcode       228 non-null    object  
 6   city           227 non-null    object  
 7   country        141 non-null    object  
 8   phone          28 non-null     object  
 9   email          2 non-null      object  
 10  website        46 non-null     object  
 11  opening_hours  278 non-null    object  
 12  atm            251 non-null    object  
 13  wheelchair     287 non-null    object  
 14  building       8 non-null      object  
 15  latitude       323 non-null    float64 
 16  longitude      323 n

### B) Missing values per column

In [94]:
# Count missing values (NaN/None) in each column
# I need this to compute percentages of missing values below

missing_count = banks_df.isna().sum().sort_values(ascending=False)
print(missing_count)


email            321
building         315
source           310
phone            295
website          277
country          182
operator         149
brand            129
housenumber       99
city              96
postcode          95
street            90
atm               72
opening_hours     45
wheelchair        36
name               1
latitude           0
longitude          0
geometry           0
dtype: int64


In [95]:
# Number of rows (observations, banks)
# I need this to compute percentages of missing values below

row_count = len(banks_df)
print(row_count)

323


In [96]:
# Build table with counts and % of missing values
# What does pd.DataFrame({...}) do? It converts that dictionary into a DataFrame (like an Excel table).
# The keys become column names.
# The values become column data.

missing = pd.DataFrame({
    "missing_count": missing_count,
    "missing_pct": (missing_count / row_count * 100).round(1)
}).sort_values(by="missing_pct", ascending=False)

print(missing)

               missing_count  missing_pct
email                    321         99.4
building                 315         97.5
source                   310         96.0
phone                    295         91.3
website                  277         85.8
country                  182         56.3
operator                 149         46.1
brand                    129         39.9
housenumber               99         30.7
city                      96         29.7
postcode                  95         29.4
street                    90         27.9
atm                       72         22.3
opening_hours             45         13.9
wheelchair                36         11.1
name                       1          0.3
latitude                   0          0.0
longitude                  0          0.0
geometry                   0          0.0


### C) Distinct values per column

In [97]:
# Number of unique values per column
# Goal: See the “variety” of each column.


distinct = banks_df.nunique().sort_values(ascending=False)
print(distinct)

# Concusion:
# latitude, longitude and geometry are diverse  => Columns to keep but use mainly for mapping
# country, city , email , maybe source  => Columns I might drop/ignore later (in Step2)
# brand, operator, postcode, wheelchair, atm, maybe opening_hours => Columns that will be most useful in Step 2


latitude         323
geometry         323
longitude        323
opening_hours    176
housenumber      146
street           145
postcode         114
name              59
website           32
operator          31
phone             23
brand             23
source             7
wheelchair         3
building           3
atm                3
email              2
country            1
city               1
dtype: int64


### D) Most common values in key columns

In [98]:
# Goal: Peek at distributions, not just counts.

# Example: top 10 brands
print("\nTop 10 brands:")
print(banks_df["brand"].value_counts().head(10))


Top 10 brands:
brand
Berliner Volksbank    37
Deutsche Bank         30
Commerzbank           29
Postbank              28
Targobank             21
Santander             10
Sparda-Bank Berlin     9
HypoVereinsbank        6
Reisebank              4
Western Union          3
Name: count, dtype: int64


In [99]:
# Example: top 10 operators
print("\nTop 10 operators:")
print(banks_df["operator"].value_counts().head(10))

# Shows concentration: most of the banks are "Berliner Sparkasse"


Top 10 operators:
operator
Berliner Sparkasse       97
Berliner Volksbank        9
Deutsche Bank             9
Berliner Volksbank eG     9
Commerzbank               8
Targobank                 7
Sparda-Bank Berlin eG     6
Commerzbank AG            3
Postbank                  3
Deutsche Bank AG          2
Name: count, dtype: int64


In [100]:
# Example: most common street 
print("\nTop street:")
print(banks_df["street"].value_counts().head(10))


Top street:
street
Bahnhofstraße           9
Schloßstraße            7
Breite Straße           6
Friedrichstraße         6
Müllerstraße            6
Frankfurter Allee       5
Kurfürstendamm          5
Hauptstraße             5
Mariendorfer Damm       5
Wilmersdorfer Straße    4
Name: count, dtype: int64


In [101]:
# Example: most common postcode 
print("\nTop postcode:")
print(banks_df["postcode"].value_counts().head(10))


Top postcode:
postcode
10117    14
12163     7
14169     6
10627     6
12555     4
13187     4
13353     4
10247     4
13125     4
13051     4
Name: count, dtype: int64


In [102]:
# Example: most common opening_hours
print("\nTop opening_hours:")
print(banks_df["opening_hours"].value_counts().head(10))


Top opening_hours:
opening_hours
Mo,We,Fr 09:30-15:00; Tu,Th 09:30-18:00                         55
Mo-Fr 06:00-23:00; Sa,Su 08:00-23:00                             8
24/7                                                             6
Mo-Fr 10:00-13:00, Mo 14:00-16:00, Tu,Th 14:00-18:00             4
Mo-Fr 10:00-13:00                                                4
Mo-Fr 09:30-18:00                                                4
Mo,We,Fr 09:30-15:00; Tu,Th 09:30-18:00; PH off                  3
Mo,Tu,Th 10:00-12:30,14:00-18:00; We,Fr 10:00-14:00              3
Mo-Th 10:00-13:00,14:00-18:00; We,Fr 10:00-13:00,14:00-16:00     3
Mo-Fr 10:00-13:00, Mo,Tu,Th 14:00-18:00                          3
Name: count, dtype: int64


### E) Geometry sanity checks

In [103]:
# Goal: Ensure spatial data makes sense.

# Unique geometry types (Point, Polygon/LineString). 
# If some are Polygon/LineString, I already handled them with .representative_point() (somewhere above in Step 1.2).

print(banks_df.geometry.geom_type.value_counts())

Point    323
Name: count, dtype: int64


In [104]:
# Any missing geometries?
# Why? Missing geometry would be a problem for maps.

print("Missing geometries:", banks_df.geometry.isna().sum())

Missing geometries: 0


### F) Latitude/Longitude checks

In [105]:
# Goal: Verify lat/lon look realistic.
# Why? If values are way off, something went wrong in conversion.

print("Latitude range:", banks_df["latitude"].min(), "to", banks_df["latitude"].max())

print("Longitude range:", banks_df["longitude"].min(), "to", banks_df["longitude"].max())


Latitude range: 52.3865001 to 52.6357618
Longitude range: 13.1416433 to 13.6255293


## 1.3 Prepare the /sources Directory

- **Raw Data Files:**  
    - `banks_raw.geojson` (includes geometry)  
    - `banks_raw.csv` (tabular only, no geometry)  

- **README.md** in `/sources` will contain:
    - Data sources used.
    - Planned transformation steps.


In [106]:
# Save as GeoJSON (keeps geometry) and CSV

raw_geojson_path = "../sources/banks_raw.geojson"
raw_csv_path = "../sources/banks_raw.csv"


banks_gdf.to_file(raw_geojson_path, driver="GeoJSON")
banks_gdf.drop(columns="geometry").to_csv(raw_csv_path, index=False)

print(f"Raw data saved to: {raw_geojson_path} and {raw_csv_path}")

Raw data saved to: ../sources/banks_raw.geojson and ../sources/banks_raw.csv


## 1.4 Review

- All 23 target columns defined.
- Data sources identified and documented.
- Schema draft created.
- Data fetched and stored in `/sources`.
- Data cleaning & enrichment plan in place.

**Next Step:** Step 2 — Fetch & Transform data.


# 🛠 Step 2: Data Transformation

### A) Standardize column names and types

In [107]:
# Standardize column names

banks_df.columns = banks_df.columns.str.lower().str.strip().str.replace(" ", "_").str.replace("-", "_")

# Convert certain columns to correct type

banks_df["housenumber"] = banks_df["housenumber"].astype(str)   # ensure text

banks_df["postcode"] = banks_df["postcode"].astype(str)         # keep leading zeros

# Normalize yes/no columns into Boolean (True/False)

banks_df["atm"] = banks_df["atm"].map({"yes": True, "no": False})

banks_df["wheelchair"] = banks_df["wheelchair"].map({"yes": True, "no": False})

# Make text values consistent (lowercase to avoid duplicates like "Sparkasse" vs "sparkasse")
# See "opening_hours" normalization in Step 2 E)


text_cols = ["name", "street", "city", "country", "website", "operator", "brand", "phone", "email", "source", "building"]
for col in text_cols:
    if col in banks_df.columns:
        banks_df[col] = banks_df[col].astype(str).str.strip().str.lower()

In [108]:
# Check the  datatypes after Step 2 A)

print(banks_df.dtypes)   

name               object
brand              object
operator           object
street             object
housenumber        object
postcode           object
city               object
country            object
phone              object
email              object
website            object
opening_hours      object
atm                object
wheelchair         object
building           object
latitude          float64
longitude         float64
geometry         geometry
source             object
dtype: object


In [109]:
# See first rows after Step 2 A)

banks_df.head() 


Unnamed: 0_level_0,Unnamed: 1_level_0,name,brand,operator,street,housenumber,postcode,city,country,phone,email,website,opening_hours,atm,wheelchair,building,latitude,longitude,geometry,source
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
node,28968292,berliner volksbank,berliner volksbank,,berliner straße,42,10713.0,berlin,de,,,,"Mo-Fr 10:00-13:00, Mo 14:00-16:00, Tu,Th 14:00...",True,True,,52.486668,13.319723,POINT (13.31972 52.48667),
node,60848455,sparkasse,,berliner sparkasse,anton-saefkow-platz,13,10369.0,berlin,,,,,"Mo,We,Fr 09:30-15:00; Tu,Th 09:30-18:00",True,,,52.530331,13.471037,POINT (13.47104 52.53033),
node,87040399,dkb,,,,,,,,,,,,,,,52.51105,13.388798,POINT (13.3888 52.51105),
node,89274635,deutsche bank,deutsche bank,deutsche bank,alexanderstraße,5,10178.0,berlin,,,,,Mo-Tu 10:00-18:00; We 10:00-16:00; Th 10:00-18...,True,True,,52.523238,13.41575,POINT (13.41575 52.52324),
node,203561614,sparkasse,,berliner sparkasse,helene-weigel-platz,1/2,12681.0,berlin,,,,,"Mo,We,Fr 09:30-15:00; Tu,Th 09:30-18:00",True,True,,52.527687,13.538327,POINT (13.53833 52.52769),


In [110]:
# Seeing more than head()

# banks_df.head(20)              # first 20 rows
# banks_df.tail(10)              # last 10 rows
# banks_df.sample(10, random_state=0)  # 10 random rows
# banks_df[["brand","operator","atm","wheelchair"]].sample(15, random_state=1)
# banks_df["brand"].value_counts(dropna=False).head(20)


In [111]:
# Seeing more than head()

banks_df.sample(10, random_state=0)  # 10 random rows

Unnamed: 0_level_0,Unnamed: 1_level_0,name,brand,operator,street,housenumber,postcode,city,country,phone,email,website,opening_hours,atm,wheelchair,building,latitude,longitude,geometry,source
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
node,9798306522,ziraat bank,,t.c. ziraat bankası,leipziger straße,31.0,10117.0,berlin,,,,,"Mo-Fr 09:00-12:00, Mo,We 13:00-17:00, Th 13:00...",,,,52.510254,13.390841,POINT (13.39084 52.51025),
node,355142046,commerzbank,commerzbank,commerzbank,,,,,,,,,"Mo-Th 09:00-13:00, Fr 09:00-14:00, Mo,We 14:00...",,False,,52.589679,13.283215,POINT (13.28321 52.58968),
node,266614884,berliner volksbank,berliner volksbank,,,,,,,,,,,True,True,,52.511426,13.586284,POINT (13.58628 52.51143),
node,4909118855,sparda-bank berlin,sparda-bank berlin,sparda-bank berlin eg,mehrower allee,20.0,12687.0,berlin,,493042080420.0,,https://www.sparda-b.de,"Tu, We, Fr 10:00-13:00; Mo 10:00-13:00, 16:00-...",True,True,,52.555698,13.559385,POINT (13.55939 52.5557),
node,701070442,deutsche bank,deutsche bank,,köpenicker straße,184.0,12355.0,berlin,,,,,,True,False,,52.418478,13.496902,POINT (13.4969 52.41848),
node,876978848,commerzbank,commerzbank,,,,,,,,,,"Mo,We 09:00-13:00,14:00-16:00, Tu,Th 09:00-13:...",True,True,,52.439537,13.215075,POINT (13.21507 52.43954),
node,472430735,deutsche bank,deutsche bank,,johannisthaler chaussee,300.0,12351.0,berlin,de,49306600670.0,,https://www.deutsche-bank.de/,,True,True,,52.430767,13.455791,POINT (13.45579 52.43077),
node,3612065220,postbank,postbank,,potsdamer straße,52.0,14163.0,berlin,de,,,,"Mo-Fr 09:00-18:00; Sa 09:00-14:00; Su,PH off",True,True,,52.435032,13.258984,POINT (13.25898 52.43503),
node,346135245,sparkasse,sparkasse,berliner sparkasse,schulzendorfer straße,1.0,13347.0,berlin,,,,,"Mo,We,Fr 09:30-15:00; Tu,Th 09:30-18:00",True,,,52.539821,13.370978,POINT (13.37098 52.53982),
node,1005311925,sparkasse,,berliner sparkasse,bahnhofstraße,61.0,13125.0,berlin,,,,,"Mo,We,Fr 09:30-15:00; Tu,Th 09:30-18:00",True,True,,52.613947,13.470641,POINT (13.47064 52.61395),


In [112]:
# Seeing more than head()

banks_df[["brand","operator","atm","wheelchair"]].sample(15, random_state=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,brand,operator,atm,wheelchair
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
node,1110318360,targobank,,True,True
node,510622166,targobank,targobank,True,
node,8462242299,postbank,,,
node,3229138230,commerzbank,commerzbank ag,True,
node,414996118,,berliner sparkasse,True,
node,4464731328,psd bank,psd bank berlin-brandenburg eg,True,
node,1365431660,deutsche bank,,,
node,4016608995,berliner volksbank,berliner volksbank eg,True,
node,837792490,,berliner sparkasse,True,
node,4537589642,,,,True


### B) Drop irrelevant / redundant columns

In [113]:
# # Drop redundant columns
columns_to_drop_in_2B = ["city", "country", "source"]

# Keep only the ones that really exist in the dataframe
columns_to_drop_in_2B = [col for col in columns_to_drop_in_2B if col in banks_df.columns]

print("Dropping in Step 2B:", columns_to_drop_in_2B)
banks_df = banks_df.drop(columns=columns_to_drop_in_2B)

print("\nRemaining columns after Step 2B:")
print(banks_df.columns.tolist())

Dropping in Step 2B: ['city', 'country', 'source']

Remaining columns after Step 2B:
['name', 'brand', 'operator', 'street', 'housenumber', 'postcode', 'phone', 'email', 'website', 'opening_hours', 'atm', 'wheelchair', 'building', 'latitude', 'longitude', 'geometry']


### C) Handle missing values

In [114]:
# Drop columns with too many missing values => See table with counts and % of missing values in Step 1 B)
# email → 99.4% missing;
# phone → 91.3% missing
# website → 85.8% missing
# building → 97.5% missing


columns_to_drop_in_2C = ["email", "phone", "website", "building"]

columns_to_drop_in_2C = [col for col in columns_to_drop_in_2C if col in banks_df.columns]

print("Dropping in Step 2C:", columns_to_drop_in_2C)
banks_df = banks_df.drop(columns=columns_to_drop_in_2C)

print("\nRemaining columns after Step 2C:")
print(banks_df.columns.tolist())

Dropping in Step 2C: ['email', 'phone', 'website', 'building']

Remaining columns after Step 2C:
['name', 'brand', 'operator', 'street', 'housenumber', 'postcode', 'opening_hours', 'atm', 'wheelchair', 'latitude', 'longitude', 'geometry']


### D) Normalize categories

In [116]:
# atm and wheelchair should be consistent values
# Replace NaN with "unknown" and standardize values

if "atm" in banks_df.columns:
    banks_df["atm"] = banks_df["atm"].fillna("unknown").astype(str).str.strip().str.lower()

if "wheelchair" in banks_df.columns:
    banks_df["wheelchair"] = banks_df["wheelchair"].fillna("unknown").astype(str).str.strip().str.lower()

print("\nUnique values in 'atm':", banks_df["atm"].unique() if "atm" in banks_df.columns else "No column")
print("Unique values in 'wheelchair':", banks_df["wheelchair"].unique() if "wheelchair" in banks_df.columns else "No column")


Unique values in 'atm': ['true' 'unknown' 'false']
Unique values in 'wheelchair': ['true' 'unknown' 'false']


### E) Opening hours normalization

In [117]:
# Normalize text format

if "opening_hours" in banks_df.columns:
    banks_df["opening_hours"] = banks_df["opening_hours"].fillna("unknown").astype(str).str.strip().str.lower()

print("\nSample opening hours values:")
print(banks_df["opening_hours"].head(10) if "opening_hours" in banks_df.columns else "No column")


Sample opening hours values:
element  id       
node     28968292     mo-fr 10:00-13:00, mo 14:00-16:00, tu,th 14:00...
         60848455               mo,we,fr 09:30-15:00; tu,th 09:30-18:00
         87040399                                               unknown
         89274635     mo-tu 10:00-18:00; we 10:00-16:00; th 10:00-18...
         203561614              mo,we,fr 09:30-15:00; tu,th 09:30-18:00
         213106681              mo,we,fr 09:30-15:00; tu,th 09:30-18:00
         213108224                                    mo-fr 09:30-18:00
         213112439                                                 24/7
         239659091              mo,we,fr 09:30-15:00; tu,th 09:30-18:00
         239661671                 mo-fr 06:00-23:00; sa,su 08:00-23:00
Name: opening_hours, dtype: object


In [118]:
# Quick preview

banks_df.head()


Unnamed: 0_level_0,Unnamed: 1_level_0,name,brand,operator,street,housenumber,postcode,opening_hours,atm,wheelchair,latitude,longitude,geometry
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
node,28968292,berliner volksbank,berliner volksbank,,berliner straße,42,10713.0,"mo-fr 10:00-13:00, mo 14:00-16:00, tu,th 14:00...",true,true,52.486668,13.319723,POINT (13.31972 52.48667)
node,60848455,sparkasse,,berliner sparkasse,anton-saefkow-platz,13,10369.0,"mo,we,fr 09:30-15:00; tu,th 09:30-18:00",true,unknown,52.530331,13.471037,POINT (13.47104 52.53033)
node,87040399,dkb,,,,,,unknown,unknown,unknown,52.51105,13.388798,POINT (13.3888 52.51105)
node,89274635,deutsche bank,deutsche bank,deutsche bank,alexanderstraße,5,10178.0,mo-tu 10:00-18:00; we 10:00-16:00; th 10:00-18...,true,true,52.523238,13.41575,POINT (13.41575 52.52324)
node,203561614,sparkasse,,berliner sparkasse,helene-weigel-platz,1/2,12681.0,"mo,we,fr 09:30-15:00; tu,th 09:30-18:00",true,true,52.527687,13.538327,POINT (13.53833 52.52769)


### F) Add district and district_id to the data frame

In [119]:
conda install -c conda-forge geopy


[1;32m2[0m[1;32m channel Terms of Service accepted[0m
Channels:
 - conda-forge
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [120]:
import pandas as pd
from geopy.geocoders import Nominatim
from time import sleep

# Initialize the geolocator
geolocator = Nominatim(user_agent="berlin_district_locator")

# Define a reverse geocoding function to get the official Berlin district (Bezirk)
def get_district(lat, lon):
    try:
        location = geolocator.reverse((lat, lon), exactly_one=True, language='de')
        sleep(1)  # Nominatim rate limit: 1 request per second
        if location and "address" in location.raw:
            address = location.raw["address"]
            return (
                address.get("city_district") or
                address.get("borough") or
                address.get("county") or
                None
            )
        return None
    except:
        return None

# Apply function row by row → add new "district" column
banks_df["district"] = banks_df.apply(
    lambda row: get_district(row["latitude"], row["longitude"]) if pd.notnull(row["latitude"]) else None,
    axis=1
)




In [45]:
# banks_df = banks_df.drop(columns="district_id")

In [121]:
# Generating district ids
# https://www.regionalstatistik.de

# District mapping (official codes as strings)
district_mapping = {
    'Mitte': '11001001',
    'Friedrichshain-Kreuzberg': '11002002',
    'Pankow': '11003003',
    'Charlottenburg-Wilmersdorf': '11004004',
    'Spandau': '11005005',
    'Steglitz-Zehlendorf': '11006006',
    'Tempelhof-Schöneberg': '11007007',
    'Neukölln': '11008008',
    'Treptow-Köpenick': '11009009',
    'Marzahn-Hellersdorf': '11010010',
    'Lichtenberg': '11011011',
    'Reinickendorf': '11012012'
}

# Apply mapping to create district_id column (string)
banks_df['district_id'] = banks_df['district'].map(district_mapping).astype(str)

# (Optional) Check if some districts were not mapped
#unmapped = df[~df['district'].isin(district_mapping.keys())]['district'].unique()
#if len(unmapped) > 0:
    #print("⚠️ Unmapped districts found:", unmapped)

### G)  Reset index, drop columns "element" and "geometry", rename "id" to "banks_id"

In [122]:
# Reset index
banks_df= banks_df.drop(columns=["geometry"]).reset_index()

In [123]:
# Drop the redundant column "element" 

banks_df= banks_df.drop(columns=["element"])

In [124]:
# Rename the "id" column to "bank_id"

banks_df = banks_df.rename(columns={"id": "bank_id"})

In [125]:
# Change bank_id column type to string

banks_df["bank_id"] = banks_df["bank_id"].astype(str)



In [126]:
# (Optional) Save enriched dataset for later use
banks_df.to_csv("banks_with_districts.csv", index=False)

### H)  Final Summary of Cleaned and Transformed Data

In [127]:
print("✅ Dataset after Steps A - G cleaning and transforming\n")

# Shape of dataframe
print(f"Number of rows: {banks_df.shape[0]}")
print(f"Number of columns: {banks_df.shape[1]}")

# Column list
print("\nRemaining columns:")
print(banks_df.columns.tolist())

# Missing values check
missing = banks_df.isnull().sum()
print("\nMissing values after cleaning and transforming :")
print(missing)

✅ Dataset after Steps A - G cleaning and transforming

Number of rows: 323
Number of columns: 14

Remaining columns:
['bank_id', 'name', 'brand', 'operator', 'street', 'housenumber', 'postcode', 'opening_hours', 'atm', 'wheelchair', 'latitude', 'longitude', 'district', 'district_id']

Missing values after cleaning and transforming :
bank_id          0
name             0
brand            0
operator         0
street           0
housenumber      0
postcode         0
opening_hours    0
atm              0
wheelchair       0
latitude         0
longitude        0
district         0
district_id      0
dtype: int64


In [128]:
# Data types and non-null counts

banks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 323 entries, 0 to 322
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   bank_id        323 non-null    object 
 1   name           323 non-null    object 
 2   brand          323 non-null    object 
 3   operator       323 non-null    object 
 4   street         323 non-null    object 
 5   housenumber    323 non-null    object 
 6   postcode       323 non-null    object 
 7   opening_hours  323 non-null    object 
 8   atm            323 non-null    object 
 9   wheelchair     323 non-null    object 
 10  latitude       323 non-null    float64
 11  longitude      323 non-null    float64
 12  district       323 non-null    object 
 13  district_id    323 non-null    object 
dtypes: float64(2), object(12)
memory usage: 35.5+ KB


In [129]:
# Quick overview

banks_df

Unnamed: 0,bank_id,name,brand,operator,street,housenumber,postcode,opening_hours,atm,wheelchair,latitude,longitude,district,district_id
0,28968292,berliner volksbank,berliner volksbank,,berliner straße,42,10713,"mo-fr 10:00-13:00, mo 14:00-16:00, tu,th 14:00...",true,true,52.486668,13.319723,Charlottenburg-Wilmersdorf,11004004
1,60848455,sparkasse,,berliner sparkasse,anton-saefkow-platz,13,10369,"mo,we,fr 09:30-15:00; tu,th 09:30-18:00",true,unknown,52.530331,13.471037,Lichtenberg,11011011
2,87040399,dkb,,,,,,unknown,unknown,unknown,52.511050,13.388798,Mitte,11001001
3,89274635,deutsche bank,deutsche bank,deutsche bank,alexanderstraße,5,10178,mo-tu 10:00-18:00; we 10:00-16:00; th 10:00-18...,true,true,52.523238,13.415750,Mitte,11001001
4,203561614,sparkasse,,berliner sparkasse,helene-weigel-platz,1/2,12681,"mo,we,fr 09:30-15:00; tu,th 09:30-18:00",true,true,52.527687,13.538327,Marzahn-Hellersdorf,11010010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318,210920895,sparkasse,,berliner sparkasse,alt-rudow,15,12357,"mo,we,fr 09:30-15:00; tu,th 09:30-18:00",true,true,52.420953,13.491388,Neukölln,11008008
319,336063499,reisebank,reisebank,reisebank,am ostbahnhof,5,10243,mo-fr 09:00-19:00; sa 10:00-18:00,true,true,52.510354,13.433634,Friedrichshain-Kreuzberg,11002002
320,422475544,allgemeine beamten bank,,,invalidenstraße,28,10115,"mo-fr 09:00-13:00, mo,we 14:00-18:00",unknown,true,52.531277,13.384913,Mitte,11001001
321,423847739,sparda-bank berlin,sparda-bank berlin,sparda-bank berlin eg,,,,"mo,th 09:00-13:00,14:00-18:00",true,true,52.534183,13.198554,Spandau,11005005


In [130]:
banks_df.to_csv("final_banks_with_districts.csv")

# 🧩 Step 3: Populate Database

In [131]:
import psycopg2
import pandas as pd
from sqlalchemy import create_engine, text
import warnings

warnings.filterwarnings("ignore")

In [132]:
user_name='clara_neagu'
password='nG60A6GG8ZKZuV'

In [133]:
# Conection
host = 'localhost'
port = '5433'
database = 'layereddb'
schema='berlin_source_data'

#connection to db after you opened tunnel
engine = create_engine(f'postgresql+psycopg2://{user_name}:{password}@{host}:{port}/{database}')

In [135]:
#this is where you create table with constraints and references first
create_table_query = f"""
CREATE TABLE IF NOT EXISTS {schema}.banks (
    bank_id VARCHAR(20) PRIMARY KEY,
    name VARCHAR(255),
    brand VARCHAR(255),
    operator VARCHAR(255),
    street VARCHAR(255),
    housenumber VARCHAR(50),
    postcode VARCHAR(10),
    opening_hours TEXT,
    atm VARCHAR(10),
    wheelchair VARCHAR(10),
    latitude FLOAT,
    longitude FLOAT,
    district VARCHAR(100),
    district_id VARCHAR(100),
     CONSTRAINT district_id_fk FOREIGN KEY (district_id)
        REFERENCES berlin_source_data.districts(district_id)
        ON DELETE RESTRICT
        ON UPDATE CASCADE
   
);
"""

# Execute the query to create empty table
with engine.connect() as conn:
    conn.execute(text(create_table_query))
    conn.commit()  # commit the transaction




In [136]:
#  Send the DataFrame to the database using .to_sql()
banks_df.to_sql(
    'banks',      
    engine,
    schema=schema,
    if_exists='append', # ✅ keeps table, just inserts data
    index=False
)

print("DataFrame sent to PostgreSQL using .to_sql() with psycopg2!")

DataFrame sent to PostgreSQL using .to_sql() with psycopg2!


In [137]:
##let's query test data!
query = f"""
SELECT * from berlin_source_data.banks
"""

# Execute the query
with engine.connect() as conn:
    df= pd.read_sql(text(query), conn)
    conn.commit()  # commit the transaction
df

Unnamed: 0,bank_id,name,brand,operator,street,housenumber,postcode,opening_hours,atm,wheelchair,latitude,longitude,district,district_id
0,28968292,berliner volksbank,berliner volksbank,,berliner straße,42,10713,"mo-fr 10:00-13:00, mo 14:00-16:00, tu,th 14:00...",true,true,52.486668,13.319723,Charlottenburg-Wilmersdorf,11004004
1,60848455,sparkasse,,berliner sparkasse,anton-saefkow-platz,13,10369,"mo,we,fr 09:30-15:00; tu,th 09:30-18:00",true,unknown,52.530331,13.471037,Lichtenberg,11011011
2,87040399,dkb,,,,,,unknown,unknown,unknown,52.511050,13.388798,Mitte,11001001
3,89274635,deutsche bank,deutsche bank,deutsche bank,alexanderstraße,5,10178,mo-tu 10:00-18:00; we 10:00-16:00; th 10:00-18...,true,true,52.523238,13.415750,Mitte,11001001
4,203561614,sparkasse,,berliner sparkasse,helene-weigel-platz,1/2,12681,"mo,we,fr 09:30-15:00; tu,th 09:30-18:00",true,true,52.527687,13.538327,Marzahn-Hellersdorf,11010010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318,210920895,sparkasse,,berliner sparkasse,alt-rudow,15,12357,"mo,we,fr 09:30-15:00; tu,th 09:30-18:00",true,true,52.420953,13.491388,Neukölln,11008008
319,336063499,reisebank,reisebank,reisebank,am ostbahnhof,5,10243,mo-fr 09:00-19:00; sa 10:00-18:00,true,true,52.510354,13.433634,Friedrichshain-Kreuzberg,11002002
320,422475544,allgemeine beamten bank,,,invalidenstraße,28,10115,"mo-fr 09:00-13:00, mo,we 14:00-18:00",unknown,true,52.531277,13.384913,Mitte,11001001
321,423847739,sparda-bank berlin,sparda-bank berlin,sparda-bank berlin eg,,,,"mo,th 09:00-13:00,14:00-18:00",true,true,52.534183,13.198554,Spandau,11005005
