# 🧪 Step 1: Research & Data Modelling
**PR Branch Name:** banks-data-modelling

This notebook documents the process for Step 1 of the "Banks in Berlin" project:
- **1.1 Data Source Discovery**
- **1.2 Modelling & Planning**
- **1.3 Prepare the /sources Directory**
- **1.4 Review**

Goal:
- Identify and document relevant data sources.
- Select the 23 key parameters for our use case.
- Draft the planned table schema.
- Plan cleaning and transformation steps before database population.


## 1.1 Data Source Discovery

**Topic:** Banks in Berlin

**Main source:**
- **Name:** OpenStreetMap (OSM) via OSMnx library
- **Source and origin:** Public crowdsourced geospatial database
- **Update frequency:** Continuous (dynamic)
- **Data type:** Dynamic (API query using `amenity=bank`)
- **Reason for selection:**  
  - Covers all banks in Berlin  
  - Includes coordinates, names, addresses, and other useful attributes  
  - Open, free, and easy to query programmatically

**Optional additional sources:**
- **Name:** Berlin Open Data Portal (daten.berlin.de)
- **Source and origin:** Official Berlin city government
- **Update frequency:** Varies per dataset
- **Data type:** Static or semi-static (download as CSV/GeoJSON)
- **Possible usage:** Enrich with official administrative boundaries or extra metadata

**Enrichment potential:**
- Neighborhood/district info from Berlin shapefiles (GeoJSON)
- Linking to local amenities for spatial context


In [23]:
# Install Libraries

%pip install osmnx geopandas pandas --quiet

Note: you may need to restart the kernel to use updated packages.


In [24]:
# Import Libraries

import osmnx as ox
import geopandas as gpd
import pandas as pd

In [25]:
# Fetch banks in Berlin from OSM using the tag "amenity=bank"

tags = {"amenity": "bank"}

In [26]:
# Fetch geometries for Berlin

banks_gdf = ox.geometries_from_place("Berlin, Germany", tags)


AttributeError: module 'osmnx' has no attribute 'geometries_from_place'

In [None]:
# Display basic info

print(f"Number of bank entries fetched: {len(banks_gdf)}")
banks_gdf.head(3)

NameError: name 'banks_gdf' is not defined

## 1.2 Modelling & Planning

### Selected 23 Key Columns
1. osm_id
2. name
3. brand
4. operator
5. street
6. housenumber
7. postcode
8. city
9. country
10. phone
11. email
12. website
13. opening_hours
14. atm
15. wheelchair
16. building
17. latitude
18. longitude
19. geom_type
20. geom
21. neighbourhood
22. district
23. source

---

### How this connects to existing tables:
- **Coordinates (latitude, longitude, geom):** link to neighbourhood and district polygons.
- **Neighbourhood & district fields:** join with administrative boundaries table.
- **Source field:** ensures traceability.

---

### Planned Schema: `banks_in_berlin`
| Column Name     | Data Type | Description | Example |
|-----------------|-----------|-------------|---------|
| osm_id          | int       | Unique OSM ID | 12345678 |
| name            | text      | Bank name | Deutsche Bank |
| brand           | text      | Brand name if available | Sparkasse |
| operator        | text      | Entity operating the bank | Berliner Volksbank |
| street          | text      | Street name | Friedrichstraße |
| housenumber     | text      | House number | 45 |
| postcode        | text      | Postal code | 10117 |
| city            | text      | City name | Berlin |
| country         | text      | Country code | DE |
| phone           | text      | Contact phone | +49 30 123456 |
| email           | text      | Contact email | info@bank.de |
| website         | text      | Website URL | www.bank.de |
| opening_hours   | text      | Opening hours string | Mo-Fr 09:00-17:00 |
| atm             | text      | Presence of ATM | yes |
| wheelchair      | text      | Accessibility info | yes |
| building        | text      | Building type | yes |
| latitude        | float     | Latitude coordinate | 52.5200 |
| longitude       | float     | Longitude coordinate | 13.4050 |
| geom_type       | text      | Geometry type | Point |
| geom            | geometry  | Full geometry | (GeoJSON) |
| neighbourhood   | text      | Local neighbourhood name | Mitte |
| district        | text      | Berlin district | Mitte |
| source          | text      | Data source info | OSM |

---

### Known Data Issues
- Missing contact details for some entries.
- Inconsistent postcode and address formats.
- Neighbourhood and district not always included in raw OSM data.
- Opening hours in non-standard formats.

---

### Transformation Plan
1. Fetch data from OSM with filter `amenity=bank` (Berlin bounding box).
2. Clean column names → snake_case.
3. Normalize formats (phone, postcode, website URLs).
4. Enrich with neighbourhood/district via spatial join.
5. Save cleaned dataset (GeoJSON + CSV).


In [None]:
# Select 23 Columns & Add Coordinates

In [None]:
# Ensure geometry type is Point for lat/lon extraction

banks_gdf = banks_gdf.to_crs(epsg=4326)

NameError: name 'banks_gdf' is not defined

In [None]:
# Extract latitude and longitude

banks_gdf["latitude"] = banks_gdf.geometry.y
banks_gdf["longitude"] = banks_gdf.geometry.x

In [None]:
# Select the 23 columns (fill missing with None if not present)

selected_columns = [
    "osmid", "name", "brand", "operator",
    "addr:street", "addr:housenumber", "addr:postcode", "addr:city", "addr:country",
    "phone", "email", "website", "opening_hours",
    "atm", "wheelchair", "building",
    "latitude", "longitude", "geometry",
    # placeholders for enrichment
    "neighbourhood", "district",
    # add source info
    "source"
]

In [None]:
# Rename columns to match our schema

rename_map = {
    "osmid": "osm_id",
    "addr:street": "street",
    "addr:housenumber": "housenumber",
    "addr:postcode": "postcode",
    "addr:city": "city",
    "addr:country": "country"
}

In [None]:
banks_df = banks_gdf.rename(columns=rename_map)[[col if col in banks_gdf.columns else None for col in rename_map.values()]]

## 1.3 Prepare the /sources Directory

- **Raw Data Files:**  
    - `banks_raw.geojson` (includes geometry)  
    - `banks_raw.csv` (tabular only, no geometry)  

- **README.md** in `/sources` will contain:
    - Data sources used.
    - Planned transformation steps.


In [None]:
# Create /sources folder if it doesn't exist

os.makedirs("sources", exist_ok=True)

In [None]:
# Save as GeoJSON (keeps geometry) and CSV

raw_geojson_path = "sources/banks_raw.geojson"
raw_csv_path = "sources/banks_raw.csv"


banks_gdf.to_file(raw_geojson_path, driver="GeoJSON")
banks_gdf.drop(columns="geometry").to_csv(raw_csv_path, index=False)

print(f"Raw data saved to: {raw_geojson_path} and {raw_csv_path}")

## 1.4 Review

- All 23 target columns defined.
- Data sources identified and documented.
- Schema draft created.
- Data fetched and stored in `/sources`.
- Data cleaning & enrichment plan in place.

**Next Step:** Step 2 — Fetch & Transform data.
