# üß™ Step 1: Research & Data Source Discovery

**PR Branch Name: supermarkets-data-modelling**

This notebook documents the process for Step 1 of the "Supermarkets in Berlin" project:

- 1.1 Data Source Discovery
- 1.2 Modelling & Planning
- 1.3 Prepare the /sources Directory
- 1.4 Review

**Goal:**

- Identify and document relevant data sources.
- Select key parameters for our use case.
- Draft the planned table schema.
- Plan cleaning and transformation steps before database population.

## 1.1 Data Source Discovery

Topic: Supermarkets in Berlin

**Main source:**

- Name: OpenStreetMap (OSM) via OSMnx library
- Source and origin: Public crowdsourced geospatial database
- Update frequency: Continuous (dynamic)
- Data type: Dynamic (API query using shop=supermarket)

**Reason for selection:**

- Covers all supermarkets in Berlin
- Includes coordinates, names, addresses, and other useful attributes
- Open, free, and easy to query programmatically

**Optional additional sources:**

- Name: Berlin Open Data Portal (daten.berlin.de)
- Source and origin: Official Berlin city government
- Update frequency: Varies per dataset
- Data type: Static or semi-static (download as CSV/GeoJSON)
- Possible usage: Enrich with official administrative boundaries or extra metadata

**Enrichment potential:**

Neighborhood/district info from Berlin shapefiles (GeoJSON)
Linking to local amenities for spatial context

**üìç Fetch data about supermarkets in Berlin from OpenStreetMap (OSM)**

In [69]:
# install libraries
# %pip install osmnx geopandas pandas

In [70]:
# import libraries
import osmnx as ox
import geopandas as gpd
import pandas as pd

In [71]:
# Give me all places tagged as shop=supermarket.
# tags filter for only features with 

tags = {"shop": "supermarket"}

In [72]:
# ‚úÖ Enables caching: Speeds up repeated queries
# üñ• Logs details to the console (helpful for debugging)
ox.settings.use_cache = True
ox.settings.log_console = True

In [73]:
# Fetch Supermarkets from Berlin from OSM using the tag "shop=supermarket"
gdf = ox.features.features_from_place("Berlin, Germany", tags=tags)

This line queries the OSM Overpass API and returns a GeoDataFrame (gdf) with all supermarkets in Berlin, including their geometry (coordinates) and OSM metadata (like name, address, brand, etc.).

In [74]:
# Display basic info

print(f"Number of supermarkets entries fetched: {len(gdf)}")
gdf.head(3)

Number of supermarkets entries fetched: 1358


Unnamed: 0_level_0,Unnamed: 1_level_0,geometry,addr:city,addr:country,addr:housenumber,addr:postcode,addr:street,addr:suburb,brand,brand:wikidata,brand:wikipedia,...,cash_withdrawal:purchase_minimum,internet,payment:ec,disused:shop,building:part,fee,building:parts,safety:hand_sanitizer:covid19,operator:legal,opening_date
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
node,58489979,POINT (13.40737 52.50982),Berlin,DE,83.0,10179.0,Alte Jakobstra√üe,Mitte,Netto Marken-Discount,Q879858,de:Netto Marken-Discount,...,,,,,,,,,,
node,79418658,POINT (13.30855 52.57242),Berlin,DE,2.0,13403.0,Qu√§kerstra√üe,Reinickendorf,,,,...,,,,,,,,,,
node,79422426,POINT (13.31248 52.57168),,,,,,,Nahkauf,Q57515238,,...,,,,,,,,,,


In [None]:
#gdf.to_file("../sources/raw_supermarkets.geojson", driver="GeoJSON")

In [None]:
#gdf.to_csv("../sources/raw_supermarkets.csv", index=False)

In [77]:
gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 1358 entries, ('node', np.int64(58489979)) to ('way', np.int64(1429504461))
Columns: 230 entries, geometry to opening_date
dtypes: geometry(1), object(229)
memory usage: 2.4+ MB


In [78]:
print(gdf.columns.tolist())


['geometry', 'addr:city', 'addr:country', 'addr:housenumber', 'addr:postcode', 'addr:street', 'addr:suburb', 'brand', 'brand:wikidata', 'brand:wikipedia', 'check_date', 'check_date:opening_hours', 'internet_access', 'name', 'opening_hours', 'payment:mastercard', 'payment:visa', 'shop', 'website', 'wheelchair', 'operator', 'phone', 'layer', 'payment:cards', 'payment:cash', 'payment:contactless', 'atm', 'atm:operator', 'type', 'wheelchair:description', 'contact:website', 'toilets:wheelchair', 'name:fa', 'diet:halal', 'diet:kosher', 'origin', 'diet:gluten_free', 'diet:lactose_free', 'diet:sugar_free', 'organic', 'contact:phone', 'operator:wikidata', 'internet_access:fee', 'level', 'payment:credit_cards', 'payment:girocard', 'email', 'branch', 'source', 'building', 'air_conditioning', 'drink:club-mate', 'payment:debit_cards', 'ref', 'self_checkout', 'stroller', 'toilets', 'payment:maestro', 'brand:website', 'addr:floor', 'diet:vegan', 'diet:vegetarian', 'name:de', 'note', 'currency:EUR', '

In [79]:
gdf['atm'].value_counts()

atm
yes    3
no     1
Name: count, dtype: int64

In [80]:
# filter columns tat relevant to project
columns = [
    "name", "addr:street", "addr:housenumber", "addr:postcode",
    "addr:city", "opening_hours", "brand", "type","geometry","payment:credit_cards",
    "payment:debit_cards", "payment:cash", "payment:contactless", "wheelchair", "internet_access","layer"

]
gdf_superstore = gdf[[col for col in columns if col in gdf.columns]].copy()

In [81]:
gdf_superstore.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,name,addr:street,addr:housenumber,addr:postcode,addr:city,opening_hours,brand,type,geometry,payment:credit_cards,payment:debit_cards,payment:cash,payment:contactless,wheelchair,internet_access,layer
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
node,58489979,Netto Marken-Discount,Alte Jakobstra√üe,83.0,10179.0,Berlin,Mo-Sa 07:00-24:00; Su off,Netto Marken-Discount,,POINT (13.40737 52.50982),,,,,no,no,
node,79418658,Ledo,Qu√§kerstra√üe,2.0,13403.0,Berlin,Mo-Sa 09:00-20:00,,,POINT (13.30855 52.57242),,,,,yes,,
node,79422426,kiezmarkt,,,,,"Mo-Fr 08:00-20:00, Sa 08:00-19:00",Nahkauf,,POINT (13.31248 52.57168),,,,,yes,,


In [82]:
print('filtered dataset shape:', gdf_superstore.shape)
gdf_superstore.info()

filtered dataset shape: (1358, 16)
<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 1358 entries, ('node', np.int64(58489979)) to ('way', np.int64(1429504461))
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   name                  1355 non-null   object  
 1   addr:street           1105 non-null   object  
 2   addr:housenumber      1097 non-null   object  
 3   addr:postcode         1081 non-null   object  
 4   addr:city             1044 non-null   object  
 5   opening_hours         1260 non-null   object  
 6   brand                 1134 non-null   object  
 7   type                  7 non-null      object  
 8   geometry              1358 non-null   geometry
 9   payment:credit_cards  152 non-null    object  
 10  payment:debit_cards   145 non-null    object  
 11  payment:cash          169 non-null    object  
 12  payment:contactless   37 non-null     object  
 13  wheelchair   

In [83]:
# Extract Latitude and Longitude
gdf_superstore['geometry'] = gdf_superstore['geometry'].apply(lambda geom: geom if geom.geom_type == 'Point' else geom.representative_point())
#Extract latitude and longitude

gdf_superstore["latitude"] = gdf.geometry.centroid.y
gdf_superstore["longitude"] = gdf.geometry.centroid.x


  gdf_superstore["latitude"] = gdf.geometry.centroid.y

  gdf_superstore["longitude"] = gdf.geometry.centroid.x


In [84]:
# rename columns for better understanding
gdf_superstore = gdf_superstore.rename(columns={
    "addr:street": "street",
    "addr:housenumber": "housenumber",
    "addr:postcode": "postcode",
    "addr:city": "city",
    "payment:credit_cards": "payment_credit_card",
    "payment:debit_cards": "payment_debit_cards",
    "payment:cash": "payment_cash",
    "payment:contactless": "payment_contactless"
    })

In [85]:
gdf_superstore.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 1358 entries, ('node', np.int64(58489979)) to ('way', np.int64(1429504461))
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   name                 1355 non-null   object  
 1   street               1105 non-null   object  
 2   housenumber          1097 non-null   object  
 3   postcode             1081 non-null   object  
 4   city                 1044 non-null   object  
 5   opening_hours        1260 non-null   object  
 6   brand                1134 non-null   object  
 7   type                 7 non-null      object  
 8   geometry             1358 non-null   geometry
 9   payment_credit_card  152 non-null    object  
 10  payment_debit_cards  145 non-null    object  
 11  payment_cash         169 non-null    object  
 12  payment_contactless  37 non-null     object  
 13  wheelchair           1213 non-null   object  
 14  internet_acc

**‚ö†Ô∏è Data Quality Summary ‚Äì Berlin Supermarkets Dataset**

- ‚úÖ The dataset contains 1358 entries with latitude and longitude for all locations.

**üìâ Missing Values**

- Address-related fields have some missing values:
- street (18% missing)
- housenumber (19% missing)
- postcode (20% missing)
- city (23% missing)
- Payment method availability is very sparse:
- payment_credit_card (only 11% present)
- payment_contactless (only 2.7% present)
- internet_access and type are also highly incomplete.
- brand is missing in ~16% of records.

- Accessibility: wheelchair accessibility is available in ~89% of stores ‚Äì relatively well-covered.

‚∏ª

**üí° Recommendation**

- Consider filling missing address values using reverse geocoding (based on lat/lon).
- Treat payment method columns as optional metadata.
- Drop or ignore type, layer, and other sparse columns unless specifically needed.


In [86]:
gdf_superstore.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,street,housenumber,postcode,city,opening_hours,brand,type,geometry,payment_credit_card,payment_debit_cards,payment_cash,payment_contactless,wheelchair,internet_access,layer,latitude,longitude
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
node,58489979,Netto Marken-Discount,Alte Jakobstra√üe,83.0,10179.0,Berlin,Mo-Sa 07:00-24:00; Su off,Netto Marken-Discount,,POINT (13.40737 52.50982),,,,,no,no,,52.509819,13.407373
node,79418658,Ledo,Qu√§kerstra√üe,2.0,13403.0,Berlin,Mo-Sa 09:00-20:00,,,POINT (13.30855 52.57242),,,,,yes,,,52.572417,13.308547
node,79422426,kiezmarkt,,,,,"Mo-Fr 08:00-20:00, Sa 08:00-19:00",Nahkauf,,POINT (13.31248 52.57168),,,,,yes,,,52.571676,13.312476
node,79428988,nah & gut,Scharnweberstra√üe,100.0,13405.0,Berlin,Mo-Sa 07:00-22:00,EDEKA,,POINT (13.31693 52.5664),,,,,yes,,,52.566403,13.316931
node,79438509,Nahkauf,Meller Bogen,2.0,13403.0,Berlin,Mo-Sa 07:00-20:00,Nahkauf,,POINT (13.32114 52.57049),,,yes,yes,limited,,2.0,52.570486,13.321144


## 1.2 Modelling & Planning

**Selected 22 Key Columns**

 1.  osm_id 
 2.  name                 
 3.  street               
 4.  housenumber           
 5.  postcode              
 6.  city                   
 7.  opening_hours         
 8.  brand                  
 9.  type                  
 10. geometry             
 11. payment_credit_card   
 12. payment_debit_cards   
 13. payment_cash          
 14. payment_contactless    
 15. wheelchair            
 16. internet_access      
 17. layer                
 18. latitude           
 19. longitude 
 20. neighbourhood
 21. district
 22. source

**How this connects to existing tables:**

- Coordinates (latitude, longitude, geom): link to neighbourhood and district polygons.
- Neighbourhood & district fields: join with administrative boundaries table.
- Source field: ensures traceability.

### üè™ Planned Schema: `superstore_in_berlin`

| Column Name             | Data Type | Description                                      | Example                        |
|-------------------------|-----------|--------------------------------------------------|--------------------------------|
| `osm_id`                | int       | Unique OSM element ID                            | 58489979                       |
| `name`                  | text      | Supermarket or store name                        | Netto Marken-Discount          |
| `brand`                 | text      | Brand if available                               | Nahkauf                        |
| `street`                | text      | Street name                                      | Alte Jakobstra√üe               |
| `housenumber`           | text      | House number                                     | 83                             |
| `postcode`              | text      | Postal code                                      | 10179                          |
| `city`                  | text      | City name                                        | Berlin                         |
| `opening_hours`         | text      | Opening hours string                             | Mo‚ÄìSa 07:00‚Äì24:00; Su off      |
| `type`                  | text      | Store type if tagged                             | supermarket                    |
| `payment_credit_card`   | text      | Accepts credit card                              | yes                            |
| `payment_debit_cards`   | text      | Accepts debit cards                              | yes                            |
| `payment_cash`          | text      | Accepts cash payment                             | yes                            |
| `payment_contactless`   | text      | Accepts contactless payments                     | no                             |
| `wheelchair`            | text      | Accessibility info                               | yes                            |
| `internet_access`       | text      | Public internet access (e.g., wifi)              | wlan                           |
| `layer`                 | text      | Vertical layer (e.g., floor number)              | 1                              |
| `latitude`              | float     | Latitude coordinate                              | 52.5200                        |
| `longitude`             | float     | Longitude coordinate                             | 13.4050                        |
| `geometry`              | geometry  | Full GeoJSON geometry                            | POINT (13.4050 52.5200)        |
| `neighbourhood`         | text      | Local neighborhood (optional / derived)          | Kreuzberg                      |
| `district`              | text      | Berlin administrative district                   | Friedrichshain-Kreuzberg       |
| `source`                | text      | Data source info                                 | OSM                            |

**Transformation Plan**

- Fetch data from OSM with filter shop=supershop (Berlin bounding box). ‚úÖ
- Clean column names ‚Üí "addr:street" : "street".‚úÖ
- Normalize formats (Consider filling missing address values using reverse geocoding (based on lat/lon) ).üìå

- Enrich with neighbourhood/district via spatial join. üìå

- Save cleaned dataset (GeoJSON + CSV).üìå


In [87]:
gdf_superstore["source"]=None

In [88]:
gdf_superstore.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,name,street,housenumber,postcode,city,opening_hours,brand,type,geometry,payment_credit_card,payment_debit_cards,payment_cash,payment_contactless,wheelchair,internet_access,layer,latitude,longitude,source
element,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
node,58489979,Netto Marken-Discount,Alte Jakobstra√üe,83.0,10179.0,Berlin,Mo-Sa 07:00-24:00; Su off,Netto Marken-Discount,,POINT (13.40737 52.50982),,,,,no,no,,52.509819,13.407373,
node,79418658,Ledo,Qu√§kerstra√üe,2.0,13403.0,Berlin,Mo-Sa 09:00-20:00,,,POINT (13.30855 52.57242),,,,,yes,,,52.572417,13.308547,
node,79422426,kiezmarkt,,,,,"Mo-Fr 08:00-20:00, Sa 08:00-19:00",Nahkauf,,POINT (13.31248 52.57168),,,,,yes,,,52.571676,13.312476,


# Step 1 Review and A‚ÄìF Data Familiarization

## A) Quick overview

In [89]:
print("Rows, Columns:", gdf_superstore.shape)
print("\nColumns:", gdf_superstore.columns.tolist())
print("data Info\n")
print(gdf_superstore.info())

Rows, Columns: (1358, 19)

Columns: ['name', 'street', 'housenumber', 'postcode', 'city', 'opening_hours', 'brand', 'type', 'geometry', 'payment_credit_card', 'payment_debit_cards', 'payment_cash', 'payment_contactless', 'wheelchair', 'internet_access', 'layer', 'latitude', 'longitude', 'source']
data Info

<class 'geopandas.geodataframe.GeoDataFrame'>
MultiIndex: 1358 entries, ('node', np.int64(58489979)) to ('way', np.int64(1429504461))
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   name                 1355 non-null   object  
 1   street               1105 non-null   object  
 2   housenumber          1097 non-null   object  
 3   postcode             1081 non-null   object  
 4   city                 1044 non-null   object  
 5   opening_hours        1260 non-null   object  
 6   brand                1134 non-null   object  
 7   type                 7 non-null      object  
 8   geometry 

## B) Missing values per column

In [90]:
missing_count = gdf_superstore.isna().sum().sort_values(ascending=False)
print(missing_count)

source                 1358
layer                  1355
type                   1351
payment_contactless    1321
internet_access        1242
payment_debit_cards    1213
payment_credit_card    1206
payment_cash           1189
city                    314
postcode                277
housenumber             261
street                  253
brand                   224
wheelchair              145
opening_hours            98
name                      3
geometry                  0
latitude                  0
longitude                 0
dtype: int64


In [91]:
row_count = len(gdf_superstore)
print(row_count)
missing = pd.DataFrame({
    "missing_count": missing_count,
    "missing_pct": (missing_count / row_count * 100).round(1)
}).sort_values(by="missing_pct", ascending=False)

print(missing)

1358
                     missing_count  missing_pct
source                        1358        100.0
layer                         1355         99.8
type                          1351         99.5
payment_contactless           1321         97.3
internet_access               1242         91.5
payment_debit_cards           1213         89.3
payment_credit_card           1206         88.8
payment_cash                  1189         87.6
city                           314         23.1
postcode                       277         20.4
housenumber                    261         19.2
street                         253         18.6
brand                          224         16.5
wheelchair                     145         10.7
opening_hours                   98          7.2
name                             3          0.2
geometry                         0          0.0
latitude                         0          0.0
longitude                        0          0.0


**‚úÖ Recommendations**

- Drop or ignore high-missing fields

	‚Ä¢	Columns like source, layer, type, payment_*, and internet_access have too many missing values to be meaningful without external enrichment.

	‚Ä¢	You can either drop them or keep them for very limited exploratory use.

- Enrich moderate-missing fields

	‚Ä¢	Consider reverse geocoding to infer:

	‚Ä¢	street, housenumber, postcode, and city from latitude/longitude
	
	‚Ä¢	Use brand clustering or reference lists to fill common brand names.

- Fill remaining small gaps

	‚Ä¢	For opening_hours, wheelchair, and name, missing values can be flagged or marked as ‚Äúunknown‚Äù for visualizations and charts.

- Keep spatial data clean

	‚Ä¢	Ensure geometry, latitude, and longitude are retained as the core fields for mapping and spatial joins.


## C) Distinct values per column

In [92]:
distinct = gdf_superstore.nunique().sort_values(ascending=False)
print(distinct)

longitude              1358
geometry               1358
latitude               1358
street                  635
housenumber             445
name                    315
opening_hours           239
postcode                186
brand                    26
type                      4
wheelchair                3
internet_access           3
layer                     2
payment_credit_card       2
payment_debit_cards       2
payment_contactless       1
payment_cash              1
city                      1
source                    0
dtype: int64


## D) Most common values in key columns

In [93]:
# Goal: Peek at distributions, not just counts.

# Example: top 10 brands
print("\nTop 10 brands:")
print(gdf_superstore["brand"].value_counts().head(10))


Top 10 brands:
brand
Edeka                    163
Lidl                     142
Netto Marken-Discount    126
Rewe                     121
Aldi Nord                 96
Penny                     71
EDEKA                     54
Denns BioMarkt            50
Bio Company               49
Netto                     44
Name: count, dtype: int64


In [94]:
print("\nTop 10 brands:")
print(gdf_superstore["street"].value_counts().head(10))


Top 10 brands:
street
Hauptstra√üe             19
M√ºllerstra√üe            14
Tempelhofer Damm        11
Frankfurter Allee       10
Wilmersdorfer Stra√üe    10
Greifswalder Stra√üe     10
Hermannstra√üe            9
Karl-Marx-Stra√üe         9
Friedrichstra√üe          9
Landsberger Allee        8
Name: count, dtype: int64


In [95]:
print("\nTop 10 brands:")
print(gdf_superstore["opening_hours"].value_counts().head(10))


Top 10 brands:
opening_hours
Mo-Sa 07:00-22:00                    254
Mo-Sa 07:00-21:00                    247
Mo-Sa 08:00-21:00                    100
Mo-Sa 08:00-20:00                     88
Mo-Sa 07:00-20:00                     59
Mo-Sa 07:00-21:00; PH off             31
Mo-Sa 07:00-22:00; PH off             25
Mo-Sa 07:00-22:00; Su,PH off          23
Mo-Sa 08:00-22:00                     22
Mo-Fr 07:00-24:00; Sa 07:00-23:30     19
Name: count, dtype: int64


In [96]:
print("\nTop 10 brands:")
print(gdf_superstore["postcode"].value_counts().head(10))


Top 10 brands:
postcode
10827    15
10365    14
10117    13
10405    13
12683    12
10243    12
12524    12
12555    12
10245    12
10967    11
Name: count, dtype: int64


In [97]:
print("\nTop 10 brands:")
print(gdf_superstore["name"].value_counts().head(10))


Top 10 brands:
name
Lidl                     142
Aldi                     132
Netto Marken-Discount    126
REWE                     105
EDEKA                     82
PENNY                     51
Denns BioMarkt            50
Bio Company               49
Netto                     45
Kaufland                  35
Name: count, dtype: int64


## E) Geometry sanity checks

In [99]:
print(gdf_superstore.geometry.geom_type.value_counts())

Point    1358
Name: count, dtype: int64


In [101]:
print("Missing geometries:", gdf_superstore.geometry.isna().sum())

Missing geometries: 0


## F) Latitude/Longitude checks

In [103]:
# Goal: Verify lat/lon look realistic.
# Why? If values are way off, something went wrong in conversion.

print("Latitude range:", gdf_superstore["latitude"].min(), "to", gdf_superstore["latitude"].max())

print("Longitude range:", gdf_superstore["longitude"].min(), "to", gdf_superstore["longitude"].max())

Latitude range: 52.37918305338519 to 52.64074992624387
Longitude range: 13.124876595696755 to 13.7149761


## 1.3 Prepare the /sources Directory

1. Raw Data Files:

- banks_raw.geojson (includes geometry)
- banks_raw.csv (tabular only, no geometry)

2. README.md in /sources will contain:

- Data sources used.
- Planned transformation steps.

In [104]:
# Save as GeoJSON (keeps geometry) and CSV

raw_geojson_path = "../sources/supermarkets_raw.geojson"
raw_csv_path = "../sources/supermarkets_raw.csv"


gdf_superstore.to_file(raw_geojson_path, driver="GeoJSON")
gdf_superstore.drop(columns="geometry").to_csv(raw_csv_path, index=False)

print(f"Raw data saved to: {raw_geojson_path} and {raw_csv_path}")

Raw data saved to: ../sources/supermarkets_raw.geojson and ../sources/supermarkets_raw.csv


# 1.4 Review

- All 22 target columns defined.
- Data sources identified and documented.
- Schema draft created.
- Data fetched and stored in /sources.
- Data cleaning & enrichment plan in place.

# üõ† Step 2: Data Transformation