# Exercise: Reproducing 'How close is "close"?' (Shingleton & Basiri, 2025)

This notebook guides you through a **small-scale reproduction** of the core analytical pipeline in:

- Shingleton, J., & Basiri, A. (2025). How close is "close"? An analysis of the spatial characteristics of perceived proximity using Large Language Models. AGILE GIScience Series, 6, 1‚Äì14. [10.1111/1755-2665.12276](https://doi.org/10.1111/1755-2665.12276)

## üéØ Objectives of the Notebook

1. Load **Inside Airbnb** listings for a city (London in the paper).
2. Use a **Large Language Model (LLM)** to extract *explicitly named* places described as **near** each property.
3. Robustly parse and clean the model output (incl. **JSON repair** and **fuzzy hallucination filtering**).
4. Compute and interpret a **standard distance** for frequently mentioned (‚Äúreference‚Äù) locations.

---

## Working assumptions (for a classroom reproduction)

- We run the LLM **locally** (via Ollama) and process a **small sample** (e.g. 200‚Äì500 listings).
- We reproduce the *core logic* of the paper, not the full London-scale compute budget.

> Keep notes as you go: any simplification you make is part of your reproducibility argument.


## 0. Environment setup

This exercise uses common data-science and geospatial libraries. If your environment is missing packages, install them below.

**Libraries you will use**
- `pandas` for tabular data
- `geopandas` + `shapely` for geospatial data structures
- `h3` for hexagonal indexing

- `tqdm` for progress bars, because we are processing a lot of data

- `ollama` for local LLM calls (or skip if you use another provider)
- `json_repair` to handle malformed JSON from the LLM
- `rapidfuzz` for fuzzy matching to remove hallucinated place names
- `geopy` for geocoding with Nominatim
- `osmnx` for downloading OpenStreetMap data
- `plotly` for interactive maps & visualization


In [None]:
# If needed, uncomment and run:
# %pip install -q pandas geopandas shapely h3 tqdm matplotlib ollama json_repair rapidfuzz geopy haversine

## 1. Imports

Read this cell carefully: in a reproducible notebook, you want *all imports in one place* so dependencies are obvious.

In [None]:
import json
import math

import pandas as pd

from tqdm.auto import tqdm

import h3
import ollama
from rapidfuzz import fuzz
from json_repair import repair_json

import osmnx as ox
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

import plotly.express as px

from typing import List, Dict, Any

# progress bar initialization
tqdm.pandas()

## 2. Data: Inside Airbnb listings

The paper uses Inside Airbnb listings for **London** and analyses property descriptions + coordinates. We will do the same analysis for Austria.

For this exercise:
- Open the [Inside Airbnb](https://insideairbnb.com/) website
- Find a city in Austria. 
```python
‚ùì `Is Graz Available?`
```
- Download `listings.csv.gz` from Inside Airbnb, then set `DATA_PATH` below.

**Task (fill-in):** set `DATA_PATH` to your local file.


In [None]:
# TODO (fill in): path to your downloaded listings.csv or listings.csv.gz
DATA_PATH = ___


## 3. Load and inspect the listings

We will keep only the columns we need:
- `id` (listing identifier)
- `latitude`, `longitude`
- `description` (the key text input to the LLM)
- `neighborhood_overview` (for additional location context)
- `neighbourhood_cleansed` (for later filtering)

**üöÄ Tasks**
1. How many total listings is present in the dataset?
2. What does a typical description/neighborhood_overview look like?
3. How many listings have missing descriptions and neighborhood_overview ? 
4. Create a kepler.gl map of all listings density based on h3 cells
```python
‚ùì `What is the coordinate reference system (CRS) of the data? Is it acceptable for our analysis?`
```

In [None]:
# load dataframe 
raw_airbnb_df = pd.___(DATA_PATH)

# TODO: print the number of rows of the dataframe


In [None]:
# Keep only relevant columns
use_cols = [
    ___
]

airbnb_df = raw_airbnb_df[___].copy()


In [None]:
# Clean na values
# drop any row that has missing values in the columns we need
airbnb_df = airbnb_df.dropna(subset=[___])

# You can also use the parameter how='all' to drop only rows that have missing values in all columns
# In the next step we will concatenate "description" and "neighborhood_overview"
airbnb_df = airbnb_df.dropna(subset=[___], how='all')

# TODO: print the number of of row in the cleaned sample of the dataframe


In [None]:
# TODO: what look like the data? especially the description column

## 4. Spatial aggregation with H3

The paper aggregates properties into an H3 grid at **resolution 7** (~5.16 km¬≤ per hexagon in their setup).  
We will use the same default so your maps are comparable in scale.

**üöÄ Tasks** 
1. assign for each row a h3 cell id using the `point_to_h3` function
2. group by h3 cell id and count the number of properties in each cell
3. create a kepler.gl map of the h3 cell density


In [None]:
# TODO: create a function that will be `apply` to your dataframe 

H3_RES = 7

def latlon_to_h3(___) -> str:
    """Return the H3 cell id for a (lat, lon) coordinate."""
    return ___

airbnb_df["h3_cell"] = airbnb_df.apply(point_to_h3, axis=1)

In [None]:
# TODO: group your data by h3 cell and count the number of properties in each cell
h3_airbnb_density = (
    airbnb_df
    .groupby("___")
    .___
    .reset_index(name="airbnb_count")
)

In [None]:
# TODO: create a kepler.gl map of the h3 cell density

## 5. Filter your data to only include the inner city

We have definitely too many points in our dataset to do the analysis. Let's filter to only include the inner city.

**üöÄ Task:**
1. Filter your data to only include the inner city
2. Create a kepler.gl map of the inner city


In [None]:
# TODO: keep only data points from the inner city
# tip: use the `neighbourhood_cleansed` column, what are the possible values? 

# TODO: How many data points are left?


In [None]:
# TODO: create a kepler.gl map of the inner city


## 6. Definition of the extraction prompt

The paper uses a carefully engineered prompt (Appendix Fig. A1) to extract **explicitly named** nearby locations, excluding vague places and excluding the property‚Äôs own location.

**üöÄ Tasks**
1. Read the following prompt for Vienna, is it similar of the one you are used to write?
2. Identify **three** design choices that reduce false positives
3. What are the differences between this prompt and the one used in the paper?


In [None]:
SYSTEM_PROMPT = """
You are an Airbnb location extraction system for Vienna. 
Your task is to extract specific Points of Interest (POIs) that are mentioned as being NEAR the property.

### 1. WHAT TO EXTRACT (Positive Criteria)
You must only extract specific, physical landmarks that can be pinpointed on a map.
- Specific Landmarks (e.g., "Sch√∂nbrunn Palace", "St. Stephen's Cathedral")
- Named Squares/Streets (e.g., "Stephansplatz", "Mariahilfer Stra√üe", "Florianigasse")
- Named Markets/Parks (e.g., "Naschmarkt", "Prater", "Sch√∂nbornpark")
- Specific Transport Hubs (e.g., "Westbahnhof", "U4 Pilgramgasse")

### 2. WHAT TO IGNORE (Negative Criteria)
- IGNORE Districts and Quarters: Do NOT extract "Neubau", "Favoriten", "1st District", "Freihausviertel", "Museumsquartier" (unless referring to the specific complex, usually assume it's a district).
- IGNORE Generic Amenities: Do NOT extract "supermarkets", "restaurants", "bars", "shops", "metro station" (if unnamed).
- IGNORE Property Location: If the text says "The apartment is located in [Place]," ignore [Place]. Only extract places described as being *near*, *around*, or *walking distance* from the apartment.

### 3. ADDRESS FORMATTING RULES
- If a full address is in the text, use it.
- If NO address is in the text, you MUST format the address as: "[Name], Vienna".
- This is required formatting, not "inventing."

### 4. OUTPUT FORMAT
Do not include markdown formatting or explanations.
Keys: "name", "address", "latitude" (default null), "longitude" (default null), "context".

### 5. FEW-SHOT EXAMPLES (Do not confuse these with your real task)

Input: "We are located in the heart of Neubau, close to the Zieglergasse station and a short walk to Mariahilfer Stra√üe."
Output:
[
  {
    "name": "Zieglergasse station",
    "address": "Zieglergasse station, Vienna",
    "latitude": null,
    "longitude": null,
    "context": "close to the Zieglergasse station"
  },
  {
    "name": "Mariahilfer Stra√üe",
    "address": "Mariahilfer Stra√üe, Vienna",
    "latitude": null,
    "longitude": null,
    "context": "short walk to Mariahilfer Stra√üe"
  }
]

Input: "Enjoy the many shops and cafes in the 7th district. The bus takes you to the City Center."
Output:
[]

Input: "The apartment is located in the 3rd district, right next to the Belvedere Palace and just a short walk from the Rennweg station."
Output:
[
  {
    "name": "Belvedere Palace",
    "address": "Belvedere Palace, Vienna",
    "latitude": null,
    "longitude": null,
    "context": "right next to the Belvedere Palace"
  },
  {
    "name": "Rennweg station",
    "address": "Rennweg station, Vienna",
    "latitude": null,
    "longitude": null,
    "context": "short walk from the Rennweg station"
  }
]
Input: "You are staying directly opposite the famous Hundertwasserhaus."
Output:
[
  {
    "name": "Hundertwasserhaus",
    "address": "Hundertwasserhaus, Vienna",
    "latitude": null,
    "longitude": null,
    "context": "directly opposite the famous Hundertwasserhaus"
  }
]

### END OF EXAMPLES
Now, analyze the following description and extract the locations based strictly on the text provided below.
"""

## 7. Calling an LLM locally (Ollama)

We will call a local model via Ollama.
- You need [Ollama](https://ollama.com/) running locally. Follow the link to download and install it.
- You also need a model pulled (e.g., `llama3`, `qwen2`, etc.). You can find a list of available models [here](https://ollama.com/models).

Running LLMs locally is extremely expensive for your RAM, Hard Drive and CPU. As a rule of thumb you need 1GB of RAM and 1GB of Hard Drive per 1B parameters. We will use a small model for this exercise: `qwen3:1.7b`.

**üöÄ Task** 
1. set `MODEL_NAME` to a model you have locally
2. Pull the model using `ollama.pull`


In [None]:
MODEL_NAME = "qwen3:1.7b" # You need 1.4GB of free space on your computer
# ‚ö†Ô∏è You must have Ollama running locally 

ollama.pull(MODEL_NAME)

# Create a client to call your model 
client = ollama.Client()


### 7.1 A minimal chat wrapper

- We keep temperature = 0 for deterministic outputs (useful in reproduction).
- We allow the model to think (set `think=True`), which is useful for debugging. Not all models support this.

In [None]:
def ollama_extract_nearby(text: str, show_log: bool = True) -> str:
    """Return the raw model text output for one listing."""

    # define what you send to the model, the `system`` defines how the model should react, the `user` is the text you want to analyze 
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": text},
    ]
    # you call the model here 
    resp = client.chat(
        model=MODEL_NAME,
        messages=messages,
        options={"temperature": 0}, 
        think=True
    )

    model_text = resp["message"]["content"]
    thinking = resp["message"].get("thinking", "...no thinking...")

    if show_log:
        # print the model's thinking process
        print("-----")
        print(text)
        print("\n thinking... \n")
        print("response",thinking)
        print("\n\n =>\n")
        print(model_text)
    
    # return only the content of the response without the thinking part 
    return model_text, thinking


### 7.2 Try the model on one listing

**üöÄ Task:** 
1. Run the model on one description of your dataframe 
‚ùì
```python

` - What is the thinking process of the model looking like?`
` - Are the results correct?`
` - How long does it take to run?`
` - What happens if your rerun the same description? Is the output the same?`
` - What happens if you change thinking to False?`
```

In [None]:
# TODO: Run your function on the first row of the dataframe



## 8) Parsing the LLM output robustly

In practice, models sometimes return malformed JSON. The paper repairs JSON first, and may ask the model to fix it if needed.

We will implement a two-step parser:
1. Try `json.loads`
2. If that fails, try `json_repair.repair_json`, then parse again

**üöÄ Tasks** 
1. creaze the function `parse_json_array` that clean the output of the LLM and try to parse it as a JSON array


In [None]:
def parse_json_array(text: str) -> List[Dict[str, Any]]:
    """Parse a JSON array from model output, with basic repair."""
    text = text.strip()

    # 1) direct parse
    # TODO: wrap the json load in a try-except block
    data = json.loads(text)
    if isinstance(data, list):
        return data

    # 2) attempt repair
    # TODO: use the library `json_repair` to repair the json
    repaired = ___  # returns a JSON string
    
    # TODO: try to reload the repaired json

    # 3) final fallback: return empty
    return []


In [None]:
# TODO: Use the output of your LLM from earlier to parse the JSON array

## 9) Batch extraction on a small sample

We will:
- run the LLM on each airbnb listing
- parse the JSON
- store results in a new column `nearby_raw`

**üöÄ Task** 
1. ‚ö†Ô∏è Let select a very small sample (e.g. 3-10 rows) to test and debug your code
2. Create a function that will be apply to each row of the dataframe to extract the locations mentioned in the description. The function should run your `ollama_extract_nearby` function and parse the JSON
3. Apply the function to the dataframe and store the results in a new column `nearby_raw`
4. Compute the mean number of extracted locations per listing
5. Create a Kepler.gl map to visualize the extracted locations

**üìù Note**
- The execution might be long... so we will use `progress_apply` from `tqdm` to show a progress bar (instead of a simple `apply`)

> **üì¢ TODO at home:**  
> Run the operation on the entire dataframe and save the results

In [None]:
# get a small sample of the dataframe
small_sample = airbnb_df.sample(___)

In [None]:
# create a function that will be apply on your df
def extract_for_row(row: str) -> pd.Series:
    # define a variable to store the result
    out = pd.Series({
        "locations": [],
        "thinking": None
    })
    # TODO: aggregate the columns to get the description and the neighborhood

    # you can also clean the text
    ___ = ___.replace("\n", " ") # remove new lines
    ___ = ___.replace("<br />", " ") # remove html tags
    ___ = ___.trim().lower() # remove extra spaces and lower case

    if len(___)==0: # make sure the text is not empty 
        return out
    
    # TODO: run your LLM function

    # TODO: Return the parsed value
    out["locations"] = ___
    out["thinking"] = ___
    return out 

In [None]:
airbnb_df[["nearby_raw", "thinking"]] = ___.progress_apply(___, ___)

In [None]:
# Let's save our results here if you ran all your dataframe 

# In json
JSON_FILEPATH = ___
airbnb_df.to_json(JSON_FILEPATH)

# in parquet
PARQUET_FILEPATH = ___
airbnb_df.to_parquet(PARQUET_FILEPATH)

```python
? 
` - What is the difference between parquet and json? check on your disk the space of each`
` - Are the results similar when using the 1.7b model you ran this weekend?`

```

In [None]:
# TODO: calculate the number of locations per row and add it as a new column
airbnb_df["nearby_raw_n"] = ___


In [None]:
# TODO: create a kepler map that show the average number of nearby locations per cell

## 9. Reduce hallucinations with fuzzy presence checks

A common failure mode is hallucinating a nearby place not mentioned in the text.


We will:
- use a fuzzy matching to remove hallucinated place names
- create a function that check in a row if a location appears in the description text
- store results in a new column `nearby_clean`


**üöÄ Task** 
1. Set a threshold (start with 70)
2. Compare how many locations remain after cleaning
3. Compute the mean number of extracted locations per listing
4. Create a Kepler.gl map to visualize the extracted locations

**üìù Note**
- You can load the parquet file `airbnb_df.parquet` to load the results of the previous step (a thinking model of 15b of parameters was run on the dataset)

In [None]:
THRESH = 70  # experiment with 60, 70, 80

def filter_dicts_by_presence(row):
    # 1. Prepare the text to search in: create the text context you created in the previous step

    kept_dicts = [] # init a dict for returning the filtered values 

    # 2. loop over your found locations
    ____:
        # 3. Check if the location name is in the text context
        if fuzz.partial_ratio(___, ___) >= 70:
            kept_dicts.append(___) # keep the item in the nearby column
    return kept_dicts

In [None]:
# TODO: Apply the above function to you dataframe
airbnb_df['nearby_clean'] = ___

In [None]:
# TODO: Comupte the new number of locations per row and add it as a new column & create a kepler map

```python
`? How many locations are left after cleaning?`
```

## 10. Adding missing coordinates to the cleaned locations

```python
‚ùì
` - Have the lat and lon filled by the AI model?`
` - What can we do to fix this?`   
```

We will:
- use a geocoder to get the coordinates of the cleaned nearby locations: we will use Nominatim
- create a function that geocodes the different nearby locations
- drop the location that doesn't have any coordinates or that are not within Vienna
- store results in a new column `nearby_coords`


**üöÄ Task** 
1. Create a function that geocodes the different nearby locations
2. Apply the function to the dataframe and store the results in a new column `nearby_coords`
3. Drop the location that doesn't have any coordinates
4. Compute the mean number of extracted locations per listing
5. Create a Kepler.gl map to visualize the extracted locations
```

In [None]:
# Initialize Nominatim using geopy
geolocator = Nominatim(user_agent="geoapiExercises") # the user_agent can be anything you like
# Nominatime enforce a limit rate of 1 request per second, you can simply use time.sleep(1) to enforce this or use RateLimiter from geopy
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

In [None]:
# TODO: define a function that geocodes the different nearby locations

def geocode_locations(nearby_locs):
    # loop ofber the nearby locations
    ___:
        # Call the geocoder on the address field
        location = geolocator.geocode(___)

        # TODO: Test iif you have a result before assigning a value
        ___["latitude"] = location.latitude
        ___["longitude"] = location.longitude
    return nearby_locs

### 10.1 Try the geocoding on a few listings

The process might be long because there is a sleep between each of your geocoding requests. Try to execute your function on several addresses. Then, you can load the file `airbnb_df_with_coords.parquet`. 
The data you'll load have been executed with Mapbox

In [None]:
# TODO: get a sample of your dataframe
___


# run the function on the sample
sample['nearby_coords'] = sample['nearby_clean'].progress_apply(geocode_locations)

### 10.2 Drop the nearby addresses without coordinates or outside Vienna


In [None]:
# TODO: get the geometry of Vienna using osmnx 
___

In [None]:
def clean_coordinates(row, geometry):
    # get bbox of vienna from the geometry
    min_lon, min_lat, max_lon, max_lat = geometry.bounds

    kept_dicts = [] # init a dict for returning the filtered values 
    
    # TODO: create a loop over the nearby_coords
    ___:
        # TODO: Extract latitude and longitude of the dict
        lon, lat = 
        if min_lon <= lon <= max_lon and min_lat <= lat <= max_lat:
            kept_dicts.append(___)
            
    return kept_dicts

In [None]:
airbnb_df['nearby_coords'] = airbnb_df.progress_apply( lambda row: ___(___, ___), axis=1)

In [None]:
# TODO: Comupte the new number of locations per row and add it as a new column & create a kepler map


```python
‚ùì
`How many locations are left after cleaning?`
```

In [None]:
# TODO: Create a histogram with plotly that shows the overview of the counts: raw number of addresses, non-hallucinate addresses, addresses with coordinates, addresses with coordinates inside Vienna 

___

data = {
    'Metric': ['Coords', 'Clear', 'Count'],
    'Value': [coords, clear, count]
}

fig = px.bar(data, x='Metric', y='Value', title='Overview of the counts')
fig.show()


## 11. Reference locations and ‚Äúarea of influence‚Äù (standard distance)

We will:
- extend our dataframe to create a line per nearby location
- display the most common nearby locations for the first district of Vienna 
- Calculate the respective distance to each of the airbnb listing
- display the average distance on the h3 grid


**üöÄ Task** 
1. keep only necessary columns
2. explode the dataframe to create a line per nearby location
3. group by the nearby location and count the number of Airbnb listings
4. sort the results by the number of airbnb listings
5. display the top 25 nearby locations
6. calculate the distance to each of the airbnb listing
7. display the average distance on the h3 grid
```

#### 11.1 Exploring the most common locations 

In [None]:
# TODO: drop the columns that are not necessary
# We need the nearby_coords, cell_id, latitude, longitude and id
___

In [None]:
# TODO: explode the df
airbnb_df_exploded = airbnb_df.explode('___', ignore_index=True).dropna(subset=["___"]).reset_index(drop=True).copy()

# extract the location name and coordinates for each dict
airbnb_df_exploded['location_name'] = airbnb_df_exploded['nearby_coords'].apply(lambda x: x['name'])
___


In [None]:
# TODO: display the 25 most common nearby location
___ 

In [None]:
# use plotly to create a histogram
fig = px.histogram(x=value_counts.index, 
    y=value_counts.values,
    labels={'x': 'Location Name', 'y': 'Count'}, height=400)
fig

```python
‚ùì 
`- What do you notice from the names, is there any issue? `
`- When you look at a ‚Äúhigh influence‚Äù location, what splatial factors might explain it beyond geometry? `
```

In [None]:
# (optional) Regroup similar names

### 11.2 Computing the relative distance 

In [None]:
EARTH_RADIUS_KM = 6371.0088

def haversine_km(lat1: float, lon1: float, lat2: float, lon2: float) -> float:
    """Great-circle distance (km) between two WGS84 points."""
    # TODO (fill in): implement haversine
    phi1 = math.radians(lat1)
    phi2 = math.radians(lat2)
    dphi = math.radians(lat2 - lat1)
    dlambda = math.radians(lon2 - lon1)

    a = math.sin(dphi/2)**2 + math.cos(phi1)*math.cos(phi2)*math.sin(dlambda/2)**2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    return EARTH_RADIUS_KM * c

In [None]:
airbnb_df_exploded['distance'] = airbnb_df_exploded.apply(lambda row: ___, axis=1)

In [None]:
# TODO: Aggregate the data by h3 cell and calculate the mean distance

In [None]:
# TODO: Add the results to a Kepler map


---

#### Key limitations 
- **Sampling bias**: Airbnb listings represent a particular demographic and purpose.
- **Coordinate uncertainty**: listing coordinates are approximate.
- **Model bias**: small models may miss local place names; larger models may hallucinate more.
- **Geocoding uncertainty**: ambiguous names and API limits.


#### Reflection questions 
1. Which step contributes the largest uncertainty to the final maps: prompt design, model choice, JSON repair, hallucination filtering, or geocoding?
2. How would you validate extraction quality without manually labelling thousands of listings?
3. What ethical risks exist when mining user-generated text for spatial inference?

---

**Congratulations! üéâ You have successfully completed the exercise.**