# A question and prompting dataset
The dataset should contain questions that ask about the three defined spatial tasks. Neighborhood, direction and proximity. Based on the NUTS and Geonames datasets.

### 1. Neighborhood and other topological relations
True/False questions:
- Is NUTS region X a neighbor of Y?
- Is NUTS region X inside of Y?
- Do NUTS regions X and Y lie in the same top-level NUTS region?
- Are cities X and Y in the same NUTS level 2 region?

Retrieval Questions:
- What NUTS regions border the Region X on the same level?
- What NUTS regions border the Region X?
- What NUTS regions are within the Region X?
- What cities with CONDITION are within the Region X?
- What are all NUTS regions that contain city X?
- What are second order neighbors of Region X?

### 2. Directions
True/False/Direct questions:
- (Is the NUTS region X further north than  NUTS region Y)
- Is NUTS region X west of NUTS region Y?
- Is city X south of NUTS region Y?
- Is city X southeast from city Y?

Retrieval questions:
- Region to region?
- Region to city?
- City to Region?
- City to city?

### 3. Proximity
True/False/Direct questions:
- Is city X within 10 km of region Y?
- Is city X within 10 km of city Y?

Retrieval Questions:
- What cities with CONDITION are within 50 km of region X?
- What regions are within 10 km of region X?
- What cities with CONDITION are within 10 km of city X?
- What cities are within 150 km of region X and Y?
- What cities are within the bounding box of region X?
- What is the largest city within 15 km of region X?

### 4. Combinations
- What are cities within 10 km to the west of region X?
- What NUTS regions border the region X in the South?

In [3]:
import re
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
from rdflib import Graph
from random import shuffle, sample, choice
from functions.correct_queries import get_gt_queries
from functions.general_functions import pickledump, pickleload

# Set up the question catalogue with variables

In [5]:
all_questions = [
    
# simple topology
    "What are all the NUTS regions that contain the city CITY?",
    "What NUTS regions are within the region CODE?",
    
# Neighborhood 
    "Is the NUTS region CODE bordering the region CODE?",
    "What regions of the same level are neighbors of the NUTS region CODE?",
    "What are the second order neighbors of the same NUTS level for the NUTS region CODE?",

# Directions (all combinations of city and region) BOOL
    "Is NUTS region CODE DIRECTION of NUTS region CODE?",
    "Is the NUTS region CODE DIRECTION of the city of CITY?",

# Directions question
    "To which direction is CITY from CITY?",
    "CITY is to which direction of NUTS region CODE?",

# Proximity
    "Is CITY within SMALLDISTANCE km of the NUTS region CODE?",
    "What are all cities within SMALLDISTANCE km of CITY?",
    "What are all cities CCONDITION and are less than BIGDISTANCE km from the NUTS region CODE?",
    "What is the largest city that can be found within the bounding box of the NUTS region CODE?",
    "What is the largest city within BIGDISTANCE km of the NUTS region CODE?",

# Combinations
    "What are cities within SMALLDISTANCE kilometers to the DIRECTION of the NUTS region CODE?",
    "Which cities can be found not further than SMALLDISTANCE km to the DIRECTION of CITY?",
    "What NUTS regions share a border with the region CODE in the DIRECTION on the same nuts level?",
    "What is the closest city to the DIRECTION of the NUTS region CODE?"
]

kind_of_question = ["simple topology"]*2 + ["neighbors"]*3 + ["directions bool"]*2 + ["directions open"]*2 + ["proximity"]*5 + ["combinations"]*4
variables = ["CODE", "CITY", "DIRECTION", "SMALLDISTANCE", "BIGDISTANCE", "CCONDITION"]
pattern = '|'.join(map(re.escape, variables))
variables_in_question = [sorted(re.findall(pattern, question)) for question in all_questions]

In [6]:
correct_graphdb_requests = get_gt_queries()

# Set up the dataframe and example tags
The dataframe contains the same 18 questions with different example from regions from different NUTS levels, different minimum size requirements from cities or cardinal/intercardinal directions.

In [8]:
df_questions = pd.DataFrame([all_questions, kind_of_question, variables_in_question]).T
df_questions.columns = ["question_raw", "question_category", "variables"]
df_questions = pd.concat([df_questions], ignore_index=True)
df_questions['gt_query_raw'] = correct_graphdb_requests

In [9]:
directions = ["north", "west", "east", "south"]
intercardinals = ["northwest", "northeast", "southwest", "southeast"]
sdistances = [8] # 8 turned out to be a good number to cover all the differnt kinds of questions and use cases so here no variations were made in the end
bdistances = range(200, 750, 50)

# cconditions do not interfere with the conditions for city-random picking as there are no questions where both are asked
cconditions = ["that have more than 100 thousand inhabitants", "that have more than 150000 residents", "with more than a 120 k people", "that are bigger than 99,999"]
cconditions_code = ["> 100000", "> 150000", "> 120000", "> 99999"]

# NUTS level 1, cities > 120000
ex1 = [[1, 120000, choice(directions), choice(sdistances), choice(bdistances), choice(cconditions)] for _ in range(len(df_questions))]

# NUTS level 2, cities > 50000
ex2 = [[2, 50000, choice(directions), choice(sdistances), choice(bdistances), choice(cconditions)] for _ in range(len(df_questions))]

# NUTS level 3, cities > 15000
ex3 = [[3, 15000, choice(directions), choice(sdistances), choice(bdistances), choice(cconditions)] for _ in range(len(df_questions))]

# Intercardinal, level2, cities > 50000
ex4 = [[2, 50000, choice(intercardinals), choice(sdistances), choice(bdistances), choice(cconditions)] for _ in range(len(df_questions))]

# No filters
ex5 = [[None, None, choice(directions), choice(sdistances), choice(bdistances), choice(cconditions)] for _ in range(len(df_questions))]

# No filters Intercardinal
ex6 = [[None, None, choice(intercardinals), choice(sdistances), choice(bdistances), choice(cconditions)] for _ in range(len(df_questions))]

exs = [ex1, ex2, ex3, ex4, ex5, ex6]

In [10]:
columns = ["NUTS level", "min inhabitants city", "direction", "small_distance", "big_distance", "city_condition"]

ex_with_df = [pd.concat([df_questions, pd.DataFrame(ex, columns=columns)], axis=1) for ex in exs]

all_examples = pd.concat(ex_with_df, ignore_index=True)
all_examples["intercardinal"] = [True if dirs in intercardinals else False for dirs in all_examples["direction"]]

pickledump(all_examples, "data/examples_not_populated.pkl")

# Notes on experiment dataset:
- compared codes are always on the same level.
- compared codes are always closer than a certain threshold dependent on a specific level.
- cities will be random or with a size filter.
- When asking about CODE-CITY relations the city should not be too far away from the code, depending on the level.