# Going further !

**Authors**

| Author      | Affiliation            |
|-------------|------------------------|
| Rémy Decoupes    | INRAE / TETIS      |
| Mathieu Roche  | CIRAD / TETIS |
| Maguelonne Teisseire | INRAE / TETIS            |

![TETIS](https://www.umr-tetis.fr/images/logo-header-tetis.png)

## Indicator 1:

### 1.1 Work on prompt

Optimize the prompts to reduce the parsing issues

### 1.2 Use other basic geographical questions

- Predict the capital given its country
- What are the 3 mosts populated cities per country
- ...

## Indicator 2:

### 2.1 How to explain the very good geographic knowledge of LLMs when, upon questioning their vocabulary, they have few location?

**Hypothesis**: LLMs encountered many locations during their training, however, they are drowned out by the quantity of other words. As a result, the subtokens that make up the locations have a good geographical representation when merged.

To validate this hypothesis, we could evaluate the proportion of subtokens from LLM and SLM tokenizers.

## Indicator 3:

### 3.1 Build clusters of countries that are semantically close

Use K-Means (n=10 clusters) or Hierarchichal Clustering or DBSCAN to cluster countries 

A low correlation between geographical distance and semantic distance between location embeddings suggests that the semantic distance (captured by the embedding space) is not strongly related to the geographical distance between locations. This could mean that the semantic relationships are more influenced by cultural, historical, or sociological factors rather than geographical distance.

Clustering of countries may highlight cultural or historical relationships between countries.

## Indicator 4:

### 4.1 Data visualization

Can we work on other data visualizations to highlight which countries are at the center of the semantic space and which ones are on the periphery?

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModel
import torch

**Geo Datasets**

In [None]:
!pip install countryinfo
!pip install shapely
!pip install geopandas
!pip install matplotlib
!pip install scikit-learn
!pip install geopy
!pip install plotly-express
!pip install --upgrade nbformat
!pip install unidecode

In [None]:
from countryinfo import CountryInfo
import pandas as pd
import numpy as np
from shapely.geometry import Polygon
import geopandas as gpd

country = CountryInfo()

countries = []
capitals = []
regions = []
subregions = []
coordinates = []

for c in list(country.all().keys()):
    country_info = CountryInfo(c)
    countries.append(c)
    try:
        regions.append(country_info.region())
    except:
        regions.append(np.NAN)
    try:
        subregions.append(country_info.subregion())
    except:
        subregions.append(np.NAN)
    try:
        if country_info.geo_json()["features"][0]["geometry"]["type"] == "Polygon":
          coordinates.append(Polygon(country_info.geo_json()["features"][0]["geometry"]["coordinates"][0]))
        else: #MultiPolygon : Take the biggest one
          polygons = country_info.geo_json()["features"][0]["geometry"]["coordinates"]
          max_polygon = max(polygons, key=lambda x: len(x[0]))
          coordinates.append(Polygon(max_polygon[0]))
    except:
        coordinates.append(np.NAN)
    try:
        capitals.append(country_info.capital())
    except:
        capitals.append(np.NAN)

# Create DataFrame
data = {
    'Country': countries,
    'Capital': capitals,
    'Region': regions,
    'Subregion': subregions,
    'Coordinates': coordinates
}

df_countries = pd.DataFrame(data)
df_countries = gpd.GeoDataFrame(df_countries, geometry='Coordinates')

**add Captials coordinates**

With OpenStreetMap data through Nominatim geocoders

In [None]:
from geopy.geocoders import Nominatim
from shapely.geometry import Point

geolocator = Nominatim(user_agent="geoBias-llm")
location = geolocator.geocode("Taipei", language='en')

print(f"lat: {location.latitude}, lon: {location.longitude}")

def capital_coord(city):
    loc = geolocator.geocode(city, language='en')
    try:
        point = Point(loc.longitude, loc.latitude)
    except:
        point = np.nan
    return point

df_countries["capital_coordinates"] = df_countries["Capital"].apply(capital_coord)

# Change the geometry
df_countries = gpd.GeoDataFrame(df_countries, geometry="capital_coordinates")

In [None]:
df_countries

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
ax =  world.plot(color='lightgrey')

df_countries.plot(ax=ax, color="red")

In [None]:
def point_to_tuple(point):
    if point is None:
        return (None, None)
    else:
        return (point.y, point.x)

# Create dataset and data loader
df_countries = df_countries.dropna(subset=["capital_coordinates"])
cities = df_countries["Capital"].to_list()  # list of city names
gps_coords = df_countries["capital_coordinates"].apply(point_to_tuple).to_list()  # list of GPS coordinates (lat, lon)

In [None]:
gps_coords

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import RobertaTokenizer, RobertaModel
from tqdm import tqdm

# Load pre-trained RoBERTa model and tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
roberta_model = RobertaModel.from_pretrained('roberta-base')

# Define the MLP model
class CityEmbeddingMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(CityEmbeddingMLP, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Define the dataset class
class CityGpsDataset(torch.utils.data.Dataset):
    def __init__(self, cities, gps_coords):
        self.cities = cities
        self.gps_coords = gps_coords

    def __len__(self):
        return len(self.cities)

    def __getitem__(self, idx):
        city = self.cities[idx]
        gps_coord = self.gps_coords[idx]

        # Encode city name using RoBERTa
        inputs = tokenizer.encode_plus(city,
                                        add_special_tokens=True,
                                        max_length=50,
                                        padding='max_length',
                                        truncation=True,
                                        return_attention_mask=True,
                                        return_tensors='pt')
        city_embedding = roberta_model(inputs['input_ids'], attention_mask=inputs['attention_mask'])[0]
    

        # Create target GPS coordinates
        gps_coord_tensor = torch.tensor(gps_coord, dtype=torch.float)

        return city_embedding, gps_coord_tensor

dataset = CityGpsDataset(cities, gps_coords)
batch_size = 8
data_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=False, drop_last=True)

# Initialize MLP model, loss function, and optimizer
mlp_model = CityEmbeddingMLP(input_dim=768, hidden_dim=128, output_dim=2)  # 768 is the RoBERTa embedding dimension
criterion = nn.MSELoss()
optimizer = optim.Adam(mlp_model.parameters(), lr=0.001)

# Train the model
for epoch in range(10):  # train for 10 epochs
    for batch in tqdm(data_loader):
        city_embeddings, gps_coords = batch
        # city_embeddings = city_embeddings
        # gps_coords = gps_coords

        optimizer.zero_grad()

        # Forward pass
        outputs = mlp_model(city_embeddings)
        # outputs.shape : [32,1,50,2]
        outputs = torch.mean(outputs, dim=2) # Average all 50 tokens
        outputs = torch.squeeze(outputs, dim=1) #only on 1 sequence (not 2!)
        loss = criterion(outputs, gps_coords)

        # Backward pass
        loss.backward()
        optimizer.step()

        print(f'Epoch {epoch+1}, Loss: {loss.item()}')

In [None]:
for i, d in enumerate(data_loader):
    try:
        print(f"{i}: {len(d)}")
    except:
        print(d)

In [None]:
dataset[0][0].shape

In [None]:
dataset[1][0].shape

In [None]:
len(dataset[4])

In [None]:
len(dataset[3])

In [None]:
# dataset = CityGpsDataset(cities, gps_coords)
print(len(cities))
print(len(gps_coords))