# Geographical Biases in Large Language Models (LLMs)

This tutorial aims to identify geographical biases propagated by LLMs.

1. Spatial disparities in geographical knowledge
2. Spatial information coverage in training datasets
3. Correlation between geographic distance and semantic distance
4. Anomaly between geographical distance and semantic distance

**Authors**

| Author      | Affiliation            |
|-------------|------------------------|
| Rémy Decoupes    | INRAE / TETIS      |
| Mathieu Roche  | CIRAD / TETIS |
| Maguelonne Teisseire | INRAE / TETIS            |

![TETIS](https://www.umr-tetis.fr/images/logo-header-tetis.png)






In [None]:
# Installation
!pip install -U bitsandbytes
!pip install transformers==4.37.2
!pip install -U git+https://github.com/huggingface/peft.git
!pip install -U git+https://github.com/huggingface/accelerate.git
!pip install openai==0.28

In [None]:
from transformers import BertModel, BertTokenizer
from transformers import RobertaTokenizer, RobertaModel
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import pipeline

list_of_models = {
    'bert': {
        'name': 'bert-base-uncased',
        'tokenizer': BertTokenizer.from_pretrained('bert-base-uncased'),
        'model': BertModel.from_pretrained('bert-base-uncased'),
        'mask': "[MASK]",
        'type': "SLM"
    },
    'bert-base-multilingual-uncased':{
        'name': 'bert-base-multilingual-uncased',
        'tokenizer': AutoTokenizer.from_pretrained('bert-base-multilingual-uncased'),
        'model': BertModel.from_pretrained('bert-base-multilingual-uncased'),
        'mask': "[MASK]",
        'type': "SLM"
    },
    'roberta': {
        'name': 'roberta-base',
        'tokenizer': AutoTokenizer.from_pretrained('roberta-base'),
        'model': RobertaModel.from_pretrained('roberta-base'),
        'mask': "<mask>",
        'type': "SLM"
    },
    'xlm-roberta-base': {
        'name': 'xlm-roberta-base',
        'tokenizer': AutoTokenizer.from_pretrained('xlm-roberta-base'),
        'model': RobertaModel.from_pretrained('xlm-roberta-base'),
        'mask': "<mask>",
        'type': "SLM"
    },
    'mistral': {
        'name': 'mistralai/Mistral-7B-Instruct-v0.1',
        'type': "LLM_local"
    },
    'llama2': {
        'name': 'meta-llama/Llama-2-7b-chat-hf',
        'type': "LLM_local"
    },
    'chatgpt':{
        'name': 'gpt-3.5-turbo-0301',
        'type': "LLM_remote_api"
    },
}

**Initiate API Key**

- HuggingFace 
- OpenAI

In [1]:
import getpass
 
HF_API_TOKEN = getpass.getpass(prompt="Your huggingFace API Key")
OPENAI_API_KEY = getpass.getpass(prompt="Your OpenAI API Key")

## 1. Spatial Disparities In Geographical Knowledge

We will use 2 different types of language models: Small Language Model (SLM) and Large Language Model (LLM):


- For SLMs: we will use the HuggingFace library transformers
- For LLMs: 2 methods, through API with OpenAI (ChatGPT) or through local inference for Mistral or llama.

In [None]:
SLMs = {key: value for key, value in list_of_models.items() if 'type' in value and value['type'] == 'SLM'}
print(f"List of SLMs: {[value['name'] for value in SLMs.values()]}")

local_LLMs = {key: value for key, value in list_of_models.items() if 'type' in value and value['type'] == 'LLM_local'}
print(f"List of LLMs local inference: {[value['name'] for value in local_LLMs.values()]}")

api_LLMs = {key: value for key, value in list_of_models.items() if 'type' in value and value['type'] == 'LLM_remote_api'}
print(f"List of LLMs local inference: {[value['name'] for value in api_LLMs.values()]}")


**Geo datasets**

In [None]:
!pip install countryinfo

In [None]:
from countryinfo import CountryInfo
import pandas as pd
import numpy as np

country = CountryInfo()

countries = []
capitals = []
regions = []
subregions = []
coordinates = []

for c in list(country.all().keys()):
    country_info = CountryInfo(c)
    countries.append(c)
    try:
        regions.append(country_info.region())
    except:
        regions.append(np.NAN)
    try:
        subregions.append(country_info.subregion())
    except:
        subregions.append(np.NAN)
    try:
        coordinates.append(country_info.geo_json()["features"][0]["geometry"]["coordinates"])
    except:
        coordinates.append(np.NAN)
    try:
        capitals.append(country_info.capital())
    except:
        capitals.append(np.NAN)

# Create DataFrame
data = {
    'Country': countries,
    'Capital': capitals,
    'Region': regions,
    'Subregion': subregions,
    'Coordinates': coordinates
}

df_countries = pd.DataFrame(data)

# Display DataFrame
df_countries

### 1.1 SLMs

#### 1.1.1 Example

Let's ask Roberta-base from which country Taipei is the capital.

In [None]:
fill_mask = pipeline(task="fill-mask", model='roberta-base')
masked_sentence = f'Taipei is capital of <mask>'

prediction = fill_mask(masked_sentence)
print(f"Prediction: {prediction}")
print(f"Predicted token: {prediction[0]['token_str']}")

#### 1.1.1.2 World Wild

### 1.2 Local LLMs

Let's do the same with local LLMs. But as they are big models, we need to use quantization them.

#### 1.2.1 Example

In [None]:
from transformers import BitsAndBytesConfig
from torch import bfloat16

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # 4-bit quantization
    bnb_4bit_quant_type='nf4',  # Normalized float 4
    bnb_4bit_use_double_quant=True,  # Second quantization after the first
    bnb_4bit_compute_dtype=bfloat16  # Computation type
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1",
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto',
    token=HF_API_TOKEN
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", token=HF_API_TOKEN)

In [None]:
city = "Tapei"

messages = [
    {"role": "user", "content": "Name the country corresponding to its capital: Paris. Only give the country."},
    {"role": "assistant", "content": "France"},
    {"role": "user", "content": f"Name the country corresponding to its capital: {city}. Only give the country."}]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

generated_ids = model.generate(encodeds, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)

print(f"Results: '{decoded}'")
print(f'Parse only the country: {decoded[0].split("[/INST] ")[-1].replace("</s>", "").lstrip()}')

### 1.3 Remote LLMs

Using OpenAI / ChatGPT

#### 1.3.1 Example 

In [None]:
import openai
openai.api_key = OPENAI_API_KEY


city = "Tapei"

messages = [
    {"role": "user", "content": "Name the country corresponding to its capital: Paris. Only give the country."},
    {"role": "assistant", "content": "France"},
    {"role": "user", "content": f"Name the country corresponding to its capital: {city}. Only give the country."}]


reponse = openai.ChatCompletion.create(
    model="gpt-3.5-turbo-0301",
    messages=messages
)

print(f"Results (without parsing): {reponse["choices"][0]["message"]["content"]}")