# Geographical Biases in Large Language Models (LLMs)

This tutorial aims to identify geographical biases propagated by LLMs.

1. Spatial disparities in geographical knowledge
2. Spatial information coverage in training datasets
3. Correlation between geographic distance and semantic distance
4. Anomaly between geographical distance and semantic distance

**Authors**

| Author      | Affiliation            |
|-------------|------------------------|
| Rémy Decoupes    | INRAE / TETIS      |
| Mathieu Roche  | CIRAD / TETIS |
| Maguelonne Teisseire | INRAE / TETIS            |

![TETIS](https://www.umr-tetis.fr/images/logo-header-tetis.png)






In [1]:
# Installation
!pip install -U bitsandbytes
!pip install transformers==4.37.2
!pip install -U git+https://github.com/huggingface/peft.git
!pip install -U git+https://github.com/huggingface/accelerate.git

Collecting transformers==4.39
  Downloading transformers-4.39.0-py3-none-any.whl.metadata (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers==4.39)
  Using cached huggingface_hub-0.22.2-py3-none-any.whl.metadata (12 kB)
Collecting tokenizers<0.19,>=0.14 (from transformers==4.39)
  Using cached tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.39.0-py3-none-any.whl (8.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached huggingface_hub-0.22.2-py3-none-any.whl (388 kB)
Using cached tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
Installing collected packages: huggingface-hub, tokenizers, transformers
  Attempting uninstal

In [2]:
from transformers import BertModel, BertTokenizer
from transformers import RobertaTokenizer, RobertaModel
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import pipeline

list_of_models = {
    'bert': {
        'name': 'bert-base-uncased',
        'tokenizer': BertTokenizer.from_pretrained('bert-base-uncased'),
        'model': BertModel.from_pretrained('bert-base-uncased'),
        'mask': "[MASK]",
        'type': "SLM"
    },
    'bert-base-multilingual-uncased':{
        'name': 'bert-base-multilingual-uncased',
        'tokenizer': AutoTokenizer.from_pretrained('bert-base-multilingual-uncased'),
        'model': BertModel.from_pretrained('bert-base-multilingual-uncased'),
        'mask': "[MASK]",
        'type': "SLM"
    },
    'roberta': {
        'name': 'roberta-base',
        'tokenizer': AutoTokenizer.from_pretrained('roberta-base'),
        'model': RobertaModel.from_pretrained('roberta-base'),
        'mask': "<mask>",
        'type': "SLM"
    },
    'xlm-roberta-base': {
        'name': 'xlm-roberta-base',
        'tokenizer': AutoTokenizer.from_pretrained('xlm-roberta-base'),
        'model': RobertaModel.from_pretrained('xlm-roberta-base'),
        'mask': "<mask>",
        'type': "SLM"
    },
    'mistral': {
        'name': 'mistralai/Mistral-7B-Instruct-v0.1',
        'type': "LLM_local"
    },
    'llama2': {
        'name': 'meta-llama/Llama-2-7b-chat-hf',
        'type': "LLM_local"
    },
    'chatgpt':{
        'name': 'gpt-3.5-turbo-0301',
        'type': "LLM_remote_api"
    },
}

  from .autonotebook import tqdm as notebook_tqdm
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You are using a model of type xlm-roberta to instantiate a model of type roberta. This is not supported for all configurations of models and can yield errors.


**Initiate API Key**

- HuggingFace 
- OpenAI

In [3]:
HF_API_TOKEN = input("Your huggingFace API Key")

## 1. Spatial Disparities In Geographical Knowledge

We will use 2 different types of language models: Small Language Model (SLM) and Large Language Model (LLM):


- For SLMs: we will use the HuggingFace library transformers
- For LLMs: 2 methods, through API with OpenAI (ChatGPT) or through local inference for Mistral or llama.

In [4]:
SLMs = {key: value for key, value in list_of_models.items() if 'type' in value and value['type'] == 'SLM'}
print(f"List of SLMs: {[value['name'] for value in SLMs.values()]}")

local_LLMs = {key: value for key, value in list_of_models.items() if 'type' in value and value['type'] == 'LLM_local'}
print(f"List of LLMs local inference: {[value['name'] for value in local_LLMs.values()]}")

api_LLMs = {key: value for key, value in list_of_models.items() if 'type' in value and value['type'] == 'LLM_remote_api'}
print(f"List of LLMs local inference: {[value['name'] for value in api_LLMs.values()]}")


List of SLMs: ['bert-base-uncased', 'bert-base-multilingual-uncased', 'roberta-base', 'xlm-roberta-base']
List of LLMs local inference: ['mistralai/Mistral-7B-Instruct-v0.1', 'meta-llama/Llama-2-7b-chat-hf']
List of LLMs local inference: ['gpt-3.5-turbo-0301']


### 1.1 Example of probing geographical knowledge

Let's ask Roberta-base from which country Taipei is the capital.

In [5]:
fill_mask = pipeline(task="fill-mask", model='roberta-base')
masked_sentence = f'Taipei is capital of <mask>'

prediction = fill_mask(masked_sentence)
print(f"Prediction: {prediction}")
print(f"Predicted token: {prediction[0]['token_str']}")

Prediction: [{'score': 0.8915017247200012, 'token': 6951, 'token_str': ' Taiwan', 'sequence': 'Taipei is capital of Taiwan'}, {'score': 0.05331902578473091, 'token': 436, 'token_str': ' China', 'sequence': 'Taipei is capital of China'}, {'score': 0.025364302098751068, 'token': 1429, 'token_str': ' Japan', 'sequence': 'Taipei is capital of Japan'}, {'score': 0.011096514761447906, 'token': 6547, 'token_str': ' Thailand', 'sequence': 'Taipei is capital of Thailand'}, {'score': 0.006269776728004217, 'token': 1101, 'token_str': ' Korea', 'sequence': 'Taipei is capital of Korea'}]
Predicted token:  Taiwan


Let's do the same with local LLMs. But as they are big models, we need to use quantization them.

In [6]:
from transformers import BitsAndBytesConfig
from torch import bfloat16

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # 4-bit quantization
    bnb_4bit_quant_type='nf4',  # Normalized float 4
    bnb_4bit_use_double_quant=True,  # Second quantization after the first
    bnb_4bit_compute_dtype=bfloat16  # Computation type
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1",
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto',
    token=HF_API_TOKEN
)
model.eval()

ImportError: Using `bitsandbytes` 8-bit quantization requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes: `pip install -i https://pypi.org/simple/ bitsandbytes`