# Prepare datasets with Argilla

## 1. Deploy Argilla server locally

```bash
cd /workspace
mkdir argilla && cd argilla
wget -O docker-compose.yaml https://raw.githubusercontent.com/argilla-io/argilla/main/examples/deployments/docker/docker-compose.yaml

service docker start
docker compose up -d
```

Connect to: http://localhost:6900/

Login:

> cat docker-compose.yaml

- USERNAME: argilla
- PASSWORD: 12345678

## 2. Install Argilla client SDK

In [None]:
pip install argilla -U --pre

In [1]:
from importlib.metadata import version
version('argilla')

'2.0.0'

API key:

> cat docker-compose.yaml

API_KEY: argilla.apikey

In [2]:
import argilla as rg

client = rg.Argilla(api_url="http://localhost:6900/", api_key="argilla.apikey")

client.me.first_name

'argilla'

In [24]:
workspace_to_create = rg.Workspace(name="argilla")
created_workspace = workspace_to_create.create()

## 3. Import Agilla dataset

### 3.1 Load Huggingface dataset

https://huggingface.co/datasets/frenchtext/banque-fr-2311

Dataset extracted from public websites by wordslab-webscraper in 2311:
- domain: banque
- language: fr
- license: Apache 2.0

In [None]:
pip install datasets

In [3]:
from datasets import load_dataset

with open("/workspace/myhftoken", 'r') as file:
    myhftoken = file.read().strip()

hf_dataset = load_dataset("frenchtext/banque-fr-2311",  token=myhftoken)

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

```yaml
dataset_info:
  features:
    - name: Uri
      dtype: string
    - name: Timestamp
      dtype: string
    - name: Lang
      dtype: string
    - name: Title
      dtype: string
    - name: Text
      dtype: string
    - name: Words
      dtype: int32
    - name: AvgWordsLength
      dtype: int32
    - name: Chars
      dtype: int32
    - name: LetterChars
      dtype: int32
    - name: NumberChars
      dtype: int32
    - name: OtherChars
      dtype: int32
    - name: Website
      dtype: string
    - name: PDF
      dtype: bool
  config_name: default
  splits:
    - name: train
      num_examples: 68166
    - name: valid
      num_examples: 8522
    - name: test
      num_examples: 8541
  download_size: 247147772
  ```

### 3.2 Compute text embeddings

MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis

https://arxiv.org/pdf/2405.20468v2

=> multilingual-e5-base for clustering = 768 embeddings dimensions

https://github.com/microsoft/unilm/tree/master/e5

https://huggingface.co/intfloat/multilingual-e5-small --- https://huggingface.co/intfloat/multilingual-e5-base

In [None]:
pip install sentence_transformers

In [4]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/multilingual-e5-base', device="cuda")

In [5]:
input_texts = [
    'query: how much protein should a female eat',
    'query: 南瓜的家常做法',
    "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 i     s 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or traini     ng for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "passage: 1.清炒南瓜丝 原料:嫩南瓜半个 调料:葱、盐、白糖、鸡精 做法: 1、南瓜用刀薄薄的削去表面一层皮     ,用勺子刮去瓤 2、擦成细丝(没有擦菜板就用刀慢慢切成细丝) 3、锅烧热放油,入葱花煸出香味 4、入南瓜丝快速翻炒一分钟左右,     放盐、一点白糖和鸡精调味出锅 2.香葱炒南瓜 原料:南瓜1只 调料:香葱、蒜末、橄榄油、盐 做法: 1、将南瓜去皮,切成片 2、油     锅8成热后,将蒜末放入爆香 3、爆香后,将南瓜片放入,翻炒 4、在翻炒的同时,可以不时地往锅里加水,但不要太多 5、放入盐,炒匀      6、南瓜差不多软和绵了之后,就可以关火 7、撒入香葱,即可出锅"
]
embeddings = model.encode(input_texts, normalize_embeddings=True)

In [6]:
embeddings.shape

(4, 768)

In [7]:
hf_ds_valid = hf_dataset["valid"]
hf_ds_valid

Dataset({
    features: ['Uri', 'Timestamp', 'Lang', 'Title', 'Text', 'Words', 'AvgWordsLength', 'Chars', 'LetterChars', 'NumberChars', 'OtherChars', 'Website', 'PDF'],
    num_rows: 8522
})

In [8]:
def embed(example):
    example["Text_e5_embeddings"] = model.encode(example["Text"])
    return example

In [16]:
hf_dataset = hf_dataset.map(embed)

Map:   0%|          | 0/68166 [00:00<?, ? examples/s]

Map:   0%|          | 0/8522 [00:00<?, ? examples/s]

Map:   0%|          | 0/8541 [00:00<?, ? examples/s]

In [18]:
hf_dataset.save_to_disk("/models/huggingface/datasets/frenchtext___banque-fr-2311/preprocessed")

Saving the dataset (0/2 shards):   0%|          | 0/68166 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/8522 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/8541 [00:00<?, ? examples/s]

In [19]:
from datasets import load_from_disk
hf_dataset = load_from_disk("/models/huggingface/datasets/frenchtext___banque-fr-2311/preprocessed")

In [20]:
hf_dataset

DatasetDict({
    train: Dataset({
        features: ['Uri', 'Timestamp', 'Lang', 'Title', 'Text', 'Words', 'AvgWordsLength', 'Chars', 'LetterChars', 'NumberChars', 'OtherChars', 'Website', 'PDF', 'Text_e5_embeddings'],
        num_rows: 68166
    })
    valid: Dataset({
        features: ['Uri', 'Timestamp', 'Lang', 'Title', 'Text', 'Words', 'AvgWordsLength', 'Chars', 'LetterChars', 'NumberChars', 'OtherChars', 'Website', 'PDF', 'Text_e5_embeddings'],
        num_rows: 8522
    })
    test: Dataset({
        features: ['Uri', 'Timestamp', 'Lang', 'Title', 'Text', 'Words', 'AvgWordsLength', 'Chars', 'LetterChars', 'NumberChars', 'OtherChars', 'Website', 'PDF', 'Text_e5_embeddings'],
        num_rows: 8541
    })
})

### 3.3 Create empty Argilla dataset

**IMPORTANT** !!!

- Argilla dataset fields **names are converted to lowercase** and **spaces in names are replaced by _**

```python
@field_validator("name")
@classmethod
def __name_lower(cls, name):
    formatted_name = name.lower().replace(" ", "_")
    return formatted_name
```

- then the comparison with Huggingface datasets column names is **case sensitive**

The consequence is that 
- **you should always use only lowercase names** in Argilla dataset settings
- **you need to define an explicit mapping** with Huggingface datasets uppercase columns

In [109]:
settings = rg.Settings(
    guidelines="Explore french banking websites dataset - date ",
    fields=[
        rg.TextField(
            name="text",
            title="Web page text",
            required=True,
            use_markdown=True,            
        ),
    ],
    questions=[
        rg.MultiLabelQuestion(
            name="contenttype",
            title="Does the web page include any of these content types?",
            labels=["info", "news", "product", "process", "ads", "metadata"],
        )
    ],
    metadata=[
        rg.TermsMetadataProperty(name="uri"),
        rg.TermsMetadataProperty(name="lang"),
        rg.IntegerMetadataProperty(name="words"),
        rg.TermsMetadataProperty(name="website"),
        rg.IntegerMetadataProperty(name="pdf"),
    ],
    vectors=[
        rg.VectorField(name="text_e5_embeddings", dimensions=768)
    ],
)

In [25]:
dataset = rg.Dataset(
    name="banque-fr-2311",
    workspace="argilla",
    settings=settings,
)

dataset.create()

Dataset(id=UUID('2cda3686-e350-4999-897d-8f30c01c81f1') inserted_at=datetime.datetime(2024, 8, 1, 22, 1, 48, 500104) updated_at=datetime.datetime(2024, 8, 1, 22, 1, 49, 262278) name='banque-fr-2311' status='ready' guidelines='Explore french banking websites dataset - date' allow_extra_metadata=False distribution=OverlapTaskDistributionModel(strategy='overlap', min_submitted=1) workspace_id=UUID('485e94ed-4e72-496c-9d92-756baea3882d') last_activity_at=datetime.datetime(2024, 8, 1, 22, 1, 49, 262278))

In [26]:
dataset = client.datasets(name="banque-fr-2311")
dataset

Dataset(id=UUID('2cda3686-e350-4999-897d-8f30c01c81f1') inserted_at=datetime.datetime(2024, 8, 1, 22, 1, 48, 500104) updated_at=datetime.datetime(2024, 8, 1, 22, 1, 49, 262278) name='banque-fr-2311' status='ready' guidelines='Explore french banking websites dataset - date' allow_extra_metadata=False distribution=OverlapTaskDistributionModel(strategy='overlap', min_submitted=1) workspace_id=UUID('485e94ed-4e72-496c-9d92-756baea3882d') last_activity_at=datetime.datetime(2024, 8, 1, 22, 1, 49, 262278))

In [66]:
def convert_pdf(example):
    example["PDF_int"] = 1 if example["PDF"] else 0
    return example

hf_dataset.map(convert_pdf)

Map:   0%|          | 0/68166 [00:00<?, ? examples/s]

Map:   0%|          | 0/8522 [00:00<?, ? examples/s]

Map:   0%|          | 0/8541 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Uri', 'Timestamp', 'Lang', 'Title', 'Text', 'Words', 'AvgWordsLength', 'Chars', 'LetterChars', 'NumberChars', 'OtherChars', 'Website', 'PDF', 'Text_e5_embeddings', 'PDF_int'],
        num_rows: 68166
    })
    valid: Dataset({
        features: ['Uri', 'Timestamp', 'Lang', 'Title', 'Text', 'Words', 'AvgWordsLength', 'Chars', 'LetterChars', 'NumberChars', 'OtherChars', 'Website', 'PDF', 'Text_e5_embeddings', 'PDF_int'],
        num_rows: 8522
    })
    test: Dataset({
        features: ['Uri', 'Timestamp', 'Lang', 'Title', 'Text', 'Words', 'AvgWordsLength', 'Chars', 'LetterChars', 'NumberChars', 'OtherChars', 'Website', 'PDF', 'Text_e5_embeddings', 'PDF_int'],
        num_rows: 8541
    })
})

In [112]:
hfmapping = {"Text":"text", "Uri":"uri", "Lang":"lang", "Words":"words", "Website":"website", "PDF_int":"pdf", "Text_e5_embeddings":"text_e5_embeddings"}

In [68]:
for split,ds in hf_dataset.items():
    dataset.records.log(records=hf_dataset[split], mapping=hfmapping)

Sending records...: 267batch [16:30,  3.71s/batch]                                                                      
Sending records...: 34batch [01:54,  3.36s/batch]                                                                       
Sending records...: 34batch [01:52,  3.29s/batch]                                                                       


### Debugging the HF to Argilla mapping

In [33]:
record_dict = {'Uri': 'https://banque.meilleurtaux.com/changer-de-banque/actualites/2020-fevrier/banques-ameliorent-cybersecurite-proteger-menace-grandissante.html', 'Timestamp': '11/07/2023 21:53:27', 'Lang': 'fr', 'Title': 'Les banques améliorent leur cybersécurité pour se protéger d’une menace grandissante - Meilleurtaux Banques', 'Text': "# Les banques améliorent leur cybersécurité pour se protéger d’une menace grandissante\r\n\r\nSécurité de la carte bancaire\r\n\r\nDepuis quelques années, les cyberattaques contre les institutions financières tendent à se multiplier et à s’aggraver. Les acteurs du secteur bancaire s’efforcent tant bien que mal de les anticiper en comblant les failles dans leur système informatique, mais les hackers les prennent souvent de court. Afin d’éviter les situations catastrophiques, les banques réinventent leur politique en matière de cybersécurité.\r\n\r\nFin 2019, le géant des changes Travelex a été victime d’un ransomware affectant ses différents partenaires, incluant HSBC, Barclays, ou encore le Royal Bank of Scotland. Concrètement, les hackers ont lancé un virus perturbant le système informatique de l’établissement. Ils ont ensuite exigé une rançon pour le débloquer.\r\n\r\nAfin d’anticiper ce scénario, huit grandes enseignes américaines se sont associées pour protéger leurs activités d’un éventuel piratage de grande envergure. Elles travaillent ainsi sur un dispositif de protection unique permettant de prendre momentanément le relais de l’établissement paralysé par une cyberattaque. De cette manière, les clients n’envisageront pas de changer de banque suite à la suspension des services.\r\n\r\nOffre du moment\r\n\r\nBoursoBank\r\n\r\nBoursoBank\r\n\r\n100 €\r\n\r\nBoursoBank\r\n\r\nJe découvre l’offre\r\n\r\nOfferts\r\n\r\nHello Bank\r\n\r\nHello Bank\r\n\r\n180 €\r\n\r\nHello Bank\r\n\r\nJe découvre l’offre\r\n\r\nOfferts\r\n\r\nOffre du moment\r\n\r\nFortuneo\r\n\r\nFortuneo\r\n\r\n80 €\r\n\r\nFortuneo\r\n\r\nJe découvre l’offre\r\n\r\nOfferts\r\n\r\nOffre du moment\r\n\r\nMonabanq\r\n\r\nMonabanq\r\n\r\njusqu'à 240 €\r\n\r\nMonabanq\r\n\r\nJe découvre l’offre\r\n\r\nOffre du moment\r\n\r\nNotre sélection des promos\r\n\r\nJe compare les offres bancaires\r\n\r\n## Une question d’image\r\n\r\nActuellement, les cyberattaques contre les banques sont protéiformes. Outre les ransomwares, les établissements financiers doivent faire face à la multiplication du jackpotting . Ce type de piratage consiste à s’attaquer aux DAB (distributeurs automatiques de billets) pour les détourner de leur usage initial.\r\n\r\nLes pirates étudient le fonctionnement de ces appareils en effectuant de la rétro-ingénierie sur un modèle acheté sur un site spécialisé ou au marché noir. Ils exploitent ensuite leurs failles de pour en faire ce qu’ils veulent.\r\n\r\nBien que les pertes soient relativement faibles, ce type de cyberattaque nuit foncièrement à l’image des banques. Ces incidents démontrent en effet l’impuissance des institutions financières dans l’univers numérique. Pourtant, elles s’évertuent à poursuivre leur processus de digitalisation.\r\n\r\nLes experts en cybersécurité émettent des recommandations en matière d’ hygiène informatique pour éviter ces problèmes. Les banques doivent par exemple mettre régulièrement à jour leur antivirus et renforcer les systèmes d’authentification concernant le droit d’accès des clients.\r\n\r\nDe leur côté, les acteurs du secteur bancaire s’efforcent de se détacher des antivirus. Ils cherchent notamment à développer des systèmes visant à repérer les anomalies. Tel est le cas de la récupération de données sur un poste de travail au sein d’un réseau bancaire ou de l’exploitation d’informations sans autorisation.\r\n\r\nToutefois, les démarches préventives en matière de cybersécurité tendent à augmenter le budget dédié à ce secteur et le coût d’exploitation des banques. Société Générale, par exemple, y consacre aujourd’hui des centaines de millions d’euros et a multiplié ses effectifs par trois au cours de ces trois dernières années.\r\n\r\nJe compare les offres bancaires\r\n\r\n## Un phénomène très répandu\r\n\r\nTravelex n’est pas le seul établissement à avoir rencontré des problèmes suite à des attaques informatiques. Le fournisseur de cartes de crédit américain Capital One a aussi subi un piratage l’été dernier. Les hackers ont eu accès aux données personnelles de plus de 100 millions d’utilisateurs, soit près d’un tiers de la population du pays (327 millions d’habitants).\r\n\r\nCet évènement a eu lieu à peine quelques jours après le vol des données personnelles de 4,2 millions d’usagers au sein de la banque Desjardins au Canada. L’enseigne maltaise Bank of Valletta a aussi subi une cyberattaque début 2019. Cette fois-ci, les pirates ont voulu dérober 13 millions d'euros.\r\n\r\nLa banque italienne UniCredit , quant à elle, a vu pirater les données de 400 000 clients en 2017. Cela s’est produit une année après l’attaque concernant le compte de 40 000 usagers chez Tesco bank. Près de 20 000 clients de cette enseigne britannique ont constaté des retraits illicites sur leur compte. En somme, les banques subissent de plus en plus de cyberattaques. D’autant que les hackers sont de plus en plus doués.\r\n\r\nSelon le directeur général de YesWeHack (start-up spécialisée en hacking éthique), Guillaume Vassault-Houlière :\r\n\r\nDès lors que vous avez une activité BtoC, c'est-à-dire qui vise des particuliers, vous êtes plus exposé aux cyberattaques, car les données personnelles ont une valeur et peuvent être revendues.\r\n\r\nGuillaume Vassault-Houlière.\r\n\r\nDans ce contexte, les white hackers sont très sollicités. Ces entreprises technologiques sont officiellement commanditées et payées pour simuler des cyberattaques afin de détecter les failles de sécurité dans un système informatique. Elles sont désormais incontournables dans le secteur bancaire.\r\n\r\n\r\n", 'Words': 780, 'AvgWordsLength': 5, 'Chars': 5375, 'LetterChars': 4258, 'NumberChars': 49, 'OtherChars': 136, 'Website': 'banque.meilleurtaux.com', 'PDF': False, 'Text_e5_embeddings': [-0.004125738516449928, 0.052799925208091736, -0.005087921395897865, -0.0037737833335995674, 0.028302252292633057, -0.015996426343917847, -0.03714762255549431, -0.029330981895327568, 0.016668792814016342, 0.034771665930747986, -0.02304643578827381, -0.022055406123399734, 0.09054294973611832, 0.01396885048598051, -0.04197077825665474, -0.04331587255001068, 0.019524618983268738, -0.046622395515441895, 0.01751800999045372, 0.0008871607133187354, 0.03083798661828041, -0.01707722805440426, 0.05824931338429451, -0.04043231159448624, 0.04148544743657112, -0.013450578786432743, -0.0018609119579195976, 0.03836340084671974, -0.02304152026772499, 0.040782567113637924, 0.0317576564848423, -0.030175451189279556, -0.02995487116277218, 0.050354838371276855, 0.037023141980171204, 0.010040245950222015, -0.004701793659478426, -0.03178611025214195, 0.01853093132376671, 0.03684496879577637, 0.04350774362683296, 0.04442795366048813, 0.03184879943728447, -0.054099902510643005, 0.014245082624256611, -0.011768095195293427, 0.027170950546860695, 0.007879257202148438, -0.03463435173034668, 0.010933599434792995, 0.02139032445847988, 0.03208475559949875, 0.049029503017663956, 0.020551972091197968, -0.04457463324069977, -0.028314489871263504, 0.022501487284898758, 0.01814953424036503, -0.03182266280055046, 0.0242055244743824, -0.015266749076545238, 0.027578113600611687, 0.011312909424304962, 0.03830978274345398, 0.030648326501250267, -0.008077877573668957, 0.006959210615605116, 0.006798553746193647, -0.046457987278699875, -0.03344039246439934, 0.009354660287499428, -0.023917673155665398, 0.07241213321685791, -0.05667330324649811, 0.012301481328904629, -0.020276034250855446, -0.054128795862197876, -0.004290479701012373, 0.018986240029335022, 0.0076298960484564304, 0.05596494302153587, 0.00024804947315715253, 0.028321417048573494, 0.02733875997364521, 0.007436877116560936, 0.0017666886560618877, -0.033237338066101074, 0.022512711584568024, 0.03760552406311035, 0.0573975071310997, 0.04332464188337326, -0.06243094429373741, -0.07820035517215729, 0.0269607063382864, 0.023824719712138176, 0.008596660569310188, 0.0246119424700737, -0.013292466290295124, 0.061037179082632065, -0.05026281625032425, -0.03024391084909439, -0.04958132654428482, -0.05013424903154373, -0.03304915130138397, -0.09276832640171051, -0.0063573750667274, 0.002803198294714093, -0.015705352649092674, 0.04428776353597641, -0.0742022693157196, -0.010881246998906136, -0.0013333914102986455, 0.0643802359700203, -0.019422031939029694, 0.04081067442893982, 0.00691958935931325, 0.06717659533023834, -0.023497868329286575, 0.00970116164535284, -0.041089121252298355, 0.00018724077381193638, 0.03180642053484917, 0.021739177405834198, -0.020464539527893066, 0.03015872649848461, 0.012818785384297371, 0.02296868897974491, 0.002681286307051778, 0.02054157294332981, -0.06318259984254837, -0.05296844244003296, -0.03129856288433075, -0.00032290120725519955, 0.031324177980422974, -0.046070557087659836, 0.052070654928684235, 0.044560063630342484, 0.022265497595071793, -0.010595710016787052, -0.0076601216569542885, 0.03914617747068405, -0.06517999619245529, 0.009003976359963417, 0.037763383239507675, 0.04426691681146622, -0.04967082664370537, 0.03488786518573761, -0.0023707314394414425, -0.025164417922496796, 0.007108941208571196, 0.018469765782356262, -0.025784898549318314, -0.020430605858564377, -0.016951890662312508, 0.013305514119565487, 0.00019533354497980326, 0.007345386315137148, -0.01385069265961647, -0.02852608822286129, 0.03613980486989021, 0.02701202593743801, 0.006803820375353098, 0.03578433021903038, 0.0031786030158400536, 0.05497261881828308, -0.0012973906705155969, -0.03850801661610603, 0.033493608236312866, -0.01736755296587944, -0.07046686857938766, -0.05403686687350273, -0.01568000204861164, -0.029526911675930023, 0.014160181395709515, -0.017206666991114616, -0.04248189553618431, -0.03234652802348137, -0.009299954399466515, -0.010609281249344349, -0.03935691714286804, -0.044622257351875305, 0.005447396542876959, -0.03771286457777023, 0.017767149955034256, -0.035144198685884476, -0.03216921165585518, -0.04316001012921333, -0.036992549896240234, 0.016038203611969948, 0.0007581572281196713, -0.03716326132416725, 0.042677853256464005, 0.04566049203276634, -5.764169691246934e-05, 0.04473340883851051, 0.07298982888460159, 0.05190373957157135, 0.036170922219753265, -0.03619106113910675, -0.04360021650791168, 0.024645091965794563, -8.833556057652459e-05, 0.01749955676496029, 0.06589918583631516, 0.003319535404443741, -0.06444893032312393, -0.003730597672984004, 0.04214875027537346, 0.04345645755529404, 0.029554884880781174, -0.004199139308184385, 0.020024804398417473, -0.04448840767145157, 0.01888960413634777, -0.033957596868276596, -0.047766074538230896, 0.02224310114979744, 0.00287211243994534, -0.02852497808635235, 0.023138755932450294, 0.023081496357917786, -0.043730780482292175, 0.04775260016322136, 0.0416734516620636, -0.0204816535115242, -0.0038233872037380934, 0.0026284276973456144, -0.019669413566589355, 0.03813127055764198, 0.023951388895511627, 0.05076836794614792, 0.03729300573468208, -0.03147077187895775, 0.09346741437911987, 0.022042548283934593, 0.007527607958763838, 0.013238240033388138, 0.018599536269903183, -0.02439536713063717, -0.09934971481561661, 0.00038976085488684475, 0.023135682567954063, -0.014683248475193977, -0.042394693940877914, -0.01612926833331585, -0.04001915454864502, -0.05577804893255234, 0.06073528900742531, -0.036210447549819946, 0.005706870928406715, 0.02511626109480858, -0.03735749423503876, 0.01732243038713932, -0.02994532138109207, 0.021044183522462845, 0.03996942192316055, -0.07260077446699142, -0.003790860064327717, 0.02422136813402176, 0.01763824373483658, 0.030176790431141853, -0.04294665902853012, -0.002991731045767665, -0.021065589040517807, -0.001003108685836196, -0.0013614860363304615, 0.0711933895945549, -0.020256511867046356, -0.030732285231351852, 0.0003993701539002359, -0.029878413304686546, -0.006858566775918007, -0.06644973903894424, 0.022493304684758186, 0.030522001907229424, -0.00669897673651576, -0.03639552742242813, 0.0237265657633543, -0.05745626613497734, -0.05254894867539406, -0.03464609757065773, 0.04674195125699043, -0.058115404099226, -0.03645890951156616, -0.02475786581635475, 0.030376499518752098, -0.06819037348031998, 0.01002578902989626, -0.09901266545057297, -0.009885337203741074, 0.04081396386027336, 0.061137229204177856, -0.011152742430567741, 0.015374809503555298, 0.09466422349214554, 0.016260482370853424, 0.01881936937570572, -0.015438860282301903, 0.0006528745871037245, -0.010997870936989784, -0.026573114097118378, -0.04482477903366089, 0.03733973950147629, -0.04242299124598503, -0.01920631155371666, 0.04447663202881813, 0.06526459008455276, 0.00809468049556017, -0.04889126494526863, 0.0548531599342823, -0.040250055491924286, 0.016282645985484123, -0.07659725099802017, 0.04355869069695473, 0.04914768785238266, -0.04788849875330925, 7.913843001006171e-05, 0.05811683461070061, -0.007543399930000305, 0.026660067960619926, -0.01351650059223175, 0.00044786432408727705, -0.017642082646489143, -0.04148491472005844, 0.01597125455737114, -0.057059723883867264, 0.04379284381866455, -0.006355058401823044, 0.03821885213255882, 0.030845211818814278, -0.0368330292403698, -0.004034117329865694, -0.04146667942404747, 0.027249377220869064, -0.09593617916107178, 0.02602269873023033, -0.08165067434310913, -0.04826661944389343, 0.0713566243648529, 0.03227315470576286, -0.03073509968817234, 0.024546029046177864, -0.01211515162140131, -0.025128556415438652, 0.05684630572795868, -0.0337102934718132, -0.02875952050089836, -0.031147180125117302, -0.022409671917557716, -0.0027325262781232595, 0.01671760529279709, -0.06551449000835419, 0.022235384210944176, 0.0028853323310613632, -0.02344406768679619, 0.027678214013576508, -0.010283753275871277, -0.023852944374084473, -0.029935011640191078, -0.007409980520606041, -0.027347300201654434, 0.055852387100458145, -0.014667457900941372, 0.007145185023546219, -0.028462117537856102, 0.017623715102672577, -0.043297503143548965, -0.03319935128092766, 0.024013701826334, 0.009711435064673424, 0.09195603430271149, 0.01316817756742239, -0.03712339326739311, 0.00652801850810647, 0.04394783079624176, -0.009103711694478989, 0.07693421840667725, -0.03728241100907326, -0.036612506955862045, 0.06281452625989914, 0.021622080355882645, 0.04757705703377724, -0.011499015614390373, 0.026834610849618912, -0.051813755184412, 0.06825681775808334, 0.025271106511354446, -0.03379122167825699, 0.039778634905815125, -0.03281019255518913, 0.030888734385371208, -0.007114924490451813, -0.0332091823220253, 0.010099238716065884, 0.042176056653261185, 0.0028638383373618126, 0.03224765881896019, -0.04734206572175026, 0.007288982160389423, -0.0421617291867733, 0.03613709285855293, -0.000992881366983056, -0.022461652755737305, 0.05475684627890587, 0.03514552116394043, 0.0022920172195881605, 0.03150108456611633, 0.05843281000852585, -0.006464590784162283, -0.0758441761136055, 0.023868372663855553, -0.02282830886542797, 0.008000489324331284, 0.020025134086608887, 0.0016176658682525158, -0.010252195410430431, 0.001220570644363761, 0.04033764824271202, -0.014924971386790276, 0.04098896309733391, -0.047514867037534714, 0.03120553307235241, 0.018053054809570312, -0.022937675938010216, 0.018179243430495262, -0.0677308589220047, -0.003150759031996131, 0.0048280395567417145, 0.02251225896179676, -0.017873728647828102, -0.008214901201426983, -0.010113192722201347, -0.015560148283839226, 0.0011726523516699672, 0.0011017024517059326, 0.016179660335183144, 0.0252061914652586, 0.02547972835600376, -0.03040494956076145, 0.02530280500650406, 0.024554390460252762, -0.009776907041668892, 0.03453369438648224, 0.004147632513195276, -0.04515322297811508, -0.003012153087183833, -0.05862865969538689, 0.031080003827810287, -0.022502155974507332, 0.02556418627500534, 0.020866844803094864, 0.03025103360414505, 0.02314906008541584, -0.017622116953134537, 0.02844931371510029, -0.04355978965759277, 0.02992424927651882, 0.0109980758279562, 0.021472835913300514, -0.09483753889799118, 0.028610335662961006, 0.013455747626721859, -0.01091096829622984, -0.002290438860654831, -0.005865796003490686, -0.029927939176559448, 0.02775508724153042, -0.00056908541591838, 0.04248350113630295, -0.06784949451684952, -0.00018667985568754375, -0.00832618959248066, 0.0348394401371479, 0.008433294482529163, 0.034615129232406616, 0.05789554864168167, 0.005664119031280279, 0.052489448338747025, 0.02391180209815502, -0.016532940790057182, 0.015048651956021786, -0.006456727162003517, -0.024786755442619324, 0.034554898738861084, 0.02075093984603882, 0.0009807961760088801, -0.07601477205753326, 0.0644899383187294, 0.05962175875902176, -0.02689405530691147, -0.03937731683254242, 0.017068276181817055, 0.004893966019153595, -0.02995367906987667, -0.005550507456064224, 0.03443741425871849, -0.020432276651263237, -0.08794093132019043, 0.06543774902820587, 0.011440999805927277, -0.03031259775161743, -0.03527821600437164, -0.05504462122917175, 0.053674280643463135, -0.044080063700675964, 0.024201612919569016, 0.01520184613764286, 0.012106197886168957, 0.056542135775089264, 0.005991669837385416, 0.09383217245340347, -0.02979951910674572, -0.0019963241647928953, -0.01121876947581768, 0.03557722643017769, 0.0020818605553358793, -0.01442661788314581, 0.00740943755954504, 0.022544562816619873, -0.05304212123155594, -0.035392265766859055, 0.004865722265094519, -0.03660944476723671, -0.039198581129312515, 0.012140045873820782, -0.09377826750278473, 0.020345594733953476, -0.04734979569911957, 0.03352127596735954, -0.02157255820930004, 0.020762966945767403, -0.03899162635207176, 0.062051910907030106, -0.07257862389087677, -0.05139090120792389, 0.0370694138109684, 0.007715103216469288, -0.021716443821787834, 0.042889390140771866, -0.007739713415503502, -0.03160223364830017, -0.026067491620779037, -0.004564320668578148, -0.023110520094633102, 0.017302772030234337, -0.05635445937514305, -0.0017279192106798291, -0.013120061717927456, 0.026750091463327408, -0.016840871423482895, 0.05423123762011528, -0.002599692903459072, -0.014072054997086525, 0.01782345026731491, 0.0008663723710924387, 0.028620483353734016, 0.021142292767763138, 0.02623216062784195, 0.042014867067337036, -0.027157656848430634, 0.012177089229226112, -0.043139971792697906, 0.012291813269257545, 0.023679694160819054, -0.0015785921132192016, -0.0033968219067901373, 0.015060486271977425, -0.02665022760629654, -0.019371267408132553, 0.04571299999952316, -0.07143539190292358, 0.054651323705911636, 0.0263277068734169, 0.008455408737063408, 0.014058727771043777, 0.04485484957695007, -0.03883916512131691, -0.026818199083209038, 0.051357246935367584, -0.024137666448950768, -0.022996705025434494, -0.022313067689538002, -0.024908430874347687, 0.0039613088592886925, 0.061495065689086914, 0.013638702221214771, 0.02657528780400753, -0.02323933131992817, -0.21407051384449005, -0.00044733553659170866, -0.03422166407108307, -0.007491130847483873, 0.036139700561761856, 0.014372160658240318, 0.026507975533604622, 0.0029908879660069942, 0.03003745898604393, 0.0013645648723468184, 0.05024179443717003, -0.04764833301305771, 0.007673018611967564, -0.03721681237220764, -0.03935578092932701, -0.0024473911616951227, -0.007290615700185299, -0.05637887120246887, 0.00839980784803629, 0.005282987374812365, -0.016281703487038612, 0.050302356481552124, -0.03956905007362366, 0.010644330643117428, 0.0022109716664999723, -0.01152209471911192, -0.04428955540060997, 0.027794253081083298, 0.00593393761664629, 0.02490459382534027, -6.793269130866975e-05, -0.0022120687644928694, -2.1457604816532694e-05, 0.018745090812444687, -0.013560828752815723, -0.02915852516889572, -0.0003414980601519346, 0.010818613693118095, -0.0907197892665863, 0.0472375713288784, 0.026492193341255188, 0.0016760677099227905, -0.04764818772673607, 0.01590362936258316, -0.01775454916059971, 0.055174268782138824, 0.05076879635453224, 0.01877765730023384, 0.010465260595083237, -0.031460367143154144, 0.026567967608571053, 0.022445503622293472, -0.08158332854509354, 0.012405652552843094, 0.010762711986899376, 0.04833889380097389, 0.0588943213224411, 0.026579998433589935, 0.02594783343374729, -0.007129158359020948, 0.05389690026640892, 0.029877452179789543, -0.05216946080327034, 0.0036160433664917946, -0.01339024119079113, 0.010822700336575508, 0.06716179102659225, 0.004739412106573582, -0.03011951968073845, 0.02312459424138069, 0.03374551981687546, -0.04297107085585594, -0.021124770864844322, -0.009261753410100937, 0.0338602289557457, -0.02790028229355812, -0.006365458481013775, -0.0377344973385334, -0.05805971845984459, 0.01592940464615822, -0.04100913554430008, -0.040288347750902176, 0.0034218986984342337, 0.02328554354608059, -0.01571032404899597, -0.027518007904291153, -0.014712312258780003, -0.004379367455840111, 0.017573729157447815, 0.06509513407945633, 0.009546909481287003, -0.03422899171710014, 0.01643945276737213, -0.005541112273931503, 0.019336624071002007, -0.03573956340551376, -0.026652837172150612, 0.057633690536022186, 0.019588235765695572, -0.010759882628917694, -0.003146431175991893, 0.003788457252085209, -0.04164247214794159, 0.035790394991636276, 0.03770679607987404, -0.05559611693024635, -0.04488547518849373, 0.04360348358750343, -0.011637243442237377, -0.035134635865688324, 0.028055207803845406, 0.06429721415042877, 0.06059931963682175, -0.04355704411864281, 0.01206795871257782, -0.0164168868213892, 0.03133976459503174, -0.016833046451210976, -0.05125684291124344, -0.06336405873298645, -0.03995782136917114, -0.01585494726896286, 0.0482410304248333, 0.009308689273893833, 0.008267953060567379, -0.033630210906267166, -0.039584480226039886, 0.010770360939204693, -0.028853224590420723, -0.007633221335709095, -0.04107866436243057, -0.055370673537254333, 0.022609995678067207, -0.011715376749634743, -0.0030148657970130444, 0.00966445542871952, 0.01956905797123909, 0.011797575280070305, 0.03743590787053108, 0.021945560351014137, 0.06499559432268143, -0.04801178351044655, -0.021771740168333054, 0.03736365586519241, -0.05418664962053299, -0.006423573475331068, 0.01130286231637001, -0.009217603132128716, 0.019941752776503563, -0.014777987264096737, -0.015072721987962723, -0.10496288537979126, -0.009849184192717075, -0.009301090613007545, -0.024059738963842392, 0.028072401881217957, 0.02990454062819481, 0.04299801215529442, -0.022248119115829468, 0.013482044450938702, -0.001538264099508524, -0.04709165543317795, 0.05298429727554321, 0.009463155642151833, 0.01461552083492279, 0.005833548028022051, -0.011023236438632011, -0.0495406799018383, 0.022273892536759377, 0.020157495513558388, -0.04642893373966217, 0.00888157170265913, 0.010884840972721577, 0.03579328954219818, 0.04011231288313866, -0.015362519770860672, 0.04041285440325737, -0.034518055617809296, -0.037399549037218094, 0.003966163378208876]}

In [57]:
from argilla.records._mapping import IngestedRecordMapper

record_mapper = IngestedRecordMapper(mapping=hfmapping, dataset=dataset, user_id=None)

In [None]:
record_mapper(data=record_dict)

In [None]:
record_mapper._map_suggestions(data=record_dict, mapping=record_mapper.mapping.suggestion)

In [None]:
record_mapper._map_responses(data=record_dict, user_id=record_mapper.user_id, mapping=record_mapper.mapping.response)

In [None]:
record_mapper._map_attributes(data=record_dict, mapping=record_mapper.mapping.field)

In [None]:
record_mapper._map_attributes(data=record_dict, mapping=record_mapper.mapping.metadata)

In [None]:
record_mapper._map_attributes(data=record_dict, mapping=record_mapper.mapping.vector)

In [None]:
for name, route in record_mapper.mapping.field.items():
    print(name,route.source)
    if route.source not in data:
        continue
    value = data.get(route.source)
    if value is None:
        continue
    attributes[name] = value

### Adjust dataset settings

In [73]:
dataset.settings.questions[0]

MultiLabelQuestion(name=contenttype, title=Does the web page include any of these content types?, description=None, type=multi_label_selection, required=True) 

In [76]:
from argilla.settings import LabelQuestion

question2 = LabelQuestion(name="domain", title="What is the business domain of the web page?", labels=["Bank", "Insurance", "Other"], required=True)

dataset.settings.questions.add(question2)

LabelQuestion(name=domain, title=What is the business domain of the web page?, description=None, type=label_selection, required=True) 

In [None]:
dataset.settings.questions._create_question(question2)

**UnprocessableEntityError: Argilla SDK error: UnprocessableEntityError: Unprocessable entity. The server cannot process the request. Details: {"detail":"questions cannot be created for a published dataset"}**

=> see below, it is not possible to create or update questions after publishing an Argilla dataset

=> the right approach is to create several versions of a dataset, evolving the annotation schema after each iteration

https://github.com/argilla-io/argilla/blob/b7ac946af610a663b48e01007bc6b31955fc0b2a/argilla-server/src/argilla_server/validators/questions.py#L38

```python
class QuestionCreateValidator:

    @staticmethod
    def _validate_dataset_is_not_ready(dataset):
        if dataset.is_ready:
            raise UnprocessableEntityError("questions cannot be created for a published dataset")
```

https://github.com/argilla-io/argilla/blob/develop/argilla/src/argilla/settings/_resource.py

```python
class QuestionsProperties(SettingsProperties[QuestionType]):
    """
    This class is used to align questions with the rest of the settings.

    Since questions are not aligned with the Resource class definition, we use this
    class to work with questions as we do with fields, vectors, or metadata (specially when creating questions).

    Once issue https://github.com/argilla-io/argilla/issues/4931 is tackled, this class should be removed.
    """
```

https://github.com/argilla-io/argilla/issues/4931

[REFACTOR] Improve handling of question models and dicts

burtenshaw modified the milestones: v2.0.0, v2.1.0 on Jun 5

https://github.com/argilla-io/argilla/milestone/48

No due date 7% complete

### Restart with a smaller sample

Try to find an intelligent way to sample the huggingface dataset

In [84]:
sampled_hf_dataset = hf_dataset["train"]

First shuffle the dataset

In [85]:
sampled_hf_dataset = sampled_hf_dataset.shuffle(seed=42)

Then count the number of source websites

In [91]:
websites = sampled_hf_dataset.unique("Website")
len(websites)

42

For a first annotation round, we want around 500 samples => take 15 examples by website

In [95]:
websites_indexes = {}
for website in websites:
    websites_indexes[website] = []

for idx,example in enumerate(sampled_hf_dataset):
    indexes = websites_indexes[example["Website"]]
    if len(indexes)<15:
        indexes.append(idx)

In [102]:
sampled_indexes = []
for website,indexes in websites_indexes.items():
    sampled_indexes.extend(indexes)
sampled_indexes = sorted(sampled_indexes)

In [105]:
sampled_hf_dataset = sampled_hf_dataset.select(sampled_indexes)

In [106]:
len(sampled_hf_dataset)

630

Delete the first big dataset

In [107]:
dataset.delete()

Re-create a smaller dataset with a more open annotation scheme

In [110]:
settings.questions = [
    rg.TextQuestion(
        name="pagetype",
        title="What is the page type?",
        required=True,
        use_markdown=False
    ),
    rg.TextQuestion(
        name="pagedomain",
        title="What is the page domain?",
        required=True,
        use_markdown=False
    ),
    rg.TextQuestion(
        name="entitiestype",
        title="What are the types of the named entities? (separated by ,)",
        required=False,
        use_markdown=False
    ),
    rg.TextQuestion(
        name="webscrapingerrors",
        title="Are there text errors introduced by web scraping? (separated by ,)",
        required=False,
        use_markdown=False
    )
]

In [111]:
dataset = rg.Dataset(
    name="banque-fr-2311-v1",
    workspace="argilla",
    settings=settings,
)

dataset.create()

Dataset(id=UUID('9f5f5739-c475-4069-ba09-f4633faad67a') inserted_at=datetime.datetime(2024, 8, 4, 15, 30, 48, 486054) updated_at=datetime.datetime(2024, 8, 4, 15, 30, 49, 241525) name='banque-fr-2311-v1' status='ready' guidelines='Explore french banking websites dataset - date' allow_extra_metadata=False distribution=OverlapTaskDistributionModel(strategy='overlap', min_submitted=1) workspace_id=UUID('485e94ed-4e72-496c-9d92-756baea3882d') last_activity_at=datetime.datetime(2024, 8, 4, 15, 30, 49, 241525))

In [113]:
dataset.records.log(records=sampled_hf_dataset, mapping=hfmapping)

Sending records...: 3batch [00:10,  3.53s/batch]                                                                        


DatasetRecords(Dataset(id=UUID('9f5f5739-c475-4069-ba09-f4633faad67a') inserted_at=datetime.datetime(2024, 8, 4, 15, 30, 48, 486054) updated_at=datetime.datetime(2024, 8, 4, 15, 30, 49, 241525) name='banque-fr-2311-v1' status='ready' guidelines='Explore french banking websites dataset - date' allow_extra_metadata=False distribution=OverlapTaskDistributionModel(strategy='overlap', min_submitted=1) workspace_id=UUID('485e94ed-4e72-496c-9d92-756baea3882d') last_activity_at=datetime.datetime(2024, 8, 4, 15, 30, 49, 241525)))

Now do a first manual round of annotation