<a href="https://colab.research.google.com/github/miczkejedrzej/MNLP-project-1/blob/main/Cleaning_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import every dataset

In [1]:
from google.colab import files
uploaded = files.upload()

import pandas as pd
from PIL import Image
from torchvision import transforms
import numpy as np

Saving [MNLP 2025 HW1] train set [PUBLIC] - train_cleaned.tsv to [MNLP 2025 HW1] train set [PUBLIC] - train_cleaned.tsv
Saving train_df_country.json to train_df_country.json
Saving train_df_dates.json to train_df_dates.json
Saving train_df_descr_analyse.json to train_df_descr_analyse.json
Saving train_df_images.json to train_df_images.json
Saving train_df_lang.json to train_df_lang.json
Saving trainset_subclass_instance.json to trainset_subclass_instance.json


In [2]:
train_df = pd.read_csv('[MNLP 2025 HW1] train set [PUBLIC] - train_cleaned.tsv', sep='\t')

train_df_subclass = pd.read_json('trainset_subclass_instance.json', orient="records", lines=True)
train_df_images = pd.read_json('train_df_images.json', orient="records", lines=True)
train_df_dates = pd.read_json('train_df_dates.json', orient="records", lines=True)
train_df_lang = pd.read_json('train_df_lang.json', orient="records", lines=True)
train_df_country = pd.read_json('train_df_country.json', orient="records", lines=True)
train_df_descr = pd.read_json('train_df_descr_analyse.json', orient="records", lines=True)

# Merge all the datasets

In [3]:
train_df.columns.to_list()

['item', 'name', 'description', 'type', 'category', 'subcategory', 'label']

In [4]:
# Get the dataset we first had
initial_df = train_df
initial_variables = train_df.columns.to_list()

# Get the datasets with the features we created
df_to_merge = [train_df_subclass, train_df_images, train_df_dates, train_df_lang, train_df_country, train_df_descr]

# Merge all of them, in a single dataframe
for df in df_to_merge:
  initial_df = pd.merge(left=initial_df, right=df, on=initial_variables, how='inner')

# Visualisation :
initial_df.head()

Unnamed: 0,item,name,description,type,category,subcategory,label,subclasses,instances_of,subclass_depth,image,date,nb_lang,main_country,descr_num_nouns,descr_num_verbs,descr_num_adjectives,descr_has_location,descr_has_ethnic_group,descr_has_event
0,http://www.wikidata.org/entity/Q306,Sebastián Piñera,Chilean entrepreneur and politician (1949–2024),entity,politics,politician,cultural exclusive,0,1,1,"[[[85, 93, 69], [64, 70, 47], [50, 55, 39], [1...",1949.0,126.0,Q298,2,0,1,0,1,0
1,http://www.wikidata.org/entity/Q12735,John Amos Comenius,"Czech teacher, educator, philosopher and write...",entity,politics,politician,cultural representative,0,1,1,"[[[33, 31, 17], [35, 33, 19], [32, 31, 19], [2...",1592.0,72.0,Q153136,4,0,1,0,1,0
2,http://www.wikidata.org/entity/Q1752,Macrinus,Roman emperor from 217 to 218,entity,politics,politician,cultural representative,0,1,1,"[[[109, 110, 103], [113, 115, 107], [117, 118,...",165.0,83.0,Q1747689,1,0,1,0,1,0
3,http://www.wikidata.org/entity/Q1639,Lamine Diack,Senegalese sports manager (1933–2021),entity,politics,politician,cultural representative,0,1,1,"[[[136, 91, 41], [123, 74, 26], [169, 134, 89]...",1933.0,42.0,Q1041,2,0,1,0,1,0
4,http://www.wikidata.org/entity/Q9588,Richard Nixon,President of the United States from 1969 to 1974,entity,politics,politician,cultural representative,0,1,1,"[[[106, 88, 76], [114, 96, 84], [115, 96, 84],...",1913.0,174.0,Q30,0,0,0,1,0,0


# Cleaning

## Labels

In [5]:
initial_df['label'].describe()

Unnamed: 0,label
count,6244
unique,9
top,cultural exclusive
freq,2685


In [6]:
initial_df['label'].unique()

array(['cultural exclusive', 'cultural representative',
       'cultural agnostic', 'cultural', 'cult', nan, 'cultural agn',
       'cultural represent', 'cultural ex', 'cultural ag'], dtype=object)

In [7]:
initial_df[['item', 'label']].groupby("label").count()

Unnamed: 0_level_0,item
label,Unnamed: 1_level_1
cult,1
cultural,5
cultural ag,1
cultural agn,3
cultural agnostic,1862
cultural ex,1
cultural exclusive,2685
cultural represent,1
cultural representative,1685


We are suppose to have only 3 different labels

- Cultural Exclusive (C.E.)
- Cultural Agnostic (C.A.)
- Cultural Representative (C.R.)

Instead, we have seen that there are 9 different labels in the dataset, plus some NaN. So we have to clean the data, in order to get only 3 differents labels. Here is the solution we propose :

First step :

We distinguish the 3 labels we want (C.A., C.E., C.R.) and also labels that are close to this ('Cultural agn', 'cultural represent', 'cultural ex', 'cultural ag'). So we propose the following mapping :

- 'cultural agn' : C.A.
- 'cultural represent' : C.R.
- 'cultural ex' : C.E.
- 'cultural ag' : C.A.

Second step :

For the other labels ('cultural', 'cult', nan), we propose to apply the rule that was define to build the dataset (see https://huggingface.co/datasets/sapienzanlp/nlp2025_hw1_cultural_dataset) : Ask ChatGPT-o3 !

First step :

In [8]:
to_correct = initial_df[initial_df["label"].isin(['cultural agn','cultural represent', 'cultural ex', 'cultural ag'])][['item', 'label']]
to_correct

Unnamed: 0,item,label
1545,http://www.wikidata.org/entity/Q86135347,cultural agn
3698,http://www.wikidata.org/entity/Q509900,cultural represent
4391,http://www.wikidata.org/entity/Q643677,cultural ex
4819,http://www.wikidata.org/entity/Q1711593,cultural agn
5731,http://www.wikidata.org/entity/Q25618,cultural ag
5968,http://www.wikidata.org/entity/Q30405,cultural agn


In [9]:
mapping_correction = {
  'cultural agn' : 'cultural agnostic',
  'cultural represent' : 'cultural representative',
  'cultural ex' : 'cultural exclusive',
  'cultural ag' : 'cultural agnostic'
}

In [10]:
to_correct['label'] = to_correct['label'].map(mapping_correction)

In [11]:
# We update the transformations to the initial dataset
initial_df.set_index('item', inplace=True)
to_correct.set_index('item', inplace=True)

initial_df.update(to_correct)
initial_df.reset_index(inplace=True)


# We check if the updates are corrects
initial_df[1543:1547]

Unnamed: 0,item,name,description,type,category,subcategory,label,subclasses,instances_of,subclass_depth,image,date,nb_lang,main_country,descr_num_nouns,descr_num_verbs,descr_num_adjectives,descr_has_location,descr_has_ethnic_group,descr_has_event
1543,http://www.wikidata.org/entity/Q27503001,professional athlete,person who earns their living from sports,concept,sports,athlete,cultural agnostic,0,1,9,,,17.0,,3,1,0,0,0,0
1544,http://www.wikidata.org/entity/Q107690317,competition climber,climber who competes in IFSC and Olympic climb...,concept,sports,athlete,cultural agnostic,0,3,9,"[[[1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 2], ...",,14.0,,2,2,1,0,0,1
1545,http://www.wikidata.org/entity/Q86135347,climber,person who practices climbing,concept,sports,athlete,cultural agnostic,2,1,9,,,16.0,,1,2,0,0,0,0
1546,http://www.wikidata.org/entity/Q28971125,cheerleader,performer who leads crowd support at sports ev...,concept,sports,athlete,cultural representative,0,1,9,,,33.0,,9,2,0,0,0,0


Second step :

Here are the items that ChatGPT will have to classify :

In [12]:
to_classify = initial_df[(initial_df["label"].isin(["cultural", "cult"])) | (initial_df["label"].isna())][['item', 'name', 'description']]

to_classify

Unnamed: 0,item,name,description
858,http://www.wikidata.org/entity/Q957033,Sunny Baudelaire,fictional character
1066,http://www.wikidata.org/entity/Q30327019,post and lintel,building system where horizontal elements (bea...
1214,http://www.wikidata.org/entity/Q811361,architectural glass,building material typically used as transparen...
1651,http://www.wikidata.org/entity/Q12014207,trekking,backpacking or hiking
2556,http://www.wikidata.org/entity/Q257907,Ethiopian movement,religious movement in southern Africa
2831,http://www.wikidata.org/entity/Q1136336,Costa Book Awards,annual series of literary awards in five categ...
3236,http://www.wikidata.org/entity/Q67111,Franz Pfemfert,German journalist (1879-1954)
3820,http://www.wikidata.org/entity/Q1089672,The Elm-Chanted Forest,1986 animated film directed by Milan Blažeković
4853,http://www.wikidata.org/entity/Q85755629,Daniel Airlie,novel
5566,http://www.wikidata.org/entity/Q206912,extremophile,organisms specifically adapted to live and sur...


Here is the result for chatgpt :

In [13]:
categories = {
    'http://www.wikidata.org/entity/Q957033': 'cultural representative',
    'http://www.wikidata.org/entity/Q30327019': 'cultural agnostic',
    'http://www.wikidata.org/entity/Q811361': 'cultural agnostic',
    'http://www.wikidata.org/entity/Q12014207': 'cultural agnostic',
    'http://www.wikidata.org/entity/Q257907': 'cultural exclusive',
    'http://www.wikidata.org/entity/Q1136336': 'cultural representative',
    'http://www.wikidata.org/entity/Q67111': 'cultural representative',
    'http://www.wikidata.org/entity/Q1089672': 'cultural representative',
    'http://www.wikidata.org/entity/Q85755629': 'cultural representative',
    'http://www.wikidata.org/entity/Q206912': 'cultural agnostic',
    'http://www.wikidata.org/entity/Q23228': 'cultural agnostic',
    'http://www.wikidata.org/entity/Q3196604': 'cultural representative',
    'http://www.wikidata.org/entity/Q1940624': 'cultural representative'
}

to_classify['label'] = to_classify['item'].map(categories)

In [14]:
# We update the transformations to the initial dataset
initial_df.set_index('item', inplace=True)
to_classify.set_index('item', inplace=True)

initial_df.update(to_classify)
initial_df.reset_index(inplace=True)


# We check if the updates are corrects
initial_df[856:860]

Unnamed: 0,item,name,description,type,category,subcategory,label,subclasses,instances_of,subclass_depth,image,date,nb_lang,main_country,descr_num_nouns,descr_num_verbs,descr_num_adjectives,descr_has_location,descr_has_ethnic_group,descr_has_event
856,http://www.wikidata.org/entity/Q929866,Santi Santamaria,Spanish chef (1957–2011),entity,food,cook,cultural exclusive,0,1,1,"[[[3, 3, 5], [3, 3, 5], [3, 3, 5], [3, 3, 5], ...",1957.0,41.0,Q29,1,0,1,0,1,0
857,http://www.wikidata.org/entity/Q899011,Ian Beale,fictional character from the soap opera EastEn...,entity,food,cook,cultural exclusive,0,2,1,"[[[81, 57, 38], [88, 57, 40], [94, 61, 40], [8...",1968.0,8.0,,3,0,1,0,0,0
858,http://www.wikidata.org/entity/Q957033,Sunny Baudelaire,fictional character,entity,food,cook,cultural representative,0,4,1,,,16.0,,1,0,1,0,0,0
859,http://www.wikidata.org/entity/Q935079,SpongeBob SquarePants,main character from the animated television sh...,entity,food,cook,cultural representative,0,4,1,"[[[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], ...",1986.0,55.0,,3,1,1,0,0,0


Check if everything is normal now :

In [15]:
initial_df['label'].describe()

Unnamed: 0,label
count,6251
unique,3
top,cultural exclusive
freq,2687


In [16]:
initial_df['label'].unique()

array(['cultural exclusive', 'cultural representative',
       'cultural agnostic'], dtype=object)

In [17]:
initial_df[['item', 'label']].groupby("label").count()

Unnamed: 0_level_0,item
label,Unnamed: 1_level_1
cultural agnostic,1871
cultural exclusive,2687
cultural representative,1693


## Images

Here, we have a lot of missing values, so we fill in the missing values with white images, in order to

- Hope that the model will learn to recognize and ignore thoses white images
- Interpret the fact that the absence of image for an item can be correlated to its label
- Regularization with noise

In [18]:
def fill_NaN_image(image):

  """
  If a cell is empty, it fills in with a white image, of dimension (28,28,3) ie (height, width, channels)
  """

  if image is None or (isinstance(image, float) and np.isnan(image)):
    image = np.ones((28, 28, 3), dtype=np.uint8) * 255

  return image

In [19]:
initial_df['image'] = initial_df['image'].apply(lambda x: fill_NaN_image(x))

In [20]:
initial_df

Unnamed: 0,item,name,description,type,category,subcategory,label,subclasses,instances_of,subclass_depth,image,date,nb_lang,main_country,descr_num_nouns,descr_num_verbs,descr_num_adjectives,descr_has_location,descr_has_ethnic_group,descr_has_event
0,http://www.wikidata.org/entity/Q306,Sebastián Piñera,Chilean entrepreneur and politician (1949–2024),entity,politics,politician,cultural exclusive,0,1,1,"[[[85, 93, 69], [64, 70, 47], [50, 55, 39], [1...",1949.0,126.0,Q298,2,0,1,0,1,0
1,http://www.wikidata.org/entity/Q12735,John Amos Comenius,"Czech teacher, educator, philosopher and write...",entity,politics,politician,cultural representative,0,1,1,"[[[33, 31, 17], [35, 33, 19], [32, 31, 19], [2...",1592.0,72.0,Q153136,4,0,1,0,1,0
2,http://www.wikidata.org/entity/Q1752,Macrinus,Roman emperor from 217 to 218,entity,politics,politician,cultural representative,0,1,1,"[[[109, 110, 103], [113, 115, 107], [117, 118,...",165.0,83.0,Q1747689,1,0,1,0,1,0
3,http://www.wikidata.org/entity/Q1639,Lamine Diack,Senegalese sports manager (1933–2021),entity,politics,politician,cultural representative,0,1,1,"[[[136, 91, 41], [123, 74, 26], [169, 134, 89]...",1933.0,42.0,Q1041,2,0,1,0,1,0
4,http://www.wikidata.org/entity/Q9588,Richard Nixon,President of the United States from 1969 to 1974,entity,politics,politician,cultural representative,0,1,1,"[[[106, 88, 76], [114, 96, 84], [115, 96, 84],...",1913.0,174.0,Q30,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6246,http://www.wikidata.org/entity/Q321103,Bühl,"quarter of Tübingen, Baden-Württemberg, Germany",entity,geography,neighborhood,cultural exclusive,0,2,1,"[[[255, 255, 255], [255, 255, 255], [255, 255,...",,73.0,Q183,1,0,0,1,0,0
6247,http://www.wikidata.org/entity/Q338167,Tenderloin,area of New York City during the late 19th and...,entity,geography,neighborhood,cultural exclusive,0,1,1,"[[[255, 255, 255], [255, 255, 255], [255, 255,...",,9.0,Q30,3,0,3,1,0,0
6248,http://www.wikidata.org/entity/Q66991,Schinznach-Dorf,former municipality and current district of Sc...,entity,geography,neighborhood,cultural exclusive,0,3,1,"[[[61, 117, 223], [64, 118, 226], [65, 119, 22...",,30.0,Q39,2,0,2,1,0,0
6249,http://www.wikidata.org/entity/Q66922,Ependes,village and former municipality in Bois-d'Amon...,entity,geography,neighborhood,cultural exclusive,0,2,1,"[[[153, 147, 44], [209, 198, 54], [201, 192, 5...",,82.0,Q39,2,0,1,1,0,0


## Dates

In [21]:
var = 'date'
initial_df[var].describe(include='category')

Unnamed: 0,date
count,2506.0
mean,1880.42498
std,258.087493
min,1.0
25%,1892.0
50%,1948.0
75%,1983.0
max,5000.0


In [22]:
initial_df[var].isna().sum()

np.int64(3745)

In [23]:
type(initial_df[var][0])

numpy.float64

In [24]:
initial_df[var].nunique()

446

Problems :

- Maximum values = 5000. Not possible : Set it to Nan
- Lot of missing values : Same arguments as for the photos, we fill in with the date -1 : Hope that the model will learn to recognize item with no dates. Interpret the fact that the absence of dates for an item can be correlated to its label


Max values

In [25]:
current_year = int(2025)
initial_df['date'] = initial_df['date'].apply(lambda x: None if x > current_year else x)

In [26]:
var = 'date'
initial_df[var].describe()

Unnamed: 0,date
count,2504.0
mean,1878.931709
std,250.237383
min,1.0
25%,1892.0
50%,1948.0
75%,1983.0
max,2025.0


Missing values

In [27]:
initial_df['date'] = initial_df['date'].fillna(-1)

## Nb_lang

In [28]:
var = 'nb_lang'
initial_df[var].describe(include='category')

Unnamed: 0,nb_lang
count,6250.0
mean,31.19024
std,36.651012
min,1.0
25%,8.0
50%,19.0
75%,39.0
max,310.0


In [29]:
initial_df[var].isna().sum()

np.int64(1)

In [30]:
type(initial_df[var][0])

numpy.float64

In [31]:
initial_df[var].nunique()

219

Problem : We have 1 missing value

In [32]:
initial_df[initial_df[var].isna()]

Unnamed: 0,item,name,description,type,category,subcategory,label,subclasses,instances_of,subclass_depth,image,date,nb_lang,main_country,descr_num_nouns,descr_num_verbs,descr_num_adjectives,descr_has_location,descr_has_ethnic_group,descr_has_event
2058,http://www.wikidata.org/entity/Q7551241,social media and television,Emerging platforms,entity,media,television,cultural agnostic,0,2,1,"[[[255, 255, 255], [255, 255, 255], [255, 255,...",-1.0,,,1,0,1,0,0,0


We fill in with the median

In [33]:
med_value = initial_df['nb_lang'].median()
initial_df['nb_lang'] = initial_df['nb_lang'].fillna(int(med_value))

## Main country

In [34]:
var = 'main_country'
initial_df[var].describe()

Unnamed: 0,main_country
count,3384
unique,283
top,Q30
freq,580


In [35]:
initial_df[var].isna().sum()

np.int64(2867)

In [36]:
type(initial_df[var][0])

str

In [37]:
initial_df[var].nunique()

283

Problem : Many missing values

- Solution : Categorize the missing values in a categorie, and then, for each unique country, we associate a category. Problem : we are not sur that we have collected all the country in the train data set. So, during inference, it would be a problem is the model does not recognize the category

- Solution (what we will do) : 0 if NaN, 1 otherwise. We loose a lot of information...

In [38]:
initial_df['main_country_cat'] = initial_df['main_country'].apply(lambda x: 0 if pd.isna(x) else 1)

# Exportation

In [40]:
# Visualisation

initial_df

Unnamed: 0,item,name,description,type,category,subcategory,label,subclasses,instances_of,subclass_depth,...,date,nb_lang,main_country,descr_num_nouns,descr_num_verbs,descr_num_adjectives,descr_has_location,descr_has_ethnic_group,descr_has_event,main_country_cat
0,http://www.wikidata.org/entity/Q306,Sebastián Piñera,Chilean entrepreneur and politician (1949–2024),entity,politics,politician,cultural exclusive,0,1,1,...,1949.0,126.0,Q298,2,0,1,0,1,0,1
1,http://www.wikidata.org/entity/Q12735,John Amos Comenius,"Czech teacher, educator, philosopher and write...",entity,politics,politician,cultural representative,0,1,1,...,1592.0,72.0,Q153136,4,0,1,0,1,0,1
2,http://www.wikidata.org/entity/Q1752,Macrinus,Roman emperor from 217 to 218,entity,politics,politician,cultural representative,0,1,1,...,165.0,83.0,Q1747689,1,0,1,0,1,0,1
3,http://www.wikidata.org/entity/Q1639,Lamine Diack,Senegalese sports manager (1933–2021),entity,politics,politician,cultural representative,0,1,1,...,1933.0,42.0,Q1041,2,0,1,0,1,0,1
4,http://www.wikidata.org/entity/Q9588,Richard Nixon,President of the United States from 1969 to 1974,entity,politics,politician,cultural representative,0,1,1,...,1913.0,174.0,Q30,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6246,http://www.wikidata.org/entity/Q321103,Bühl,"quarter of Tübingen, Baden-Württemberg, Germany",entity,geography,neighborhood,cultural exclusive,0,2,1,...,-1.0,73.0,Q183,1,0,0,1,0,0,1
6247,http://www.wikidata.org/entity/Q338167,Tenderloin,area of New York City during the late 19th and...,entity,geography,neighborhood,cultural exclusive,0,1,1,...,-1.0,9.0,Q30,3,0,3,1,0,0,1
6248,http://www.wikidata.org/entity/Q66991,Schinznach-Dorf,former municipality and current district of Sc...,entity,geography,neighborhood,cultural exclusive,0,3,1,...,-1.0,30.0,Q39,2,0,2,1,0,0,1
6249,http://www.wikidata.org/entity/Q66922,Ependes,village and former municipality in Bois-d'Amon...,entity,geography,neighborhood,cultural exclusive,0,2,1,...,-1.0,82.0,Q39,2,0,1,1,0,0,1


In [39]:
#initial_df.to_json('final_train_df.json', orient="records", lines=True)
#files.download('final_train_df.json')