<a href="https://colab.research.google.com/github/ymoslem/MT-Preparation/blob/main/extra/oversampling-polars.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Oversampling

In some toolkits like OpenNMT{py,tf}, you can apply oversampling during training using the "dataset weights" feature. However, this notebook explains how to apply it *manually* to datasets as part of data preperation, using *Polars*.

In [1]:
import polars as pl

In [3]:
data = {
        "label": [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        "source": [
                  "Sunny skies and gentle breeze.",
                  "Rainy with thunderstorms in evening.",
                  "Clear night with a full moon.",
                  "Cloudy and cool, chance of showers.",
                  "Students gather, eager to learn.",
                  "New books, fresh start, endless possibilities.",
                  "Friends reunite, laughter fills corridors.",
                  "Teachers inspire, minds come alive.",
                  "Busy hallways, buzzing with excitement.",
                  "Homework assignments, challenges to conquer.",
                  "Exploring new subjects, expanding knowledge.",
                  "Lunchtime chatter, delicious meals shared.",
                  "Sports teams practice, readying for competition.",
                  "Art projects, creativity unleashed on canvas.",
                  "Science experiments, discoveries waiting ahead.",
                  "Math problems solved, confidence grows.",
                  "Field trips, adventures beyond classroom.",
                  "Class discussions, diverse ideas shared.",
                  "Exam time, studying late nights.",
                  "Graduation nears, futures take shape."
  ],
        "target": [
                  "Ciel ensoleillé et légère brise.",
                  "Pluvieux avec des orages le soir.",
                  "Nuit claire avec une pleine lune.",
                  "Nuageux et frais, risque d'averses.",
                  "Les élèves se rassemblent, impatients d'apprendre.",
                  "Nouveaux livres, nouveau départ, possibilités infinies.",
                  "Les amis se retrouvent, les couloirs résonnent de rires.",
                  "Les enseignants inspirent, les esprits s'éveillent.",
                  "Couloirs animés, bourdonnant d'excitation.",
                  "Devoirs à faire, défis à relever.",
                  "Exploration de nouvelles matières, élargissement des connaissances.",
                  "Bavardages à l'heure du déjeuner, repas délicieux partagés.",
                  "Les équipes sportives s'entraînent, se préparant à la compétition.",
                  "Projets artistiques, créativité libérée sur la toile.",
                  "Expériences scientifiques, découvertes en attente.",
                  "Problèmes de maths résolus, confiance qui grandit.",
                  "Sorties scolaires, aventures au-delà de la salle de classe.",
                  "Discussions en classe, partage d'idées diverses.",
                  "Temps des examens, études jusqu'à tard dans la nuit.",
                  "La remise des diplômes approche, les futurs prennent forme."
  ]
}

Assume that "data" includes two domains. We will use the lable *0* for the domain with larger data, and label *1* for the domain with smaller data. In this toy example, the first domain has 4 translation pairs, while the second domain has 16 translation pairs.

In [4]:
df = pl.DataFrame(data)

In [5]:
print(df.shape)

df

(20, 3)


label,source,target
i64,str,str
0,"""Sunny skies and gentle breeze.""","""Ciel ensoleillé et légère bris…"
0,"""Rainy with thunderstorms in ev…","""Pluvieux avec des orages le so…"
0,"""Clear night with a full moon.""","""Nuit claire avec une pleine lu…"
0,"""Cloudy and cool, chance of sho…","""Nuageux et frais, risque d'ave…"
1,"""Students gather, eager to lear…","""Les élèves se rassemblent, imp…"
…,…,…
1,"""Math problems solved, confiden…","""Problèmes de maths résolus, co…"
1,"""Field trips, adventures beyond…","""Sorties scolaires, aventures a…"
1,"""Class discussions, diverse ide…","""Discussions en classe, partage…"
1,"""Exam time, studying late night…","""Temps des examens, études jusq…"


Now, let's randomly oversample the smaller domain data (with label *0*), to make it more balanced. In this case, it will now have the same number of translation pairs as the larger domain data (with label *1*).

In [6]:
import polars as pl

# Get the max class count
most = df["label"].value_counts()["count"].max()

# Oversampling using group_by and map_groups
df_balanced = (
    df.group_by("label", maintain_order=True)
    .map_groups(lambda group: group.sample(n=most, shuffle=True, with_replacement=True))
)

print(df_balanced.shape)
print(df_balanced["label"].value_counts())  # Verify class balance


(32, 3)
shape: (2, 2)
┌───────┬───────┐
│ label ┆ count │
│ ---   ┆ ---   │
│ i64   ┆ u32   │
╞═══════╪═══════╡
│ 1     ┆ 16    │
│ 0     ┆ 16    │
└───────┴───────┘


Compare the new dataframe with the original one, and notice how the data with the lable *0* is now oversampled. Congratulations, now you have a balanced dataset!

In [12]:
df_balanced

label,source,target
i64,str,str
0,"""Clear night with a full moon.""","""Nuit claire avec une pleine lu…"
0,"""Rainy with thunderstorms in ev…","""Pluvieux avec des orages le so…"
0,"""Rainy with thunderstorms in ev…","""Pluvieux avec des orages le so…"
0,"""Sunny skies and gentle breeze.""","""Ciel ensoleillé et légère bris…"
0,"""Cloudy and cool, chance of sho…","""Nuageux et frais, risque d'ave…"
…,…,…
1,"""Science experiments, discoveri…","""Expériences scientifiques, déc…"
1,"""Lunchtime chatter, delicious m…","""Bavardages à l'heure du déjeun…"
1,"""Exploring new subjects, expand…","""Exploration de nouvelles matiè…"
1,"""Students gather, eager to lear…","""Les élèves se rassemblent, imp…"


In [13]:
df_balanced.filter(pl.col("label") == 0)

label,source,target
i64,str,str
0,"""Clear night with a full moon.""","""Nuit claire avec une pleine lu…"
0,"""Rainy with thunderstorms in ev…","""Pluvieux avec des orages le so…"
0,"""Rainy with thunderstorms in ev…","""Pluvieux avec des orages le so…"
0,"""Sunny skies and gentle breeze.""","""Ciel ensoleillé et légère bris…"
0,"""Cloudy and cool, chance of sho…","""Nuageux et frais, risque d'ave…"
…,…,…
0,"""Rainy with thunderstorms in ev…","""Pluvieux avec des orages le so…"
0,"""Rainy with thunderstorms in ev…","""Pluvieux avec des orages le so…"
0,"""Sunny skies and gentle breeze.""","""Ciel ensoleillé et légère bris…"
0,"""Rainy with thunderstorms in ev…","""Pluvieux avec des orages le so…"
