# Aligning word embeddings for Scottish Gaelic -- Irish

I aligned fastText word embeddings for Scottish Gaelic -- Irish. This notebook shows some evidence that the alignment worked (to some extent).

In [1]:
from pymagnitude import *

In [2]:
# Original unaligned word vectors
gd_original = Magnitude("gd-ga-en-vecs/vectors-gd.magnitude")
ga_original = Magnitude("gd-ga-en-vecs/vectors-ga.magnitude")

## Src = gd, Tgt = ga

In [3]:
# Aligned word vectors using Kevin Scannell's bilingual dictionary for training
# src = gd, tgt = ga
gd_aligned = Magnitude("gd-ga-en-vecs/src=gd_tgt=ga/vectors-gd-aligned-w-ga.magnitude")
ga_aligned = Magnitude("gd-ga-en-vecs/src=gd_tgt=ga/vectors-ga-aligned-w-gd.magnitude")

## Distance and Similarity of Cognates

### Identical surface forms *athair*--*athair* 'father'

The distance between cognates when using the original word embeddings is greater than the distance when using the aligned word vectors.

In [4]:
gd_original.distance(gd_original.query("athair"), ga_original.query("athair"))

1.3714929

In [5]:
gd_aligned.distance(gd_aligned.query("athair"), ga_aligned.query("athair"))

0.9214781

Conversely, similarity between cognates is greater when using aligned word embeddings compared to the original unaligned word embeddings.

In [6]:
gd_original.similarity(gd_original.query("athair"), ga_original.query("athair"))

0.059503734

In [7]:
gd_aligned.similarity(gd_aligned.query("athair"), ga_aligned.query("athair"))

0.575439

### Different surface forms *thidsear*--*múinteoir* 'teacher'

As with identical surface forms, the distance of cognates decreases when using the aligned word vectors

In [8]:
gd_original.distance(gd_original.query("tìdsear"), ga_original.query("múinteoir"))

1.4716123

In [9]:
gd_aligned.distance(gd_aligned.query("tìdsear"), ga_aligned.query("múinteoir"))

1.1173369

Similarity similarity increases when using the aligned word vectors compared to the unaligned ones.

In [10]:
gd_original.similarity(gd_original.query("tìdsear"), ga_original.query("múinteoir"))

-0.08282142

In [11]:
gd_aligned.similarity(gd_aligned.query("tìdsear"), ga_aligned.query("múinteoir"))

0.37577924

## Most similar embeddings

### Identical surface forms *athair*--*athair* 'father'

#### Original word vectors

The most similar entries to _athair_ in both the original Scottish Gaelic and Irish word embeddings are closely related words.

For Scottish Gaelic, for example, we see _mathair_ 'mother' and _leas-athair_ 'stepfather'. However, there are some odd entries that are only similar in their surface form, e.g. _Nathair_ 'snake' (could be corruption of _n-athair_ 'their father'?) or _Bathair_ 'goods'.

In [12]:
gd_original.most_similar("athair", topn=10)

[('leas-athair', 0.7714895),
 ('àrd-athair', 0.7699828),
 ('mhathair', 0.76533455),
 ('t-athair', 0.7603621),
 ('mathair', 0.7544086),
 ('h-athair', 0.7522838),
 ('Mathair', 0.75154734),
 ('Nathair', 0.7480391),
 ('t-àrd-athair', 0.73647505),
 ('Bathair', 0.7290089)]

Top 10:
1. stepfather
2. stepfather
3. mother
4. the father
5. mother
6. her father
7. mother
8. **Snake** (or *n-athair* 'their father'?) 
9. the great-grandfather
10. goods

For Irish, we see similar words like _mháthair_ 'mother' and _sheanathair_ 'grandfather', but also some odd names like _Théodebald_, _Gondioque_, and _Aminah_.

In [13]:
ga_original.most_similar("athair", topn=10)

[('mháthair', 0.58177733),
 ('t-athair', 0.5201826),
 ('h-athair', 0.50818443),
 ('sheanathair', 0.5013288),
 ('n-athair', 0.49979436),
 ('shin-seanathair', 0.48902076),
 ('shean-athair', 0.4642293),
 ('Théodebald', 0.4593647),
 ('Gondioque', 0.4590472),
 ('Aminah', 0.45837396)]

Top 10:
1. mother
2. the father
3. father
4. grandfather
5. their father
6. great-grandfather
7. grandfather
8. **Théodebald**
9. **Gondioque**
10. **Aminah**

#### Unaligned word vector comparison

When we try to get similar entries across unaligned word embeddings, we see that they are definitely not aligned. In this case, we take the word vector for _athair_ from the Irish word embeddings and find the most similar word vectors to it in the Scottish Gaelic word embeddings, and vice versa. Notice that we no longer see the related terms like 'mother' or 'step/grandfather'

In [14]:
gd_original.most_similar(ga_original.query("athair"), topn=10)

[('Spartan', 0.21923038),
 ('Albion', 0.21457234),
 ('Tartan', 0.20621169),
 ('bheanntan', 0.20374598),
 ('1,902km', 0.20359457),
 ('Bishop', 0.20356673),
 ('star', 0.20177013),
 ('agree', 0.2008923),
 ('tartan', 0.19916904),
 ('Là-bàis', 0.19887699)]

Top 10:
1. **Spartan**
2. **Albion**
3. **Tartan**
4. **mountains**
5. **1902km**
6. **Bishop**
7. **star**
8. **agree**
9. **tartan**
10. **day of death**

In [15]:
ga_original.most_similar(gd_original.query("athair"), topn=10)

[('dheoin', 0.29285294),
 ('dímhorfach', 0.27654022),
 ('ró-shona', 0.26213565),
 ('fhéinscrúdú', 0.2587552),
 ('roicéad', 0.25809878),
 ('tascanna', 0.25144416),
 ('fútsa', 0.2504164),
 ('ndáríre', 0.24929008),
 ('mhionscrúdú', 0.24743652),
 ('oireas', 0.24653216)]

Top 10:
1. **voluntarily**
2. **degenerate**
3. **too happy**
4. **self-examination**
5. **rocket**
6. **tasks**
7. **about you**
8. **in fact**
9. **scrutiny**
10. **time** (?)

#### Aligned word vector comparison

However, when we do the same procedure using the aligned word embeddings, we once again see the familiar, similar terms. This is good evidence that, at least in some cases, the word embedding alignment was successful.

In [16]:
gd_aligned.most_similar(ga_aligned.query("athair"), topn=10)

[('athair', 0.575439),
 ('mhathair', 0.4998223),
 ('h-athair', 0.49022955),
 ('mathair', 0.47973418),
 ('phiuthair', 0.4791753),
 ('àrd-athair', 0.47690833),
 ('leas-athair', 0.47433442),
 ('mic-bhràthair', 0.47367144),
 ('Bràthair', 0.47073084),
 ('Bhràthair', 0.4641339)]

Top 10:
1. father
2. mother
3. her father
4. mother
5. sister
6. stepfather
7. stepfather
8. nephews
9. brother
10. brother

In [17]:
ga_aligned.most_similar(gd_aligned.query("athair"), topn=10)

[('athair', 0.575439),
 ('h-athair', 0.44955707),
 ('máthair', 0.44419354),
 ('mhathair', 0.44060296),
 ('shin-seanathair', 0.42478544),
 ('seanathair', 0.41890094),
 ('sheanathair', 0.41703337),
 ('Ardathair', 0.4120913),
 ('shean-athair', 0.40566164),
 ('n-athair', 0.4001657)]

Top 10:
1. father
2. father
3. mother
4. mother
5. great-grandfather
6. grandfather
7. grandfather
8. grandfather
9. grandfather
10. their father

In order to continue, how many other examples should I consider? The alignment appears to work for _athair_ 'father', but it may break down elsewhere. What else should I try in order to be reasonably confident in these results?

### Different surface forms *tìdsear*--*múinteoir* 'teacher'

##### Original word vectors

In [18]:
gd_original.most_similar("tìdsear", topn=10)

[('thìdsear', 0.7221449),
 ('bùidsear', 0.70341194),
 ('poidsear', 0.6913197),
 ('Mhàidsear', 0.6801137),
 ('Fleidsear', 0.67649484),
 ('Boildsear', 0.6738622),
 ('tìdsear-piana', 0.6710552),
 ('tidsear', 0.66696274),
 ('Màidsear', 0.6398938),
 ('ghàidsear', 0.6365607)]

Top 10:
1. teacher
2. butcher
3. poacher (?)
4. Major
5. **Fleidsear** (name? Aonghas Fleidsear)
6. **Boildsear** (SG name in The Hobbit for Fredegar Bolger)
7. piano teacher
8. teacher
9. Major
10. gauger/exciseman/catcher

In [19]:
ga_original.most_similar("múinteoir", topn=10)

[('Múinteoir', 0.66677696),
 ('bunmhúinteoir', 0.61109895),
 ('Iarmhúinteoir', 0.59405947),
 ('Mhúinteoir', 0.57933664),
 ('scoth-mhúinteoir', 0.57555544),
 ('muinteoir', 0.5748805),
 ('mhúinteoir', 0.5688859),
 ('Iar-mhúinteoir', 0.556272),
 ('iarmhúinteoir', 0.54384166),
 ('Fuinteoir', 0.5346435)]

Top 10:
1. teacher
2. primary teacher
3. former teacher
4. teacher
5. excellent teacher
6. teacher
7. teacher
8. former teacher
9. former teacher
10. teacher (misspelled?)

#### Unaligned word vector comparison

In [20]:
gd_original.most_similar(ga_original.query("múinteoir"), topn=10)

[('Ounceland', 0.1916768),
 ('Stòr', 0.18859445),
 ('fhag', 0.1851612),
 ('fo-bhailtean', 0.18509023),
 ('rèabhailtean', 0.18382232),
 ('iom-bailtean', 0.18359554),
 ('tàmailtean', 0.18334988),
 ('recycle', 0.18286756),
 ('Nollaig', 0.18228635),
 ('iomall-bhailtean', 0.18077886)]

Top 10:
1. **Ounceland** (traditional Scottish land measurement)
2. **source**
3. **leave**
4. **suburbs**
5. **revolutions**
6. **suburbs**
7. **insults**
8. **recycle**
9. **Christmas**
10. **suburbs**

In [21]:
ga_original.most_similar(gd_original.query("tìdsear"), topn=10)

[('Vanhanen', 0.25243953),
 ('Moane', 0.2466792),
 ('Goody', 0.24587086),
 ('Nord-Sud', 0.24368005),
 ('Mattraw', 0.2426495),
 ('Daydream', 0.2421751),
 ('Lynn', 0.24175812),
 ('Ghoebbels', 0.24150711),
 ('gnáthocsaigine', 0.23977983),
 ('Hinchcliffw', 0.23965335)]

Top 10:
1. **Vanhanen**
2. **Moane**
3. **Goody**
4. **Nord-Sud**
5. **Mattraw**
6. **Daydream**
7. **Lynn**
8. **Goebbels**
9. **normal oxygen**
10. **Hinchcliffw**

##### Aligned word vector comparison

In [22]:
gd_aligned.most_similar(ga_aligned.query("múinteoir"), topn=10)

[('fear-teagaisg', 0.53558624),
 ('Neach-teagaisg-dannsa', 0.4949063),
 ('fhear-teagaisg', 0.4791184),
 ('neach-teagaisg', 0.47808105),
 ('Neach-teagaisg', 0.47394735),
 ('clàr-teagaisg', 0.46540892),
 ('bhocsair', 0.4291808),
 ('Theagaisg', 0.41435152),
 ('treòraiche-ciùil', 0.409761),
 ('Luchd-teagaisg', 0.408665)]

Top 10:
1. teacher
2. dance teacher
3. teacher
4. teacher
5. teacher
6. curriculum
7. boxer
8. teaching
9. music guide
10. teachers

In [23]:
ga_aligned.most_similar(gd_aligned.query("tìdsear"), topn=10)

[('Yehudi', 0.43762973),
 ('phianódóir', 0.43374342),
 ('Lorenzini', 0.4251451),
 ('Debastien', 0.41626137),
 ('Comala', 0.41439307),
 ('Tereza', 0.40945715),
 ('Bonifacio', 0.40215826),
 ('Pianódóir', 0.40194356),
 ('Bonifacia', 0.39925823),
 ('Strozzi', 0.3992504)]

Top 10:
1. **Yehudi**
2. pianist
3. **Lorenzini**
4. **Debastien**
5. **Comala**
6. **Tereza**
7. **Bonifacio**
8. pianist
9. **Bonafacia**
10. **Strozzi**

In [45]:
# gd_original2 = Magnitude("gd-ga-en-vecs/gd-en/vectors-gd.magnitude")
# gd_aligned2 = Magnitude("gd-ga-en-vecs/gd-en/vectors-gd-aligned.magnitude")
# en_original = Magnitude("gd-ga-en-vecs/gd-en/vectors-en.magnitude")
# en_aligned = Magnitude("gd-ga-en-vecs/gd-en/vectors-en-aligned.magnitude")

In [None]:
# gd_original2.most_similar("athair", topn=10)

In [None]:
# en_original.most_similar("father", topn=10)

In [None]:
# gd_original2.most_similar(en_original.query("father"), topn=10)

In [None]:
# en_original.most_similar(gd_original2.query("athair"), topn=10)

Using aligned word embeddings

In [None]:
# gd_aligned2.most_similar(en_aligned.query("father"), topn=10)

In [None]:
# en_aligned.most_similar(gd_aligned2.query("athair"), topn=10)

In [None]:
# gd_original2.most_similar(en_original.query("cat"), topn=10)

In [None]:
# gd_aligned2.most_similar(en_aligned.query("cat"), topn=10)