<a href="https://colab.research.google.com/github/shacharmirkin/misc/blob/main/achiataalivrit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# אחי, אתה על עברית
This script in meant to find cases where accidentaly typing in Hebrew instead of English still results with a word in Hebrew that has the same meaning as what we intented to write in English.

Executive summary: fail :( Anyway, it was worth a try

@shacharmirkin

In [None]:
import numpy as np
from gensim.models import KeyedVectors
import gensim.models.wrappers.fasttext
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

Create a keyboard mapping

In [None]:
# Left to right, top to bottom keys on my laptop
en = 'qwertyuiopasdfghjkl;zxcvbnm,.'
he = "/'קראטוןםפשדגכעיחלךףזסבהנמצתץ"

he2en_map = {}
for i in range(len(he)):
  he2en_map[he[i]] = en[i]

In [None]:
def he2en(he_str : str) -> str:
  """ Convert a Hebrew string to an English (Latin letters) string based on the keyboard layout """
  en_word = ''
  for l in list(he_str):
    if l not in he2en_map:
      return None
    en_word += he2en_map[l]
  return en_word

he2en('טמקא')

'ynet'

# A bilingual dictionary
We're using aligned word vectors from fasttext: https://fasttext.cc/docs/en/aligned-vectors.html 

In aligned word vectors, similar words in different languages have similar vectors.


In [None]:
import os
def download_vec(lang: str):
  """ Download fastText word vectors for a given language """
  filename = 'wiki.' + lang + '.align.vec'
  url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/' + filename
  if not os.path.exists(filename):
    ! wget $url
  return filename


Download the vectors for Hebrew and English and print a few lines from each file (showing only a part of the vector)

Format: The first line includes the number of vectors and their dimension. 
The other lines contain a word followed by its vector, space separated. 

In [None]:
he_filename = download_vec('he')
en_filename = download_vec('en')

!sed -n '1p;997,1000p' $he_filename | cut -f1-10 -d' '
print()
!sed -n '1p;997,1000p' $en_filename | cut -f1-10 -d' '

--2020-10-30 11:19:41--  https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.he.align.vec
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.75.142, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1106836660 (1.0G) [binary/octet-stream]
Saving to: ‘wiki.he.align.vec’


2020-10-30 11:20:29 (22.2 MB/s) - ‘wiki.he.align.vec’ saved [1106836660/1106836660]

--2020-10-30 11:20:29--  https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.en.align.vec
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 172.67.9.4, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5685446378 (5.3G) [binary/octet-stream]
Saving to: ‘wiki.en.align.vec’


2020-10-30 11:25:32 (17.9 MB/s) - ‘wiki.en.align.vec’ saved 

Loading the word vectors into memory

In [None]:
he_wv = KeyedVectors.load_word2vec_format(he_filename, binary=False, encoding='utf8', limit=None)

In [None]:
# The English vectors include 2.5M entries, and it takes some time to load
# if not all are needed, change to a smaller number (None means all)
en_wv = KeyedVectors.load_word2vec_format(en_filename, binary=False, encoding='utf8', limit=None)

Playing a bit with the word vectors

In [None]:
w1, w2 = 'אליפות', 'גביע'
print(f"sim({w1},{w2})={he_wv.similarity(w1,w2):.3f}")  # (0.56 is pretty high)

print(f"\nMost similar words to '{w1}':")
for w, score in he_wv.most_similar(w1):
  print (w, round(float(score), 3))

sim(אליפות,גביע)=0.557

Most similar words to 'אליפות':
ואליפות 0.844
אליפות+גביע 0.839
#אליפות 0.831
שאליפות 0.83
כאליפות 0.78
אליפותו 0.779
מאליפות 0.745
ולאליפות 0.729
כשבאליפות 0.728
שבאליפות 0.728


## Checking the alignment
We expect translations to get a relatively high similarity score

In [None]:
def get_he_en_sim(he_w, en_w):
  """ Get the Cosine similarity between a Hebrew and an English word """
  try:
    v1 = he_wv.word_vec(he_w) 
    v2 = en_wv.word_vec(en_w)
    sim = np.dot(v1,v2)  # the vectors are normalized so dot product equals cosine similarity
    print(he_w, en_w, sim)
    return sim
  except KeyError as e:
    print(e)
    return None

get_he_en_sim('חתול', 'cat')  # 0.43 is actually quite high
get_he_en_sim('חתול', 'dog')
get_he_en_sim('חתול', 'table')
_ = get_he_en_sim('חתול', 'black')


חתול cat 0.43351614
חתול dog 0.38818812
חתול table 0.06924441
חתול black 0.06795343


#Looking for matches 
Requiring keyboard layout match and semantic similarity

In [None]:
min_sim = 0.3  # this also removes a lot of the noise that exists in the word vectors
min_length = 3
 
for w in he_wv.vocab:
  en_w = he2en(w)
  if len(w) >= min_length and en_w in en_wv.vocab:
    v1 = he_wv.word_vec(w) 
    v2 = en_wv.word_vec(en_w)
    sim = np.dot(v1,v2)
    if sim >= min_sim: 
      print(w, en_w, round(sim,3))

רקס rex 0.372


That's quite disappointing.
The only (kind of) success with at least 3 letters and minimal similarity = 0.3 is *רקס rex 0.372*. 

In retropect, maybe this shouldn't be too surprising, 
e.g. because of the length difference between Hebrew and English words (vowels).

Will it will work better with other languages pairs (and different scripts/keyboards)?


Lastly, are the Hebrew words semantically similar to their accidentaly typed version?

In [None]:
min_sim = 0.25
min_length = 3

for w in he_wv.vocab:
  en_w = he2en(w)
  if len(w) >= min_length and en_w in he_wv.vocab:  # now looking at the Hebrew voacb only
    v1 = he_wv.word_vec(w) 
    v2 = he_wv.word_vec(en_w)
    sim = np.dot(v1,v2)
    if sim >= min_sim: 
      # also showing for reference the 10th-most similar word
      most_sim_w, most_sim_score = he_wv.most_similar(w, topn=10)[9]
      print(w, en_w, round(sim,3), "::", most_sim_w, round(most_sim_score, 3), sep='\t') 

שמאל	antk	0.382	::	upright	0.624
נגד	bds	0.319	::	נגדו	0.573
אלי	tkh	0.267	::	ותשאלי	0.432
הכס	vfx	0.325	::	אפוקריסיאריוס	0.428
רקס	rex	0.285	::	שונוזאורוס	0.486
יגה	hdv	0.295	::	טרהבייט	0.519
הצג	vmd	0.256	::	תוצג	0.482
קצא	emt	0.274	::	אמרכל	0.427
אצה	tmv	0.254	::	אצפה	0.39
דאע	stg	0.261	::	שטרור	0.479
ערוד	grus	0.276	::	ערופה	0.447
פררפ	prrp	0.386	::	לפקמן	0.486
דוגם	sudo	0.254	::	ומבוצע	0.425
נהורא	bvurt	0.257	::	אהורא	0.491
בצג	cmd	0.369	::	במציג	0.546
ועמ	ugn	0.265	::	וכרך	0.483
טמקא	ynet	0.26	::	אשתוק	0.399
דפור	spur	0.258	::	ברדפורד	0.419
ההה	vvv	0.258	::	ההמפ	0.49
דמל	snk	0.263	::	הלכתא	0.426
לצא	kmt	0.292	::	לצאתם	0.546
אונקר	tuber	0.268	::	אונקיית	0.493
צצצ	mmm	0.297	::	לכאן/תבנית	0.532
