Semantically ordering word lists in Python
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
README.md
data.csv
semantic_sort.py

README.md

Semantic Sort

See the blog post containing detail about this code.

Word embedding is a feature learning technique in natural language processing (NLP) where words or phrases are mapped into vectors of real numbers. The technique was popularised by Tomas Mikolov's word2vec toolkit in 2013.

Today, word embedding finds its way into many real world applications, including the recent update of Google Translate. Google now use a neural machine translation engine, which translates whole sentences at a time, rather than just piece-by-piece. These sentences represented by vectors.

Normally, word vectors consist of hundreds, if not thousands, of real numbers, which can capture the intricate detail of a rich language. However, I thought, what would happen if we crush each word's representation down to a single number?

Surely, no meaningful information would remain.

I exported into a CSV file a selection of words and their associated word vectors generated by Athena - my port of word2vec. Here's a truncated view of the 128 dimension word vectors:

albert_einstein, -2.480522, -1.91355, 0.4266494, 1.927253, -0.5318278, -3.722191, ...
audi, -0.8185719, -1.691721, -0.3929027, 1.698154, -2.953124, 0.9475167, ...
bill_gates, -1.673416, -1.601884, 1.130291, 2.139339, 0.5832655, -2.355634, ...
bmw, -1.027373, -1.668206, -1.728756, 2.338698, -4.249786, 0.4357562, ...
cat, -0.5071716, -2.760615, -2.546596, 0.9999355, -0.1860456, 0.2416906, ...
climb, 0.7150535, 0.1190424, 1.583062, -0.3858005, -3.991093, 1.382508, ...
dog, -0.4773144, -2.224563, -3.67255, 0.5424166, 0.6331441, 1.222993, ...
hotel, 0.3524887,-4.38588, 1.197459, 2.595855, -0.3414235, -0.4427964, ...
house, 0.5532117, -2.279454, -0.2512704, 0.4140477, 2.676775, 0.05087801, ...
monkey, -0.623855, -3.508944, -0.931955, -0.4193407, -0.9044554, 0.347873, ...
mountain, 2.207317, 0.5984201, -1.398792, -0.5220098, -1.344777, 0.3062904, ...
porsche, -0.316146, -1.779519, -0.8431134, 2.44296, -3.680713, 0.874707, ...
river, 3.286445, 2.139661, -1.43799, 2.606198, -2.337485, -0.4348237, ...
school, -3.210236, -3.298275, 3.333953, 0.9878215, 1.926927, -0.1040714, ...
steve_jobs, -2.178778, -2.492632, 1.083596, 1.491468, 0.5440083, -3.330186, ...
swim, 0.8094505, -0.911125, -1.189181, 1.908399, -4.087798, 1.79775, ...
valley, 1.044242, 1.814712, 0.1396747, 0.6305104, -1.227837, -0.389852, ...
walk, 0.5212327, 0.03666721, 0.6227544, 0.6157224, -2.084322, 0.6642563, ...

Next, I would crush each vector down to just a single dimension using Principle Component Analysis (PCA). It's really easy to do this in Python using the scikit-learn library.

The results are astonishing!

bmw  -12.042180
audi  -11.357731
porsche  -11.349108
steve_jobs  -8.577104
bill_gates  -7.390602
albert_einstein  -4.910876
monkey  -1.285317
cat  -1.197481
dog  -0.925163
house  0.517183
school  0.732882
hotel  1.319551
swim  6.056411
climb  6.943114
walk  7.069848
mountain  9.365861
valley  11.752656
river  15.278058

Even with just a single value representing each word, enough information is retained so that clusters have clearly formed - we have car manufacturers, people, animals, buildings, verbs and geographical features...

What does this mean? No idea..! But, if we can semantically sort words, we can probably do the same for sentences. Would sorting sentences in a document make the flow of ideas more coherent? Maybe..?

Follow me on Twitter for more updates like this.

@robosoup www.robosoup.com