### Distributive hypothesis in semantics

+ Ludwig Wittgenstein:
Die Bedeutung eines Wortes liegt in seinem Gebrauch.


+ Firth (1935:37) on context dependence (cited by Stubbs):
the complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.


+ Firth (1957:11):
You shall know a word by the company it keeps . . .


+ Harris (1954:34):
All elements in a language can be grouped into classes whose relative occurrence can be stated exactly. However, for the occurrence of a particular member of one class relative to a particular member of another class, it would be necessary to speak in terms of probability, based on the frequency of that occurrence in a sample.


+ Harris (1954:34):
It is possible to state the occurrence of any element relative to any other element, to the degree of exactness indicated above, so that distributional statements can cover all of the material of a language without requiring support from other types of information.


+ Harris (1954:34) (anticipating deep learning?):
The restrictions on relative occurrence of each element are described most simply by a network of interrelated statements, certain of them being put in terms of the results of certain others, rather than by a simple measure of the total restriction on each element separately.


+ Harris (1954:36) on levels of analysis:


    - Some question has been raised as to the reality of this structure. Does it really exist, or is it just a mathematical creation of the investigator’s? Skirting the philosophical difficulties of this problem, we should, in any case, realize that there are two quite different questions here.
    
    - One: Does the structure really exist in language? The answer is yes, as much as any scientific structure really obtains in the data which it describes — the scientific structure states a network of relations, and these relations really hold in the data investigated.
    
    - Two: Does the structure really exist in speakers? Here we are faced with a question of fact which is not directly or fully investigated in the process of determining the distributional structure. Clearly, certain behaviors of the speakers indicate perception along the lines of the distributional structure, for example, the fact that while people imitate nonlinguistic or foreign-language sounds, they repeat utterances of their own language.


+ Harris (1954:39) on meaning and context-dependence:
All this is not to say that there is not a great interconnection between language and meaning, in whatever sense it may be possible to use this work. But it is not a one-to-one relation between morphological structure and anything else. There is not even a one-to-one relation between vocabulary and any independent classification of meaning; we cannot say that each morpheme or word has a single central meaning or even that it has a continuous or coherent range of meanings...The correlation between language and meaning is much greater when we consider connected discourse.


+ Harris (1954:43):
The fact that, for example, not every adjective occurs with every noun can be used as a measure of meaning difference. For it is not merely that different members of the one class have different selections of members of the other class with which they are actually found. More than that: if we consider words or morphemes A and B to be more different than A and C, then we will often find that the distributions of A and B are more different than the distributions of A and C. In other words, difference in meaning correlates with difference in distribution.


+ Turney & Pantel (2010:153):


    - Statistical semantics hypothesis: Statistical patterns of human word usage can be used to figure out what people mean (Weaver, 1955; Furnas et al., 1983). – If units of text have similar vectors in a text frequency matrix, then they tend to have similar meanings. (We take this to be a general hypothesis that subsumes the four more specific hypotheses that follow.)

    - Bag of words hypothesis: The frequencies of words in a document tend to indicate the relevance of the document to a query (Salton et al., 1975). – If documents and pseudo-documents (queries) have similar column vectors in a term–document matrix, then they tend to have similar meanings.

    - Distributional hypothesis: Words that occur in similar contexts tend to have similar meanings (Harris, 1954; Firth, 1957; Deerwester et al., 1990). – If words have similar row vectors in a word–context matrix, then they tend to have similar meanings.
      
    - Extended distributional hypothesis: Patterns that co-occur with similar pairs tend to have similar meanings (Lin & Pantel, 2001). – If patterns have similar column vectors in a pair–pattern matrix, then they tend to express similar semantic relations.

    - Latent relation hypothesis: Pairs of words that co-occur in similar patterns tend to have similar semantic relations (Turney et al., 2003). – If word pairs have similar row vectors in a pair–pattern matrix, then they tend to have similar semantic relations.
    
    
+ What is the meaning of the word "bardiwac" (Stefan Evert's example)?

    - He handed her her glass of bardiwac.

    - Beef dishes are made to complement the bardiwacs.

    - Nigel staggered to his feet, face flushed from too much bardiwac.

    - Malbec, one of the lesser-known bardiwac grapes, responds well to Australia’s sunshine.

    - I dined off bread and cheese and this excellent bardiwac.

    - The drinks were delicious: blood-red bardiwac as well as light, sweet Rhenish.

#### Word2Vec

One of the most famous distributional models is word2vec. The model is based on a neural network that predicts the probability of occurrence of a given word in a given context. The two seminal papers are linked below:

+ [Efficient Estimation of Word Representations inVector Space](https://arxiv.org/pdf/1301.3781.pdf)
+ [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546)

The model produces word representations in a form of a vector, or, an embedding.

Word2Vec comprises two algorithms: Skip-Gram and Continuous Bag-Of-Words (CBOW). The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word.

#### How does word2vec work?

Word2Vec takes a corpus as an input and creates a vector for each word. Vectors (embeddings) are created based on the distributional hypothesis. Cosine similarity between embeddings reflects similarity in the semantics of the words.

We can use embeddings to create analogies:

+ king: man = queen: woman $\Rightarrow$
+ king - man + woman = queen

![w2v](https://cdn-images-1.medium.com/max/2600/1*sXNXYfAqfLUeiDXPCo130w.png)

More on the mechanics you can find [here](https://habr.com/ru/post/446530/)

#### Why do we need it?

+ to solve semantic problems
+ for which classes of words is the distributional hypothesis most useful?
+ some papers on its use in semantics:

* [Turney and Pantel 2010](https://jair.org/index.php/jair/article/view/10640)
* [Lenci 2018](https://www.annualreviews.org/doi/abs/10.1146/annurev-linguistics-030514-125254?journalCode=linguistics)
* [Smith 2019](https://arxiv.org/pdf/1902.06006.pdf)
* [Pennington et al. 2014](https://www.aclweb.org/anthology/D14-1162/)
* [Faruqui et al. 2015](https://www.aclweb.org/anthology/N15-1184/)

+ to create input for neural networks
+ word2vec is used in Siri, Google Assistant, Alexa, Google Translate...

#### Gensim

We will use the `gensim` library to get access to the word2vec model. Here you can find the library's [documentation](https://radimrehurek.com/gensim/models/word2vec.html).

First, let's install the library: `pip install gensim`. You can do it from jupyter: `!pip install gensim`. To update: `pip install gensim --upgrade` or `pip install gensim -U`

In [None]:
import re
import gensim
import logging
import nltk.data
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from gensim.models import word2vec

import warnings
warnings.filterwarnings('ignore')

#### How to train your own model

NB! The training does not involve preprocessing! It means that, if necessary for your task, you have to get rid of the punctuation, lower, lemmatize, do the pos tagging before the training.

To log the training:

In [None]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

The input for the model is a text file, where every sentence starts on a new line. The text is stripped of the punctuation, lowered and lemmatized. We won't do the preprocessing part in class, we will use a preprocessed file liza_lem.txt.

In [None]:
f = 'liza_lem.txt'
data = gensim.models.word2vec.LineSentence(f)

We will be training our model now. The main parameters:

+ the data should be iterable
+ vector_size — dimensionality of the word vectors,
+ window — maximum distance between the current and predicted word within a sentence,
+ min_count — ignores all words with total frequency lower than this,
+ sg —  training algorithm: 1 for skip-gram; otherwise CBOW,
+ sample — the threshold for configuring which higher-frequency words are randomly downsampled,
+ iter — number of iterations (epochs) over the corpus,
+ max_vocab_size — limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.

In [None]:
%time model_liza = gensim.models.Word2Vec(data, vector_size=300, window=5, min_count=2)

CPU times: user 81.3 ms, sys: 5.51 ms, total: 86.8 ms
Wall time: 310 ms


We can normalize the vectors, then the model would take up less RAM. After this operation, however, you won't be able to retrain the model. L2-normalization is used: the sum of squares of all the vector elements will be brought to 1.

In [None]:
model_liza.init_sims(replace=True)
model_path = "liza.bin"

print("Saving model...")
model_liza.wv.save_word2vec_format(model_path, binary=True)



Saving model...


Let's see what the model learned:

In [None]:
model_liza.wv.most_similar(positive=["смерть", "любовь"], negative=["печальный"], topn=3)

[('проходить', 0.18653517961502075),
 ('нежный', 0.17550215125083923),
 ('показываться', 0.16081567108631134)]

In [None]:
model_liza.wv.most_similar("любовь", topn=3)

[('выть', 0.2030768096446991),
 ('нежный', 0.1860518455505371),
 ('лодка', 0.1758255660533905)]

In [None]:
model_liza.wv.similarity("лиза", "эраст")

0.14449573

In [None]:
model_liza.wv.similarity("лиза", "лиза")

1.0

In [None]:
model_liza.wv.doesnt_match("скорбь грусть слеза улыбка".split())

'слеза'

In [None]:
model_liza.wv.words_closer_than("лиза", "эраст")

['свой',
 'который',
 'мочь',
 'сей',
 'мой',
 'ты',
 'часто',
 'слеза',
 'жить',
 'цветок',
 'смотреть',
 'прощаться',
 'прекрасный',
 'девушка',
 'час',
 'дуб',
 'поле',
 'дело',
 'поцеловать',
 'деревня',
 'довольно',
 'страшно',
 'какой',
 'карман',
 'побледнеть',
 'забава',
 'схватывать']

#### Parameter variation

Note that what is said below works for large corpora, if your corpus is small, you need to be extra careful!

1) preprocessing -- do we lemmatize, tokenize, pos-tag or not

2) corpus size -- the greater, the better; but! for semantic tasks the quality is more important than quantity

3) vocabulary size

4) negative samples

5) the number of iterations

6) vector size -- 100-300 (it looks like >300 does not make the results better)

7) window size -- for syntax -- around 4, for semantics -- 8, 10.

A paper that discusses different parameter settings: https://www.aclweb.org/anthology/D14-1162.pdf

### How to use a pre-trained model

http://vectors.nlpl.eu/ provides a number of pre-trained models for Russian and for other languages.

For other languages, look also at [fastText](https://fasttext.cc/docs/en/english-vectors.html) and [GloVe](https://nlp.stanford.edu/projects/glove/)

For a bit of exploration, let's look at some vector novels https://nevmenandr.github.io/novel2vec/

#### Working with a model

Word2vec models can have two formats:

+ .vec.gz — an ordinary file
+ .bin.gz — a binary file

To load a word2vec model, use `KeyedVectors`, you can set the `binary` parameter of the function `load_word2vec_format`.

Note that if the embeddings were created not by word2vec, you need to use `load`. Use it if you load the `glove`, `fasttext`, `bpe` embeddings.

In [None]:
!pip install wget

In [2]:
import zipfile
import wget

In [5]:
model_url = 'http://vectors.nlpl.eu/repository/20/220.zip'
m = wget.download(model_url)
model_file = model_url.split('/')[-1]
with zipfile.ZipFile(model_file, 'r') as archive:
    stream = archive.open('model.bin')
    model = gensim.models.KeyedVectors.load_word2vec_format(stream, binary=True)

100% [......................................................................] 638171816 / 638171816

In [6]:
words = ['хороший_ADJ', 'плохой_ADJ', 'ужасный_ADJ','жуткий_ADJ', 'страшный_ADJ', 'красный_ADJ', 'синий_ADJ']

We need the POS tags, because the model was trained on lemmatized and tagged words. The name of the model specifies the algorythm that was used to tag the words, mystem, in our case.

Let's look at the 10 closest members for each word that we are interested in and at the cosine similarity.


In [7]:
for word in words:
    # is the word present in the model?
    if word in model:
        print(word)
        # looking at the first 10 numbers from the embedding
        print(model[word][:10])
        # getting 10 neighbours
        for i in model.most_similar(positive=[word], topn=10):
            # word + cosine similarity
            print(i[0], i[1])
        print('\n')
    else:
        # Oops!
        print('Oops, the word "%s" is not in the model!' % word)

хороший_ADJ
[-0.5533533   3.525192    1.3954544   0.50957227  0.9530872   0.42150345
  0.14798506 -0.89938575 -1.9526412  -2.9605858 ]
плохой_ADJ 0.7704135179519653
отличный_ADJ 0.745093822479248
хороший_ADV 0.7096987962722778
неплохой_ADJ 0.7080509662628174
хорошо_ADJ 0.685799777507782
превосходный_ADJ 0.66509610414505
приятный_ADJ 0.6305955052375793
хорошо_ADV 0.6236262321472168
дурной_ADJ 0.5771215558052063
нужный_ADJ 0.5764609575271606


плохой_ADJ
[ 1.47815     3.4967737   0.6200617   0.21216297 -2.1780734  -0.71112794
 -0.55119324 -0.8843036  -1.0574621  -2.4701962 ]
хороший_ADJ 0.7704134583473206
плохой_ADV 0.6875114440917969
дурной_ADJ 0.6704886555671692
плохо_ADJ 0.6194170117378235
скверный_ADJ 0.6098257303237915
слабый_ADJ 0.5972127914428711
хороший_ADV 0.5740218162536621
плохо_ADV 0.5671497583389282
неплохой_ADJ 0.5508592128753662
неудовлетворительный_ADJ 0.5305141806602478


ужасный_ADJ
[-0.7115563   0.14254573 -0.9630084  -0.7578321  -0.35663527 -2.2121139
 -1.0280207   1.

Cosine similarity for a pair of words:

In [8]:
print(model.similarity('плохой_ADJ', 'хороший_ADJ'))

0.7704135


In [9]:
print(model.similarity('плохой_ADJ', 'синий_ADJ'))

0.027845463


In [10]:
print(model.similarity('ужасный_ADJ', 'жуткий_ADJ'))

0.73393726


Proportion:

+ positive — vectors that we add
+ negative — vectors that we subtract

In [11]:
print(model.most_similar(positive=['плохой_ADJ', 'ужасный_ADJ'], negative=['хороший_ADJ'])[0][0])

страшный_ADJ


Find the word that does not match the rest of the words:

In [12]:
print(model.doesnt_match('плохой_ADJ хороший_ADJ ужасный_ADJ страшный_ADJ'.split()))

хороший_ADJ


In [13]:
print(model.doesnt_match('плохой_ADJ ужасный_ADJ страшный_ADJ'.split()))

плохой_ADJ


In [14]:
for word, score in model.most_similar(positive=['ужасно_ADV'], negative=['плохой_ADJ']):
    print(f'{score:.4}\t{word}')

0.5586	страшно_ADV
0.4865	безумно_ADV
0.4491	несказанно_ADV
0.433	безмерно_ADV
0.4316	неимоверно_ADV
0.4027	донельзя_ADV
0.3973	необыкновенно_ADV
0.3899	чрезвычайный_ADV
0.3874	чрезвычайность_NOUN
0.3825	жутко_ADV


#### Model evaluation

+ word similarity, compare the results of the training with experimental results from human participants
+ analogies:

| слово 1    | слово 2    | отношение     |
|------------|------------|---------------|
| Россия     | Москва     | страна-столица|
| Норвегия   | Осло       | страна-столица|