[SemAxis](https://arxiv.org/pdf/1806.05521.pdf) is a method for scoring terms along a user-defined axis (e.g., positive-negative, concrete-abstract, hot-cold), which can be used for a range of empirical questions (for one example, see [Kozlowski et al. 2019](https://journals.sagepub.com/doi/full/10.1177/0003122419877135)). In this activity, you'll implement SemAxis using word representations from Glove, and use it to explore corpus-specific conceptual associations.

Gensim should be installed before running this notebook; if not, install with:

`conda install gensim`


In [None]:
import re
from gensim.models import KeyedVectors
import numpy as np
import numpy.linalg as LA

In this activity, we'll be working with pre-trained word embeddings using the `gensim` library, which provides a number of functions for accessing representations for individual words and comparing them.  The representations we'll use come from [Glove](https://nlp.stanford.edu/projects/glove/), which are trained on web data from the [Common Crawl](https://en.wikipedia.org/wiki/Common_Crawl) corpus.

In [None]:
glove = KeyedVectors.load_word2vec_format("../data/glove.6B.100d.100K.w2v.txt", binary=False)

In [None]:
good_vector=glove["good"]

In [None]:
print(good_vector)

Functions useful for this activity include the following:

In [None]:
# access the representation for a single word
great_vector=glove["great"]

# use numpy to average multiple vector representations together
vecs_to_average=[good_vector, great_vector]
average=np.mean(vecs_to_average, axis=0)

# calculate the cosine similariy between two vectors
cosine_similarity=glove.cosine_similarities(good_vector, [great_vector])

print(good_vector.shape, great_vector.shape, average.shape, cosine_similarity)

Implement the [SemAxis](https://arxiv.org/pdf/1806.05521.pdf) method as described in class. Given a set of word embeddings for positive terms $S^+ = \{v_1^+, \ldots v_n^+\}$ and embeddings for negative terms $S^- = \{v_1^-, \ldots v_n^-\}$ that define the endpoints of the axis, your output should be a single real-value score for an input word $w$ with word representation $v_w$:

$$
score(w)_{\mathbf{V_\textrm{axis}}} = \textrm{cos}(v_w, \mathbf{V}_\textrm{axis})
$$

Where: 
$$
\mathbf{V}^+ = {1 \over n} \sum_1^n v_i^+
$$

$$
\mathbf{V}^- = {1 \over m} \sum_1^m v_i^-
$$

$$
\mathbf{V}_{\textrm{axis}} = \mathbf{V}^+ - \mathbf{V}^-
$$



In [None]:
def get_semaxis_score(vectors, positive_terms=None, negative_terms=None, target_word=None):
    
    # See cell below for example arguments
    
    # vectors: gensim KeyedVectors object (e.g., glove)
    # positive_terms: list of terms defining one end of an axis
    # negative_terms: list of terms defining the other end of an axis
    # target_word: the term to score along that axis
    
    # the output should be a single real number (the SemAxis score for that word).
    
    # TODO

    return score

In [None]:
# should be 0.342
get_semaxis_score(glove, positive_terms=["woman", "women"], negative_terms=["man", "men"], target_word="actress")

Now let's score a set of target terms along that axis

In [None]:
def score_list_of_targets(vectors, positive_terms=None, negative_terms=None, target_words=None):
    scores=[]
    for target in target_words:
        scores.append((get_semaxis_score(vectors, positive_terms, negative_terms, target), target))

    for k,v in reversed(sorted(scores)):
        print("%.3f\t%s" % (k,v))

In [None]:
targets=["doctor", "nurse", "actor", "actress", "mechanic", "librarian", "architect", "magician", "cook", "chef"]

In [None]:
score_list_of_targets(glove, positive_terms=["woman", "women"], negative_terms=["man", "men"], target_words=targets)

Define **your own concept axis** by selecting a set of positive and negative terms and illustrate its utility by scoring a set of 10 target terms (as we did above).  If you've implemented  `get_semaxis_score` above, you only need to add terms to the `positive_terms` and `negative_terms` lists below and execute this cell.

In [None]:
positive_terms=[]
negative_terms=[]
targets=[]

score_list_of_targets(glove, positive_terms=positive_terms, negative_terms=negative_terms, target_words=targets)

Let's assume now that you're able to score all words in a vocabulary along several conceptual dimensions (like the one you've defined) for a given set of word embeddings trained on a dataset.  What could you do with that score? Brainstorm possible applications.