# Python Text Analysis: Part 3 Solutions

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gensim
import gensim.downloader as api
from gensim.models import KeyedVectors

In [2]:
wv = KeyedVectors.load_word2vec_format('../data/GoogleNews-vectors-negative300.bin', binary=True)

## 🥊 Challenge 1: Dosen't Match

Now it's your turn! In the following cell, we have prepared a list of coffee-noun pairs, i.e., the word "coffee" is paired with a specific coffee drink. Let's find out which coffee drink is considered most similar to "coffee," and which one is not. 

Complete the for loop (two cells below) to calculate the cosine similarity between each pair of words, i.e., make use of the `similarity` function. 

In [15]:
coffee_nouns = [
    ('coffee', 'espresso'),
    ('coffee', 'cappuccino'),
    ('coffee', 'latte'),
    ('coffee', 'americano'),
    ('coffee', 'irish'),
]

In [16]:
# Get cosine similarities between each pair
for w1, w2 in coffee_nouns:
    similarity = wv.similarity(w1, w2)
    print(f"{w1}, {w2}, {similarity}")

coffee, espresso, 0.6616826057434082
coffee, cappuccino, 0.662549614906311
coffee, latte, 0.6049396395683289
coffee, americano, 0.33290809392929077
coffee, irish, 0.16667571663856506


Next, let's investigate verbs commonly associated with coffee-making. Take a look at the use case for the [`doesnt_match`](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#word2vec-demo) function and then use it to identify the verb that does not seem to belong.

Feel free to add more verbs to the list!

In [17]:
coffee_verbs = ['brew', 'drip', 'pour', 'make', 'grind', 'roast']

In [18]:
# Find the word that doesn't belong to the list
verb_dosent_match = wv.doesnt_match(coffee_verbs)
verb_dosent_match

'make'

## 🥊 Challenge 2: Woman is to Homemaker?

[Bolukbasi et al. (2016)](https://arxiv.org/pdf/1607.06520) is a thorough investigation of gender bias present in word embeddings, and they primarily focus on word analogies, especially those that reveal gender stereotyping! Let run a couple examples discussed in the paper, using the `most_similiar` function we've just learned. 

The following code block contains a few examples we can pass to the `positive` argument: we want the output to be similar to, for example, `woman` and `chairman`, and in the meantime, we are also specificying that it should be dissimilar to `man`. We'll print the top result by indexing to the 0th item. 

Let's complete the following for loop.

In [7]:
positive_pair = [['woman', 'chairman'],
                 ['woman', 'doctor'], 
                 ['woman', 'computer_programmer']]
negative_word = 'man'

In [8]:
# Get the most similar word given positive and negative examples
for example in positive_pair:
    result = wv.most_similar(positive=example, negative=negative_word)
    print(f"man is to {example[1]} as woman is to {result[0][0]}")

man is to chairman as woman is to chairwoman
man is to doctor as woman is to gynecologist
man is to computer_programmer as woman is to homemaker


## 🥊 Challenge 3: Construct a Semantic Axis

Now it's your turn! We have two sets of pole words for "female" and "male". These are example words tested in Bolukbasi et al., 2016. We will get the embeddings for these words from glove to calculate the gender axis. 

The cell for the function `get_semaxis` provides some starting code. Complete the function. If everything runs, the embedding size of the semantic axis should be the same as the size of the input vector. 

In [9]:
glove = api.load('glove-wiki-gigaword-50')

In [10]:
# Define two sets of pole words (examples from Bolukbasi et al., 2016)
female = ['she', 'woman', 'female', 'daughter', 'mother', 'girl']
male = ['he', 'man', 'male', 'son', 'father', 'boy']

In [11]:
def get_semaxis(list1, list2, model, embedding_size):
    '''Calculate the embedding of a semantic axis given two lists of pole words.'''

    # STEP 1: Get the embeddings for terms in each list
    v_plus = [model[term] for term in list1]
    v_minus = [model[term] for term in list2]

    # Step 2: Calculate the mean embeddings for each list
    v_plus_mean = np.mean(v_plus, axis=0)
    v_minus_mean = np.mean(v_minus, axis=0)

    # Step 3: Get the difference between two means
    sem_axis = v_plus_mean - v_minus_mean

    # Sanity check
    assert sem_axis.size == embedding_size
    
    return sem_axis

In [12]:
# Plug in the gender lists to calculate the semantic axis for gender
gender_axis = get_semaxis(list1=female, 
                          list2=male, 
                          model=glove, 
                          embedding_size=50)
gender_axis

array([ 0.08418201,  0.30625182, -0.23662159,  0.02026337, -0.00296998,
        0.6195349 ,  0.01208681,  0.06963003,  0.49099812, -0.20878893,
        0.00934163, -0.44707334,  0.48806185,  0.19471335,  0.20141667,
        0.0832995 , -0.4245833 , -0.08612835,  0.47612852, -0.05129966,
        0.31475997,  0.49075842,  0.12465019,  0.26685053,  0.29776838,
        0.14211655, -0.09953564,  0.2320785 , -0.01026282, -0.30585438,
       -0.1335001 ,  0.21605133,  0.10961549, -0.03373036, -0.13584831,
       -0.12131716, -0.14671612, -0.04348468,  0.06151834, -0.3654362 ,
       -0.06193466, -0.17093089,  0.5058871 , -0.44872418,  0.05962732,
       -0.18274659,  0.24432765, -0.3396697 ,  0.00442566,  0.10554916],
      dtype=float32)